Technical Writing is Hard - The Git Glossary
A member of the Soylent News community has published the start of a Git Glossary (version control software).
He points out many problems with terminology within the official git documentation, and hopes to fix this by starting a glossary which is unambiguous.
He states his target audience which should include me, because I have a mediocre understanding of git. But I will never use his git glossary (despite it being golly good already, and very well researched). I learn by example, and grasp the fundamentals gradually with experience. I can't recall ever looking to a glossary to find meaning. I look at examples, and read documentation, and I gradually build an understanding of the system.
The glossary writer wants hard and fast definitions up front, and then use those definitions rigorously. I suppose that is mandatory for a glossary writer, but should all the documentation be so rigourous?
That isn't how children learn languages, should we expect technical documentation to be different? Should we teach children the complete definition of "cat" up front, or be fast and loose at first, and then surprise (and confuse) them when we say that a lion is a cat?
Maybe it is necessary when a subject gets passed a threshold level of abstraction. Natural language is no longer able to convey meaning because the ambiguities cannot be resolved by a beginner, until the beginner is fluent - a chicken and egg problem.
I think part of his problem is summed up by René Magritte's This Is Not a Pipe.
If I write "a pipe" on a shopping list, have I added a pipe to the list, or just the words "a pipe". A pipe is different to a picture of a pipe, which is different again to the word "pipe". But in everyday language, we can infer meaning (despite the ambiguities) using context.
"My shopping list contains a pipe and an apple" is unambiguous to a native English speaker. We know there is nothing to eat nor smoke, there is merely ink!
Demystifying Remotes
With regard to git, "remote" is a synonym for "remote repository". But it is used to mean the actual object AND the item on the shopping list. In exactly the same way that I've used the word "pipe" above.
We can add and remove remotes in the same way we can add and remove items from a shopping list.
We can also create and delete remotes (the actual repository, not just reference to them), in the same way we can manufacture and destroy wooden pipes.
With regard to remotes, things are more confusing, because the list contains a shorthand name as well as the location.
Informally, we might say the use the word remote in three contexts :
- the remote on gitlab (i.e. meaning the actual repository),
- the remote "bob" (i.e. the shorthand name of the remote on gitlab),
- the the remote "https://gitlab.com/..." (i.e. meaning the location of the remote repository)
My father is Peter. Father is a reference to a person (which is only resolved if you know who is speaking).Peter is the name of a person. Thirdly and there's an actual human being.Is father a person? Is Peter a person? Is that bloke over there a person?The glossary is attempting to set a rule that "Person" can not refer to more than one of these things for fear of confusion.
While it is useful to understand the difference between "father", "Peter" and a particular 6 foot conglomeration of atoms, IMHO, it is quite acceptable to use the same phrase "person" to refer to all three.
He is advocating for all phrases to be fully qualified, and to refer to each concept using a single phrase. i.e. person nickname, person name, and person. You must never call my father a person, because my father is a nickname of a person.
So when adding a remote's address to a list, we must use the fully qualified phrase "remote reference", not "remote".
Also, we must never shorten "remote repository" to just "remote" because that creates two phrases for a single concept.
When we follow these rules, our documentation becomes unambiguous, but long winded. We end up with sentences such as :
After we create a remote repository on github, we can add a remote repository reference to our remote repository list using the command :
git remote add LOCAL_NICKNAME https://github.com/OUR_NAME/OUR_PROJECT.git
The official git documentation tends to break these rules like so :
After we create a remote on github, we can add it to our remotes using the command ...
This breaks the proposed rules in many ways!
BTW, note that may attempt at full rigour failed, because the repository "list" is really a "map" keyed on remote repository nickname, with a value of a remote repository location. Wow what a mouthful that is. And for a newbie give no additional insight at all.
If we then logged onto a remote server using ssh, and did the same thing again, that would be confusing, because this time "remote" means "a computer far away", not a "remote repository". In everyday English we would clarify this second usage of the word, because the context is "remote repositories", but the usage differs from the context.
Note, at one point he implies that the commands should have the same degree of rigour as the documentation. I don't think he really means that. The word "remote" in the command could be confusing on two counts.
- ) It is using different terminology, because it uses the phrase "remote" as opposed to "remote repository".
- ) It isn't specifying that we add adding a reference, and not adding an actual repository.
This is obviously too far IMHO, surely he doesn't expect the command to read :
git remote_repository_reference add https://github.com/OUR_NAME/OUR_PROJECT.git
BTW, isn't using the word "add" and "remove" enough to convey "reference". If we were creating actual repositories, we would use verbs create and delete.
StagingAreas
He asks : My question is very simple: Is the staging area or staging index part of a repository or not?
Why do we care? This seems like an implementation detail which could be changed at any time without affecting anybody.
At present we have a "working tree" containing our project's files, and a ".git" directory containing everything else. The staging area is clearly "inside" the ".git" directory as there is nowhere else it could be.
What if the next release of git they change the inner workings and we now have an extra ".staging" directory. This contains the information about changes which have been added via the "git add" command, but which have not been committed using the "git commit" command. (let's ignore stashed files; but we could imagine them living in yet another place, a ".stash" directory).
My point is the exact location of the data doesn't matter. We (as users of git) do not touch that level of abstraction. IMHO, it is meaningless to say that the staging area is contained *anywhere*. It is local to my project's directory; that is all I need to know.
The unix command rm
If we consider the simple unix command "rm", what does it do? It deletes a file from a directory, right?
If we are deleting, surely "del" would be more a appropriate name? DOS got it right !?!
Well, it turns out that this is a cat/lion problem. To newbies, "rm" does delete files. But when we grow up (and learn that a lion is a cat), we also discover that after we rm a huge file it is possible that we've not freed up any disk space.
If the file has a hard link in another directory, the space won't be freed. The file is not deleted at all.
If the file is open, the file will only be deleted when the last file handle is closed (I think!).
It is similar-ish to garbage collection in a managed language; a file is only deleted when there are no more references to it.
"rm" removes one of the references. Most of the time the reference count reaches zero, so the file is also deleted.
IMHO, good technical documentation should say that rm deletes a file, despite that being technically wrong. Only when you did deeper into the documentation should it say what really happens!
Git is complex, and we should expect the top-level documentation to leave out some of the subtle distinctions. If it didn't it would be inscrutable to newbies.
Conclusion
Writing documentation is a tricky skill which is often lacking in the Free Software community.
I'm sure that his work on the git glossary will be beneficial to the community, and I wish him well. However, I have reservations about the level of rigour to which he aspires. There is a balance between rigour and readability.
I am less worried about his lack of understanding of the core concepts, because he is welcoming input from the wider community, and a group lead by a keen and open leader will produce excellent results.