2009-10-18

Tidy rewritten histories with Git

I imported some of my old projects from CVS to Git. I had the CVS repository of an old student project as a tarball. That one repository contained the sources of two programs - the main project and one small utility. I was able to import them into two separate Git repositories and also rewrite their version history so, that it would seem as if the utility program had always been a separate project and been using Maven (neither of which was true).

Importing the CVS repository to Git did not succeed with git cvsimport (it failed with "fatal error - cmalloc would have returned NULL"), but cvs2git worked and it was also orders of magnitude faster. It was necessary to edit the example options file provided with cvs2git - the CVS repository path and author names had to be configured. If some of the authors have non-ascii characters in their names, it's best to save the options file in UTF-8 format and use the u'Námè' format for the author names. See cvs2git's usage instructions for details on how to do the conversion.

Now that I had a Git repository with the history of both of the programs, it was time to separate the utility program's version history with git filter-branch (the main project's history did not need to be modified). It's best to take a temporary clone of the original repository before messing with filter-branch. That way it's easier to revert all changes and try again by just deleting and recreating the temporary repository.

I made a clone of the repository and in that clone I used --subdirectory-filter to remove everything else except the source codes of the utility program:

git filter-branch --subdirectory-filter src/hourparser -- --all

Originally it did not use Maven, but I wanted to modify the history to look like it had always used Maven. So then I used --tree-filter to move all the source files to the right directory structure. I also remove the manifest file, because Maven will generate it automatically. When removing files, it's best to use --prune-empty, or you may have problems for example during rebasing (I learned it the hard way). Also make sure that the last command in the filter will aways exit successfully with error code 0, or otherwise the whole filtering process will fail.

git filter-branch --prune-empty --tree-filter '
mkdir -p src/main/java/hourparser
mv *.java src/main/java/hourparser
rm -rf META-INF
' -- --all

After that was done, I had to insert the pom.xml and other Maven files to the version history. That I was able to do by making multiple commits with the initial project files and all the version number incrementing changes to them (the version number in pom.xml needs to be changed when a release is made) so that those commits were last in the history. Then I used git rebase to reorder the commits, so that the changes to pom.xml would be in the right places in the history. Changing the initial commit was more complicated, but I was able to do it by creating a new repository with that initial commit, and then rebasing the rest of the history from the other repository on top of it.

After this I had the right commits in place, but their dates were not consistent. The commits for the Maven files were dated in 2009, but everything else was dated 2005. That I was able to fix by exporting the repository into patches, editing the authors and author dates in the patches with a text editor, and finally importing the patches into a blank repository. Temporary patches are a powerful tool in editing the history.

git format-patch -M -C -k --root master
[edit the patches and move them to a new directory]
git init
git am -k 00*

After all this the authors and author dates were fine, but the committer and commit date information still needed fixing. I was able to change the committers to be the same as the author with the following command:

git filter-branch -f --env-filter '
export GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"
export GIT_COMMITTER_EMAIL="$GIT_AUTHOR_EMAIL"
export GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"
' -- --all

After this I could publish Git repositories of the main project and the utility project with nice clean histories.

2009-10-10

TDD is not test-first. TDD is specify-first and test-last.

Recently there has been some discussion about TDD at the Object Mentor blog. In one of my comments I brought forth the idea in this article's title. It was such a nice oxymoron that I decided to elaborate here that what I mean by saying that "TDD is not test-first".

The TDD Process

Because Test-Driven Development has the word "test" in its name, and the people doing TDD speak about "writing tests", there is much confusion about TDD, because frankly, the big benefits of TDD have very little to do with testing. That's what brought about Behaviour-Driven Development (BDD) which is the same as TDD done right, but without the word "test". Because BDD does not talk about testing, it helps many to focus on the things that TDD is really about.

Here is a diagram of how I have come to think about the TDD process:

When you look at that diagram, it probably seems quite similar to traditional software development methods, even quite waterfallish. Let's remind ourselves what a waterfall looks like:

The waterfall model is "Specify - Design - Implement - Verify - Maintenance". The TDD process is otherwise the same, except that it loops very quickly (one cycle usually takes a couple of minutes), it has a new "Cleanup" step, all of it is considered "Design", and all of it is also considered "Maintenance".

Step 1: Specify

The first step in TDD is to write a test a specification of the desired behaviour. Here the developer thinks about what the system should do, before thinking about how it should be implemented. The developer focuses on just one thing at a time - separate the what from the how.

When the developer has decided that "what is the next important behaviour that the system does not yet do", then he will document the specification of that behaviour. The specifications are documented in a very formal language (i.e. a programming language), so formal that they can be executed and verified automatically (not to be confused with formal verification).

Writing this executable specification will save lots of time, because the developer does not need to do the verification manually. It will also communicate the original developer's intent to other developers, because anybody can have a look at the specification and see what the original developer had in his mind when he wrote some code. It will even help the original developer to remember, when he returns to code that he wrote a couple of weeks ago, that what he was thinking at the time of writing it. And best of all, anybody can verify the specifications at any moment, so any change that breaks the system will be noticed early.

Step 2: Implement

After the specification has been written, it's time to think about how to implement it, and then just implement it. The developer will focus on passing just one tiny specification at a time. This is the most easy step in the whole TDD process.

If this step isn't easy, then the developer tried to make a too big step and specified too much new behaviour. In that case he should go back and write a smaller specification. With experience, the developer will learn that what kind of steps are not too big (so that the step would be hard) and not too small (so that the progress would be slow).

If this step isn't easy, it could also be that the code that needs to be changed is not maintainable enough for this change. In that case the developer should first clean up and reorganize the code, so that making the change will be easy. If the code is already very clean, then only a little reorganizing is needed. If the code is dirty, then it will take more time. Little by little, as the code is being changed, the codebase will get cleaner and stay clean, because otherwise the TDD process will soon grind to a halt.

Step 3: Verify

Now the developer has implemented a couple of lines of code, which he believes will match the specification. Then he needs to verify that the code fulfills its specification. Thanks to the executable specifications, he can just click a button and after a couple of seconds his IDE will report whether the specification has been met.

This step is so quick and easy, that it totally changes the way that code can be written. It will make the developers fearless in making changes to code that they do not know, because they can trust that if they break something, they will find it out in a couple of seconds. So whenever they see some bad code, they can right away clean it up, without fear of breaking something. This difference is so overwhelming, that it even made Michael Feathers (in his book "Working Effectively with Legacy Code") to define "legacy code" as code without such executable specifications.

Step 4: Cleanup

When the code meets all its specifications, it's time to clean up the code. As Uncle Bob says, "the only way to go fast is to go well". We need to keep the code at top shape, so that making future changes will be easier. We can do this by following the boy scout rule: Always check-in code cleaner than when you checked it out.

So when the developer has written some code that works, he will spend a few seconds or minutes in removing duplicated code, choosing more descriptive names, dividing big methods into many smaller methods and so on. Every now and then the developer will notice new structures emerging from the code, so he adjusts his original plans about the design and extracts a new class or reorganizes some existing classes.

Steps 1-4: Design

The specification, implementation and cleanup steps all include designing the code, although in each step the focus in designing slightly different aspects of the code. As Kent Beck says in his book "Extreme Programming Explained" (2nd Ed. page 105), "far from design nothing, the XP strategy is design always."

In the specification step, the developer is first designing the behaviour of the system, what the system should do. When he is writing the specification, he is designing how the API of the code being implemented will be used.

In the implementation step, the developer is designing the structure of the code, how the code should be structured so that it will do what it should do. In this step the amount of design is quite low, because the goal is to just make the simplest possible change that will achieve the desired behaviour. It is acceptable to write dirty code just to meet the specification, because the code will be cleaned immediately after writing it.

In the cleanup step, the developer is designing that what is the right way to structure the code, how to make the code cleaner, more maintainable. This is where the majority of the design takes place, which also makes the cleanup step the hardest step in the whole TDD process. Thanks to the automatic verification of the specifications, it is possible to evolve the design and architecture of the system in small, safe steps. When improving the design of the system, the system will be working at all times, so it is possible to do even big changes incrementally, without a grand redesign.

Steps 1-4: Maintenance

When using TDD, we are at all times in maintenance mode, because we are all the time changing existing code. Only the first cycle, the first couple of minutes, is purely greenfield code.

This continuous maintenance forces the system to be maintainable, because if it would not be maintainable, the TDD process would grind to a halt very soon. On the other hand, waterfall does not force the system to be maintainable, because the maintenance mode comes only after everything else has been done, which means that with waterfall it's possible to write unmaintainable code.

Maybe this is one of the reasons why TDD produces better code, more maintainable code. If some piece of code is not maintainable, it will become apparent very quickly, even before that piece of code has been completed. This early feedback in turn will drive the developer into changing the code to be more maintainable, because he can feel the pain of changing non-maintainable code.


Updated 2009-10-15:

Somebody posted this at Reddit and in the comments the appears to be some confusion about the kinds of specs that I'm referring to in this article and which are useful in TDD. To find out in what style my specs are written, have a look at the TDD tutorial which I have created. To see TDD in action in a non-trivial application, have a look at my current project.

And of course the executable specs are not the only kinds of specifications that a real-life project needs. Just as I said above, they are "a specification of the desired behaviour", not the only specification. TDD specs are written at the level of individual components, which makes them useful for driving the design of the code in the components. They are the lowest level specifications that a system has. But before diving into the code, first the project should have high-level requirements and specifications describing from a user's point of view that what the system should do. A high-level architectural description is also useful.

I'm also into user interface design, so whenever the system being built will be used by human users, the first thing I'll do in such a project is to gather the goals and tasks of the users, based on which I will design a user interface specification in the form of a paper prototype, but that would be the topic for a whole another article...