Esko Luontola: 2010

2010-09-22

Let's Code Dimdwarf

I'm starting a new screencast series, Let's Code, where I will be recording myself developing some open source projects. This was inspired by James Shore's Let's Play TDD series and I will try doing something similar. My goal is not to teach the basics of how to do TDD, but to show how one developer does it - in the hope that something can be learned from it. Each episode will be about 25 minutes long ("one pomodoro") and I will try to release a new episode every couple of days, but no promises about that.

The first episode can be seen at my new blog where I will announce all new Let's Code episodes.

P.S. I'll be writing a blog article about my screencast toolchain and experiences about different video hosting providers. The video quality of YouTube and Vimeo was not good enough for high resolution text (these screencasts are 1440x1080 resolution with font size 16), but Blip.tv was just perfect since they won't re-encode my videos.

2010-07-28

Design for Integrability

There is already the term Design for Testability - it's easy to write tests for the software. I would like to coin a new term, Design for Integrability - it's easy to integrate the system with its external environment. (Yes, integrability is a word.)

Designing for testability is closely linked with how good the design of the code is. A good way to design for testability is to write the tests first, in short cycles, which leads to all code by definition being tested. As a result, the developers will need to early on go through the pains of improving the design to be testable, because otherwise it would be hard for them to write tests for it.

Designing for integrability is possible with a similar technique. The book Growing Object-Oriented Software, Guided by Tests (GOOS) presents a style of doing TDD where the project is started by writing end-to-end tests, which are then used in driving the design and getting early feedback (see pages 8-10, 31-37, 84-88 and the code examples). Also the "end-to-end" of the GOOS authors is probably more end-to-end than the "end-to-end" of many others. Quoted from page 9:

For us, "end-to-end" means more than just interacting with the system from the outside - that might be better called "edge-to-edge" testing. We prefer to have the end-to-end tests exercise both the system and the process by which it's build and deployed. An automated build, usually triggered by someone checking code into the source repository, will: check out the latest version; compile and unit-test the code; integrate and package the system; perform a production-like deployment into a realistic environment; and, finally, exercise the system through its external access points. This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software's lifetime. Many of the steps might be fiddly and error-prone, so the end-to-end build cycle is an ideal candidate for automation. You'll see in Chapter 10 how early in a project we get this working.

A system's interaction with its external environment is often one of the riskiest areas in its development, so the authors of the GOOS book prefer to expose the uncertainties early by starting with a walking skeleton, which contains the basic infrastructure and integrates with the external environment. On page 32 they define walking skeleton as "an implementation of the thinnest possible slice of real functionality that we can automatically build, deploy, and test end-to-end. It should include just enough of the automation, the major components, and communication mechanisms to allow us to start working on the first feature." This forces the team to address the "unknown unknown" technical and organization risks at the beginning of the project, while there is still time to do something, instead of starting the integration only at the end of the project.

Starting with the integration also means that the chaos related to solving the uncertainties moves from the end of the project to the beginning of the project. But once the end-to-end infrastructure is in place, the rest of the project will be easier. On page 37 there is a nice illustration of this:

I also perceive the end-to-end tests to be helpful in guiding the design of the system towards integrability. When writing software with an inside-out approach to TDD, starting with low-level components and gluing them together until all components are ready, it's possible that once the development reaches high-level components which need to interact with external systems and libraries, the design of the low-level components makes the integration hard. So then you will need to change the design to make integration easier. But when developing outside-in and starting with end-to-end tests, those integration problems will be solved before the rest of the system is implemented - when changing the design is easier.

Listening to the feedback from the end-to-end tests can also improve the management interfaces of the system. Nat Pryce writes in TDD at the System Scale that the things that make writing reliable end-to-end tests hard, are also what makes managing a system hard. He writes: "If our system tests are unreliable, that's a sign that we need to add interfaces to our system through which tests can better observe, synchronise with and control the activity of the system. Those changes turned out to be exactly what we need to better manage the systems we built. We used the same interfaces that the system exposed to the tests to build automated and manual support tools."

By starting with end-to-end tests it's possible to get early feedback and know whether we are moving in the right direction. Also the system will be by definition integrable, because it has been integrated since the beginning.

Note however, that what J.B. Rainsberger says in Integration Tests Are a Scam still applies. You should not rely on the end-to-end tests for the basic correctness of the system, but you should have unit-level tests which in themselves provide good coverage. End-to-end tests take lots of time to execute, so it's impractical to execute them all the time while refactoring (my personal pain threshold for recompiling and running all tests after a one-liner change is less than 5-10 seconds). In the approach of the GOOS authors the emphasis is more on "test-driving" end-to-end than on "testing" end-to-end. See the discussion at Steve Freeman's blog post on the topic (also the comments).

Experience 1

The first project where I have tried this approach is Dimdwarf - a distributed application server for online games. I started by writing an end-to-end test the way I would like to write it (ClientConnectionTest). I configured Maven to unpack the distribution package into a sandbox directory (end-to-end-tests/pom.xml) against which I will then run my end-to-end tests in Maven's integration-test phase. The test double applications which I deploy on the server are in the main sources of the end-to-end-tests module, and I deploy them by copying the JAR file and writing the appropriate configuration file (ServerRunner.deployApplication()). It takes about half a second for the server to start (class loading is what takes most of the time), so the tests will wait until the server prints to the logs that it is ready (ServerRunner.startApplication()). The server is launched in a separate process using ProcessRunner and its stdout/stderr are redirected to the test runner's stdout/stderr and to a StreamWatcher which allows the tests to examine the output using ProcessRunner.waitForOutput(). There is a similar test driver for a client, which connects to the server via a socket, and it has some helper methods for sending messages to the server and checking the responses (ClientRunner). When the test ends, the server process is killed by calling Process.destroy() - after all it is meant to be crash only software.

Getting the end-to-end tests in place went nicely. It took 9.5 hours to write the end-to-end test infrastructure (and the tests are now deterministic), plus about 8 hours to build the infrastructure for starting up the server, some reorganizing of the project, and enough of the network layer to get the first end-to-end test to pass (the server sends a login failure message when any client tries to login). The walking skeleton does not yet integrate with all third-party components that will be part of the final system. For example the system has not yet been integrated with an on-disk database (although the system can be almost fully implemented without it, because the system anyways relies primarily on its internal in-memory database).

It takes 0.8 seconds to run just one end-to-end test, which is awfully slow compared to the unit tests (I could run the rest of the 400+ tests in the same time if JUnit just would run the tests in parallel on 4 CPU cores), in addition to which it takes 13 seconds to package the project with Maven, so the end-to-end tests won't be of much use while refactoring, but they were very helpful in getting the deployment infrastructure ready. I will probably write end-to-end tests for all client communication and a couple of tests for some internal features (for example that persistence works over restarts and that database garbage collection deletes the garbage). The test runner infrastructure should also be helpful in writing tests for non-functional requirements, such as robustness and scalability.

Experience 2

In another project I was coaching a team of 10 university graduate students during a 7 week course (i.e. they had been 3-5 years at university studying computer science - actually also I was a graduate student and had been there for 8 years). We were building a web application using Scala + Lift + CouchDB as our stack. The external systems to which the application connects are its own database and an external web service. We started by writing an end-to-end test which starts up the application and the external web service in their own HTTP servers using Jetty, puts some data - actually just a single string - to the external web service, the application fetches the data from the web service and saves it to the database, after which the test connects to the application using Selenium's HtmlUnitDriver and checks whether the data is shown on the page. All applications were run inside the same JVM and the CouchDB server was assumed to be already running in localhost without any password.

It took a bit over one week (30 h/week × 10 students) to get the walking skeleton up and ~~walking~~ crawling. I was helping with some things, such as getting Maven configured and tests running, but otherwise I was trying to keep away from the keyboard and focus on instructing others that how to do things. I also code reviewed (and refactored) almost all code. Before getting started with the walking skeleton, we had spent about 2 weeks learning TDD, Scala, Lift, CouchDB and evaluating some JavaScript testing frameworks.

The end-to-end tests had lots of undeterminism and were flaky. Parsing the HTML pages produced by the application made writing tests hard, especially when some of that HTML was generated dynamically with JavaScript and updated with Ajax/Comet. There were conflicts with port numbers and database names, which were found out when the CI server ran two builds in parallel. There were also issues with the testing framework, ScalaTest, which by default creates only one instance of the test class and reuses it for all tests - it took some time hunting weird bugs until we noticed it (the solution is to mix in the OneInstancePerTest trait). It would have been better to start the application-under-test in its own process, because reusing the JVM might also have been the cause for some of the side-effects between the tests, and during the project we did not yet get all deployment infrastructure ready (for example some settings were passed via System.setProperty()).

We were also faced with far too many bugs (2-3 in total, I think) in the Specs framework, which ignited me to write a ticket for "DIY/NIH testing framework", later named Specsy, which I have been working on slowly since then. Because none of the "after blocks" in Specs really worked after every test execution, I had to use shutdown hooks to write a hack which deletes the temporary CouchDB databases after the tests are finished and the JVM exits. We used to have hundreds of stale databases with randomly generated names, because the code which was supposed to clean up after an integration test was not being executed.

The test execution times also increased towards the end to the project. One problem was that Scala is slow to compile and the end-to-end tests did a full build with Maven, which took over a minute. Another (smaller) problem was that some of the meant-to-be unit tests were needlessly using the database when it should have been faked (IIRC, it took over 10 seconds to execute the non-end-to-end tests). Let's hope that the Scala compiler will be parallelized in the near future (at least it's on a TODO list), so that the compile speeds would be more tolerable.

All in all, I think the end-to-end tests were effective in finding problems with the design of the system and the tests themselves. It requires much from the development team to write good, reliable tests. The system should now have quite good test coverage, so that its development can continue - starting with some cleaning up of the design and improving the tests.

2010-05-08

Choice of Words in Testing Frameworks (...and how many get it wrong, including RSpec)

One [word] to rule them all... and in the darkness bind them.

I want my testing framework to be able to express the ideas that I have in my mind in the best possible way. This includes giving the tests the best possible names. Unfortunately, lots of testing frameworks force the developer to start or end his test names with predefined words - such as "define", "it", "should", "given", "when", "then". This can be harmful, because they incline the user to write his test names always following the same structure, even in situations where that way of structuring tests is suboptimal.

Predefined words produce twisted sentences

Here are some trivial counterexamples of specification-style test names, to prove that requiring the test to start with a predefined word will sometimes lower the quality of the test names. Here is a specification of Fibonacci numbers which is taken straight from their Wikipedia article:

Fibonacci numbers:
- The first two Fibonacci numbers are 0 and 1
- Each remaining number is the sum of the previous two

RSpec requires the test fixtures to start with the word "define" and the tests with "it", in addition to which RSpec's documentation encourages the tests to start with "it should". Let's try to twist the above specification of Fibonacci numbers into that style. Here is my best attempt which still holds the same information:

Define Fibonacci numbers:
- it should have the first two numbers be 0 and 1
- it should have each remaining number be the sum of the previous two

Urgh. Totally unnatural way of saying the same information. Lots of unnecessary words need to be added to make the test names full sentences.

If we use example-style, then it's possible to write the test names, but valuable information is lost:

Define Fibonacci numbers:
- it should calculate the first two numbers
- it should calculate each remaining number

The problem appears to be that "it" is forced to be the subject of the sentence. We can get around that restriction by adding more "define" elements, but then the tests become awfully verbose without adding any new information:

Define Fibonacci numbers:
- Define the first number:
  - it should be 0
- Define the second number:
  - it should be 1
- Define each remaining number:
  - it should be the sum of the previous two

Here is another example, written in a slightly different style:

Stack:
- An empty stack
    - is empty
    - After a push, the stack is no longer empty
- When objects have been pushed onto a stack
    - the object pushed last is popped first
    - the object pushed first is popped last
    - After popping all objects, the stack is empty

And the same using RSpec's predefined words:

Define stack:
- Define an empty stack
    - it should be empty
    - it should, after a push, be no longer empty
- Define a stack onto which objects have been pushed
    - it should pop first the object pushed last
    - it should pop last the object pushed first
    - it should, after popping all objects, be empty

It was necessary to change the order of some of the sentences for them to make sense, and it was not natural to write the "it should, after..." tests - their sentence order should have been changed to make them more natural, but then the effect would have been before the cause, which is neither good. Also the subject of some sentences had to be changed from "the object pushed last" to "it" (i.e. stack) and the subject of the old sentence became the object of the new sentence.

The testing framework should obey the developer, not the other way around! The developer is the one knows best that how to make a sentence convey his intent. A testing framework, which forces the developer to use a predefined style of writing his sentences, is immoral!

Predefined words do not improve the test names

What about example-style test names? Predefined words are equally bad for them. Here are Uncle Bob's Bowling Game Kata's test names in RSpec format:

Define bowling game:
- it should score gutter game
- it should score all ones
- it should score one spare
- it should score one strike
- it should score perfect game

Adding "it should" does not improve the test names. They don't make the intent any clearer. It just adds lots of duplication and becomes background noise. The framework will not magically make a person who writes example-style tests to suddenly start writing specification-style tests.

What about implementation-style tests? I have seen lots of implementation-style tests written in an "it should have" pseudo-specification-style like this:

Define person:
- it should have name
- it should not allow null name
- it should have age
- it should have address
- it should save
- it should load
- it should calculate pay

Writing implementation-style tests is still perfectly possible. A framework alone can't make the developer better. He must first understand the philosophy behind the framework and how to write expressive tests, before his tests will get any better.

Many testing frameworks get it wrong

behaviour-driven.org says that "Getting the words right" was the starting point for the development of BDD, so it is absurdly ironic that lots of BDD frameworks get the words wrong.

First and foremost, RSpec gets its words wrong by forcing the tests to start with "describe" and "it", as described above. And because RSpec has become popular and was one of the first BDD frameworks, lots of other BDD frameworks copy RSpec and use the same predefined words. They just repeat mindlessly what others have done, without stopping to think why the things were done that way. They become angry monkeys and cargo-cults, which annoy me very much.

Many BDD acceptance testing frameworks force the use of words "given", "when", "then". For example Cucumber does this. Decomposing actions into those three parts gives you state transition tables. This is a very explicit way of defining actions, but also very verbose. Added verbosity does not always make things easier to read; on the contrary, it can make it harder to see what is really important. As said in a famous quote:

In anything at all, perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away.
- Antoine de Saint-Exupéry

And another one:

Many attempts to communicate are nullified by saying too much.
- Robert Greenleaf

I have even found a framework which adds predefined words as suffixes to the test names. In the Specs framework for Scala, the top-level test names end with "should", which is even printed in all the reports (unlike RSpec's "it"). Thankfully there is a workaround to avoid that suffix.

Frameworks should not limit the developer

A good testing framework will allow the developer to choose himself in what way he writes his test names. Some examples of testing frameworks, which do not force the use of predefined words, are JUnit 4 and JDave. But those two still force a fixed level of test nesting - JUnit 4 has no nesting and JDave has one level of nesting. Of the frameworks that I know, the least restrictive one is GoSpec, which I wrote myself with that as a goal.

When designing GoSpec, my goals were to allow unlimited levels of nesting, and to not force the test names to start with any predefined words. In Scala and other similarly expressive languages it would be easy to cope without any predefined words. For example I like in Specs the "test name" in { ... } style, which can also be written using an unpronounceable symbol "test name" >> { ... }. Unfortunately Go's syntax is not as flexible, so I was satisfied with prefixing each test name with "Specify". I chose that word to be such that starting a sentence with it would be totally unnatural, so that the developers would be inclined to just ignore it. Also all the examples of using that framework are written so that they do not include that word in the test names.

In a future article I will write about my current ideals for a unit testing framework. One of the primary goals is allowing the developer to use any style that is best for the situation, but there are also other goals (for a sneak peek, see the project goals in GoSpec's README).

2010-04-10

Direct and Indirect Effects of TDD

Some time ago @unclebobmartin tweeted about the direct and indirect effects of TDD:

TDD guarantees test coverage.
3:17 PM Jan 31st

TDD helps with, but does not guarantee, good design & good code. Skill, talent, and expertise remain necessary.
3:25 PM Jan 31st

I agree with the above, but I also felt that there was still something missing, because I did not see a direct relation from good test coverage to good design - it's possible to get high test coverage even with test-last approaches, but that does not help with the design similarly to TDD. So that made me think about what are the direct effects of TDD, and how do they indirectly help with the design.

Here are some effects that I've noticed TDD to have, divided into direct and indirect effects. If you have noticed some more direct or indirect effects, please leave a comment.

Direct effects, given just following the three rules of TDD:

Guarantees code coverage
Amplifies the pain caused by bad code

Indirect effects, given skilled enough programmers using TDD:

Enables changing the code without breaking it
Improves the quality of the code

Direct: Guarantees code coverage

From the three rules of TDD it's easy to see that no application logic will come to existence, unless some test first covers it. So if somebody just follows these rules with discipline, code coverage is guaranteed. That result is irrelevant of the skills of the developer*.

* Although it will be hard for a very unskilled and undisciplined developer to follow the rules, but that's beside the point. ;)

Direct: Amplifies the pain caused by bad code

In TDD, after writing a failing test, the next step is to make it pass with the simplest possible change - the code does not need to be elegant, because it's meant to be refactored later. Generally there is very little thinking before writing some code (just a rough idea where the project is heading), but instead most of code design is meant to be done after writing the code. Also, to be able to test something, the code needs to be testable - in other words low coupling and high cohesion.

The above leads to TDD requiring existing code to be changed continuously - it's like being in maintenance mode all the time. So if the code is not maintainable, it will be hard to change. Also if the code is not testable, it will be hard to write tests for it. If the code would not be written iteratively and no tests would be written for it, life would be much easier to the developer. In other words, TDD increases the pain caused by bad code, because it's not possible to avoid changing the code, nor avoid writing tests for it.

Amplifying pain might seem like a bad idea, but actually that's one of TDD's best assets. :) Keep on reading...

Indirect: Enables changing the code without breaking it

The direct effect of code coverage makes it possible to notice when something breaks. But that alone is not enough for changing the system safely. It requires skill to be able to modify the code in small, safe steps. The developer needs the ability to do even big design changes by combining multiple small refactorings*, so that the tests pass after every refactoring. This requires skill and discipline. An unskilled or undisciplined developer would get stuck in refactoring hell, or would give up and abandon the test suite.

* Programming can be thought of as the process of solving a big problem by combining multiple small elementary pieces (such as conditionals, statements and libraries). In refactoring the elementary pieces are small transformations of the source code's structure which preserve its observed behaviour (rename, extract method, move field etc.). In this sense also mathematics requires similar thinking (for example prove a conjecture by combining theorems and axioms). This kind of problem solving requires first creativity and intuition to get an idea of the solution, and then discipline and attention to detail to implement the solution; two opposite personality traits.

Indirect: Improves the quality of the code

The direct effect of amplified pain and the indirect effect of making safe changes enable the improving of the code quality. The important point is "listening to the tests". When something is painful while doing TDD, you should be sensitive to notice the pain and then react to it by fixing whatever was causing that pain.

Growing Object-Oriented Software, Guided by Tests says on page 245 under the subheading "What the Tests Will Tell Us (If We're Listening)", commenting on somebody who was suffering from unreadable tests, up to 1000 lines long test classes, and refactoring leading to massive changes in test code:

Test-driven development can be unforgiving. Poor quality tests can slow development to a crawl, and poor internal quality of the system being tested will result in poor quality tests. By being alert to the internal quality feedback we get from writing tests, we can nip this problem in the bud, long before our unit tests approach 1000 lines of code, and end up with tests we can live with. Conversely, making an effort to write tests that are readable and flexible gives us more feedback about the internal quality of the code we are testing. We end up with tests that help, rather than hinder, continued development.

Also Michael Feathers says in an interview:

It's something that people don't talk about enough and it seems like particularly in TDD, there is a really great thing that you notice that if something hurts when you are doing TDD, it often means that it's an indication of something wrong with the design. Since people are so drawn and they say "OK, this is kind of painful. It must mean that the TDD sucks." In fact, there is a way of going and getting feedback about a thing you are really working on, if you pay attention to the pain.

Noticing the pain as soon as possible and then fixing the problem - whether it is a rigid design, fragile tests or something else - requires skill. Not everybody is alert to the pain, but instead they keep on writing bad code until making changes becomes too expensive and a rewrite is needed. Not everybody fixes the problem when they feel the pain, but instead they implement a quick hack and leave an even bigger mess for the next developer. But for those who have the necessary skills and discipline, TDD can be a powerful tool and they can use it to write better code.

2010-02-13

Three Styles of Naming Tests

I have now used TDD for about 3 years, during which time I've come to notice three different styles of naming and organizing tests. In this article I'll explain the differences between what I call specification-style, example-style and implementation-style.

Tests as a specification of the system's behaviour

Specification-style originates from Behaviour-Driven Development (BDD) and it's the style that I use 90% of the time. It can be found among practitioners of BDD (for some definition of BDD), so it could also be called "BDD-style". However, just using a BDD framework does not mean that you write your tests in this style. For example the examples on Cucumber's front page (no pun intended) are more in example-style than in specification-style. (By the way, I don't buy into the customer-facing requirements-analysis side of BDD, because in my opinion interaction design is much better suited for it.)

In specification-style the tests are considered to be a specification of the system's behaviour. The test names should be sentences which describe what the system should do - what are the system's features. Just by reading the names of the tests, it should be obvious that what the system does, even to such an extent that somebody can implement a similar system just by looking at the test names and not their body.

When a test fails, there are three options: (1) the implementation is broken and should be fixed, (2) the test is broken and should be fixed, (3) the test is not anymore needed and should be removed. If the test has been written in specification-style, then knowing what to do is simple. Just read the name of the test and decide whether that piece of behaviour is still needed. If it is, then you keep the same test name, but change the implementation or test code. If it is not, for example if the specified behaviour conflicts with some new desired behaviour, then you can remove the test and double-check all other tests in the same file, in case some of them should also be updated.

Here are some examples of specification-style tests (using Go and GoSpec). A test for Fibonacci numbers could look like this:

func FibSpec(c gospec.Context) {
    fib := NewFib().Sequence(10)

    c.Specify("The first two Fibonacci numbers are 0 and 1", func() {
        c.Expect(fib[0], Equals, 0)
        c.Expect(fib[1], Equals, 1)
    })
    c.Specify("Each remaining number is the sum of the previous two", func() {
        for i := 2; i < len(fib); i++ {
            c.Expect(fib[i], Equals, fib[i-1] + fib[i-2])
        }
    })
}

If you look at the Wikipedia entry for Fibonacci numbers, you will notice that the above test names are directly taken from there. This is how Wikipedia defines the Fibonacci numbers: "By definition, the first two Fibonacci numbers are 0 and 1, and each remaining number is the sum of the previous two. Some sources omit the initial 0, instead beginning the sequence with two 1s." The test names should document the same specification.

Each test focuses on a single piece of behaviour

One more example, in the same language and same framework, this time on stacks (ignore the comments for now). This is the style how I typically organize my tests:

func StackSpec(c gospec.Context) {
    stack := NewStack()

    c.Specify("An empty stack", func() { // Given

        c.Specify("is empty", func() { // Then
            c.Expect(stack.Empty(), IsTrue)
        })
        c.Specify("After a push, the stack is no longer empty", func() { // When, Then
            stack.Push("foo")
            c.Expect(stack.Empty(), IsFalse)
        })
    })

    c.Specify("When objects have been pushed onto a stack", func() { // Given, (When)
        stack.Push("one")
        stack.Push("two")

        c.Specify("the object pushed last is popped first", func() { // (When), Then
            x := stack.Pop()
            c.Expect(x, Equals, "two")
        })
        c.Specify("the object pushed first is popped last", func() { // (When), Then
            stack.Pop()
            x := stack.Pop()
            c.Expect(x, Equals, "one")
        })
        c.Specify("After popping all objects, the stack is empty", func() { // When, Then
            stack.Pop()
            stack.Pop()
            c.Expect(stack.Empty(), IsTrue)
        })
    })
}

(Note that GoSpec isolates the child specs from their siblings, so that they can safely mutate common variables. This was one of the design principles for GoSpec which enables it to be used the way that I prefer writing specification-style tests. The other important ones are: allow unlimitedly nested tests, and do not force the test names to begin or end with some predefined word.)

Each test has typically three parts: Arrange, Act, Assert. In BDD vocabulary they are often identified by the words Given, When, Then.

I've found it useful to arrange the tests so, that the Arrange and Act parts are in the parent fixture, and then have multiple Asserts each in its own test. Organizing the tests like this follows the spirit of the One Assertion Per Test principle (more precisely, one concept per test). When each test tests only one behaviour, it makes the reason for a test failure obvious. When a test fails, you will know exactly what is wrong (it isolates the reason for failure) and you will know whether the behaviour specified by the test is still needed, or whether it is obsolete and the test should be removed.

Quite often I use the words Given, When and Then in the test names, because they are part of BDD's ubiquitous language. But I always put more emphasis on making the tests readable and choosing the best possible words. So when it is obvious from the sentence, I may choose to

omit the Given/When/Then keywords,
group the Given and When parts together,
group the When and Then parts together, or even
group all three parts together.

In the above stack example, I have marked with comments that which of the specs is technically a Given, When or Then. As you can see, there is a distinct structure, but also much flexibility. The "should" word I dropped long time ago, after my second TDD project, because it was just adding noise to the test names without adding value. The value is in focusing on the behaviour, not in using some predefined words.

The specification should be decoupled from the implementation

Specification-style tests focus on the desired behaviour, or feature, at the problem domain's level of abstraction, and try to be as decoupled from the implementation as possible. The tests should not contain any implementation details (for example method names and parameters), because those implementation details are what will be designed after the test's name has been written. If the test's name already fixes the use of some implementation details (for example whether a method accepts null parameters), then refactoring the code will be harder, because it will force us to update the tests. Coupling tests to the implementation leads to implementation-style tests.

When the tests focus on the desired behaviour, then when refactoring, you won't need to change the name of the test, but only the body of the test (when refactoring affects the implementation's public interface). If you're doing a rewrite, then you may even be able to reuse the old test names - which is helpful, because thinking of the test name is what usually takes the most time in writing a test, because that is when you think about what the system should do. (If choosing the name does not take the most time, then you're not thinking about it enough, or you're writing too complex test code, which is a test smell that the production code is too complex.)

For example have a look at SequentialDatabaseAccessSpec and ConcurrentDatabaseAccessSpec. These are tests which I wrote 1½ years ago and in the near future the subsystem that those tests specify will be rewritten, as the application's architecture will be changed from being based on shared-state to message-passing and also the programming language will be mostly changed from Java to Scala. Here are the names of those tests:

SequentialDatabaseAccessSpec:

When database connection is opened
  - the connection is open
  - only one connection exists per transaction
  - connection can not be used after prepare
  - connection can not be used after commit
  - connection can not be used after rollback
  - connection can not be used after prepare and rollback

When entry does not exist
  - it does not exist
  - it has an empty value

When entry is created
  - the entry exists
  - its value can be read

When entry is updated
  - its latest value can be read

When entry is deleted
  - it does not exist anymore
  - it has an empty value


ConcurrentDatabaseAccessSpec:

When entry is created in a transaction
  - other transactions can not see it
  - after commit new transactions can see it
  - after commit old transactions still can not see it
  - on rollback the modifications are discarded
  - on prepare and rollback the locks are released

When entry is updated in a transaction
  - other transactions can not see it
  - after commit new transactions can see it
  - after commit old transactions still can not see it
  - on rollback the modifications are discarded
  - on prepare and rollback the locks are released

When entry is deleted in a transaction
  - other transactions can not see it
  - after commit new transactions can see it
  - after commit old transactions still can not see it
  - on rollback the modifications are discarded
  - on prepare and rollback the locks are released

If two transactions create an entry with the same key
  - only the first to prepare will succeed
  - only the first to prepare and commit will succeed

If two transactions update an entry with the same key
  - only the first to prepare will succeed
  - only the first to prepare and commit will succeed
  - the key may be updated in a later transaction

If two transactions delete an entry with the same key
  - only the first to prepare will succeed
  - only the first to prepare and commit will succeed

When the above components are rewritten using a new architecture, new language and different programming paradigm, most of those test names will stay the same, because they are based on the problem domain of transactional database access, and not any implementation details such as the architecture, programming language, or *gasp* individual classes and methods.

In the above tests there will be only minor changes:

The first fixture of SequentialDatabaseAccessSpec may be removed, or moved to some different test, because in the new architecture opening a database connection will be quite different (implicit instead of explicit). Actually it should have been put into its own test class, named DatabaseConnectionSpec, already when it was written, because it is very much different from the focus of the rest of SequentialDatabaseAccessSpec.
In ConcurrentDatabaseAccessSpec, the test saying "on prepare and rollback the locks are released" will be removed, because the new architecture will not need any locks. The use of locks is an implementation detail and these specs were not fully decoupled from it.

What is a "unit test"?

The above example also raises the question about the size of a "unit" in a unit test. For me "a unit" is always "a behaviour". It never is "a class" or "a method" or similar implementation detail. Although following the Single Responsibility Principle often leads to one class dealing with one behaviour, that is a side-effect of following SRP and not something that would affect the way the tests are structured.

For those two test classes in the above example, the number of production classes being exercised by the tests is about 15 concrete classes (excluding JDK classes) from 2 subsystems (transactions and database). The lower-level components of those 15 classes have been tested individually (the transaction subsystem has its own tests, as well as do the couple of data structures which were used as stepping stones for the in-memory database) because the higher-level tests will not cover the lower-level components thoroughly, and I anyways wrote those tests to drive the design of the lower-level components with TDD. So the new code produced by those two test classes is about 5 production classes (originally it was about 3 production classes, but they were split to follow SRP).

From TDD's point of view, it's very important to be able to run all tests quickly, in a couple of seconds (more than 10-20 seconds will make TDD painful). On my machine SequentialDatabaseAccessSpec takes about 15 ms to execute (1.2 ms/test) and ConcurrentDatabaseAccessSpec about 120 ms (5.5 ms/test). I prefer tests which execute in 1 ms or less. If it takes much longer, then I'll try to decouple the system so that I can test it in smaller parts using test doubles. So to me the "unit" in a "unit test" is one behaviour, with the added restriction that its tests can be executed quickly.

More on specification-style

To learn more about how to write tests in specification-style, do the tutorial at http://github.com/orfjackal/tdd-tetris-tutorial and also have a look at its reference implementation and tests.

Update 2010-03-17: I just started reading Growing Object-Oriented Software, Guided by Tests and I'm happy to notice that also the authors of that book prefer specification-style. In chapter 21 "Test Readability", page 249, under the subheading "Test Names Describe Features", they have the following tests names:

ListTests:

- holds items in the order they were added
- can hold multiple references to the same item
- throws an exception when removing an item it doesn't hold

If you notice more books which use specification-style, please leave a comment.

Update 2015-11-01: Also the excellent presentation What We Talk About When We Talk About Unit Testing by Kevlin Henney recommends specification-style.

Tests as examples of system usage

Example-style is perhaps the most popular among TDD'ers, maybe because many books, tutorials and proponents of TDD use this style. It's also quite easy to name tests with example-style, while at the same time being much better than implementation-style. I use this style maybe 10% of the time, usually to cover a corner case for which I have too hard a time to give a name using specification-style, or the specification-style name would be too verbose without added value as documentation. For some situations one style fits better than the other.

In example-style the tests are considered to be examples of system usage, or scenarios of using the system. The test names tell what is the scenario, and you will need to read the body of the test to find out how the system will behave in that scenario. The test names are not a direct specification, but instead to arrive at the specification, you will need to read the tests and reverse-engineer and generalize the behaviour that is happening on those tests.

A famous example which is written in example-style is Uncle Bob's Bowling Game Kata. There the test names are:

testGutterGame
testAllOnes
testOneSpare
testOneStrike
testPerfectGame

Now that you have read the test names, can you tell me the scoring rules of bowling? You can't? Exactly! That is what sets example-style apart from specification-style. In example-style you would need to reverse-engineer the scoring rules of bowling from the test code. In specification-style the test names would tell the scoring rules directly.

My domain knowledge about bowling is not good enough for me to write good specification-style tests for it, but it might look something like below. I took the scoring rules from the Bowling Game Kata's page 2 and reworded them some.

The game has 10 frames

In each frame the player has 2 opportunities (rolls) to knock down 10 pins

When the player fails to knock down some pins
  - the score is the number of pins knocked down

When the player knocks down all pins in two tries
  - he gets spare bonus: the value of the next roll

When the player knocks down all pins on his first try
  - he gets strike bonus: the value of the next two rolls

When the player does a spare or strike in the 10th frame
  - he may roll an extra ball to complete the frame

Here is another example of example-style, this time from JUnit's AssertionTest class.

arraysExpectedNullMessage
arraysActualNullMessage
arraysDifferentLengthMessage
arraysDifferAtElement0nullMessage
arraysDifferAtElement1nullMessage
arraysDifferAtElement0withMessage
arraysDifferAtElement1withMessage
multiDimensionalArraysAreEqual
multiDimensionalIntArraysAreEqual
oneDimensionalPrimitiveArraysAreEqual
oneDimensionalDoubleArraysAreNotEqual
...

This shows well a situation where example-style is useful: corner cases. In English it would be possible to describe the behaviour specified by AssertionTest with one sentence. Even though there are lots of corner cases, they all are semantically very similar. Writing these tests in specification-style would be impractically verbose. Here are the specification-style tests for a generic assertion:

When the expected and actual value are equal
  - the assertion passes and does nothing

When the expected and actual value differ
  - the assertion fails and throws an exception
  - the exception has the actual value
  - the exception has the expected value
  - the exception has an optional user-defined message

Repeating that specification for every corner case is not practical, because it would just be more verbose but without any added documentation value. That's why in this case it would make more sense to write one use case with specification-style and the rest of the use cases in example-style (this is how I did it with GoSpec's matchers). Or since this particular problem domain is quite simple, just leave out the specifications and use only example-style.

Tests reflecting the implementation of the system

Implementation-style is typical in test-last codebases and with people new to TDD who are still thinking about the implementation before the test. I never use this style. It was only in my very first TDD project that I wrote also implementation-style tests (for example it had tests for setters), but at least I knew about BDD and was aware of my shortcomings and tried to aim for specification-style. (It took about one year and seven projects to fine-tune my style of writing tests, after which I wrote tdd-tetris-tutorial.)

In implementation-style the tests are considered to be verifying the implementation - i.e. the tests are considered to be just tests. There is a direction relation from the implementation classes and methods to the test cases. By reading the test names you will be able to guess that what methods a class has.

Typically the test names start with the name of the method being tested. Since nearly always more than one test case is needed to cover a method, people tend to append the method parameters to the test name, or append a sequential number, or *gasp* put all test cases into one test method.

As an example of implementation-style, here are some of the test cases from Project Darkstar's TestDataServiceImpl class. I know for sure that Darkstar has been written test-last, in addition to which mosts of its tests are integration tests (it takes 20-30 minutes to run them all, which makes it painful for me to make my changes with TDD).

testConstructorNullArgs
testConstructorNoAppName
testConstructorBadDebugCheckInterval
testConstructorNoDirectory
...
testGetName
testGetBindingNullArgs
testGetBindingEmptyName
testGetBindingNotFound
testGetBindingObjectNotFound
testGetBindingAborting
testGetBindingAborted
testGetBindingBeforeCompletion
testGetBindingPreparing
testGetBindingCommitting
testGetBindingCommitted
...

From the above test names it's possible to guess that there is a class DataServiceImpl which has a constructor which takes as parameters at least an app name, a debug check interval and some directory. It's not clear which are the valid values for them and whether null arguments are allowed or not. Also we can guess that the DataServiceImpl class has methods getName and getBinding, the latter which probably takes a name as parameter. With getBinding it's possible that "something is not found" or "an object is not found". The getBinding method's behaviour also appears to depend on the state of the current transaction. It's not clear how it should behave in any of those states.

Implementation-style is bad compared to example-style and specification-style, because implementation-style is not useful as documentation - it does not tell how the system should behave or how to use it - which in turn makes it hard to know what to do when a test fails. Also implementation-style couples the tests to the implementation, which makes it hard to refactor the code; if you rename a method, you need to also rename the tests. If you do big structural refactorings, you must rewrite the tests. And when you rewrite the tests, the old tests are of little benefit in knowing which new tests to write.

Summary

Specification-style test names describe how the system will behave in different situations. By reading the test names it will be possible to implement the system. When a test fails, the test name will tell which behaviour is specified by that test, after which it's possible to decide whether that test is still needed. The test names use the problem domain's vocabulary and do not depend on implementation details.

Example-style test names describe which special cases or scenarios the system should handle. You will need to read the body of the test to find out how the system should behave in those situations.

Implementation-style test names tell what methods and classes the system has. It will be very hard to find out from the tests that which situations the system should handle and how it should behave in those situations. Refactorings require you to change also the tests.