Generated Tests and TDD
Posted by Uncle Bob on 01/10/2008
TDD has become quite popular, and many companies are attempting to adopt it. However, some folks worry that it takes a long time to write all those unit tests and are looking to test-generation tools as a way to decrease that burden.
The burden is not insignificant. FitNesse, an application created using TDD, is comprised of 45,000 lines of Java code, 15,000 of which are unit tests. Simple math suggests that TDD increases the coding burden by a full third!
Of course this is a naive analysis. The benefits of using TDD are significant, and far outweigh the burden of writing the extra code. But that 33% still feels “extra” and tempts people to find ways to shrink it without losing any of the benefits.
Test Generators.
Some folks have put their hope in tools that automatically generate tests by inspecting code. These tools are very clever. They generate random calls to methods and remember the results. They can automatically build mocks and stubs to break the dependencies between modules. They use remarkably clever algorithms to choose their random test data. They even provide ways for programmers to write plugins that adjust those algorithms to be a better fit for their applications.
The end result of running such a tool is a set of observations. The tool observes how the instance variables of a class change when calls are made to its methods with certain arguments. It notes the return values, changes to instance variables, and outgoing calls to stubs and mocks. And it presents these observations to the user.
The user must look through these observations and determine which are correct, which are irrelevant, and which are bugs. Once the bugs are fixed, these observations can be checked over and over again by re-running the tests. This is very similar to the record-playback model used by GUI testers. Once you have registered all the correct observations, you can play the tests back and make sure those observations are still being observed.
Some of the tools will even write the observations as JUnit tests, so that you can run them as a standard test suite. Just like TDD, right? Well, not so fast…
Make no mistake, tools like this can be very useful. If you have a wad of untested legacy code, then generating a suite of JUnit tests that verifies some portion of the behavior of that code can be a great boon!
The Periphery Problem
On the other hand, no matter how clever the test generator is, the tests it generates will always be more naive than the tests that a human can write. As a simple example of this, I have tried to generate tests for the bowling game program using two of the better known test generation tools. The interface to the Bowling Game looks like this:
public class BowlingGame {
public void roll(int pins) {...}
public int score() {...}
}
The idea is that you call roll each time the balls gets rolled, and you call score at the end of the game to get the score for that game.
The test generators could not randomly generate valid games. It’s not hard to see why. A valid game is a sequence of between 12 and 21 rolls, all of which must be integers between 0 and 10. What’s more, within a given frame, the sum of rolls cannot exceed 10. These constraints are just too tight for a random generator to achieve within the current age of the universe.
I could have written a plugin that guided the generator to create valid games; but such an algorithm would embody much of the logic of the BowlingGame itself, so it’s not clear that the economics are advantageous.
To generalize this, the test generators have trouble getting inside algorithms that have any kind of protocol, calling sequence, or state semantics. They can generate tests around the periphery of the classes; but can’t get into the guts without help.
TDD?
The real question is whether or not such generated tests help you with Test Driven Development. TDD is the act of using tests as a way to drive the development of the system. You write unit test code first, and then you write the application code that makes that code pass. Clearly generating tests from existing code violates that simple rule. So in some philosophical sense, using test generators is not TDD. But who cares so long as the tests get written, right? Well, hang on…
One of the reasons that TDD works so well is that it is similar to the accounting practice of dual entry bookkeeping. Accountants make every entry twice; once on the credit side, and once on the debit side. These two entries follow separate mathematical pathways. In the end a magical subtraction yields a zero if all the entries were made correctly.
In TDD, programmers state their intent twice; once in the test code, and again in the production code. These two statements of intent verify each other. The tests, test the intent of the code, and the code tests the intent of the tests. This works because it is a human that makes both entries! The human must state the intent twice, but in two complementary forms. This vastly reduces many kinds of errors; as well as providing significant insight into improved design.
Using a test generator breaks this concept because the generator writes the test using the production code as input. The generated test is not a human restatement, it is an automatic translation. The human states intent only once, and therefore does not gain insights from restatement, nor does the generated test check that the intent of the code was achieved. It is true that the human must verify the observations, but compared to TDD that is a far more passive action, providing far less insight into defects, design and intent.
I conclude from this that automated test generation is neither equivalent to TDD, nor is it a way to make TDD more efficient. What you gain by trying to generate the 33% test code, you lose in defect elimination, restatement of intent, and design insight. You also sacrifice depth of test coverage, because of the periphery problem.
This does not mean that test generators aren’t useful. As I said earlier, I think they can help to partially characterize a large base of legacy code. But these tools are not TDD tools. The tests they generate are not equivalent to tests written using TDD. And many of the benefits of TDD are not achieved through test generation.
Posted in Uncle Bob's Blatherings, Agile Methods
Meta 44 comments, permalink, rss, atom
Posted by Uncle Bob on 12/13/2007
I was at a client recently. They are a successful startup who have gone through a huge growth spurt. Their software grew rapidly, through a significant hack-and-slash program. Now they have a mess, and it is slowing them way down. Defects are high. Unintended consequences of change are high. Productivity is low.
I spent two days advising them how to adopt TDD and Clean Code techniques to improve their code-base and their situation. We discussed strategies for gradual clean up, and the notion that big refactoring projects and big redesign projects have a high risk of failure. We talked about ways to clean things up over time, while incrementally insinuating tests into the existing code base.
During the sessions they told me of a software manager who is famed for having said:
“There’s a clean way to do this, and a quick-and-dirty way to do this. I want you to do it the quick-and-dirty way.”
The attitude engendered by this statement has spread throughout the company and has become a significant part of their culture. If hack-and-slash is what management wants, then that’s what they get! I spent a long time with these folks countering that attitude and trying to engender an attitude of craftsmanship and professionalism.
The developers responded to my message with enthusiasm. They want to do a good job (of course!) They just didn’t know they were authorized to do good work. They thought they had to make messes. But I told them that the only way to get things done quickly, and keep getting things done quickly, is to create the cleanest code they can, to work as well as possible, and keep the quality very high. I told them that quick-and-dirty is an oxymoron. Dirty always means slow.
On the last day of my visit the infamous manager (now the CTO) stopped into our conference room. We talked over the issues. He was constantly trying to find a quick way out. He was manipulative and cajoling. “What if we did this?” or “What if we did that?” He’d set up straw man after straw man, trying to convince his folks that there was a time and place for good code, but this was not it.
I wanted to hit him.
Then he made the dumbest, most profoundly irresponsible statement I’ve (all too often) heard come out of a CTOs mouth. He said:
“Business software is messy and ugly.”
No, it’s not! The rules can be complicated, arbitrary, and ad-hoc; but the code does not need to be messy and ugly. Indeed, the more arbitrary, complex, and ad-hoc the business rules are, the cleaner the code needs to be. You cannot manage the mess of the rules if they are contained by another mess! The only way to get a handle on the messy rules is to express them in the cleanest and clearest code you can.
In the end, he backed down. At least while I was there. But I have no doubt he’ll continue his manipulations. I hope the developers have the will to resist.
One of the developers asked the question point blank:
Comments
swombat 29 minutes later:
Very true. As a dedicated RSpec BDDer, it is a bit shocking to hear of people trying to take the shortcut of generating their tests. If I was in a smart-arse mood, I might comment that they should take an even quicker shortcut and just generate the test reports directly! Why bother running the tests at all? They’ve already cut out 90% or so of the benefits of testing, why not go all the way? :-)
One element which is missing from your article is the use of TDD as a design process. This is especially the case in BDD, but as BDD is supposed to simply be “TDD done right”, with a better adapted vocabulary, what’s true of BDD tends to hold for TDD as well. When you write tests first, it makes you think about the design of the item you’re writing in a way that’s immensely helpful.
Another important use of TDD is to ensure that you let user stories drive the requirements. In this case, you’d write a user story (e.g. using FIT or the RSpec Story Runner) first, then write a view spec, then a controller spec, then finally a model spec if required. Thus, every line of new code that you write is driven by a clear user benefit, and you waste no time implementing features that You Ain’t Gonna Need (It). In my experience, this goes a long way towards reducing cruft and keeping your codebase tight and focused on user benefits.
Daniel
Johan Samyn about 1 hour later:
A comment from a somewhat unusual angle :
Two things I like most about this post : the comparison with the accounting practice (great for advocacy), and the referring to the human factor. The second the most. Indeed, we humans are the most important factor, and will stay so for quite some time I believe. This helps me understand why you run a successful company : you seem to value people, the single most important asset there is. And that’s a great thing. You consider humans as an important factor in the process of writing software. Not just the languages, tools, methodologies and so on. But those people using all that. They are the binding glue, the commanding factor in the process. The importance of us, as valuable human beings, can’t be stressed enough. That’s why other/new tools and so can’t beat the fact that you can get more out of a team by educating them, because that is helping to make those good people (the best factor in the game) even better. Which is not always understood.
Jeff Langr about 19 hours later:
I’m wondering if FitNesse would be 75,000 total lines, no tests, were it not written test-first.
Steve Meuse about 22 hours later:
Very nice article, insightful and clear. One tiny glitch: Uncle Bob wrote, “The burden is not insignificant. FitNesse, an application created using TDD, is comprised of 45,000 lines of Java code, 15,000 of which are unit tests. Simple math suggests that TDD increases the coding burden by a full third!”
The portion of TDD code may be 1/3, but the coding burden is increased by 1/2.
Given: FitNesseWithTDD = 45,000 LOC TDDPortion = 15,000 LOC We know: FitNesseWithoutTDD = 30,000 LOC RatioOfExtraTDDCode = TDD / FitNesseWithoutTDD = 15,000 / 30,000 = 1/2
Of course, this doesn’t account for the time saved finding bugs before and during the coding phase, rather than retrospectively. A hypothetical FitNesse developed with traditional testing methods could well be bigger than 30,000 LOC. Still, the delta is a bit higher than a third, just so the folks who write the checks and set the schedules know what to expect and when.
I agree with Pavel. The bookkeeping analogy is brilliant. Thanks!
Steve Meuse about 22 hours later:
Arghh. The “Given”/”We Know” block lost its formatting. It should be eight lines, which Preview displays correctly:
Given:
FitNesseWithTDD = 45,000 LOC
TDDPortion = 15,000 LOC
We know:
FitNesseWithoutTDD = 30,000 LOC
RatioOfExtraTDDCode = TDD / FitNesseWithoutTDD
= 15,000 / 30,000
= 1/2
unclebob about 23 hours later:
The portion of TDD code may be 1/3, but the coding burden is increased by 1/2.
Damn! How do you TDD a blog?
DAR about 24 hours later:
Actually, the tests increased the size of the code base by 50%.
“45,000 lines of Java code, 15,000 of which are unit tests”
So that means 30K LOC without tests. 15K/30K = .5. So +15K means +50%.
I don’t have a problem with that (I’m a strong TDD advocate myself). But it doesn’t serve anybody well to have the numbers wrong.
Amund 3 days later:
quote: “The burden is not insignificant. FitNesse, an application created using TDD, is comprised of 45,000 lines of Java code, 15,000 of which are unit tests. Simple math suggests that TDD increases the coding burden by a full third!”
That is not simple math, it is likely to be advanced-alternative-history-math. How do you know that it would end up with 30k lines (and still work) if it was written without TDD?
TDD also drives design and my guess is that you are likely to end with quite a different application when using other development approaches.
Eric Landes 3 days later:
Bob, a great point here. I’ve posted some more thoughts at the URL I’ve posted, that relate to TDD and using that with Visul Studio Testing (for those new to TDD).
Ross MacGregor 19 days later:
Bob, I’m not sure the Periphery Problem as you’ve stated it is really a good argument.
“The test generators could not randomly generate valid games. It’s not hard to see why. A valid game is a sequence of between 12 and 21 rolls, all of which must be integers between 0 and 10. What’s more, within a given frame, the sum of rolls cannot exceed 10. These constraints are just too tight for a random generator to achieve within the current age of the universe.”
Here you assume these constraints cannot be captured by the system, but why couldn’t they? Perhaps we need languages with more powerful constraint systems like Eiffel that has championed Design by Contract programming.
For example the problem of the number of pins is easily solved by creating an integer value type that only has the range of 0-10. So when it comes time to generate a random value it can only generate valid numbers.
DBC like TDD is a methodology for designing software that promises to increase the quality of the design. Perhaps with enough practice with DBC programming one would be able to capture most of these elusive constraints you speak of. You want to capture these constraints programmatically so that you can verify that your software is not operating in an invalid state.
If DBC programming languages were more common perhaps these tools may actually work well and be able to auto generate most of the unit tests.
unclebob 26 days later:
For example the problem of the number of pins is easily solved by creating an integer value type that only has the range of 0-10. So when it comes time to generate a random value it can only generate valid numbers.
The problem is that a valid game is not just a sequence of rolls between 0 and 10. Within any frame (usually two rolls) the sum of the rolls cannot exceed 10; but if the first roll is 10, then that rule doesn’t count.
Trying to capture all the constraints for valid rolls is tantamount to writing the scoring algorithm.
Yet another Bob about 1 month later:
Bob,
What about code generation? Some people suggest that for certain solutions where metaprogramming is being applied you should generate tests and the corresponding code you want to test – but then you are not really expressing the intent twice (weeellll, the generating code consists of two parts, with one part generating the tests and another one generating the code).
Of course you can TDD the generation code and let it generate simple examples that you test automatically as well.
What’s your take on this?
Yazid about 1 month later:
Hello,
I love TDD (or Test first), recently Microsoft research team created something called a tool called PEX.
This a description of Pex
Pex (Program EXploration) is an intelligent assistant to the programmer. By automatically generating unit tests, it helps to find bugs early. In addition, it suggests to the programmer how to fix the bugs and here is a link
Can this tool be useful or does it defeat the goal of TDD?
Thx Dr Y Arezki
mmorpg about 1 month later:
funny how something with 33% more code can be that much more efficient…some of these algorithms people are developing have me in severe coder envy.
Charles 11 months later:
OK, I am willing to accept the premise. Having a third more code may be more efficient overall but has anyone done any true test to confirm this. I am all for increasing efficiency but too often we programmers seem to be in an “add more code” mindset.
Peli 12 months later:
Hi Bob,
The type of test generators that you describe are indeed detrimental to TDD; because they are missing the oracle that tells whether the code is behaving as it should (the oracle sit in your head).
However, if you can provide an oracle (i.e. assertion, invariants, etc…), test generators will help the TDD process because they will cover a lot of corner cases that otherwize get forgotten by the developer.
With Pex, an automated whitebox test generator, we use parmaterized unit tests (which have been around for a while) to start the code exploration. If this test contains assertions, the tool will try to fail them. If it does not, then it is a bad test – parameterized or not -.
- using two of the better known test generation tools. Can you share the names?
Peli 12 months later:
Hi Bob,
The type of test generators that you describe are indeed detrimental to TDD; because they are missing the oracle that tells whether the code is behaving as it should (the oracle sit in your head).
However, if you can provide an oracle (i.e. assertion, invariants, etc…), test generators will help the TDD process because they will cover a lot of corner cases that otherwise get forgotten by the developer.
With Pex, an automated whitebox test generator, we explore the code from user-written paramterized unit tests (i.e. unit tests with parameters). If this test contains assertions, the tool will try to fail them. If it does not, then it is a bad test – parameterized or not -. Whenever you write a unit test and hard-code a value that does not matter (i.e. you hardcode “Marc” as a field name), then you should refactor out that value and let a tool ‘explore’ it. Note that parameterized unit tests can be written in a Test First fashion. More in this article: http://dspace.mit.edu/bitstream/handle/1721.1/40090/MIT-CSAIL-TR-2008-002.pdf
> using two of the better known test generation tools. Can you share the names