High school lessons for performance testers

Post date: Dec 3, 2013 8:40:38 PM

(originally posted on may 14 2013)

On the latest testnet event I attended a presentation on the need for calibration of performance test tools. The speaker showed by tests he and some colleagues had performed that performance test tools don’t behave exactly like real users. For instance, when testing manually the browser didn’t open more than 5 TCP/IP connection whereas the load generator opened nearly 32 sessions. This can make a big difference.

That may not be news and the tests may not meet high scientific requirements on how to prove this, but it was good thing they showed this aspect of the discrepancy between an artificially generated load by tools and the load generated by real users. They did a good job making this clear.

It was their conclusion however that reminded me that performance testers have forgotten some basic lessons we learned in high school math. The reason they investigated the difference between the artificially and user generated load was that they, like so many of us, have been confronted with situations where software passed the performance test but failed in production as well as the other way around. They used their findings to state that we should ‘calibrate’ the performance tool. Perhaps we should, but I doubt that doing so would prevent the issue of performance testing being way off.

High school math

In ‘high school’ (it’s not called high school where I am from) I did take math classes. And there was something I learned there. Which is by itself odd since the few times that I actually showed up for class, I hardly ever paid attention. What I learned is that an answer cannot be more precise than the measurements on which it is based.

As an example, if you measure two distances in mm with the accuracy of maximum of half a millimeter the following two values: Value A: 53 value B:27.5 and then divide them, what is the correct answer?

  1. 1.9272727272727272727272727272727

  2. 1.9

  3. 2

  4. Something else

Answer 1 is what you get if use a calculator to divide the values. If you would choose answer 1, it would not only be considered wrong , you would actually get points deducted. The answer may never be more precise than the measurements. The accuracy is never better than the lowest denominator. So if one measurement is on half a millimeter accurate, there is no reason to get the second measurement accurate on the micrometer

Yet in performance testing this basic knowledge seems forgotten or people don’t realize it applies. In performance testing we are faced with a lot of assumptions that make the base of our test inaccurate:

  • Concurrent users dilemma. What are concurrent users? Is that the amount of users that at exactly the same moment click on something? How to define concurrent users is subject of many debates and at most an estimated guess.

  • How do we expect the users to actually use the system and with how many? If you ask the project manager of the development team he will expect that the application will be used ones every two hours by some old and very patient granny. If you ask marketing, they will expect that the entire user base of Facebook is anxiously awaiting the new functionality and will open it directly at launch. Reality however is often even more bizarre. We just don’t know. We just assume some behavior.

  • We often test under load a particular function. In production however, the other functions get usage at the same time as well. Simulating this is not only hard, but we often can only guess at what the exact ‘mix’ of usage is. There will be other processes, batch jobs, maintenance jobs, reports generated etc.

  • Sometimes, just sometimes, we get a test environment that has the same power as production to be. But then we still will have some different parameters. The network latency for instance often cannot be guaranteed to represent production. Test is often in a different network with completely different values. And yes, network latency can have a huge impact.

  • Forecasts on usage are usually based on averages, whereas usually the biggest thing we need to test for is a peak load. Forecasting feasible peaks is difficult and more often than not just wishful thinking.

All in all, we create a test scenario and a load profile based on many assumptions. The more assumptions you have, the less accurate you can actually predict. So we base our conclusions on a situation that hardly looks like reality, put load on it that does not represent reality in type nor quantity. And then: someone concludes passed…. In reality we can only report on the risk that the system in production will actually meet or fail generic requirements.

I don’t think they overcame these issues mentioned. If so I would very much like to see a presentation on how they achieved that!

So while I share their conclusion that we should be aware of the inaccuracy caused by load generating tools not behaving exactly like reality, I think it’s accuracy is still orders of magnitude higher than other aspects of our test.

It is a bit like, someone asks you why the men’s room is so smelly. You find out that the mop is so large it doesn’t reach the corners. You could advice to use a toothbrush for cleaning the men’s room. Although your analysis is true, the reason the men’s room is so smelly is because men have lousy aim, not because of the few spots in the corner.

So yes, be aware of the limitation on how the load generated by tools represent real load. Be aware of the possible settings (such as caching, network settings etc.) You’ll find that most tools actually are aware and offer you to control these settings and this behavior.

Cleaning large areas with a toothbrush by the way is not a good idea. That’s another lesson some of us got in high school.