First published on Scrum.org, 17 September 2018 (syndicated).
"The essence of the beautiful is unity in variety"  Mendelssohn
My gears aflame
I remember being told, many years ago when I
started university, that Information Technology is a numerate
discipline. I had been admitted onto the IT course on the strength of a
highschool biology certification. That was my “technical competency”.
The professors assured me, however, that I did not need to be an expert
mathematician. The essential requirement, they said, was just to be able
to work with numbers.
I took that to mean I needed to be able to
add and subtract, which I had mastered at the age of six or seven, or
perhaps a couple of years later when I suppose I must have been about
twelve. Anyway, it turned out that they really meant I had no
obligation to study mathematics beyond second order calculus. You could
smell my poor gears burning as I hurtled into Remedial Class City. It
can be argued that I have effectively remained there, trying to deduce
inflection points on my fingers and toes, ever since.
I am the least numerate person you will ever
meet in the IT industry. The people working in the office canteen are
more mathematical than I am. At least they can work out the change. For
my part though, numbers induce within me an absurd feeling of dread.
They are at once both my nemesis and my nightmare. They have left me
with my very own case of imposter syndrome. When will my arithmetic incompetence be exposed to all, I chew my nails wondering? When will I finally be caught out?
Ducking and diving, bobbing and weaving
Agile coaching can be a good hiding place for
someone like me, at least until metrics are discussed. Fortunately,
it's possible to kick the can down the road for a while before that
actually happens. An agile coach might assert, with justification, that
true value lies in the product increment and not in throughput,
velocity, or other proxy measures. Sprint capacity planning, you can
also argue, is never meant to be a sufficiently exact science for which a
mathematical approach might truly apply. You can rightly claim that
estimates are just that – estimates  and that a team merely needs to
get its arms around how much work it thinks it can take on. You might
also assert that "agile team members should take care not to become
story point accountants"  spitting out that last word as I do, like a hardcore numerophobe.
Sometimes though, an agile coach really does have
to look at the numbers. If a Product Owner wishes to make
evidencebased future delivery projections, for example, then it will be
hard to help without conveying an understanding of burn rates,
capacity, or throughput. Then there are times when a quantitative
approach is arguably less essential, and yet you know in your heart it's
the best option. Flow optimization during a Sprint can demand the
analysis of cycle times for example, and not just the eyeballing of a
team board for bottlenecks. Improving service level expectations can
mean that workitem age ought to be assessed, since there are times when
a general reflection on team performance just won’t be enough to cut
it.
Crunch time
Sprint Planning is another occasion when an
irredeemably scattered brain, delinquent in mathematical measure, can be
forced into jackboots and expected to crunch numbers. It’s a formal
opportunity to use the best data available in order to provide the best
possible timeboxed forecast of work. According to the available
historical data, how much work can a team reasonably expect to undertake
during a Sprint? It’s time for you to challenge any numeracy demons you
may have.
If you step up gamely to the mark, you’ll
bend your mind around a product backlog’s size and the rate at which a
team burns through it. A forecast of Sprint capacity and projected
delivery dates will hopefully lurch into view. Pushing the envelope
further, you might also factor in the possibility of a Sprint Backlog
growing during the iteration. You elicit a burnup chart to accommodate
such eventualities, and then cheekily express disdain for mere
burndowns. Apart from an occasional dalliance with cumulative flow and
Little’s Law, that’s about as far as a mathematical derelict such as I
will generally dare to venture into the arcane world of numbers.
Fortunately, I have been able to comfort
myself for years with the knowledge that, however incompetent I might be
in terms of mathematical ability, most of my peers will demonstrate no
higher accomplishment. While they must surely be better at grasping
metrics than I am, for some reason they always evidence the same dilettante standard
of burnups and burndowns. With an ongoing sigh of relief, I have
found again and again, that I can wing my way past the whole issue.
Nobody seems to care about fancier metrics than those I can actually
cope with. The bar has not been set high and even I might hope to clear
it.
Raising the bar
The problem is, these usual measures will
typically express a precision with limited respect for accuracy. If we
have to forecast when a delivery is likely to be made, for example, then
we may offer a particular number as our answer. From the Product
Backlog burn rate, we might project that the relevant increment will be
forthcoming in “Sprint 12”. The calculation we have performed will
demonstrate that our assertion is fair. Yet we will also know that the
number we offer is uncertain, and that the increment might not be
provided in Sprint 12 at all.
We can usually express greater confidence in
vaguer terms, such as a range. It might be wiser to say “delivery will
probably happen between Sprints 11 and 13”. You see, the more precise
forecast of “Sprint 12” is not very accurate. It is a rough number
precisely stated. We offer it as cipher for a vaguer and truer answer.
This is a game we play with our stakeholders. We know, and expect them
to know, that any number we forecast may not be accurate. It is an
estimate.
Why don’t we offer the sort of range we can
have better confidence in, whenever we have to make a forecast? Why are
we obsessed with giving a precise number no matter how unreliable it may
be?
Part of the answer lies in our cultural
reluctance to be vague, even under conditions of high uncertainty. We
assume that someone who fails to be exact cannot be on top of their
game. The charlatan who deals in fantasy numbers with precision is
respected, while the careful analyst whose stated range actually proves
reliable is held to be of lesser accomplishment.
Fortunately, there is a way out of this state
of affairs. We can raise the bar and still hope to meet it. All we need
to do is to ensure that any “vagueness” we express – such as a range of
dates  accurately captures the uncertainty we are dealing with. If we
can evidence that velocity is stable to within plus or minus 20% every
Sprint for example, then the range we offer in a future prediction ought
to reflect that established variation. In truth, the range we give will
not be “vague” at all. It will be a range precisely articulated, and it
will be as accurate as it is founded on hard data and careful analysis.
An example
Let’s go through a worked example. This is
entirely from reallife by the way, and reflects an actual situation
where I joined a team as a Scrum Master. All of the data is real and
nothing has been made up.
The situation is that the team has been
Sprinting for a while with iterations two weeks in length. There is a
Product Backlog of an estimated 510 points remaining. We know the team
velocities for Sprints 4, 6, 7, 8, 10, 11, and 12. These were 114, 143,
116, 109, 127, 153, and 120 points respectively. We don’t have the data
for any other Sprints and some Sprints are evidently missing. It doesn’t
really matter though. We just need a representative set of recent
velocities from which variation can be evidenced.
We can see that the distribution falls
between 109 and 153 points and with no obvious pattern to the
scattering. We could of course work out the average burn rate, which is
126 points, and then estimate that it would take just over 4 Sprints to
complete the 510 points of work. However, as we have already discussed,
when we take an average we are effectively throwing away the variation
in order to arrive at our answer, and so we lack an understanding of its
accuracy. When we average stuff out, it should be because the variation
is annoying and we genuinely want to throw it away. With our greater
ambition however, we want to make use of any variation so as to deduce a
more useful and reliable range.
The first thing to do is to express the data
we have in a more flexible way. Each Sprint lasted 10 working days. If a
Sprint completed X story points of work, that means a typical point for
that Sprint will have taken 10/X days to clear. We can now produce the
following table from the data we have.
Sprint  4  6  7  8  10  11  12 
# Days  10  10  10  10  10  10  10 
# Points  114  143  116  109  127  153  120 
TypicalDaysPerPoint  0.087719  0.06993  0.086207  0.091743  0.07874  0.065359  0.083333 

Backlog Size  510 
If you were asked to build a simulation of
how long the 510point backlog might take to complete, you could use the
above data to help you. You’d pick one of the “Typical Days Per Point”
values at random 510 times, and then add the set of numbers all up. If
you ran the simulation once, it might come to 40.91 days. If you ran
that simulation twice more, you could get totals of 40.63 and 41.35
days. On each occasion, you would be running a realistic simulation of
how the team might be expected to saw away at 510 story points of work.
If you ran the simulation a hundred times,
you’d get even more data to look at. However, the volume of data could
then prove a bit overwhelming. So, let’s group the runtimes we get from
a hundred simulations into 10 buckets, each covering onetenth of the
time between the fastest and slowest. In other words, we’ll count the
number of runs which fall within each of the equallysized
timeboundaries of ten buckets.
Bucket #  1  2  3  4  5  6  7  8  9  10 
Bucket boundary  40.51  40.60  40.70  40.80  40.90  41.00  41.09  41.19  41.29  41.39 
# runs in bucket  1  4  5  15  16  20  18  8  8  5 
OK, all of the runs add up to 100. What we
appear to have though is a concentration of runs in buckets 4, 5, 6, and
7, with a peak in bucket 6 and a tailoff towards the first and last
buckets. In other words, the 510point backlog is more likely to
complete in the 40.8 to 41.09 dayrange than at any other point in time.
Let’s chart this to better visualize the results.
Now let’s run a thousand simulations, and see if this pattern can be confirmed and if it emerges even more clearly.
Bucket #  1  2  3  4  5  6  7  8  9  10 
Bucket boundary  40.357  40.494  40.631  40.768  40.905  41.042  41.18  41.317  41.454  41.591 
# runs in bucket  7  17  80  161  276  251  143  53  11  1 
By
Jiminy, so it does. There really is a clustering. In fact, it’s the
sort of thing a biologist might recognize as a normal distribution, or
“bell curve”. This type of curve is encountered frequently when studying
population distributions, height or biomass variation between
individuals, pollutant concentrations, or other events which  when
examined in the small  might appear to be without pattern.
This means that if stakeholders wish to know
when a given item occupying a certain position on the backlog is likely
to be completed, we can do better than to calculate a time based upon a
potentially misleading average. Instead, we can show them the timerange
in which delivery is likely to happen, and the level of confidence we
would have in delivery occurring within it.
Most occurrences cluster around the average 
but few if any will actually be the average. Very few people are of
exactly average height, for example. Remember that an “average” can be
stated precisely but rarely proves to be accurate.
Note also that if we were measuring
throughput rather than velocity  or had a rule in which each story must
be scoped to one point  then 10/X would effectively represent the
“takt time” for that Sprint. This can be a better alternative to story
point estimation in so far as it takes a team closer to gauging the
actuals of stories completed.
End note
We can never guarantee the certainty of a
forecast. The scope of work can grow, items can be reprioritized, and
unforeseen events can always happen. True value will always lie in the
increment a team delivers, and not in story points.
Nevertheless, when we collect metrics, such
as the velocities attained over a number of sprints, we ought to
remember that critically important detail can exist in the variation.
All too often we just work out a forecast such as an average burnrate,
and throw the texture of the available data away. Yet in truth, we don’t
have to give stakeholders a single “precise” date for delivery in which
we can express little confidence. We can do better, and show them the
projected range from a thousand or more simulated scenarios.
The process we have covered here is sometimes
referred to as a “Monte Carlo” method. This is a class of algorithms
which use largescale random sampling to generate reliable predictions.
The technique was implemented computationally by Fermi, Von Neumann, and
other physicists at the Los Alamos laboratory in the 1940’s. They were
working on problems such as how to work out the likely penetration of
neutrons into radiation shielding. It’s generally applicable to complex
phenomena where the inputs are uncertain, and yet the probability of
outcomes can be determined.
I suppose it’s possible that the world’s
great nuclear physicists aren’t much good at working out their change in
the canteen either. What they are certainly very good at, however, is
recognizing how new information – such as a reliable forecast  can be
inferred or deducted from observations which others ignore or take for
granted. That’s a very special kind of skill.
As a mathematical blockhead, I can’t lay
claim to either of these abilities. What I can do though is to stretch a
little. I can raise the bar a touch, and perhaps hope to meet it. You
can too. We aren’t restricted to working out an average velocity which
we can plug in to burnrate projections, for example. Monte Carlo
analysis is just one illustration of why we should take care before
stripping the texture of our data away. We should always seek to provide
stakeholders with better information, using the data we have, and by
the considered inspection and adaptation of our method.