Risk Burn-up: Daring Deeds in DevOps

First published in The DevOps Zone, 3 January 2017

Uncoupling a railway carriage in Around the World in 80 Days
“Never was anything great achieved without danger.” ― Niccolò Machiavelli



Agility can be seen as an approach towards managing risk. By delivering value early and often, any leap of faith taken before seeing empirical evidence of progress is reduced. Hence, agile methods differ not only in the manner in which increments of value are provided but in the way risk is controlled.

For example, a Kanban approach is likely to emphasize the achievement of the leanest possible workflow. Any leap of faith will be constrained to the work that is currently on hand. It will not be batched into any larger quantities than are necessary to assure optimal flow, and the risks to the delivery of any one item will consequently be minimal.

Scrum, on the other hand, will limit work-in-progress to a Sprint timebox of no longer than one month. The work being carried out is thus likely to encompass multiple items, but it allows a substantial goal for each timebox to be framed and allows for more complex risks to be mitigated on a regular cadence.

This is the reason why Scrum is generally favored for development projects, while Lean Kanban approaches are more typical of operational support work once those projects have finished. If the risks of developing a complex software product have been addressed, then a flow of support and maintenance tasks can be optimized.

What, though, does this mean for DevOps? When the Development and Operations gap is effectively bridged, all of these capabilities become encapsulated within a DevOps team or studio. This can lead to the expectation that any risk can be handled by a DevOps stream. However, a DevOps outfit is still subject to the constraints that affect any workgroup. If demand is not managed effectively, performance can be expected to degrade along with the effective use of controls, including risk management.

For example, a DevOps team may adopt a "Scrumban" approach or otherwise attempt to find a practical compromise between Scrum and Kanban ways of working. A Sprint goal of sorts might be agreed upon with business and the team will plan to deliver increments of value which mitigate project-level risks. If any operational support requests come in during the Sprint, whether for the same product or a different one, then those requests will pre-empt the work that has been planned on the Sprint Backlog. They will be fast-tracked and handled to an agreed quality of service, such as a turnaround time established by service-level agreement.

Clearly, this will impact the ability of the team to continue and to still meet the Sprint goal. It may no longer be possible for them to achieve it at all. Their capacity to deliver the agreed Sprint increment and to manage risk at that scale will have been degraded. In short, if a DevOps team is put in the position of having to address risks of both types at the same time, the frictions of support work can be expected to take a certain toll. Each Sprint is a window of opportunity not just for project risk mitigation and the delivery of a valuable increment but for general frictions to interfere and take effect.

A safe DevOps project is one in which these risks are understood and managed, and these frictions are kept under control. A risk burn-up chart can be a helpful tool in this regard. The principle is essentially the same as that of a product burn-up chart with which many agile practitioners are familiar. However, rather than tracking the delivery of scope, it is the mitigation of risk that is visualized.

As an example, let's suppose that a DevOps team has been given a small project to do. For the purpose of illustration, we'll make it a modest and conservative undertaking: the deployment of some new functions that make use of an upgraded system core. The team does not believe it will take much more than an hour to do the work, and so the window in which any friction can take effect is comparatively small. Bear in mind that the project challenges which face a DevOps team in real life can be quite a bit more substantial than this, and the need to evidence risk control may be correspondingly greater.

The team goes ahead and plans the tasks they expect to perform during the timebox. However, they take the additional step of enumerating the associated risks. This is done in the standard way of estimating the probability of occurrence and the expected impact. The two variables are simply multiplied together in order to give a measure of each risk. Hence an improbable risk may be a substantial one if recovery proves difficult, while a risk with a high likelihood of occurrence may be of only small magnitude if the remedy is straightforward.

The team has enumerated the risks as follows, in order of possible occurrence, from A to J. The first number is the probability, the second the impact, the third the risk, and the fourth is the cumulative risk at each point. The fifth number is the time into the project at which the risk is expected to be dealt with.

 Item
 Prob. Impact Risk Cumul. Time Description
 A 5 7 35 35 5 Like-for-like algorithm migration (DEV)
 B 5 8 40 75 8 Add new algorithms (DEV)
 C 3 5 15 90 15 Modify Build & Deploy scripts (PRE-PROD)
 D 10 20 200 290 20 Integration with upgraded core (PRE-PROD)
 E 5 5 25 315 30 Automated regression testing (PRE-PROD)
 F 5 11 55 370 33 Build and deploy (PRE-PROD)
 G 4 5 20 390 40 Cohort acceptance (PRE-PROD)
 H 3 10 30 420 50 Fail and recover (PRE-PROD)
 I 4 10 40 460 60 Build and deploy (PROD)
 J 5 8 40 500 65 Toggle to live (PROD)

Simple risk burn-up chart
Turning this data into a projective burn-up chart is quite straightforward. All we have to do is to express time on the x-axis and cumulative risk on the y-axis, then plot risks A to J at the appropriate point. The origin of the graph represents the start of the project being undertaken.

However, there is a refinement to this technique which I have found to be useful, and which is sometimes used by military planners. In this approach, it is firstly recognized that the point at which overall success is more likely than failure lies exactly halfway up the y-axis. This is known as the relative superiority line (R.S. line). Secondly, it is recognized that the area above and to the left of the plot-line represents an area of vulnerability which ought to be minimized. Thirdly, it is recognized that it is most important to minimize the vulnerability below the R.S. line — which is to say, the level of 50% risk — because that is the tipping point where success becomes likely. The y-axis is redefined in terms of percentage risk magnitude, 100% representing the point at which all risks have been mitigated and success is at hand.

Clearly, it is desirable to reduce the area of vulnerability as far as possible, regardless of whether or not it lies above or below the R.S. line. However, in practice, it is especially important to reduce the area below that line, because it means lessening the window of opportunity for friction to occur when the project is at its most vulnerable.

Once we have a model, we can start to optimize it. For example, we might reduce the area of vulnerability by modifying the build and deployment scripts after integration with the upgraded core. This would allow relative superiority to be achieved more quickly, reducing the window of risk during which any frictions could take effect.

Advanced risk burn-up chart
Note that once a project manages its risks to the point that relative superiority is achieved, it does not mean that success is guaranteed; it just means that once that point is reached, a careful handling of each risk is likely to result in overall success. Thenceforth, it is improbable that the initiative will have to be aborted.

Of course, this is just a projective model. A risk above the R.S. line might become compounded in ways that were not foreseen, such that it cannot be dealt with, and which might thereby still lead to overall failure. For example, build and deployment may fail due to incorrectly modified scripts, a problem which might not be resolved in the time available. In other words, you can't anticipate each and every eventuality. A projective risk burnup chart can certainly be useful for timebox planning, but it's still only a model of what you expect to happen in the real world.