don't worry, it's probably fine

Reducing Variance and Increasing Mean in Continuous Delivery

12 Nov 2014

continuous delivery

Applying Continuous Delivery principles to the operations-focused areas of software development, I often trot out the mantra of “Reduce Variance, Increase Mean”. A common definition of mantra is a phrase or saying that encapsulates certain beliefs or practices in a simple way, but the problem with boiling them down is that they often lose their subtleties. In this post, I’ll show what I mean when I talk about reducing variance and increasing mean in the context of software development and Continuous Delivery.

The idea comes from a manufacturing discipline called Six Sigma, developed at Motorola in the mid-80s and introduced to General Electric around 10 years later, based heavily on Plan-Do-Check-Act feedback loops. In particular it focuses on using statistical methods as a measurement of quality control, and draws its name from an event having a high level of statistical certainty. To contrast, new discoveries in particle-physics have to be proven within five sigmas (and independently confirmed) to be accepted as true discoveries.

How can we do that if we don’t know what are we optimising against? We should select measures of things that we are actively able to change and can monitor how they affect our pipeline. There are two important notions here that we can creatively acquire (read: steal) to use when we’re thinking about our Continuous Delivery pipelines: reducing special-cause variation and reducing common-cause variation

Part 1: Reducing Special-Cause Variation

We will define special-cause variation as unknown factors introducing random interference into our measure. A high level of special-cause variation means that the process being measured is unstable and unpredictable. This uncertainty flies in the face of the ideas of reproducibility and determinism that become increasingly important as we grow our software.

Consider the following example of optimising against deployment time, assuming we are delivering in small incremental chunks.

The last four deployments of an application have taken 20 minutes, 12 minutes, 35 minutes, and 60 minutes.

These output values seem to have an unusually wide spread which could be an indicator of special-cause variation.

Of course there are no cure-all solutions to these problems which is why it’s important to monitor and consume feedback, but the most likely culprits I’ve stumbled upon tend to include tests failing non-deterministically and calls to flaky external services. These two cases are easily solved using stub service calls and aggressive test quarantining/maintenance (deleting tests that have little to no value is rarely a bad thing), and we can force our example deployment to become much more consistent.

After these changes, the next four deployments have taken 15 minutes, 19 minutes, 16 minutes, and 15 minutes.

But this is only half the battle, once we’ve aggressively moved our metrics towards the mean. The other half is then moving the mean.

Part 2: Reducing Common-Cause Variation

Unlike special-cause variation which is caused by unknown factors acting unexpectedly, common-cause variation is the source of constant interference with our measure’s average. Changing database tests to truncate tables instead of rebuilding an empty database is a good example of reducing common-cause variation because it will affect all such tests, attacking the mean and not the variance.

But if common-cause variation affects everything, then surely dealing with this will yield bigger rewards, so why not do this first?

If we ignored special-cause variation and just concerned ourselves with common issues then we run the risk of incorrectly diagnosing the former as the latter and ending up with a local maximum rather than a global - we can make all our deploys five minutes faster but with the variance we started with the edge cases will still be unacceptably high. On the other hand, reducing special-cause variance beforehand means that we can focus our time on improving at a global level with confidence of a global effect.

Optimising against an Effective Metric

Applying RVIM is only beneficial if used to optimise a metric in a way that increases the value of that particular application. As the saying goes, time is money, and time-based metrics are among the easiest and most profitable to optimise against. A fast deploy keeps the feedback loops tight, but why not take it one step further and consider Full-Rebuild Time: for any running application, how quickly can you get another one into production and delivering value?

As an application scales up, the time between failures becomes far less important than how quickly you can recover from failures. After a certain point, applications will fail most of the time and fault-tolerance becomes an absolute necessity.

In eXtreme Programming Explained, Kent Beck talks about the concept of a coffee-break build - being able to build your application in the time it takes for a cup of coffee - and the idea of a coffee-break deploy is the next logical step. But what about a coffee-break rebuild? Can we deploy a change and rebuild an entire application in the time it takes to get a hot drink?

In order to shrink our pipeline like this, we must apply RVIM to almost all time-based metrics within our pipeline. Compilation, unit tests, full integration tests, provisioning environments, application deployment, everything. All of these steps will need to be fully automated and monitored, which for large applications can be a Herculean task. Luckily, we have a huge toolbox provided to us by the trend of Software-Defined Everything which I cover later on.

Time-based metrics are among the lowest hanging fruit to build metrics and feedback around but we can be more conceptual with these ideas. How about using the difference between different instances of our production environment as our measure? Running different versions of Apache, Tomcat, or PostgreSQL in the same application stack should be a smell. Running applications on different virtual instances should probably be a smell too. We can act against this configuration drift by being extremely aggressive with rebuilding and reprovisioning instances - luckily automation and software-defined everything can help us out here too.

It’s not all sunshine and roses: one potentially negative effect of homogenisation is the reduction of innovation or evolution, although this can be countered with explicit innovation/experimentation time. We also need to be wary of making metrics into solid targets. They are signposts, gesturing us in the right direction to solve our next problems, rather than something we need to reduce because low values are inherently good. It’s easy to speed up test runs: remove all the tests. This is obviously a terrible idea but demonstrates the idea of metrics being targets as opposed to a source of guidance.