Notes from the Week #8

Despite my best efforts, I seem to be succumbing to the same cold that’s going around at work right now, so a brief and late weeknotes from last week - my brain is full of fluff and being slow. Blerggh.

Money talks

Steve and I had another excellent Wednesday chat, and we talked a bit about our experiences with budgeting and procurement. We’re both in a similar position where the real value of our teams is often in second or third order effects rather than direct revenue.

E.g. replacing a system, which shaves off toil (in the SRE Book sense of the word) for each team, increasing productivity and streamlining onboarding, but from a budgetary point of view it’s still just a cost and the effects might have a long lead time.

In these cases, it’s often hard to make compelling business cases (particularly if they are risk-based, and we’ve accepted the risk up until this point).

There seems to be no easy answer to this question other than making good faith arguments about the perceived benefits of the change, enumerating costs and savings wherever possible, and trying to monitor things like time spent on toil within each team.

If anyone has good suggestions, please let me know through Twitter or other means!

Adopting a new platform

After a few weeks of trialling and figuring out the cost-benefits, the Shift team has moved from an in-house paging service to Opsgenie - we’re a small team so reap the benefits of the free plan, but we like it so much that we’re putting together a business case to roll this out across the whole of Product Development at Unruly.

The integration was delightfully easy, so major kudos to the Opsgenie team and their thorough documentation!

Moving towards SLx

Before the Shift team came into existence, shared systems were collectively owned and so ‘everyone’ was responsible for them. Now that there is a dedicated team to drive and improve, we feel a need to have a conversation tool to talk about reliability of the mission-critical shared services we maintain (like metric collection, monitoring, paging).

We’re going to be experimenting with SLx (service level indicators/objectives/agreements) and error budgets as a way to communicate these things.

There’s a good chapter in the SRE book about these things.