Don't worry, it's ... Probably Fine

Notes from the Week #14

11 Jan 2019

week notes

Getting back into the habit of blogging after stopping for almost a month is a bit of a wrench, but it means there’s more juicy stuff for me to talk about!

new year, new me

I’ve been digging really hard into reliability and SLXs this year - we’re already well down the road to better understanding the existing reliability for our Graphite metrics-collection system, and are hoping to take these learnings into building SLXs and Error Budgets for our other core systems.

We’ve already gained a lot of insight into how our customers use the metrics collection API to build dashboards:

  • Setting harder timeouts at the load balancer level has exposed just how many dashboards have panels with too many metrics (as Grafana bunches them into a single request)
  • Measuring our error rates highlighted the number of dashboards relying on metrics that are no longer being collected and cause 5xx responses at the backend.

I want Shift to set a good example for ProDev as a whole - as we scale out a service-oriented approach to architecture being able to have concrete discussions for reliability is going to become paramount.

Keeping records of our decisions

Shift built out a new Puppet module under our open-source repository. The nrpe_custom_check module wraps several different configuration files to provide a clean interface to build NRPE checks for production machines.

We designed it to have as few configurable parts as possible (and indeed only has 3 inputs - name, content, and whether the script needs sudo privileges) but hit a major snag on an architectural point.

There’s an existing module base which has some default NRPE plugins. Given base is designed to be completely standalone and “batteries included”, should it depend on and use the nrpe_custom_check module rather than using static files?

There was a lot of back-and-forth debate about the merits composing small well-defined modules together versus jumping into an abstraction too quickly rather than letting the design evolve “naturally”.

In the end, we decided that the points made were too valuable to lose in the mists of time and resolved to adopt Architecture Decision Records. The implementation we chose was the Nygard format supported by Nat Pryce’s excellent adr-tools toolchain.

You can check out our record here: 2. Standalone NRPE Custom Check module

Historically, we’ve been not as good as we could be at recording not just the decision but the context in which it was made. A key part of Norm Kerth’s Prime Directive is

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Code, documentation, and domain models all succumb to rot over time and it’s important to remember that choices made are probably correct given the information available at the time.

I’m hopeful that we continue to adopt this, especially for shared projects like our monolithic codebases, so that knowledge about why we do certain things is not purely contained in the anecdotes of people who were there at the time.

What does value actually mean?

New year, and I’m having a bit of a new think on this topic. For a team like Shift whose role is primarily that of support:

  1. What does ‘value’ mean?
  2. How do we measure it effectively?

A (paraphrased) aphorism from our CTO has been rattling around in my head for a while:

At any given point in time, there is a good argument to not do X (usually in favour of feature development). Not doing X causes more problems, and will take longer to heal, the longer we leave it.

Where X can be anything from dependency upgrades, platform investments, even refactoring.

Shift are fundamentally an enabling team - the systems we build support the other teams in ProDev and we are also a source of extra hands for doing work related to our area of expertise.

We frequently pair with other teams to share the knowledge and help guide them around pitfalls that we have experienced, all while improving the state of our own infrastructure by learning about the needs of our teammates.

We don’t often get the quick highs of rapid feature delivery like a focused product development team but we are steadily building out our own tooling and systems to aid ProDev.

2019 is a year full of potential and Shift are going to seize as much of it as possible.