don't worry, it's probably fine

Notes from the Week #15

22 Jan 2019

weeknotes

I’ve spent more time pairing this week than I have all year - granted that’s just over 3 weeks, but still, I often miss the tight collaboration of pairing when I have a fragmented week due to discussions and huddles across the business.

Four Things That Happened

Where’s O11y?

This week’s main event was a kick-off session to make sure the Product Development is aligned around one of our explicit tech goals for the next quarter: Observability. As Shift team-lead, and the owner of the observability strand of work within Shift, I’ve been quite knee-deep and hands-on with this saga.

I ran a quick non-scientific poll in our Slack channel to get a rough idea of how people currently felt about it - my suspicion was that people had ideas about what observability meant but those ideas are probably at best inconsistent or at worst contradictory.

A lot of the work to improve our observability foundations will be driven through the Shift team, so members of each ProDev team got together and worked in groups to come up with user stories (as developers) for features that would improve their ability to observe their systems in production.

I’m (internally) working from the definition of Observability below:

… the ability to infer the internal state and behaviour of a system from its outputs

In the context of software development, outputs could be metrics, logs, or actual system outputs - Shift will be sharing their recent experiences with SLx and error budgets with the wider team,

Load balancing between teams

We have two teams in ProDev that are ostensibly have infrastructure as their primary concern:

  1. Shift, the Shared Infrastructure Team, lead by myself, accountable to the core development teams and the CTO.
  2. SREs, the Site Reliability Engineers, lead by the VP of Architecture, accountable to the team leads and the CTO.

The work that we might be doing tends to overlap a lot - how do we define the boundaries between the two such that neither team is stepping on the other’s toes but we don’t let things “fall between the cracks”?

SREs help the core development teams achieve excellence in their team-specific infrastructure concerns - Shift doesn’t have the size or the bandwidth to be across all of these concerns. Our SREs are also not part of the on-call system, and are sources of expertise for ProDev.

Shift on the other hand is responsible for operational stability of shared systems and their explicit mission is to build a solid foundation for systems consumed by all teams such as metrics, alerting, and configuration management.

But we work together!

SREs are often prototyping ideas and technologies that Shift doesn’t have capacity to do, and we collaborate on bringing them into the production environment in a state that Shift is happy to be on call for. On the flipside, Shift benefits from their expertise much in the same way as the core development teams.

The result is a quite harmonious relationship, and I’m eagerly awaiting upcoming advances that we’ll be collaborating on.

Shared S3 == Too Much Responsibility

I paired a lot more than usual this week, with Petrut who is one of our SREs. We’ve been fine-tuning lifecycle policies on our S3 buckets and encouraging each team to create new buckets rather than use shared buckets.

When we were much smaller, having a single shared bucket for e.g. backups was fine, but as we’ve grown we’ve had to create numerous policies on the same bucket to cope with each separate prefix.

It’s hard to grok even for someone who’s been at Unruly for a while (like me) so we’re in the process of decommissioning our shared S3 buckets in favour of team-owned single-responsibility buckets.

Moving data into AWS Glacier like ...

It feels a lot like refactoring a piece of code with too many responsibilities, or breaking apart a monolithic application into smaller services - doing a bulk data copy is neither fast nor cheap, and we want to avoid breaking changes during switch-overs too.

Counting S3 Vastness

We’ve been using it at Unruly pretty much since it was released, for reporting aggregation and building our data lake, but I thoroughly enjoy the experience of using AWS Athena - I get into a workflow of:

  1. Discover process that takes a long time or outputs a lot of data
  2. Write a Python script to generate smaller CSV files from (1)
  3. Upload the CSVs to S3
  4. Query with Athena

As part of the work I was doing with Petrut, we were investigating digital assets from ad campaigns that have long since concluded - videos, images, etc. There were quite of these so we uploaded almost 10,000 CSV files to S3 containing metadata about these assets and could then run sub-5s queries over it all.

The more I do this kind of work, the more I realise the sheer utility of a simple standard format like CSV because it can go pretty much anywhere. If we wanted to do some more fun stuff with the data, we could write some ETL jobs in e.g. Python because it has CSV/JSON support out of the box.


I’m going to be leveraging the Athena learnings with my investigations into commit data using conventional commits in the near future, so watch this space for a blogpost full of graphs and fun facts!