The Monday of this week, I was drafted in to help resolve a production incident on a system that I had helped build before I moved teams. What makes this unusual is that I had no way to actually debug it at the time. So here’s a small experience post about what I did and what I learned.
NB: These are my personal conclusions, YMMV
A small amount of scene-setting
- Unruly practices developers-on-call because we believe it makes us build better services.
- We practice primary, secondary, team-lead levels of escalation, but engineers with relevant experience can be drafted in if they are available.
- For REASONS, I had my laptop but no internet access, so I couldn’t see anything that was going on.
- I was on a phone-call with the lead of the team that owns the service, and he had all the usual tools at his disposal
For the next half hour, I was asking him questions and helping to debug remotely given what I knew of the system.
Lesson 1: Be absolutely clear with your questions, requests, and points.
He couldn’t read my mind.
If I was debugging alone, I would likely be jumping back and forth between my hunches, trying to cross them off as quickly as possible. The nature of this new dynamic required a much slower and measured approach.
Don’t say “Can you read out the logs from 3 o’clock to 5 o’clock?”
We’re trying to identify possible causes of the issues, and these (caveat: in my opinion) are better phrased as questions rather than requests
Do say “There was a deploy at 3 o’ clock today, is there anything unusual in the app log?”
If you have the source code available to you, referring to files and line numbers is really helpful — this enabled us to identify particular lines of configuration that might be causing the issue.
Lesson 2: Say why you’re asking the question
I’m not going to treat my colleague like he’s just a pair of hands for me, and I felt it was really important to clarify why I was asking the question. It gives an opportunity to short-circuit the query if it’s something that’s already been eliminated.
Don’t say “Can you search the logs for AWS request errors?”
Bonus points: ask about the output, not the act. I don’t really mind how we get to the answer, more that we eliminate or prove a potential cause.
Do say “I think that the problem might be being caused by a lack of correct permissions for the AWS credentials. Are there AWS permission denied errors in the app log?”
Lesson 3: They are the system owners, and the experts. Treat them as such.
I was pulled in because I’d worked on the system before but well over a year ago. The system might have changed in ways that I don’t know about, so it was important for me to recognise that my mental model might be out of date and I needed to tailor my questions as such.
Do say “I remember it behaving like X due to Y. Is this still true?”
In this scenario, there were a number of things I could probably eliminate based on my previous knowledge of what failure causes would look like for network issues, datastore connections, etc, but I didn’t want to assume anything.
Lesson 4: On-call outages are inherently stressful. Don’t make it worse.
Feedback when debugging yourself is of the order of seconds.
The thought -> question -> action -> reply loop is significantly longer.
Given that we’re trying to solve the issue in the shortest timeframe available, this process can make things even more stressful if things get out of hand.
There are several things you can do but they depend on how the other person works — for example, are they someone who likes to talk a lot, or more measured with the way they speak?
In the former, I would try to keep a conversation going with my thought process to normalise the conversation, but I wouldn’t do this in the latter scenario.
I hope will be a useful read for people who do on-call and who might encounter something like this — tl;dr try your best to help, be clear and up-front with questions, show empathy and care to not make things more stressful.