Engineering

5 ways incidents made me a better engineer

A photo of Lisa Karlin Curtis

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems.

In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team. I had the same experience joining incident.io (yes we do have incidents, and yes it is quite 馃く).

Learn outside your box

Incidents often occur at the edges of teams. That makes them a great chance to learn about stuff that isn't in your day-to-day remit.

The obvious example for me is infrastructure: at GoCardless we had an infrastructure group who provided a platform for us to deploy our services. I didn't interact with the infrastructure directly much in my first few months, so didn't have a strong mental model of how any of it fit together. That was a huge limitation on the kinds of problems I could solve. I couldn't make good decisions about how to best use our database, or how to manage asynchronous work, as I didn't understand the trade-offs.

Watching people solve incidents was the entrypoint I needed to start investigating and understanding our infrastructure, and how it connected to my day-to-day trade-offs.

Get straight to the difficult bits

Incidents are usually caused by (or manifest in) the most difficult parts of the systems we interact with. Seeing multiple incidents impact the same component is great way to learn about that component, while simultaneously signalling that understanding the component will be valuable.

I've been introduced to a number of domain areas via incidents including database replication (often the culprit), quorum (terrifying) and DNS (a classic). After getting some initial context during an incident, I could then spend some time reading about these concepts with confidence that it would prove useful.

Make code go wrong in the right way

We're not perfect: our job is hard and our code is very likely to go wrong at some point. Instead of trying to write perfect code, incidents have shown me that it's more important to make code that fails in safe ways. This includes:

  • Making code alert loudly and clearly if it sees something that 'can't happen' (famous last words). Ideally, the alert should be easy to trace to a code comment, doc or commit message explaining why (when you wrote the code) you didn't think this would happen.
  • Keep the blast radius for failures as small as possible: think carefully about what should be considered 'critical' for a given request, and get everything else out of the way. Being unable to log a user tracking event should never degrade the customer experience.

While it's possible to read this stuff in textbooks, seeing the impact of these choices in real incidents is what taught me how to put this advice into practise.

Observability Rocks

I've been in many incidents where a graph or set of log lines has been the key bit of information to help diagnose the problem. It's also usually what tells us that the incident is over. Finding components with poor observability can be really stressful: it's like someone blindfolding you and asking you to find the front door. Possible, but not fun or efficient.

Watching more experienced colleagues use our observability tools, and then using them myself, taught me how to use them to quickly get the information I needed. Once you understand how to use the information that's already there, it's easier to understand what other information would be useful when working on other projects.

Finding great people

Incidents are a great opportunity to meet people outside your team. Many of the colleagues I respected, valued and relied on most were not people I worked with day-to-day. It's a great change to find people who have different skill sets from your usual team mates. Maybe there's someone who knows lots about a particular technology, or someone who is a really great teacher. Having a network of talented people I could ask for advice has been the single most impactful accelerator for my growth.

Closing thoughts

  • Get involved in incidents from day one! Even if you鈥檙e only at the very start of your career.
  • Be respectful of other people's time and situation. Observe quietly at first, note down questions to ask later.
  • Be honest with yourself and others about what you can and can't do alone. If you're happy to take the lead but want a pair - say so. That's a great way to learn, but depending on the situation might not be appropriate.
Picture of Lisa Karlin Curtis
Lisa Karlin Curtis
Technical Lead

Operational excellence starts here