Writing a series on how to engineer an engineering team, I’m tempted to start with how to do things right. But a more impactful starting point is actually what to do when things go wrong.
Because things will go wrong, even if you’re doing things right.
Postmortems are a meeting after an incident is fixed, scheduled to achieve the following goals:
ensure the root cause is fixed
review the team’s response to the incident and improve how you respond
identify patterns to prevent future similar incidents
share knowledge about the system across the team
Context Matters
All my recommendations come with the caveat that context matters. For example, how strictly should your team implement postmortems? It depends—how effectively do they currently respond to incidents?
Two years ago, our team lacked a strong postmortem culture. Recognizing this, we committed to running postmortems for even minor incidents, deliberately cultivating a habit of learning and improvement. Over time this became second nature. Today, post-mortems are a core part of our process, and I no longer push for one after every small issue.
So here are my pro-tips. Based on these principles, we have a template that I’ve linked below so that anyone is free to use it.
Postmortem Pro Tips
Bad teams fix the symptom. Good teams fix the root cause. High Performance teams fix the root cause and pattern match to future proof similar issues.
The 5 Whys – keep asking why
In your postmortem, Start with what happened, then ask why it happened. Then ask again. And again. 5 times in total. Often the 1st why is a symptom. The 2nd or 3rd why is a root cause. And the 4th or 5th why recognizes patterns that you can prevent in the future.
Here is an example of the 5 Whys from a testnet incident a few months ago:
Why did the chain stall? A bug in the attest module.
Why did that bug occur? Fuzzy votes triggered a conflict.
Why was that possible? The deduplicate function misused attest headers.
Why was that mistake made? Lack of unit testing.
Why weren’t tests in place? The attest module wasn’t fully tested.
As we can see, if we stopped after the first couple, we would’ve fixed the immediate issue. By going deeper, we realized that we needed to increase testing coverage for our modules across the board.
Focus on improving MTTR: Mean Time to Restore
MTTR is one of the four key metrics identified by DORA to measure the performance of elite teams. This measures the average time it takes to restore service after an incident occurs.
Break down each incident into 3 stages:
Detection: How long did it take for the system to flag the issue?
Diagnosis: How long did it take to pinpoint the root cause?
Resolution: How long did it take to fix and confirm stability?
Which of these 3 stages was the bottleneck? How can you improve it?
Get notified proactively and sooner
Here are some good questions to ask during your postmortem to identify how you can be alerted of incidents sooner.
When/how were we first alerted?
Was the alert close to the actual issue, or was it a symptom?
Were there lots of alerts firing, or was it a focused alert on the core issue?
Was the alert actionable?
Were there false positives?
How can we improve this?
Minimize blast radius
Who was affected? How were they affected? What percentage of users were affected?
Did the root issue have a larger impact than it should have? ex: a simple error resulting in a panic and caused cascading affects across multiple services, when it should’ve been handled properly by its original service.
Focus on isolating failure points better to prevent cascading failures.
Postmortems don’t matter without action items
Come out of every postmortem with a list of action items which are concrete and measurable. Every action item should also have a DRI (directly responsible individual who is responsible for acting on it).
Examples
Change a threshold on an alert
Fix a bug
Add an alert
Use data – metrics, logs, or graphs to tell the story of the incident
Nothing matters without data!
If you don’t have the right data to tell the story of the incident, then you should have action items to add metrics and logs so that you can have the right data
Where did you get lucky?
Identify where luck may have played a role in getting you out of a sticky situation. Try to identify where the system is brittle but avoided breaking this time.
Luck hides issues, but brittleness will eventually be exposed, so it’s better to identify it now.
Example: maybe the incident corresponded with a low traffic period. How would it have been different during a high traffic period?
Embrace blameless postmortems
This is one of DORA’s key recommendations in running effective postmortems.
Pointing fingers is not the point. The point is to build a better system and ultimately a better product.
Everyone should feel comfortable surfacing and discussing issues, without fear of “killing the messenger”.
Emphasize sharing and learning culture.
Allocate time for questions, discussion, and learning.
Encourage people from different teams to join – it’s a great way to understand more about complex system architectures.
Postmortem Template
Based on these principles, we have established a template used to guide our postmortem meetings. The owner of the incident should fill most of it out before the meeting, but coming up with action items should be a group effort.
Here’s is a link to Omni’s template. Feel free to us it to implement postmortems on your own team.
Closing Thoughts
Creating a culture of learning and improving is the most effective way to move from a good team to a great team to an elite team. It can take a long time, but postmortems are the most effective mechanism for taking big leaps forward as a team because they identify the biggest bottlenecks for success.
Best of luck in your journey of getting better at getting better!
Resources
For more in-depth resources on building resilient engineering teams and effective incident management, check out the following resources:
Special shoutout to Angus Cowell for sending me the Elmo Burning meme every time the market drops by >1% in a day.