Postmortems in Software Development

22 November 2023 · 5,484 views · Updated 08 February 2024

On August 1, 2012, Knight Capital Group suffered a severe trading loss due to a software error. A technician's failure to update one of the eight servers with new code led to the unintended activation of an obsolete function, 'Power Peg', causing significant market disruption. This resulted in erratic pricing for 148 NYSE-listed stocks and generated 4 million trades over 45 minutes, translating to a pre-tax loss of $440 million. Knight Capital's stock plummeted by over 70%, leading to a $400 million rescue investment.

Or another one: In 1990, AT&T faced a catastrophic network failure due to a single line of errant code in a software update, resulting in a $60 million loss and widespread disruption. AT&T's network, a model of efficiency, was brought to its knees because someone, somewhere, messed up a line of code. A misplaced 'break' statement in a C program caused data overwrites and system resets across the network. The fallout? Over 60,000 Americans lost phone service, 500 flights got delayed, affecting 85,000 people, and AT&T's wallet got $60 million lighter​​​​​​.

This would’ve been nice post-mortems to read about, sadly I didn’t find them on the internet, but I’m sure they spent months investigating what led to those failure and implemented fail-safes.

Let's face the truth. If you're a CTO, or any kind of leader in tech, you're going to face a crisis — probably not at the scale as the above Knight Capital Group or AT&T — but there’s definitely going to be a crisis. It's not a question of if, but when. I've dealt with several myself — ranging from a project spiraling out of control to a full-blown multi-hour datacenter outage that had huge impact for our clients. It's like watching a slow-motion car crash — you see the car in front of you, but there's not a damn thing you can do to stop it. Brace for impact.

I remember this one time, we had a major product release lined up. Huge presentation, go-live time announced, partners aligned, marketing engaged, clients notified. We had all our ducks in a row, or so we thought. Then, during the launch, everything went to shit. The payments didn’t work. Signing certificates mismatch. It was the kind of nightmare scenario you read about and think, "Thank God that's not me."

Well, it was me. And let me tell you, the weight of that responsibility hits you like a freight train. The customers trusted you to deliver, and things did not go as planned. To keep this story short — We hopped on an emergency call, pulling in our best minds to assess the issue, check the logs, find the culprit. It was a race against time, but thank god the team was amazing. We were fully transparent about what happened with our client, they were still pissed. We did an internal investigation and learned from our mistake by implementing processes so it never happens again.

🏄 That’s why I think post-mortems are a great thing. They are, in a nutshell, an autopsy of a failure. Dissecting the what, why, and how of every screw-up and success. I think they should be done more often. 

Look at aviation or construction engineering. In these fields, post-mortems are conducted with a near-religious fervor. When a plane crashes or a bridge collapses, experts meticulously pick apart every detail, from mechanical failures to human errors. It’s a brutal, often sobering process, but it's how these industries have achieved remarkable safety records. They embrace the hard truth that the best lessons are often written in the aftermath of failure.

In software engineering, the stakes might not always be life and death like in aviation, but they're pretty damn high. A coding error can cost millions, a security breach can compromise thousands of personal data records, and a system failure can cripple a business.

We need to adopt this same unflinching attitude as in aviation and construction engineering. Postmortems should not be mere formalities or blame games. They should be part of every process. Something went wrong? BAM, full analysis, transparent report, checklist added to the process, so it never happens again. Things will go wrong. Of course, you should be prepared, and plan, but no amount of planning can prepare you for the moment when the plan goes out the window. But that's the job. That's what you signed up for. Welcome to the big leagues.

Also the crisis are relative in how they are perceived by the people who are involved and those who are observing it from a distance. It’s easy to say “you should’ve done this and that” after the incident has concluded and you have the knowledge of things that happened. In the moment, it’s a different story. But never doubt the decisions that you made with the information that you had at the time. Don't assume events that happened were more predictable than they were.

In the end, the only real failure is the failure to learn from our mistakes. That's what post-mortems are all about – turning our f*ck-ups into future successes.

So let’s talk more about how to get ready and what to do when a f*ck-up happens.

Get your team ready

A fancy term that basically means your team doesn't shit their pants every time something goes sideways. For this to happen, you need an environment where people can screw up, own it, learn from it, and move on without being haunted by the fear of getting axed.

I’m going to repeat again: mistakes are going to happen. Hell, they need to happen for growth. But the key is how you handle them. You've got to keep people accountable, sure, but this isn't the Spanish Inquisition. Post-mortems are crucial, but they're not blame games; they're learning opportunities. Leave emotions at the door.

Google SRE Book, has a good example of a dealing with blame culture:

Responding when a senior executive uses blameful language can be challenging. Consider the following statement made by senior leadership at a meeting about an outage:

VP Ash: I know we are supposed to be blameless, but this is a safe space. Someone must have known beforehand this was a bad idea, so why didn’t you listen to that person?

Mitigate the damage by moving the narrative in a more constructive direction. For example:

SRE Dana: Hmmm, I’m sure everyone had the best intent, so to keep it blameless, maybe we ask generically if there were any warning signs we could have heeded, and why we might have dismissed them.

The real impact of psychological safety hits when the fan gets dirty. If your team is cool and collected, with a problem-solving mindset, you're golden. But if they're scared stiff of making a mistake or getting fired, you're basically trying to defuse a bomb with someone shaking your ladder. Not fun.

Enjoyed the read? Join a growing community of more than 2,500 (🤯) future CTOs.

So how do you asses your team psychological safety? Start by showing an example. As a leader, your reaction to issues sets the tone. Show your team that it's okay to not have all the answers immediately, and encourage open communication to find solutions together.

Then move on observing how your team acts, pay attention to how they handle minor setbacks, then move on to conversations, ask directly — Can they handle a code red without imploding? What happens if the shit really hits the fan? Discuss it with the team, in 1:1s. Understanding this gives you a sense of what kind of crisis your team can take on and how they'll react when things get ugly, really ugly. Because trust me, they will.

Simulate f*ckups. Go through scenarios with your team where some major outage happened, or personal data leaked, or someone got compromised. Play it out with them and discuss the steps that need to happen to make sure the crisis is dealt with properly.

Before sh*t hits the fan

Before anything goes wrong, you need a plan. And not some half-assed, "we'll-cross-that-bridge-when-we-come-to-it" plan. I'm talking about a solid, "this-is-what-we-do-when-the-world-ends" kind of plan.

This means knowing exactly what to do in case of a security breach, data center meltdown, or any other apocalyptic tech scenario you can think of. I mean, if you’re ISO 27001 compliant, you should already have a list of scenarios that are most probable to happen as well as a clear, step-by-step response plan for each of those scenarios.

A simplified disaster plan looks like this:

  1. A list of things that need to be done.
  2. The order of the things that needs to be done.
  3. The people who are responsible for doing those things.
  4. The results that need to be achieved.

Sounds straightforward right? Good. Make sure your team knows their roles. Like, who handles communications? Who's on tech duty? In crisis mode, you don't want to be crafting emails from scratch. Set up communication protocols for internal teams, stakeholders, and customers. This isn't just about being prepared; it's about being smart. Because when the storm hits, you don't have time to build a shelter. You need to be ready to weather it from the get-go.

I don’t know about you, but I enjoy reading postmortems, here’s a great collection. I suggest you read up on then, there’s nothing better than reading how others f*cked up (great list btw) and how they handled things.

And talking about client communication — honesty is your only policy. If it's a security breach and it's time to call the cops, then do it. No second-guessing. Your integrity and your company's reputation are hanging by a thread. This is where you earn your stripes as a leader. No sugarcoating, no evasive maneuvers. Just straight talk, responsibility, and a clear path to damage control.

After sh*t hits the fan

Phew, breathe out, it was a hard day, everything is under control now. The crisis might be over, but as a CTO, you're just getting started. Now comes the part where you stand in front of the board, maybe even the CEO, and lay it all out. It's not just about giving a report; it's about owning your shit.

Assemble your best CSI tech team for a deep dive. Dissect the 'why' and 'how' of the failure. Timeline every commit, every log message, every deployment, every network packet, everything that led to it and happened after. Summarize the findings and make them understandable.

Best case you draft it collaboratively by everyone who was involved. Present it, discuss it, and lay out the steps to ensure this disaster doesn't repeat. Emotions? Leave them at the door. This is about hard facts, even if it means swallowing your pride and admitting to mistakes.

Here’s an example of a how to write up a post-mortem:

Title and Ownership: A clear title for the postmortem and identification of the document's owner(s).
Incident Date and Time: When the incident occurred.
Authors and Participants: Names of people who wrote the postmortem and those involved in the incident.
Status: Indicate if the postmortem is in draft form, under review, or finalized.
Executive Summary: A brief overview of the incident, including the impact and the root cause.
Impact Analysis: Detailed information on what was affected during the incident.
Root Cause Analysis: In-depth exploration of the causes of the incident.
Timeline of Events: A chronological account of how the incident unfolded.
Action Items and Remediations: Specific, actionable steps to prevent recurrence, with assigned owners and due dates.
Lessons Learned: Key takeaways and insights gained from the incident.
Who’s to blame: I said no! Here’s another article for blameless Postmortem.

Some rules of thumb for post-mortems:

  1. Dive into the Details: When encountering a backend error, don't settle for a surface-level explanation. Investigate the specific error and its underlying causes comprehensively. Could a better QA, more peer reviews, or better exception handling prevent future occurrences? Did automatic CI failed? What tests were missing? What automation we didn’t have? Why didn’t we have it?
  2. Concrete Resolution Steps: Avoid vague “corporate” solutions like “We need better deployment” or "improve documentation" or “more training". Instead, develop specific, actionable steps to address the issue directly, for example “X will add fuzzy testing to the deployment pipeline. Y will add configuration variable checking before deployment”
  3. Focus on immediate Solutions first, and long-term solutions second: Prioritize fixes that can quickly prevent recurrence, e.g. those that can be implemented right now. While post-mortem analysis will lead to long-term changes, your immediate goal is rapid resolution. Avoid measures like “lets refactor everything in this module” or “I think it’s time to switch to Rust”.
  4. Challenge the Status Quo: Use the post-mortem to question and test the team's assumptions. Just because a belief is widely held doesn't make it true. Be open to discovering and addressing underlying misconceptions.

Common Mistakes

So a screw-up happened. There’s several things that you should avoid doing during any screw-up at any cost:

  1. Finger pointing and shaming.
  2. Vague communication with customers.
  3. Not owning up to your mistake
  4. Sweeping everything under the rug.

Every one of the above points isn't just bad management; it's like throwing gasoline on a dumpster fire.

For example, I remember reading a post mortem on Hacker News, there was some kind of major outage happened. While writing this I tried to find the link to it, but couldn’t remember what company it was. The post-mortem was as clear as mud, filled with tech jargon no one outside the IT could possible understand. It was unnecessary complicated, the whole communication a masterpiece of vagueness, and no mention of “we’re sorry”. Even the status page showed simply “We're addressing some technical difficulties." No shit, Sherlock.

Clients don’t forget the time you left them in the dark or the time their data was hanging out in the wind because of your team's mistake. Customers have long memories and Twitter accounts. They'll remember how you handled (or mishandled) a crisis, and they'll be damn sure to remind you.

The above is an example of what you should avoid at all costs. I don’t think your post-mortems will end up on Hacker News, but write them as if they will and be as transparent as possible.

I've had fun reading Postmortem Culture, highly recommend it.

Conclusion

Alright, let's wrap this up. TL;DR;

  • Incidents are going to happen. Systems will fail, code will break, and sometimes, despite your best efforts, things will go south.
  • How you handle it matters.
  • Communication matters.
  • Learning from incidents matters.
  • When shit hits the fan, keep your cool, rally your team and keep them informed.

Anything I forgot? Let me know.


Other Newsletter Issues:

  • Will

    While the article idealizes the concept of learning from failures, let’s be real – mistakes can be very costly to the business and to everyone in the company, I’ve seen layoffs happen because of mistakes. I work in product, basically middle management, I appreciate the emphasis on structured postmortems, but you forgot one thing, prevention is better than cure. post mortem is the cure in this case. It’s great to learn from mistakes, but it’s even better to have robust systems in place that minimize these errors from the get-go.

  • Anonymous

    Reminds me of a shitty crisis we faced in my early startup days. A major feature deploy went really bad, and it felt like a nightmare. We fixed it easily, but took us days to figure out the root cause – post-mortem helped a lot. We learned the hard way that transparent communication and a no-blame culture are vital.

  • Anonymous

    Agreed, screw up can happen. Emphasizing a blameless culture and learning from mistakes is key for any tech leader or software development team. This approach not only improves processes but also fosters a positive work environment. It’s a reminder that in the tech industry, where errors can have significant financial and reputational impacts, the ability to effectively analyze and learn from mistakes is VERY IMPORTANT. Implementing structured postmortems is a must at company of any size.

  • Toby

    Learned the hard way that a good monitoring setup is crucial. One project had silently failing services for days before we caught it. Setting up alerts for abnormal behavior now feels like a basic step, but it’s saved us numerous times since. That, combined with regular, understandable updates for users, has kept things running smoother and trust higher.

  • Tracy-Gregory

    Screw-ups can happen to every developer, juniors and seniors. The knack is the recognise when a task is going off the rails but that skill only comes with experience, and can still be missed.
    In a career of over 30 years, I had my first post mortem just last week, and it was excellent. I suspect they will only really work in mature teams where members can trust there will be no blame attributed. Failures are seldom down to one person and more often just one of those things the team missed.
    It is important to consider the what, when and how but the who cannot, and should not, be avoided but only to get the first-hand source of information and not to point fingers.
    Investigating what went wrong enables a team to learn how to identify the early indicators and establish new processes to avoid issues in future.