vadimkravcenko

Postmortems in Software Development

22 November 2023 ·6,317 views ·Updated 04 April 2026

August 1st 2012, 9:30 a.m. on the NYSE open. Knight Capital flips the switch on a new release—and instantly starts bleeding money. One of eight servers was still running an old flag, so an ancient feature called “Power Peg” re-awakened and sprayed ridiculous orders across 148 stocks. Forty-five minutes, four million trades, and $440 million in losses later, the firm needed a bailout to stay alive. (Side note: I’ve never managed to dig up their internal post-mortem. If you ever find it, send it my way.)

Same pattern, different decade. In 1990 a single misplaced break; in AT&T’s switch software created a reboot cascade that knocked out long-distance service for roughly 60,000 people and cost the company about $60 million. Flights were delayed, pagers were silent, the whole nine yards. One character, nationwide havoc. I’d love to pretend we’ve evolved past that, but we both know we haven’t.

Those two disasters would make fascinating public post-mortems—sadly I can’t find any. I’m guessing a small army sifted through packet captures for months, but the write-ups never saw daylight.

I’ll keep it smaller-scale. Over the years I’ve had projects skid off the runway and a multi-hour datacenter failure that shaved a few months off my life. Watching a major incident unfold in real time feels like slow-motion physics: you can see the collision coming but you still can’t reach the brake.

The worst for me? A high-profile product launch—press releases queued, clients on livestream, champagne somewhere on ice. Ten minutes in, payments started hard-failing because of a certificate mismatch that hadn’t shown up in staging (classic). We scrambled, fixed it, owned it, survived. Still hurt.

We pulled an all-hands, combed logs, published a timeline, and added checks so the same class of bug can’t ship again. The relief was real, but so was the hangover. (I’m talking adrenaline, not champagne.)

🏄 Post-mortems are the autopsies that keep the next patient alive. Skip them and you rerun the same crash.

Aviation and bridge engineering treat failure analysis as sacred ritual. Black boxes, FDR read-outs, load tests on snapped beams—the works. That rigor is why commercial flying feels routine and why bridges rarely fold under rush-hour traffic.

Our servers don’t carry passengers, but they do process paychecks, medical data, and the odd rocket launch. Consequences may be less visible, yet they’re expensive enough to bankrupt a company before lunch.

So yes, we should borrow a page from aviation—but with nuance. I’ve worked in teams where the post-mortem template turned into bureaucratic theatre: 20 pages, zero insight. The trick is honesty over ceremony. Root-cause analysis when something breaks, near-miss reviews when something almost breaks, and—this one’s underrated—“brag docs” when a risky change doesn’t break. (I stole that idea from someone on my team who swears it keeps alert fatigue in check.)

Perspective matters too. After an outage everyone suddenly “knew it was going to fail.” That hindsight bias will poison your analysis if you let it. Ask what was knowable at the time, not what’s obvious afterwards. I get this wrong more often than I want to admit.

The only unforgivable mistake is failing to learn. Everything else is tuition.

Alright—how do we set up for the next mess?

Get your team ready

“Psychological safety” sounds like HR jargon, but the test is simple: can an engineer push a bad config, say “I broke prod,” and still feel welcome at Monday stand-up? If not, you’ll never get the real story during a crisis.

Mistakes happen—occasionally by design. I’ve seen teams run controlled failure drills (“GameDays”) every sprint. It normalises error and surfaces weak spots long before customers notice. Caveat: overdo it and you burn people out. I could be wrong, but quarterly exercises feel about right for a midsize team.

The Google SRE book nails the culture shift away from blame. Quick excerpt:

VP Ash: Someone must have known this was risky—why didn’t you listen?
SRE Dana: Everyone had good intent. Maybe let’s ask if there were warning signs we missed and why.

When the pager screams you want problem-solvers, not self-preservation instincts. If your team hides mistakes, the incident will outlast your runway.

Enjoyed the read? Join a growing community of more than 2,500 (🤯) future CTOs.

How to gauge safety? Start with your own reaction. If I curse at a junior for a minor slip, I’ve signalled that survival > learning. After that, watch hallway chatter, read PR comments, ask in 1:1s whether folks feel comfortable flagging brittle code. The answers are rarely subtle.

Then rehearse. Table-top the ugly scenarios: leaked database, ransomware, production schema drop. Talk through who calls legal, who pauses ad spend, who rotates secrets. It feels theatrical—until it isn’t.

Before sh*t hits the fan

You need a runbook you can follow at 3 a.m. with half a coffee in your system. Not a PDF buried in Confluence. Something discoverable—pager message, Slack bookmark, whatever works.

If you’re ISO-27001 aligned you likely have a risk register already. Good start. But keep it alive: new third-party dependency? Update the register. New data residency law? Same story. I’ve lost count of audits that failed because the “living document” hadn’t been touched since 2019.

Simplified plan:

  1. Enumerate plausible incidents.
  2. Define sequence of actions per incident.
  3. Assign on-call roles (incident commander, comms lead, tech lead).
  4. Spell out “done” criteria so you know when you’re clear.

The role split looks excessive until you’re juggling exec calls, rollback commands, and social-media blow-ups in parallel. Borrow the “incident commander” pattern from firefighting; it forces single-threaded decision flow and reduces cross-talk.

I binge post-mortems the way some people binge Netflix—there’s a nice public list. Reading through other teams’ disasters costs nothing and occasionally saves you from repeating one.

On communication: default to candour. If personal data is at risk, call your DPO and, if law requires, law enforcement—immediately. Trying to massage the timeline usually backfires and drags regulators into the mix.

After sh*t hits the fan

The outage is over, alerts are green, adrenaline drops. Now the real work starts. You need the narrative before rumours fill the vacuum.

Form a small squad—anyone paged during the incident plus a neutral facilitator. Collect artefacts: deploy hashes, CloudTrail logs, Slack threads, customer tickets. I aim for a 48-hour draft while memories are fresh. Faster is better, but people need sleep first.

Keep drafts collaborative. A shared doc beats a heroic single author because it catches memory gaps and prevents hindsight bias. I’ve seen a post-mortem shift root cause from “buggy code” to “ambiguous runbook step” once the operations folks annotated the timeline.

Google’s template is solid. My stripped-down version usually includes:

Title & Owners
Date/Time
Executive Summary (2-3 sentences)
Impact (customers, revenue, SLAs)
Timeline
Root Cause(s)
Short-Term Fixes
Long-Term Actions (owners + due dates)

Notice there’s no “culprit” section.

Rules of thumb:

  1. Get below the surface. “Database error” is not root cause. Was it schema drift? Connection pool starvation? Mis-configured AZ failover?
  2. Write concrete actions. “Improve monitoring” is fluff; “Add 95th-percentile latency alert in Prometheus” is testable.
  3. Sequence matters. Quick guardrails first (feature flag, extra alert). Ambitious refactor later, maybe.
  4. Question assumptions. Just because the cluster scaled yesterday doesn’t mean it will tomorrow. Probe the “obviously safe” areas.

Common Mistakes

Things that make any incident worse:

  1. Public finger-pointing.
  2. Hand-wavy customer updates.
  3. Dodging responsibility.
  4. Quietly deleting evidence or logs.

I once read a post-mortem so drenched in acronyms that even I had to open Google. No apology, no plain-English explanation, just “technical difficulties.” Customers remember that stuff—and they screenshot.

Assume your post-mortem might end up on Hacker News. Write accordingly.

For a deeper dive, the “Post-mortem Culture” workbook is a fun rabbit hole.

Conclusion

Incidents are inevitable. How you react—technically and emotionally—decides whether you end up stronger or merely bruised. Keep the comms clear, the analysis honest, and the fixes actionable. That’s pretty much the whole playbook.

Missed something? Ping me.


Other Newsletter Issues:

Worried your codebase might be full of AI slop?

I've been reviewing code for 15 years. Let me take a look at yours and tell you honestly what's built to last and what isn't.

Learn about the AI Audit →

No-Bullshit CTO Guide

268 pages of practical advice for CTOs and tech leads. Everything I know about building teams, scaling technology, and being a good technical founder — compiled into a printable PDF.

Get the guide →

6 Comments

  1. Anonymous

    Screw-ups can happen to every developer, juniors and seniors. The knack is the recognise when a task is going off the rails but that skill only comes with experience, and can still be missed.
    In a career of over 30 years, I had my first post mortem just last week, and it was excellent. I suspect they will only really work in mature teams where members can trust there will be no blame attributed. Failures are seldom down to one person and more often just one of those things the team missed.
    It is important to consider the what, when and how but the who cannot, and should not, be avoided but only to get the first-hand source of information and not to point fingers.
    Investigating what went wrong enables a team to learn how to identify the early indicators and establish new processes to avoid issues in future.

  2. Anonymous

    Learned the hard way that a good monitoring setup is crucial. One project had silently failing services for days before we caught it. Setting up alerts for abnormal behavior now feels like a basic step, but it’s saved us numerous times since. That, combined with regular, understandable updates for users, has kept things running smoother and trust higher.

  3. Anonymous

    Agreed, screw up can happen. Emphasizing a blameless culture and learning from mistakes is key for any tech leader or software development team. This approach not only improves processes but also fosters a positive work environment. It’s a reminder that in the tech industry, where errors can have significant financial and reputational impacts, the ability to effectively analyze and learn from mistakes is VERY IMPORTANT. Implementing structured postmortems is a must at company of any size.

  4. Anonymous

    Reminds me of a shitty crisis we faced in my early startup days. A major feature deploy went really bad, and it felt like a nightmare. We fixed it easily, but took us days to figure out the root cause – post-mortem helped a lot. We learned the hard way that transparent communication and a no-blame culture are vital.

  5. Anonymous

    While the article idealizes the concept of learning from failures, let’s be real – mistakes can be very costly to the business and to everyone in the company, I’ve seen layoffs happen because of mistakes. I work in product, basically middle management, I appreciate the emphasis on structured postmortems, but you forgot one thing, prevention is better than cure. post mortem is the cure in this case. It’s great to learn from mistakes, but it’s even better to have robust systems in place that minimize these errors from the get-go.

  6. Anonymous

    I once helmed a project that was the epitome of a dumpster fire, thanks to a cascading failure that kicked off from a minor bug. Despite our best efforts at transparency and a no-blame post-mortem, the aftermath was a morale black hole, tearing at the team’s confidence. What these articles don’t tell you is how disillusionment can set in when you realize that no amount of post-mortems can prepare you for the unpredictable chaos of tech. Each failure chipped away at our “resilience,” turning our ambitious team into a group just bracing for the next crash.

Cancel