Part of The GreenBox Story — a standalone reference for the full series.
Three customers received produce they were allergic to. One of them was Mrs Patterson – subscribed since week three, loyal through every crisis, gentle about it even when she shouldn’t have had to be. Another was a Melbourne parent with a child who has a nut allergy. “If this had been nuts instead of dairy…” She didn’t finish the sentence. She cancelled.
That’s the allergen incident from Two Squads, One Direction – the moment GreenBox’s clean architecture hurt a real person for the first time. Perth changed an API format. Melbourne’s reconciliation consumed it. Nobody told anyone. Preference flags got scrambled. Boxes went out wrong.
Priya fixed it in twelve lines of code. But fixing the code wasn’t the point. The point was making sure it never happened again – and that meant understanding why it happened, without turning “why” into “whose fault.”
That’s what a blameless post-mortem is for.
Why “blameless” matters
The natural response to an incident is to find someone to blame. Tom shipped the API change. Tom didn’t notify Melbourne. Tom’s fault.
Except it isn’t. Tom did what any competent developer would do – shipped a feature in his squad’s sprint plan. He didn’t think it would affect the other squad. He was wrong, but the system gave him no way to know he was wrong. No contract test. No cross-squad notification process. No schema validation on the consumer side.
If you blame Tom, two things happen. First, he becomes defensive – stops volunteering information about near-misses, stops saying “I didn’t think it’d affect you” honestly. Second, everyone else learns: don’t be the person holding the commit when something breaks. Hide your mistakes. Bury the near-misses.
Blame creates silence. Silence creates repeat incidents.
Sidney Dekker puts this precisely in The Field Guide to Understanding Human Error: human error is not the cause of failure – it’s the consequence of deeper systemic issues. The question isn’t “why did this person fail?” but “why did the system make it easy for a reasonable person to fail?” Dekker calls this the “New View” of human error, and it’s the foundation of every effective post-mortem process I’ve seen.
John Allspaw, who built the post-mortem culture at Etsy, makes the same argument from an engineering perspective. Engineers who feel safe to report mistakes produce better incident data, which produces better systemic fixes, which produces fewer incidents. Psychological safety is a reliability strategy, not a kindness.
This isn’t about being soft. It’s about being effective. The system that allowed the error is still intact. The next developer who makes a similar reasonable assumption will produce a similar incident. You haven’t fixed anything. You’ve just made people quieter about it.
The format
Charlotte ran GreenBox’s first formal post-mortem after the allergen incident. The format she used is the one I’ve seen work best across dozens of teams and hundreds of incidents. It has five phases.
1. Timeline
Before analysis, establish facts. What happened, when, in what order. Write it on the whiteboard or in a shared document. No interpretation yet – just timestamps and events.
For the GreenBox incident:
- Tuesday, 4:47pm – Perth deploys the subscription API change (pause feature)
- Wednesday, 5:30am – Melbourne’s automated reconciliation runs against the new API format
- Wednesday, ~5:30am – Reconciliation completes with malformed data. No errors raised. 340 subscribers get incorrect allocations
- Wednesday, 9:00am – Sam receives three customer emails about allergens in boxes
- Wednesday, 9:15am – Maya calls all three customers personally
- Wednesday, 11:14am – Priya deploys the fix
The timeline stops arguments about what happened and reveals gaps. Why fourteen hours between the API change and the fix? Because the reconciliation runs overnight, unmonitored, with no schema validation. That gap is where the systemic problem lives.
2. Root cause analysis (5 Whys)
Start with the observable failure and keep asking “why” until you reach something systemic.
Why did three customers get allergens in their boxes? Because the preference flags were scrambled.
Why were the preference flags scrambled? Because Melbourne’s reconciliation received malformed data from the subscription API.
Why was the data malformed? Because Perth restructured the API response format for the pause feature.
Why didn’t Melbourne know the format was changing? Because there’s no contract test between the services and no cross-squad notification process.
Why is there no contract test and no notification process? Because no mechanism exists for one squad to know what the other is changing.
That last answer is the root cause. It’s systemic, not personal. It would have been true regardless of who shipped the change or which API was involved.
The 5 Whys technique comes from the Toyota Production System. It works because it forces you past symptoms into structure. The first two “whys” describe what happened. The next two describe how the system allowed it. The last one describes what’s missing.
“5 Whys” doesn’t mean exactly five. Sometimes three, sometimes seven. The discipline of asking “why” one more time than feels comfortable is what matters.
3. Contributing factors
Root cause is the structural gap. Contributing factors are everything that made the incident worse or made the root cause harder to catch.
For the allergen incident:
- No consumer notification when a shared API changes (process gap)
- No API versioning (technical gap)
- The reconciliation failing silently instead of raising an alert on schema mismatch (monitoring gap)
- No cross-squad visibility into sprint plans (coordination gap)
Tom saying “I didn’t think it’d affect you” is a contributing factor, not the root cause. It’s honest – he genuinely didn’t know. The root cause is that the system gave him no way to know. Fix the system, and Tom’s reasonable assumption stops being dangerous.
This distinction matters. If you list Tom’s assumption as the root cause, the action item becomes “Tom should check with Melbourne before shipping” – a human process depending on one person remembering. It will fail. If the root cause is systemic, the action item becomes “contract tests break the build when a schema change affects a consumer.” That’s automated. It works at 2am when Tom is asleep.
4. Action items
This is where most post-mortems either succeed or collapse.
Charlotte’s allergen incident post-mortem produced three actions:
- Contract testing across bounded contexts – Priya, already done
- Cross-squad notification process for API changes – Tom and Anika, by next Friday
- Reconciliation alerting on schema mismatches – Ravi, by end of sprint
The bad post-mortem produces fifteen action items, assigns them to “the team,” and gives no deadlines. Three months later, none are done. Three actions you’ll actually complete are worth more than twenty you won’t.
5. Follow-up
Schedule a check-in – two weeks later, four weeks later, whatever fits your cadence. Were the action items completed? Did they work? Has a similar incident occurred?
Charlotte pinned the post-mortem photograph in GreenBox’s #incidents Slack channel and added it to the incident log – the spreadsheet she later projected at quarterly planning. Without follow-up, each incident is isolated. With follow-up, three incidents become a trend, and a trend demands structural change.
How to facilitate
The facilitator’s job is to keep asking “why” without it feeling like an interrogation.
Charlotte’s approach at GreenBox works because she does three things consistently.
She separates the person from the system. When Tom says “I didn’t think it’d affect you,” Charlotte doesn’t say “you should have thought about it.” She says “nothing in our process would have caught this.” The distinction is subtle but crucial. Tom hears “the system failed” rather than “you failed.” He stays engaged instead of becoming defensive.
She writes on the whiteboard, not in her notebook. Everything is visible. The timeline, the why chain, the action items. Nobody wonders what the facilitator is thinking. The whiteboard is the shared record, and everyone can see it and correct it.
She stops at three actions. The temptation is always to list everything that could be improved. Resist it. Charlotte asks: “What are the three things that would most reduce the chance of this happening again?” Then she stops.
One more thing: the facilitator should not be the person most affected by the incident. The facilitator needs enough distance to keep asking “why” without emotional investment in the answers.
Common failure modes
I’ve seen post-mortems go wrong in predictable ways. These are the patterns to watch for.
The twenty-item action list. The team lists everything that could possibly be improved. Nobody prioritises. Nobody owns most of the items. Four weeks later, two items are done (the easy ones) and eighteen are forgotten. The post-mortem felt thorough but accomplished nothing. The fix: force-rank actions by impact. Pick two or three. Do them. Come back for more if needed.
The post-mortem nobody attends. Scheduled for Friday at 4pm. Half the team has “conflicts.” The people who need to hear the analysis aren’t there. The fix: schedule within 48 hours of the incident, during core hours. If someone can’t attend, they read the write-up and sign off on the action items.
The recurring root cause. The same issue appears in three post-mortems over six months. This means action items aren’t being completed, or the fixes aren’t working. The fix: at each post-mortem, review the action items from the last two. If they’re not done, that’s the first topic. If they’re done and the problem persists, go deeper.
The blame post-mortem. Someone says “this wouldn’t have happened if Tom had checked with Melbourne.” The room goes cold. Tom stops contributing. The facilitator has lost the room. The fix: re-read the Prime Directive at the start. If blame surfaces, redirect immediately. “That’s a human action. What’s the systemic condition that made that action risky?”
The post-mortem that’s actually a status meeting. The manager wants to know what happened and who’s fixing it. No root cause analysis, no 5 Whys. Fifteen minutes, one action: “don’t do that again.” This is an incident debrief, not a post-mortem. The fix: separate the two. Debrief immediately (what happened, what’s the immediate fix). Post-mortem within a week (why did the system allow it, what structural changes prevent recurrence).
The incident log
Charlotte keeps a spreadsheet. It’s simple: date, type (surprise or duplication), description, impact, root cause, action items, status.
The spreadsheet isn’t remarkable on its own. What’s remarkable is that she projects it at quarterly planning. Seven incidents in four weeks. Three of them in her bounded contexts. Nobody had seen the full picture before.
Individual incidents feel like bad luck. A spreadsheet reveals a pattern. That’s why Charlotte’s quarterly planning day works – she’s not asking the team to trust her instinct that coordination is broken. She’s showing them the data.
If you run post-mortems but don’t aggregate the findings, you’re learning from each incident in isolation. The real value comes from seeing that three different incidents share the same root cause.
After the post-mortem: the retro
A post-mortem and a retrospective are different things. The post-mortem asks “what happened and why?” The retrospective asks “how do we work differently going forward?”
Charlotte runs both. The post-mortem happens within days of the incident. The retrospective happens at the regular sprint cadence – it’s where the post-mortem’s action items get reviewed and where deeper cultural issues surface.
Retromat is useful here. It generates structured retrospective plans with different activities for each phase. The variety forces different thinking, which surfaces different observations. After a serious incident, the team needs to process more than just the technical failure.
A second example: the Saturday outage
The allergen incident isn’t GreenBox’s only brush with things going wrong. In Threat Modelling, a Friday database migration breaks the payment webhook handler. Saturday morning, Stripe webhooks return 500 errors. Customers see “payment pending” when they’ve already been charged. Ravi diagnoses it in twenty minutes. The fix is two lines.
Same format. Timeline: migration deployed Friday afternoon, webhooks fail Saturday morning, Ravi diagnoses at 7:20am, fix deployed by 8am. Root cause: no integration test covering the webhook handler’s interaction with the audit log schema. Contributing factors: Friday afternoon deployment with less weekend monitoring, no staging test exercising the full webhook flow.
Charlotte’s observation: “We built the right thing – audit logging was a mitigation from the threat model. We deployed it without adequate testing. The fix isn’t removing the audit logging. It’s improving the deployment process.”
That distinction – the feature was right, the deployment process was wrong – is exactly what a blameless post-mortem surfaces. Without the structured analysis, the false lesson would be “audit logging caused an outage, don’t add audit logging.”
The literature
If you want to go deeper, three references are worth your time.
Sidney Dekker, The Field Guide to Understanding Human Error (2014). The definitive argument for “New View” thinking about failure. If you read one book about incident analysis, make it this one.
John Allspaw’s writing on blameless post-mortems at Etsy. Allspaw built one of the first engineering cultures to formalise blameless post-mortems at scale. Start with “Blameless PostMortems and a Just Culture” on the Etsy engineering blog.
Google’s Site Reliability Engineering book, Chapter 15: “Postmortem Culture: Learning from Failure.” Free to read online. Google’s post-mortem template is well-documented and battle-tested at enormous scale. Even if you’re not operating at Google’s scale, the principles transfer directly.
The principle
Every system fails. The question is whether the failure makes the system stronger or just makes people quieter.
A blameless post-mortem turns an incident into a structural improvement. It asks “why did the system allow this?” instead of “who did this?” It produces concrete actions with owners and deadlines. It feeds into a log that reveals patterns over time.
Tom saying “I didn’t think it’d affect you” is not negligence. It’s a reasonable assumption the system failed to catch. Fix the system. Tom will be the first person to tell you when the next assumption looks risky – but only if honesty isn’t punished.
Blame feels like accountability. It’s actually the fastest way to ensure nobody tells you about the next near-miss until it’s a full-blown incident. Build the culture where people say “I made a mistake” before someone else discovers it. That culture starts with the first post-mortem.
Related references
- Two Squads, One Direction – the allergen incident and Charlotte’s post-mortem
- Threat Modelling – the Saturday outage and its post-mortem
- Retrospectives at Every Scale – the feedback loops that follow post-mortems
- The GreenBox Cheat Sheet – every technique in one place
- The Planning Onion – every planning layer in one place
- The GreenBox Story – the full series