On-Call and Incident Response: When the Pager Goes Off

September 01, 2026 · 15 min read

Three customers received allergens in their boxes. The blameless postmortem identified the root cause, no mechanism for one squad to detect another’s changes. Contract tests went in. Cross-squad syncs started. But a question lingered: how did four hours pass between bad data hitting production and someone noticing?

Tom is asleep when the tweet arrives. It’s 3:07am on a Saturday.

His phone is on the bedside table, face down, on silent. He doesn’t see the tweet. He doesn’t see the three that follow. He doesn’t see the DM from a subscriber named Claire: “Hey @GreenboxAU, my box had capsicum in it. I flagged capsicum as an allergy. This is the second time. Please fix this.”

Claire posted at 3:07am because she’s a nurse finishing night shift and she opened her box when she got home. She’s frustrated but not in danger, her capsicum sensitivity is mild, not anaphylactic. But she doesn’t know that Greenbox doesn’t have anyone watching at 3am. She doesn’t know that nobody will see her tweet until Sam checks the social accounts at 8:30am, five and a half hours later.

Sam sees it and feels her stomach drop. Not again. She pulls up Claire’s account. The allergen flag is intact, capsicum is listed. She checks the reconciliation logs. Clean. She checks the substitution engine output. There it is: a quiet failure in the seasonal rules Tom built last sprint. The rule prioritised seasonal availability over allergen exclusions when supply was constrained. The logic was correct in isolation, prefer seasonal produce, but it didn’t check the allergen flags first.

A bug. Not a systemic failure like the Perth API change. A regular, human-scale bug that shipped in a PR that passed all its tests, because nobody had written a test for the interaction between seasonal priority and allergen exclusions.

Sam tells Maya. Maya calls Claire personally. Claire is asleep by now, night shift, and Maya leaves a voicemail. Then Maya sits at the kitchen table and types a message to Charlotte: “We caught this by accident. A customer tweeted at 3am and Sam saw it five hours later. What if it had been peanuts?”

Charlotte’s reply comes at 9:14am: “You know the answer. We need to talk about on-call.”

The conversation nobody wants to have

Charlotte calls a meeting for Monday. Both squads. She starts with a question.

“How do we currently find out that something has gone wrong in production?”

Silence. Then Tom: “Customers tell us.”

“How long does that take?”

Sam checks her notes from the allergen incident. “The Perth API change shipped at 4:47pm Tuesday. The reconciliation ran at 5:30am Wednesday. Sam got the first customer email at 9am. Fourteen hours.”

“And this weekend?”

“The bug shipped Friday afternoon. Claire tweeted at 3:07am Saturday. Sam saw it at 8:30am. Seventeen hours from deploy to detection. Five and a half from customer report to human awareness.”

Charlotte writes both numbers on the board. Fourteen hours. Seventeen hours. She circles them.

“These are our detection times. The time between something going wrong and us knowing about it. Right now, our monitoring system is customers being harmed and then telling us.”

The room is uncomfortable. Tom crosses his arms. Priya looks at the numbers on the board.

“We need three things,” Charlotte says. “Monitoring that detects problems before customers do. Alerting that tells the right person immediately. And a response process so that person knows what to do.”

Monitoring: detecting the problem

Priya takes monitoring. She’s the one who built the contract tests after the first allergen incident, and she thinks about systems the way a doctor thinks about symptoms, what should we be watching for?

She starts with the reconciliation system, because that’s where both incidents originated. She adds checks that run after every reconciliation:

  • Does the output contain any allergen violations? (Compare box contents against subscriber allergen flags.)
  • Did the substitution engine override any allergen exclusions? (This would have caught Claire’s bug.)
  • Are there any subscribers whose box contents changed in the last hour without a corresponding supply update? (This would have caught the Perth API change.)

Each check runs automatically and writes its result to a dashboard. Green means clean. Red means something needs attention. Amber means an anomaly that might be nothing but should be checked.

Tom builds the dashboard in a day. It’s simple, a status page that polls the checks every five minutes and displays the results. He puts it on a screen in the office. The first morning, it’s all green. The second morning, one amber: a substitution chain went three levels deep for a single subscriber. Not a bug, just an unusual supply week. But visible.

“This is the difference,” Charlotte tells the team. “Before, you found problems when customers emailed Sam. Now you find them when the dashboard turns amber.”

Alerting: telling the right person

Monitoring without alerting is a dashboard nobody looks at. The office screen helps during business hours, but Greenbox ships boxes seven days a week, and the reconciliation runs at 5:30am.

Charlotte introduces the concept of an on-call roster. One person carries the pager, in practice, a phone with push notifications from the monitoring system. If a check goes red, the on-call person gets an alert. They assess, respond, and escalate if needed.

Tom pushes back immediately. “We’re twenty people. If one person is on-call every night, that’s every twentieth night. That’s sustainable. But who wants to be woken up at 3am?”

“Nobody wants to be woken up at 3am,” Charlotte says. “The question is whether you’d rather be woken up at 3am by an alert, or at 8:30am by Sam telling you a customer got allergens in their box.”

“Fine. But how do we decide who’s on-call? And how often?”

The argument that follows is the most heated discussion Greenbox has had since the substitution policy debate in the Event Storming session. Not because anyone disagrees about the principle, but because on-call is personal. It touches sleep, family, weekends, fairness.

Anika raises the Melbourne question. “If on-call is shared across both squads, Melbourne developers could get paged for Perth issues they don’t understand. And vice versa.”

Ravi has a practical concern. “I have a six-month-old. I’m already not sleeping. Adding on-call on top of that…”

Charlotte listens to all of it. Then she draws a framework.

On-call principles
  • Rotation is weekly. One week on, several weeks off. Nobody does more than one week in six.
  • On-call is compensated. Time in lieu: if you get paged overnight, you start late the next day. No exceptions.
  • Scope is limited. The on-call person responds to red alerts only. Amber waits until business hours.
  • Runbooks exist. You should never be paged and not know what to do. If there's no runbook for a red alert, the alert shouldn't page someone at 3am.
  • Squad-scoped rosters. Perth on-call handles Perth systems. Melbourne handles Melbourne. Shared systems rotate between squads.
  • Opt-out is respected. Ravi is excused from on-call for six months. No judgement. The roster adjusts.

Ravi says thank you quietly. Tom nods. The framework doesn’t eliminate the discomfort of being on-call, but it makes it fair and bounded. One week in six, with time in lieu, with runbooks, with the knowledge that you won’t be paged for something you can’t handle.

Severity levels

Charlotte introduces severity levels the following week, after an incident that illustrates why they matter.

On Wednesday at 2pm, the monitoring dashboard goes red. Tom is on-call. He gets the alert on his phone, puts down his coffee, and opens the dashboard. The red check: “Substitution engine returned empty result for 3 subscribers.”

Three subscribers. Out of six thousand. The substitution engine hit an edge case with a new farm’s produce categories and returned no result instead of falling back to the default box. No allergen risk. No safety issue. Three people might get a box that’s missing an item.

Tom spends forty-five minutes diagnosing, fixing, and deploying. He misses his 2:30 meeting. He burns through his afternoon focus time. The fix is twelve lines of code.

“Was that worth a red alert?” Charlotte asks at the retro.

“No,” Tom admits. “It felt urgent because my phone buzzed. But three missing items is not the same as three allergen violations.”

Charlotte draws a severity grid on the board.

Greenbox severity levels
Level Definition Response Example
P1 Customer safety risk or data breach Page on-call immediately, any hour Allergen violation in box contents
P2 Major feature broken, many customers affected Page during business hours; overnight only if >100 subscribers affected Reconciliation producing wrong allocations
P3 Minor feature broken, small number affected Fix during business hours, next working day 3 subscribers get incomplete substitution
P4 Cosmetic or non-urgent Add to backlog Dashboard formatting issue

“The severity level determines the response, not the alert,” Charlotte says. “Everything can be detected by monitoring. Only P1s page someone at 3am. P2s page during business hours. P3s go into the sprint. P4s go into the backlog.”

Tom recategorises his Wednesday incident. It was a P3. It should have been a Slack notification, not a phone alert. He wouldn’t have missed his meeting. He wouldn’t have burned his afternoon. He would have fixed it the next morning and nobody would have noticed.

“When everything is urgent, nothing is,” Charlotte says. “Severity levels protect the on-call person from alert fatigue. If you get paged for P3s at 2am, you’ll start ignoring pages. And then when a real P1 comes, an allergen violation, a data breach, you’ll be the person who silenced their phone.”

Runbooks: what to do when the phone rings

Priya writes the first runbook. She chooses the reconciliation system because that’s where both allergen incidents started, and because she understands it better than anyone.

The runbook is a document, a page in the team wiki, that answers one question: “It’s 3am, you’ve been paged, the reconciliation check is red. What do you do?”

She writes it in numbered steps, each one specific enough to follow when you’re half-asleep and your adrenaline is spiking.

  1. Open the monitoring dashboard. Confirm the red check. Note which check failed and the timestamp.
  2. Open the reconciliation log for today’s run. Look for error messages or anomalies.
  3. If the error is “allergen violation detected”: this is a P1. Do not proceed alone. Escalate to the incident commander (currently Charlotte, fallback Maya). Then continue to step 4 while waiting for the commander.
  4. Identify affected subscribers. Run the allergen check query (linked). Note subscriber IDs and the specific violations.
  5. If boxes have not yet been packed: update the reconciliation data and re-run. Verify the output is clean.
  6. If boxes have been packed but not dispatched: contact the packing facility (number listed). Request a hold on affected boxes.
  7. If boxes have been dispatched: this is a customer contact situation. Escalate to Maya for personal calls. Sam handles email notification to affected subscribers using the allergen incident template (linked).

Seven steps. Each one ends with either a resolution or an escalation. Priya tests it by walking Ravi through a simulated incident, she marks a check red and watches him work through the steps. He completes it in nine minutes. Two of those minutes are reading the runbook.

“I’ve never touched the reconciliation system,” Ravi says. “I just followed the steps.”

“That’s the point,” Priya says.

Over the next two weeks, the team writes runbooks for five more failure modes: delivery tracking outage, payment processing failure, farm portal downtime, substitution engine error, and notification system failure. Each one follows the same format: confirm, assess severity, follow steps, escalate if needed.

The incident commander

Charlotte introduces one more role: the incident commander. During a P1 or P2 incident, one person coordinates. Everyone else executes.

“The commander doesn’t fix the bug. The commander makes sure the right people are fixing the bug, that customers are being communicated with, that someone is tracking the timeline, and that nobody is working on the same thing as someone else.”

She draws the model on the whiteboard:

Without a commander
  • Three people investigate the same log
  • Nobody tells the customer anything
  • Fix is deployed without checking side effects
  • Timeline is reconstructed from memory at the postmortem
With a commander
  • Commander assigns roles: investigate, communicate, verify
  • Customer comms go out within thirty minutes
  • Fix is verified against the runbook before deploy
  • Timeline is recorded in real time in the incident channel

Charlotte volunteers to be the first incident commander. She’ll rotate out after two months, once someone else has observed enough incidents to take over.

The first clean incident

It happens three weeks later. A Thursday morning, 6:15am. The reconciliation check goes amber: “Supply data incomplete for 4 farms. Reconciliation output may contain gaps.”

Kai is on-call. His phone buzzes. He checks the dashboard. Amber, not red. He reads the severity guide: amber during business hours means assess and respond, no escalation needed.

He opens the runbook for supply data issues. Step 1: check which farms have incomplete data. Step 2: check the farm portal logs. Step 3: if the farms haven’t submitted, contact them.

He finds the issue in four minutes. Rachel’s farm portal session timed out overnight, her dodgy broadband, the same satellite connection she complained about at the very first Event Storming session. Her availability data submitted partially. Two other farms had the same issue, a server-side timeout that dropped connections after 30 seconds.

Kai pings Tom in Slack: “Supply data timeout issue. Three farms affected. Partial submissions. I’m increasing the timeout to 120 seconds and re-requesting submissions.”

Tom replies: “Good catch. Fix the timeout, I’ll check the packing schedule isn’t affected.”

By 7:30am, the data is complete. The reconciliation re-runs cleanly. The dashboard goes green. Kai logs the incident in the #incidents channel, timestamp, cause, resolution, time to fix: 75 minutes.

No customers affected. No boxes delayed. No phone calls from Maya.

Charlotte reads the incident log at 9am and posts a single message: “This is what good looks like.”

The tension

There’s a conversation that happens at the retro, and it’s harder than the technical discussion.

Tom raises it. “I’m on-call next week. I’m also supposed to be building the Brisbane onboarding flow. If I get paged twice overnight, I’ll be useless the next day. How do we reconcile ‘move fast’ with ‘be careful’?”

It’s a real tension and Charlotte doesn’t pretend it isn’t. Moving fast means shipping features, taking risks, iterating quickly. Being careful means monitoring, runbooks, on-call, postmortems. They pull in opposite directions.

“You don’t reconcile them,” Charlotte says. “You hold both. Some weeks you move fast and ship three features. Some weeks you get paged at 3am and the next day is a write-off. The on-call structure doesn’t slow you down, it catches you when speed creates problems.”

Priya adds something quieter. “Before the monitoring, we were still getting paged. It just came through Sam’s inbox five hours later. The speed was the same. The feedback loop was slower. We thought we were moving fast because nobody was telling us we’d broken things.”

Tom considers this. “So we’re not slower. We just know more.”

“Yes. And knowing more feels slower because you’re responding to things you used to ignore.”

Blameless by default

Charlotte has one final piece. She pins a message in #incidents:

Every incident gets a postmortem. Every postmortem follows the Prime Directive. This is not optional and it’s not only for disasters.

The team runs postmortems for three incidents in the first month. The Claire incident (P1, allergen bug). The Wednesday P3 that Tom over-responded to. And a P2 where the Melbourne notification system sent duplicate emails to four hundred subscribers.

The postmortems take thirty minutes each. Timeline. Root cause. Contributing factors. Actions. The actions are small, a test here, a timeout there, a severity reclassification. But they accumulate. Each postmortem makes the system slightly more resilient.

By the end of the first month, the #incidents channel has seven entries. Each one is a story: what happened, why, what was done, what changed. New developers who join the team will read those entries and understand not just how the system works, but how it fails and how the team responds to failure.

That’s the real value. Not the runbooks or the severity levels or the on-call roster, though all of those matter. The real value is a team that treats incidents as expected rather than exceptional, that responds with process rather than panic, and that learns from every failure without blaming the person who happened to be holding the keyboard when things went wrong.

Tom finishes his first on-call rotation on a Sunday evening. He wasn’t paged once. He spent the week with his phone on the bedside table, volume up, and nothing happened.

Sarah notices him checking his phone at dinner on Saturday. “Everything okay?”

“Yeah. Just checking. Force of habit.”

“Ah. The on-call thing.”

“The on-call thing.”

She puts her hand on his. “At least now you get to check a dashboard instead of finding out from an angry tweet.”

Tom laughs. He puts his phone face down on the table and goes back to his dinner. In the morning, the dashboard will be green. And if it isn’t, he’ll know what to do.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.