On-Call and Incident Response: When the Pager Goes Off

Tom is asleep when the tweet arrives. It’s 3:07am on a Saturday.

His phone is on the bedside table, face down, on silent. He doesn’t see the tweet. He doesn’t see the three that follow. He doesn’t see the DM from a subscriber named Claire: “Hey @GreenboxAU, my box had capsicum in it. I flagged capsicum as an allergy. This is the second time. Please fix this.”

Claire posted at 3:07am because she’s a nurse finishing night shift and she opened her box when she got home. She’s frustrated but not in danger, her capsicum sensitivity is mild, not anaphylactic. But she doesn’t know that Greenbox doesn’t have anyone watching at 3am. She doesn’t know that nobody will see her tweet until Sam checks the social accounts at 8:30am, five and a half hours later.

Sam sees it and feels her stomach drop. Not again. She pulls up Claire’s account. The allergen flag is intact, capsicum is listed. She checks the reconciliation logs. Clean. She checks the substitution engine output. There it is: a quiet failure in the seasonal rules Tom built last sprint. The rule prioritised seasonal availability over allergen exclusions when supply was constrained. The logic was correct in isolation, prefer seasonal produce, but it didn’t check the allergen flags first.

A bug. Not a systemic failure like the Perth API change. A regular, human-scale bug that shipped in a PR that passed all its tests, because nobody had written a test for the interaction between seasonal priority and allergen exclusions.

Sam tells Maya. Maya calls Claire personally. Claire is asleep by now, night shift, and Maya leaves a voicemail. Then Maya sits at the kitchen table and types a message to Charlotte: “We caught this by accident. A subscriber tweeted at 3am and Sam saw it five hours later. What if it had been peanuts?”

Charlotte’s reply comes at 9:06am: “You know the answer. We need to talk about on-call.”

The conversation nobody wants to have

Charlotte calls a meeting for Monday. Both squads. She starts with a question.

“How do we currently find out that something has gone wrong in production?”

Silence. Then Tom: “Subscribers tell us.”

“How long does that take?”

Sam checks her notes from the allergen incident. “The Perth API change shipped at 4:47pm Tuesday. The reconciliation ran at 5:30am Wednesday. I took Mrs Patterson’s call at 9:03. Fourteen hours.”

“And this weekend?”

“The bug shipped Friday afternoon. Claire tweeted at 3:07am Saturday. I saw it at 8:30am. Seventeen hours from deploy to detection. Five and a half from subscriber report to human awareness.”

Charlotte writes both numbers on the board. Fourteen hours. Seventeen hours. She circles them.

“These are our detection times. The time between something going wrong and us knowing about it. Right now, our monitoring system is subscribers being harmed and then telling us.”

The room is uncomfortable. Tom crosses his arms. Priya looks at the numbers on the board.

“We need three things,” Charlotte says. “Monitoring that detects problems before subscribers do. Alerting that tells the right person immediately. And a response process so that person knows what to do.”

Monitoring: detecting the problem

Priya takes monitoring. She’s the one who built the contract tests after the first allergen incident, and she thinks about systems the way a doctor thinks about symptoms, what should we be watching for?

She starts with the reconciliation system, because that’s where both incidents originated. She adds checks that run after every reconciliation:

Does the output contain any allergen violations? (Compare box contents against subscriber allergen flags.)
Did the substitution engine override any allergen exclusions? (This would have caught Claire’s bug.)
Are there any subscribers whose box contents changed in the last hour without a corresponding supply update? (This would have caught the Perth API change.)

Each check runs automatically and writes its result to a dashboard. Green means clean. Red means something needs attention. Amber means an anomaly that might be nothing but should be checked.

Tom builds the dashboard in a day. It’s simple, a status page that polls the checks every five minutes and displays the results. He puts it on a screen in the office. The first morning, it’s all green. The second morning, one amber: a substitution chain went three levels deep for a single subscriber. Not a bug, just an unusual supply week. But visible.

“This is the difference,” Charlotte tells the team. “Before, you found problems when subscribers emailed Sam. Now you find them when the dashboard turns amber.”

Alerting: telling the right person

Monitoring without alerting is a dashboard nobody looks at. The office screen helps during business hours, but Greenbox ships boxes seven days a week, and the reconciliation runs at 5:30am.

Charlotte introduces the concept of an on-call roster. One person carries the pager, in practice, a phone with push notifications from the monitoring system. If a check goes red, the on-call person gets an alert. They assess, respond, and escalate if needed.

Tom pushes back immediately. “We’re twenty people. If one person is on-call every night, that’s every twentieth night. That’s sustainable. But who wants to be woken up at 3am?”

“Nobody wants to be woken up at 3am,” Charlotte says. “The question is whether you’d rather be woken up at 3am by an alert, or at 8:30am by Sam telling you a subscriber got allergens in their box.”

“Fine. But how do we decide who’s on-call? And how often?”

The argument that follows is the most heated discussion Greenbox has had since the substitution policy debate in the Event Storming session. Not because anyone disagrees about the principle, but because on-call is personal. It touches sleep, family, weekends, fairness.

Anika raises the Melbourne question. “If on-call is shared across both squads, Melbourne developers could get paged for Perth issues they don’t understand. And vice versa.”

Ravi has a practical concern. “I have a six-month-old. I’m already not sleeping. Adding on-call on top of that…”

Charlotte listens to all of it. Then she draws a framework.

On-call principles

Rotation is weekly. One week on, several weeks off. Nobody does more than one week in six.
On-call is compensated. Time in lieu: if you get paged overnight, you start late the next day. No exceptions.
Scope is limited. The on-call person responds to red alerts only. Amber waits until business hours.
Runbooks exist. You should never be paged and not know what to do. If there's no runbook for a red alert, the alert shouldn't page someone at 3am.
Squad-scoped rosters. Perth on-call handles Perth systems. Melbourne handles Melbourne. Shared systems rotate between squads.
Opt-out is respected. Ravi is excused from on-call for six months. No judgement. The roster adjusts.

Ravi says thank you quietly. Tom nods. The framework doesn’t eliminate the discomfort of being on-call, but it makes it fair and bounded. One week in six, with time in lieu, with runbooks, with the knowledge that you won’t be paged for something you can’t handle.

Severity levels

Charlotte introduces severity levels the following week, after an incident that illustrates why they matter.

On Wednesday at 2pm, the monitoring dashboard goes red. Tom is on-call. He gets the alert on his phone, puts down his coffee, and opens the dashboard. The red check: “Substitution engine returned empty result for 3 subscribers.”

Three subscribers. Out of six thousand. The substitution engine hit an edge case with a new farm’s produce categories and returned no result instead of falling back to the default box. No allergen risk. No safety issue. Three people might get a box that’s missing an item.

Tom spends forty-five minutes diagnosing, fixing, and deploying. He misses his 2:30 meeting. He burns through his afternoon focus time. The fix is twelve lines of code. While he’s there he also ssh’s into the box and bumps the substitution worker’s memory limit from 512MB to 1GB, because the edge case had spiked memory on the way down and he doesn’t want it to OOM if it hits again before the code change rolls. He notes it in the incident channel and moves on.

The next morning Kai runs terraform plan on a routine PR and sees something he didn’t write: the substitution worker is set to 1GB in live, but the Terraform file Kai checked in still says 512MB. Plan wants to revert it. He pings Tom.

“Was that you?”

“Yesterday’s incident. I bumped it on the box so it wouldn’t OOM.”

“Plan’s going to put it back to 512 next time we apply.”

Tom stares at his screen. The Terraform hadn’t crossed his mind at 2pm with a red alert open. Charlotte joins the thread. The conversation is short: do they update the .tf to match what Tom did, or revert Tom’s fix? The fix was right, the box needs more memory for that workload, so they update the Terraform. Kai opens a one-line PR bumping the value to 1GB and links to Tom’s incident note as the rationale. It merges in ten minutes.

Charlotte writes a longer message in the channel afterwards. “From now on the Terraform is the source of truth. If you change something on a box during an incident, that’s fine, that’s what incidents are for. But the same day, before you log off, the change goes into the .tf. Otherwise the next apply silently undoes your fix and we’re debugging a ghost.”

Tom doesn’t argue. He’d been treating the Terraform as a record of the system rather than the system itself, something the pipeline ran plan against, useful but secondary. Watching it want to undo his fix changed the shape of it. The next config change he needs, a connection-pool bump for the reconciliation worker, he writes the Terraform first, reviews the plan in the PR, and runs the apply himself once it merges. It takes him longer than ssh would have, the first time. By the third time it’s faster, because he doesn’t have to remember which box he changed.

“Was that worth a red alert?” Charlotte asks at the retro.

“No,” Tom admits. “It felt urgent because my phone buzzed. But three missing items is not the same as three allergen violations.”

Charlotte draws a severity grid on the board.

Greenbox severity levels

Level	Definition	Response	Example
P1	Subscriber safety risk or data breach	Page on-call immediately, any hour	Allergen violation in box contents
P2	Major feature broken, many subscribers affected	Page during business hours; overnight only if >100 subscribers affected	Reconciliation producing wrong allocations
P3	Minor feature broken, small number affected	Fix during business hours, next working day	3 subscribers get incomplete substitution
P4	Cosmetic or non-urgent	Add to backlog	Dashboard formatting issue

“The severity level determines the response, not the alert,” Charlotte says. “Everything can be detected by monitoring. Only P1s page someone at 3am. P2s page during business hours. P3s go into the sprint. P4s go into the backlog.”

Tom recategorises his Wednesday incident. It was a P3. It should have been a Slack notification, not a phone alert. He wouldn’t have missed his meeting. He wouldn’t have burned his afternoon. He would have fixed it the next morning and nobody would have noticed.

“When everything is urgent, nothing is,” Charlotte says. “Severity levels protect the on-call person from alert fatigue. If you get paged for P3s at 2am, you’ll start ignoring pages. And then when a real P1 comes, an allergen violation, a data breach, you’ll be the person who silenced their phone.”

Runbooks: what to do when the phone rings

Priya writes the first runbook. She chooses the reconciliation system because that’s where both allergen incidents started, and because she understands it better than anyone.

The runbook is a document, a page in the team wiki, that answers one question: “It’s 3am, you’ve been paged, the reconciliation check is red. What do you do?”

She writes it in numbered steps, each one specific enough to follow when you’re half-asleep and your adrenaline is spiking.

Open the monitoring dashboard. Confirm the red check. Note which check failed and the timestamp.
Open the reconciliation log for today’s run. Look for error messages or anomalies.
If the error is “allergen violation detected”: this is a P1. Do not proceed alone. Escalate to the incident commander (currently Charlotte, fallback Maya). Then continue to step 4 while waiting for the commander.
Identify affected subscribers. Run the allergen check query (linked). Note subscriber IDs and the specific violations.
If boxes have not yet been packed: update the reconciliation data and re-run. Verify the output is clean.
If boxes have been packed but not dispatched: contact the packing facility (number listed). Request a hold on affected boxes.
If boxes have been dispatched: this is a subscriber contact situation. Escalate to Maya for personal calls. Sam handles email notification to affected subscribers using the allergen incident template (linked).

Seven steps. Each one ends with either a resolution or an escalation. Priya tests it by walking Ravi through a simulated incident, she marks a check red and watches him work through the steps. He completes it in nine minutes. Two of those minutes are reading the runbook.

“I’ve never touched the reconciliation system,” Ravi says. “I just followed the steps.”

“That’s the point,” Priya says.

Over the next two weeks, the team writes runbooks for five more failure modes: delivery tracking outage, payment processing failure, farm portal downtime, substitution engine error, and notification system failure. Each one follows the same format: confirm, assess severity, follow steps, escalate if needed.

The incident commander

Charlotte introduces one more role: the incident commander. During a P1 or P2 incident, one person coordinates. Everyone else executes.

“The commander doesn’t fix the bug. The commander makes sure the right people are fixing the bug, that subscribers are being communicated with, that someone is tracking the timeline, and that nobody is working on the same thing as someone else.”

She draws the model on the whiteboard:

Without a commander

Three people investigate the same log
Nobody tells the subscriber anything
Fix is deployed without checking side effects
Timeline is reconstructed from memory at the postmortem

With a commander

Commander assigns roles: investigate, communicate, verify
Subscriber comms go out within thirty minutes
Fix is verified against the runbook before deploy
Timeline is recorded in real time in the incident channel

Charlotte volunteers to be the first incident commander. She’ll rotate out after two months, once someone else has observed enough incidents to take over.

The first clean incident

It happens three weeks later. A Thursday morning, 6:15am. The reconciliation check goes amber: “Supply data incomplete for 4 farms. Reconciliation output may contain gaps.”

Kai is on-call. His phone buzzes. He checks the dashboard. Amber, not red. He reads the severity guide: amber during business hours means assess and respond, no escalation needed.

He opens the runbook for supply data issues. Step 1: check which farms have incomplete data. Step 2: check the farm portal logs. Step 3: if the farms haven’t submitted, contact them.

He finds the issue in four minutes. Rachel’s farm portal session timed out overnight, her dodgy broadband, the same satellite connection she complained about at the very first Event Storming session. Her availability data submitted partially. Two other farms had the same issue, a server-side timeout that dropped connections after 30 seconds.

Kai pings Tom in Slack: “Supply data timeout issue. Three farms affected. Partial submissions. I’m increasing the timeout to 120 seconds and re-requesting submissions.”

Tom replies: “Good catch. Fix the timeout, I’ll check the packing schedule isn’t affected.”

By 7:30am, the data is complete. The reconciliation re-runs cleanly. The dashboard goes green. Kai logs the incident in the #incidents channel, timestamp, cause, resolution, time to fix: 75 minutes.

No subscribers affected. No boxes delayed. No phone calls from Maya.

Charlotte reads the incident log at 9am and posts a single message: “This is what good looks like.”

The tension

There’s a conversation that happens at the retro, and it’s harder than the technical discussion.

Tom raises it. “I’m on-call next week. I’m also supposed to be building the Brisbane onboarding flow. If I get paged twice overnight, I’ll be useless the next day. How do we reconcile ‘move fast’ with ‘be careful’?”

It’s a real tension and Charlotte doesn’t pretend it isn’t. Moving fast means shipping features, taking risks, iterating quickly. Being careful means monitoring, runbooks, on-call, postmortems. They pull in opposite directions.

“You don’t reconcile them,” Charlotte says. “You hold both. Some weeks you move fast and ship three features. Some weeks you get paged at 3am and the next day is a write-off. The on-call structure doesn’t slow you down, it catches you when speed creates problems.”

Priya adds something quieter. “Before the monitoring, we were still getting paged. It just came through Sam’s inbox five hours later. The speed was the same. The feedback loop was slower. We thought we were moving fast because nobody was telling us we’d broken things.”

Tom considers this. “So we’re not slower. We just know more.”

“Yes. And knowing more feels slower because you’re responding to things you used to ignore.”

Postmortems by default

Charlotte has one final piece. She pins a message in #incidents:

Every incident gets a postmortem. Every postmortem follows the Prime Directive. This is not optional and it’s not only for disasters.

The team runs postmortems for three incidents in the first month. The Claire incident (P1, allergen bug). The Wednesday P3 that Tom over-responded to. And a P2 where the Melbourne notification system sent duplicate emails to four hundred subscribers.

The postmortems take thirty minutes each. Timeline. Root cause. Contributing factors. Actions. The actions are small, a test here, a timeout there, a severity reclassification. But they accumulate. Each postmortem makes the system slightly more resilient.

By the end of the first month, the #incidents channel has seven entries. Each one is a story: what happened, why, what was done, what changed. New developers who join the team will read those entries and understand not just how the system works, but how it fails and how the team responds to failure.

That’s the real value. Not the runbooks or the severity levels or the on-call roster, though all of those matter. The real value is a team that treats incidents as expected rather than exceptional, that responds with process rather than panic, and that learns from every failure without blaming the person who happened to be holding the keyboard when things went wrong.

Tom finishes his first on-call rotation on a Sunday evening. He wasn’t paged once. He spent the week with his phone on the bedside table, volume up, and nothing happened.

Sarah notices him checking his phone at dinner on Saturday. “Everything okay?”

“Yeah. Just checking. Force of habit.”

“Ah. The on-call thing.”

“The on-call thing.”

She puts her hand on his. “At least now you get to check a dashboard instead of finding out from an angry tweet.”

Tom laughs. He puts his phone face down on the table and goes back to his dinner. In the morning, the dashboard will be green. And if it isn’t, he’ll know what to do.

The runbooks and the roster put a process around the failures that page someone. But the work that keeps Greenbox running mostly never pages anyone, and mostly never gets noticed at all, right up until the person doing it stops.

The next chapter, Sam, Jas, and the Invisible Work, publishes around 30 July.