The Boring Middle

The pause-risk work now has a goal, a don’t-call list, a holdout design, and a call script Dina wrote in four hours and Kai has read more times than he has read his employment contract. What it does not have is data that can be trusted. That is the work of February.

Kai opens the first query on a Monday morning and is in a kind of trouble by Monday afternoon.

He has been assuming, the way you assume gravity, without thinking about it, that the pause event in the subscription log means what it says. A row exists when a subscriber clicks pause. The row records when they did it, what reason they chose from the dropdown, and which box they were meant to have that week.

This turns out to be true most of the time and not-true interestingly often.

The first anomaly he finds is that 3.6% of pauses in 2025 have no reason on them. He traces it to a three-day window in August 2025 when the pause form was deployed with the reason dropdown briefly broken. The subscribers who paused in those three days pressed pause, got a confirmation screen, and moved on. The reason field is null.

He almost writes a Slack message asking if anyone knows about the August 2025 deploy. He doesn’t, because it’s ten past midnight, and also because he has just noticed the second anomaly.

The second anomaly is that about 40% of pauses in the full archive show the reason other (free text) with an empty free-text field. He thought this was another bug. It is not a bug. It is what happens when a subscriber chooses other and doesn’t type anything in the box, and the form lets them through.

Which is a product failure, but it is not the point. The point is that 40% of historical pauses have effectively no reason attached. Which means the cohort Kai had been about to train on, subscribers whose pauses were classified as preventable, is built from a column that is missing its contents for two fifths of the rows.

He messages Priya at 12:14. Are you awake.

She messages back at 12:14. Yes.

Tuesday, 7 a.m., the kitchen

Priya is already there with a coffee when he walks in. She slides a printout across the table. She has done her own queries overnight.

“I pulled every row where the reason is other with nothing after it. Forty-one percent. Then I pulled every row where the reason is moving house and cross-referenced against the subscriber’s address field thirty days before the pause and thirty days after. About twenty percent of moving house pauses are from addresses that didn’t change. They clicked moving house because it was the nearest thing to I need an excuse.”

Kai looks at the printout. “So ‘moving house’ isn’t a clean label either.”

“Nothing is a clean label. The reason field is a proxy. It was good enough for a spreadsheet Sam reads once a month. It is not good enough to train on.”

Kai sits down. He looks at his hands. He says, quietly: “I’ve spent three weeks thinking I was doing the modelling part. I’ve been doing the data part. The data part is the whole job.”

Priya: “Yes.”

“This is why teams don’t ship ML.”

“Yes. This is also why teams that do ship ML ship it well.”

He looks up. “How long does the data work take?”

She has been thinking about this. “Three more weeks. Maybe four. We need a label we can defend, and we’re not going to get it from the reason field. We’re going to get it from the outcome field, did they come back within ninety days, did they cancel, and from Sam’s team’s notes, which are free text and which nobody has ever extracted before.”

“Sam’s team’s notes.”

“Dina has been writing reason for pause: real or reason for pause: proxy in her own notes for two years because she didn’t trust the reason field either. Four thousand of them. She sent them to me on Sunday.”

Kai stares at her. “Dina’s been labelling this by hand for two years.”

“Dina has been labelling this by hand for two years because she needed it for her own work. She never told anyone because nobody asked.”

He laughs, the slightly unhinged laugh of a man who has realised Dina has been doing the work that actually matters for two years while the rest of them were building dashboards. “We need to hire more Dinas.”

“We need to not lose the one we have.”

The quiet work

Priya is correct about the three weeks. It is four, in the end.

Most of it is Dina’s notes joining to the subscription log on Kai’s laptop, one Monday evening at a time. The ambiguous reasons either get dropped or reclassified. The August 2025 missing-reason rows get attached to their ninety-day outcomes and treated as a separate cohort, because you cannot infer what you cannot measure. The don’t-call list is built, indexed, and wired into the join so the flagged rows never enter the trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. set at all. Not filtered. Never seen.

Somewhere in the middle of the second pass, Kai wants to throw every feature in that he can think of, delivery punctuality, substitution counts, email open rates, help-page visits, the thing Priya quietly calls the reply-to-the-weekly-email signal because it is the strongest single feature and she does not want to admit it, and Priya makes him stop and actually write down why each feature might matter and what a leaky version of it would look like. They cut six of his original twenty-three. One of them read the pause outcome directly. It would have made the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. look brilliant and useless.

By the third week the evaluation framework is not yet what it will be. Kai wants raw accuracy. Priya has the argument with him at the coffee machine and wins it because accuracy is a trap on imbalanced classes and Kai, once he has let himself be talked out of it, knows this better than Priya does. What they settle on is lift at the top decile, because Sam’s team only has capacity to call the riskiest ten percent each week. There is a separate evaluation for explanation quality, which Priya has realised, in the same coffee-machine conversation, they are also going to need. Sam’s team is not going to act on a number. They are going to act on why.

The fourth week Priya writes a one-page spec for the bake-off and four people in the room argue with it for ninety minutes before anyone agrees to run the experiments.

Two things that happen in this month are not on the spec. Kai starts sleeping badly on Monday nights, because Mondays are when the data work happens, and he has not quite recovered from finding out how bad the reason field was. And Dina, without telling anybody, spends about six hours on a Saturday going through her own old notes to check whether her two-year labelling was as consistent as she thought. It was not. She revised about two hundred of the four thousand labels and sent the revisions to Kai on Sunday evening with a one-line message: I was worse than I remembered. Sorry. Kai wrote back you were better than anyone else would have been and re-ran the join on Monday morning, twelve hours behind schedule, and nobody except Dina noticed the twelve hours.

The bake-off

Four models on the same training set, the same test set, the same eval metric. Priya has written them on the whiteboard in the modelling-Thursday room.

Logistic regression. Thirty features, regularised. The one you run because your supervisor would kill you if you didn’t.

Gradient-boosted trees. The default answer for tabular problems since about 2016. Kai has fit one in a notebook. It took forty minutes.

A small neural ranker. Two hidden layers. Trained on the same features. Kai’s favourite because it’s what he built at his last job.

A fine-tuned small LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. , given the subscriber’s history as structured text and asked for a risk score. Priya’s favourite because she wants the team to prove to itself whether the shiny thing wins.

She has written prediction on the board in big letters. Priya: LLM will win. Kai: tree will win. Anika: tree will win. Marcus: I don’t care, show me the lift.

They run it on a Thursday afternoon. The tree wins.

It wins on lift at the top decile. It wins on training cost by two orders of magnitude. It wins on inferenceInferenceRunning a trained model to produce output – as opposed to training it. cost by three orders of magnitude. It wins on latency to a degree that renders the comparison silly. The LLM comes second on lift, fourth on everything else. The neural ranker is third on lift, which is to say not competitive. Logistic regression is fourth on lift and Kai says I told you to nobody in particular.

Priya looks at the numbers. She has the face of someone who has just been told their horse came second.

“Fine,” she says. “The tree wins.”

Kai, who genuinely did not want to be the person who won this argument: “The tree wins.”

Anika, on the laptop from Melbourne: “Can we pick the tree and stop?”

Priya: “No. Because we still need the explanation.”

Where the LLM earned its place

The tree predicts. It does not explain. It returns a risk score between zero and one and a vector of feature importances. Sam’s team cannot act on a feature-importance vector. Dina would fire anyone who handed her one.

What Sam’s team needs, on the at-risk list every Tuesday morning, is a sentence. This subscriber has had three substitutions in four weeks and their email open rate has fallen by half. Or: This subscriber has doubled their visits to the pause help page and reduced their box size last month. Something Dina can read on her phone before she picks up.

The tree does not write sentences.

Kai tries to build the sentence generator as a template, a set of hand-written English patterns keyed off the top features. It works for about a third of subscribers. The other two thirds produce sentences that are technically correct and read like a tax form. He shows them to Dina. Dina reads three and says no.

Priya: “This is the LLM’s job.”

Kai: “We just said the LLM didn’t win.”

“The LLM didn’t win at ranking. Ranking is numbers in, numbers out, and the tree owns that problem. This is different. This is take the features and the risk score and write a sentence Dina can read on her phone at seven-thirty in the morning. Language in the output. That’s an LLM job.”

They wire up a small, cheap, hosted model. Bedrock, nothing bespoke. The promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. takes the top three features for each subscriber, the risk score, and Dina’s style guide from the call script, and returns one sentence. Priya reads the first batch and messages Dina two hours later.

Dina: yes. finally.

The final system is two models. A tree that ranks. An LLM that explains. The tree is the expensive bit to get correct. The LLM is thirty lines of prompt.

Kai writes both names on the Post-it on his laptop. Underneath, he writes: right tool for each half of the problem.

Friday, the car park

The bake-off is the scene people will remember when they talk about this month in six months’ time. Priya already suspects it is not the scene that mattered.

She does not write that down. She walks to the car park with Kai on Friday evening. It is getting dark earlier than it did in January, and the parrots in the peppermint tree behind the bin enclosure are making the specific angry noise they make at dusk in February. Kai has his laptop bag over one shoulder and his notebook in his other hand, and the Post-it on his laptop has, by now, five items on it in five different pens, running down the centre of the yellow square like a ledger.

“You okay?” Priya says.

“I will be okay when we’ve shipped it and nothing has set fire to Ruth.”

Priya: “That’s what shadow mode is for.”

He nodded. “Next Thursday?”

“Next Thursday.”

They walk a few more paces. Kai stops at his car and puts the laptop bag on the roof while he looks for his keys.

“The tree was the right answer,” he says, more to himself than to Priya. “I wanted it to be the wrong answer because I wanted to build something more interesting. That wasn’t the job.”

“No.”

“I think I get that now.”

“I know you do.”

He finds his keys. He gets into his car. Priya walks on to her own car, which is parked further away because she was late on Monday morning. The parrots continue to be angry at the dusk. Above the office the sky is the colour of a bruise that has started to heal.

Six weeks of shadow mode, a canary that did not behave, and a rollback plan written before anyone wrote the deploy script.