Two Deploys to Ship One Feature

The first time I noticed it, I thought I was looking at a coincidence. The pattern repeated for a quarter. By then it had a name.

A team I was working with had clean DORA numbers. Deploy frequency steady at multiple-per-day, lead time under an hour, change failure rate hovering around four percent. By every measure on their dashboard the pipeline was healthy. They were also, quietly, exhausted. Pull-request reviews were taking longer. The on-call rotation was complaining about a steady drip of small incidents. Sprint reviews kept featuring the same two or three features being demonstrated for the second or third time, with a different fix each time.

The metrics said one thing. The team said another. The metrics weren’t lying. They just weren’t measuring what was actually happening.

What was actually happening was rework. Not the visible kind that the failure-rate metric catches, the rollbacks, the alerts, the incidents that wake people up. The quieter kind: a deploy that landed cleanly, then twelve hours later a follow-up deploy fixing a thing the first one missed. Then sometimes a third one fixing what the second one disturbed. Each individual deploy looked fine. The pattern across deploys was the story.

DORA gave it a name in 2025. They added Rework Rate as a fifth metric, alongside the four that had been the canonical set since Accelerate. The reason they added it now and not earlier is the same reason it was suddenly worth measuring: AI-assisted development changed the shape of the problem.

What rework rate measures

Rework Rate is the percentage of deployments that exist primarily to fix something a recent earlier deployment broke or didn’t quite get right.

The “primarily” is doing work in that sentence. Normal iterative development is not rework. A feature ships, customers use it, the team learns something, and the next sprint refines what shipped. That’s the loop working as intended. Rework is the unplanned hot-fix follow-up: the configuration tweak that should have been in the original change, the missed edge case that surfaced an hour after the deploy, the patch that compiled cleanly and passed CI but didn’t actually solve the problem in production.

The operational definition usually looks like this: a deploy counts as rework if it touches files modified by another deploy in the previous N hours and doesn’t add net new functionality. N is typically 24 to 48 hours. The “doesn’t add net new functionality” part is the judgment call, in practice, teams either trust their commit messages (a fix: prefix is a strong hint) or have someone tag rework deploys manually for the first month and then automate from the resulting dataset.

This is, deliberately, a fuzzy measure. Precise rework rates are not the point. The point is that rework rate is a signal you mostly didn’t have before, and even an approximate version of it tells you something the other four metrics actively hide.

Why the other metrics hide it

A high rework rate makes deployment frequency look better. Two deploys ship one feature; the deploy counter goes up by two. The team appears to be moving faster. They aren’t, they’ve split one piece of work into two attempts, but the metric doesn’t see the difference.

Lead time can also look better with high rework. The follow-up deploy is small, fast, and goes through the pipeline quickly. It pulls the average down. You shipped the original change in four hours and the fix in twenty minutes; your average lead time is now twenty-two minutes per deploy, which sounds great until you realise that the time-to-actually-working-feature was four hours and twenty minutes, not twenty-two minutes.

Change failure rate misses it almost entirely. Most rework deploys don’t trip the failure threshold. They aren’t outages. They aren’t rollbacks. They’re “oh, we forgot the timezone case” or “the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. ’s regex was too greedy”, merged, deployed, forgotten. The change failure rate metric thinks everything is fine. The codebase is gathering quiet barnacles.

Failed Deployment Recovery Time only fires when something has actually broken badly enough to count as failure. Rework lives in the long tail below that threshold.

So you can have elite numbers on all four classic metrics and still be doing twice as much work as you think you are, because half of every feature is being shipped in two passes instead of one and the second pass is invisible to the dashboard.

What changed in 2025

The reason DORA added rework rate when they did is that the rise of AI-assisted development inflated this kind of churn dramatically.

LLM-generated code has a particular failure mode that traditional metrics handle poorly. The code compiles. The tests pass. The diff looks reasonable. Code review approves it because the structure is conventional and the variable names are sensible. It deploys without incident. And then, in production, against real data and real edge cases, it turns out to be subtly wrong in a way that’s neither a crash nor an outage but a quiet “that’s not quite what was supposed to happen.” Someone files a small ticket. Someone else fixes it the next day. The fix takes another deploy.

This is not new. Junior developers have been generating “almost right” patches for as long as there have been junior developers. What’s new is the volume. A team using LLMs heavily can produce many more “almost right” diffs per week than the same team produced when every line was hand-typed. Rework rate is the metric that catches the volume.

It’s worth saying clearly: a high rework rate doesn’t mean stop using LLMs. The teams I’ve seen with the lowest rework rates often use AI assistance heavily. What distinguishes them is how they use it. They don’t ship LLM output that hasn’t been read carefully by someone with context. They don’t trust passing tests as a substitute for understanding what the code does. They treat an LLM-generated PR with the same scepticism they’d apply to a brand new junior’s first PR, not because the LLM is bad, but because the LLM, like the junior, doesn’t know what’s already in the codebase or what the team learned from last quarter’s incident.

Rework rate makes the difference visible. Two teams using the same tools, with the same other DORA numbers, can have rework rates that differ by an order of magnitude. The difference is the discipline around what gets merged.

What to do with the number

Like all DORA metrics, rework rate is a thermometer, not a target. Setting “reduce rework rate to under 5% this quarter” creates the same Goodhart’s law problem that velocity targets did. The team will stop tagging deploys as rework. They’ll roll fixes into larger changes to hide them. They’ll defer fixes until the next sprint so they look like new work. The number will go down. The thing the number was supposed to measure will not.

Used as a diagnostic, rework rate opens conversations the other metrics can’t.

When rework rate climbs sharply, ask which area of the codebase the rework is concentrated in. Usually it’s one or two services or one or two parts of the workflow. That’s where the team’s mental model and the actual behaviour have drifted apart. It’s worth a session of event storming or a refreshed pass at the integration tests in that area before adding more features there.

When rework rate is high specifically on AI-assisted PRs, the conversation is about review discipline. Not about banning LLMs. About what “approved” means when the diff was generated rather than written. Most teams find they need a different review checklist for generated code, one that asks “does this match what we already do here?” rather than just “does the diff look reasonable?”

When rework rate is high but concentrated on one developer, that’s almost always a context problem rather than a competence problem. The developer is shipping into an area they don’t have full context on, and the LLM is filling in the gaps with plausible-but-wrong patterns. Pair them with someone who does have the context for a sprint.

When rework rate is steadily low and the rest of the metrics are healthy, you are in a place very few teams reach. Don’t celebrate by setting it as a target.

How to start measuring

Most teams don’t need a tool for this. They need a definition and a fortnight of paying attention.

Pick a window, 24 hours is the most common. Pull the deploy log for the last quarter. For each deploy, ask whether it touched files that another deploy in the previous 24 hours touched, and whether it primarily fixed rather than added. Count those as rework. Divide by total deploys.

Even a manual pass over a quarter’s deploys takes an afternoon and produces a number that’s interesting. The interesting part is rarely the number itself, it’s what you find on the way. You’ll notice clusters: areas of the codebase that produce rework consistently, particular types of change that always need a follow-up, specific testing gaps that the rework keeps exposing. The metric is a path into the conversation; the conversation is where the value is.

If the manual pass reveals something worth tracking, automating it is straightforward. Most CI systems already log the data needed. A scheduled job, a small SQL query, a dashboard tile. The simplest version is: count deploys per week where the commit message starts with fix: or hotfix: and the previous deploy on the same files was within 24 hours. Refine from there if the signal is useful.

What you don’t want is a precise, complex, automated rework-rate dashboard that turns the metric into a number people optimise for. Keep it scrappy. Keep it conversational. The day someone asks “why was our rework rate 0.3 percentage points higher this week?” you’ve started measuring the wrong thing.

What to remember

The four classic DORA metrics measure speed and stability of the pipeline. Rework rate measures whether the pipeline is shipping the right thing on the first try, or whether it’s shipping a rough draft and a fix-up.

In a world where tests pass but the code is subtly wrong, the difference between those two pipelines is the difference between a team that’s actually moving fast and a team that’s running on a treadmill. The other four metrics can’t tell them apart. The fifth can.

Track it. Discuss it in retros. Don’t put it on a target. Use it to ask better questions about the work you’re already doing.

The thing about quiet barnacles is that they keep accumulating until you notice them. Rework rate is what noticing looks like, written down.