Defining and Measuring High Performance

Every team I’ve worked with wants to be high-performing. Almost none of them can tell me what that means.

Ask a manager what a high-performing team looks like and you’ll get answers that circle around speed. They ship fast. They hit deadlines. They deliver a lot. Push a little harder and you’ll hear story points, velocity charts, lines of code: metrics that measure activity and call it performance.

This is wrong. Not slightly wrong. Wrong in a way that leads to bad decisions, burnt-out teams, and software that ships fast and breaks faster.

High performance isn’t about speed; it’s about the sustained ability to deliver valuable change safely and learn from the results. Speed is a side-effect of a team that’s working well. When you optimise for speed directly, you get something that looks fast but isn’t: a team that cuts corners, skips tests, avoids the hard conversations, and generates a growing pile of invisible risk that will eventually detonate.

Let me be specific about what I think high performance actually is, how to measure it without destroying it, and why most attempts at measurement make teams worse.

What high performance is not

It’s not velocity. Story points are an estimation tool, not a performance metric. They measure the team’s guess about relative effort, denominated in a made-up unit. A team that delivers 40 story points per sprint is not twice as good as one that delivers 20, because the points aren’t comparable across teams, they aren’t comparable across sprints in the same team, and they don’t tell you whether what was delivered was valuable. A team could deliver 100 story points of perfectly implemented features that no customer wants. High velocity, zero value.

It’s not hours worked. A team that works 60-hour weeks is not high-performing. They’re over-working, which is a reliable predictor of errors, burnout, and eventual attrition. The team that leaves at 5.00pm and ships something meaningful every week is out-performing the team that stays until 9.00pm and ships something buggy every two weeks, regardless of what the timesheet says.

It’s not individual heroics. The developer who pulls an all-nighter to fix a production outage is not a sign of high performance. They’re a sign of a system that produces outages requiring all-nighters. High-performing teams don’t need heroes because they build systems that don’t generate crises.

It’s not busyness. Full calendars, packed sprints, no slack in the schedule: these feel productive. They aren’t. A team with no slack has no capacity for learning, no room for improvement, and no ability to absorb surprises. Busyness is the performance of productivity, not the thing itself.

What it actually is

A high-performing team delivers valuable change to its customers frequently, safely, and sustainably, while continuously improving its ability to do so.

That sentence has five words doing the work.

Valuable. What they deliver matters to someone. Not every feature is valuable. Not every task is valuable. A team that spends a sprint refactoring code that doesn’t need refactoring is busy but not valuable. The word “valuable” forces the question: valuable to whom? If you can’t answer that, you don’t know whether the work matters.

Frequently. The feedback loop is short. They ship often enough to learn from what they ship. A team that deploys once a quarter gets four data points a year about whether they’re building the correct thing. A team that deploys daily gets hundreds. Frequency isn’t about speed; it’s about learning.

Safely. Changes don’t break things. When they do break things, the blast radius is small and the recovery is fast. Safety isn’t the absence of risk; it’s the presence of systems that manage risk. Tests, monitoring, deployment practices, incident response: these are the infrastructure of safety.

Sustainably. They can keep doing this. Not for a sprint. Not for a quarter. For years. The pace is one the team can maintain without burning out, without accumulating crippling technical debt, without losing key people because the work is unsustainable.

Continuously improving. They get better at it over time. Last quarter’s hard thing is this quarter’s routine. They invest in their own capability (learning, tooling, process improvement) as part of regular work, not as a special event.

DORA metrics: the least bad option

The DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, Failed Deployment Recovery Time, and Rework Rate) are the closest thing we have to a useful, research-backed measurement of software delivery performance. They come from the Accelerate research programme, which studied thousands of teams over multiple years and found statistically significant correlations between these metrics and organisational performance.

The list has shifted. For most of its history DORA was four metrics: Deployment Frequency, Lead Time, Change Failure Rate, and Mean Time to Recovery. The 2023 DORA report renamed MTTR to Failed Deployment Recovery Time (narrowing the metric to incidents caused by a deployment rather than every production incident), and the 2024 report added Rework Rate as a fifth signal, partly in response to what AI-assisted development was doing to delivery pipelines. Plenty of material still talks about “the four DORA metrics.” That material is out of date.

Here’s what they measure and why they matter.

Deployment Frequency measures how often the team ships to production. Daily is good. Multiple times a day is better. Weekly is okay. Monthly is a warning sign. This isn’t because deploying more often is intrinsically good; it’s because deploying more often requires everything else to be good. You can’t deploy daily if your tests are broken, your merge process takes two days, or your deployments require three people and a prayer.

Lead Time for Changes measures the time from code commit to code running in production. Short lead times mean the pipeline is smooth: testing, review, deployment, all flowing without manual gates and week-long queues. Long lead times mean friction, and friction means risk, because a change that sits in a queue for two weeks is a change that’s two weeks stale by the time it ships.

Change Failure Rate measures what percentage of deployments cause a failure in production: a service outage, a rollback, an incident. Lower is better. Zero isn’t realistic. The point isn’t to never fail; it’s to fail infrequently enough that failure is unusual rather than routine.

Failed Deployment Recovery Time measures how quickly the team recovers when a deployment causes a production failure. This used to be called Mean Time to Recovery, and the rename matters: the old version conflated deployment-induced incidents with every other kind of incident, which made the metric noisy. Recovery from a bad deploy is something the team controls. Recovery from a third-party outage is mostly waiting on someone else’s phone call. Failed Deployment Recovery Time isolates the part the team can actually improve. Fast recovery matters more than low failure rate, because failure is inevitable and the question is not “will things break?” but “how quickly can we fix them when they do?” A team with a 5% failure rate and a ten-minute recovery time is safer than a team with a 1% failure rate and a four-hour recovery time.

Rework Rate measures how often a deployment requires another deployment shortly after to fix what the first one broke or didn’t quite get right. “Shortly” matters; this isn’t normal iterative work where a feature ships, customers use it, and the next sprint refines it based on feedback. Rework is the unplanned hot-fix follow-up: the configuration tweak that should have been in the original change, the missed edge case that surfaced an hour after the deploy, the LLM-generated patch that compiled cleanly and passed CI but didn’t actually solve the problem. DORA added it in 2024 because the rise of AI-assisted development was inflating this kind of churn: code that’s almost right but not quite, shipped because the diff looked good and the tests passed, fixed-up the next day. A high rework rate doesn’t just slow you down; it hides the cost of the original change. Two deploys to ship one feature is twice the risk, twice the noise in the other metrics, and twice the chance of something else going wrong on the way through.

The five metrics work as a system. You can’t game one without the others revealing the game. If you increase deployment frequency by skipping tests, your change failure rate will climb. If you decrease lead time by skipping code review, your rework rate will follow. The metrics hold each other in tension, which is what makes them useful.

But, and this is critical, DORA metrics measure the delivery pipeline. They do not measure whether what you’re delivering is valuable. A team can have elite DORA metrics and build the wrong product. The metrics tell you the team is healthy and effective at shipping. They don’t tell you the team is shipping the correct things. You need other instruments for that.

Team health checks

The Spotify team health check model asks teams to rate themselves across a set of dimensions: mission clarity, speed, quality, fun, learning, support, pawns-or-players (agency). It’s a subjective self-assessment, not an objective measurement, and that’s the point.

DORA metrics tell you what the delivery pipeline is doing. Health checks tell you what the humans are experiencing. Both matter. A team can have great DORA metrics and be miserable: shipping fast because they’re afraid to slow down, not because they’re working well. The health check surfaces the misery that the metrics don’t.

I’ve used a simplified version of the health check in teams I’ve worked with. Eight dimensions, rated green/amber/red, discussed in a retrospective once a quarter. The value isn’t in the ratings; it’s in the conversation. When a team rates “fun” as red, the rating is the starting point for a conversation about why. The answer is usually specific and actionable: “We’ve spent three sprints on compliance work and nobody’s done anything creative.” That’s fixable.

The danger of health checks is that they become performative. If the team thinks the ratings will be used against them (to justify reorganisation, to flag “underperformers,” to satisfy a management report) they’ll rate everything green and the exercise becomes worthless. Health checks work only in an environment of psychological safety, where people can say “this is red” without consequences beyond a conversation about how to make it better.

Leading vs lagging indicators

DORA metrics are lagging indicators. They tell you what has already happened. Deployment frequency tells you how often you shipped last month. Change failure rate tells you how many of those deployments broke something. By the time the metric changes, the cause has already occurred.

Leading indicators tell you what’s about to happen. They’re harder to measure but more useful for steering.

Code review turnaround time is a leading indicator of lead time. If reviews sit for two days before someone looks at them, lead time will be long. You can see this before it shows up in the lead time metric.

Test suite reliability is a leading indicator of change failure rate. If the test suite is flaky (passing sometimes, failing sometimes, for reasons unrelated to the code change) developers will stop trusting it and start merging without confidence. Failures will follow.

Team mood is a leading indicator of everything. When people start dreading work, when Slack goes quiet, when the retro surfaces the same complaints for the third sprint running, something is wrong, and it will show up in the lagging metrics within a month or two.

On-call burden is a leading indicator of sustainability. If the same two people carry the on-call rotation and neither has had an uninterrupted weekend in a month, you’re headed for burnout and attrition. The DORA metrics won’t show it until someone quits.

The best teams track both. Lagging indicators tell you where you’ve been. Leading indicators tell you where you’re going. A team that only watches lagging indicators is driving by looking in the rear-view mirror.

Goodhart’s law: the measurement trap

Goodhart’s law states: when a measure becomes a target, it ceases to be a good measure.

This is not an abstract academic concern; it’s the single most common failure mode I’ve seen in teams that try to measure performance.

The moment you tell a team “your target is 30 deployments per month,” they will optimise for deployments. They’ll split changes into smaller pieces (good, probably). They’ll deploy config changes that don’t need deploying (bad). They’ll stop doing the careful, slow work that produces fewer but more meaningful changes (very bad). The metric goes up. The performance goes down. Everyone reports success.

I’ve watched this happen with velocity, with code coverage, with deployment frequency, and with every other metric that’s been turned into a target. The pattern is always the same. The number improves and the thing the number was supposed to measure gets worse.

The fix is to use metrics as diagnostic tools, not targets. A thermometer is useful for understanding whether you have a fever. It’s useless as a target. “My goal this quarter is to maintain a body temperature of 36.8 degrees” doesn’t make you healthier; it just makes you obsess about the thermometer.

Track DORA metrics. Review them in retrospectives. Discuss trends. Ask why deployment frequency dropped this month. Ask why lead time increased. But don’t set targets. Don’t tie metrics to performance reviews. Don’t put them on a dashboard that management reviews weekly with pointed questions about why the numbers aren’t green.

How to measure without destroying what you’re measuring

Here’s the approach I recommend, having watched teams get this correct and wrong for years.

Track DORA metrics passively. Instrument the pipeline. Collect the data. Generate the reports automatically. Don’t make anyone responsible for the numbers. The numbers are a signal, not a score.

Run quarterly health checks. Eight dimensions, self-assessed, discussed in a retro. Keep the ratings private to the team. Don’t share them upward unless the team chooses to. The point is the team’s own understanding of its health, not a management report.

Watch leading indicators informally. Code review turnaround, test suite health, team mood, on-call burden. These don’t need dashboards. They need attention. A team lead who notices that reviews are sitting for two days and asks about it is more valuable than a dashboard that turns yellow.

Have the conversation, not the metric. When something looks off (deployment frequency drops, the health check goes red on “fun,” reviews are taking longer) the response isn’t to set a target; it’s to ask the team what’s going on. The metric opened the conversation. The conversation produces the insight. The insight drives the change.

Separate measurement from evaluation. This is the hard one. Metrics exist to help the team improve. They do not exist to evaluate individuals or teams for promotion, ranking, or comparison. The moment metrics become evaluative, they become gamed, and gamed metrics are worse than no metrics because they create a false sense of understanding.

The team, not the individual

One more thing that matters: high performance is a property of teams, not individuals.

The research is clear on this. Google’s Project Aristotle found that who was on a team mattered less than how the team worked together. The same person on two different teams performed at two different levels. The team was the unit of performance, not the person.

This means measuring individual performance (stack ranking, individual velocity, personal deployment counts) is not just unhelpful but actively destructive. It incentivises individual optimisation at the expense of team outcomes. The developer who helps three colleagues solve problems and ships nothing personally has contributed more than the one who ships six features and helps nobody. Individual metrics can’t see this.

Measure the team. Develop the team. Improve the team. The individuals will improve as a side-effect, because working in a high-performing team is the single most effective development activity for any engineer.

Where to start

If your team isn’t measuring anything, start here:

Instrument deployment frequency and lead time. These are the easiest DORA metrics to collect and the most informative to discuss.
Run one health check. Keep it simple. Eight dimensions. Green/amber/red. Discuss the reds.
Pick one leading indicator (code review turnaround is usually the most immediately actionable) and pay attention to it for a month.
Have one conversation in a retro about what the data is telling you. Not what to do about it. Just what you’re seeing.

That’s enough for the first quarter. Measurement is a practice. Like all practices, it gets better with repetition and worse with intensity. Start small, stay consistent, and resist the temptation to turn signals into targets.

The teams that measure well are not the ones with the best dashboards; they’re the ones that use measurement to start conversations, and then have the courage to act on what those conversations reveal.