How Estimation Works (And Why It Doesn't)

Every software project starts with someone asking “how long will it take?” Every experienced developer knows the honest answer is “longer than you think, even after accounting for the fact that it’ll be longer than you think.” That’s not cynicism. It’s a well-documented cognitive phenomenon with a name, a body of research, and – if you’re willing to change your approach – some practical solutions.

The planning fallacy

In 1979, Daniel Kahneman and Amos Tversky described what they called the planning fallacy: the systematic tendency for people to underestimate the time, cost, and risk of future actions while overestimating their benefits. The original paper, “Intuitive Prediction: Biases and Corrective Procedures”, showed that people consistently generate optimistic estimates even when they have direct experience of similar tasks taking longer than expected.

The planning fallacy isn’t about incompetence. It’s about how human brains construct predictions. When you ask a developer “how long will this take?”, their brain does something specific: it imagines the best-case scenario. It constructs a mental model of the work going well – no surprises, no blockers, no interruptions, no scope changes, no bugs in dependencies. The estimate that emerges is the time required in this imaginary best case.

Kahneman and Tversky distinguished between two modes of prediction: the inside view and the outside view. The inside view constructs a prediction by thinking about the specific task – the steps involved, the complexity, the skills required. The outside view asks: “How long have similar tasks taken in the past?” The inside view produces optimistic estimates. The outside view produces realistic ones.

The problem is that humans default to the inside view. It’s intuitive. It feels responsible – you’re thinking about this task, not some generic average. But the inside view systematically ignores the things that make real projects take longer: the unknown unknowns, the requirements that change mid-build, the dependency that turns out to be broken, the three hours spent debugging an environment issue that shouldn’t exist.

The cone of uncertainty

The cone of uncertainty, popularised by Barry Boehm in the 1980s and later refined by Steve McConnell in Software Estimation: Demystifying the Black Art (2006), describes how the range of possible outcomes narrows as a project progresses.

At the start of a project, before any detailed requirements work, the cone is wide: the actual effort might be 0.25x to 4x the initial estimate. That’s a sixteen-fold range. A task estimated at four weeks might take one week or sixteen weeks. This isn’t a failure of estimation – it’s a reflection of genuine uncertainty. At the start, you don’t know what you don’t know.

As the project progresses through requirements, design, and implementation, the cone narrows. After detailed requirements, the range might be 0.5x to 2x. After high-level design, 0.67x to 1.5x. By the time you’re well into implementation, you have a much clearer picture of how long the remaining work will take.

The cone has an important implication: early estimates are inherently imprecise, and no amount of effort will make them precise. The uncertainty isn’t in the estimating process – it’s in the project itself. You can’t estimate accurately because the information needed for an accurate estimate doesn’t exist yet. It will be discovered during the work.

Project phase	Typical range	If estimate = 4 weeks
Initial concept	0.25x -- 4x	1 -- 16 weeks
Approved product definition	0.5x -- 2x	2 -- 8 weeks
Requirements complete	0.67x -- 1.5x	2.7 -- 6 weeks
UI design complete	0.8x -- 1.25x	3.2 -- 5 weeks
Detailed design complete	0.9x -- 1.1x	3.6 -- 4.4 weeks

The numbers vary by source and context, but the shape is consistent: high uncertainty early, narrowing with discovery.

Hofstadter’s Law

Douglas Hofstadter, in his 1979 book Godel, Escher, Bach, formulated what he called Hofstadter’s Law:

It always takes longer than you expect, even when you take into account Hofstadter’s Law.

The recursion is the point. Even when you consciously add buffer for the planning fallacy – “I know estimates are usually low, so I’ll add 50%” – the result is still too optimistic. The bias isn’t fixed by knowing about it. Kahneman has been explicit about this: awareness of cognitive biases doesn’t eliminate them. You need structural interventions, not willpower.

Reference class forecasting

One structural intervention is reference class forecasting, developed by Bent Flyvbjerg based on Kahneman and Tversky’s work on the outside view.

The idea is simple: instead of estimating from the inside (thinking about the specific task), estimate from the outside (looking at how similar tasks have actually performed). To forecast how long your project will take, find a reference class of similar projects and use their actual durations as your baseline.

Flyvbjerg’s research on large infrastructure projects (bridges, tunnels, railways) found that cost overruns were the norm, not the exception: 90% of projects exceeded their budgets, with average cost overruns of 28% for roads, 45% for rail, and 34% for bridges and tunnels. In Australia, the Sydney Opera House – estimated at $7 million in 1957, completed for $102 million in 1973 – remains a salutary example. These weren’t bad estimates by bad estimators. They were the predictable result of the inside view applied to complex, uncertain endeavours.

Software is no different. The Standish Group’s CHAOS Report has been tracking software project outcomes for decades. Their findings consistently show that the majority of software projects exceed their budgets and timelines, with large projects faring worse than small ones.

Reference class forecasting says: if you want to know how long your project will take, don’t think about your project. Look at projects like yours and see how long they took. The outside view isn’t as satisfying – it doesn’t feel like you’re engaging with the specifics – but it’s consistently more accurate.

Story points: the rise and fall

Somewhere around the early 2000s, the agile movement popularised story points as an alternative to estimating in hours or days. The idea, often attributed to Ron Jeffries and the early Extreme Programming community, was to separate the size of work from the duration of work.

A story point is a relative measure of effort, complexity, and uncertainty. A simple task might be 1 point. A moderately complex task might be 3 points. A large, uncertain task might be 8 points. The scale is typically the Fibonacci sequence (1, 2, 3, 5, 8, 13) or powers of 2, deliberately using non-linear scales to acknowledge that larger tasks are harder to estimate precisely.

The team estimates stories in points, tracks how many points they complete per sprint (their velocity), and uses velocity to project how long the remaining work will take. If the team averages 20 points per sprint and there are 60 points of work remaining, that’s roughly three sprints.

In theory, this is elegant. In practice, story points have created a remarkable amount of dysfunction.

The first problem is gaming. When velocity becomes a metric that managers track, teams learn to inflate their point estimates. A task that was 3 points last quarter is now 5 points. Velocity goes up. Everyone is happy. Nothing has actually changed.

The second problem is false precision. A story estimated at 5 points implies a level of understanding that often doesn’t exist. The team spends twenty minutes debating whether something is a 5 or an 8, when the honest answer is “somewhere between 3 and 13, and we won’t know until we start.” The Fibonacci scale was supposed to prevent false precision, but human nature reasserts itself.

The third problem is comparison. Velocity is supposed to be team-specific – 20 points for Team A means something completely different from 20 points for Team B. But managers inevitably compare. “Why is Team B’s velocity only 15 when Team A does 25?” Because they’re different teams working on different things with different point scales, but that answer never quite satisfies.

The fourth problem is that story points don’t answer the question people actually care about. Nobody outside the development team wants to know the velocity. They want to know: When will it be done? Converting points to dates requires assumptions about future velocity, which are exactly the same assumptions you’d make without story points.

Mike Cohn, who literally wrote the book on agile estimation (Agile Estimating and Planning, 2005), has been increasingly candid about story points’ limitations. In recent writing, he’s acknowledged that many teams would be better served by simply counting stories and tracking throughput.

Throughput-based forecasting: counting what finishes

The #NoEstimates movement, advocated by Woody Zuill, Vasco Duarte, and others, argues that estimation effort is largely wasted and that teams should focus on throughput: how many items finish per unit of time.

The approach is straightforward:

Break work into roughly similarly-sized items (stories, tasks, whatever you call them)
Track how many items the team completes per sprint (or per week)
Count the remaining items
Divide

If the team finishes 8 stories per sprint and there are 40 stories left, that’s about 5 sprints. No estimation session required. No pointing poker. No debates about whether something is a 5 or an 8.

The key insight is that if work items are roughly similar in size – not identical, just roughly similar – the variance averages out over time. A team that finishes 6 items one sprint and 10 the next will average 8. The average is a better predictor than any individual estimate.

This is essentially what Charlotte introduced to the GreenBox team. In the planning onion work, the board-level forecasts were based on throughput data, not estimates – “about forty stories remaining, at eight stories per sprint.” During the pitch preparation, Charlotte answered the board’s timeline questions with throughput ranges rather than point-based projections. The approach worked because it was honest about uncertainty and grounded in what the team had actually delivered, not what they hoped to deliver.

There’s a legitimate objection: what if work items aren’t similarly sized? Some stories genuinely are much larger than others. The response is: then break them down. If a story is three times the size of a typical story, split it into three stories. The goal isn’t to pretend all work is equal – it’s to normalise the unit of measurement so that counting becomes meaningful.

Monte Carlo simulation

If throughput gives you a point estimate (“about 5 sprints”), Monte Carlo simulation gives you a probability distribution (“there’s an 85% chance it’ll be done within 7 sprints”).

The technique is named after the Monte Carlo Casino, because it involves running thousands of random simulations. Here’s how it works for software delivery forecasting:

Collect your historical throughput data: the number of items completed in each of the last N sprints (or weeks)
For each simulation run, randomly sample from that historical data to project future sprints. If you need to forecast 40 items, randomly pick a throughput value from your history for sprint 1, another for sprint 2, and so on, until the cumulative total reaches 40
Record how many sprints that simulation took
Repeat 10,000 times
The distribution of results tells you the probability of finishing by each date

The beauty of Monte Carlo is that it naturally captures variability. If your team’s throughput is inconsistent (some sprints are 4, some are 12), the simulation reflects that – the probability distribution will be wider. If your throughput is consistent (always 7-9), the distribution will be tight.

Here’s what a typical result might look like:

Confidence level	Sprints needed	Completion date (fortnightly sprints)
50%	5	Mid-September
70%	6	Mid-October
85%	7	Mid-November
95%	9	Mid-January

The conversation changes from “it’ll take 5 sprints” to “there’s an 85% chance we’ll be done within 7 sprints.” This is dramatically more useful, because it gives stakeholders a way to make risk-informed decisions. If the deadline is mid-October, you have about a 70% chance of making it. Is that acceptable? That’s a business decision, not an engineering one.

Troy Magennis has done extensive work on applying Monte Carlo methods to software delivery forecasting, and his Focused Objective tools demonstrate the approach in practice. Daniel Vacanti’s Actionable Agile Metrics for Predictability provides the theoretical underpinning.

Estimates vs commitments

One of the most corrosive dynamics in software development is the conflation of estimates and commitments.

An estimate is a prediction: “Based on what we know, this will probably take 4-6 weeks.” It’s probabilistic, uncertain, and subject to revision as new information emerges.

A commitment is a promise: “We will deliver by March 15.” It’s binary – you either meet it or you don’t.

In healthy organisations, the flow is: engineers produce estimates, product managers assess the risk, and leadership decides which commitments to make, accepting the associated risk. An estimate of “4-6 weeks” might lead to a commitment of “we’ll have it by 8 weeks from now,” leaving buffer for the uncertainty.

In unhealthy organisations, the flow is: leadership asks for an estimate, treats the most optimistic end as a commitment, communicates it to customers, and then holds the engineering team accountable when reality intrudes. The estimate of “4-6 weeks” becomes a commitment of “4 weeks.” When it takes 7, the team has “failed,” even though 7 weeks was well within the original estimate range.

The GreenBox team navigated this tension explicitly. In the early sprints, Lee drew a distinction between what the team estimated they could deliver and what they committed to for the funding deadline. Charlotte reinforced this during the board room conversations, presenting throughput ranges rather than commitments for future quarters. The board learned to ask “what’s the 85th percentile?” instead of “when will it be done?” – and that shift in framing made the conversation productive rather than adversarial.

Why estimation fails: a summary

Estimation fails for reasons that are structural, not personal:

The planning fallacy causes individuals to underestimate because they reason from the inside view
The cone of uncertainty means early estimates are inherently imprecise because the information needed for accuracy doesn’t exist yet
Scope creep is not an aberration – it’s the normal process of discovery during implementation. Requirements change because understanding deepens
Dependencies are rarely fully understood at estimation time. The task that was estimated at 3 days requires a library upgrade that takes 2 days, which breaks a test suite that takes 1 day to fix, which reveals a bug that takes 3 days to diagnose
Interruptions and context switching are systematically excluded from estimates because they’re unpredictable, but they’re a predictable fraction of every developer’s time
Anchoring means that once a number is spoken, it becomes the reference point. Even if you said “it’s a rough guess,” the number 6 weeks anchors all subsequent thinking around 6 weeks

What actually works

If estimation is so problematic, what should teams do instead?

Track throughput. Measure what your team actually delivers, sprint over sprint, week over week. This is your empirical reality. It already accounts for all the things estimates miss – interruptions, scope creep, dependency surprises, sick days, public holidays, and the three hours someone spent helping a colleague with an unrelated problem.

Break work down. Smaller items are easier to estimate, but more importantly, they flow through the system faster and their variability averages out. If your backlog is full of stories that vary between 1 day and 3 months, throughput-based forecasting won’t work. If they vary between 1 day and 5 days, it works well.

Use Monte Carlo. Feed your throughput history into a simulation and present results as probability distributions. “85% chance by October” is more honest and more useful than “it’ll be done in September.”

Separate estimates from commitments. Make it safe for engineers to give honest estimates by not treating those estimates as promises. Add buffer at the organisational level, not by asking engineers to pad their estimates (which they’ll do inconsistently and which erodes trust).

Shorten the feedback loop. The most reliable forecast is “what will we ship this sprint?” Two weeks of work is much easier to predict than six months. If you need a six-month forecast, use Monte Carlo. If you need a two-week forecast, just look at the board.

Accept uncertainty as a feature, not a bug. The cone of uncertainty isn’t a problem to solve. It’s information about the nature of the work. Early in a project, uncertainty is high because you haven’t learned enough yet. That’s normal. Communicate it clearly, make decisions based on ranges, and let the cone narrow as work progresses.

The uncomfortable truth

Estimation in software is hard not because developers are bad at it, but because software development is a process of discovery. You learn what needs to be built by building it. You discover the edge cases by implementing the happy path. You find the dependency problems by integrating. Each discovery changes the estimate.

The industry has spent decades trying to make estimation more accurate. A better use of that energy is to make estimation less necessary – by shortening delivery cycles, breaking work into smaller pieces, and making decisions based on empirical throughput data rather than predictions about the future.

As Hofstadter told us in 1979: it always takes longer than you expect. The question isn’t how to estimate better. The question is how to build systems, teams, and organisations that can deliver value despite irreducible uncertainty.

Series

Under the Hood — deep dives into the technology we use every day.

What Time Is It? — coming around 21 April
The Wonderful, Absurd Science of Colour — coming around 28 April
How LLMs Actually Work — coming around 26 May
Time Is Weirder Than You Think — coming around 2 June
How Estimation Works (And Why It Doesn't) (you are here)
How Clocks Work — coming around 21 July
How CI/CD Pipelines Work — coming around 18 August
How Email Works — coming around 15 September
How Computer Networks Work — coming around 22 September
How Randomness Works — coming around 27 October
A Gentle Guide to Typography: From Chisels to Character Sets — coming around 17 November
How TLS Works: From Trusted Networks to Trust-No-One — coming around 24 November
How Multi-Factor Authentication Works — coming around 12 January
How the Internet Routes Traffic — coming around 2 March
A Guide to DNS — coming around 23 March