Directional, Not Definitive

February 05, 2027 · 11 min read

An A/B test with three hundred visitors per arm is not the same instrument as an A/B test with thirty thousand. Both produce a table at the end. Both let you say “the test ran for two weeks and the number moved.” Only one of them lets you say the number moved because of what you changed. Most teams running tests at the smaller scale either pretend they’re running them at the larger scale or they don’t run them at all. There’s a more honest middle path.

The honest math

Power, in plain English, is the chance your experiment will detect a real effect if there is one. A test with 80% power has a one-in-five chance of missing a real effect; with 50% power, a coin-flip. Tests without explicit power calculations almost always run far below 50%.

The arithmetic is unforgiving at common startup scales. Suppose your baseline conversion rate is 7% and you want to detect a 10% relative lift (so 7% becoming 7.7%) with 80% power at a 5% significance threshold. The standard formula puts you at roughly twenty thousand visitors per arm. To detect a 5% relative lift, you need closer to eighty thousand per arm. Most early-stage teams running website tests get three hundred to a thousand per arm on a good fortnight.

This isn’t an exotic edge case. It’s the rule. Pricing tests, signup-flow tests, copy tweaks, button colours, almost all of them target effects in the single-digit percentage range, at baseline conversion rates in the low single digits, on visitor populations measured in the hundreds. The arithmetic is brutal: if you are running this test, your test is almost certainly underpowered.

The corollary is the part that hurts. An underpowered test that returns a positive result is more likely to be noise than signal. The smaller your sample, the larger the effect needs to be before you can distinguish it from chance, and at small samples, “large enough to be distinguishable” usually means an effect so big you’d have seen it without measuring.

What goes wrong without it

A team running an underpowered test gets a number at the end. Variant A is at 6.4%; variant B at 7.1%. The number moved. The instinct is to ship the winner. This is the most common failure mode in product analytics, and it is exactly as wrong as it sounds.

The first thing that goes wrong is that random variation gets dressed up as truth. If you ran the same test in a different fortnight, with different visitors arriving in a different order, the numbers would shift, often enough to flip which variant looked like the winner. Acting on the result is the same gamble as acting on no data at all, except the chart on the wall makes it feel rigorous.

The second thing is the trap of peeking. The test was supposed to run for two weeks. After three days, variant B looks better. Someone calls the test early. The bias here is systematic and well-documented: tests that are stopped the first time they show a positive result are wrong about the result far more often than tests that run to a fixed sample size. The dashboard creates an itch you cannot scratch without invalidating your own data.

The third thing is the long tail of decisions you don’t realise are downstream of the test. Once a team has “validated” a pricing change with three hundred visitors per arm, that pricing becomes the foundation for revenue forecasts, marketing spend, hiring plans, and runway calculations. The whole structure is built on a result that was within the noise. When it later doesn’t replicate at scale, the team retrofits an explanation rather than confronting that the original test never proved what they thought.

When acting on directional evidence is honest anyway

None of this means small teams should stop running tests. It means they should be honest about what the tests are telling them.

There’s a defensible case for acting on an underpowered result. It hinges on four things: the decision is easy to reverse, the expected effect is large relative to the cost of being wrong, the alternative is paralysis, and you wrote down before the test started what you’d do with each outcome.

A pricing change is a useful illustration. Charging $30 instead of $25 for a small box is, in most early-stage businesses, trivially reversible. If the new price hurts conversion at scale, you change it back, honour the old price for the people who paid it, and apologise to nobody because nobody saw the difference. The downside of being wrong is small. The downside of being wrong about whether to test at all is much larger, because untested prices compound into structural revenue gaps that go invisible for years.

The honest framing isn’t “the test showed $30 is better.” It’s “the test showed a small directional lean toward $30. We’re acting on the lean because the cost of being wrong is small and the cost of doing nothing is non-zero. We owe ourselves a properly powered look at this once we have the volume.”

This is the case where directional evidence beats no evidence. The mistake is using the same language (“we tested it, the data showed…”) for both the directional case and the powered case. They are not the same thing, and the people downstream of you, investors, board members, future hires, deserve to know which kind of evidence is underneath your decisions.

Techniques that help small samples

If you can’t escape the small-sample regime, and most teams can’t, for years at a stretch, the question becomes: what techniques squeeze more honest information out of the data you can collect?

Pre-register the decision rule. The single highest-value habit. Before any data exists, write down the call you’ll make for each outcome. “If conversion in the test arm is higher with a 95% confidence interval that excludes zero, we ship it. If the interval crosses zero but the point estimate is positive, we act on a directional lean and revisit in six months. If the point estimate is negative, we keep the control.” Pre-registration removes the entire class of failure where the team retrofits a decision rule to whatever result they got. If you wrote the rule first, you cannot p-hack your way past it.

Target the minimum detectable effect, not the effect you hope for. Before running the test, compute the smallest effect your sample size can reliably distinguish from noise, given your baseline rate and acceptable error. If your traffic only lets you detect a 25% relative lift, run tests where you reasonably expect a 25% or larger effect. Don’t waste two weeks trying to detect a 5% lift you have no instrument to see. The discipline isn’t to run every test; it’s to run the tests you can actually answer.

Confidence intervals beat point estimates. “Variant B converts at 7.1%” is a confident-sounding lie. “Variant B converts at 7.1%, 95% CI [5.4%, 8.8%]” is honest. The interval communicates the actual precision of your estimate. When two intervals overlap by half their width, you do not have a winner. Train your team to think and present in intervals; the alternative is a long history of confidently-asserted point estimates that don’t replicate.

Bring priors when you have them. A Bayesian framing treats your existing knowledge as part of the analysis rather than pretending each test starts from zero. If you have run three previous price-sensitivity tests and they all leaned in the same direction, your prior on the next one isn’t flat, you have evidence already, and rolling it in tightens your posterior estimate even when the current sample is thin. This isn’t statistical sleight-of-hand; it’s the correct way to combine information across small experiments.

Use sequential testing instead of fixed-horizon plus peeking. If you genuinely need to stop the test early when the data is overwhelming, use a method designed for that: alpha-spending boundaries, mSPRT, or a Bayesian posterior threshold. These are designed to preserve the integrity of the test under repeated looks. The naive approach, peek every day, stop when significant, is exactly the failure mode the formal methods prevent. The trade-off is that sequential methods require slightly more total sample for the same confidence; the payoff is that you can stop honestly when the data is decisive without invalidating everything.

Triangulate with qualitative. Five customer interviews are not a substitute for statistical power, but they are a different kind of evidence that combines well with weak quantitative data. If the test arm wins by a thin margin and five out of five interviewed test-arm subscribers say the new price felt fair, the combined signal is more credible than either alone. The interviews don’t replace the test; they corroborate it (or contradict it, which is also useful information).

Pool cohorts and replicate. A single underpowered test is brittle. The same test run three times across non-overlapping cohorts is a meta-analysis. If the directional lean replicates across three independent samples, that is meaningful evidence even if no single run is significant on its own. Build the pipeline so you can rerun the same experiment cleanly. The cumulative sample size is the only way to grow your statistical reach without growing your traffic.

Pick tests with big expected effects. This is the strategic version of MDE targeting. When your traffic is the binding constraint, run the tests where the effect is likely to be huge: removing free delivery, adding a major new tier, changing the funnel entry point. Save the subtle copy-and-colour tests for when you have the volume to detect subtle changes. There is no shame in deciding the small tests are not worth running. There is significant cost to running them and pretending they answered the question.

The honest report

The way you write up an experiment shapes how the next person uses the result. Language that overstates the strength of the evidence creates a fossil record of false certainty that future decisions get built on.

Useful habits: name the sample size up front. Quote intervals, not points. Use phrases like “directional lean”, “consistent with”, “did not distinguish from chance” instead of “showed”, “proved”, “validated”. When you acted on directional evidence, say so explicitly, “we acted on the lean given the reversibility of the change and the cost of paralysis”, so anyone reading later knows what kind of result they’re inheriting.

If the test was underpowered, write that down. Quantify it: “this experiment had 38% power to detect the observed effect; rerunning at three times the sample would resolve the ambiguity.” This is the single most useful sentence to leave behind, because it tells the next person whether to act on the result, repeat the experiment, or move on.

When not to run the test at all

There is one more decision worth making explicit: sometimes the right answer is not to run the test. If your traffic guarantees the result will be inside the noise, and the change is cheap to make and cheap to reverse, run it as a launch and measure the aftermath instead. The information cost of a deliberately-skipped test is sometimes lower than the information cost of a test that produces a misleading number.

This isn’t permission to skip every test. It is permission to skip the tests that you know in advance cannot produce a meaningful answer. The discipline is honesty about your instrument’s resolution, if the test won’t tell you anything you can trust, don’t run it and pretend it did.

A worked example

For a specific case of an early-stage team running on directional evidence, and being explicit about it, see how a two-hundred-subscriber team handled a small-box pricing test. The experiment was underpowered. The team acted on the directional lean anyway, with a written acknowledgement that the result was directional rather than proven, and a commitment to revisit once volume allowed.

That isn’t best practice. It’s the realistic version of best practice for a team that doesn’t have the option of waiting six months for traffic. Don’t do it that way unless you can’t help it.

What’s good about that example: the team was explicit about the limits of the evidence. What’s worth adding when you find yourself in the same place: most of the techniques in the previous section. Confidence intervals on the table, not just point estimates. A pre-registered decision rule written before the test goes live. Cohort replication when the same question gets revisited. A note in the writeup quantifying the power. Each of these turns a “we acted on a lean” decision into a slightly more defensible one without requiring traffic the team doesn’t have. None of them rescue an underpowered test from being underpowered. They make it harder to mistake an underpowered test for a powered one, which is the failure that actually causes damage.

The principle

A small sample size is not a moral failing. Pretending you have a large one is. The honest path is to run tests at the scale you have, choose techniques that fit that scale, write up the results in language that matches the strength of the evidence, and revisit the question when you can.

Directional, not definitive. That’s the discipline. The teams that get hard pricing and product calls right at small scale aren’t the ones who pretend their tests are bigger than they are. They’re the ones who say, out loud, what their tests can and cannot do, and then act on the evidence they have, with their eyes open.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.