Pricing Experiments: The Right Box at the Right Price

Lee is at Maya’s kitchen table with a coffee and the canvas printed on A3. He’s drawn a circle around the Revenue Streams box and written three words next to it: “untested, but working.”

“The prices are working,” Maya says. “People are paying them. Nobody’s complained.”

“Nobody who signed up has complained. You don’t know anything about the people who looked at the site, saw $25, and closed the tab. You don’t know whether you could charge $30 and still get the same conversion. You don’t know whether the small box is mispriced relative to the large box. You’ve got one data point, the current price, and you’ve decided it’s right because people are buying.”

Maya frowns. She’d been congratulating herself, a little, that the pricing felt settled. It was one less thing to worry about.

“So what are you suggesting?”

“Pricing experiments. Not once. As a habit. You should be testing your pricing the way you test your features.”

Why pricing feels different

Pricing is one of the few decisions in a startup that feels genuinely scary to get wrong. A bad feature can be rolled back. A bad copy tweak can be un-tweaked. A bad price, once published, anchors every subscriber’s expectation of what the product costs. Raise it and people feel cheated. Lower it and the people who paid the old price feel stupid.

This is why most founders pick a number that feels right, ship it, and never touch it again. It’s not laziness. It’s loss-aversion. The downside of changing pricing feels enormous, and the upside feels uncertain.

Lee has seen this pattern many times. His view is that it’s the wrong way to think about it.

“The risk isn’t changing your price. The risk is being wrong about your price and not knowing. If you’re charging $25 for a small box that people would happily pay $30 for, you’re not being nice. You’re giving away five dollars per subscriber per week. At two hundred and ten subscribers, that’s $1,050 per week you’re leaving on the table. That’s a farm partner you could pay. That’s Sam’s hours. That’s runway.”

Maya does the maths in her head. The number is uncomfortably large.

What to test

The team sits down on a Wednesday afternoon to work out what they actually want to learn. Lee uses the same approach they’ve used for product discovery: write down the assumptions and figure out which ones are worth testing.

The assumptions on the wall, in Maya’s handwriting:

The small box at $25 and the large box at $45 are both priced correctly relative to cost.
The $20 gap between small and large reflects the actual value difference to subscribers.
Subscribers would not pay more than $25 for the small box.
A third, larger box option would not expand the market.
Weekly delivery is the only frequency that makes sense.
Free delivery is a deal-breaker if removed.

Lee reads down the list and then sets the marker down. “Which of these would you be most embarrassed to find out you’d been wrong about for six months?”

Maya doesn’t hesitate. “Three. If people would happily pay $30 for the small box, I’d feel sick.”

“Good. Let’s test three first.”

Designing the experiment

This is where it gets interesting. You can’t just change the price on the website and see what happens, because if the new price is wrong you’ve damaged the brand. You also can’t ask people “would you pay $30?” in a survey, because people lie about pricing in surveys, not deliberately, but because stated preference and revealed preference are different things.

Lee proposes a simple framework:

Test the price before the commitment. New visitors who haven’t yet chosen a box see a landing page with the price. Half see $25. Half see $30. Everybody who clicks through can choose to continue at the price they saw. Measure two things: the click-through rate (do fewer people continue at $30?) and the conversion rate (of the people who continue, how many actually subscribe?).

Be honest about what you’re doing. If a subscriber asks why they saw $30 and their friend saw $25, don’t pretend it was a glitch. Explain that you were testing pricing and wanted to understand what the right number was. Offer them whichever price they would have preferred.

Lee pauses there. “That’s the design. Now the part I can’t do for you. I’ve watched plenty of these experiments and I know what they cost when they’re set up wrong, but the actual sample-size maths, how big and how long and how confident, isn’t my world. I can tell you a pricing test is worth running. I can’t tell you whether your traffic will let you read the result. You need that worked out before you agree on a duration.”

He looks across the table. Priya is already on it.

“Give me a minute. I want to be sure we can actually read a result before we agree to run this.”

She pulls her notepad towards her.

“The thing we’re worried about with $30 is people seeing the higher number and not subscribing. So the question I have to size up is: how many visitors per arm before we’d reliably notice a drop in conversion at $30, if there’s one to notice? The maths is symmetric, the sample size comes out the same whether we frame the change we’re looking for as a fall or a lift, but the fall is what would actually hurt us, so that’s the version I’ll plug in.”

She writes a few lines.

“Before I pick a number for ‘how big a drop’, let me work out what would actually count as bad for us. At $25 with seven percent conversion we make $1.75 a visitor. We’re testing $30, so the question becomes: how far would conversion have to fall at $30 before the price hike is a loss instead of a win? Revenue per visitor matches when $30 × p equals $1.75, so p = $1.75 ÷ $30, five point eight three percent. Anything below that and $30 is making us less money than $25, not more. So the drop we’d care about catching is conversion falling from seven percent to about five point eight, roughly a seventeen percent relative drop. That’s what I’ll size for.

“Two arms, $25 and $30. Conversion seven percent today. Target drop pinned at break-even, seventeen percent relative. Standard settings for how cautious we’re being. The back-of-envelope number is roughly seven thousand visitors per arm, and at our traffic that’s the better part of a year.”

Tom holds up a hand. “Nearly a year? Where does seven thousand actually come from?”

Priya turns to a fresh page. “Four numbers we have to commit to before we run anything, not after. The baseline: seven percent, the fraction of visitors who subscribe today at $25. The shift we want to be able to detect: the break-even drop, seven percent falling to five point eight. And two dials for caution. How often are we willing to be fooled by chance into seeing an effect that isn’t there? Standard answer: one experiment in twenty. And if there really is an effect, how often do we insist on actually catching it? Standard answer: four times in five, which means even a perfectly designed experiment writes off a real difference as noise one time in five.”

Tom nods slowly. “So the dials are us deciding how cautious to be in each direction.”

“Exactly. Fix all four and a standard formula turns them into a sample size. The shape of the formula matters more than its insides: caution multiplied by noise, divided by the square of the gap you want to detect. That squaring is the killer. Halve the drop you’re trying to see and you don’t need twice the visitors, you need four times as many. Small differences look a lot like noise, and the maths charges accordingly. Plug our numbers in and it comes out just under seven thousand visitors per arm.”

She underlines the result. “Seven thousand per arm to spot the break-even drop. We get maybe three hundred visitors a week, a hundred and fifty into each arm. Forty-six weeks with both arms running side by side; call it a year. If we’d settle for spotting a much bigger collapse, twenty-five percent or so, we could call it in four or five months. A subtler ten percent drop puts us past two years. A five percent drop, well inside what a small price tweak might plausibly move, is the better part of a decade.”

Tom sits back. “So what does two weeks of data buy us?”

“Worth working out. Let me run the formula backwards. Two weeks gives us roughly three hundred visitors per arm. If I fix the caution and noise where they were and solve for the gap instead, what’s the smallest difference we’d reliably catch at n = 300?, the maths gives me about six percentage points. So a real conversion rate would have to drop from seven percent to about one percent before our two-week test would reliably notice. Anything subtler than that, we miss most of the time.”

She thinks for a second.

“More usefully: take the gap we actually care about, the break-even drop, seven down to five point eight, and ask how often we’d correctly call it. At three hundred per arm, the four-in-five catch rate we asked for collapses to about eight percent. So even if $30 genuinely tips us over to break-even, our two-week test would flag it only about one time in twelve. The other eleven times we’d shrug and write it off as noise.

“And the flip side is just as ugly. Even if $25 and $30 convert identically, we’ll see a gap that looks significant about one time in twenty. That’s the false-alarm rate we picked, and it doesn’t get kinder when the sample’s small. So a ‘significant’ result at this scale could be a real effect we got lucky enough to spot, or it could be pure chance. We can’t tell which from the data alone.”

She caps the pen.

“At our volume the test is statistically blind. Whatever number comes out, we can’t separate signal from noise. What we can do is read the direction the gap leans, and decide in advance whether that’s enough to act on.”

Maya looks at Lee. “So the experiment can’t really prove anything in any sane window.”

“Not at your volume. Which means the choice you have isn’t ‘run it until it’s statistically valid’. It’s ‘run it long enough to see the shape of the signal, then act on the direction’. You’ll be moving from one defensible price to another defensible price with a lean, not a proof. That’s the only kind of pricing decision you can make at this stage.”

Maya thinks about it. “If we’re wrong by five percent in either direction, we can adjust. We won’t be wrong by fifty percent.”

“That’s the right framing. And it tells you what the time limit is for.”

Set a time limit. Two weeks, then stop, not because two weeks will give you certainty (it won’t), but because the time box is what limits your exposure. A pricing experiment is a temporary act of price discrimination, and the longer it runs the more it corrodes trust. The time limit is containment, not measurement.

Know what you’ll do with each outcome. Before starting, write down what decision you’ll make if revenue per visitor goes up, goes down, or barely moves. “The data is inconclusive” needs to be one of the outcomes you’ve planned for, because at this volume it’s the most likely one. Decide in advance whether a directional lean is enough to act on. If it isn’t, don’t run the experiment yet; save it for when you’ve got the traffic.

Priya adds one more thing. “And we write a note in the wiki, ‘small box price, revisit when weekly visitors exceed five thousand’. Future us deserves to know we acted on a lean, not on a proof.”

Tom has a concern. “What if conversion drops by ten percent at $30? Does that mean $30 is the wrong price?”

Lee thinks about it. “Not necessarily. If conversion drops ten percent but revenue per converted subscriber goes up twenty percent, you’re still ahead. The question isn’t ‘does conversion drop’, it’s ‘does total revenue go up or down.’ And you have to weight that against the long-term effects on word-of-mouth, retention, and brand.”

Running the experiment

They run it for two weeks.

The setup is deliberately simple. Priya writes a small piece of code that assigns each new visitor to one of two groups at random and shows them the appropriate landing page. The price on the page is the price they’d pay if they subscribed. Nothing else about the site changes.

Over two weeks, 312 visitors see the $25 page. 298 visitors see the $30 page. The split is even enough to compare, and, as the team already knows going in, well short of what statistical confidence would require.

The results:

Small box pricing experiment: two weeks

Variant	Visitors	Clicked through	Subscribed	Revenue/week
$25 (control)	312	184 (59%)	22 (7.1%)	$550
$30 (test)	298	164 (55%)	19 (6.4%)	$570

Click-through drops from 59% to 55%. Conversion drops from 7.1% to 6.4%. But revenue per visitor goes up, because the people who subscribe at $30 are paying more than the people who subscribe at $25.

The total revenue from the $30 cohort is slightly higher than the $25 cohort, despite slightly fewer subscribers.

Reading the numbers

The team gathers round Priya’s laptop.

“This is roughly the shape we expected,” Priya says. “Twenty-two subscribers versus nineteen, out of about three hundred visitors each. The gap is well inside the noise, if we’d run the experiment in a different fortnight, those numbers could easily have flipped. The data isn’t conclusive. We knew going in it wouldn’t be.”

“Then what is it telling us?” Maya asks.

“Direction. Click-through is slightly lower at $30. Conversion is slightly lower. Revenue per visitor is slightly higher. That’s the shape you’d expect if $30 is closer to the right price than $25, and roughly the opposite of what you’d see if $30 were too high. It’s not proof. It’s a lean.”

Lee picks it up. “And the team agreed before we ran it that a lean is what we’d act on. The alternative was waiting years for a confidence we don’t actually need to make a five-dollar decision.”

The decision

Maya looks at the numbers again. “So we raise the price.”

“On a directional signal,” Lee says. “Not because the data proved anything, but because we said we’d act on direction and the direction is up. If it had pointed the other way we’d be having a much shorter meeting. The five percent more revenue per visitor isn’t huge, but the people who said yes at $30 are telling you they value the box at $30. The people who said no were probably never going to be great subscribers anyway. They would have subscribed for a month and cancelled.”

Maya pushes back anyway, because somebody has to say it out loud. “Freshly launches in Perth within weeks. Eighteen dollars a week, twelve million in funding, a slicker app than ours. And we’re sitting here talking about putting our price up. Part of me thinks that’s madness.”

“It would be madness if you were selling what they’re selling,” Lee says. “You can’t win an eighteen-dollar fight against twelve million dollars, so don’t enter it. The people who say yes at $30 are hiring you for the job Freshly doesn’t do: dinner decided, a recipe card on top, a box they can trust without thinking about it. If your subscribers were choosing on price alone, you’d already have lost them, at $25 or any other number. The experiment is telling you the job is worth $30 to the people who want it done. Charge what the job is worth, and let Freshly have the people who were only ever buying vegetables.”

Then he adds: “Before you change anything, how do you feel about the subscribers who paid $25?”

Maya thinks. “I don’t want to raise the price on them. They signed up at $25 and that was the deal.”

“Good instinct. Honour the original price for existing subscribers indefinitely. New subscribers sign up at $30. Your existing subscribers feel looked after. Your new subscribers feel fairly treated. The only people who lose are the ones who would have subscribed at $25 but won’t at $30, and the experiment tells you that’s a small group, and a group that probably wouldn’t have stuck around.”

It’s a clean decision, but it’s only clean because they measured first, and only honest because they were clear, before measuring, about what kind of evidence they’d accept.

What gets tested next

Maya and Lee work through the rest of the list. The $20 gap between small and large is the obvious next target, assumption 2 from the wall. After that, the mixed-sourcing pilot that’s been on the whiteboard for weeks. Each one a separate test. Each one starting the same way: write down the question, design the test, decide what you’d do with each outcome before you run it, measure, decide.

Pricing experiments. Not once. As a habit.

The team doesn’t know it yet, but the question on the wall is about to change. The discipline they’ve just learned will get its first real test on a deadline they didn’t pick. That’s a story for next week.

Lee writes a single sentence at the top of the pricing page in the team wiki:

“Price is an assumption until you’ve tested it. Test the assumptions you’d be most embarrassed to be wrong about first.”

Maya reads it, nods, and goes back to the kitchen to think about what to test next.