From Stories to Working Software

The GreenBox team hit 214 subscribers. The sprint cadence is working. Event Storming gave them shared understanding. Example Mapping made their stories concrete. The sprint rhythm turned sticky notes into delivery.

But bugs keep appearing.

Not catastrophic bugs – the payment system works, the delivery scheduling is solid. But edge cases slip through. The delivery date calculation breaks on public holidays because nobody checked. The box-size switch fails if a customer changes on Wednesday instead of Monday. A paused subscriber gets charged because the retry logic doesn’t check pause state. Each one is a twenty-minute fix. Each one costs trust.

The team has concrete examples from their Example Mapping sessions – context, action, outcome, written on cards. But those cards are on a table. The code is on a screen. Somewhere between the two, the details get lost.

This post is about that bridge: from Example Map to executable specifications, from specifications to tests, from tests to code. And about what happens to that bridge when you have an LLM sitting next to you.

A language for examples

The Example Map gave the team examples written as Context/Action/Outcome. That structure is doing a lot of the heavy lifting already. But there’s a step between “cards on a table” and “something a test framework can run.” The team needs a way to express those examples formally enough for a computer to use, while keeping them readable enough that Maya – who doesn’t write code – can still look at them and say “yes, that’s what I meant.”

There’s a language designed for exactly this. It’s called Gherkin.

Gherkin uses three keywords: Given, When, and Then. If that sounds familiar, it should – it maps directly to the Context/Action/Outcome pattern from the Example Mapping session.

Given sets up the context – what’s true before anything happens.
When describes the action – what someone does.
Then states the outcome – what should be true afterwards.

Here’s a trivial example, nothing to do with GreenBox:

“If it’s raining and I go outside, I should get wet.”

In Gherkin, that becomes:

Given it is raining
When I go outside
Then I should get wet

That’s it. No code. No special syntax beyond those three keywords. Anyone can read it and understand what’s being described. The context is “it’s raining.” The action is “I go outside.” The outcome is “I get wet.”

It’s the same pattern the team already used on their green cards – just formalised with keywords that a test framework can parse.

From Example Map to Gherkin

With that in mind, look at the Example Map output from the Example Mapping session. The rules and examples for “Subscribe to a produce box” are already there:

Rule: Customer must choose a box size (Small $25/week, Large $45/week)
Rule: Payment must succeed (valid card → confirmed, declined card → retry)
Rule: Customer sees their first delivery date (Monday → this Thursday, Friday → next Thursday)

Take one of the delivery date examples from the Example Map. On the green card, it reads something like:

Context: delivery day is Thursday, minimum lead time is 3 days. Sarah subscribes on Friday. → First delivery is next Thursday.

Translated to Gherkin:

Given today is Friday
And deliveries happen on Thursdays
And the minimum lead time is 3 days
And a customer has a valid payment method
When they subscribe to the "Small" box
Then their first delivery date should be next Thursday

The Given lines are the context from the card. The When line is the action. The Then line is the outcome. It’s a mechanical translation – the hard thinking already happened round the table with Maya and the team.

The Feature file

Individual scenarios are grouped into a Feature. A Feature file describes one coherent piece of behaviour – in this case, “Subscribe to a produce box.” It can also include a Background section for context that’s shared across every scenario, so you don’t have to repeat yourself.

Here’s the full Feature file for the subscription story, with all three rules represented:

Feature: Subscribe to a produce box
  Customers want a regular supply of fresh, local produce
  without having to think about it each week.

  Background:
    Given the following box sizes are available:
      | name   | price    |
      | Small  | $25/week |
      | Large  | $45/week |

  Scenario: Subscribing with a valid payment method
    Given a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their subscription should be confirmed
    And they should see their first delivery date

  Scenario: Payment is declined
    Given a customer has an expired credit card
    When they subscribe to the "Small" box
    Then no subscription should be created
    And they should be asked to update their payment method

  Scenario: Subscribing without enough lead time
    Given today is Friday
    And deliveries happen on Thursdays
    And the minimum lead time is 3 days
    And a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their first delivery date should be next Thursday

Each rule from the Example Map maps to one or more scenarios. Each green example card becomes the concrete data inside a scenario. The red cards (questions) that were resolved feed into the details. The red cards that weren’t resolved got deferred to separate stories.

This is what makes Example Mapping so powerful as a feeder into BDD. You’re not sitting in front of a blank Feature file wondering what to write. The conversation already happened. You’re just transcribing.

The BDD cycle: story, unit, code

Now the team has scenarios. These are acceptance tests – they describe the agreed behaviour, the contract between the team and the business. But you don’t implement them top-down. You work inward, using two loops.

The outer loop is the acceptance test – the Gherkin scenario. The inner loop is unit tests driving the implementation. The acceptance test tells you when you’re done. The unit tests tell you how to get there.

The cycle goes like this:

Pick a scenario. Run it. It’s RED – it fails because nothing exists yet.
Drop down to unit tests. Write a small, focused test for the first piece of behaviour you need. RED.
Write the simplest code that makes the unit test pass. GREEN.
Refactor if needed.
Repeat steps 2-4 until the acceptance test passes. GREEN.
Move to the next scenario.

Red, green, refactor. That’s the rhythm.

Worked example: GreenBox subscription in Go

A note about the code that follows: it’s deliberately simple. The delivery date calculator and subscription logic show the BDD rhythm – red, green, refactor – without the noise of a real production system. A real subscription service needs database transactions, Stripe webhook handling, idempotency keys, timezone-aware date logic, and retry queues. The discovery techniques produce the same concrete examples regardless of implementation complexity. The rhythm stays the same whether you’re writing a pure function or wiring up a distributed system.

Let’s walk through this concretely. Tom and Priya are going to implement the subscription story. They’re using Go. They’re sitting side by side for the first time – Priya usually works with headphones on, and Tom usually works alone. He notices, as they start writing the first test, that Priya names her tests differently from him. “How do you name tests?” he asks. “I describe what the customer expects, not what the code does,” she says. It’s a small thing. Tom starts doing it too.

Delivery date calculator

They start with the third scenario – “Subscribing without enough lead time” – because the delivery date calculation is pure logic with no external dependencies like payment gateways. It’s the ideal first slice: self-contained, well-specified by the Example Map, and easy to test in isolation.

The first piece of behaviour they need: given a date someone subscribes, when is their first delivery?

The rules from the Example Map:

Deliveries happen on Thursdays
Minimum lead time is 3 days
Subscribe on Monday → this Thursday (3 days away – just enough)
Subscribe on Friday → next Thursday (less than 3 days to this Thursday, so it rolls to the next one)

RED. Write the unit test first.

// delivery_test.go
package greenbox

import (
    "testing"
    "time"
)

func TestFirstDeliveryDate_MondaySubscription(t *testing.T) {
    // Monday 2026-03-23
    monday := time.Date(2026, 3, 23, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(monday, deliveryDay, minLeadDays)

    // Thursday 2026-03-26 (3 days later -- just enough lead time)
    want := time.Date(2026, 3, 26, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            monday.Weekday(), got.Weekday(), want.Weekday())
    }
}

This won’t compile. FirstDeliveryDate doesn’t exist yet. That’s the point.

GREEN. Write the function.

// delivery.go
package greenbox

import "time"

func FirstDeliveryDate(from time.Time, deliveryDay time.Weekday, minLeadDays int) time.Time {
    earliest := from.AddDate(0, 0, minLeadDays)
    daysUntil := (int(deliveryDay) - int(earliest.Weekday()) + 7) % 7
    if daysUntil == 0 {
        return earliest
    }
    return earliest.AddDate(0, 0, daysUntil)
}

Test passes.

RED. Edge case from the Example Map: Friday subscription.

func TestFirstDeliveryDate_FridaySubscription(t *testing.T) {
    // Friday 2026-03-27
    friday := time.Date(2026, 3, 27, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(friday, deliveryDay, minLeadDays)

    // Next Thursday: 2026-04-02 (Friday + 3 = Monday, then forward to Thursday)
    want := time.Date(2026, 4, 2, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            friday.Format("Monday"), got.Format("Monday 2006-01-02"),
            want.Format("Monday 2006-01-02"))
    }
}

GREEN. It already passes. The modular arithmetic in daysUntil handles this naturally: Friday + 3 = Monday, Monday to Thursday = 3 days forward. No code change needed.

This is one of the pleasures of TDD. You write a test you expect to fail, and it passes, which tells you your implementation is more general than you thought. The test still has value – it documents the edge case and will catch regressions.

REFACTOR. The function is five lines. Nothing to clean up.

Subscription creation

Now the second piece: creating the subscription itself, including the payment check.

RED. Test the happy path first.

// subscription_test.go
package greenbox

import (
    "testing"
    "time"
)

type fakeGateway struct {
    shouldSucceed bool
    chargedAmount int
}

func (f *fakeGateway) Charge(amountCents int) (bool, error) {
    f.chargedAmount = amountCents
    return f.shouldSucceed, nil
}

func TestSubscribe_ValidPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: true}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    if sub.BoxSize != "Small" {
        t.Errorf("BoxSize = %q, want %q", sub.BoxSize, "Small")
    }
    if sub.PricePerWeek != 2500 {
        t.Errorf("PricePerWeek = %d, want %d", sub.PricePerWeek, 2500)
    }
    if !sub.FirstDelivery.Equal(delivery) {
        t.Errorf("FirstDelivery = %v, want %v", sub.FirstDelivery, delivery)
    }
    if gw.chargedAmount != 2500 {
        t.Errorf("charged %d, want %d", gw.chargedAmount, 2500)
    }
}

Won’t compile. No Subscribe function, no Subscription type, no PaymentGateway interface.

GREEN. Write the simplest thing that passes.

// subscription.go
package greenbox

import (
    "errors"
    "time"
)

var ErrPaymentDeclined = errors.New("payment declined")

type Subscription struct {
    BoxSize       string
    PricePerWeek  int
    FirstDelivery time.Time
}

type PaymentGateway interface {
    Charge(amountCents int) (ok bool, err error)
}

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    _, _ = gw.Charge(priceCents)
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Test passes. But notice the implementation is deliberately naive – it ignores the payment result. That’s fine. We only need to make the current test pass. The next test will force us to handle failure.

RED. Test the declined payment path.

func TestSubscribe_DeclinedPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: false}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != ErrPaymentDeclined {
        t.Errorf("err = %v, want %v", err, ErrPaymentDeclined)
    }
    if sub != nil {
        t.Errorf("subscription should be nil when payment declined")
    }
}

This fails. The current implementation ignores the payment result and always returns a subscription.

GREEN. Fix the implementation to check the payment result.

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    ok, err := gw.Charge(priceCents)
    if err != nil {
        return nil, err
    }
    if !ok {
        return nil, ErrPaymentDeclined
    }
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Both tests pass.

REFACTOR. The types are clean. The interface is narrow – one method, one responsibility. The function does one thing. Nothing to tidy up.

That’s the inner loop done for this story. Four unit tests, two source files, clean types, narrow interfaces.

Step definitions: the glue that matters

The acceptance test – the Gherkin scenario – would now pass if you wired up the step definitions. Step definitions are the code that connects Gherkin keywords to your actual application. When the test runner sees When they subscribe to the "Small" box, it needs a function that calls your real Subscribe code.

In Go, using a framework like godog, a step definition looks like this:

func iSubscribeToTheBox(ctx context.Context, size string) error {
    gw := stripeGateway()
    sub, err := greenbox.Subscribe(size, boxPrice(size), gw, greenbox.FirstDeliveryDate(time.Now(), time.Thursday, 3))
    if err != nil {
        lastError = err
        return nil
    }
    lastSubscription = sub
    return nil
}

This is thin on purpose. The step definition delegates to Subscribe and FirstDeliveryDate – the real functions the team already wrote and tested. It doesn’t contain business logic. It doesn’t make decisions. It’s glue.

That thinness is critical. Step definitions are the most fragile part of BDD. They sit between two worlds – the business language of Gherkin and the implementation language of Go – and they rot fast if you don’t maintain them.

Guidelines for keeping them healthy:

Keep them thin. A step definition should call a helper function or your real domain code. If you find yourself writing if statements or business logic inside a step definition, something has gone wrong. The logic belongs in the domain code where it’s unit-tested.

Use consistent language. If the team says “subscribe,” every step says “subscribe.” Not sometimes “sign up,” not sometimes “create subscription.” Inconsistent language means duplicate step definitions doing the same thing with different words. That’s tech debt.

Avoid regexp gymnastics. Step matchers should be simple and readable. When they subscribe to the {string} box is good. A regex that tries to match twelve variations of the same sentence is a maintenance nightmare.

Reuse across scenarios. One step definition per behaviour, shared across every scenario that needs it. If Given a customer has a valid payment method appears in five scenarios, that’s one step definition, not five.

Maintain them like production code. Review step definitions in pull requests. Refactor them when the domain language evolves. Delete them when scenarios are removed – orphaned step definitions are tech debt that confuses the next person who reads the test suite. If a step definition is hard to write, the scenario is probably too coupled to implementation detail. Rewrite the Gherkin, not the step code.

The payoff for this discipline is significant. Well-maintained step definitions become living documentation. A new team member can read When they subscribe to the "Small" box, follow the step definition to greenbox.Subscribe, and understand exactly what “subscribe” means in the codebase. If the step definitions are messy or out of date, that trust breaks down – and once the team stops trusting the scenarios, they stop maintaining them, and BDD quietly dies.

This is where many teams abandon BDD. The Gherkin is easy. Writing scenarios from Example Maps is almost mechanical. But the step definitions are where the discipline lives. If you skip the maintenance, the scenarios drift from reality and become fiction nobody trusts. The team stops running them, then stops writing them, and six months later someone says “we tried BDD and it didn’t work.” It worked. They just didn’t maintain the glue.

Priya suggests running the Gherkin tests automatically. “We’re writing tests that prove the code does what Maya expects. Why are we running them by hand?” She sets up a GitHub Action – tests run on every pull request. It takes her an afternoon. The first automated run catches a bug in Tom’s payment retry logic that manual testing missed. Their deploy script is still manual, but at least the tests aren’t. Tom: “That saved me a day.” Priya: “That saved a customer.”

LLMs as implementation partners

Here’s the thing about everything you just read: an LLM could have written all of it.

Not the Example Map. Not the discovery conversation where Maya explained that deliveries happen on Thursdays and the minimum lead time is three days. Not the moment when Tom asked “what about Friday?” and surfaced an edge case. The LLM wasn’t in the room for that.

But the code? The Go functions, the test files, the types and interfaces? You could hand an LLM the Feature file – the Gherkin scenarios the team wrote – and say:

Here’s a Feature file for subscribing to a produce box. Write me a Go implementation with tests that makes these scenarios pass.

And it would produce something remarkably close to what you just read. Probably not identical – it might choose different names, structure the tests differently, use a table-driven test style. But the behaviour would be right, because the scenarios are concrete and unambiguous. There’s no room for the LLM to guess wrong about what “subscribe” means, because the Feature file spells it out: these box sizes, this payment behaviour, this delivery date logic.

A caveat. LLMs are good at the happy path. They’ll write code that handles the cases you specified and passes the tests you described. What they’ll miss is everything you didn’t specify: error handling for network timeouts, concurrency issues when two customers subscribe simultaneously, retry logic for flaky payment gateways, audit trails for compliance. Code review isn’t optional when an LLM writes the implementation – budget roughly half your time for reviewing, adjusting, and hardening what comes back. The discovery work is what makes this review possible. Because you have concrete examples with explicit context, action, and outcome, you can check the LLM’s output against something specific. Without that, you’re reviewing code against vibes.

Without the discovery work, the LLM would produce plausible code that implements the wrong thing – exactly like Tom did in week one, when he built a subscription system based on assumptions. With the Feature file, the LLM has the same shared understanding the team built round the table. The scenarios are the specification.

I wrote about this in The Value Is in Ideas, Not Code. The gap between “I know what to build” and “working code” has collapsed. LLMs have made implementation cheap. What they haven’t made cheap is knowing what to implement.

The discovery work – Event Storming, Example Mapping – is MORE important now, not less. The bottleneck has shifted. It used to be: we know what to build but it takes ages to build it. Now it’s: we can build anything quickly but we don’t know what to build.

The pipeline looks like this:

Event Storming

→

Example Mapping

→

BDD Scenarios

→

Hand to LLM

→

Review Output

→

Ship

Everything to the left of “Hand to LLM” is human thinking. Everything to the right is review and refinement. The LLM sits in the middle, turning precise specifications into working code.

The human work is the thinking. The LLM work is the typing. Both are necessary. Neither is sufficient alone.

This is why I keep saying: invest in discovery. The return on that investment has gone up, not down, since LLMs arrived.

The Rhythm

The GreenBox team has found its rhythm. Example Map the story. Write the scenarios. Red. Green. Refactor. Ship.

The gap between “we understand what to build” and “it’s working in production” has collapsed. Not because the team skipped steps, but because they did the thinking first. The Example Map gave them concrete, unambiguous examples. Gherkin gave them a shared language. BDD gave them a cycle. And the LLM turned precise specifications into working code faster than anyone expected.

The hard part was never the code. It was knowing what to build. That’s still true – maybe more true than ever.

While implementing the payment integration, Tom makes a deliberate shortcut: he hardcodes the currency to AUD instead of making it configurable. He writes a comment in the code: // SHORTCUT: AUD only. If we ever go international, this needs to change. Lee sees it during a review and says: “That’s a good shortcut. You know it’s there, you know when it’ll matter, and you’ve documented it. Technical debt is fine when it’s conscious.” It plants an idea that Tom carries forward: debt is a choice, not an accident. The dangerous kind is the kind you don’t know you’re taking on.

That same week, a subscriber emails Sam on Saturday: “Your website has been showing an error since yesterday afternoon.” Nobody noticed – they don’t monitor the site outside business hours. Sam signs up for a free uptime monitor. It pings the site every five minutes and texts her if it’s down. It’s not observability. It’s a text message. But it’s the first time a machine is watching instead of a person.

But are we building the right things?

One thing Tom notices during the BDD cycle: the LLM generates code faster than he can review it. The code arrives clean and confident, but Tom can’t always tell if it’s right until he traces through it line by line. For now that’s fine – the Feature file gives him something concrete to check against. But the speed creates an odd sensation: the bottleneck isn’t writing code any more. It’s knowing whether the code is correct.

A few weeks in, the rhythm is working. The team is shipping well. Example Mapping has eliminated the surprises. BDD is catching bugs before they reach production. The code quality is up. The board looks healthy.

But the number that actually matters – active subscribers – is going backwards. They hit 214 at the end of the first sprint cycle. A month later, they’re at 197. The slope is going the wrong direction.

Maya checks the number at her kitchen table one evening. Nadia looks over her shoulder. “Is that good?” “It’s going the wrong way.” Churn is eating the growth. For every ten new subscribers, three or four existing ones cancel. The team is building well, but the subscriber count doesn’t care about code quality.

The frustrating thing is that the team is doing good work. They’ve built a solid subscription system, payment processing, delivery date logic. Tom has been improving the admin tools. Jas redesigned the onboarding flow. Sam is pushing for a farm analytics dashboard. Everyone has a reasonable next thing to build, and they’re building it well.

But nobody has stepped back to ask: which of these things will actually stop the bleeding? The team is efficiently building features, but are they the right features for the business? Is a prettier onboarding flow going to fix churn? Is a farm dashboard? Or are those just comfortable engineering tasks that feel productive without delivering bottom-line impact?

Maya raises it at the Monday standup. “We’re shipping faster than ever. But we’re shrinking. Something’s wrong and I don’t think the answer is to ship even faster.”

Example Mapping tells you what a story means. BDD turns that into working code. But which stories should the team be building? How do they connect their work to the business goal? For that, they need a technique that works backwards from outcomes – one that forces the question “why are we building this?”

That’s Impact Mapping (coming 12 May).