Most teams start by using LLMs to generate code fast. The ones that get the most value end up using LLMs to help them think. The shift isn’t about better prompts or newer models, it’s about better inputs. Discovery techniques produce structured understanding. Structured understanding produces useful LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. output. The Greenbox series provides worked examples at each stage; the principles below stand on their own.
The evolution at a glance
| Stage | Worked example | LLM role | What worked | What didn’t | Post |
|---|---|---|---|---|---|
| Code generator | Week 1–4, building the wrong thing | “Write me a subscription system” | Fast output, clean code | Built on wrong assumptions, amplified misunderstanding | Catching the Wrong Kind of Fast |
| Implementation partner | BDD/Gherkin, turning examples into code | Generate code from concrete specs | Accurate when given precise examples | Still needs human-written specs | From Stories to Working Software |
| Sprint planning assistant | First sprints, task breakdown | Break down stories, draft acceptance criteria | Speeds up planning | Can’t assess gut-feel sizing | The First Sprints |
| Research / sense-making tool | JTBD interviews, assumption mapping | Transcribe interviews, spot patterns, make sense of the data | Catches themes humans miss across many interviews | Misses cultural context, local nuance | Jobs to Be Done, Assumption Mapping |
| Board presentation drafter | Roadmapping, board decks | Draft presentations from data | Fast first draft | Needs heavy editing for narrative and nuance | What Changes First |
| Code generator from domain models | Decision tables, bounded contexts | Generate code from formal tables, generate within context boundaries | Comprehensive, consistent, testable | Needs precise domain models as input | Decision Tables, Domain-Driven Design |
| ADR drafter | Architecture decisions | Draft ADRs from conversation context | Gets written instead of deferred | Misses nuance, overstates certainty | Architecture Decision Records |
| Ensemble tool | Ensemble programming | Types while team navigates | Removes mechanical bottleneck, team focuses on thinking | Solo use misses cross-domain concerns | Ensemble Programming |
| First-pass threat modeller | Threat modelling / STRIDE | Systematic STRIDE enumeration | Covers ~70% of threats, doesn’t get tired | Misses context-specific threats, cultural factors | Threat Modelling |
| Discovery infrastructure | Continuous discovery | Transcription, sense-making, drafting across all practices | Embedded in every part of the weekly cadence | Never replaces the human judgment about what matters | Continuous Discovery |
Phase 1: “Give me the code”
This is where most teams start. Describe the feature, let the LLM write the code, ship it. The trap: a week-one build can produce a subscription system that’s clean, well-structured, and wrong, handling billing before the team understands what customers are actually subscribing to. The LLM doesn’t cause the mistake; it amplifies it. Vague understanding in, plausible-looking wrong code out. (Worked example.)
The shift comes with Example Mapping. Once a team has concrete examples, “Given a customer in Melbourne, when they subscribe to a weekly veggie box, then delivery is every Thursday”, the LLM stops guessing. Gherkin features from Example Map cards become precise prompts. The same LLM that built the wrong thing now builds the correct thing, because the input changed. (Worked example.)
The lesson: the LLM amplifies whatever understanding you give it, correct or incorrect, with equal confidence. This is the single most important thing to understand about using LLMs for software development. If your team’s understanding of the problem is vague, the LLM will produce confident, plausible, wrong code.
Phase 2: “Help me understand the data”
Once you’re past code generation, LLMs become powerful research assistants. Twenty customer interviews produce thousands of words of transcript. The LLM excels here: transcription, pattern-spotting across interviews, clustering themes. In a real JTBD sense-making session, this is where an LLM finds a recurring anxiety pattern, customers checking on Monday whether their Thursday delivery will arrive, across seven separate interviews that three different interviewers conducted. No single interviewer saw the pattern. The LLM did.
But it misses context. Regional differences, cultural nuance, the way different stakeholders talk about the same concept differently, these distinctions matter for strategy, and the LLM flattens them into generic summaries. Assumption Mapping is the technique that catches where the LLM’s pattern-finding needs human correction.
For board presentations, the LLM can draft a clean deck from roadmap data. But data is not narrative. The founder almost always has to rewrite the LLM’s draft, the facts are correct but the story is wrong. The LLM presents information; the board needs a narrative they can act on. (Worked example.)
The lesson: LLMs are strong at finding patterns across volume. They’re weak at judgment, narrative, and cultural nuance. Use them to find patterns in data you’ve already collected. Don’t trust them to tell you what the patterns mean.
Phase 3: “Generate from the model”
This is where LLM usage matures. Decision Tables are formal, complete, unambiguous, exactly the kind of input LLMs handle well. Every condition combination, every outcome, explicitly stated. Give an LLM a complete decision table and it generates comprehensive test suites and implementation code with near-zero defects. The same is true for code generation scoped to bounded contexts, when the LLM knows the boundaries and the ubiquitous language, it stays within them.
ADRs reveal a different benefit. Teams defer documentation because writing is slow. An LLM can draft ADRs from conversation transcripts, not perfect, but good enough that the team edits rather than writes from scratch. Decisions that would have gone unrecorded get captured. The risk: the LLM overstates certainty and understates trade-offs, so every draft needs human review for hedging and nuance.
The lesson: formal, structured inputs produce the best LLM outputs. The shift from “write me code” to “implement this specification” is the difference between Phase 1 and Phase 3. Any team can make this shift, the prerequisite is not a better LLM, it’s better discovery work before you open a terminal.
Phase 4: “Think with us”
Ensemble programming changes the relationship entirely. The LLM types while the team navigates. Three or four people debating the correct approach, the LLM implementing their decisions in real time. The mechanical bottleneck, someone has to type, disappears. The team focuses on thinking. Solo LLM use produces code that works but misses cross-domain concerns; ensemble use catches those concerns because multiple perspectives are present.
Threat modelling with STRIDE shows the LLM as systematic first-pass analyst. It enumerates threats at every boundary, spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege, without getting tired or bored. In practice it’ll cover roughly 70% of what a team finds important. The remaining 30% requires human knowledge of the specific deployment context, customer behaviour patterns, and regulatory environment. The ratio varies by domain, but the pattern is consistent: the LLM handles the systematic enumeration, humans handle the judgment.
By Continuous Discovery, the LLM becomes infrastructure. Transcription, sense-making, pattern-matching, drafting, woven into the weekly cadence at every step. Not a tool the team reaches for occasionally, but a layer underneath every practice. The human role shifts entirely to judgment: what matters, what to act on, what to ignore.
The principle
The LLM’s value is proportional to the quality of the thinking that goes into the PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. . Vague instructions produce plausible-looking wrong code. Concrete examples produce accurate implementations. Formal domain models produce comprehensive code. Discovery techniques aren’t just for humans, they produce the structured understanding that makes LLMs genuinely useful. This is true regardless of which LLM you use or what you’re building.
The anti-pattern
Using the LLM without discovery. “Give me a subscription system” versus “implement these 12 Example Map scenarios as Gherkin features.” The first produces fast, confident, wrong code. The second produces working software. The gap between the two isn’t a better prompt template or a more capable ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. . It’s the discovery work that happened before anyone opened a terminal.
If your team is frustrated with LLM output quality, the fix is almost never a better model or a cleverer prompt. It’s better understanding of the problem you’re asking the LLM to solve. Run an Example Mapping session. Build a decision table. Define the bounded context. Then ask the LLM again. The difference will be immediate.
Related references
- Example Mapping, the single best technique for improving LLM input quality
- Decision Tables, formal models that produce comprehensive, testable code
- Which Workshop When, every discovery and delivery technique in one place
- The Planning Onion, every planning layer in one place
- Retrospectives at Every Scale, the feedback loop that catches whether the LLM helped
- The Greenbox Story, narrative behind the worked examples