LLMs as Thinking Partners: How the Role Evolved

Most teams start by using LLMs to generate code fast. The ones that get the most value end up using LLMs to help them think. The shift isn’t about better prompts or newer models, it’s about better inputs. Discovery techniques produce structured understanding. Structured understanding produces useful LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. output. The Greenbox series provides worked examples at each stage; the principles below stand on their own.

The evolution at a glance

Stage	Worked example	LLM role	What worked	What didn’t	Post
Code generator	Week 1–4, building the wrong thing	“Write me a subscription system”	Fast output, clean code	Built on wrong assumptions, amplified misunderstanding	Catching the Wrong Kind of Fast
Implementation partner	BDD/Gherkin, turning examples into code	Generate code from concrete specs	Accurate when given precise examples	Still needs human-written specs	From Stories to Working Software
Sprint planning assistant	First sprints, task breakdown	Break down stories, draft acceptance criteria	Speeds up planning	Can’t assess gut-feel sizing	The First Sprints
Research / sense-making tool	JTBD interviews, assumption mapping	Transcribe interviews, spot patterns, make sense of the data	Catches themes humans miss across many interviews	Misses cultural context, local nuance	Jobs to Be Done, Assumption Mapping
Board presentation drafter	Roadmapping, board decks	Draft presentations from data	Fast first draft	Needs heavy editing for narrative and nuance	What Changes First
Code generator from domain models	Decision tables, bounded contexts	Generate code from formal tables, generate within context boundaries	Comprehensive, consistent, testable	Needs precise domain models as input	Decision Tables, Domain-Driven Design
ADR drafter	Architecture decisions	Draft ADRs from conversation context	Gets written instead of deferred	Misses nuance, overstates certainty	Architecture Decision Records
Ensemble tool	Ensemble programming	Types while team navigates	Removes mechanical bottleneck, team focuses on thinking	Solo use misses cross-domain concerns	Ensemble Programming
First-pass threat modeller	Threat modelling / STRIDE	Systematic STRIDE enumeration	Covers ~70% of threats, doesn’t get tired	Misses context-specific threats, cultural factors	Threat Modelling
Discovery infrastructure	Continuous discovery	Transcription, sense-making, drafting across all practices	Embedded in every part of the weekly cadence	Never replaces the human judgment about what matters	Continuous Discovery

Phase 1: “Give me the code”

This is where most teams start. Describe the feature, let the LLM write the code, ship it. The trap: a week-one build can produce a subscription system that’s clean, well-structured, and wrong, handling billing before the team understands what customers are actually subscribing to. The LLM doesn’t cause the mistake; it amplifies it. Vague understanding in, plausible-looking wrong code out. (Worked example.)

The shift comes with Example Mapping. Once a team has concrete examples, “Given a customer in Melbourne, when they subscribe to a weekly veggie box, then delivery is every Thursday”, the LLM stops guessing. Gherkin features from Example Map cards become precise prompts. The same LLM that built the wrong thing now builds the correct thing, because the input changed. (Worked example.)

The lesson: the LLM amplifies whatever understanding you give it, correct or incorrect, with equal confidence. This is the single most important thing to understand about using LLMs for software development. If your team’s understanding of the problem is vague, the LLM will produce confident, plausible, wrong code.

Phase 2: “Help me understand the data”

Once you’re past code generation, LLMs become powerful research assistants. Twenty customer interviews produce thousands of words of transcript. The LLM excels here: transcription, pattern-spotting across interviews, clustering themes. In a real JTBD sense-making session, this is where an LLM finds a recurring anxiety pattern, customers checking on Monday whether their Thursday delivery will arrive, across seven separate interviews that three different interviewers conducted. No single interviewer saw the pattern. The LLM did.

But it misses context. Regional differences, cultural nuance, the way different stakeholders talk about the same concept differently, these distinctions matter for strategy, and the LLM flattens them into generic summaries. Assumption Mapping is the technique that catches where the LLM’s pattern-finding needs human correction.

For board presentations, the LLM can draft a clean deck from roadmap data. But data is not narrative. The founder almost always has to rewrite the LLM’s draft, the facts are correct but the story is wrong. The LLM presents information; the board needs a narrative they can act on. (Worked example.)

The lesson: LLMs are strong at finding patterns across volume. They’re weak at judgment, narrative, and cultural nuance. Use them to find patterns in data you’ve already collected. Don’t trust them to tell you what the patterns mean.

Phase 3: “Generate from the model”

This is where LLM usage matures. Decision Tables are formal, complete, unambiguous, exactly the kind of input LLMs handle well. Every condition combination, every outcome, explicitly stated. Give an LLM a complete decision table and it generates comprehensive test suites and implementation code with near-zero defects. The same is true for code generation scoped to bounded contexts, when the LLM knows the boundaries and the ubiquitous language, it stays within them.

ADRs reveal a different benefit. Teams defer documentation because writing is slow. An LLM can draft ADRs from conversation transcripts, not perfect, but good enough that the team edits rather than writes from scratch. Decisions that would have gone unrecorded get captured. The risk: the LLM overstates certainty and understates trade-offs, so every draft needs human review for hedging and nuance.

The lesson: formal, structured inputs produce the best LLM outputs. The shift from “write me code” to “implement this specification” is the difference between Phase 1 and Phase 3. Any team can make this shift, the prerequisite is not a better LLM, it’s better discovery work before you open a terminal.

Phase 4: “Think with us”

Ensemble programming changes the relationship entirely. The LLM types while the team navigates. Three or four people debating the correct approach, the LLM implementing their decisions in real time. The mechanical bottleneck, someone has to type, disappears. The team focuses on thinking. Solo LLM use produces code that works but misses cross-domain concerns; ensemble use catches those concerns because multiple perspectives are present.

Threat modelling with STRIDE shows the LLM as systematic first-pass analyst. It enumerates threats at every boundary, spoofing, tampering, repudiation, information disclosure, denial of service, elevation of privilege, without getting tired or bored. In practice it’ll cover roughly 70% of what a team finds important. The remaining 30% requires human knowledge of the specific deployment context, customer behaviour patterns, and regulatory environment. The ratio varies by domain, but the pattern is consistent: the LLM handles the systematic enumeration, humans handle the judgment.

By Continuous Discovery, the LLM becomes infrastructure. Transcription, sense-making, pattern-matching, drafting, woven into the weekly cadence at every step. Not a tool the team reaches for occasionally, but a layer underneath every practice. The human role shifts entirely to judgment: what matters, what to act on, what to ignore.

The principle

The LLM’s value is proportional to the quality of the thinking that goes into the promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. . Vague instructions produce plausible-looking wrong code. Concrete examples produce accurate implementations. Formal domain models produce comprehensive code. Discovery techniques aren’t just for humans, they produce the structured understanding that makes LLMs genuinely useful. This is true regardless of which LLM you use or what you’re building.

The anti-pattern

Using the LLM without discovery. “Give me a subscription system” versus “implement these 12 Example Map scenarios as Gherkin features.” The first produces fast, confident, wrong code. The second produces working software. The gap between the two isn’t a better prompt template or a more capable modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. . It’s the discovery work that happened before anyone opened a terminal.

If your team is frustrated with LLM output quality, the fix is almost never a better model or a cleverer prompt. It’s better understanding of the problem you’re asking the LLM to solve. Run an Example Mapping session. Build a decision table. Define the bounded context. Then ask the LLM again. The difference will be immediate.

Example Mapping, the single best technique for improving LLM input quality
Decision Tables, formal models that produce comprehensive, testable code
Which Workshop When, every discovery and delivery technique in one place
The Planning Onion, every planning layer in one place
Retrospectives at Every Scale, the feedback loop that catches whether the LLM helped
The Greenbox Story, narrative behind the worked examples