Barking Iguana

Domain-Driven Design: Drawing the Boundaries

2026-06-09T06:00:00+08:00

Greenbox delivers weekly produce boxes from local farms. With 2,500 subscribers and a team growing from five to twelve, the startup is expanding to Melbourne, and the codebase that was built fast for a small team is starting to crack under the weight.

Maya is chopping sweet potato when her phone rings. She’s at home in Fremantle, Saturday evening, Nadia’s Spotify playlist filling the kitchen. Nadia is making a dressing at the counter. The number on the screen is Dave Morrison’s.

Dave doesn’t call on weekends. He doesn’t call much at all, he’s a text message man, and even those are sparse. Three words. “Zucchini looks short.” Maya puts down the knife and answers.

“Maya. That Freshly mob rang me today.”

He says it the way he says everything: flat, unhurried, like he’s reporting rainfall. Maya leans against the counter. Nadia glances over.

“They’re offering guaranteed volume. A hundred crates a week. That’s more than you take from me in a month.”

Maya’s mouth goes dry. “Are you going to switch?”

A pause. Dave doesn’t rush pauses. “I didn’t say that. I said they called. Thought you should know.”

They talk for another two minutes. Dave mentions that Rachel got the same call. He says “goodnight, Maya” and hangs up.

Maya puts the phone face-down on the counter. Nadia has stopped whisking.

“What happened?”

“The thing I was afraid of.”

She tells Nadia about Freshly — the $12 million in funding, the guaranteed volume they’re dangling in front of the farms Greenbox depends on. A phone call to Dave is different from a competitor entering a market. That’s someone reaching for the thing she built.

The sweet potato burns slightly while they talk. They eat it anyway.

The 47-file PR

Greenbox has two thousand five hundred subscribers. The team is growing from five to twelve. They’re opening operations in Melbourne. And the codebase that Tom and Priya built for 200 subscribers is groaning under the weight.

Charlotte is now the team’s scaling coach. She’s spent fifteen years with subscription businesses and she’s seen what happens when a startup codebase meets rapid growth. Lee is still around for discovery foundations. But the problems now are different — not “we don’t understand the domain” but “the architecture can’t keep up.”

The pull request that said everything

Kai joins the team on a Monday. He’s twenty-eight, from Sydney, five years at a fintech company where he built payment systems handling half a billion dollars a year. Solid Go skills, comfortable with LLMs, and he carries the quiet confidence of someone who has never worked on a codebase he couldn’t master in a week.

With Kai joining, Tom realises the deploy script doesn’t scale — two developers deploying simultaneously caused a conflict last week. Tom sets up a basic CI/CD pipeline: tests run automatically, deploys go through a single pipeline instead of individual laptops. Priya’s GitHub Action from the BDD work evolves into a real pipeline.

Kai reads the codebase over two days. On Wednesday, he opens his LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. and prompts: “Add a gift subscription feature to this codebase. A customer should be able to buy a subscription as a gift for someone else.”

The LLM generates code. Kai reviews it, tweaks a few things, writes tests, and opens a pull request on Thursday afternoon.

The PR touches 47 files.

Across every part of the system. The subscription model, the payment processing, the delivery scheduling, the farm matching algorithm, the email templates, the customer portal. The gift subscription feature reaches into every corner because the codebase has no corners. It’s one big room.

One of the changes modifies the farm matching algorithm — it assumes supply is reliable enough to serve gift recipients on the same schedule as regular subscribers. Dave Morrison, whose zucchini yield over-promises by twenty percent every spring, would have something to say about that. But Dave isn’t in the code review.

Tom reviews the PR and his heart sinks. Not because the code is bad — some of the function signatures are cleaner than his own. But every change is tangled with everything else. Changing gift billing requires touching the same files as regular billing. The delivery changes affect all subscribers. The farm matching modifications could break Maya’s substitution logic.

“I can’t review this,” Tom tells Kai honestly. “Not because it’s wrong. Because I can’t tell what it’ll break.”

Charlotte pulls up the PR. “This is a symptom, not a bug.”

What Charlotte sees

She asks the team: “When you say ‘subscription,’ what do you mean?”

Tom: “The record that tracks what box someone gets and when they’re billed.”

Priya: “The relationship between a customer and their delivery schedule.”

Sam: “The thing a customer signs up for and can pause or cancel.”

Maya: “The commitment to receive a box every week.”

Four people. Four definitions. None wrong. All different.

“That’s your problem. Not four definitions — four definitions living in one codebase with no boundaries. When Kai asked the LLM to add gift subscriptions, the LLM did what the codebase told it to: spread the feature across everything, because everything is connected to everything.”

Bounded Contexts

Charlotte introduces Domain-Driven Design — specifically, Eric Evans’ concept of Bounded Contexts. Complex systems should be divided into distinct areas, each with its own clear language and boundaries.

“Subscription” in the billing context means “a recurring charge.” In the fulfilment context, it means “a delivery schedule.” In the customer context, it means “a thing I signed up for.” These aren’t contradictions. They’re different perspectives that belong in different parts of the code.

Charlotte pulls up the Event Storm photographs from months ago — Maya had them laminated, which Charlotte says is one of the smartest things she’s seen a founder do. “We’re going to run the next level up from what you did with Lee,” she says. “Lee got you a Process Level model — the flow, the events, the commands, the actors. Today we’re going to Event Storm an Architecture on top of it. Same wall, same sticky notes, but we’re looking for the code boundaries instead of the business logic.”

She copies the domain events onto the whiteboard and asks the team to help her find the boundaries.

“Look for three things,” she says. “First: where the language changes. When ‘subscription’ stops meaning the same thing to different people, that’s a boundary. Second: where the people change. The person who cares about supply matching is not the same person who cares about billing. Different stakeholders, different contexts. Third: where the rate of change differs. Billing changes when Stripe changes their API. Fulfilment changes when you add a city. If two areas change for different reasons, they probably belong in different contexts.”

The team clusters the events on the whiteboard. Tom moves “Payment Charged” next to “Invoice Generated”, they’re both about money. Priya groups “Farm Availability Submitted” with “Substitution Applied”, they’re both about what goes in the box. Sam pulls “Box Packed” and “Delivery Confirmed” together, those are her world.

Charlotte watches and asks questions. “Who cares when a payment fails?” Sam says billing. “Who cares when a box is packed?” Sam says logistics. “Who cares when a subscription is paused?” Everyone hesitates, it affects billing AND delivery. Charlotte marks it with a pink note: “Pause is a boundary event. It starts in one context and triggers work in others.”

After twenty minutes, four clusters have emerged. Not because Charlotte drew them, because the team found them by looking at who cares about what and when the language shifts:

Subscription Context

Customer Subscribed
Subscription Paused
Subscription Cancelled
Gift Subscription Created
Box Size Changed

Billing Context

Payment Charged
Payment Failed
Invoice Generated
Refund Issued

Supply Matching Context

Farm Availability Submitted
Supply Matched to Demand
Substitution Applied
Shortfall Detected

Fulfilment Context

Box Packed
Box Dispatched
Delivery Confirmed
Delivery Failed

Four bounded contexts. Each talks to the others through events and clearly defined interfaces. Inside each context, the code is self-contained. You can change billing without touching fulfilment.

The reprompt

Charlotte has Kai try again. This time with a bounded promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. :

Add a gift subscription to the Subscription context. A gift subscription is created by a purchasing customer for a recipient. It has a status (pending, activated, active, paused, cancelled), a box size, a purchaser reference, and a recipient email. When created, publish a GiftSubscriptionCreated event. When activated, publish GiftSubscriptionActivated. The Subscription context does not handle billing, delivery, or supply matching.

The LLM generates code. The PR touches 8 files. The codebase is still one big room, nothing has been carved into packages yet, but the changes cluster: the subscription model, its status transitions, the gift fields, the two new events. Nothing reaches into billing, delivery, or farm matching.

Eight instead of forty-seven. Tom can review it in twenty minutes.

“The boundary didn’t just organise the code,” Charlotte says. “It organised the conversation with the LLM. We haven’t moved a single line yet, we just told it where the line is going to be.”

The Context Map

The bounded contexts communicate through events. Charlotte draws a Context Map showing what flows between them:

From	To	Events
Subscription	Billing	SubscriptionCreated, SubscriptionPaused, SubscriptionCancelled
Subscription	Supply Matching	SubscriptionCreated, SubscriptionCancelled
Supply Matching	Fulfilment	BoxAllocated, SubstitutionApplied
Billing	Fulfilment	PaymentConfirmed

This loose coupling is what the team is aiming for. Once the contexts are real in the code, Kai can build gift subscriptions while Priya works on Melbourne delivery zones, and their changes won’t collide.

Tom’s resistance

Tom pushes back. “This feels like Java-enterprise-architect nonsense. We’re a startup. We have twelve people, not twelve hundred.”

Charlotte doesn’t dismiss him. “You’re right about the ceremony. DDD has a reputation for being over-engineered. But look at Kai’s PR. Could you review it?”

“No.”

“Could you be confident it wouldn’t break billing?”

“No.”

“That’s the problem DDD solves at your scale. Not coordinating a thousand developers. Being able to change one thing without breaking everything else.”

She shows him the numbers from the teams she’s coached through this. Average PR size before boundaries: 23 files. After: 9 files. Review time dropped by more than half.

Tom looks at the data. “Fine. But if I ever have to write a UML diagram, I’m quitting.”

“Deal.”

That evening, Tom sits in his home office after the kids are asleep. Three monitors, the framed print of his first merged PR, LEGOs on the floor. Sarah comes in with tea.

“You’re quiet tonight.”

“Charlotte wants to carve up the codebase. Draw boundaries.”

“Is she right?”

Tom looks at Kai’s 47-file diff, still open on his centre monitor. “Yeah. Probably. It’s just –” He picks up a LEGO brick. “I built this. All of it. And now someone’s telling me it needs walls.”

Sarah leans against the door frame. “You love making things. But you hate letting anyone help you make them. You’re like your dad.”

Tom’s jaw tightens. His father runs a construction company. Marco, Tom’s brother, works there. Every family dinner, Marco talks about the business and Tom’s dad listens like it matters.

“That’s not fair,” Tom says.

“It’s not a criticism. The codebase isn’t yours any more, Tom. It’s theirs. That’s what growing means.”

She leaves the tea and goes to bed. Tom stares at the monitor for another hour.

The boundaries that don’t stick

Two weeks later, Kai opens another PR for the gift activation flow — what happens when a recipient clicks the link, creates an account, starts receiving boxes. The PR touches three bounded contexts.

Tom says, to nobody in particular: “I told you it wasn’t that simple.”

Charlotte doesn’t defend the diagram. She studies the event flows. The gift activation genuinely requires coordination between subscriptions, billing, and fulfilment. The feature isn’t violating the boundaries — the boundaries were drawn in the wrong place.

“You’re right,” she says to Tom. “The boundaries I drew were a first hypothesis. Let’s redraw them.”

The Subscription and Billing contexts share too many events. They merge into a single “Commercial” context — subscriptions, billing, gifts, pausing. Supply Matching and Fulfilment stay separate.

Tom watches Charlotte erase her own lines and draw new ones. He’d expected her to defend the original design.

“DDD is iterative,” Charlotte says. “The first set of boundaries is always wrong. You find out where by building against them. Kai’s PR told us something about the domain that the workshop didn’t.”

Kai looks at the new map. “So the 47-file PR was useful after all.”

Charlotte smiles. “The most expensive domain discovery session Greenbox ever ran. But yes.”

Making it real

The team refactors incrementally — Charlotte is adamant about no big-bang rewrites. They start with Billing (already somewhat isolated because of the Stripe API), then Supply Matching, then Fulfilment. Three weeks. Not perfect — some leaky abstractions remain — but the major boundaries are drawn.

New joiners can now be pointed at a single context: “You’re working on Supply Matching. Here’s the package. Here are the events. You don’t need to understand Billing to be productive.” Onboarding drops from two weeks to days.

A month later, when Maya announces a corporate catering service — weekly fruit boxes for offices — the bounded contexts prove their worth. Each context changes independently. Nobody’s PR touches 47 files.

The database needs splitting too. Tom’s first migration takes the site down for twenty minutes on a Sunday. Three subscribers email Sam. Tom resolves to learn zero-downtime migrations. The next migration, weeks later, goes live without anyone noticing. Progress.

What comes next

The boundaries are drawn. The team agrees on the contexts. But the wall is about to be painted over, and new joiners can’t read the sticky notes from three cities away. Next: turning the wall into living diagrams.

The next chapter, Drawing the System: From Event Storm to C4, publishes around 16 Jun.

Designing Short-Term and Long-Term Memory for a Bedrock Chat Assistant

2026-06-08T06:00:00+08:00

The situation

A product team is building an AI support assistant for a mid-sized SaaS company. The assistant handles first-line queries, billing, account access, feature questions, refund requests, and escalates to human agents when it can’t. Measured over six weeks of closed beta:

Average conversation length: 15 turns, ranging from five to past thirty.
Return rate: 40% within 30 days. Median return gap eleven days; roughly half reference something from a previous thread, “did the refund you mentioned go through?”, “I’m still seeing the login error you helped me with last week”.
Tool useTool useLetting an LLM call structured functions you’ve defined – search, calculator, database query, API call – instead of trying to do everything in text. : three or four per conversation. Account lookups, subscription checks, ticket creation.
Platform: Bedrock. Nothing self-hosted.
Team: two backend engineers, one front-end, no dedicated ML-ops.
Compliance: GDPR. Conversation content is personal data; deletion-on-request has to be clean, retention has to be bounded.

What actually matters

“Memory” is two problems, not one. The first is keeping a single conversation coherent: turn fifteen has to know what happened at turn two. The second is recognising a returning user: someone who comes back eleven days later should land on a bot that already knows about their open refund, not one that asks them to retype it. Build both with one mechanism and you usually get one that does neither well, because the two pull in different directions. In-conversation memory has to be right on every turn and fails loudly when it isn’t, which makes it backend plumbing. Cross-visit memory can be approximate, but it has two failure modes that are worse than approximate, which makes it product policy with engineering behind it.

Those two cross-visit failures are worth naming, because they set the privacy bar. Surfacing someone else’s conversation as if it were this user’s is a wrongful-disclosure incident: a stranger’s refund thread pulled up against this user’s login question. Failing to surface this user’s own open refund when they ask about it is milder, a trust dent rather than a breach, but still a product bug. Avoiding the first means per-user isolation has to be airtight, and it can’t be talked out of place by prompt injectionPrompt injectionAn attack where untrusted text the model is processing tries to override the instructions you actually gave it. . Avoiding the second means retrieval has to work on short, fragmented conversation text, which is exactly what document-retrieval tooling is bad at.

GDPR sets the next bar. When a user asks to be forgotten, every trace of their conversations has to go, cleanly and provably. A design where deletion cascades across four stores is one that eventually fails an audit. Aim instead for one delete call per store, each scoped to an identifier the application already holds. A single opaque per-user key deletes cleanly; per-turn vectors scattered through a shared index behind metadata filters can be made to work, but they’re far harder to stand behind when someone asks you to prove the data is gone.

Then there’s the team: two backend engineers, no ML-ops. Anything that scales with conversation volume is a liability by year two. A summarisation cron firing an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. call on every session close brings its own eviction policy, retention TTL, and retry logic, all of it infrastructure to own and operate. A managed option that does the same job behind a config flag buys that attention back for the product. The thing you give up is flexibility, and this product never spends it. One seam is worth leaving open, though: a billing agentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. and a support agent may one day need to share what they each know about the same user, so memory keyed to user identity rather than to a single agent instance is the easier thing to grow into.

What we’ll filter on

Five things the design has to deliver.

In-session coherence. Turn fifteen must be aware of turn two. The agent needs to see the relevant history of this conversation when it generates the next response.
Cross-session recall. A user returning eleven days later should land on a bot that can reasonably answer “what was the last thing we talked about?” without asking them to retype context. Not perfect replay, a usable summary.
Orchestration included. Fifteen turns with three tool calls per conversation means the assistant is planning, calling tools, observing results, and deciding what to do next. The memory solution has to live next to the orchestration, not compete with it.
Retrieval quality for conversational context. Pulling the correct fact from a past conversation is a different retrieval problem from pulling the correct paragraph from a product manual. Conversation data is short, interleaved, and context-dependent.
Operational overhead low enough for two backend engineers. No bespoke orchestration loop, no custom summarisation pipeline, no self-hosted vector database. GDPR delete has to be a button, not a project.

The memory landscape on Bedrock

Four plausible ways to build this.

Bedrock Agents’ built-in memory. A Bedrock Agent is the managed orchestration primitive: modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. + action groups (tool definitions) + knowledge bases + promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. templates, all wired together so the platform handles the plan-call-observe loop. Memory comes in two layers. Session state is automatic: every InvokeAgent call within a session sees the full conversation history, pass a sessionId and the agent assembles the history itself. Long-term session summaries are opt-in: enable memoryConfiguration with SESSION_SUMMARY, set a memoryId per end user, set a retention window (1 to 365 days). After each session ends, the agent generates a concise summary and stores it keyed to that memoryId. Delete is a single DeleteAgentMemory call.

DynamoDB-backed session store (build-your-own). Roll the orchestration loop yourself. A Lambda receives the user turn, reads conversation-so-far from DynamoDB (partition key sessionId, sort key turn timestamp), builds the prompt, calls InvokeModel, writes the response back, returns it. Cross-session recall is a second table keyed by user ID holding rolled-up state. Summaries come from an LLM call you write and schedule.

Bedrock Knowledge Bases for long-term recall. Dump transcripts or summaries into S3 and query at runtime for “what’s this user’s history?”. Chunking strategies assume a prose document; conversations are short, fragmentary, and relevance is keyed to who spoke and when. A chunk from someone else’s refund thread retrieved as “relevant” to this user’s login question is a correctness problem with a compliance problem stapled to it.

Custom vector store with conversation embeddings. Embed each conversation (or turn, or summary) with Titan Embeddings V2, store in OpenSearch Serverless or pgvector with per-user metadata, at session start query for the current user’s top-k most relevant past interactions. Full control of chunking granularity, metadata filtering, ranking. Also a second stateful system to own alongside DynamoDB.

Side by side

Option	In-session coherence	Cross-session recall	Orchestration included	Retrieval for conversation	Low ops
Bedrock Agents memory	✓	✓	✓	✓	✓
DynamoDB session store (DIY)	✓	✓	✗	✓	✗
Knowledge Bases for past transcripts	✗	—	✗	✗	—
Custom vector store of conversation embeddings	—	✓	✗	✓	✗

Matching the layers to the memory

One `InvokeAgent` call. Green dashed reads pull session history and prior-session summaries; red writes append the new turn and, on session end, write the summary. The developer passes `sessionId` and `memoryId`; the agent owns the plumbing.

Bedrock Agents memory, in depth

A Bedrock Agent is more than a model invocation; it’s an orchestration surface. Define action groups (tool schemas plus implementing Lambdas), optionally attach knowledge bases, write an instruction prompt, call InvokeAgent with a user turn and a sessionId. The runtime handles the ReAct-style loop.

Session memory is automatic. Every call with the same sessionId sees every prior turn, including tool calls and tool results. Idle timeout defaults to 30 minutes, configurable up to 24 hours. Turn fifteen sees turns one through fourteen because the agent reads them itself.

Long-term summary memory is configuration. Set memoryConfiguration on the agent with enabledMemoryTypes: [SESSION_SUMMARY] and a storageDays retention window. At runtime, pass a memoryId alongside the sessionId, typically a hash of the authenticated user ID. When the session ends, the agent generates a summary using a managed (customisable) prompt and stores it keyed to that memoryId. Subsequent sessions with the same memoryId have the prior summaries injected into context.

Retention and deletion. storageDays sets a TTL; once it lapses, the summary is gone. DeleteAgentMemory with a memoryId wipes everything for that user on demand. GDPR right-to-be-forgotten in one request.

Limits worth naming. memoryId lookup is exact-match, not semantic, no vector-search “find users with similar past experiences” built in. Summaries are bounded in length, so very long histories lose detail over time. Session memory is within an agent instance, moving a user from a support agent to a billing agent needs application-level plumbing to pass state across.

When build-your-own earns a place

Two situations flip the decision toward DynamoDB + a hand-rolled loop.

When you don’t want the orchestration. Bedrock Agents is opinionated about how tools get called: it runs the loop, chooses the action group, writes the reasoning. A team that needs tighter control over prompts, tool ordering, or failure modes sometimes builds its own loop instead. Session state then has to live somewhere, and DynamoDB is the natural home: partition key sessionId, sort key turn timestamp, TTL for auto-expiry.

When state is richer than turns. Conversations aren’t the only per-session state; a shopping cart, a configured quote, a workflow status are none of them naturally turns. DynamoDB holds that directly, and the tools read and write it.

Neither flip applies to the two-engineer support bot. Orchestration is standard ReAct-over-tools; state is conversational. Bedrock Agents covers both.

The hybrid worth knowing. Teams using Bedrock Agents memory often add a small DynamoDB or S3 store for structured cross-session facts, ticket numbers, subscription plan, last-known issue code, that the agent needs reliably regardless of whether they appear in a generated summary. Summary memory is the prose recall; the DynamoDB table is the structured one. A tool the agent calls to fetch it is the clean seam.

Why Knowledge Bases is the wrong shape for conversations

Four reasons.

Chunking doesn’t match. Knowledge Bases chunk documents, fixed-size (default ~300 tokens), hierarchical, or semantic, assuming nearby text is topically coherent. A conversation transcript has rapid speaker alternation, interleaved tool outputs, and short turns; a 300-token chunk spans three sub-topics and two speakers.

Retrieval relevance is topic, not speaker. A vector search for “refund” across a knowledge base of all transcripts will cheerfully return high-similarity chunks from other users’ refund conversations. Compliance problem plus correctness problem. Metadata filtering by user ID helps but has to be attached at ingestion and is less flexible than a native vector store’s.

Summaries vs transcripts. Storing raw transcripts means retrieving fragments. The correct thing to retrieve is summaries, and generating those is the job Bedrock Agents’ long-term memory already does.

GDPR is harder. Deleting a user’s data means locating every chunk that contains their content in a service-managed index, then re-ingesting. DeleteAgentMemory is one call.

Knowledge Bases are correct for “what does our support policy say about refunds?”, a reference corpus shared across users. Wrong for “what did this user say yesterday?”, per-user conversational state.

A worked design

Bedrock Agent wrapping Claude Haiku 4.5, latency-sensitive, cost-sensitive, reasoning bar for first-line support is low enough. Action groups for account lookup, subscription status, ticket create/query. One Knowledge Base attached for the product documentation corpus, the policy memory, not the user memory.
Session memory: on by default. sessionId is the chat-widget session, rotated on explicit “new conversation” or 30 minutes idle.
Long-term summary memory: memoryConfiguration with SESSION_SUMMARY, storageDays: 90. memoryId is sha256(userId), stable per authenticated user, doesn’t leak the raw ID. sessionId and memoryId both passed on every InvokeAgent call.
Structured cross-session state: a small DynamoDB table keyed by user ID, holding open ticket IDs, subscription tier, last-issue-code. A GetUserContext action group lets the agent fetch this at conversation start when relevant.
GDPR delete: a Lambda triggered by account closure calls DeleteAgentMemory with the user’s memoryId, deletes the DynamoDB row, records an audit trail.
Retention: summaries lapse after 90 days via storageDays.
Monitoring: CloudWatch on InvokeAgent latency and error rate; a weekly anonymised sample of summaries reviewed for quality.

No dedicated memory database, no custom summarisation cron, no per-user vector index. The memory plumbing comes with the agent.

What’s worth remembering

Short-term and long-term memory are different problems. Turn-level coherence within one conversation is session state; cross-visit recall is summary state. A single solution rarely does both well unless it was designed for both.
Bedrock Agents memory covers both layers as managed functionality. Session memory is automatic, pass a sessionId. Long-term summary memory is configuration, enable SESSION_SUMMARY, pass a memoryId, set storageDays.
memoryId scopes long-term memory to a user; sessionId scopes session memory to a conversation. Orthogonal identifiers, both passed on every InvokeAgent call when long-term memory is enabled.
DeleteAgentMemory is the GDPR delete button. One API call, scoped to a memoryId. Retention also lapses automatically via storageDays (1 to 365).
Knowledge Bases are for reference corpora, not conversational state. Chunking, retrieval relevance, and per-user isolation all work against using them for past-transcript recall.
DynamoDB fits as structured-state companion to Bedrock Agents memory. Ticket IDs, subscription tier, status flags, things the agent fetches via a tool call, not things the agent summarises in prose. A hybrid is common and clean.
A custom vector store over conversation embeddings is flexibility that costs a team. Justified when cross-user semantic similarity is a product feature; overkill when the product just needs “remember this user”.
Bedrock Agents includes orchestration. Action groups, knowledge bases, and the ReAct-style tool loop come with the agent. Build-your-own means rebuilding that loop, more code to own, no better outcome for standard shapes.

The answer: use a Bedrock Agent with session memory for in-conversation coherence and long-term summary memory (SESSION_SUMMARY, memoryId per user, storageDays retention) for cross-session recall. Attach a Knowledge Base for product documentation, the reference corpus every user shares. Add a small DynamoDB table of structured per-user state (open tickets, subscription tier) behind a GetUserContext action group. Wire DeleteAgentMemory into the account-closure path for GDPR. The two engineers ship a memory system without operating a memory system.

Search and Planning

2026-06-06T06:00:00+08:00

You open the map app. You type an address. Forty milliseconds later it shows you a path through 3.4 million road segments, optimal in time, accounting for current traffic. There’s no neural network involved. There’s no learning. There’s an algorithm from the 1960s, running on a graph, doing what it has always done.

In the previous five posts we covered AI for problems where the input is text. Classification, retrieval, generation, the lot. This post leaves text behind. Most of what the textbooks call classical AI, the kind in Russell and Norvig’s Artificial Intelligence: A Modern Approach, isn’t about understanding language. It’s about searching through possibilities.

Search algorithms run more production AI than transformers do. They route your packets, plan your warehouse robot’s path, schedule your CI build, find the chess move, plan your delivery route. They’ve been quietly working since the 1960s and they’re not getting replaced.

This post walks the family. Like Before the Transformer but for problem-solving instead of language modelling.

The setup: states, actions, goals

Most search problems share a structure:

States. Configurations of the world. The position of the chess pieces. The location of the delivery truck. The contents of the warehouse robot’s basket.
Actions. Things you can do that change one state into another. Move a piece. Drive to the next intersection. Pick up an item.
A goal. A state (or set of states) you want to reach. Checkmate. The customer’s address. All packages delivered.
A cost. Often there’s a cost on each action, distance, time, fuel, whatever, and you want the cheapest path to the goal.

The search problem is: find a sequence of actions that gets you from the starting state to a goal state, ideally cheaply, ideally fast.

That’s it. That’s the framing. Once a problem is in this shape, decades of algorithms apply.

Uninformed search

These are the algorithms you can write in fifty lines because they don’t know anything about the problem, they just systematically explore.

Breadth-first search (BFS) explores layer by layer, finding the path with the fewest actions. Use it for: small graphs, “fewest-moves” puzzles, finding the nearest matching node. The classic example: solve the 8-puzzle in the minimum number of slides.

Depth-first search (DFS) goes as deep as possible before backtracking. Use it for: exploring trees, generating permutations, anything where you need a memory-light traversal. Classic example: enumerate all possible game positions.

Iterative deepening (IDS) combines both: do DFS to depth 1, then to depth 2, then to depth 3, and so on. Memory of DFS, completeness of BFS. Used in chess engines for depth-limited search.

Uniform-cost search is BFS with weighted edges, explore in order of cumulative cost rather than in order of depth. Equivalent to Dijkstra’s algorithm, which you’ve probably implemented at some point.

These are the workhorses. They’re old, they’re simple, and they show up everywhere.

Informed search: A* and friends

The big jump happens when you have a heuristic, a function that estimates how far each state is from the goal. With a heuristic, you don’t explore blindly. You explore in order of “most promising next.”

A* (Hart, Nilsson, and Raphael, 1968) is the algorithm. It expands states in order of g(n) + h(n), where g is the cost to reach the state and h is the heuristic estimate of cost to the goal. If h never overestimates the true cost (an “admissible” heuristic), A* is guaranteed to find the optimal path.

A* runs:

Your map app, with a heuristic of straight-line distance to the destination.
Pathfinding in games, where the units need to walk around walls efficiently.
Robot path planning, both in factories and in self-driving cars.
Puzzle solving, with heuristics like “number of misplaced tiles” for the 15-puzzle.
Build systems, finding the minimum-work path through a dependency graph.

A* works because a good heuristic can collapse the search space dramatically, from “explore everything” to “explore only what’s plausible.”

When A* won’t fit in memory, you have variants: IDA* (iterative-deepening A*), SMA* (memory-bounded A*), D* (dynamic A* for changing environments). These are all in the toolbox for production pathfinding.

Local search and metaheuristics

When the search space is too big to enumerate, you give up on optimality and try to find a good answer rather than the best one. This is local search.

Hill climbing starts from a random state and moves to the best neighbour. Simple, fast, gets stuck in local optima. Good enough for many problems.

Simulated annealing (Kirkpatrick, Gelatt, and Vecchi, 1983) hill-climbs but occasionally accepts a worse move (more often early, less often later). The “annealing” comes from metallurgy, cooling slowly to find a better global structure. Workhorse for layout problems, scheduling, and combinatorial optimisation.

Genetic algorithms maintain a population of candidate solutions, combine pairs (“crossover”), perturb them (“mutation”), and select the fittest to breed. Used for design-space exploration, hyperparameter tuning before Bayesian methods, and antenna design (NASA has flown a genetic-algorithm-designed antenna).

Tabu search keeps a list of recently-visited states and refuses to revisit them, forcing the search to explore new territory.

These are not the prestige algorithms of the field, but they’re the practical answer for “I have a giant combinatorial problem and I need a reasonable solution by Friday.”

Adversarial search: games

When you’re playing against an opponent, single-agent search isn’t enough, you need to anticipate what they’ll do. Enter adversarial search.

Minimax is the basic game-tree search: assume both players play optimally, and pick the move that maximises your worst-case outcome. The tree branches at every move, with you maximising at your turns and the opponent minimising at theirs.

Alpha-beta pruning is the optimisation that makes minimax practical. By tracking the best score the maximiser is assured of (alpha) and the best the minimiser is assured of (beta), large parts of the search tree can be pruned without affecting the result. A well-tuned alpha-beta search can go many plies deeper than naive minimax in the same time.

This is the algorithmic core of:

Chess engines (Stockfish, the strongest classical chess program, is alpha-beta with extensive engineering).
Checkers, Go (pre-AlphaGo), Othello, and most other deterministic two-player games.
Monte Carlo tree search (MCTS), which is what AlphaGo and AlphaZero used, a different strategy, but still adversarial search at heart.

Even the deep-learning-based game systems use search. AlphaGo combined a neural network for evaluation with MCTS for search. Stockfish 16+ has a neural network evaluation but still does alpha-beta search through the tree. The search is the engine; the neural net is the heuristic.

Constraint satisfaction problems

Slightly different shape: you have a set of variables, each with a domain of possible values, and a set of constraints between them. You want an assignment of values to variables that satisfies all constraints.

Examples:

Sudoku. Variables are cells, domains are 1-9, constraints are row/column/box uniqueness.
Map colouring. Variables are regions, domains are colours, constraints are “adjacent regions different.”
Class scheduling. Variables are courses, domains are time slots, constraints are room/teacher/student conflicts.
Configuration. Variables are component choices, domains are products, constraints are compatibility.

The classical algorithm is backtracking with constraint propagation: pick a variable, try a value, propagate the implications, recurse, backtrack on failure. The cleverness is in the propagation, arc consistency, forward checking, unit propagation, which prunes the search space dramatically.

Industrial CSP solvers (Google OR-Tools, Choco, MiniZinc) handle problems with millions of variables and constraints. They run:

Hospital staff scheduling.
Aircraft and crew scheduling.
Hardware verification.
Network configuration.
Supply-chain optimisation.

There’s no learning involved. Just very-well-engineered search through structured spaces.

Planning

Planning is the version of search where the action descriptions are more abstract. Instead of a graph of states, you have:

A description of the world in terms of facts (the box is at location A, the gripper is empty, the door is open).
A library of actions, each described by preconditions (what must be true to execute) and effects (what becomes true / false after executing).
A goal expressed in terms of facts (the box is at location B).

The classical algorithm here is STRIPS (Stanford Research Institute Problem Solver, 1971) and its descendants. Modern planners. Fast Downward, LAMA, ENHSP, can handle much richer planning problems with continuous variables, time, and resources.

Planning runs:

Robot task planning. “Put the cup on the shelf” decomposed into a sequence of low-level actions.
Logistics and delivery routing in complex domains.
Game AI for non-player-character behaviour, particularly Goal-Oriented Action Planning (GOAP) in commercial game engines.
Spacecraft autonomy. NASA’s Remote Agent on Deep Space 1 was a planner.

Planning is less famous than minimax or A*, but in domains where you need to reason about long action sequences, it’s the correct tool.

A decision table

If your task is…	Reach for…
Shortest route between points on a graph	Dijkstra (no heuristic) or A* (with one)
Pathfinding for a unit in a 2D/3D world	A* on a navigation grid or mesh
A two-player perfect-information game	Alpha-beta or MCTS, with a learned or hand-crafted evaluation
Solving a puzzle (Sudoku, n-queens, scheduling)	A CSP solver (OR-Tools, MiniZinc)
Optimising a hard combinatorial problem	Simulated annealing or genetic algorithm if exact methods are infeasible
Sequencing actions for a robot or workflow	A planner (PDDL + Fast Downward, or GOAP for games)
Routing many vehicles to many destinations	A vehicle-routing solver (built on CSP / mixed-integer programming)
Build-system dependency resolution	Topological sort + Dijkstra or DAG-aware scheduler

Search vs ML: when each wins

Search and machine learning solve different shapes of problem.

Search wins when:

The state space and action space are well-defined.
Optimality (or near-optimality) is the goal, not “good enough.”
Problems are deterministic and the rules are knowable.
You can write a heuristic that estimates progress.

ML wins when:

The state space is fuzzy or perceptual (images, raw text).
“Good enough” is fine and you can’t define optimal.
The rules are statistical, not deterministic.
You have lots of examples to learn from.

The most powerful systems combine both. AlphaZero is search guided by a learned heuristic. Self-driving cars use planning over a perception layer that’s deep learning. Modern logistics systems use ML to predict demand and search to plan delivery.

Most of the AI in the textbooks before deep learning was some flavour of search. Those textbooks weren’t wrong; they were a generation early, and most of the algorithms they covered are still in production. A* still finds the route in the map app. Alpha-beta still drives the strongest classical chess engines, and even AlphaZero is search guided by a learned evaluator rather than a pure neural play. CSP solvers schedule the hospital, the airline, and the supply chain. STRIPS-descended planners sequence robot actions and ran the autonomy on Deep Space 1. When exact search doesn’t fit, simulated annealing and genetic algorithms produce respectable answers by Friday.

Search and ML solve different shapes of problem. Search wins where the state space is well-defined, the rules are knowable, and you can write a heuristic that estimates progress. ML wins where the input is perceptual, the rules are statistical, and “good enough” is the bar. The best systems combine them: ML for perception and evaluation, search for the sequencing that has to be optimal. The pattern was always going to be both.

The next chapter, Knowledge, Logic, and Constraints, publishes around 13 June.

The Workshop: Business Model Canvas

2026-06-05T06:00:00+08:00

Nine boxes on one page. The Business Model Canvas shows where revenue comes from, what it costs to serve a customer, and which assumptions hold the whole thing together: “does this business actually work?” you can read in five minutes. Worked example: Does This Actually Work?.

Business Model Canvas

The Business Model Canvas (BMC) lays out how a business creates, delivers, and captures value on a single page of nine boxes, filled in customer-first order, so the team can see whether the whole thing holds together and where the most dangerous assumptions live. Invented by Alexander Osterwalder as part of his PhD, published with Yves Pigneur in Business Model Generation (2010), and now one of the most widely used strategic tools in any discipline. Sometimes confused with a business plan (a long document assuming the model works; the Canvas is a one-page hypothesis about whether it will) or with Ash Maurya’s Lean Canvas (Running Lean, 2012), which swaps four boxes for Problem, Solution, Key Metrics, and Unfair Advantage; reach for Lean Canvas before product-market fit, BMC once there’s something to articulate.

At a glance

Who, for how long: a facilitator, the founder or business owner, a product person, customer-facing people (sales, support, marketing), one or two developers, and operations. Four to six people, around two hours.
What you walk out with: a populated nine-box Canvas with photographs and a digital transcription, an explicit contradiction list from the read-aloud review, and the riskiest assumptions flagged for follow-up testing.
When to reach for it: a new business or product line, an investor / board / new-hire explanation, or a suspicion that pricing, costs, and value proposition don’t actually fit together. Not for sprint planning or feature prioritisation, and not when nobody in the room understands the economics (do JTBD first if the customer or value proposition isn’t yet clear).

What’s It For

“Can everyone in the room describe the business model the same way?”

That’s the question a facilitator asks at the start of a Canvas session, and it’s the question that makes the room go quiet. Not because nobody in the room knows the business model; they each know a version of it. The founder knows the pricing and the margin assumptions. The developer knows the delivery architecture and roughly what it costs to run. The ops lead knows the wastage rate on unsold perishables. The product person knows the customer acquisition story. Each version is coherent on its own. None of the versions agree with each other, and the business model isn’t any single one of them; it’s the intersection of all of them, which is a thing nobody has ever looked at as a single object.

The gap shows up quietly. A founder whose pricing assumptions haven’t met the ops lead’s wastage numbers discovers, one quiet Sunday, that each box costs $41 to source, pack, and deliver, and they’ve been selling them for $35. Every new subscriber is losing the business money. The faster they grow, the faster they go bankrupt. Nobody did anything wrong. Each person was right about their own box. The failure was that the boxes were never put on one page where somebody had to read them out loud next to each other.

The Business Model Canvas exists to be that one page. Nine boxes, filled in customer-first order, read out loud in pairs at the end so the arithmetic and the logic both have to survive contact with the rest of the model. You can’t admire the value proposition without seeing what it costs to deliver. You can’t celebrate the pricing without seeing the operational burden. You can’t forget about customer acquisition because there’s a box for it on the same page as the revenue streams it feeds.

Reach for it when:

You’re starting a new business or a major new product line
You need to explain the business model to investors, new hires, a board, or yourself
You suspect parts of the business model don’t fit together: the value proposition doesn’t match the revenue model, or the costs don’t support the pricing
You’re comparing two different business model options and need a side-by-side view
An existing business is drifting and you want to diagnose which part of the model has changed

What It’s Not For

Skip it when:

The business model is established, well-understood, and not in question. You’d be documenting, not discovering.
You’re planning features or sprints. The Canvas is strategic, not tactical.
You don’t have anyone in the room who understands the economics (revenue, costs, margins). The Canvas will have holes exactly where it needs to be sharpest.

Stop a session that’s already started if:

Four or more of the nine boxes are pure guesses, making the Canvas mostly fiction; better to pause and do research first
The founder refuses to engage with contradictions surfaced during review
The room is missing the person who owns the cost structure or the pricing, and multiple boxes depend on them

Stopping and fixing the inputs is not failure. Producing a Canvas that papers over an incoherent business is.

Definitions & Background

The nine boxes of the Canvas, with the role each one plays:

Customer Segments: who you’re serving. Specific groups of people or organisations.
Value Propositions: what value you deliver to each segment. The benefit as the customer experiences it, not the feature you build.
Channels: how you reach and deliver to each segment, across awareness, evaluation, purchase, delivery, and after-sales.
Customer Relationships: how you acquire, retain, and grow each segment. Personal, automated, community, co-creation.
Revenue Streams: what customers pay, how, and how much. Pricing models with numbers attached.
Key Resources: the assets essential to delivering the value propositions. Physical, intellectual, human, financial.
Key Activities: the most important things the business must do well. Essential and distinctive, not every task.
Key Partners: who you depend on. Suppliers, partners, services who could break you if they disappeared.
Cost Structure: the most significant costs in the model. Fixed, variable, one-time.

Boxes are filled in customer-first order, not left-to-right. Start with the customer, trace the value outward (segments → propositions → channels → relationships → revenue), then trace the economics backward (activities → resources → partners → costs). Starting with costs produces a defensive Canvas. Starting with customers produces a strategic one.

The reading-aloud ritual. At the end, pairs of boxes are read out loud next to each other: Revenue Streams next to Cost Structure, Value Propositions next to Customer Relationships, Customer Segments next to Channels. The arithmetic and the logic both have to survive contact with the rest of the model. Most Canvas sessions produce at least one coherence break in this phase. The value of the session is finding it.

Inputs

An idea of the business or product concrete enough to test. “A weekly subscription produce box” is enough. “Something with food, maybe” is not.
People who can speak to different parts of the business. No single participant will know all nine boxes, but collectively the room should. Customer-facing voices, economic voices, operational voices.
A wall or large surface with the nine-box Canvas drawn on it (printed, taped up, or projected), sticky notes, and pens. Roughly two hours, uninterrupted.
Numbers, even rough ones, for pricing and the major costs. The Canvas can survive estimates labelled as estimates; it can’t survive nine boxes of vibes.

If the team can’t yet articulate the value proposition or the customer, run JTBD first to clarify which job the customer is hiring the product to do. JTBD output feeds directly into the Value Propositions box.

Outputs

What lands on the Canvas at the end:

A populated Canvas: nine boxes with sticky notes, photographed from directly in front and as close-ups of each box, with the notes readable.
A digital transcription: the Canvas captured in Miro, Mural, Figma, or a slide template, with every sticky note carried across.
A list of contradictions found during the review phase, each as its own line item: “Revenue $35 vs variable cost $41, losing $6 per sale”, “Personal-connection value vs automated relationship”, etc.
A list of empty or shaky boxes: the ones the room couldn’t fill confidently. These are findings, not failures.
An explicit list of the riskiest assumptions baked into the model, usually concentrated in Revenue Streams, Cost Structure, and Customer Segments.

These outputs feed straight into:

Assumption Mapping is the natural follow-up. A Canvas is nine boxes of beliefs; Assumption Mapping surfaces and tests them. Run it specifically on Revenue Streams and Cost Structure, where incorrect assumptions are fatal.
Impact Mapping. The Canvas sets the strategy; Impact Mapping picks the deliverables to execute against it. Canvas first for a new business; Impact Mapping first for an existing business with a clear goal.
User Story Mapping. Once the Canvas is coherent, Story Mapping turns the value proposition into a user journey and a release plan.
Wardley Mapping. The Canvas shows what the business is; Wardley Mapping shows where its components sit in the evolution of the market and therefore how they should be treated strategically. Canvas answers “what”; Wardley answers “where.”
Event Storming. Once the Canvas is agreed, Event Storming maps the processes the business will actually run to deliver on it. Canvas sets the shape; Event Storming maps the operations.

Who’s Needed

Four to six people, around two hours:

Facilitator. Holds the box order, keeps the conversation moving, and catches when a feature has been smuggled into the Value Propositions box.
Founder or business owner. Mandatory. They’re the only person who knows (or at least believes they know) the economics, the pricing, the costs, and the margins. Without them, the Canvas will have the most important boxes filled in with guesses.
Product person. They’ll anchor the value propositions and the channels, and they’ll translate between the founder’s business framing and the team’s delivery framing.
Customer-facing people. Whoever talks to actual customers: sales, support, marketing, account managers, operations staff who handle complaints. They will contradict the optimistic assumptions in the room, which is exactly why they’re there.
Developers. One or two. They need to understand the model they’re building for. They will also catch the technical assumptions baked into the Key Resources and Key Activities boxes that nobody else will notice.
Operations / SRE. For any business where operations are a non-trivial cost or a differentiator (which is most of them) ops is a first-class participant. The Cost Structure box is often where ops has the most to say, and what they say is often unwelcome but essential.

The Canvas works by conversation between perspectives, and the conversation collapses above six. Below four, you don’t have enough perspectives to challenge each other.

Who to leave out:

Investors and board members. They see the Canvas as an output, not during the conversation. Their presence changes what the team will say out loud.
Large stakeholder groups. If ten people need to shape the model, run a pre-session to agree the goal and come to the Canvas with the group down to six.
Pure feature-thinkers. Someone who can only discuss what to build, not why or for whom or at what margin, will turn the Value Propositions box into a feature list and the whole Canvas drifts.

How To Run It

Phase	Box	Duration	Key question
1	Customer Segments	10 min	“Who are we serving?”
2	Value Propositions	15 min	“What value do we deliver?”
3	Channels	10 min	“How do we reach and deliver?”
4	Customer Relationships	10 min	“How do we acquire, retain, and grow?”
5	Revenue Streams	10 min	“What do they pay, and how?”
6	Key Activities	10 min	“What must we actually do?”
7	Key Resources	10 min	“What do we need to deliver this?”
8	Key Partners	10 min	“Who do we depend on?”
9	Cost Structure	10 min	“What does it all cost?”
10	Review for coherence	15 min	“Does the maths work?”
Total		~2 hours

Order matters. Start with Customer Segments because every other box is defined in terms of the customer. End with Cost Structure because by the time you get there, you know what you’re doing, for whom, how, and how it’s delivered. Only then can you add up what it costs.

The Canvas is a round-the-room conversation moderated by the facilitator, with notes going up in the current box only. Everyone speaks in every box, but the domain expert for the box takes the lead:

Customer Segments: the customer-facing people lead; everyone else pressure-tests specificity.
Value Propositions: the founder and product person lead; everyone else challenges whether the claimed value is actually the value the customer experiences.
Channels and Customer Relationships: marketing, sales, and support lead; the developers listen hard because these boxes define half of what they’ll need to build.
Revenue Streams: the founder leads, with support from anyone who knows the market. Numbers get written down, even rough ones.
Key Resources, Activities, Partners: operations and developers lead; the founder listens hard because this is where their optimism meets operational reality.
Cost Structure: operations and founder together. Developers add the technology costs.
Review: everyone. The reading-it-aloud ritual is where the coherence check happens.

The rhythm is customer outward, then back through to costs, then check the maths.

Phase 1: Customer Segments (10 minutes)

Point at the Customer Segments box and ask:

“Who exactly are we creating value for? I want specifics. Not ‘everyone who eats food’ or ‘health-conscious consumers.’ Specific enough that I could walk down a street and tell you whether the person next to me is one or not.”

Write each segment on a sticky note and place it in the box. Push hard for specificity:

“‘Busy families’ is closer. Which busy families? Dual-income, both parents working full-time, kids at school, lives in a city with limited supermarket access after 7pm? Now we have an actor whose behaviour we can actually influence.”

If there are multiple segments, rank them. One primary, one or two secondary. Businesses rarely serve three primary segments well in the first year.

What to watch for:

Too broad. “People who eat food.” Push for age, geography, behaviour, pain point, or life stage.
Too many segments. More than three or four is a startup trying to be everything. Pick the one or two that matter most and park the rest.
Confusing users with customers. The person who uses the product and the person who pays may be different. If a company buys boxes for employees, the company is the customer and the employee is the user. Capture both and note which one pays.
The absent customer. Nobody in the room knows the segment concretely because they’ve never spoken to one. That’s a finding; note it, because it will shape which assumptions you test later.

Phase 2: Value Propositions (15 minutes)

For each customer segment, ask:

“What value are we delivering to this segment specifically? What problem are we solving, what pain are we relieving, what job are we helping them get done (Jobs to be Done: the framing that customers hire products to get a specific job done; see JTBD workshop for the deeper version)? I want the benefit as the customer would describe it, not the feature we’d describe.”

Write value propositions on sticky notes in the box and connect them, visually or by proximity, to the segment they serve.

Value propositions come in several flavours and a good Canvas usually has a mix:

Functional: “Fresh produce at the door every Wednesday without having to plan for it”
Problem-solving: “No more panicked supermarket trip on a Tuesday night”
Emotional: “Feel good about supporting local farms”
Economic: “Better value than buying organic at the supermarket, including the time saved”

What to watch for:

Features disguised as value. “We have a mobile app” is a feature. “Manage your subscription in thirty seconds from your phone” is a value proposition. Push for the benefit, not the mechanism.
Value that doesn’t match segment. If the segment is “busy professionals” and the value proposition is “learn about seasonal farming,” something is off. Challenge it: “Would a busy professional sign up for this reason?”
Founder passion masquerading as value. The founder may love supporting small farms; the customer may just want fresh produce at their door. Both can be true, but the Canvas should reflect the customer’s experienced value, not the founder’s internal motivation.
Too many value propositions per segment. If a segment has eight value propositions, the team doesn’t know which one is actually the reason the customer buys. Rank them and put the top two forward.

Phase 3: Channels (10 minutes)

Ask:

“How do we reach our customers? How do they hear about us, how do they decide to try us, how do we actually deliver the value to them, and how do we support them afterwards?”

Channels include awareness, evaluation, purchase, delivery, and after-sales. A good Channels box covers all five phases, not just the sexy acquisition ones.

What to watch for:

Only digital channels. For a physical product like a produce box, the delivery channel (courier, pick-up, post) is critical and often the biggest operational constraint. Don’t forget it.
Missing acquisition. The team knows how to deliver but has no plan for how customers will find them. That’s a gap worth flagging loudly.
Unreal channels. “We’ll go viral on TikTok” is not a channel strategy, it’s a wish. Push: “What specifically will we do on TikTok? Who runs the account? How do we measure whether it works?”
Every channel is owned by the founder. That’s a scaling ceiling. Worth noting now, even if you don’t solve it.

Phase 4: Customer Relationships (10 minutes)

Ask:

“What kind of relationship do we maintain with each segment? How do we acquire them, keep them, and grow what they spend with us?”

Relationships come in flavours: personal (dedicated account manager, farmer liaison), automated (emails, notifications, self-service), community (forums, social media groups, events), co-creation (customers help pick produce, vote on weekly boxes).

What to watch for:

Relationship / value proposition mismatch. If the value is personal connection to local farms but the relationship is entirely automated, something doesn’t fit. The customer signed up for connection and is getting a chatbot.
No retention strategy. Acquiring subscribers is expensive. How do you keep them? If nobody in the room has an answer, that’s a high-impact assumption to flag.
Every customer gets the same relationship. Different segments often need different relationships. A family subscriber and a corporate gift-giver behave differently and need to be managed differently.

Phase 5: Revenue Streams (10 minutes)

Ask:

“What exactly are customers paying for, how do they pay, and how much? I want numbers, even if they’re rough.”

Pricing models include: subscription fees (weekly, monthly, quarterly), per-box pricing with different tiers, add-ons, gift subscriptions, one-off purchases, referral credits.

Write each revenue stream as a sticky note with the price attached. “$35 per box, weekly” not “subscription fee.”

What to watch for:

Vague pricing. “They’ll pay a fair price” is not a revenue stream. Push for numbers: “If we had to set a price today, what would it be?”
Only one revenue stream. Not necessarily wrong, but fragile. Are there adjacent revenue opportunities (add-ons, gifts, upgrades) the team hasn’t considered? At least note them.
Pricing that ignores willingness to pay. “We need $50 per box to cover costs.” That’s a cost-plus position, not a market-led one. Note it; you’ll return to it in the coherence check.
Revenue shapes, not just revenue amounts. “$35 per box, weekly” is different from “$140 per month, billed on the first,” which is different again from “$1600 per year with a renewal window.” The shape of the revenue determines the shape of the cost structure you need to cover, and which box the failure mode hides in.

Phase 6: Key Activities (10 minutes)

Ask:

“What are the most important things we must do to make this business work? Not every task; the essential, distinctive activities.”

Activities might include: sourcing produce from farms, curating and packing boxes, operating delivery logistics, managing the subscriber platform, running customer acquisition marketing, handling support, navigating food safety regulations.

What to watch for:

Listing every task in the business. Key activities are essential AND distinctive. “Payroll” is an activity but not a key one unless payroll is your business.
No mention of the hard things. The activities that are difficult AND essential are the ones that matter most. If sourcing seasonal produce at consistent quality is the hardest part of the business, it should be prominent in this box.
Forgetting acquisition as an activity. Teams treat sales and marketing as “things that happen” rather than activities the business must do well. If customer acquisition is hard, it belongs here.

Phase 7: Key Resources (10 minutes)

Ask:

“What do we need in order to deliver the value propositions? What assets are essential to this business model? Physical, intellectual, human, financial.”

Resources include: physical (warehouse, refrigerated transport, packing equipment), intellectual (software, algorithms, brand, data, supplier relationships), human (team, expertise, farmer relationships), financial (capital, credit lines, working capital for perishable inventory).

What to watch for:

Forgetting people. Teams list technology and forget the farm relationships lead, the customer support agent, the on-call engineer, the person who drives the van at 5am. People are resources.
Aspirational resources. Don’t list what you wish you had; list what you actually need to make this work, and note which of those you don’t yet have.
Missing the non-obvious. “Refrigerated storage” is obvious. “A supplier network you trust enough to bet perishable inventory on” is less obvious and often more important.

Phase 8: Key Partners (10 minutes)

Ask:

“Who do we depend on to make this work? Suppliers, partners, services we can’t deliver without? Who could break us if they disappeared?”

Partners might include: farms and producers, delivery companies, payment processors, cloud providers, co-marketing partners, regulatory bodies.

What to watch for:

Single points of failure. “Our single farm partner supplies everything.” That’s a risk worth flagging. Same for a single delivery company or a single cloud provider.
Partners assumed but not secured. “We’ll partner with local farms” is an assumption, not a partnership. Is there evidence the farms want to work with you?
Hidden partners. Payment processors, email providers, SMS gateways, the cloud provider. Easy to forget, easy to break the business when they fail or change pricing.

Phase 9: Cost Structure (10 minutes)

Ask:

“What are the most significant costs in this business model? Fixed, variable, one-time. I want enough detail that when we look at the Revenue Streams box next to this one, we can tell whether the maths works.”

Categorise:

Fixed: rent, salaries, software subscriptions, insurance
Variable: produce, packaging, delivery, payment processing fees, wastage
One-time: initial equipment, software development, setup

What to watch for:

Missing costs. Teams forget customer acquisition costs, payment processing fees, wastage (unsold perishables), refunds, returns, support salaries, compliance, insurance.
Cost-per-unit vs fixed. Make sure the team separates variable from fixed. A $35 box with $25 of variable cost and $10,000 of monthly fixed cost is a very different business from one with $15 variable and $30,000 fixed.
The silent cost. The founder’s unpaid labour. At some point this becomes a real cost (a hired replacement); the Canvas should flag it even if it’s not being paid today.

Phase 10: Review for coherence (15 minutes)

Step back from the Canvas. This is the phase where the session earns its cost.

Read each pair of boxes out loud, looking for contradictions:

“Revenue Streams says $35 per box. Cost Structure says variable cost per box is $41. This model loses $6 every time we make a sale. Is that right?”

“Value Propositions says ‘personal connection to farms.’ Customer Relationships says ‘automated self-service.’ Are those consistent?”

“Customer Segments says ‘busy professionals.’ Channels says ‘farmers’ market stall.’ Do busy professionals go to farmers’ markets?”

Most Canvas sessions produce at least one coherence break. The value of the session is finding it.

Once you’ve found the breaks, list them explicitly. Each one becomes an assumption worth testing or a strategic decision worth making. Add a sticky note in the margin of the Canvas for each break, so the photograph captures them.

What to watch for:

The optimism spiral. Every box looks rosy. Force the question: “What’s the weakest part of this Canvas? Which box are we least confident about?”
No contradictions found. Either the team has done excellent work, or they’re avoiding the hard look. Challenge them to read the specific numbers aloud: revenue minus variable cost, for example. The contradictions often hide in the arithmetic.
The empty box. If a box stayed mostly empty, that’s a signal. Either the team doesn’t know (valuable finding) or the model has a gap (also valuable finding). Don’t leave an empty box unflagged.

See Business Model Canvas: Does This Actually Work? for the Greenbox team’s first Canvas session, including the moment the founder does the arithmetic between Revenue Streams and Cost Structure out loud and the room goes very quiet.

What Can Go Wrong

The feature session. The team keeps listing product features in the Value Propositions box. Recovery: “Features go in Key Resources or Key Activities. Value Propositions is what the customer gets, not what we build.” Stop if: The team can’t hold the distinction. The Canvas isn’t the right session yet; they need to finish Impact Mapping or Story Mapping first.

The optimism spiral. Every box looks rosy. Recovery: “What’s the weakest part of this Canvas? Which box are we least confident about? Which assumption, if wrong, kills the business?” Stop if: The team refuses to identify a weak box. They’re not ready to be honest with themselves; the Canvas will be decorative.

Analysis paralysis. Twenty minutes debating whether something is a Key Activity or a Key Resource. Recovery: “It doesn’t matter. Canvas is a thinking tool, not a taxonomy exercise. Best-fit box, move on.” Stop if: The argument happens on a second box. The team is using classification to avoid the real conversation.

The absent economics. Nobody in the room knows the actual costs or pricing. Recovery: Fill those boxes with labelled estimates and flag them explicitly: “These boxes are guesses. They go on the assumption list.” Continue the session. Stop if: Four or more of the nine boxes are pure guesses. The Canvas is then mostly fiction; better to pause and do research first.

The contradiction denial. The facilitator names a contradiction (cost exceeds revenue, channel doesn’t match segment) and the room brushes it off. Recovery: Make the contradiction concrete: “Let’s write the arithmetic on the wall. $35 minus $41 is minus $6 per box. Is that what we believe?” Numbers on the wall are harder to dismiss than numbers in the head. Stop if: The room refuses to engage with the arithmetic. The session has produced its finding even if the team won’t accept it: record the contradiction and end.

The wrong room. Halfway through, you realise the person who knows the cost structure isn’t in the room and nobody in the room can speak to it. Recovery: Flag the box as unfinished, capture it as a to-do for a follow-up. Continue with the boxes the room can actually fill. Stop if: Multiple boxes depend on absent people. Reschedule with the right invite list.

The dominant founder. One person (usually the founder) talks every box, and the Canvas becomes their mental model rather than a shared one. Recovery: Round-robin the next box. “Let’s hear from the ops lead first on this one. Founder, hold your view until we’ve heard from everyone else.” Stop if: The pattern survives a second redirect. The Canvas will reflect one person’s beliefs and won’t deliver shared literacy; better to address the dynamic outside the session.

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Photographs the Canvas from directly in front, and close-ups of each box. Make sure the notes are readable.
Transcribes the Canvas into a digital template (Miro, Mural, Figma, or a simple slide) with every sticky note captured.
Lists the contradictions and empty boxes found during the review phase, each as its own line item.
Sends the transcribed Canvas and the contradiction list to participants and relevant stakeholders.

This week, the founder:

This is where the pattern earns its cost, and the work is mostly the founder’s. The Canvas is worthless if the contradictions aren’t resolved.

Fix the arithmetic. If the revenue and cost numbers don’t work, they have to be made to work: by raising prices, cutting costs, changing the operational model, or abandoning the business. Sitting on a broken model is the most expensive option. The founder owns this call.
Run Assumption Mapping on the shaky boxes. Any box that was filled with guesses, or that was the source of a contradiction, needs its assumptions pulled apart. Book the Assumption Mapping session for the next week.
Test the riskiest beliefs fast. Pricing, willingness to pay, cost per unit, churn rate, and customer acquisition cost are the five numbers that kill businesses quietly. If any of them are guesses, they’re the first things to validate in the real world, not in a spreadsheet.
Walk the Canvas to absent stakeholders. Anyone who should have been in the room but wasn’t gets a walk-through. Their challenges will either strengthen the Canvas or reveal problems the original group missed.
Use the Canvas to say no. Any new feature, initiative, or hire that doesn’t improve a box on the Canvas, or worse, makes a box harder, gets parked. The Canvas is the strategic filter.

Ongoing, the team:

Revisits the Canvas quarterly, or when the business model changes significantly. New segments, new pricing, new partners, new costs: each is a reason to update.
Keeps the photographed Canvas visible where strategic conversations happen. It’s the reference that prevents the slow drift back into feature-thinking.
When someone proposes a new initiative, asks them to point to the box it changes on the Canvas. If they can’t, the initiative is probably cost without coherence.

Variants

Standard BMC (default). Nine boxes, four to six people, around two hours, customer-first order. Output: a populated Canvas, a contradiction list, an assumption list. This is what most teams need, and the rest of this post describes it.

Lean Canvas. Ash Maurya’s variant for early-stage problem validation. Replaces Key Partners, Key Activities, Key Resources, and Customer Relationships with Problem, Solution, Key Metrics, and Unfair Advantage. Reach for it when you’re earlier than BMC, when the question is “do we understand the problem well enough to build anything?” rather than “does this business hold together?” Lean Canvas before product-market fit; BMC once there’s something to articulate.

Comparative Canvas. Fill two Canvases side by side for two business model options, “subscription with weekly delivery” vs “on-demand single-box purchase”, and read each pair of boxes across both Canvases. The contradictions surface faster because the alternative is right next to the option, not a hypothetical. Useful when the team is genuinely undecided between two strategic directions.

Diagnostic Canvas. For an existing business that’s drifting, fill the Canvas as it actually is today, then a second Canvas as the team believed it was a year ago. The deltas, which boxes have quietly changed without anyone noticing, are usually where the drift lives. Pricing held while costs crept up. The original segment quietly shifted. The acquisition channel that worked at launch stopped working but wasn’t replaced.

Remote. A Miro or Mural board with the nine-box template pinned, video call for the conversation. Slightly slower (the rhythm of “write a sticky, place a sticky” is faster in person), but the structure transfers cleanly. Use one shared cursor: only the facilitator places stickies, prompted by the team, to keep the layout legible.

Scaled (multi-business or multi-product). A company with several products or business lines runs one Canvas per line, then a master Canvas for the parent. Tensions between Canvases (shared resources, conflicting segments, channel cannibalisation) become visible at the parent level. Six hours total, ideally split across two days so the team can sleep on the first pass.

Time Is Wrong Everywhere All at Once

2026-06-04T06:00:00+08:00

The previous posts in this series covered how humans agree on time, how clocks count it, how physics bends it, whether you can travel through it, and why your brain gets it wrong. This post asks a more mundane but equally maddening question: how do computers agree on what time it is? The answer is that they don’t, not really, and the entire field of distributed systems is, in a sense, the study of what to do about that.

The fundamental problem

Two computers cannot agree on the time.

This sounds like an engineering problem with an engineering solution: just synchronise the clocks. And we do. NTP (Network Time Protocol) has been synchronising clocks across the internet since 1985. A well-configured NTP client can keep its clock within a few milliseconds of UTC. That’s good enough for log files, cron jobs, and displaying the time on your screen.

It’s not good enough for answering the question: “did event A happen before event B?”

Take two servers, Alice and Bob. Alice receives an order at 14:00:00.003 by her clock. Bob processes a cancellation at 14:00:00.001 by his clock. Did the cancellation arrive before the order? If Alice’s clock is 5 milliseconds ahead of Bob’s, the order actually came first, but the timestamps say otherwise. Every distributed system that uses wall-clock timestamps to determine ordering is vulnerable to this. And it’s not a theoretical concern. It’s the kind of bug that causes duplicate charges, lost messages, and inventory discrepancies that take weeks to track down.

The problem is fundamental, not technical. Even if you had perfect clocks (you don’t, How Clocks Work explained why), the speed of light imposes an irreducible minimum delay on communication between machines. A signal from London to Sydney takes at least 50 milliseconds. During those 50 milliseconds, events can happen at both ends, and neither machine can know about the other’s events until the signal arrives. There is no way, not with better cables, not with faster processors, not with atomic clocks on every server, to create a globally consistent “now” across a distributed system. Relativity says the same thing about the universe. Computer science says it about networks.

Lamport clocks: forgetting what time it is

In 1978, Leslie Lamport published “Time, Clocks, and the Ordering of Events in a Distributed System.” It remains one of the most cited papers in computer science, and its core insight is deceptively simple: you don’t need to know what time it is. You only need to know what happened before what.

A Lamport clock is not a clock in the physical sense. It’s a counter. Every process maintains its own counter. When a process does something, it increments its counter. When it sends a message, it attaches its current counter value. When it receives a message, it sets its counter to the maximum of its own counter and the received value, then increments.

That’s it. No NTP. No atomic clocks. No synchronisation at all. The counter doesn’t represent a time. It represents a position in a causal sequence.

The rule is: if event A causally precedes event B (A happened before B, and B could have been influenced by A), then A’s counter value is less than B’s. Lamport called this the “happened-before” relation. It’s a partial order, not every pair of events is comparable. If Alice does something and Bob does something at the same time with no communication between them, neither “happened before” the other. They’re concurrent. And that’s fine. The system doesn’t need to order them, because they couldn’t have influenced each other.

It’s like a family tree. Your grandmother happened before you, there’s a clear causal chain. Your cousin in another country did things today that you know nothing about. Neither of you happened “before” the other. You’re concurrent. A family tree doesn’t need to put all the cousins in order. It only needs to know who descended from whom.

Lamport clocks capture exactly this: causality, not chronology. They tell you “A could have caused B” or “A and B are independent.” They don’t tell you which happened first on a wall clock, because that question, in a distributed system, often has no meaningful answer.

Vector clocks: who knew what when

Lamport clocks have a limitation: if A’s counter is less than B’s, you know A might have caused B, but you can’t be sure. The ordering is consistent with causality but doesn’t perfectly capture it. In 1988, Colin Fidge and Friedemann Mattern independently invented vector clocks, which fix this.

A vector clock is an array of counters, one per process. When process Alice does something, she increments her entry. When she sends a message, she attaches the entire vector. When Bob receives it, he takes the element-wise maximum of his vector and Alice’s, then increments his own entry.

The result: you can look at two vector timestamps and determine not just whether one might have caused the other, but whether they’re definitely concurrent. If every entry in A’s vector is less than or equal to the corresponding entry in B’s vector, then A happened before B. If some entries are greater and some are less, they’re concurrent, neither caused the other.

It’s like a group chat where everyone keeps a diary. Each diary entry notes what the writer did and the last thing they heard from everyone else. If Alice’s diary says she’s seen Bob’s message #5 and Carol’s message #3, and Bob’s diary says he’s seen Alice’s message #2 and Carol’s message #4, you can reconstruct exactly who knew what when. Two entries are concurrent if neither person had seen the other’s latest update.

Vector clocks are used in real systems. Amazon’s Dynamo database (the foundation of DynamoDB) used them to detect conflicting writes. Riak, a distributed key-value store, used them for the same purpose. They’re more expensive than Lamport clocks, the vector grows with the number of processes, but they give you something Lamport clocks can’t: a definitive answer about concurrency.

The CAP theorem and the cost of consistency

In 2000, Eric Brewer proposed (and in 2002, Seth Gilbert and Nancy Lynch proved) the CAP theorem: a distributed system can provide at most two of three guarantees:

Consistency: every read receives the most recent write.
Availability: every request receives a response.
Partition tolerance: the system continues to operate even if network messages between nodes are lost or delayed.

Since network partitions happen in real systems (cables get cut, switches fail, datacentres lose connectivity), you effectively have to choose between consistency and availability. You can’t have both when the network is broken.

This is a theorem about time in disguise. “Consistency” means “every node agrees on the current state.” “Current” means “right now.” But “right now” across multiple machines separated by a network is the exact problem we started with. The CAP theorem is, at its heart, a formal proof that the speed of light makes global agreement expensive.

CP systems (consistent, partition-tolerant) sacrifice availability: if the system can’t guarantee that all nodes agree, it refuses to answer rather than give a possibly-stale response. Traditional relational databases in a distributed setting often work this way. Your query might time out, but it won’t give you wrong data.

AP systems (available, partition-tolerant) sacrifice consistency: every node answers every request, even if it means some nodes are serving stale data. Eventually, when the partition heals, the nodes reconcile. This is “eventual consistency”, the system will converge to the correct state, but there’s a window where different nodes disagree. DynamoDB, Cassandra, and most eventually-consistent NoSQL databases work this way. Your query always gets an answer, but it might not be the latest answer.

The choice between CP and AP is a choice about how to handle the impossibility of shared time. Do you pause and wait for agreement (CP), or do you keep going and sort it out later (AP)?

Google Spanner: buying time with atomic clocks

In 2012, Google published a paper describing Spanner, a globally distributed database that appears to violate the CAP theorem. It offers strong consistency (every read sees the most recent write) across datacentres on different continents, with high availability. How?

The trick is hardware. Google put GPS receivers and atomic clocks in every datacentre. Not NTP. Not “synchronise to a time server.” Actual atomic clocks, caesium and rubidium oscillators, sitting in the server racks, cross-checked against GPS signals. This gives each datacentre a clock that’s accurate to within about 7 milliseconds of true time, with known uncertainty bounds.

Spanner uses an API called TrueTime, which doesn’t return a single timestamp. It returns an interval: “the current time is definitely between earliest and latest.” The interval is typically a few milliseconds wide. Every transaction gets a timestamp, and the system guarantees that if transaction A’s timestamp is before transaction B’s, then A actually happened before B in real time. If the system isn’t sure about the ordering, if the intervals overlap, it waits until the uncertainty resolves. This is called “commit wait,” and it typically adds a few milliseconds to each transaction.

Google is buying consistency with atomic clocks and patience. The speed of light still prevents perfect synchronisation, but by bounding the uncertainty and waiting it out, Spanner creates the illusion of a single global timeline. It’s not cheap, the atomic clocks, the GPS receivers, the global network, the engineering team that maintains all of it, but it works. It’s been running Google’s advertising system (among other things) since 2012.

It’s like a courtroom. Two witnesses disagree about whether the red car or the blue car arrived first. In most distributed systems, you’d have to choose: either stop the trial until you can resolve the disagreement (CP), or let both witnesses testify and live with the inconsistency (AP). Spanner’s approach is different: give both witnesses a clock so precise that their testimony overlaps only slightly, then pause just long enough for the overlap to resolve. The trial continues. The record is consistent. It costs you a good clock and a little patience.

Conflict resolution: when time isn’t enough

Even with perfect clocks, distributed systems face a problem that time alone can’t solve: conflicting writes. Two users edit the same document at the same time. Two processes update the same database row. Two nodes accept contradicting requests during a network partition. What wins?

Last-writer-wins (LWW) is the simplest policy: whichever write has the latest timestamp wins. It’s used widely. Cassandra defaults to it. It’s simple, deterministic, and almost always wrong. If Alice saves a document at 14:00:00.003 and Bob saves a different version at 14:00:00.005, Bob’s version wins and Alice’s changes vanish. Nobody is notified. The data loss is silent. If the clocks are even slightly wrong, the “wrong” write wins. LWW trades correctness for simplicity, and in many cases the trade is terrible.

CRDTs (Conflict-Free Replicated Data Types) take a fundamentally different approach. Instead of asking “which write happened last?”, they design the data structure so that all writes can be merged without conflict. A CRDT counter, for instance, tracks each node’s increments separately and sums them on read. Two nodes can increment independently, with no communication, and when they eventually sync, the counter is correct. No timestamps needed. No conflict resolution needed. The data type’s mathematical properties guarantee convergence.

CRDTs work for counters, sets, registers, and certain kinds of text editing (Google Docs uses a CRDT-like approach for collaborative editing). They don’t work for everything, some operations are inherently conflicting (two users setting the same field to different values), and CRDTs can only merge what the data structure’s rules allow.

Operational transformation (OT) is the older approach to the same problem, used by Google Docs before CRDTs and still used in many collaborative editors. OT transforms each operation against concurrent operations to produce a consistent result. If Alice inserts a character at position 5 and Bob deletes a character at position 3, the system transforms Alice’s insertion to account for Bob’s deletion: Alice’s insert moves to position 4. The result is the same regardless of the order the operations arrive.

All of these techniques exist because time, even perfectly synchronised time, isn’t enough to resolve concurrent events. When two things happen at the same time, you need a policy, not a clock.

Logical time in practice

The theoretical framework of Lamport clocks and vector clocks shows up in practical systems, often under different names:

Version vectors in distributed databases (Riak, Dynamo) are vector clocks by another name. Each node maintains a counter, and the vectors are compared to detect conflicts. When a conflict is detected, the system either merges automatically (if it can) or presents both versions to the application for resolution.

Sequence numbers in consensus protocols like Raft and Paxos are, at their core, Lamport clocks. Each proposal gets a monotonically increasing number. The ordering of proposals is determined by these numbers, not by wall-clock time. This is why consensus protocols work even when clocks disagree: they never consult a clock.

Log-structured systems. Kafka, event sourcing architectures, blockchain, use an append-only log as their source of truth. The position in the log is the logical time. Event #4,721 happened before event #4,722 because 4,721 < 4,722. No timestamps needed. The log imposes a total order. This is Lamport’s insight, made concrete.

Even Git uses a form of logical time. A commit’s position in the DAG (directed acyclic graph) determines its causal relationship to other commits. Commit A is an ancestor of commit B. A happened before B. Two commits on different branches are concurrent. Git doesn’t care when they were created (the author date is just metadata). It cares about the graph structure. Causality, not chronology.

The speed of light is a systems problem

Every problem in this post traces back to the same root cause: information takes time to travel. Light from London to Sydney: 50 milliseconds. A packet across a datacentre: maybe 0.5 milliseconds. A signal between two chips on the same board: nanoseconds. The delays are different, but they’re never zero, and as long as they’re not zero, two observers can’t agree on “now.”

The relativity posts made this point about the universe. The speed of light means there’s no universal “now.” Simultaneity is relative. The block universe might be the correct picture: everything already exists, and our experience of “the present” is local and subjective.

Distributed systems live in the same reality, just at a smaller scale. The speed of light in a fibre optic cable (about two-thirds the speed of light in vacuum) means that two servers in different datacentres can never share a “now.” They can get close. Google’s TrueTime gets within milliseconds, but “close” and “exact” are different things, and the gap between them is where bugs live.

Leslie Lamport’s great insight was that you don’t have to solve this problem. You can sidestep it. Stop asking “what time is it?” and start asking “what happened before what?” Stop synchronising clocks and start tracking causality. The universe can’t agree on “now” either. It gets along fine by tracking the causal structure of events, the light cones that determine what can influence what.

Distributed computing reinvented the same solution, decades later, for the same reason. It turns out that the question we started this series with, “what time is it?”, is just as hard for computers as it is for physicists. And the answer, in both domains, is the same: it depends on who’s asking, and what they need to know.

Combining RAG and Fine-Tuning for a Legal Contract Assistant

2026-06-03T06:00:00+08:00

The situation

A legal-technology startup is building a contract review assistant for a mid-sized commercial firm. The in-product model answers two shapes of question: “What does this clause mean in the context of our past drafting?” and “Where have we seen this indemnity construction before, and how did we negotiate it?”

The constraints:

Corpus: ~200,000 past contracts, amendments, side letters, and internal case studies. Roughly 40 GB of text-heavy PDFs, Word documents, and Markdown notes after extraction. Growing by ~500 new matters a month.
Voice: every answer references clauses by section number (§3.2(b)), uses the firm’s preferred hedging (“the drafting is ambiguous on this point” rather than “this is unclear”), and cites internal precedents in the firm’s matter-number format.
Refusal: questions outside commercial contract law (tax, immigration, employment) get a structured decline with a pointer to the correct in-house team. Nothing off-domain.
Budget: AUD$100,000 end-to-end for customisation, data preparation, trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. , evaluation, first quarter of inferenceInferenceRunning a trained model to produce output – as opposed to training it. .
Timeline: three months to a pilot with fee-earners.
Platform: Bedrock. Nothing self-hosted.

What actually matters

Three customisation levers are on the table, retrieval-augmented generation, supervised fine-tuning, continued pre-training, and the instinct to pick one of them is the mistake. The levers aren’t substitutes; they answer different questions. The first question is what kind of problem is “be correct about 200,000 contracts”? It’s a retrieval problem. Facts about specific documents live in documents, not in weights, and any approach that tries to memorise 200,000 specific contracts is either astronomically expensive or silently unfaithful. That shape pushes the “what does the corpus say?” half of the design toward retrieval by default, and the choice of vector storeVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. and embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. model becomes the interesting part.

The second question is what kind of problem is “sound like the firm”? It’s a behaviour problem. The firm’s voice is a set of rules, hedged phrasings, §-citations, matter-number formats, the polite decline when the question drifts into tax law. Rules about how to write aren’t facts; they’re patterns of output conditioned on input. Teaching those patterns through the system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. works up to a point, and then starts drifting under adversarial phrasing or long conversations. Baking the rules into the weights via supervised fine-tuning means a short prompt is enough to invoke them and a jailbreak costs more than a system-prompt line to get around. That pushes the “how should the model say it?” half toward training, with the labelled dataset becoming the artefact that encodes the firm’s style guide as a training signal.

The third is what’s the planning horizon on each piece? The corpus grows by 500 matters a month. The style guide changes when a senior partner wins an argument about hedging. The refusal list changes when a user finds a new way to ask about divorce. A two-person platform team can absorb weekly ingestion (ingest jobs on object-storage events) and quarterly fine-tune refreshes (lawyer curates deltas, trigger a training run) but cannot absorb monthly retrains of anything that reads 40 GB. That cadence asymmetry is the strongest argument against continued pre-training in this project: its refresh cycle is weeks, not days, and its cost is per-token-processed on an unlabelled 40 GB corpus. The pay-off exists only when the base model’s vocabulary is genuinely wrong, and commercial contract English is squarely inside what a modern hosted model has already read.

The fourth is where does the budget actually get spent? AUD$100K in three months looks like training compute at first glance and turns out to be hosting commitments on inspection. Custom-trained models on a managed-model platform typically can’t be served on the standard pay-per-token rate, they need a reserved-capacity commitment, and that is the line item most often under-estimated. The budget shape for any approach that ships custom weights is low-training plus high-fixed-serving, and the architectural consequence is that fine-tuning earns its place only when the behaviour change is worth the always-on hourly burn. A pure retrieval approach has a different cost shape: low-fixed plus variable-per-query, which is correct for a pilot with light traffic.

The fifth is what does a wrong answer look like and who catches it? A model that gets the voice correct but hallucinates clause numbers is worse than an un-tuned model that cites faithfully. The evaluation harness has to score citation faithfulness (every § reference traces back to a retrieved chunk) separately from voice (did the model write like a partner?) because the two signals tell the team different things, citation faithfulness moves when retrieval changes, voice moves when training drifts. Without separate scores the team can’t tell which half to fix.

Finally: what buys the right to change our mind? A retrieval-only baseline ships in weeks and answers faithfully but boringly. Adding a fine-tune on top adds voice without re-doing the retrieval. If the firm decides in year two that Welsh property law has become a practice area, the retrieval corpus picks up the documents immediately and the fine-tune picks up the phrasing on the next quarterly refresh. If instead the team had picked continued pre-training, adding a new sub-domain would mean another round of training on another tranche of unlabelled text.

What we’ll filter on

Five filters to score the landscape against.

Corpus groundingGroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. . Two hundred thousand documents the model has never seen, with new ones arriving weekly. The answer has to reflect the current corpus, not a snapshot frozen at training time.
Voice and format. The firm’s phrasing and citation style are rules about how to write, not facts about the world. The model needs to internalise them so a prompt doesn’t re-teach them every turn.
Refusal. Off-domain questions must be declined in a structured way. A behavioural policy that has to hold under adversarial prompting.
Budget and timeline. AUD$100K and 90 days. Any method that blows either is out.
Maintainability. A two-person platform team. Customisation has to be refreshable when the corpus grows or the style guide changes, without a full retrain every time.

The customisation landscape

Bedrock gives five levers that could plausibly shape model behaviour.

Prompt engineering alone. Cheapest. System prompt with the style guide, few-shot examples, refusal instructions. Works well for voice and refusal when the base model is capable. Claude Sonnet follows detailed style instructions to a fault. Fails the corpus attribute: 200,000 documents don’t fit in any prompt.

Retrieval-augmented generation. The corpus lives in a vector store; every question retrieves relevant chunks, and those chunks ride into the prompt alongside the user’s question. Facts stay outside the weights, updating the corpus is an ingestion job, not a training job. Citations fall out naturally because the model knows which chunk each claim came from. On Bedrock: Knowledge Bases plus RetrieveAndGenerate, backed by OpenSearch Serverless, Aurora pgvector, S3 Vectors, or third-party stores.

Supervised fine-tuning. Show a base model a labelled dataset of (prompt, ideal response) pairs; adjust weights so outputs move closer to the ideal. On Bedrock: Claude 3 Haiku (us-west-2), Meta Llama 3.1 / 3.2 / 3.3 across 1B-70B, Amazon Nova Micro / Lite / Pro, plus Titan Text. Writes a custom model that must be served via provisioned throughput, on-demand isn’t available. Training cost is modest (Llama 2 70B fine-tune training is ~$0.00799 per 1,000 tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ; custom model storage $1.95/month). Teaches style, format, and behaviour; does not reliably teach facts.

Continued pre-training. Keep training a base model on a large body of unlabelled domain text using the same objective that originally pre-trained it. Shifts the model’s distribution of language toward the domain. Historically supported on Amazon Titan Text; not on Claude, Llama, or Nova. Heavyweight; training cost proportional to tokens processed; output still needs provisioned throughput to serve.

Bedrock Custom Model Import. Bring weights trained elsewhere (Llama / Mistral / compatible architectures) and serve them through the Bedrock API. Provisioned-only; us-east-1 and us-west-2. A packaging choice, not a fresh customisation lever.

Side by side

Lever	Corpus grounding	Voice & format	Refusal behaviour	Budget/timeline	Maintainability
Prompt engineering alone	✗	—	—	✓	✓
RAG (Knowledge Bases)	✓	✗	—	✓	✓
Supervised fine-tuning	✗	✓	✓	✓	—
Continued pre-training	✗	—	✗	✗	✗
Custom Model Import	✗	✗	✗	—	✗

No single lever clears all five. Two stacked clear all five: RAG for the corpus, fine-tuning for voice and refusal.

Matching the levers to the question

Three questions, three levers, two picked. The RAG path pulls corpus facts in at inference; the fine-tune path bakes voice and refusal into weights offline. Continued pre-training stays parked, the base vocabulary is already correct, and the budget can't carry it alongside the two that earn their place.

The RAG + fine-tune split, in depth

The instinct to pick one customisation method comes from treating them as interchangeable. They aren’t. Each answers a different question.

RAG answers “what does the corpus say?” Facts about 200,000 specific contracts live in the vector store. A question about a force-majeure clause retrieves the dozen most relevant past instances; the model reads them at inference time and reasons about them. Adding a new matter is an ingestion job, the vector store grows by one document, the model doesn’t change. Removing a retracted matter is a delete on a few vectors. The corpus is a living index, not a snapshot baked into weights.

Fine-tuning answers “how should the model say it?” The firm’s voice, hedged, precise, §-citing, is a set of stylistic rules. A few hundred labelled examples teach the model those rules in its weights. After fine-tuning, a twenty-line system prompt produces voice-compliant answers where an un-tuned model would need two hundred lines of style-guide text and still drift under pressure.

Continued pre-training answers “what vocabulary does the model know?” Useful when the base model genuinely doesn’t speak the domain’s language, regulatory filings in a rare jurisdiction, argot from a century-old trade, notation from a narrow sub-field. Commercial contract English doesn’t qualify. Claude has read plenty of contracts.

The three aren’t substitutes, they stack. A fully-customised model in a demanding domain might do all three: CPT on domain text, fine-tune on (prompt, response) pairs, then wrap in RAG at inference. For this situation, two of the three clear every attribute and the third is overkill.

A worked decision trace

Attribute 1, 200,000-document corpus. RAG ingests into OpenSearch Serverless via Knowledge Bases. Titan Text Embeddings V2 at 1,024 dimensions. Hierarchical chunking, child ~300 tokens for retrieval precision, parent ~1,500 tokens for generator context. Metadata sidecars tag each document with matter number, practice area, and client. Weekly refresh via EventBridge calling StartIngestionJob; deltas only. Fine-tuning doesn’t touch this, the fine-tuned model calls the same vector store as an un-tuned one.

Attribute 2, voice and citation format. A lawyer-in-the-loop curates ~1,500 (prompt, ideal-response) pairs over four to six weeks. Each pair is a real question-and-answer exchange, reviewed and edited to the firm’s style guide: hedged phrasing, §X.Y(z) references, matter-number citations. The dataset trains Claude 3 Haiku via Bedrock fine-tuning in us-west-2, the only Claude option for fine-tuning today. Llama 3.3 70B would be the alternative if quality required it; a fine-tuned 70B on provisioned throughput is materially more expensive per hour, and Haiku should clear the bar.

Attribute 3, refusal on off-domain questions. A subset, perhaps 300 of the 1,500 pairs, are refusal examples. Fine-tuning bakes this into the weights. The system prompt reinforces it; default behaviour under a prompt-injection attempt holds much better than a prompt-only approach would.

Attribute 4. AUD$100K and 90 days. Budget pass below. Both methods fit; CPT doesn’t.

Attribute 5, maintainability. RAG updates are ingestion; no retrain needed when a new matter lands. Fine-tuning refreshes happen quarterly, when the style guide evolves or refusal patterns grow. A two-person platform team runs ingestion continuously and the fine-tune four times a year.

Cost shape: where the dollars land

The cost profile differs in shape, not just size.

RAG: low fixed, variable with queries. One-off ingestion cost (embedding 40 GB at Titan V2’s per-token rate, a few thousand dollars, plus incremental weekly deltas), baseline vector-store cost (OpenSearch Serverless at 2-OCU minimum, ~AUD$520/month), per-query embedding plus generation cost.

Fine-tuning: low training, high fixed serving. Training a Haiku fine-tune on 1,500 pairs runs in the low hundreds of dollars; custom model storage $1.95/month. The catch is serving: fine-tuned models run on provisioned throughput only, a minimum hourly burn from deployment. Haiku-tier MUs are cheaper than the Llama 2 70B reference ($21.18/hour, ~$15,750/month on a 1-month commit) but still add up to several thousand dollars a month.

Continued pre-training: high training and high fixed serving. Pricing is per token processed; at 40 GB raw text (~10 billion tokens), one pass is a serious bill before fine-tuning or evaluation begin.

Budget pass, AUD:

Data preparation. PDF extraction, chunking pipeline, metadata tagging, the 1,500-pair dataset curated by a lawyer: ~AUD$30K.
RAG ingestion + 2-OCU OpenSearch Serverless for three months: ~AUD$6K.
Fine-tune training plus iteration cycles: ~AUD$2K.
Provisioned throughput for the fine-tuned Haiku, three months: ~AUD$30-40K.
Bedrock Evaluations weekly against a 200-question golden set: ~AUD$4K.
Generation cost for the pilot at low query volume: ~AUD$2-4K.

Total: ~AUD$74-86K of AUD$100K, headroom for a Sonnet evaluation judge and a round of iteration.

Evaluation: the quiet third leg

A contract review assistant that gets the voice correct but hallucinates clauses is worse than one that gets the voice vaguely correct but cites faithfully. Evaluation matters as much as the customisation choice.

The golden dataset: ~200 real questions from the firm’s advice history, with expected answers reviewed by a senior lawyer. Refreshed quarterly. Includes questions the system should refuse.

Automatic metrics via Bedrock Evaluations: citation faithfulness (every § reference traces back to a retrieved chunk), answer accuracy against the lawyer-reviewed reference, and refusal correctness. Citation faithfulness tells you whether RAG is doing its job; refusal correctness tells you whether fine-tuning is doing its job.

Human review: a weekly spot check by a senior lawyer on a random sample, scoring on “would I have said it this way?” When rubric scores drop, the fine-tune dataset needs refreshing; when citation faithfulness drops, retrieval is returning the wrong chunks.

What’s worth remembering

Three customisation methods answer three different questions. RAG: what does the corpus say? Fine-tuning: how should the model say it? CPT: what vocabulary does it know? Treating them as substitutes leads to picking wrong.
RAG via Bedrock Knowledge Bases handles 200K-document corpora with weekly updates through incremental ingestion, no retrain required. Citations fall out of retrieval, not out of weights.
Supervised fine-tuning on Bedrock supports Claude 3 Haiku (us-west-2), Meta Llama 3.1 / 3.2 / 3.3, Amazon Nova Micro / Lite / Pro, and Amazon Titan. Not Sonnet, not Opus, not Llama 4 MoE.
Fine-tuned custom models must be served via provisioned throughput. On-demand isn’t available. The minimum hourly commitment is the line item that most often blows a customisation budget.
Continued pre-training uses unlabelled text and the base pre-training objective to shift the model’s language distribution. Heavyweight, priced per training token, still needs provisioned throughput. Correct when base vocabulary is wrong; wrong when the corpus is just more of what the base already reads.
Cost shapes differ. RAG: low fixed, variable with queries. Fine-tuning: low training, high fixed serving. CPT: high training and high fixed serving. Budget discipline comes from knowing the shape, not just the sticker price.
Custom Model Import packages an externally-trained model into Bedrock’s inference surface, a deployment choice, not a customisation method. Provisioned-only; us-east-1 and us-west-2 only.
Evaluation is the third leg. Bedrock Evaluations for automatic citation faithfulness, accuracy, and refusal correctness; human review for voice. Without it, neither RAG nor fine-tuning is maintainable.

The answer: Bedrock Knowledge Bases for RAG over the 200,000-document corpus. Titan Text Embeddings V2 at 1,024 dimensions, hierarchical chunking, metadata filtering by matter number and practice area, weekly incremental ingestion from S3. Supervised fine-tuning of Claude 3 Haiku in us-west-2 on ~1,500 lawyer-curated (prompt, response) pairs covering voice, §-citation format, and structured refusals. The fine-tuned Haiku serves via provisioned throughput behind RetrieveAndGenerate, so every inference call pulls relevant chunks from the knowledge base and hands them to a model that already knows how to write in the firm’s voice. Continued pre-training is parked, the sub-domain doesn’t need it, and the budget can’t afford it alongside fine-tuning and RAG. Evaluation runs weekly. RAG for the what, fine-tuning for the how.

How to Build a Citations-Required RAG Over 50K Internal Documents

2026-06-01T06:00:00+08:00

The situation

A 6,000-person enterprise is standing up an internal assistant. The corpus is ~50,000 documents across four domains. HR policies, engineering runbooks, security guidelines, product specs, totalling ~5 GB of mostly text-dense PDFs, Markdown, Word, and Confluence exports. New documents land weekly, old ones get superseded, a handful are retracted. The assistant has to reflect the current state within a day of a change.

On the answer path:

P95 end-to-end latency < 3 s from question to last tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , across retrieval, generation, and network.
Document-level access control. An engineer asking “what are the band-5 engineering salaries?” must get a polite refusal, not an HR document. A security auditor asking about an incident-response runbook gets the runbook. Identity drives what the retriever can see.
Citations on every answer. Every factual claim points back to a source chunk. No citation, no answer.

What actually matters

A RAG system lives or dies at the boundary where identity meets retrieval, so the first question is who owns that boundary? A product team that ships “the assistant” without owning the access-control fabric under it is building a compliance incident with a generative front-end. The design has to make the seam explicit: identity in, filter out, retriever sees only what the caller is allowed to see. Anywhere else in the stack is the wrong place to apply the check, filtering results after retrieval leaves the top-K polluted with chunks the user can’t read, and filtering at generation leaves the citation hanging off something the user shouldn’t have seen in the first place.

The second is what’s the blast radius of a bad answer? An engineer who asks about someone else’s salary and gets a careful decline is fine. An engineer who asks about someone else’s salary and gets the answer is a wrongful-disclosure incident, and the remediation isn’t a promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. tweak, it’s legal notice, HR escalation, and a six-month trust deficit with the workforce that was just asked to share more data with the tool. The cost of a single leakage dominates every other cost on the project. That shape pushes the design toward managed components where the access-control path is a first-class API, not a piece of glue the team maintains.

The third is what’s the cost curve as the corpus grows? Five gigabytes today, seven next year, thirty when the internal wiki finally gets ingested. The ingestion story has to be incremental by default, a full weekly reprocess of 5 GB is doable, a full weekly reprocess of 30 GB eats the evening. The vector storeVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. bill scales with vector dimensions × chunks × replicas, so the embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. model choice locks in a multi-year storage footprint. Changing embedding models means reindexing everything, so whatever dimension trade-off gets baked in at install time is the one the team lives with, cheap to choose, expensive to reverse.

The fourth is what are the failure modes we have to design against? A citation the user can’t load because the S3 object is gated by a different policy. A retrieval that returns zero chunks for a legitimate question because the filter is too tight. A chunking strategy that slices a procedure in half and leaves the generator stitching two halves of two runbooks together. A metadata-sidecar path where a file was added without its .metadata.json and therefore has no allowed_groups, defaulting to nobody or everybody depending on how the filter is composed. Each of those wants a test, a runbook, and a monitoring line, the managed service takes care of about half; the application team owns the other half.

The fifth is where does a small platform team want to spend its operational attention? Not on owning a vector-store operator, not on writing chunking pipelines, not on re-implementing citation extraction for the fourth time. Managed services buy back that attention at the cost of flexibility; the trade is good when the workload is standard and bad when it has a weird shape. A 50K-document corpus with vanilla group-based access control is standard. A SOX-grade audit requirement with multi-hop ACL joins is weird and wants SQL.

Finally: what does “current state” mean in practice? The brief says “within a day” but the business will discover it means “within an hour” the first time a retracted policy keeps answering questions. The ingestion cadence has to scale from weekly-cron down to per-object event without re-architecting, because the product requirement will tighten under production pressure.

What we’ll filter on

Five filters, and the landscape either clears them or doesn’t.

Document-level access control enforced during retrieval. Not a post-hoc scrub of results, otherwise the top-K is polluted with chunks the user can’t see and quality collapses.
Sub-3-second end-to-end latency at P95. Retrieval under a second, generation streamed, first tokens visible to the user inside one.
Citations that survive the model summarising or paraphrasing. The generation path has to propagate “which chunk came from which document” all the way to the response.
Incremental weekly ingestion. New files picked up, changed files re-embedded, deleted files removed. Not a full weekly reprocess of 5 GB.
Reasonable operational overhead. A small platform team. Managed components where the differentiation isn’t worth hand-rolling.

The RAG architecture landscape

Five plausible shapes on AWS.

Fine-tune a foundation model on the corpus. No retrieval at all, the knowledge goes into the weights. Weekly refresh means weekly fine-tune cycles at 5-GB scale. Citations are impossible because fine-tuning merges sources into weights with no pointer back. Per-user access control is impossible because once a chunk is in the weights, every user sees it.

Bedrock Knowledge Bases. A managed RAG service that ingests documents from a data source (S3, SharePoint, Confluence, Salesforce, web crawler, custom), chunks them, embeds them through a chosen model, stores the vectors, and exposes two runtime APIs – Retrieve for raw chunks and RetrieveAndGenerate for the full round-trip with citations. Eight supported vector stores: OpenSearch Serverless, OpenSearch managed clusters, S3 Vectors, Aurora pgvector, Neptune Analytics (GraphRAG), Pinecone, Redis Enterprise Cloud, MongoDB Atlas. Four supported embedding models: Titan Embeddings G1 (1,536 dim), Titan Text Embeddings V2 (256 / 512 / 1,024), Cohere Embed English v3 (1,024), Cohere Embed Multilingual v3 (1,024). Metadata filtering during retrieval and citations in generation are first-class.

Custom RAG with Bedrock + OpenSearch Serverless vector engine. Same substrate as Knowledge Bases’ most common configuration, but you write the pipeline: ingestion Lambdas, embedding invocations, k-NN mappings, prompt assembly, citation extraction. Every component is under your control and yours to operate. OpenSearch Serverless supports HNSW with Faiss, cosine / L2 / dot-product metrics, up to 16,000 dimensions, and scales in OCU increments (2-OCU minimum for production, $0.24 per OCU-hour).

Custom RAG with Bedrock + Aurora PostgreSQL pgvector. Same DIY pipeline, but the vector store is Aurora with pgvector 0.5.0+ and HNSW indexes on a vector(n) column. Knowledge Bases can also consume Aurora as a vector store via the RDS Data API plus Secrets Manager. The selling point is SQL: embeddings sit next to the metadata you already keep relationally, and filters become ordinary WHERE clauses.

Custom RAG with Bedrock + Amazon Kendra. Kendra is not a vector database, it’s an intelligent search service with its own ranking models, ML-based relevance tuning, and built-in document-level security. GenAI Enterprise Edition runs $0.32/hour base plus $0.25/hour per storage unit plus $0.07/hour per query unit; Basic Enterprise starts at $1.40/hour. Point it at data sources, hit Retrieve, stuff results into a Bedrock prompt, emit citations from Kendra’s result URIs.

Side by side

Option	Access control in retrieval	<3 s P95	Citations	Incremental sync	Low ops overhead
Fine-tune foundation model	✗	✓	✗	✗	✗
Bedrock Knowledge Bases	✓	✓	✓	✓	✓
Custom RAG on OpenSearch Serverless	✓	✓	✓	✓	✗
Custom RAG on Aurora pgvector	✓	✓	✓	✓	✗
Custom RAG on Kendra	✓	✓	✓	✓	—

Matching the shape to the managed service

Identity in, filter composed server-side, metadata filter applied during retrieval (green dashed), citations emitted by preserving the default prompt template's `$output_format_instructions$` placeholder.

Knowledge Bases, in depth

Chunking. Five strategies: default (~300 tokens, sentence-aware), fixed-size (tunable), hierarchical (child for precision, parent for context), semantic (LLM-driven boundaries with buffer and percentile threshold), no-chunking (one chunk per document, loses page-number citations). For runbooks and policies, structured documents where the correct answer is a two-sentence span but the generator needs surrounding subsection context, hierarchical earns its place. Child 300 tokens, parent 1,500. Parent + child above 8,000 combined tokens hits metadata-size limits; not supported on the S3 Vectors backend.

Embedding model. Titan V2 at 1,024 dimensions is the default for an English corpus: cheapest option that clears the quality bar, reasonable per-vector footprint. Dropping to 512 halves vector storage at some retrieval-quality cost. Cohere Embed English v3 is the upgrade when lexical-vs-semantic ranking matters. Dimensions are locked to the embedding model, switching models means reindexing the whole corpus.

Access control through metadata filtering. Every document has a companion <filename>.metadata.json declaring allowed_groups, domain, classification, effective_date. Every retrieval call passes a filter composed server-side from the authenticated caller’s group membership:

{
  "vectorSearchConfiguration": {
    "numberOfResults": 10,
    "filter": {
      "orAll": [
        {"listContains": {"key": "allowed_groups", "value": "engineering"}},
        {"listContains": {"key": "allowed_groups", "value": "on-call"}}
      ]
    }
  }
}

The filter is applied during vector search, not after. Chunks whose metadata doesn’t satisfy it never enter the top-K. Available operators: equals, notEquals, greaterThan(OrEquals), lessThan(OrEquals), in, notIn, startsWith (OpenSearch Serverless only), stringContains, listContains, andAll / orAll (minimum 2 conditions each). Enough for group-based rules; not enough for full ABAC with clearance-level comparisons.

Critical: the filter is composed by a trusted backend on every call. If the browser gets to construct it, there’s no access control at all.

Incremental ingestion. StartIngestionJob walks the data source, diffs against the vector store via S3 metadata (ETags), re-embeds what changed, removes vectors for deleted documents. Weekly cron via EventBridge; per-object triggers from S3 event notifications when the product tightens to near-real-time.

Citations. RetrieveAndGenerate preserves a citations array in the response linking spans of the generated text to retrieved chunks plus their S3 URIs and metadata. Citations require the $output_format_instructions$ placeholder in the prompt template; removing it to hand-tune instructions silently disables citations.

A worked retrieval trace

One question, end to end. An engineer asks “What’s the runbook for rotating the production database password?” Groups ["engineering", "on-call"].

Identity translation. Backend looks up groups, confirms the session is live, composes the retrieval filter.
Embed the query. Titan V2 returns a 1,024-dim vector in ~30-80 ms.
Vector search with filter. Retrieve with numberOfResults: 10 and the orAll filter. OpenSearch Serverless runs HNSW k-NN with metadata filtering during search, returning ten chunks. HR chunks never contribute noise. ~100-250 ms.
Hierarchical replacement. Child chunks sharing a parent collapse to the parent. Ten children might become six parents, each 1,500-token, each with surrounding procedural context.
Prompt assembly. Knowledge Bases populates $search_results$ , $query$ , and $output_format_instructions$ , removing the last silently disables citations.
Generation. RetrieveAndGenerate calls Claude Sonnet via a cross-region inference profile. First token ~800 ms; a 300-token answer finishes in ~1.8 s.
Citations. Response includes a citations array linking spans of generated text to retrieved chunks plus S3 URIs. The app renders each as a numbered inline reference.

Total end-to-end: embedding 60 ms + vector search 180 ms + orchestration 50 ms + first-token 800 ms + streaming 1,000 ms = ~2.1 s P95. Comfortably inside the 3-second budget.

When Aurora pgvector earns its place instead

Reach for Aurora pgvector directly when the access-control logic exceeds what metadata-filter operators express: multi-hop joins across user / group / ACL / classification tables, clearance-level ≤ user-clearance via a lookup table, time-windowed validity (effective_date <= now() AND (expiry_date IS NULL OR expiry_date > now())). SQL eats all of that; metadata attributes can’t. Also correct when the ops muscle for Postgres already exists and adding pgvector plus an HNSW index is a smaller jump than owning an OpenSearch Serverless collection, or when transactional consistency between documents and metadata matters (an ACL change and its embedding update atomically, no stale-filter window).

For 50,000 documents with a vanilla group-membership filter, Aurora is overkill. For 5 million documents with SOX-grade audit against a mature Postgres estate, it’s the correct answer.

When Kendra earns its place instead

Kendra is an intelligent-search service that happens to be useful in a RAG pipeline. Favour it when ranking quality on messy natural-language queries matters more than embedding flexibility (Kendra’s ML-based ranking beats plain vector similarity when user phrasing diverges sharply from source text), when document-level access control via user tokens and group context in the Retrieve API is easier to wire than metadata sidecars, and when the maintained connectors (SharePoint, Confluence, ServiceNow, Salesforce, Box, Slack) earn the premium. For 50,000 documents a GenAI Enterprise Edition base runs ~$500-700/month before queries versus OpenSearch Serverless’s 2-OCU minimum at ~$350/month. For the situation as stated, Knowledge Bases wins on cost and flexibility. For “users consistently phrase things weirdly enough that vector similarity misses,” Kendra earns the premium.

What’s worth remembering

Bedrock Knowledge Bases is the managed RAG path. A data source, a chunking strategy, an embedding model, a vector store, and two runtime APIs: Retrieve for raw chunks and RetrieveAndGenerate for the full round-trip with citations.
Chunking is the lever nobody thinks about until answers are wrong. Five strategies; hierarchical (child for precision, parent for generator context) is the pragmatic default for structured documents.
Embedding model locks dimensions and therefore storage footprint. Titan V2 at 1,024 is the sensible English-corpus default; changing embedding models means reindexing.
Metadata filters run during vector search, not after. That’s what makes access control effective rather than cosmetic, disallowed chunks never enter the top-K and never pollute the generator.
Filter operators cover equals, numeric comparisons, in / notIn, stringContains, listContains, startsWith (OpenSearch-Serverless only), andAll / orAll. Enough for group-based access; not enough for multi-hop SQL-style ACL joins.
Identity-to-groups translation happens server-side. The browser never composes filters; that’s the one non-negotiable security boundary in the design.
Citations depend on the $output_format_instructions$ placeholder. Remove it to hand-tune the prompt and citations vanish silently.
Incremental ingestion scales from weekly cron to per-object S3 event triggers without rearchitecting. “Weekly” becomes “within an hour” with a config change, not a redesign.
Aurora pgvector is the upgrade path when access-control logic exceeds metadata-filter operators. Kendra is the upgrade when ranking quality beats embedding flexibility. Fine-tuning is the wrong path entirely for living, access-controlled corpora.

The answer: Bedrock Knowledge Bases on OpenSearch Serverless, Titan Text Embeddings V2 at 1,024 dimensions, hierarchical chunking with child 300 tokens and parent 1,500, metadata sidecars declaring allowed_groups, every RetrieveAndGenerate filtered by the caller’s group membership via orAll + listContains. Weekly EventBridge-triggered StartIngestionJob; Claude Sonnet for generation with the default prompt template preserving $output_format_instructions$ . Latency closes at ~2.1 s P95, generation dominates the time budget, retrieval barely registers. A configured managed service plus a small orchestration Lambda, not a pipeline to own.

Rules, Grammars, and Regex

2026-05-30T06:00:00+08:00

A regulator emails the compliance team: every customer email mentioning a competitor must be flagged for review within ten minutes of receipt, with an audit trail of why each one was flagged. The team starts designing an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. pipeline. Six weeks in, the regulator wants to see the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. card and asks why the system flagged a borderline message yesterday at 3:47pm. Nobody can answer.

A list of competitor names in a regex would have shipped on day one and answered the regulator’s question in three seconds.

This post is about the AI that isn’t AI, deterministic, hand-written rules. They have no learning, no embeddings, no probabilistic outputs. They’re often dismissed as “primitive,” and they’re often the correct answer.

In the previous post we covered the classical statistical baselines that beat fancy models on small problems. This post covers the rule-based systems that beat statistical models on problems where you actually know the answer.

The case for rules

A rule-based system is one where every decision is dictated by code a human wrote, not weights a machine learned. Three properties make rules valuable:

They’re deterministic. Same input, same output, every time. No drift, no hallucination, no surprise.
They’re auditable. Every decision can be traced to a specific line of code. You can explain to a regulator, an auditor, or a customer exactly why the system did what it did.
They’re free at inference time. A regex match runs in microseconds. A finite-state transducer runs at gigabytes per second. There’s no per-call cost, no rate limit, no GPU.

In return for those properties, rules are inflexible. They handle exactly what you wrote down, and nothing else. The first time the world produces an input you didn’t anticipate, your rule fails, silently or loudly, depending on the system.

It all comes down to the shape of the input space. When it’s bounded and the rules are knowable, hand-written rules are unbeatable. When it’s open-ended and full of paraphrase and ambiguity, hand-written rules are useless and you need a learning system.

Regular expressions: the workhorse

You know regular expressions. They’re a small language for describing patterns in text. They came out of theoretical computer science in the 1950s and have been quietly running production systems ever since.

Things you can do with regular expressions and shouldn’t reach for an LLM for:

Validating an email address looks plausible.
Extracting phone numbers, postcodes, dates, ABNs, account numbers, IP addresses.
Tokenising structured logs. Every webserver log, syslog entry, and audit trail is parseable by regex.
Recognising fixed product codes in customer support tickets (“KB-2847-FATAL”) for routing.
Stripping HTML, normalising whitespace, redacting sensitive fields.
Implementing a basic spam filter for known bad strings.
Anything where the pattern is exact, even if there’s some variation.

If you can write a regex that matches your target with high precision and recall, you don’t need a model. You’re done. Ship the regex.

The known dangers

Regular expressions have a well-earned reputation for sharp edges, and the relevant ones are:

Catastrophic backtracking. A poorly written regex can take exponential time on adversarial inputs. The re2 library (Google) sidesteps this with a different engine. If you’re processing untrusted input, use re2.
Unicode is harder than ASCII. \w doesn’t always mean what you think it means once you leave ASCII land.
The “more is more” trap. A regex that grows past 200 characters is usually a sign you should be writing a parser, not a pattern.

The discipline that pays off is treating regex as a focused tool. Use it when the pattern is genuinely regular. Reach for something else when it isn’t.

Finite-state transducers: regex with structure

A finite-state transducer (FST) is a regex with two important upgrades: it can produce output, and it can be composed with other FSTs.

An FST is a state machine that consumes input symbols and emits output symbols based on its current state. The classic use is morphological analysis, mapping walked to walk + PAST or mice to mouse + PLURAL. The transducer encodes the rules of a language’s morphology directly.

FSTs are the workhorse of:

Speech recognition lexicons, mapping phoneme sequences to words.
Computational morphology for low-resource languages.
Spell-checkers and stemmers for languages with rich inflection.
Pre-processing pipelines for NLP in production search systems.
Machine translation grammars, particularly rule-based MT for language pairs without enough parallel corpus for neural translation.

The dominant tool here is OpenFST (a C++ library originally from AT&T). The pynini Python wrapper is the practitioner’s way in. For most people building a normal application this is overkill, but if you’re working on production search, speech, or non-English NLP, FSTs are part of the toolkit and they’re not going away.

Context-free grammars: parsing structured language

Beyond regular languages live context-free grammars (CFGs). The tool you reach for when you have a language with structure that a regex can’t capture, nested brackets, recursive expressions, anything where the validity of one part depends on another part you haven’t seen yet.

CFGs are the foundation of programming-language compilers. They’re also production tools for:

Parsing structured user input, query languages, formula syntax, search expressions.
Validating semi-structured documents. LaTeX, JSON, XML, all defined by grammars.
Information extraction from templated text, forms, contracts, regulatory filings.
Implementing natural-language interfaces with bounded vocabularies, voice command systems, where the user can only say a fixed set of patterns.

You write the grammar, you run a parser-generator (ANTLR, Bison, Lark for Python), you get a parser. The result is fast, deterministic, and tells you exactly which production rule matched.

For most application code, CFGs are overkill, regex handles it. But the moment you find yourself writing nested-condition regex with manual depth tracking, stop and reach for a grammar.

Decision trees and rule lists

A decision tree is a sequence of if-then rules organised as a tree. Each internal node tests a feature; each leaf is a decision. They sit on the boundary between rules and ML, you can hand-write a decision tree (it’s just a flowchart) or learn one from data (sklearn.tree.DecisionTreeClassifier).

Hand-written decision trees are the correct answer for:

Eligibility logic. “Customer is eligible for the discount if they’ve been with us for more than 12 months AND have spent more than $500 AND haven’t used a discount in the last 90 days.”
Triage and routing. Support ticket routing, document workflows, customer-service escalation.
Compliance gating. Regulatory rules that must be applied exactly as written.
Game logic. Rules in a turn-based game, transitions in a state machine.

The advantage of a decision tree (or its equivalent: a sequence of if-elif statements in code) is that the entire decision process is visible and reviewable. The disadvantage: it doesn’t generalise. If a new condition shows up that wasn’t in the rules, the tree has no opinion.

The hybrid pattern that pays off: start with a hand-written decision tree, instrument it for the cases it handles badly, then either add rules or switch to a learned model when the rule list gets unmaintainable. Many production systems live in this hybrid space for years.

Expert systems: the ancestor

A rule-based expert system is a large collection of if-then rules with an inference engine that chains them together. They were the fashionable AI of the 1970s and 1980s. MYCIN for medical diagnosis, DENDRAL for chemistry, XCON for configuring DEC computers.

The expert-system winter, when these projects mostly disappointed, gave rule-based AI a bad name in popular memory. But the practical lessons remain:

Rules work well for stable, well-understood domains.
Maintaining a rule base of more than a few thousand rules is hard. Conflicts emerge. Edge cases pile up. The system becomes brittle.
Combining rules with statistical methods, using rules for the cases you understand and ML for the rest, is often more practical than picking one.

Modern descendants live in:

Drools and similar business-rules engines, used in insurance underwriting, banking compliance, and benefits administration.
Prolog and other logic-programming systems, mostly in academia but still used commercially in some niches.
Datalog for analytic and policy reasoning over relational data.

When you hear “rule-based system” today, it’s usually a Drools-style production rule engine making decisions in a regulated domain.

When to use rules: a triage

Property of your problem	Lean toward rules	Lean toward ML
Auditability requirement	Strong (regulatory, legal, safety)	Weak (best-effort relevance)
Latency budget	Microseconds	Milliseconds or seconds OK
Per-call cost tolerance	Must be near-zero	Some cost is fine
Input variability	Bounded (formats, codes, structured)	Open-ended (natural language)
Domain expert availability	Yes, can write down the rules	No, has to be learned from data
Drift	Slow (years)	Fast (weeks/months)
Volume of labelled data	Zero, rules are the labels	Thousands of examples
Acceptance of failures	Failures must be debuggable	Probabilistic failures OK

The hybrid pattern

The best production systems usually mix rules and learning. The pattern, in rough form:

Rules at the edges. Fast pre-filters that reject obvious garbage and recognise unambiguous cases.
Statistical models in the middle. ML for the genuinely ambiguous cases the rules can’t handle.
Rules at the edges again. Post-filters that catch known-bad model outputs and apply business logic.

A spam filter does this: regex catches the obvious phishing patterns; a statistical model handles the borderline cases; a final rule layer applies user preferences and explicit allowlists. A search relevance system does this: lexical matching with rules first; semantic ranking with embeddings; business-rule re-ranking last.

The reason the hybrid wins is that rules and ML have inverse strengths and weaknesses. Rules are precise but rigid; ML is flexible but fuzzy. Used together, they cover each other’s gaps.

Rules are AI’s quiet underclass. They run more production systems than transformers do, and most teams forget about them until the regulator emails or the latency budget collapses. Regex handles patterns that are genuinely regular. Finite-state transducers extend that into composition and structured output for speech and morphology. Context-free grammars take over when nesting and recursion show up. Hand-written decision trees encode the business logic nobody wants buried in a model. Production rule engines like Drools still run the parts of insurance, banking, and compliance where every decision needs a trace.

The version that wins in production is rarely all-rules or all-learning. It’s rules at the edges, fast pre-filters that catch the obvious cases and post-filters that apply business policy, with statistical models in the middle handling the genuinely ambiguous inputs. Rules and learning have inverse strengths. Used together, they cover each other’s gaps. The next four posts in the series leave the language-and-text neighbourhood and pick up the rest of the classical AI textbook, search and planning, logic, constraints, probability.

The Salt in the Dish

2026-05-29T06:00:00+08:00

There are four kinds of salt in my kitchen drawer, and they are not interchangeable.

Maldon flaky sea salt for finishing steak. Cooking salt in a big jar by the stove, cheap and honest, for seasoning as I go and for brines. Oak smoked salt in a tin my mother-in-law gave me, so intensely flavoured I use it by the pinch on things off the grill. And a grinder of powdered salt I crush from cooking salt with a mortar and pestle, because powdered salt disperses evenly over popcorn without forming the salty pockets that spoil a bowl halfway through.

None of these is a substitute for any of the others. A pinch of oak smoked where cooking salt is called for would be overwhelming. A teaspoon of Maldon where fine salt is called for would leave most of the food unseasoned and some of it disagreeably gritty. They are tools. Each does one thing well and refuses to do the other things at all.

This post is about salt. Most of it is about dependencies, and about the discipline of knowing what you’re putting into the dish before you put it in.

Not all salts are the same kind of salty

A teaspoon of one salt is not a teaspoon of another. Morton’s table salt is dense and finely crystalline, a teaspoon weighs about six grams. Diamond Crystal kosher salt is the same compound, but its crystals are hollow and flaky, and a teaspoon weighs about three grams. A recipe written for Morton’s and executed with Diamond Crystal will be under-seasoned. The reverse will be inedibly salty. Same chemical, same teaspoon, half the delivered dose.

Flaky finishing salts like Maldon are designed to sit on top of the food as a textural element, not to dissolve into it. Grinding Maldon into a braise is a waste; sprinkling it on a finished steak is exactly right. Smoked salts carry flavours as well as sodium and must be used sparingly. Fine salts for brining need to dissolve completely and can’t contain anti-caking agents that cloud the brine.

What you actually want, at every point in every dish, is the right form of salt for this moment. Grabbing the nearest box because it says “salt” on the side is a mistake that shows up later, in the taste of the thing you served to people who trusted you with their dinner.

Most “salts” are not food

Most of the compounds called “salts” are not edible. “Salt” is a chemistry term, not a culinary one, any compound formed when an acid reacts with a base. Sodium chloride is one. There are thousands of others.

Epsom salt is magnesium sulfate. It is a laxative. Lead acetate is a salt; the Romans used it to sweeten wine and it probably poisoned a fair chunk of the aristocracy. Potassium nitrate is a salt, used in gunpowder and in curing bacon, and which application you have in mind very much matters.

The word “salt” doesn’t tell you whether the thing is safe to put in food. You have to know which salt you’re looking at, and in what quantity.

The parallel to software is exact. A package on npm called fast-json-parser could be a perfectly fine JSON parser, or a package published last week that quietly exfiltrates environment variables while also, technically, parsing JSON. The name tells you nothing. “It is, technically, a JSON parser” is the software equivalent of “it is, technically, a salt.”

The cake contest

There is an old story, probably apocryphal, about a baking contest in which one contestant sabotaged another by swapping the labels on two unmarked jars in the victim’s pantry the night before the final. The victim reached for the jar they thought was sugar and measured out a cup of salt. The cake was inedible. They lost.

The attack was not on the salt. The salt was fine. The attack was on the assumption that the label on the jar reflected the contents of the jar.

Software supply chain attacks work the same way. They are attacks on the assumption that the package name on npm, or PyPI, matches what’s inside. Someone takes over an abandoned package. Registers a name one letter away from a popular one. Pushes a malicious version to a legitimate project. By the time anyone notices, the poisoned version has been installed by a hundred thousand npm install commands, each one issued by an engineer who trusted that the label matched the jar.

Software Bills of Materials. SBoMs, are, in essence, the practice of writing on every jar exactly what’s in it, when it arrived, and where it came from. Boring paperwork. Tedious to maintain. Exactly the kind of thing an engineer who is moving fast will skip, and then one morning will discover they needed.

Taste before you use

The single most important discipline in a kitchen is tasting the dish at every stage, and seasoning in response to what you taste, not in response to what the recipe said to do.

Recipes are approximations. The tomatoes were different tomatoes. The stock had different baseline salt. The cheese you’re melting contains sodium the recipe writer didn’t know about. So you taste. When the onions are sweating. When the liquid goes in. Halfway through the braise. Right before you plate. Each time you add a little if it needs it, and nothing if it doesn’t. You are adjusting based on current state, not on a timer and a hope.

Engineers should do exactly the same with dependencies. Read the README. Skim the entry point. Look at the issue tracker. Check the release cadence. Run the tests locally. Try it against your actual use case before you commit to it. Taste the dish.

The engineer who runs npm install against the first Google result is the cook who empties the first jar they grab into the pot without tasting. Sometimes this works. Often it produces something under-seasoned. Every so often, and this is the one that keeps me awake, it produces something that tastes fine at first and is slowly poisoning everyone who eats it.

The drawer has four salts for a reason

My drawer has four salts because each does something the others can’t. Learning which is which, and how much, and when, is the patient unglamorous work of becoming someone who can cook.

Your codebase’s package.json should have the same relationship with its dependencies. Each one chosen for a specific reason you remember. Each one tasted before it was committed. Each one revisited periodically to see whether the reason still holds. None grabbed at random because the name on the jar sounded about right.

Taste. Decide. Add. Taste again. Adjust. That’s the discipline. Everything else is built on top of it.

Why Does Thursday Last Forever?

2026-05-28T06:00:00+08:00

The previous posts in this series asked what time even is, how we count it, how physics bends it, whether it exists at all, whether you can go backwards, and how your body keeps its own time. All of that was about time out there: in clocks, in spacetime, in the equations, in your biology. This post is about time in here. In your head. Where it behaves worst of all.

The afternoon that wouldn’t end

You know the feeling. You’re in a meeting on a Thursday afternoon. It started at 2.00. You’ve been through two agenda items, a disagreement about scope, and someone’s screen-share that wouldn’t connect. You check the clock, certain it must be nearly 3.00. It’s 2.12.

This isn’t boredom distorting your memory. Your brain is actively constructing a wrong answer about how much time has passed. It does this reliably, predictably, and for reasons that neuroscience is starting to understand.

The clock on the wall is objective. It ticks at the same rate whether you’re watching it or not (well, mostly). But the clock in your head, the one that tells you “that felt like an hour” or “where did the day go”, runs on completely different hardware, and it has no quartz crystal, no caesium atom, no oscillator of any kind. It’s a guess assembled from scraps, and it’s wrong more often than it’s right.

Your brain doesn’t have a clock

This is the first surprise. Despite our constant awareness of time passing, the human brain has no dedicated timekeeping organ. There’s no neural metronome ticking away in your cortex. Unlike vision (which has the visual cortex) or hearing (the auditory cortex), time perception is distributed across multiple brain regions, none of which is specifically for time.

The leading model (still debated) is something called the striatal beat frequency model, proposed by Matthew Matell and Warren Meck in 2004. The idea: cortical neurons oscillate at different frequencies, like a room full of musicians each playing at their own tempo. The striatum, a structure deep in the brain involved in decision-making and reward, listens to the pattern of beats. When a familiar pattern recurs, the brain recognises it as a familiar duration. “That felt like about five seconds” isn’t a measurement. It’s a pattern match.

This is astonishingly imprecise compared to a caesium clock. But it works well enough to catch a ball, keep a beat, and sense that Thursday afternoon is dragging.

Why time slows down when you’re watching

A watched pot never boils. Psychologists call this the attentional gate model (Zakay & Block, 1995). The theory: when you direct attention toward the passage of time itself, you notice more temporal information, and more noticed information makes the interval feel longer.

It’s like counting cars on a motorway. If you’re not paying attention, you’d guess “a few went past.” If you’re actively counting, you’d say “seventeen.” The cars didn’t speed up. You just noticed more of them. Time works the same way. When you’re clock-watching in that Thursday meeting, you’re accumulating more temporal “ticks” in working memory, and more ticks means the interval feels stretched.

The reverse is equally real. When you’re absorbed in something (what Csikszentmihalyi called flow) attention is consumed by the task, leaving nothing spare for monitoring the clock. Time doesn’t slow down or speed up. You just stop counting. An hour vanishes because you didn’t notice it passing.

Why holidays evaporate

Here’s the paradox. That Thursday meeting felt endless while it was happening. But ask someone about it a week later and they’ll say “I barely remember it.” Meanwhile, a two-week holiday felt like it flew past while you were on it, but looking back, it feels like it lasted ages.

This is the difference between prospective time (how long something feels while it’s happening) and retrospective time (how long it seems in memory). They use different mechanisms, and they often give opposite answers.

Prospective time is driven by attention. The more you monitor the clock, the longer it feels. Retrospective time is driven by memory density: how many distinct, novel memories were formed. A boring Thursday produces almost no memorable events, so in retrospect it collapses to nothing. A holiday in an unfamiliar place (new food, new streets, new language, daily surprises) lays down dense, rich memories, and the brain interprets that density as duration.

This is why the first day of a holiday feels longest in retrospect, and the last day feels shortest. By day ten, you know how the coffee machine works, where the beach is, what the breakfast buffet looks like. Novelty drops. Memory formation slows. The days start blurring together, just like they do at home.

William James wrote about this in 1890: “In youth we may have an absolutely new experience, subjective or objective, every hour of the day. Apprehension is vivid, retentiveness strong, and our recollections of that time, like those of a time spent in rapid and interesting travel, are of something intricate, multitudinous, and long-drawn-out. But as each passing year converts some of this experience into automatic routine which we hardly note at all, the days and the weeks smooth themselves out in recollection to contentless units, and the years grow hollow and collapse.”

He was 48. He was describing the effect from the inside.

Why childhood lasted forever

This is the same mechanism writ large. Children experience almost everything for the first time. The first day of school. The first time you ride a bike. The first thunderstorm that actually scares you. Every one of these is a dense, vivid memory. A year of childhood contains thousands of novel events, and looking back, the brain reads that density as duration. A year felt enormous because it was enormous, in terms of encoded experience.

By your thirties, most experiences are variations on things you’ve already done. Another commute. Another Monday. Another Christmas that’s almost the same as last Christmas. The events are real, but they don’t register as novel, so memory formation is thin. A year passes and when you look back, there’s not much there. Not because nothing happened, but because nothing new happened.

Daniel Kahneman makes a useful distinction between the experiencing self (the one who lives through each moment) and the remembering self (the one who tells the story afterward). The experiencing self had a perfectly normal year. The remembering self says it was over in a flash, because it has almost nothing to report.

This has a practical corollary that sounds like self-help but is grounded in psychology: if you want time to feel longer in retrospect, seek novelty. New places, new skills, new routines. Not because happiness requires novelty (it doesn’t) but because memory does. The years you remember are the ones that were different from the years before.

Temperature, emotion, and the internal clock

Your internal clock isn’t just attention-dependent. It’s also affected by body temperature, emotional state, and neurochemistry.

Temperature: raising body temperature speeds up the internal clock. In studies where participants’ core temperature was elevated (via warm rooms or mild fever), they consistently overestimated how much time had passed; their internal clock was running fast. This was first demonstrated by Hudson Hoagland in 1933, when he noticed his wife, who had a fever, complained that he’d been away for ages when he’d only left the room for a few minutes. He tested her repeatedly during the fever, and found her time estimates were consistently inflated. Then, being a scientist, he published it.

Fear: time slows down. Not literally. David Eagleman tested this directly by dropping people from a 45-metre tower (with a net) while they watched a fast-flickering display. If time genuinely slowed, they’d be able to read the display. They couldn’t. What actually happens is that the amygdala (the brain’s threat-response system) kicks into high gear during fear, laying down memories at a much higher density than normal. Afterward, looking back, the dense memory makes the event feel like it lasted longer than it did. Your brain didn’t slow time down. It just took more notes.

Dopamine: the neurotransmitter most associated with reward and motivation affects time perception directly. Higher dopamine speeds up the internal clock; lower dopamine slows it. This is why stimulant drugs (which increase dopamine) make time feel like it’s dragging: your internal clock is running fast, so objective time seems to crawl. And it’s why the anticipation of a reward makes the wait feel longer. You want the thing. Your dopamine is up. Your internal clock speeds up. The five minutes until dinner feels like twenty.

Age and the shrinking year

There’s a popular mathematical explanation for why years feel shorter as you age: when you’re five, a year is 20% of your life. When you’re fifty, it’s 2%. Each year is a smaller fraction of your total experience, so it should feel proportionally shorter.

This is neat, intuitive, and probably wrong, or at least insufficient. The ratio theory predicts a smooth logarithmic curve, but subjective reports don’t follow it precisely. The memory-density explanation is better supported: years feel shorter because they contain less novelty, and less novelty means fewer memories, and fewer memories means the year collapses in retrospect.

But there’s a third factor that matters, especially in middle age: routine. When your days are structured by the same alarm, same commute, same meetings, same evening pattern, the brain doesn’t bother encoding each day individually. It compresses. Monday through Friday becomes a single unit in memory. Weeks blur into months. This is efficient (you don’t need to remember every identical Tuesday) but it creates the unsettling sensation that time is accelerating.

Breaking routine doesn’t add hours to your day. It adds anchors to your memory. A Wednesday that’s different from every other Wednesday gets its own entry in the ledger. The weeks that contain an unusual Wednesday feel, in retrospect, longer than the weeks that don’t.

Why two hours of coding disappears

Programmers know this feeling intimately. You sit down to fix a bug. The next time you surface, two hours have gone and you didn’t notice.

Flow states are the extreme case of the attentional gate closing. When you’re deeply absorbed, attention is entirely consumed by the task. The gate that lets temporal information into working memory swings shut. You stop counting ticks. There’s nothing to estimate duration from.

But here’s the interesting part: the same two hours spent in a meeting that you don’t care about will feel like four hours. Same clock time. Opposite subjective experience. And afterward, the two-hour coding session will feel like “not long at all” in retrospect (low novelty, high focus, few distinct memories formed), while the two-hour meeting will also feel like nothing in retrospect (boring, unmemorable). Both collapse, but for different reasons. One was too engaging to notice. The other was too dull to remember.

The 3 AM effect

Anyone who’s been awake at 3 AM with worry knows that the small hours last forever. There’s a neurochemical basis for this. Cortisol (the stress hormone) is at its lowest between midnight and 4 AM, and your body temperature drops to its daily minimum around the same time. Both of these affect time perception. Low body temperature slows the internal clock, making objective time feel like it’s crawling. Anxiety directs attention toward the passage of time itself, opening the attentional gate wide. The combination is brutal: you’re cold, stressed, and clock-watching. Every minute expands.

This is also why night shifts feel so different from day shifts, even after you’ve adjusted your sleep schedule. Your circadian rhythm still modulates body temperature and cortisol independently of when you’re sleeping. At 3 AM, your body thinks time should be crawling, regardless of whether you went to bed at 7 PM or not.

So what time is it, really?

The first post in this series asked what time it is and discovered a tower of conventions, politics, and compromise. The second found that even physical clocks are approximations. The third showed that time itself bends. The fourth asked whether time fundamentally exists. The fifth asked whether you can go backwards. The sixth looked at the biology: the SCN, the circadian rhythm, the shift-worker’s bill.

This post adds one more layer. The time you actually experience, the time that determines whether your day felt long or short, whether your year flew or crawled, whether that meeting was bearable, is constructed by a brain that has no clock, uses attention as a proxy, stores memories as a ledger, and gets reliably fooled by temperature, emotion, novelty, and age.

The clock on the wall says 17:04. Your brain says Thursday lasted a week. Next up: computers can’t agree on what time it is either, and it turns out their problem is disturbingly similar.

Picking a Bedrock Model for High-Volume RAG

2026-05-27T06:00:00+08:00

The situation

A B2B SaaS platform is shipping an in-product assistant. Users ask questions of their own data; the application retrieves relevant records, stitches them into a promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , and asks a foundation model to answer. Measured over three months of production traffic:

~1,000,000 requests per day, peaking at 30 RPS during US/EU business-hours overlap.
Median request: ~3,000 input tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. (system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. + retrieved context + user question), ~400 output tokens.
P99 first-token latency target < 1.5 s. The UI streams the answer.
Quality bar: complex reasoning over structured retrieved context, tables, JSON, pulling answers from multiple documents.
Multi-region failover is hard-required. Customers in both us-east-1 and eu-west-1; a regional Bedrock incident must not take either customer base down.
Bedrock-native. No separate model-serving infrastructure.

What actually matters

Before reaching for a model card, ask what the application is actually paying for.

The first question is whose product is this? A model choice is a product choice, it decides who owns the upgrade cadence, who tracks the pricing page, and who gets paged when the answer quality drifts after a point release. On a hosted-foundation-model platform, those answers split three ways: the model vendor ships the behaviour, the platform ships the availability, the team owns the integration. That shape is cheap to buy into and expensive to reverse; the cost of moving a production RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. application between model families is rarely smaller than the savings, and any decision worth making pairs the model with the throughput mode underneath it.

The second is what does a bad day look like? At a million requests a day the interesting failure isn’t an individual bad answer, it’s a region going dark for forty-five minutes. Every second of that outage has a customer-visible consequence, and the blast radius is the entire customer base on the affected region unless the architecture spreads the load. That pushes the design toward something the application can call with a single model identifier while the platform fans the request out across regions behind the scenes, because the alternative is the application owning its own regional routing table and every deploy carrying the risk of a misrouted call.

The third question is what does the bill look like when the product wins? At ~3 billion input tokens and ~400 million output tokens a day, the gap between a cheap-tier and a premium-tier model is the difference between a few hundred thousand dollars a month and a couple of million. That’s not a line-item on a finance review, it’s a budget conversation with the CFO. The interesting economics aren’t “which model is cheapest” but “where can we spend cheap-tier prices on questions the cheap tier can answer, and premium prices on the ones that actually need reasoning?”

The fourth is what happens when the easy answer is wrong? A single-model architecture pays premium rates for the FAQ slice of traffic and gets premium-grade failure modes when a region saturates. A two-model architecture splits the traffic by difficulty and buys a load-shed path when the premium tier’s capacity tightens. The sophistication isn’t picking the model, it’s designing the cascade that lets the cheaper model carry the tail.

The fifth is how do we know it’s still working? Model behaviour drifts across point releases. A RAG system prompt calibrated against one minor version doesn’t automatically work the same on the next, and the gap between “the answers are slightly worse this week” and “we lost 3% accuracy across the board” is an evaluation pipeline that runs nightly against a golden set. That pipeline is a first-class piece of the design, not an afterthought.

Finally: what buys us the right to change our mind? InferenceInferenceRunning a trained model to produce output – as opposed to training it. profiles that hide the specific region, prompt templates that separate cacheable prefix from volatile context, evaluation harnesses that can A/B a new model version, all of those are optionality the architecture builds in, and they’re the difference between “we upgraded to the new minor last Tuesday” and “we spent six weeks revalidating the prompt.”

What we’ll filter on

Distilling that exploration into filters we can score each model against:

Reasoning quality on retrieved context. Reasoning across long structured prompts, not just fluent extraction.
First-token latency under 1.5 s at P99 for ~3,000-token inputs. Tail, not average.
Cost-per-token that survives a million requests a day. Daily volume is ~3B input + ~400M output tokens; a 10x pricing gap between families is $60k vs $600k a month.
Bedrock-native multi-region availability across US and EU, surviving one region offline.
Throughput predictability at 30 RPS peak. Traffic is smooth, not spiky, so the capacity question is which mode gives predictable latency without over-buying.

The Bedrock model landscape

Bedrock’s catalogue sorts into seven families.

Anthropic Claude. Three live tiers: Haiku 4.5 (~$1 / $5 per million input / output tokens; 200K context), Sonnet 4.5 and Sonnet 4.6 (~$3 / $15; 200K context), Opus 4.5 and Opus 4.6 (~$15 / $75; 200K context). All three support global and geographic cross-region inference profiles, prompt caching, and vision inputs. Sonnet is available in US East (N. Virginia, Ohio, Oregon) and EU (Frankfurt, Ireland, Paris, Zurich). First-token latency on Sonnet 4.5 in a warm region sits around 1-1.8 s; Haiku 4.5 under a second.

Amazon Nova. Four text tiers: Nova Micro (~$0.03 / $0.12 per million; 128K context), Nova Lite (~$0.06 / $0.24; 300K context), Nova Pro (~$0.80 / $3.20; 300K context), Nova Premier (~$2.50 / $12.50; frontier-class). Nova Pro is cheapest-per-token at its quality tier by a wide margin. Regional availability is broad within the US; EU coverage is thinner and largely via cross-region profiles anchored in US regions.

Meta Llama. Llama 3.1 (8B, 70B, 405B), Llama 3.2 (1B, 11B vision), Llama 4 Maverick and Scout (MoE). Among the lowest pricing on the platform. Llama 3.1 70B around $2.65 / $3.50 per million. The top-end 405B and the Llama 4 MoE models are concentrated in US regions only; cross-region profiles don’t cover EU for the top tiers.

Mistral AI. Mistral Large 3 (~$2 / $6 per million, 128K context) is the flagship; Ministral 3B and Mixtral 8x7B sit lower. Decent mid-tier reasoning, strong multilingual. Doesn’t beat Sonnet on quality or Nova Pro on cost; EU coverage thinner than Claude’s.

Cohere. Command R+ is specifically tuned for RAG, citation generation, grounded answers, tool-use. Available in us-east-1 and us-west-2 only; no native EU. First-class option for US-only RAG; ruled out by the EU requirement.

Amazon Titan. The family has shifted to embeddings (Titan Text Embeddings V2) and image generation. Useful for the embedding side of a RAG pipeline; not the generation model.

AI21 Labs. Jamba 1.5 Mini and Large, 256K context, Jamba hybrid SSM/Transformer. Good at long-context extraction; limited EU presence; mid-tier reasoning.

Side by side

Family	Reasoning on retrieved context	P99 first-token < 1.5 s	Cost at 1M req/day	EU region availability	Predictable at 30 RPS
Anthropic Claude (Sonnet)	✓	✓	✓	✓	✓
Amazon Nova (Pro / Premier)	✓	✓	✓	✗	✓
Meta Llama	✓	✓	✓	✗	✓
Mistral AI	—	✓	✓	—	✓
Cohere Command R+	✓	✓	✓	✗	✓
Amazon Titan	✗	—	—	—	—
AI21 Jamba	—	✓	✓	—	✓

Two families make the shortlist on all five: Anthropic Claude and (in a US-only variant) Amazon Nova. Nova wins on cost-per-token but fails EU availability for the Pro and Premier tiers that would clear the reasoning bar. Cohere’s Command R+ is purpose-built for RAG but currently lives in us-east-1 and us-west-2 only. Claude Sonnet is the only row with all ticks, and the “complex reasoning over structured retrieved context” constraint keeps it there.

Matching the workload to the model

Four gates, reasoning, residency, latency, cost, and the seven-family catalogue collapses to Sonnet on a geographic cross-region profile with Haiku as the cost-tier fallback.

Sonnet, in depth

Sonnet is where most production RAG applications land: Opus-grade reasoning on most realistic prompts, Haiku-competitive latency for typical RAG input sizes, mid-tier pricing that makes a million-requests-a-day application viable.

Version choice. 4.5 and 4.6 are both live on Bedrock at the same price. 4.6 is newer; 4.5 has the longer benchmarkBenchmarkA standardised test set used to score and compare models. track record. New applications default to 4.6; calibrated pipelines stay on 4.5 until the re-calibration is done, because behaviours differ across minor versions and a RAG system prompt is typically tuned against one specific model.

Context windowContext windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. . 200K tokens. A 3K input sits far below the ceiling; context pressure is nowhere near a concern.

Latency profile. First-token for a 3K input in a warm region runs under 1.8 s. That’s close to the 1.5 s target, which makes latency a throughput-mode question rather than a model question, on-demand variance can push P99 above target under peak load.

Four ways to buy capacity. On-demand pays per token with no commitment, variable latency under noisy-neighbour contention. Provisioned throughput buys model units at a guaranteed rate on a 1-month or 6-month commitment, predictable latency, committed spend. Batch inference ships 50% off at a 24-hour SLA, fine for offline jobs. Flex tier ships 50% off at best-effort latency, fine for tolerant async. The right default for 30 RPS peak is on-demand with raised quotas; provisioned earns its place when the peak sustains into the hundreds of RPS or a hard latency SLA demands isolated capacity.

Prompt caching is the cost lever most applications miss. Cache reads cost ~10% of the normal input-token price; cache writes cost ~25% more than normal and populate the cache for ~5 minutes. The scenario’s 3K input is almost certainly ~800 tokens of shared system prompt plus ~1,700 of retrieved context plus ~500 of user question. Marking the system prompt cacheable pays full price once per 5-minute window and 10% everywhere else, a 23% reduction in input cost at the stated numbers, larger if tool definitions and few-shot examples live in the cached prefix. Caching also cuts first-token latency by hundreds of milliseconds, directly against the 1.5 s budget.

Cross-region inference profiles are how the multi-region requirement collapses to a config change. Call us.anthropic.claude-sonnet-4-6 and the US invocation spreads across us-east-1, us-east-2, and us-west-2; call eu.anthropic.claude-sonnet-4-6 and the EU invocation spreads across Frankfurt, Ireland, Paris, and Zurich. If one constituent region fails, the others serve. No code change; on-demand rate applies; no cross-region data-transfer charge on the inference path. Global profiles exist for maximum availability but trade residency; geographic profiles are the right default when US and EU customers are separate.

Cascading for cost. Not every question needs Sonnet. Routing the easy slice, short queries, straightforward extraction, to Haiku 4.5 at roughly a third of Sonnet’s price is where the daily bill bends. Three shapes in the wild: cascade (try Haiku, retry on Sonnet when confidence is low), pre-route (classify first, choose once), load-shed (Sonnet by default, drop to Haiku when Sonnet’s P99 climbs). Cascading is the most common because it degrades gracefully when the judge is uncertain.

A worked architecture

Primary model: Sonnet 4.6 via us.anthropic.claude-sonnet-4-6 for US tenants, eu.anthropic.claude-sonnet-4-6 for EU. The application routes tenants to the right profile by home region.
Throughput mode: on-demand. Quotas raised in advance for peak 30 RPS with margin. Provisioned reviewed quarterly against actual utilisation.
Prompt caching: system prompt (roles, instructions, tool definitions) marked cacheable. Cache hit rate monitored as a first-class metric.
Cost-tier fallback: shed to Haiku 4.5 when profile P99 exceeds 2 s for 5 min. Haiku via the matching geographic profile.
Cross-geography failover: on repeated 5xx from the primary profile, retry once against the other geography. Degraded-residency mode for continuity.
Evaluation: 500-question golden dataset, nightly run via Bedrock Evaluations with Sonnet 4.6 as judge, alert on aggregate drops above 5% WoW.
Embedding model: Titan Text Embeddings V2 in each region, vector store local to where it’s queried.

Rough monthly cost at 1M requests/day, 3K/400 token median, ~25% input cached:

Input: 3,000 x 1M x 30 = 90B/month. Effective ~70B after caching x $3/M = ~$210k.
Output: 400 x 1M x 30 = 12B x $15/M = ~$180k.
Evaluations, embeddings, incidental Haiku: ~$10k.
Total: ~$400k/month, before any volume discounts from the account team.

Routing ~40% of traffic to Haiku via a well-calibrated cascade drops total cost to roughly $260k/month for similar quality on the easy slice. That’s where the investment in evaluation pays off, a trustworthy judge makes the cost curve bend.

What’s worth remembering

Seven Bedrock families, only some in both US and EU. Claude, Nova, Llama, Mistral, Cohere, Titan, AI21. The EU-residency gate is what rules most of the catalogue out for a dual-geography product.
Claude’s three-tier split (Haiku / Sonnet / Opus) maps to working points. Haiku for latency and cost, Sonnet as the production default, Opus as an escalation rather than a daily-driver.
Geographic cross-region inference profiles (us., eu.) give automatic in-geography failover at on-demand rates with no surcharge. The primary mechanism for multi-region availability on Bedrock.
Four throughput modes. On-demand for smooth sub-hundreds-RPS; provisioned for sustained high throughput or hard latency SLAs; batch for 24-hour async; flex for tolerant async.
Prompt caching’s 90% discount on cache reads is the biggest lever on input cost for any RAG workload with a stable system prompt. The 5-minute window and ~25% write premium are the two numbers to remember.
Cascading, pre-routing, and load-shedding are three shapes for splitting traffic between Sonnet and Haiku. Cascading is the common production default; load-shedding is the cleaner degradation path than 503s under peak.
Bedrock Evaluations runs automatic and LLM-as-a-judge modes against a golden dataset. Nightly evaluation plus alert-on-drop is the discipline that catches model-version drift before users do.
Custom Model Import is a niche tool, provisioned-only, limited regions, fine-tuned open-source only. Not a general alternative when Bedrock’s hosted models already clear the quality bar.

The model choice is the easy part. The work that actually matters is the throughput mode, the failover topology, the caching discipline, and the evaluation harness wrapped around the chosen model, all pieces that compound when the application outgrows a single region and a single quality tier.

Prioritisation: What Changes First

2026-05-26T06:00:00+08:00

Greenbox has 200 subscribers but 8% monthly churn. Over three weeks of discovery (Jobs to Be Done, Assumption Mapping, Business Model Canvas) the team has uncovered more problems than they can fix at once. Now they need to decide what changes first.

It’s Monday morning and the office whiteboard is covered. Three weeks of discovery work have produced a wall of insights, sticky notes, canvas printouts, and scribbled questions. Maya stands in front of it with a coffee that’s gone cold.

Sam arrives early, which is unusual. He puts his bag down and opens his laptop before he takes off his jacket. “We lost three subscribers over the weekend.”

Maya turns. “Churn?”

“Not exactly. They switched. To Freshly.” Sam turns his laptop around. Freshly’s Perth launch page fills the screen: a clean hero image, the $18 price tag prominent, a “Now delivering in Perth” banner. “They went live on Friday. Three of our subscribers signed up over the weekend and cancelled with us. One of them, Louise, from the JTBD interviews, sent a message: ‘Sorry, but $18 is $18.’”

Maya stares at the screen. She knew this was coming. Dave had told her Freshly was calling farms. Charlotte’s BMC questions had forced the pricing conversation. But knowing it’s coming and seeing it on a Monday morning are different experiences.

“Seven dollars a week. Three hundred and sixty-four dollars a year. Of course people switch.”

Too much to fix

The insights are clear. A two-tier pricing model could fix the economics. A pause button would reduce churn. The value proposition needs repositioning around convenience. SEO is underinvested. The recipe cards are working but the marketing doesn’t match what customers actually care about.

Maya knows all of this. The team knows all of this. And that’s the problem.

By the time everyone arrives, Maya has written five priorities on the whiteboard, each circled in red.

Ship the pause button (reduces churn)
Launch two-tier pricing model (fixes unit economics)
Reposition the value prop in all marketing
Run a mixed-sourcing pilot
Start SEO foundation work

Priya reads the list. “We’re five people.”

“I know.”

“That’s five initiatives for five people.”

Sam looks at the board. “Plus we still have to pack and ship two hundred boxes a week, manage farm relationships, keep the platform running, and prepare a board presentation.” He’s listing his own workload, though he doesn’t frame it that way. He has forty-three unread support emails from the weekend.

Maya puts down her marker. “I don’t know how to choose.”

Everyone has a different answer

Lee and Charlotte are on the call. The team spends thirty minutes arguing.

Tom thinks the pause button should be first: highest leverage, small engineering lift. Sam disagrees; fix the value prop messaging and you’ll acquire better-fit customers who churn less in the first place. Jas pushes for two-tier pricing because the board meeting is in three weeks. Priya wants the mixed-sourcing pilot first: you can’t pitch two-tier pricing without validating the supply chain. Maya keeps circling back to SEO.

Charlotte lets the argument run past the point where it’s productive. Then she says: “Five people, five answers. That’s not a disagreement about priorities. That’s the absence of a framework for deciding.”

What we’re optimising for

“Before we sort anything,” Charlotte says, “we agree on what we’re optimising for this quarter. Otherwise the 2x2 is just opinions in a grid.”

She types into a shared doc and turns the screen.

Q3 Theme: Fix the leaky bucket. Reduce monthly churn below 5%.

“Eight percent monthly means we lose a quarter of our subscribers every year. Everything else is downstream of that. So the first question on every initiative, including the five on the whiteboard, is: does it move churn? By how much, and how fast?”

Tom frowns. “Two-tier pricing isn’t a churn play. It’s unit economics.”

“It’s churn through a longer chain. Better economics means we can afford the convenience features that hold subscribers. And the $20 tier closes the gap to Freshly, which is already costing us churn. So yes, it serves the theme. But that’s the test for every initiative on the wall.”

Impact and effort

With the theme in place, the 2x2 has meaning. The horizontal axis is effort/risk. The vertical axis is impact on churn, the metric the theme picked out. Charlotte scores each initiative.

Do First High impact, low effort/risk

Pause button

Big Bet High impact, high effort/risk

Two-tier pricing model

Fill In Low impact, low effort/risk

Value prop repositioning
SEO foundation

Defer Low impact, high effort/risk

Mixed-sourcing pilot

Priya objects to the mixed-sourcing pilot being deferred. “We need supply chain data before we can commit to two-tier pricing.”

“You’re right,” Charlotte says. “But you don’t need a full pilot. You need three phone calls to wholesale suppliers and a week of test orders. That’s not a separate initiative; it’s part of the pricing preparation. The fuller pilot can come later.”

Now / Next / Later

Charlotte shares the next screen. Three columns.

Now is the next four weeks. High impact, high urgency. You can name the people and describe what “done” looks like.

Next is four to twelve weeks. Important but can wait, or needs more information first.

Later is beyond twelve weeks. Good ideas that aren’t ready.

“Everything can’t be Now. If it is, nothing is.”

The 2x2 doesn’t sort itself into columns. Three things bend it:

Capacity. Five people. Now holds at most two big initiatives.
Dependencies. The supply-chain checks the pricing model needs are folded into the pricing work, not listed as a separate Next item.
External deadlines. The board meeting is in three weeks. Two-tier pricing is a Big Bet, not a Do First, but the timing pulls it into Now anyway.

The roadmap:

Now Next 4 weeks

Pause button
Reduce churn from 8% toward 5%
Two-tier pricing model
Viable unit economics for board

Next 4 – 12 weeks

Mixed-sourcing pilot
SEO foundation
Value prop repositioning

Later Beyond 12 weeks

B2B offerings
Second city expansion
Referral programme

“Anything that doesn’t move churn, directly or through a short chain, waits.”

Building the Now

Tom and Priya take the pause button. They Example Map it on Monday afternoon; twenty-five minutes produces twelve concrete examples and three red cards. They build it in six days.

Maya and Jas take the two-tier pricing model. Maya spends two days on the phone with Dave, Rachel, and their third farm partner, explaining what mixed sourcing means for local orders.

Dave is quiet for a long time. Then he asks: “Will the local box subscribers grow?”

Maya doesn’t know. She says so.

“Here’s what I need. Don’t blindside me. Give me three months’ notice if the local orders are going to drop. I can find other buyers, but I need time.”

Maya commits to it. She adds “quarterly farm partner review” to the Later column.

Jas designs the pricing page. But first, she presents something she’s been working on privately.

She’d taken the value prop repositioning, the one Maya moved from Now to Next, and done it anyway. Three evenings at home in Leederville, Moleskine open, laptop beside her. She connects her laptop to the office projector without asking anyone’s permission.

The homepage: “Dinner decided.” Mrs Patterson’s words, now a headline in Greenbox’s brand typeface. Below it, not a photo of vegetables but a photo of a family kitchen, a recipe card propped against a cutting board. The message: we deliver the moment after the decision is made.

The pricing page: “Local Box, $25/week, 100% locally sourced, seasonal produce from farms within fifty kilometres” and “Fresh Box, $20/week, a mix of local and market-fresh produce, same quality, more variety.” The mixed box isn’t framed as the cheap option. It’s framed as the variety option.

The about page: not “we source from local farms” but “we take Tuesday night off your plate.” The farm stories are still there, halfway down the page. But the lead is the job.

Maya stands in front of the projector. She reads every screen twice. “This is the first time the website matches what we actually do.”

Jas’s eyes fill. She blinks hard and looks down at her Moleskine. She’s been waiting to hear something like that since week one, when she designed the customisation interface that got thrown away, when Maya redirected the product without telling her, when she sat in her Leederville flat thinking about quitting. Her mum’s words about her grandmother: “She never grew what she thought people should eat. She grew what they actually wanted.” The napkin sketch from Mrs Patterson’s interview, with “dinner decided” underlined twice, is still in her Moleskine. It might be the most important thing she’s ever drawn.

“We can’t ship this yet,” Charlotte says, gently. “Value prop repositioning is Next, not Now. But save every one of these files.”

Sam catches Jas’s eye across the table and mouths: That was brilliant.

The board meeting

Maya presents on a Thursday afternoon. Charlotte coaches her the night before: “Don’t start with the product. Start with the problem.”

Maya starts with the churn number. She walks through the JTBD insight, the assumption mapping, the broken unit economics. Then the plan: Now/Next/Later roadmap, quarterly theme, early results (pause button already shipped, churn trending down in week one).

One investor, Angela, leans forward. “This is the first time you’ve presented something that isn’t a feature list. You’re showing me the thinking behind the choices.”

The board approves the next tranche of funding. Not because the plan is guaranteed, but because it’s coherent and evidence-based.

Angela stays on the call after the others drop off. “The fact that you were willing to present a plan that partially walks away from 100% local sourcing tells me you’re making decisions based on data, not sentiment. That’s what we needed to see.”

Four weeks later

The pause button: twenty-three subscribers used it. Nineteen resumed. Four extended but none cancelled. Monthly churn dropped from 8% to 5.5%.

The two-tier model: fourteen new subscribers chose the Fresh Box ($20), six chose Local ($25). Nobody switched from Local to Fresh: the new tier is expanding the market, not cannibalising the existing one. Sam checked five of the Fresh Box subscribers in their welcome call. Three had compared Greenbox to Freshly. The $20 price point made the comparison close enough that the recipe cards tipped the balance.

“Freshly has better technology and a lower price,” Charlotte says. “You have better curation and a clearer job-to-be-done. The question is which one matters more in six months.”

The draft

On the evening after the board call, Maya sits at the kitchen table. Nadia pours her a glass of wine.

“They said yes?”

“They said yes.”

“Then why do you look like that?”

Maya opens her email drafts. The “pausing operations” email is still there: three sentences, unsent, from the night after the BMC session. She reads it once. Then she closes the draft folder. Not deleting it. Not yet.

“I look like this because the hard part isn’t over. It’s changing shape.”

Freshly has ninety subscribers in Perth after one month. Sam tracks the number. Greenbox has two hundred and thirty-one. But Freshly’s growth rate is steeper. Dave reported that Rachel got a call from them last week. Rachel told them to get stuffed, but Rachel is one farmer.

Greenbox raises its funding. The board is satisfied. Churn is dropping. The two-tier model is expanding the market without cannibalising the existing one. The team understands the customer, not the customer they imagined at the Margaret River market, but the real one, the one who hires Greenbox so that dinner is already decided when they walk through the door.

That’s product-market fit. Not a guess. Evidence.

The team grows from five to fifteen. Two cities. New subscribers arriving faster than at any point in the company’s history. And then the problems change.

The codebase that five people understood becomes a system fifteen people need to work in. The architecture that worked at startup scale starts creaking. New developers join and don’t know why things are built the way they are. A change in the billing module breaks the delivery scheduler because nobody realised they were coupled. Tom fixes it in an hour, but the look on his face says he knows: this will happen again, and next time it might not be the billing module. It might be the substitution engine, or the allergen flags, or something that sends the wrong produce to the wrong person.

The techniques from the first two series got Greenbox here. But “here” is a different kind of problem. Not “what should we build?” but “how do we build at scale without the system collapsing under its own weight?”

Charlotte has a name for the approach: Domain-Driven Design. It starts with drawing boundaries around the parts of the system that change for different reasons.

Choosing Between SageMaker, Bedrock, and Purpose-Built AI APIs

2026-05-25T06:00:00+08:00

The situation

The platform team at a mid-size enterprise has a backlog of five AI-shaped requests, all due by end of sprint:

Call-centre transcription. The support team records 40,000 calls a month. They want searchable transcripts with speaker diarisation (“AgentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. said X, then Customer said Y”) and redaction of credit-card numbers spoken aloud.
Sensor anomaly detection. The facilities team has 2,000 IoT sensors across 12 sites streaming temperature, humidity, and vibration data. They want alerts when readings stray from normal patterns, patterns that vary by site, sensor type, and time of day.
Form text extraction. Ops receives 3,000 scanned supplier invoices a week. They want the invoice number, date, line items, and total extracted into a structured row in a database.
Email summarisation. The sales team wants one-paragraph summaries of customer email threads auto-inserted at the top of their CRM view.
Visitor-badging face detection. Reception has a camera at the front door; they want a system that detects faces, checks them against an approved-visitor list for that day, and prints a badge (or alerts security).

Five requests, five different ML problem types, one sprint, one platform team of six. The question is which AWS service shape fits each, without the team writing five bespoke ML pipelines.

What actually matters

AWS organises its AI/ML offerings into three broad tiers, and recognising which tier a request belongs to is most of the work.

The first question is what kind of ML problem this is. “AI” is a wide word. Speech-to-text is a solved problem that almost every cloud provider offers as a managed API. Anomaly detection on time-series data has well-understood algorithms but needs tuning per deployment. Document parsing with structured extraction is a specific service category (intelligent document processing). Summarisation is generative text, which points at Bedrock. Face detection is a computer-vision primitive available as a managed API. The problem shape is the first filter; the platform choice follows from it.

The second is the three tiers. The top tier is purpose-built AI services: fully managed APIs for well-defined tasks, one service per named problem (speech-to-text, document extraction, image analysis, NLP, recommendations, fraud detection, and the like). Input goes in, structured output comes out. No modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. to train, no infrastructure to provision. This tier is where you want to live if the problem matches a named service and the service’s accuracy meets your bar.

The middle tier is foundation-model APIs: managed access to general-purpose LLMsLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. for generative and broad-language tasks. Text generation, summarisation, embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. , RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. , agents, image generation. Pay per tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. . The tier for problems where a general-purpose model is a reasonable substrate (most text understanding, most generation, most RAG).

The bottom tier is the full ML platform: train your own model, bring your own data, deploy your own endpoint, fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. open-source models, run notebooks, orchestrate pipelines. The tier for problems where no managed service fits, custom domains, custom metrics, proprietary architectures, regulated contexts where data can’t leave your VPC, or unusual volumes that shift the cost calculus.

The third thing is the default pull toward the bottom tier. Engineers with ML backgrounds reach for the platform because it’s the most powerful. Engineers without ML backgrounds reach for the foundation-model API because it’s the most recent. Neither is the correct first question. The correct first question is: “is there a purpose-built service for this?” If there is, that’s almost always the correct answer, faster to integrate, more accurate on the specific task, cheaper per request, and the team doesn’t become responsible for ongoing model maintenance.

The fourth is the cost shape per tier. Purpose-built AI services are usually priced per transaction: per minute of audio, per page of document, per image analysed. Foundation-model APIs are per token (input + output). The full ML platform is per instance-hour for trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. and per instance-hour for hosted endpoints, regardless of traffic. At low-to-medium volume, managed APIs crush the alternatives on cost; at very high volume, fixed-per-hour pricing can win if your traffic saturates the endpoint. The crossover point is usually higher than teams assume.

The fifth is accuracy and customisation. A managed API gives you the accuracy the vendor tuned for the general case. If your domain is specific enough, medical transcription, legal documents, audio in a noisy factory, accuracy may fall short, and a custom model trained on your data can beat the general API. But this is empirical, not assumed: benchmark the managed API against a test set before concluding you need to build.

The sixth is the operational cost. A managed API is an HTTPS call. Bedrock is an HTTPS call plus a bit of prompt engineeringPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. . SageMaker requires endpoint capacity planning, auto-scaling configuration, model-version deployment, CloudWatch monitoring, and a team that stays current on SageMaker’s operational surface. These costs compound over the model’s life, not just at launch.

What we’ll filter on

Five filters, applied to each of the three tiers.

Time to first working version, hours, days, or weeks?
Customisability, can you fine-tune, retrain, or change the model?
Cost shape, per-transaction, per-token, or per-instance-hour?
Infrastructure overhead, do you run an endpoint, or does AWS?
Breadth of task types, narrow, broad, or arbitrary?

The three-tier landscape

AI services (top tier). Purpose-built, task-specific managed APIs. Each service does one thing well:
- Transcribe, speech-to-text with diarisation, custom vocabulary, PII redaction, real-time streaming option.
- Comprehend, sentiment, entities, key phrases, language detection, topic modelling; custom classification and entity recognition via Comprehend Custom.
- Textract. OCR with table and form extraction, plus specialised APIs for invoices and identity documents.
- Rekognition, face detection and matching, object and scene detection, moderation, text-in-image, video analysis.
- Translate, neural machine translation, 75+ languages.
- Polly, text-to-speech.
- Personalize, recommendation systems.
- Fraud Detector, custom fraud models on structured data.
- Forecast (retired as standalone; folded into SageMaker Canvas’s time-series mode).
- Kendra, intelligent search over enterprise documents.
- Lookout for Equipment / Metrics / Vision, anomaly detection for industrial equipment, metrics, and visual inspection.

Pricing: per-transaction (per minute, per page, per image). No infrastructure.

Bedrock (middle tier). Foundation-model APIs for generative and general-purpose tasks. Catalogue of models (Anthropic Claude, Amazon Nova and Titan, Meta Llama, Mistral, Cohere, AI21). Primary APIs: InvokeModel for synchronous generation, RetrieveAndGenerate for managed RAG, InvokeAgent for agent workflows. Pricing: per input + output token, with provisioned-throughput option for high-volume or fine-tuned models. No infrastructure.
SageMaker (bottom tier). Full ML platform. Surfaces include:
- SageMaker Studio, the IDE for ML: notebooks, experiments, pipelines.
- SageMaker Training, run training jobs on managed compute (CPU, GPU, Trainium).
- SageMaker Endpoints, hosted model inferenceInferenceRunning a trained model to produce output – as opposed to training it. , real-time or asynchronous or batch transform.
- SageMaker Autopilot. AutoML for tabular data.
- SageMaker Canvas, no-code UI on top of Autopilot plus time-series forecasting.
- SageMaker JumpStart, catalogue of open-weight foundation models deployable to your own endpoint.
- SageMaker Ground Truth, data labelling.
- SageMaker Clarify, bias detection and explainability.
- SageMaker Feature Store, managed feature storage.
- SageMaker Pipelines. MLOps orchestration.
- SageMaker Model Registry, versioned model artefacts with approval workflow.

Pricing: per-instance-hour for training jobs and endpoints (ml.m5, ml.g5, ml.p5 etc.), plus storage and data transfer. Full VPC isolation; your models live in your account.

Side by side

Tier	Time to v1	Customisable	Cost shape	Infra	Task breadth
AI services (Transcribe, etc.)	Hours	Limited (custom vocabs, classifiers)	Per-transaction	None	Narrow (one task each)
Bedrock	Hours-days	Prompt + RAG + fine-tune	Per-token	None	Broad (any text task; growing image)
SageMaker	Days-weeks	Any	Per-instance-hour	Endpoint management	Arbitrary

Reading the table in reverse is instructive. If your problem matches an AI service, you’re done in hours. If it’s a generative or general-text task, Bedrock is hours to days. Only drop to SageMaker if neither fits, because the task is so domain-specific no managed service addresses it, or because volume makes per-transaction pricing lose to per-hour.

Mapping the five requests

Four of the five requests land on purpose-built AI services. Only summarisation genuinely wants Bedrock. None of them justifies dropping to SageMaker.

The picks in depth

Call-centre transcription. Amazon Transcribe. The service has a call-analytics mode (StartCallAnalyticsJob) specifically for contact-centre audio: produces transcripts with speaker labels (“AGENT” / “CUSTOMER”), automatic sentiment scoring, issue detection, and content redaction that blanks out card numbers, SSNs, and other PII in the output. Input is an S3 URI of audio; output is JSON to another S3 URI. 40,000 calls a month at, say, 8 minutes average = 320,000 minutes; Transcribe Call Analytics prices per minute, so the monthly bill is straightforward to forecast. Custom vocabularies handle product names, internal jargon, and employee names. No model to train, no endpoint to host.

Sensor anomaly detection. Amazon Lookout for Equipment. Purpose-built for multivariate anomaly detection on industrial sensor data. You upload historical sensor readings, Lookout trains a site-specific model automatically, and you stream readings in for real-time inference. Handles site-to-site variation naturally (each asset or site can have its own model), doesn’t need you to hand-label anomalies (it learns normal patterns from the healthy history). The alternative, building this on SageMaker with a custom model, is a weeks-to-months project; Lookout is days. (Note: the service is being wound down; at the time of writing, Amazon is directing new customers toward SageMaker’s own anomaly-detection capabilities. Check current guidance before new builds.)

Form text extraction. Amazon Textract. Textract has a dedicated Invoice API (AnalyzeExpense) that returns structured fields from invoices: vendor name, invoice number, date, line items, totals, tax, currency. A per-page API call per invoice, output is JSON with each field tagged by type. 3,000 invoices a week at a page each is ~12,000 pages a month; straightforward per-page pricing. For non-invoice forms the generic AnalyzeDocument with FORMS and TABLES feature types does the same for arbitrary structured layouts.

Email summarisation. Bedrock. No purpose-built AI service exists for generic summarisation. Comprehend has a summarisation capability but it’s extractive (pulls existing sentences); for a paragraph summary written in the company’s voice, a foundation model is the correct tool. Bedrock with Claude Sonnet or Nova Lite, one InvokeModel call per email thread, a well-engineered prompt. Volume is low (sales-team-scale, probably low-thousands a day); on-demand per-token pricing. The Bedrock-tier pick that earns its place.

Visitor face detection. Amazon Rekognition. IndexFaces adds today’s approved visitors to a face collection at the start of the day; SearchFacesByImage on each camera frame returns matches above a similarity threshold. Low latency, fully managed. Quotas and pricing scale with API calls; the whole system is a Lambda + S3 + Rekognition pipeline. No custom model training.

Note that four of five are AI services and the fifth is Bedrock. None of them is SageMaker. That’s the correct outcome for this backlog: purpose-built services cover the named tasks, Bedrock covers the generative task, and SageMaker is reserved for the problems none of the above handles.

When would SageMaker win?

“SageMaker” is still the default someone types, so it’s worth being clear when it actually wins. SageMaker is the correct tier when:

The task isn’t covered by an AI service. A custom computer-vision model for detecting specific defects in your specific product on your specific assembly line, where Rekognition’s general models don’t have the precision. A custom NLP classifier for a domain-specific taxonomy Comprehend Custom can’t learn. A custom regression on industrial sensor data that goes beyond what Lookout handles.
The workload is high-volume enough that per-transaction pricing loses. If you’d be calling Transcribe 10 million times a month and your contract negotiated you a bulk price that’s still more than running a self-hosted ASR model on a few ml.g5 endpoints, then self-hosting wins. This is unusual; the crossover is higher than teams assume.
Data residency or VPC isolation. Some AI services have VPC endpoints; some don’t. If your data can’t leave a specific VPC, or can’t be sent to a managed API under any circumstances, SageMaker endpoints inside your VPC are the answer.
You need to explain or audit the model’s behaviour in detail. Managed APIs are black boxes. SageMaker lets you inspect, explain (SageMaker Clarify), and version every model; for regulated contexts where “why did the model do this?” needs a substantive answer, that visibility matters.
You’re doing research, experimentation, or model development, not consumption. SageMaker Studio is an ML development environment. Managed APIs are ML consumption surfaces. If your team’s job is to build models, Studio is the workspace.

For this platform team’s five requests, none of those conditions applies. They’re consuming ML capabilities, not developing models. AI services and Bedrock are correct; SageMaker would be over-engineering.

A worked sprint

How the team could schedule the five builds across a two-week sprint:

Week 1
  Day 1: team walks the five requests against the taxonomy.
         Writes up tier assignments + rough cost estimates.
  Day 2: parallelise.
    Pair A: Transcribe call-analytics pipeline
      (S3 + Lambda + StartCallAnalyticsJob + result consumer)
    Pair B: Textract Invoices pipeline
      (S3 + Lambda + AnalyzeExpense + DB writer)
    Pair C: Bedrock email summariser
      (Lambda + InvokeModel + prompt tuning + CRM integration)
  Day 3-4: each pair gets to a working end-to-end version.
  Day 5: review, measure accuracy against a held-out set of
         real inputs, tune thresholds / prompts / custom vocabs.

Week 2
  Day 1-2: Pair A picks up visitor face detection
    (Rekognition + face collection management + badge printing).
  Day 1-2: Pair B picks up sensor anomaly detection
    (Lookout for Equipment, historical data ingest, inference
    stream wiring, alerting).
  Day 3-4: Pair C does guardrails, monitoring, IAM scoping
           across all five pipelines.
  Day 5: end-to-end review with stakeholder sign-off per request.

Six engineers delivering five AI-shaped features in a sprint is only realistic because none of them is a from-scratch ML project. Each is “wire up a managed service correctly.” If any of the five had required SageMaker, that one alone would have consumed the sprint.

The default to hold

When a new AI-shaped request lands, the sequence worth running is:

Is there a purpose-built AI service for this task? (Transcribe for speech, Textract for forms, Rekognition for images, Comprehend for NLP, Lookout for sensors, etc.) If yes and accuracy meets the bar, use it.
If no, is it a generative / general-text / RAG task? If yes, Bedrock.
If neither, does the problem genuinely require custom ML? If yes, SageMaker.

Most requests terminate at step 1. A significant minority at step 2. A small minority at step 3. Reversing the sequence, starting with SageMaker and asking “could this be done simpler?”, is how teams end up running ML pipelines for problems Transcribe would have solved in a day.

What’s worth remembering

AWS organises AI/ML into three tiers. Purpose-built AI services (Transcribe, Textract, Rekognition, Comprehend, etc.); Bedrock for foundation models; SageMaker for the full ML platform. The tier is the first question; the specific service follows.
AI services win when the task matches. Transcribe does call-centre audio with diarisation and PII redaction out of the box. Textract does forms. Rekognition does faces. These services are tuned by teams of specialists; competing with them from scratch takes months.
Bedrock is the foundation-model tier. Generative text, summarisation, RAG, agents, embeddings. Per-token pricing. When the task is “something with text that’s not covered by a specific AI service,” Bedrock is usually correct.
SageMaker is the platform of last resort, not first choice. The correct tier when no managed service fits, when volume shifts the cost calculus, when data residency demands it, or when you’re developing rather than consuming models.
Pricing shape varies by tier. Per-transaction (AI services), per-token (Bedrock), per-instance-hour (SageMaker). Low-to-medium volume favours managed per-transaction pricing; very high volume can tip toward SageMaker’s per-hour model.
Time-to-first-version tracks the tier. AI services are hours. Bedrock is hours-to-days. SageMaker is days-to-weeks. Budget accordingly.
Custom vocabularies, custom classifiers, and custom entities bridge the accuracy gap on AI services. Transcribe custom vocabularies, Comprehend Custom classification and entity recognition, Textract custom adapters. Often enough to close the accuracy gap without dropping to SageMaker.
The default sequence is AI service -> Bedrock -> SageMaker. Start at the top, drop only when the tier above can’t meet the requirement. Reversing the sequence is how teams build custom ML pipelines for problems managed services would have solved in a day.

Five AI-shaped requests, one sprint, and no custom models. That’s the shape the AWS AI stack is designed for: most “AI” problems are not ML research projects; they’re integrations of capabilities someone else already built and tuned. Recognising which tier a problem belongs to, and holding the discipline to stay as high as the problem allows, is most of what makes a platform team fast.

The Boring Baseline That Wins

2026-05-23T06:00:00+08:00

You have 4,000 customer reviews. Half are positive, half are negative, more or less. You want a sentiment classifier. The team’s first instinct is to call the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. API once per review and parse the response. The bill is real, the latency is real, and the accuracy on your specific data is unproven.

An afternoon’s work in scikit-learn produces a modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. that hits 92% accuracy, runs at 50,000 predictions per second on a CPU, and costs nothing per call. The afternoon includes lunch.

This shouldn’t be an unusual outcome, but increasingly it is.

There’s a recurring pattern in machine learning projects: someone reaches for the most sophisticated tool first, struggles with it, and only later discovers that a “boring” classical baseline. TF-IDF features fed into a logistic regression, would have solved the problem in an hour. The previous post covered the classical NLP that still ships in production. This post covers the classical machine learning that should be the default starting point for most text-classification, clustering, and topic-modelling projects.

Not because neural models are bad. Because for problems below a certain size and complexity, the boring tools are simply the correct answer.

TF-IDF: the trick that won’t die

TF-IDF. Term Frequency / Inverse Document Frequency, is a way of turning a piece of text into a vector of numbers based on which words appear in it and how distinctive those words are.

The intuition is simple. For each word in your vocabulary, multiply two numbers:

TF: how often the word appears in this document. Common words score high.
IDF: a penalty for words that appear in many documents. Words that are common everywhere (like “the” or “and”) score low. Words that appear in only a few documents score high.

The result is a feature vector where words that are distinctive to a document score highly and words that are common across the corpus score low. “Refund” in a customer-service ticket scores high; “the” scores near zero.

That’s it. There’s no neural network, no trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. in the modern sense. You count words, you weight them, you have a feature vector. The whole pipeline is a hundred lines of Python or a single call to sklearn.feature_extraction.text.TfidfVectorizer.

And it works. Astonishingly well, for a fifty-year-old idea.

Logistic regression on TF-IDF features

Once you have TF-IDF vectors, you can feed them into any classifier. The most-used and least-glamorous choice is logistic regression: a linear model that learns a weight for each feature and predicts the probability of each class as a logistic function of the weighted sum.

For text classification with reasonable amounts of data (a few thousand to a few hundred thousand labelled examples), TF-IDF + logistic regression is often within a few percentage points of the best deep-learning model, and orders of magnitude cheaper to train, deploy, and explain.

Real numbers from real projects:

Sentiment analysis on movie reviews (50k examples, IMDB-style): TF-IDF + logistic regression hits ~89% accuracy. A fine-tuned BERT hits ~94%. A frontier LLM with a prompt hits ~92%. The first one trains in 30 seconds and runs at 50,000 predictions per second on a CPU.
Spam detection (millions of emails): TF-IDF + logistic regression or naive Bayes is still the production standard at most large mail providers. The neural model would be more accurate by a percentage point and cost a thousand times more to run at scale.
Topic classification of news articles (20-30 classes, 100k articles): TF-IDF + logistic regression matches BERT to within a couple of points and runs in milliseconds.

The pattern holds: when the task is “find a stable mapping from word patterns to a fixed set of labels,” and you have a few thousand examples, the linear model on lexical features is the sensible baseline.

When the linear model isn’t enough

The boring baseline has known weaknesses, and they’re the cases where you actually want a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. .

Paraphrase and synonymy. “I’m furious” and “I’m absolutely livid” are obviously sentiment-equivalent to a human. TF-IDF treats them as completely different features. Word2vec helps a bit; transformers solve it.
Long-range context. “The hotel was lovely, except for the bedbugs and the manager who threatened me.” A bag-of-words model averages “lovely” and “threatened” and gets the answer roughly correct by accident. A transformer reads it as a sentence and weights the second clause appropriately.
Negation and irony. “Best customer service ever, if you enjoy waiting four hours and being lied to.” TF-IDF sees “best” + “customer service” + “ever” and predicts positive. The transformer sees the structure.
Low-resource targets. If you only have 50 labelled examples, the linear model is overfitting; an LLM with zero-shot prompting may genuinely do better.

The rule of thumb is: if the task can be solved by paying attention to the correct keywords, the boring baseline works. If it requires understanding sentence structure or context, you need a transformer.

Naive Bayes: the even more boring baseline

Naive Bayes is, in a real sense, more primitive than logistic regression. It assumes every feature is independent of every other feature given the class, a “naive” assumption that’s almost always false. And yet it often works fine, particularly for spam classification, document categorisation, and short-text problems.

The reason is computational. Naive Bayes is blazing fast to train, counting word occurrences per class, and equally fast at inferenceInferenceRunning a trained model to produce output – as opposed to training it. . For applications where you need to retrain frequently (incoming email streams, news feeds, anything with model drift) it’s hard to beat. Multinomial naive Bayes specifically remains the correct default for short text classification with limited data.

Clustering: k-means and the friends you don’t think about

Sometimes the task isn’t “classify this into one of N labels”, it’s “find natural groupings in this data.” That’s clustering, and the boring baseline is k-means.

K-means takes a set of points (your TF-IDF vectors, your image embeddings, whatever) and a number k, and finds k clusters such that each point is closer to its own cluster’s centre than to any other. It’s the algorithm taught in the first week of a machine learning course, and it’s still the correct tool for most clustering problems.

When you’d actually use it:

Customer segmentation based on behaviour vectors.
Document clustering for exploratory analysis (“what topics exist in this corpus?”).
Image quantisation, reducing a photograph to a palette of k colours.
Vector quantisation for compression and indexing in vector databases.

K-means has limitations, it assumes spherical clusters, requires you to pick k, and can get stuck in bad local minima, but for “I have a pile of vectors and I want to know what’s in there,” it’s still the first tool to reach for.

For when k-means isn’t enough, there’s a small family of alternatives that are themselves still classical: DBSCAN for density-based clustering, hierarchical clustering when you want a dendrogram, Gaussian Mixture Models when you want soft assignments and uncertainty.

Topic modelling: LDA and NMF

A specific kind of unsupervised text analysis: what topics are present in this corpus, and which documents touch on which topics?

The classical answer is Latent Dirichlet Allocation (LDA, Blei et al., 2003). LDA models each document as a mixture of topics, and each topic as a distribution over words. The result, when applied to a corpus of news articles, might give you topics that look like “sports basketball game team player,” “politics election vote senator democrat,” “weather storm rain temperature forecast.” Each document is described as some percentage of each topic.

LDA is interpretable, deterministic-ish, and runs on modest hardware. It produces output a human can read (a topic is a list of weighted words) rather than a 768-dimensional vector. For exploratory analysis, journalism, and humanities research, it’s still extremely common.

Non-negative Matrix Factorisation (NMF) does a similar thing through different mathematics and often produces sharper, more separable topics, worth trying alongside LDA when topic modelling is what you actually want.

The neural alternatives, topic models built on top of contextual embeddings, like BERTopic, produce subtler topics but are harder to interpret and slower to run. If your goal is “give me a readable list of what’s in this corpus,” LDA is still hard to beat.

A starter kit, in code

Eighty per cent of the practical problems in this post can be solved with a combination of:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation

The total surface area is maybe 30 functions. The mental model is small. The deployment cost is whatever it costs to run a Python process on a CPU. You can train, deploy, and serve all of these from a single laptop, and you can scale them out to billions of documents on commodity hardware without surprise.

That’s not nothing. That’s most of the practical value of machine learning, available without buying a GPU or calling an API.

A decision table

If your task is…	The boring baseline is…	Reach for a transformer when…
Sentiment / topic / intent classification with thousands of labels	TF-IDF + logistic regression	You need to handle paraphrase, irony, or long context
Spam / phishing / abuse detection	Multinomial naive Bayes or logistic regression	Adversaries are actively rewording to evade keywords
Document categorisation across many classes	TF-IDF + linear SVM or logistic regression	Class definitions are subtle and require context
Customer segmentation	K-means on engineered features	You need clusters defined by complex relationships
"What topics exist in this corpus?"	LDA or NMF	You need topics defined by semantic meaning rather than co-occurring words
Initial baseline for any new ML problem	TF-IDF + logistic regression, even if you eventually replace it	Always start here. Knowing how the boring baseline scores tells you whether the fancy model is worth the cost.

Why teams skip this step

Three usual reasons.

First, the gradient of professional incentives points away from boring. Saying “I shipped a TF-IDF + logistic regression model” sounds like 2008. Saying “I fine-tuned a transformer” sounds like 2026. The actual customer doesn’t care.

Second, the tooling for fancy models is now better than the tooling for boring ones. Hugging Face, Replicate, and the LLM APIs have made it easier to call a transformer than to set up a scikit-learn pipeline, particularly for someone new to the field. The friction has inverted.

Third, “good enough” is hard to defend when the alternative is “best.” Nobody got fired for picking the SOTA model. If you pick the linear baseline and it’s 92% accurate, someone will eventually ask why you didn’t use the 94% transformer. The answer is “because it costs a thousand times more and is two percent better and we don’t need that two percent”, but that’s an explicit trade-off discussion most teams don’t want to have.

The discipline that pays off is making the boring baseline the explicit comparison point. If you can’t beat the linear model by a meaningful margin, the linear model wins. If you can, you’ve justified the upgrade with a number.

The discipline that pays off is making the boring baseline the explicit comparison point on every project. TF-IDF and logistic regression remain the right place to start a text-classification problem with thousands of labelled examples. Multinomial naive Bayes still beats most things for very short text at very high throughput. K-means is still the first thing to reach for when you want to know what groups exist in a pile of vectors, and LDA or NMF are still the tools to use when “give me a readable list of topics” is the actual brief. None of these is the consolation prize. They are the score the fancier model has to beat by a margin large enough to justify its cost.

Most production ML in industry is still classical. The headlines belong to LLMs and the backend belongs to logistic regression. A 92% model that runs at fifty thousand predictions per second on a CPU usually beats a 94% model that costs a thousandth of a cent per call, once you multiply by the volume you’re actually serving. Always know the boring number before you commit to something fancier.

A Gentle Guide to Typography: From Chisels to Character Sets

2026-05-22T06:00:00+08:00

Before there were fonts, before there were printing presses, before there was even an alphabet, there were people who wanted to say things that would last longer than a breath.

They scratched marks into wet clay. They carved shapes into stone. They painted on cave walls with ground-up ochre and spit; the pigments at Lascaux date to around 17,000 years ago. But that’s not the oldest mark-making by a long stretch. The First Nations peoples of Australia, the oldest continuous civilisation on Earth, were creating rock art tens of thousands of years earlier. Petroglyphs in the Pilbara region of Western Australia have been dated to at least 30,000 years ago, and charcoal drawings in Arnhem Land’s Nawarla Gabarnmang shelter push past 28,000 years (as documented in Histories of Australian Rock Art Research and related studies). Some researchers argue the tradition extends back 65,000 years or more, to the earliest evidence of human settlement on the continent. Writing, in its oldest form, was a physical act: you took a tool and you pushed it into something that would hold the mark after you walked away.

This is where typography starts. Not with software. Not with design theory. With someone pressing a wedge into clay and thinking: I want this to outlive me.

From hand to mould

For thousands of years, every copy of every written document was made by hand. Scribes (often monks in medieval Europe) would sit for hours copying text character by character onto parchment or vellum. Each copy was unique. Each was slightly different. The handwriting of the scribe was the “font”, though nobody called it that.

Then, around 1440, Johannes Gutenberg changed everything.

Gutenberg didn’t invent printing. The Chinese had been doing block printing for centuries, and Bi Sheng had created movable type from baked clay as early as 1040 AD. What Gutenberg invented was movable metal type: individual letters, each cast as a small block of a lead-tin-antimony alloy, that could be arranged into words, locked into a frame, inked, and pressed onto paper. When you were done printing one page, you could break the letters apart and rearrange them into something else.

This was revolutionary, and it introduced a bunch of concepts we still use today. So let’s walk through them, starting from the most fundamental.

Characters

A character is the abstract idea of a letter, digit, or symbol. The letter “A” is a character. So is “7”. So is “?”. So is “é”. A character doesn’t have a specific shape; it’s the concept of that symbol. When you think of the letter B, you’re thinking of a character: the second letter of the Latin alphabet, regardless of whether it’s tall and thin or short and round.

This distinction matters because the same character can look wildly different depending on who’s drawing it. Your handwritten “g” looks nothing like the “g” on this screen, but they’re the same character. They carry the same meaning.

Glyphs

A glyph is the specific visual shape that represents a character. If a character is the idea, a glyph is the drawing. The letter “a” is a character; the particular way it looks in this paragraph, its curves, its weight, its proportions, that’s a glyph.

One character can have many glyphs. Think about “a” for a moment. There’s the version you’re probably reading now: a little arch sitting over a closed bowl, with a distinct two-part structure. Then there’s the simpler version, the one that looks like a circle with a stick, the kind most people write by hand. Typographers call the first one “double-storey” and the second “single-storey” (because the first has two enclosed spaces stacked up, like floors of a building). Both are glyphs of the same character.

This goes further. An italic “a”, a bold “a”, a small-caps “A”: these are all different glyphs of the same character. Gutenberg understood this instinctively. His Bible used around 290 distinct glyphs, far more than the alphabet required, including variant letterforms and common ligatures, all designed to mimic the natural variation of handwriting.

Typefaces

Now we’re getting to the term people most often mix up.

A typeface is a designed set of glyphs that share a consistent visual style. When someone sits down and draws a complete alphabet (uppercase, lowercase, numbers, punctuation) in a unified style, they’ve created a typeface. Helvetica is a typeface. Garamond is a typeface. Times New Roman is a typeface.

The word “typeface” comes directly from the physical world. In Gutenberg’s workshop, each metal letter block had a face: the raised surface that got inked and pressed onto paper. A set of blocks sharing the same design was a set of type with the same face. A typeface.

When people say “I love that font”, they usually mean the typeface: the overall design, the aesthetic, the personality. And that’s fine; language evolves. But if you want to be precise, the typeface is the design.

Fonts

So what’s a font then?

In the metal-type era, a font was a specific size and style of a typeface. Garamond 12-point italic was one font. Garamond 14-point bold was a different font. They were literally different sets of physical metal blocks. You had to buy them separately and store them in different drawers.

Those drawers, by the way, were called cases. The capital letters were stored in the upper case (the harder-to-reach one, since capitals are used less often) and the small letters in the lower case, which is where we get the terms “uppercase” and “lowercase”. (Lovely, isn’t it?)

In the digital world, the distinction has blurred. A font file today usually contains the full set of glyphs for one style of a typeface: Garamond Italic, say, or Garamond Bold. The typeface is the family; the font is the specific file or instance. But in everyday conversation, “font” and “typeface” are used interchangeably, and that’s okay.

Font faces

Font face is a term that lives mostly in the world of CSS and web development. When you write @font-face in a stylesheet, you’re telling the browser: here’s a font file, and here’s what I want you to call it. It’s the bridge between a font file sitting on a server and a name you can use in your design.

In broader typographic conversation, “font face” and “typeface” mean roughly the same thing: the visual design of the letterforms.

Serifs (and their absence)

Look at the letters in a book printed in Times New Roman. See those little feet and flicks at the ends of the strokes? Those are serifs.

The word probably comes from the Dutch schreef, meaning “stroke” or “line” (as discussed in De Vinne’s The Practice of Typography). Serifs have been around since Roman times, literally. If you look at the inscriptions on Trajan’s Column in Rome (dedicated 113 AD), the letters have serifs. There’s a beautiful theory, advanced by Edward Catich in his 1968 study The Origin of the Serif, that they originated not from the chisel but from the brush: before carving, Roman stonecutters painted the letterforms with a flat brush, and the natural flare of each brush stroke at the start and end of a line became the serif. The chisel then faithfully followed the painted guide. (Catich’s The Origin of the Serif demonstrated this by cutting letters with period-appropriate tools.)

Typefaces with serifs (like Garamond, Baskerville, Georgia, and Times New Roman) are called serif typefaces. They feel classic, bookish, warm. Serifs also have a practical function: they help guide the eye along a line of text, creating a subtle visual rail. That’s why they’ve been the default for body text in printed books for centuries.

Typefaces without serifs (like Helvetica, Arial, Futura, and Gill Sans) are called sans-serif typefaces (“sans” is French for “without”). They tend to feel modern, clean, minimal. On screens, especially at small sizes, sans-serif typefaces have historically been easier to read because the fine details of serifs can get lost in low-resolution pixels. (High-resolution screens have closed that gap considerably.)

There are other categories too. Slab serif typefaces (like Rockwell or Courier) have thick, blocky serifs: bold and industrial. Monospaced typefaces give every character the same width, which is why they’re used for code: everything lines up neatly. Script typefaces mimic handwriting. Display typefaces are designed for headlines and large sizes, where they can be dramatic without worrying about readability at 10 points.

Spacing and leading

When Gutenberg assembled his type, the letters didn’t just touch each other. The metal blocks had built-in spacing: a little extra metal on each side of the letter face, so that when you lined them up, there was breathing room between characters.

Spacing (or tracking in modern terminology) is the uniform distance between all characters in a block of text. Increase the tracking and the text feels airy, open, maybe a little aloof. Decrease it and things get tight, urgent, compressed. Good tracking is invisible; you don’t notice it, but you feel comfortable reading.

Leading (pronounced “ledding”) is the vertical space between lines of text. The name comes from the actual strips of lead that typesetters placed between rows of metal type to push the lines apart (as described in Lupton’s Thinking with Type). More leading gives text room to breathe. Less leading packs it in. The correct amount depends on the typeface, the line length, and where the text is being read. Cramped leading is one of the quickest ways to make text feel hostile.

Kerning

Kerning is the adjustment of space between specific pairs of characters. This is different from tracking, which affects all characters equally. Kerning is about individual relationships.

Consider the letters “AV”. Because of their shapes (one leaning left, one leaning right) if you just space them evenly using each letter’s default width, there’ll be an awkward gap between them. It looks like “A V” instead of “AV”. Kerning tucks them closer together so they feel correct.

Other classic kerning pairs: “To”, “We”, “Ty”, “VA”, “LT”. Any combination where the shapes of adjacent letters create an optical gap that needs closing.

Good kerning is something you never notice. Bad kerning is something you can’t unsee. (There’s a whole internet subculture dedicated to finding poorly kerned signs. It’s called “keming”, because that’s what “kerning” looks like with bad kerning.)

Metrics and the anatomy of letters

Typographers have a precise vocabulary for the parts of a letter, and some of it is unexpectedly wonderful.

Take the counter, the empty space inside a letter. The hole in “o”, the gap inside “e”, the little window in “a”. The empty space has a name! And it matters: counters are a huge part of what makes a typeface feel open or cramped.

Then there’s the baseline (the invisible line letters sit on) and the x-height, which is just the height of a lowercase “x” (and by extension, most lowercase letters). Once you know about x-height, you start noticing it everywhere: a typeface with a tall x-height feels big and readable even at small sizes. Tall lowercase letters like “b” and “d” have ascenders that rise above the x-height. Letters like “p” and “g” have descenders that drop below the baseline, and the length of the descenders is one of those subtle things that gives a typeface its personality.

The rest of the vocabulary is just as precise: the cap height is how tall capitals are, the bowl is the rounded part of letters like “b” and “d”, the stroke is any main line, and a terminal is where a stroke ends without a serif.

The em is a unit of measurement that originally meant the width of the capital M, because M was typically the widest letter, and its width roughly equalled its height, making a nice square. Today, an em is simply equal to the current point size: in 16-point type, an em is 16 points. It’s used everywhere in typography and CSS. An en is half an em (roughly the width of a capital N) and is the unit behind the en-dash (–), which is half the width of an em-dash (—).

But what is a point? And how does it relate to the pixels on your screen?

A point (pt) is the fundamental unit of typographic measurement. The concept dates back to Pierre Simon Fournier, who proposed a standardised point system in 1737, later refined by François-Ambroise Didot in the 1780s (documented in Carter’s A View of Early Typography). In the modern PostScript standard (used by virtually all digital typography), one point is exactly 1/72 of an inch. So 72-point type has letters about an inch tall. This wasn’t always the case; before digital standardisation, different countries used slightly different point sizes. The American point (established by the American Type Founders Association in 1886) was 0.01383 inches; the French Didot point was 0.01483 inches, about 7% larger, which made international typesetting exciting in all the wrong ways.

A pica is 12 points, or 1/6 of an inch. Picas are used for measuring larger things: column widths, margins, page dimensions. If a designer says “set the body text in 10-point on a 20-pica column”, they mean 10-point type in a column about 3.3 inches wide. There’s even a European cousin called the cicero, which is 12 Didot points, almost the same size as a pica, but not quite. It’s mostly historical now.

A pixel (px) is a single illuminated dot on your screen, and its physical size depends entirely on the display. On a 96-DPI (dots per inch) screen (the traditional Windows default) one pixel is 1/96 of an inch, so a CSS “point” (1/72 inch) works out to about 1.33 pixels. On a modern Retina display at 220 DPI, the same point might be 3 or more physical pixels.

This is where it gets confusing. CSS defines 1px as exactly 1/96 of an inch, but on high-DPI screens, a CSS pixel might map to 2 or 3 physical device pixels. Your phone’s “logical” resolution (the one websites see) is often half or a third of its actual hardware resolution. The operating system handles the scaling, which is why text looks sharp on a Retina display: there are simply more physical pixels per logical pixel, giving the rasteriser more dots to work with when drawing those Bézier curves (the mathematical curves that define each letter’s shape; more on these shortly).

In practice: points for print, pixels for screens, ems for responsive design. An em in CSS is relative to the current font size, so padding: 1em means “pad by the width of one M in whatever size we’re using”. This makes layouts scale naturally when the user changes their font size, which is why web designers love ems and their cousin, the rem (root em), which is relative to the root element’s font size rather than the current element’s.

Character sets and encodings

Now we leave the world of ink and metal and enter the world of computers. And things get… complicated.

When computers first needed to represent text, someone had to decide: which characters do we support, and how do we store them?

ASCII (American Standard Code for Information Interchange), first published as ASA X3.4-1963 and revised several times through 1986 (as documented in Mackenzie’s Coded Character Sets), was one of the earliest answers. It used 7 bits to represent 128 characters: the English alphabet (upper and lower), digits 0-9, punctuation, and a handful of control characters (like “new line” and “tab”). It was simple, elegant, and completely inadequate for anyone who didn’t write in English.

To make this tangible, here’s what the letter “R” looks like as actual bits in ASCII:

Character:  R
Decimal:    82
Hex:        52
Binary:     01010010

Seven bits of information. That’s all it takes. The letter “A” is 01000001 (65), “B” is 01000010 (66), and so on. Uppercase and lowercase letters are exactly 32 apart (“a” is 01100001, 97) which means you can convert between them by flipping a single bit (bit 5, if you’re counting from zero). This wasn’t an accident; the designers of ASCII, led by Robert Bemer at IBM, were very clever about the layout (Bemer wrote about these design decisions himself).

A character set (or charset) is the complete collection of characters that a system recognises. ASCII’s character set has 128 members. That’s fine for English, but French needs accented characters, German needs ß and umlauts, Greek needs an entirely different alphabet, and that’s before we even get to Chinese, Japanese, Korean, Arabic, Hindi, or the hundreds of other writing systems used by actual humans.

The 1980s and 90s saw a proliferation of extended character sets: ISO 8859-1 for Western European languages, ISO 8859-5 for Cyrillic, Shift JIS for Japanese, Big5 for Traditional Chinese. Each one carved out a different set of 256 (or more) characters. This sort of worked if everyone agreed on which character set they were using, but of course they often didn’t. The result was mojibake: garbled text where characters from one encoding were displayed using another’s mapping. You’ve seen it. Those weird sequences of Ã¤ and â€™ where accented letters and curly quotes should be? That’s mojibake.

Unicode: one set to rule them all

Unicode was the attempt to fix this mess, and it’s one of the great technical achievements of the modern era, even if nobody outside of a relatively small group of people appreciates it.

The idea was simple and ambitious: create a single character set that includes every character from every writing system, living or dead, plus mathematical symbols, emoji, musical notation, and anything else humans have ever wanted to write down.

Each character in Unicode gets a unique number called a code point. These are written using a notation you’ll see everywhere: “U+” followed by a hexadecimal number. Hexadecimal (base 16) uses the digits 0-9 and the letters A-F, so each digit represents a value from 0 to 15. It’s used because it maps neatly onto bytes: two hex digits represent exactly one byte. The “U+” prefix just means “Unicode code point”.

So when you see U+0041, that means Unicode code point number 65 (in decimal), which is the letter “A”. U+03B1 is code point 945, the Greek letter alpha (α). U+1F600 is code point 128512, the emoji 😀. The higher the number, the later the character was added to the standard (roughly speaking). The first 128 code points (U+0000 to U+007F) map directly to ASCII, which was a deliberate design choice that made adoption much easier.

As of Unicode 16.0 (September 2024), the standard defines 154,998 characters covering 168 scripts. Every one of them has a code point and an official name. U+0052 is LATIN CAPITAL LETTER R. U+2603 is SNOWMAN (☃). U+1F4A9 is PILE OF POO (💩). The naming is meticulous, sometimes whimsical, and always permanent: once a character is added, it’s never removed.

But a code point is just a number. To actually store and transmit that number in a computer, you need an encoding: a scheme for turning code points into bytes.

UTF-8 is the most common encoding on the web, used by over 98% of all websites as of 2024, and the one you should almost always use. It was designed in September 1992 by Ken Thompson and Rob Pike, famously sketched out on a placemat in a New Jersey diner. It’s clever: ASCII characters (U+0000 to U+007F) are stored as a single byte, identical to their ASCII values, so all existing ASCII text is automatically valid UTF-8. Characters outside ASCII use 2, 3, or 4 bytes as needed. This makes it compact for English text and capable of representing any Unicode character.

To see the difference, let’s look at how a few characters are stored as actual bytes across the different encodings. First, something simple, the letter “R” (U+0052):

Encoding   Bytes (hex)       Bytes (binary)
ASCII      52                01010010
UTF-8      52                01010010
UTF-16     00 52             00000000 01010010
UTF-32     00 00 00 52       00000000 00000000 00000000 01010010

For basic Latin characters, UTF-8 and ASCII are identical: one byte. UTF-16 pads it to two bytes. UTF-32 pads it to four. You can see why UTF-32 is wasteful for English text: three of those four bytes are zeros, carrying no information.

Now something outside ASCII, the pound sign “£” (U+00A3):

Encoding   Bytes (hex)       What's happening
ASCII      --                Can't represent it (not in the character set)
Latin-1    A3                One byte -- works, but only in this specific encoding
UTF-8      C2 A3             Two bytes (the C2 signals "two-byte sequence")
UTF-16     00 A3             Two bytes
UTF-32     00 00 00 A3       Four bytes

And something further afield, the Japanese character “字” (U+5B57, meaning “character”, how fitting):

Encoding   Bytes (hex)       What's happening
ASCII      --                Can't represent it
Latin-1    --                Can't represent it
Shift JIS  8E 9A             Two bytes (Japanese-specific encoding)
UTF-8      E5 AD 97          Three bytes (the E5 signals "three-byte sequence")
UTF-16     5B 57             Two bytes (falls within the Basic Multilingual Plane)
UTF-32     00 00 5B 57       Four bytes

And finally, an emoji, “😀” (U+1F600):

Encoding   Bytes (hex)          What's happening
ASCII      --                   Can't represent it
UTF-8      F0 9F 98 80          Four bytes (the F0 signals "four-byte sequence")
UTF-16     D8 3D DE 00          Four bytes (a surrogate pair -- two 2-byte code units)
UTF-32     00 01 F6 00          Four bytes (same size as everything else in UTF-32)

Notice how UTF-8 scales: 1 byte for ASCII, 2 for European characters, 3 for most of the world’s living languages, and 4 for emoji and rarer scripts. The leading bits of each byte tell the decoder how many bytes to read. It’s an elegant piece of engineering.

UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (the first 65,536 code points, which covers most living languages) and 4 bytes for everything else. Those 4-byte characters are encoded using pairs of 2-byte values called surrogate pairs: a clever hack that lets UTF-16 reach the full Unicode range while keeping the common case compact. UTF-16 is used internally by Windows, Java, and JavaScript. If you’ve ever been bitten by a JavaScript string reporting the wrong .length for an emoji, that’s because JavaScript counts UTF-16 code units, not characters, and your emoji needed a surrogate pair.

UTF-32 (sometimes called UCS-4) takes the brute-force approach: 4 bytes for every single character, no exceptions. This makes it simple; the nth character is always at byte offset 4n, so random access is trivial. But it’s wasteful. An English text file in UTF-32 is four times the size of the same file in UTF-8, with three zero bytes for every one byte of actual data.

There are also some historical encodings worth knowing about. UCS-2 was an early 2-byte encoding that predates UTF-16; it could only represent the first 65,536 code points and had no surrogate pair mechanism, so it couldn’t handle emoji or many CJK characters. It’s effectively obsolete, but you’ll occasionally encounter it in older systems. UTF-7 was designed for email systems that could only handle ASCII; it encoded Unicode characters using only ASCII-safe bytes. It was slow, complex, and is now deprecated for security reasons (it enabled some nasty injection attacks).

The encoding is not the character set. Unicode is the character set (the list of characters and their code points). UTF-8, UTF-16, and UTF-32 are encodings (ways of turning those code points into bytes). This distinction trips people up constantly, but it matters. You might say “this file is Unicode” when you mean “this file is encoded in UTF-8”. Unicode tells you which characters exist. The encoding tells you how they’re stored as bytes.

Representations: how letters become pixels

So we have characters (abstract ideas), code points (numbers assigned to those ideas), encodings (ways to store those numbers), and typefaces (visual designs). The last piece of the puzzle is: how does a computer actually draw a letter on screen?

There are two main approaches.

Bitmap fonts were the early method. Each glyph was stored as a grid of pixels: literally a tiny picture. This was fast to render but didn’t scale well. A bitmap font designed for 12-point looked terrible at 24-point because you were just scaling up the pixel grid, producing jagged edges.

Outline fonts (also called vector fonts) solved this. Instead of storing a grid of pixels, they store the shape of each glyph as a set of mathematical curves: typically Bézier curves, named after Pierre Bézier, the French engineer at Renault who developed them in the 1960s for designing car bodies. (Paul de Casteljau at Citroën independently developed equivalent mathematics around the same time, but Renault published first.) To display the letter, the computer calculates which pixels fall inside the outline and fills them in. This process is called rasterisation, and it’s why outline fonts scale beautifully to any size.

The two dominant outline font formats are TrueType (developed by Apple and announced in 1991, partly to avoid Adobe’s licensing fees for PostScript Type 1 fonts (TrueType was Apple’s response), with files ending in .ttf) and OpenType (announced jointly by Microsoft and Adobe in 1996, with files ending in .otf or .ttf). OpenType is TrueType’s successor and adds support for advanced typographic features: ligatures, small caps, stylistic alternates, and more.

Hinting is the process of adjusting how outlines are rasterised at small sizes on low-resolution screens. Without hinting, the mathematical curves of a glyph might fall between pixels, creating blurry or uneven strokes. Hints are instructions embedded in the font that snap the outlines to the pixel grid at small sizes, keeping text crisp. It’s painstaking work, and it’s one of the reasons well-hinted fonts (like the core Microsoft fonts) have historically looked so much better on screen than cheaper alternatives.

Ligatures: when letters merge

A ligature is a single glyph made by combining two or more characters. The most common one in English is “fi”: in many serif typefaces, the dot of the “i” collides with the overhang of the “f”, so designers create a special glyph where the two letters are fused together. Other common ligatures: “fl”, “ff”, “ffi”, “ffl”.

Ligatures started as a practical solution in metal type (it was easier to cast certain letter combinations as a single piece) and survived because they look good. OpenType fonts can contain dozens of ligatures, and modern software can substitute them automatically.

Some typefaces take this further with glyphs that change shape depending on what’s next to them (font nerds call this contextual alternates). This is especially common in script typefaces, where a letter might have a different tail depending on the following letter, mimicking the natural flow of handwriting.

How long is a piece of string (or: what even is a character?)

You’d think counting characters would be simple. You want to allow 600-character comments on your website. How hard can it be? You just… count the characters. Right?

Welcome to one of the most quietly maddening problems in software engineering.

Let’s start with something innocent: the letter “é”. Is that one character? It depends on who you ask. In Unicode, it can be represented two ways. There’s U+00E9, LATIN SMALL LETTER E WITH ACUTE: a single code point, unambiguously one thing. But there’s also the two-code-point sequence U+0065 (LATIN SMALL LETTER E) followed by U+0301 (COMBINING ACUTE ACCENT). These render identically. They mean the same thing. They’re defined as canonically equivalent by the Unicode standard. But one is one code point and the other is two.

So when your user types “café” into your 600-character comment box, how many characters is that? If you count code points, it might be 4 or 5, depending on which representation of “é” their keyboard produced. If you count UTF-8 bytes, it’s 5 or 6. If you count UTF-16 code units (which is what JavaScript’s .length does), it’s yet another number.

Now add emoji. The thumbs-up emoji 👍 is one code point: U+1F44D. But 👍🏽 (thumbs up with a medium skin tone) is two code points: U+1F44D followed by U+1F3FD (a skin tone modifier). They render as a single visible symbol. The family emoji 👨‍👩‍👧‍👦 is seven code points stitched together with invisible joiners (U+200D, ZERO WIDTH JOINER): man + joiner + woman + joiner + girl + joiner + boy. One “character” on screen, seven code points, many more bytes.

And flags! The flag emoji 🇬🇧 is two code points: U+1F1EC (REGIONAL INDICATOR SYMBOL LETTER G) followed by U+1F1E7 (REGIONAL INDICATOR SYMBOL LETTER B). The system pairs them up and displays a flag. What happens if you insert a character between them? Now you’ve got two orphaned regional indicators that render as ugly letter boxes. Is this one character? Two?

The Unicode standard defines a concept called grapheme clusters: sequences of code points that together represent a single user-perceived character. This is probably what you mean when you say “character”, and it’s what a well-implemented character counter should count. But getting grapheme cluster segmentation correct requires implementing a nontrivial Unicode algorithm (UAX #29, “Unicode Text Segmentation”). Most programming languages don’t do this by default. Python’s len() counts code points. JavaScript’s .length counts UTF-16 code units. Neither counts what a human would call “characters”.

So your 600-character limit? If you implement it by counting .length in JavaScript, a user could type 300 emoji and hit your limit, because each emoji is two UTF-16 code units. Or they could paste in text with combining accents and get 600 “characters” that look like 400. Or they could use a single family emoji and consume 11 of their 600 “characters” on one symbol.

The correct answer is to count grapheme clusters, validate on the server (since the client can always lie), and honestly, to be generous with your limits because this stuff is harder than it has any right to be.

There’s an old joke among internationalisation engineers: “How many characters are in this string?” “It depends on what you mean by ‘character’.” It’s not really a joke; it’s more of a warning.

When letters lie: homoglyphs and Punycode

Unicode’s ambition, including every character from every writing system, introduced a problem that no one at the printing press ever had to worry about: characters from different scripts that look identical.

The Latin letter “a” (U+0061) and the Cyrillic letter “а” (U+0430) are visually indistinguishable in most typefaces. The same goes for Latin “o” and Cyrillic “о”, Latin “p” and Cyrillic “р”, Latin “e” and Cyrillic “е”. These are called homoglyphs: different characters that produce identical (or nearly identical) glyphs.

This is a problem because domain names can contain non-ASCII characters. (The DNS post in this series covers the DNS side of this story.) The system that makes this work is called Internationalised Domain Names (IDN), and under the hood it uses an encoding called Punycode to convert Unicode domain names into ASCII-safe strings that DNS can handle. The domain “münchen.de” becomes “xn–mnchen-3ya.de” in Punycode. The “xn–” prefix tells the system it’s an encoded internationalised domain.

The security implications are nasty. An attacker can register a domain like “аpple.com” where the first “а” is Cyrillic, not Latin. To the naked eye, this looks exactly like “apple.com”. The underlying Punycode is completely different (“xn–pple-43d.com”), but browsers display the pretty Unicode version. This is called an IDN homograph attack, first described by Evgeniy Gabrilovich and Alex Gontmakher in a 2002 paper, and it has been used for real-world phishing.

Browsers have defences. Most will display the Punycode version instead of the Unicode version if the domain mixes scripts suspiciously (if some characters are Latin and others are Cyrillic, for instance). Chrome, Firefox, and Safari each have slightly different rules for when to show the Punycode, and these rules have been refined over years of cat-and-mouse with attackers. But the fundamental problem remains: Unicode gives us more than 154,000 characters, many of which look alike, and any system that displays them needs to decide how much to trust what it’s showing you.

It’s not just URLs. Homoglyphs can appear in code too. A variable name that looks like password but uses a Cyrillic “а” is a different identifier entirely. Malicious pull requests have used this trick to sneak backdoors past code review. Some code editors now flag mixed-script identifiers, and Unicode itself defines a set of security mechanisms (documented in Unicode Technical Report #36, “Unicode Security Considerations”) for detecting confusable characters.

Gutenberg’s compositor never had this problem. Every letter in his type case was unambiguous; you could pick it up and feel its shape. In the digital world, two characters can be byte-for-byte different but pixel-for-pixel identical. The typeface doesn’t lie; the character set does.

What happens when you press a key

Let’s make all of this concrete. You’re sitting at your computer and you press the letter “R”. What actually happens, and how did we get here?

In the scribe’s version (500 AD), a monk in a scriptorium dips a quill in iron gall ink. He looks at the exemplar (the book he’s copying from) and draws an R. His hand shapes the stroke, the bowl, the leg. The letter exists because his muscles moved in a practised pattern. The “input device” is his hand; the “rendering engine” is also his hand. The glyph is one-of-a-kind.

In the printer’s version (1500 AD), a compositor stands at a type case. He reaches into the compartment labelled R, picks up a small metal block (reversed, so it’ll print the right way round) and slots it into the composing stick alongside the other letters. Later, the assembled type is locked into a frame, inked with a leather ball, and pressed onto dampened paper. The letter R is now reproducible. The same block can print the same R a thousand times. The “input” is the compositor’s hand selecting the correct piece of type; the “rendering” is the press.

In the typist’s version (1900 AD), a typist sits at a typewriter and strikes the R key. A mechanical linkage swings a type bar upward. On the end of the bar is a small metal slug with a reversed R on its face. It hits an inked ribbon, which presses against paper, leaving the shape of the letter. One keystroke, one character, one glyph. The “encoding” is purely mechanical: each key is physically connected to exactly one letterform. (This is also where monospaced type became the norm: every character had to occupy the same width so the carriage could advance by a fixed amount after each keystroke.)

In the early computer’s version (1980 AD), you press R on the keyboard of an IBM PC. The keyboard controller sends a scan code: a number identifying which physical key was pressed (not which character it represents; that comes later). The operating system’s keyboard driver translates the scan code into a character code. On this machine, that means ASCII: the letter R is stored as the number 82 (binary 01010010). The application receives this number, looks it up in a bitmap font (a grid of pixels for each character) and copies those pixels into video memory. The screen redraws. An R appears. The letter is now a number that becomes a picture.

In the modern version (today), you press R. The keyboard sends a scan code (via USB or Bluetooth). The operating system’s input system translates it, through the keyboard layout (QWERTY? AZERTY? Dvorak?), into a Unicode code point: U+0052, LATIN CAPITAL LETTER R. This code point might be stored in memory as UTF-8 (the single byte 0x52, since R falls within ASCII’s range), or as UTF-16 (the two bytes 0x00 0x52), depending on the application.

Now the text renderer takes over. It looks up the current font (say, a .otf OpenType file for the typeface Inter). Inside that file, it finds the glyph for U+0052: a set of Bézier curves describing the outline of the letter R in this particular design. The renderer checks the kerning table to see if R needs to be nudged closer to or further from the characters on either side. It checks for ligatures: does this R combine with the next character into a special glyph? (Probably not for R, but the system checks every time.) It applies hinting to snap the curves to the pixel grid at the current size. It rasterises the outline, filling in pixels that fall inside the curves, with subpixel rendering to smooth the edges: each pixel on your LCD is actually three tiny coloured stripes (red, green, blue), and the renderer exploits this to position edges with sub-pixel precision. The result is painted into the application’s window buffer, which is composited with other windows by the operating system and sent to the display.

All of that, scan code to keyboard driver to Unicode code point to glyph lookup to Bézier curves to kerning adjustment to hinting to rasterisation to subpixel rendering to composited display, happens in microseconds. You press R, and R appears. It feels instant because it is.

Across the ages, the same act has run on very different machinery. A monk spent minutes per letter. A compositor spent seconds selecting type. A typist connected key to page in a single mechanical stroke. A modern computer does it in microseconds, but the pipeline is deeper: physical key → scan code → character code → Unicode code point → encoding → glyph lookup → outline scaling → kerning → hinting → rasterisation → pixel buffer → display.

More steps than ever before. Each one invisible. Each one built on something a monk, a compositor, or a typist once did by hand.

In the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. ’s version (also today), you ask an AI to write a paragraph. Somewhere in a data centre, billions of numerical weights are multiplied together across dozens of layers of a neural network. The model predicts the most likely next tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. : not quite a character, not quite a word, but a chunk of text from a vocabulary of tens of thousands of pieces. It picks the token for “R”. This token is decoded back into the bytes of the UTF-8 character R (0x52), which is sent over HTTPS to your browser, where it enters the exact same rendering pipeline as before: Unicode code point → glyph lookup → Bézier curves → rasterisation → screen. The R appears. The entire history of typography, scribe, compositor, typist, keyboard, font renderer, is still there, running the last mile. The only difference is who asked for the letter. It used to be a human pressing a key. Now, sometimes, it’s a machine guessing what comes next. (Though arguably the monk was also guessing what came next. He was just copying more carefully.)

Now print it

Everything above gets a letter onto a screen. But what if you want it on paper? What if you hit Ctrl+P and expect a piece of dead tree to come out of a machine with your words on it?

This is where things get properly unhinged. Because a printer is not a screen. A screen has pixels that glow. A printer has to physically deposit material onto a surface. And the chain of events between “the user clicked Print” and “ink is on paper” is one of the most gloriously over-engineered pipelines in all of computing.

The problem is one of translation. Your computer knows what the document looks like; it’s been rendering it on screen just fine. But the printer is a separate device with its own processor, its own memory, and its own ideas about how to put dots on paper. Somehow, the computer has to describe the page in a way the printer can understand and reproduce.

In the early days, this was brutally simple. Character printers (like daisy-wheel and dot-matrix printers) worked much like typewriters. The computer sent ASCII characters down a cable, and the printer had its own built-in font: literally a physical wheel with letter shapes on it, or a set of pin patterns for each character. You got whatever the printer gave you. Want a different typeface? Buy a different daisy wheel. Want graphics? Good luck.

PostScript changed everything.

In 1984, Adobe released PostScript: a full programming language designed for describing pages. Not characters. Not lines of text. Pages. PostScript could describe any combination of text, graphics, and images as mathematical instructions. A PostScript file doesn’t say “print an R at position 40, 100”. It says “move to coordinates (40, 100), select the font Palatino-Roman at 12 points, scale the coordinate system, define a path using these Bézier curves, and fill it”. Sound familiar? Those are the same Bézier curves we met in outline fonts. This wasn’t a coincidence; Adobe co-founder John Warnock developed PostScript and the Type 1 font format together (as Warnock recounted in his Computer History Museum oral history).

PostScript was revolutionary because it made the printer resolution-independent. The same PostScript file could print on a 300-DPI office laser printer and a 2,400-DPI professional typesetter, and each device would rasterise the curves at its native resolution. The page description was abstract; the physical rendering was the printer’s job.

The downside? PostScript is a Turing-complete programming language. Your printer is literally executing code to figure out what to print. Early PostScript printers needed powerful processors and lots of RAM; they were computers in their own right, and often more expensive than the computer sending them data. Some complex pages could take minutes to process. Occasionally, a malformed PostScript file could crash the printer or send it into an infinite loop, because that’s what happens when your printer runs arbitrary programs.

PDF (Portable Document Format), also from Adobe, is PostScript’s better-behaved descendant. It dropped the full programming language in favour of a more structured, predictable format, while keeping the same fundamental model: vector graphics, Bézier curves, embedded fonts, resolution independence. When you “print to PDF” today, your computer is generating a page description in this format.

Printer drivers are the translators that sit between your operating system and your specific printer. When you click Print, here’s what actually happens:

The application hands the document to the operating system’s printing subsystem. On Windows, this is the GDI (Graphics Device Interface) or the newer XPS (XML Paper Specification) pipeline. On macOS, it’s Quartz, which internally uses PDF as its native page description format; every Mac has been thinking in PDF since OS X launched in 2001. On Linux, it’s typically CUPS (Common Unix Printing System, originally created by Michael Sweet at Easy Software Products in 1997 and later acquired by Apple in 2007).

The printing subsystem renders the page into an intermediate format: a spool file. The printer driver then translates this spool file into whatever language the specific printer speaks. This might be PostScript, or it might be PCL (Printer Command Language, HP’s long-running alternative), or for many modern consumer printers it might be a proprietary raster format where the computer does all the rendering and just sends the printer a bitmap of dots to lay down.

Now for the physics.

A laser printer works by exploiting static electricity and heat: a process called electrophotography, invented by Chester Carlson in 1938 and first commercialised by Xerox in 1959. The first laser printer was the Xerox 9700, released in 1977. A photosensitive drum, a cylinder coated in a material (typically organic photoconductor, or OPC) that conducts electricity when exposed to light, sits at the heart of the machine. Here’s the sequence:

The drum is given a uniform negative electrical charge by a corona wire or charge roller.
A laser (or an array of LEDs) scans across the drum, switching on and off thousands of times per second. Where the light hits, the charge dissipates. The laser is drawing the page as a pattern of charged and uncharged areas on the drum’s surface, one row of dots at a time, as the drum rotates.
The drum passes a reservoir of toner: a fine powder of plastic particles mixed with pigment. Toner is attracted to the uncharged areas (where the laser hit) and repelled from the charged areas. The powder sticks to the drum in exactly the pattern of your text and images.
A sheet of paper is fed past the drum. The paper has been given a positive charge, which is stronger than the drum’s remaining negative charge, so the toner transfers from drum to paper.
The paper passes through a fuser: a pair of heated rollers at around 150-200°C (300-390°F). The heat melts the plastic in the toner, bonding it permanently to the paper fibres.

That’s it. Your letter R is now toner (melted plastic) fused into paper. The Bézier curves that started as mathematical abstractions in a font file have become a physical pattern of charged and uncharged spots on a rotating drum, which attracted specks of plastic dust, which were melted onto a sheet of ground-up wood pulp. It’s absurd when you think about it.

An inkjet printer takes a different approach but is no less wild. Tiny nozzles, sometimes thousands of them, each thinner than a human hair, fire microscopic droplets of liquid ink onto paper. There are two main technologies: thermal inkjet (used by HP and Canon), in which a tiny resistor heats the ink to around 300°C in microseconds, forming a steam bubble that ejects a droplet, and piezoelectric inkjet (used by Epson), which uses a piezoelectric crystal that physically deforms when electricity is applied, squeezing the ink out mechanically. Each droplet is about 1-5 picolitres (a picolitre is a trillionth of a litre, Castrejon-Pita et al., 2013). The precision required is staggering: the nozzles must fire at exactly the correct microsecond as the print head sweeps across the page, placing dots at up to 5,760 DPI.

For colour printing, things multiply. A colour laser printer has four separate drums and four toner cartridges (cyan, magenta, yellow, and black, or CMYK). Each colour is laid down in a separate pass, with the four layers combining to produce the full colour spectrum. Getting the four colours to align perfectly (registration) is one of the hardest mechanical challenges, and even tiny misalignment shows up as colour fringing on text. A colour inkjet fires four (or more, with some having six or eight) colours from separate nozzle arrays in a single pass.

And then there’s the font question. Does the printer use the same fonts as the computer? Sometimes yes, sometimes no. PostScript printers traditionally had a set of built-in fonts (the “PostScript 35”, including Helvetica, Times, Courier, and others) stored in the printer’s own ROM. If your document used one of these, the computer just sent the font name and the printer rendered it locally. If your document used a font the printer didn’t have, the driver had to either embed the font data in the print job (increasing its size) or substitute a similar built-in font (changing how your document looked).

Modern printers mostly receive pre-rasterised data; the computer does the heavy lifting and sends the printer a bitmap. This avoids font substitution problems entirely but means the computer is doing more work and the print data is larger. It’s the same trade-off as always: do you send instructions or pixels? PostScript said instructions. Modern consumer printing says pixels. The professional print industry still says instructions (PDF), because when you’re printing a million copies of a magazine, you need the precision.

The full pipeline, from keypress to paper: You type an R. It becomes a Unicode code point (U+0052). The application looks up the glyph in the font. It renders the page layout: text, kerning, leading, line breaks, all of it. You hit Print. The OS printing subsystem takes the rendered page and converts it to a page description (PostScript, PDF, XPS, or a raw bitmap). The printer driver translates this for your specific printer. The data travels over USB, Wi-Fi, or Ethernet to the printer. The printer’s controller processes the data and drives the marking engine: laser and drum, or inkjet nozzles. Toner is melted or ink is squirted. The paper emerges.

From an abstract idea in Unicode, through mathematical curves, through a page description language, through a driver, through a cable, through trapped lightning in a thinking chip, to a laser drawing on a charged drum, to plastic powder melted by heat onto pressed wood fibre. Gutenberg would be absolutely baffled. But he’d recognise the letter.

Why this matters

Typography sits at the intersection of art, engineering, and language. It’s been refined over nearly 600 years of printing and thousands of years of writing before that. Every choice, serif or sans-serif, tight tracking or loose, generous leading or cramped, changes how text feels, and therefore changes how ideas land.

When typography is good, it’s invisible. The words just flow. You’re not thinking about letterforms or kerning or encodings; you’re thinking about what the text says. That invisibility is hard-won. Behind every comfortable paragraph is centuries of craft: stonecutters and scribes, punchcutters and typesetters, designers and engineers, all working towards the same goal.

Making the words feel effortless.

The Clock Inside You

2026-05-21T06:00:00+08:00

The previous posts in this series covered the human history of the hour, the calendar, the physics of the second, how relativity bends it, whether it exists at all, and whether you can go backwards. All of that was about time out there, in clocks, in spacetime, in the equations. This post is about the clock you can’t put down: the one inside you.

Your body doesn’t run on UTC

You can’t switch off the body clock. You can ignore it. You can override it with coffee, bright screens, shift work, or a midnight flight to Frankfurt. The body clock does not care. It keeps running at roughly its own pace, drifts slightly out of sync with the sun, and resets itself, not from your watch, but from the light hitting your retina.

This is the biological machinery that runs every living thing on Earth. Mice have it. Fruit flies have it. So do cyanobacteria, which were doing circadian biology a billion years before anyone invented a sundial. Whatever timekeeping humans eventually engineered with caesium atoms and geoid corrections, evolution got there first. It just picked a different substrate.

The suprachiasmatic nucleus

Somewhere behind your eyes, just above where the optic nerves from your left and right retinas cross, sits a pair of tiny nuclei about the size of a grain of rice. That’s the suprachiasmatic nucleus, the SCN, and it contains roughly 20,000 neurons. It is the master clock of your body.

The SCN is astonishing. Isolate it from the rest of the brain, keep its neurons alive in a dish with the correct nutrients, and it keeps oscillating, individual cells tick in rough synchrony, with a period close to 24 hours, for weeks. No external input. No light cues. Just a biochemical feedback loop inside each cell, coupled loosely to its neighbours, running on its own rhythm.

The period is not exactly 24 hours. Czeisler et al. demonstrated the near-24-hour intrinsic period in a landmark 1999 study in Science, putting volunteers in carefully controlled environments with no time cues and measuring their internal rhythms. The average was about 24.18 hours, slightly longer than a solar day. Almost all healthy humans cluster close to that value. A small number run shorter.

That slight mismatch matters. Because your intrinsic period isn’t 24 hours, the clock would drift out of phase with the planet if it ran uncorrected. A small daily adjustment keeps it aligned. The adjustment comes from light.

How light resets you

The SCN gets its light signal from a special class of cells in the retina called intrinsically photosensitive retinal ganglion cells (ipRGCs). These aren’t the rods and cones you see with. They’re a separate system, tuned to detect overall light level, particularly the blue end of the visible spectrum, and report it to the SCN via a dedicated neural pathway called the retinohypothalamic tract.

When those cells fire, the SCN adjusts its internal clock. Bright light in the morning nudges the clock earlier. Bright light in the evening nudges it later. The direction depends on when in your current cycle the light lands.

The practical consequences are everywhere. If you’re staring at a screen at midnight, your ipRGCs are reporting “high blue light” to the SCN, which reads that as an argument for “still daytime,” which delays the clock. The clock drifts later. You go to bed later. The next morning you have to drag yourself up before the clock says it’s morning. Repeat this for a working week and you have given yourself a mild, self-imposed form of jet lag without leaving the house.

Jet lag

Jet lag is what happens when you cross time zones faster than your body can adjust. The SCN resets at roughly one hour per day, so a five-hour time change takes roughly five days to shake off. A ten-hour change takes about ten. That’s why you can feel perfectly fine by day three of a short hop, and still brain-fogged by day seven of a long one.

Eastward travel is generally worse than westward. This is because your intrinsic period is slightly longer than 24 hours, and shortening the day is harder than extending it. Flying east forces your clock to advance, to squeeze 24 hours of biology into, say, 20 hours of wall time. Flying west lets your clock simply extend, which it already wants to do. Living in Perth, I feel this every time I fly to Europe. The outbound is westward, and my clock gets to drift out to match the longer day; I feel human again by the third morning. The return is the brutal one, eight or nine time zones east, staring at the ceiling at 2 AM local time for the better part of a week.

The fix is light, mostly. Morning sunlight at the destination, avoiding bright light in the destination’s evening, and, if you’re feeling technical, using light carefully before you fly to pre-adapt. Melatonin at the destination’s bedtime can help, but the headline intervention is exposure to the right light at the right time. The eyes know.

Shift work and the IARC

Chronic circadian disruption is an occupational hazard for long-haul flight crews, night-shift workers, and anyone whose work schedule repeatedly drags them across their own body clock’s boundaries. Studies have linked it to increased rates of cardiovascular disease, metabolic disorders, and several cancers.

In 2007, the International Agency for Research on Cancer (IARC) classified shift work involving circadian disruption as Group 2A: probably carcinogenic to humans. That’s the same classification as red meat, and one step below “known carcinogen.” The evidence isn’t airtight, but it’s strong enough that IARC was willing to put it in writing. The body’s clock is not a metaphor. It’s a biological mechanism, and forcing it out of sync repeatedly has measurable health consequences.

We built a civilisation on the assumption that humans can work any hours, as long as someone is willing to pay for them. The biology quietly disagrees. A nurse working rotating night shifts isn’t just tired in a local, sleep-deficit sense, they’re operating a system that evolved for a world where you did most of what you did during daylight. Ignoring the clock has a bill, and the bill is paid in health outcomes decades down the line.

Sleep pressure and adenosine

There is a second clock in your body that interacts with the first, and it’s worth keeping them straight.

The circadian clock, the SCN, tells you what time of day it is. It doesn’t care how long you’ve been awake. Even if you stay up all night, your SCN will still say “morning” when morning comes.

The sleep drive, often called sleep pressure, tells you how long you’ve been awake. It builds up the longer you’re conscious, and drops while you sleep. The chemistry behind it is largely a molecule called adenosine, which accumulates in the brain during wakefulness as a byproduct of neural activity. High adenosine means high sleep pressure: your head gets heavy, focus goes, and the couch starts looking like a strategic asset.

Caffeine is an adenosine receptor antagonist. It doesn’t remove the adenosine, it just blocks the receptors that let your brain notice how much has built up. The pressure is still there. You’re borrowing alertness against it. When the caffeine wears off, the full accumulated adenosine load lands on the receptors at once, which is part of why the crash can be steeper than the coffee was worth.

The two systems normally cooperate. Circadian drive pushes you to be alert during biological daytime. Sleep drive pushes you toward sleep when you’ve been awake too long. Night-shift workers are fighting both at once: their sleep drive is high because they’ve been up all night, and their circadian drive is high because their body thinks it’s morning. That combination is brutal, and it’s one of the reasons shift work is so hard on the body.

Larks, owls, and the chronotype spectrum

Not everyone’s SCN runs at the same phase. Some people’s clocks run earlier than the population average; others run later. The technical term is chronotype, and it’s real.

Larks, morning types, feel sharpest in the early hours and fade in the evening. Their SCN is phase-advanced relative to the average. Extreme larks are up at 5 AM with the birds and exhausted by 9 PM.

Owls, evening types, peak late and struggle with early starts. Their SCN is phase-delayed. Extreme owls are at their best after midnight and miserable before 10 AM.

Most people sit somewhere on a continuum between the two. Chronotype is partly genetic, several genes, including PER3, have been implicated, and partly age-related. Teenagers are statistically more owl-like; older adults drift lark-ward. The stereotype of the teenager who can’t be roused before 10 AM isn’t laziness. Their circadian phase is genuinely shifted later during adolescence, for developmental reasons that aren’t fully understood.

The trouble is that society is calibrated for average-to-lark chronotypes. School starts at 8 AM. Offices open at 9. An extreme owl trying to hold down a 9-to-5 job is being asked, every working day, to be awake and productive at a time their body is biologically still asleep. The polite term is social jet lag. The practical effect is that owls are chronically mildly sleep-deprived for their entire working lives, and it shows up in the health data.

The body clock and ageing

Circadian rhythm weakens with age. The SCN’s output becomes less reliable; the light-sensitive cells in the retina decline; older adults often report waking earlier, sleeping less deeply, and feeling “off” if their schedule shifts. The internal clock is still there, but the signal it sends the rest of the body is quieter.

There’s a feedback with cognition. Sleep disruption in older adults is associated with memory problems, and some researchers suspect that weakening circadian control contributes to neurodegenerative conditions, not as a sole cause, but as one of the stressors that piles up over decades. The relationship runs both ways: Alzheimer’s disease, for instance, damages the SCN directly, which further disrupts sleep, which in turn makes cognitive symptoms worse.

The practical implications are straightforward, even if they’re hard to put into practice. Bright light in the morning. Dim light in the evening. Consistent sleep and wake times. Outdoor time in actual daylight, which is orders of magnitude brighter than any indoor lighting and gives the SCN a much stronger signal to lock onto. None of this is glamorous. All of it works.

The clock that doesn’t care about you

The other clocks in this series are human inventions. Sundials, mechanical escapements, caesium fountains, GPS constellations, all of them are things we built, using machinery we understand, to answer questions we formulated. The body clock was here first. It runs on biochemistry you didn’t choose, it resets itself from signals you can’t see directly, and it has opinions about when you should be awake whether you consult it or not.

You can fight it. Plenty of people do. But the fight has a cost, and the cost compounds. If the physics of time is impressive, the biology is humbling. The most accurate atomic clock in the world has been running for a few decades. The machinery in your suprachiasmatic nucleus has been keeping time, in one organism or another, for a billion years. It is not going to lose an argument with your calendar.

There’s one more clock to look at, and it’s the trickiest of the lot: the one in your head that tells you how long something felt. It has nothing to do with the SCN. It runs on attention, memory, and dopamine. And it’s wrong almost all the time.

Why Does Thursday Last Forever? is next, the neuroscience of why time drags, vanishes, and accelerates as you age.

Choosing Between Chains, Retrieval, and Agents for a GenAI Assistant

2026-05-20T06:00:00+08:00

The situation

An internal-operations team has commissioned a GenAI assistant. The requirements, ordered by when they landed:

v1: Policy Q&A. Engineers ask “what’s our data-retention policy for customer chat logs?”; the assistant answers from the internal policy wiki. One-shot question-answering, grounded in documents.
v2: Customer record lookup. Support agents ask “what subscription tier is customer ID 4711 on, and when did they last log in?”; the assistant calls an internal API and returns the answer in natural language. The data isn’t in any document; it’s in a database.
v3: Email drafting. After looking up a customer, draft a personalised apology-plus-next-steps email for the agent to review. Combines retrieved facts with generated text.
v4: Ticket filing. “Please file a P2 Jira ticket against team payments describing the issue above, with the customer context attached.” The assistant takes an action in an external system based on what was just discussed.

The team has a Bedrock account, access to Claude Sonnet and Nova Pro, an internal REST API for the customer-record lookup, a Jira API, and the policy wiki mirrored to an S3 bucket. What’s unclear is the architecture. Somebody has proposed one big “agent” that handles all four versions uniformly. Somebody else has proposed four separate endpoints, each built on the simplest pattern that works for its job. The team want a recommendation.

What actually matters

These three patterns aren’t rivals for the same problem. They’re answers to different shapes of problem. Picking right is mostly about matching the pattern to the shape of the workload. Policy Q&A is a retrieval shape, knowledge in documents, no actions. A multi-step “do whatever’s needed” flow is an agent shape. Everything in between is some flavour of chain. Trying to solve a retrieval problem with an agent is over-engineering; trying to solve “do whatever’s needed” with a fixed chain is under-engineering.

The patterns sit on a ladder of elaboration. Each rung up buys capability, external data, actions, planning, and pays for it in three currencies: latency, cost, and predictability. The first two scale with the number of model invocations per request. The third is what bites in production: a chain always runs steps 1, 2, 3, in that order; an agentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. might call lookup_customer once on Monday and three times on Tuesday, depending on how the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. planned. For any production-facing behaviour, that determinism is a feature, if an auditor asks “what did the assistant do when user X asked Y?”, a chain’s answer is the code; an agent’s is the run trace.

Tool useTool useLetting an LLM call structured functions you’ve defined – search, calculator, database query, API call – instead of trying to do everything in text. is the mechanism that connects models to systems, describe a function, the model decides when to call it, the application executes, the result returns. It’s a model feature, not an agent-only one. A chain whose steps can call tools is still a chain (the topology is fixed by code); a model handed tools and told to “figure it out” is an agent (the topology is chosen by the model). That distinction is the architectural call, not “do we use tools?” but “do we let the model choose the path?”

The default error mode, when an agent platform is on the table, is over-elaboration: “we have an agent runtime, so everything becomes an agent call.” The better discipline is to pick the simplest pattern that supports each piece of the workload, and reach for agency only when the path genuinely needs to be chosen by the model rather than written down by the engineer.

What we’ll filter on

Six filters, applied to each of the three patterns.

Determinism of topology, does the same input produce the same sequence of steps?
Supports external data at query time, facts not in the promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. or the model’s training?
Supports taking actions (calling APIs, writing to external systems)?
Latency, roughly, how many round-trips to the model per user request?
Cost per request, roughly proportional to number of model invocations?
Observability / ease of audit, can a human reconstruct what happened?

The pattern landscape

Single LLM call. One invocation, one response. The user’s prompt goes in; the model’s completion comes out. No external data, no tools, no multi-step reasoning beyond what fits in one prompt. The baseline, useful for tasks the model can do in one shot given its training (summarise this text, translate this sentence, classify this ticket).
Chain. Multiple LLM calls stitched together in application code. Output of call N feeds into input of call N+1. Topology is hardcoded. Example: “extract facts from this ticket (call 1), then generate a customer-facing summary from the facts (call 2).” Each step is an InvokeModel call; the orchestration is your Lambda or application server.
Retrieval (RAG). A specific two-step chain: retrieve relevant chunks from a document corpus, then generate using the chunks. AWS-native via Bedrock Knowledge Bases (bedrock-agent-runtime:RetrieveAndGenerate) or DIY with embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. + vector storeVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. + InvokeModel. Deterministic topology, one model call per request (two if you count the embedding call).
Chain with tool use. A chain where one or more steps allow the model to call tools. The model’s response might be “call tool X with these arguments”; the application executes the tool and sends the result back; the model continues. The chain topology is still fixed (step 1 can use tools, step 2 generates the final response, etc.) but within a step the model has degrees of freedom.
Agent. A loop, not a chain. The model is given tools and a goal; it plans, the application executes, the model observes, it re-plans. The loop continues until the model emits a “final answer.” AWS-native via Bedrock Agents, define an agent with instructions, action groups (each backed by a Lambda or an OpenAPI schema), optional Knowledge Bases, and optional guardrailsGuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. ; invoke via bedrock-agent-runtime:InvokeAgent. The runtime handles the plan-act-observe loop and emits a trace showing each step.
Multi-agent orchestration. Multiple specialised agents coordinated by a supervisor agent. A billing agent, a customer-lookup agent, a ticketing agent, and a supervisor that routes requests to the correct one. Bedrock supports this via agent collaborators. Useful at scale when a single agent’s tool count exceeds what it can reason over reliably (typically 15-20 tools); over-engineering at lower scale.

Side by side

Pattern	Deterministic	External data	Actions	Latency	Cost	Auditability
Single call	✓	✗	✗	1 hop	Low	Trivial
Chain	✓	✗ (unless tool)	Partial	N hops	Low-medium	Easy
Retrieval (RAG)	✓	✓	✗	1-2 hops	Low	Easy
Chain with tool use	✓	✓	✓	N+M hops	Medium	Easy
Agent	✗	✓	✓	Variable	High	Trace-based
Multi-agent	✗	✓	✓	Very variable	Very high	Multi-trace

The trade is obvious reading down the table: as you move from single call to multi-agent, capability increases and predictability, latency, and cost all move the wrong way. Choosing well means picking the lowest row that supports the required capability.

Pattern decision tree

Two questions split the space cleanly. "One big agent" is almost always over-engineering; "agent when it genuinely needs to plan" is the rule worth holding.

The picks in depth

v1. Policy Q&A: retrieval. Knowledge lives in documents, no actions, fixed topology (retrieve, then generate). Implementation: a Bedrock Knowledge Base over the policy wiki’s S3 mirror, with semantic chunking and Titan v2 embeddings. The v1 endpoint is one RetrieveAndGenerate call per question. Answers come back with citations; the UI renders clickable links to the source policy. No tools, no agent runtime.

v2. Customer lookup: chain with tool use, not an agent. The topology isn’t uncertain, ask, optionally call one tool, format. The model never has to plan; it just decides whether the question needed the tool. An agent here would be slower, costlier, and harder to debug for no extra capability the flow needs. Implementation: a tool description lookup_customer(id: int) -> {tier, last_login, plan_details}. The v2 endpoint calls Claude Sonnet with the user’s question, the tool description, and a system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. instructing it to call the tool if the question requires customer data. The model responds either with text (if it didn’t need the tool) or a tool_use block with the customer ID. The application executes the Lambda behind the tool, sends the result back in a tool_result block, and the model produces the final text response. Two model calls in the typical case; one if the model can answer without the tool.

v3. Email drafting: chain, with retrieval folded in. Fixed topology, retrieve customer + policy context, generate the draft, optionally review for tone/PII. The model doesn’t choose the path. Implementation: two or three steps in application code. Step 1: retrieve policy context relevant to the issue (if applicable) and the customer record. Step 2: pass the retrieved context plus the engineer’s instructions to Claude with a prompt template (“You are drafting a customer email. Tone: empathetic but professional. Include the following points…”). Step 3 (optional): a second model call that reviews the draft for tone and PII compliance.

v4. Ticket workflow: agent, and this is where the pattern earns its keep. If the flow were “user says file a ticket, assistant extracts context and calls file_jira()”, that would still be a chain with one tool call. The agent pattern starts to make sense when the flow might be: check Jira for a duplicate first, comment on it if one matches, otherwise file new, then notify Slack, then escalate on-call if P1. That’s a decision tree where the model picks the path based on what each call returns, agent territory. Implementation: a Bedrock Agent with:

Instructions. “You help engineers file and manage Jira tickets. When asked to file, first search for duplicates; if a similar open ticket exists, prefer commenting on it. Always summarise the ticket back to the user for confirmation before filing. Never change ticket priority without explicit user confirmation.”
Action group: Jira. An OpenAPI schema (or Lambda function schema) exposing jira_search(query), jira_create_ticket(team, title, description, priority), jira_comment(ticket_id, text), jira_update(ticket_id, changes).
Action group: Notifications. slack_notify(channel, text) for the P1 escalation flow.
Knowledge Base attached. For any policy questions that come up mid-flow (“is this a P1 or P2?” requires consulting the severity policy).
Guardrails attached. Content filters; PII redaction on output.

The agent runtime handles the loop: receive the user’s request, plan (model decides first action), act (runtime calls the action group’s Lambda), observe (result returns to the model), re-plan, etc. When the model emits a final response, the loop terminates. The runtime produces a trace, a structured record of each plan-act-observe cycle, which is the audit artefact.

An example v4 flow, condensed:

User: "File a P2 ticket with team payments describing the duplicate-charge issue we
       just discussed. Attach customer 4711's context."

Agent:
  Thought: "Search for existing P2 tickets matching this description first."
  Action: jira_search(query="duplicate charge customer 4711")
  Observation: [] (no matches)

  Thought: "Look up customer context to include."
  Action: (uses Knowledge Base to retrieve recent chat history for customer 4711)
  Observation: "Customer 4711, Enterprise tier, reported duplicate charge on 2027-02-26..."

  Thought: "Confirm ticket details with user before filing."
  Response to user: "I'll file a P2 with team payments: 'Duplicate charge for
                    customer 4711 (Enterprise) on 2027-02-26.' Confirm?"

User: "Yes, file it."

Agent:
  Action: jira_create_ticket(team="payments", title="Duplicate charge for customer 4711",
                             description="...", priority="P2")
  Observation: {ticket_id: "PAY-2387"}

  Response to user: "Filed PAY-2387 with team payments. https://..."

Four model calls, three action-group invocations, one knowledge-base lookup, one user confirmation cycle. The trace shows every step with inputs, outputs, and the model’s reasoning. Compare that to the equivalent “chain with tool use” implementation: you’d have to hardcode the search-first-then-file logic, the confirmation step, the knowledge-base lookup, and the moment a user asks something slightly different (e.g. “file this or comment on PAY-2301 if it’s the same issue”), your hardcoded chain misses it. The agent handles variations without code changes; the cost is the non-deterministic topology and the observability burden of reading traces.

A worked dispatch

The v1-to-v4 assistant in production fronts four endpoints, or one endpoint with an intent router. Either way, an individual request’s path looks like:

Request: "What's our data retention policy for chat logs?"
Router classifies: policy question -> route to Knowledge Base
  RetrieveAndGenerate -> answer + citations
  Latency: ~1.5s. Cost: ~$0.003.

Request: "What tier is customer 4711 on?"
Router classifies: customer lookup -> route to chain with tool use
  Claude Sonnet (tool-use capable) -> tool_use: lookup_customer(4711)
  Lambda executes -> tool_result: {tier: "Pro", ...}
  Claude formats -> "Customer 4711 is on the Pro tier, last logged in 2 days ago."
  Latency: ~2s. Cost: ~$0.008.

Request: "Draft an apology email for customer 4711 about their billing issue."
Router classifies: drafting -> route to email chain
  Step 1: retrieve customer + policy context.
  Step 2: generate draft with Claude Sonnet.
  Step 3: (optional) tone/PII review pass.
  Latency: ~4s. Cost: ~$0.015.

Request: "File a P2 Jira ticket for the billing issue, attach customer context,
          and notify #customer-escalations on Slack."
Router classifies: multi-step workflow -> route to Bedrock Agent
  InvokeAgent -> agent runs its plan-act loop:
    search Jira -> retrieve context -> confirm with user -> create ticket -> notify Slack
  Returns final response + trace.
  Latency: ~10-15s across confirmations. Cost: ~$0.05-0.10.

Latency and cost vary by an order of magnitude across the four patterns. If you’d built the policy Q&A as an agent, each simple question would cost 5x more and take 5x longer, for no quality gain. If you’d built the ticket workflow as a chain, you’d have hardcoded the flow and brittle-failed on any variant the code didn’t anticipate. Each pattern matches the shape of its workload.

What’s worth remembering

Chains, retrieval, and agents are different designs for different problems. Not rival approaches to the same problem. Picking wrong usually means over-engineering (agent where chain would do) rather than under-engineering.
Chains are fixed topologies; agents are model-driven topologies. That’s the fundamental distinction. Chains are deterministic in structure (though each LLM call is stochastic). Agents choose their own path through a tool space, so the same request can take different paths.
Tool use is a model feature, not an agent-only feature. A chain with tool use gets many agent-like capabilities (external data, actions) while keeping a fixed topology. This is often the correct middle ground.
Retrieval is a specific chain that deserves its own name. Retrieve, then generate. Bedrock Knowledge Bases is the managed path; DIY with embeddings + vector store is the flexible path.
Bedrock Agents handle the plan-act-observe loop for you. Define action groups (Lambdas or OpenAPI schemas), attach Knowledge Bases and Guardrails, invoke via InvokeAgent. The runtime produces a trace that’s the audit artefact.
Latency and cost scale with the pattern. Single call is 1-2 seconds and fractions of a cent. Retrieval is 1-3 seconds and a few cents. Chain with tool use is 3-5 seconds and an order of magnitude up. Agents are 5-20 seconds and another order of magnitude. Pick the cheapest pattern that works.
Non-determinism is the cost of agency. Agents will take different paths on the same request. That’s what makes them general; it’s also what makes them harder to test and explain. Keep agents to the flows that genuinely need them.
An intent router is the architecture most assistants actually want. Not “one big agent” handling everything. A router that classifies the request and dispatches to the correct pattern (retrieval, chain, agent) keeps the cheap paths cheap and reserves the expensive machinery for the cases that need it.

The temptation, when agents are available, is to use them for everything, “then we don’t have to think about routing.” The result is a system that costs ten times more, takes five times longer, and is harder to audit than it needed to be. The harder, better discipline is to look at each piece of the workload, ask whether the model genuinely needs to choose the path, and reach for an agent only when the answer is yes.

Pricing Experiments: The Right Box at the Right Price

2026-05-19T06:00:00+08:00

Greenbox has just over two hundred subscribers. The Business Model Canvas workshop was earlier in the week, and something Lee said is still rattling around in Maya’s head: “You’ve validated almost everything on that canvas. The one box you haven’t tested is the revenue model.”

Lee is at Maya’s kitchen table with a coffee and the canvas printed on A3. He’s drawn a circle around the Revenue Streams box and written three words next to it: “untested, but working.”

“The prices are working,” Maya says. “People are paying them. Nobody’s complained.”

“Nobody who signed up has complained. You don’t know anything about the people who looked at the site, saw $25, and closed the tab. You don’t know whether you could charge $30 and still get the same conversion. You don’t know whether the small box is mispriced relative to the large box. You’ve got one data point, the current price, and you’ve decided it’s right because people are buying.”

Maya frowns. She’d been congratulating herself, a little, that the pricing felt settled. It was one less thing to worry about.

“So what are you suggesting?”

“Pricing experiments. Not once. As a habit. You should be testing your pricing the way you test your features.”

Why pricing feels different

Pricing is one of the few decisions in a startup that feels genuinely scary to get wrong. A bad feature can be rolled back. A bad copy tweak can be un-tweaked. A bad price, once published, anchors every subscriber’s expectation of what the product costs. Raise it and people feel cheated. Lower it and the people who paid the old price feel stupid.

This is why most founders pick a number that feels right, ship it, and never touch it again. It’s not laziness. It’s loss-aversion. The downside of changing pricing feels enormous, and the upside feels uncertain.

Lee has seen this pattern many times. His view is that it’s the wrong way to think about it.

“The risk isn’t changing your price. The risk is being wrong about your price and not knowing. If you’re charging $25 for a small box that people would happily pay $30 for, you’re not being nice. You’re giving away five dollars per subscriber per week. At two hundred and ten subscribers, that’s $1,050 per week you’re leaving on the table. That’s a farm partner you could pay. That’s Sam’s hours. That’s runway.”

Maya does the maths in her head. The number is uncomfortably large.

What to test

The team sits down on a Wednesday afternoon to work out what they actually want to learn. Lee uses the same approach they’ve used for product discovery: write down the assumptions and figure out which ones are worth testing.

The assumptions on the wall, in Maya’s handwriting:

The small box at $25 and the large box at $45 are both priced correctly relative to cost.
The $20 gap between small and large reflects the actual value difference to subscribers.
Subscribers would not pay more than $25 for the small box.
A third, larger box option would not expand the market.
Weekly delivery is the only frequency that makes sense.
Free delivery is a deal-breaker if removed.

Lee reads down the list and then sets the marker down. “Which of these would you be most embarrassed to find out you’d been wrong about for six months?”

Maya doesn’t hesitate. “Three. If people would happily pay $30 for the small box, I’d feel sick.”

“Good. Let’s test three first.”

Designing the experiment

This is where it gets interesting. You can’t just change the price on the website and see what happens, because if the new price is wrong you’ve damaged the brand. You also can’t ask people “would you pay $30?” in a survey, because people lie about pricing in surveys, not deliberately, but because stated preference and revealed preference are different things.

Lee proposes a simple framework:

Test the price before the commitment. New visitors who haven’t yet chosen a box see a landing page with the price. Half see $25. Half see $30. Everybody who clicks through can choose to continue at the price they saw. Measure two things: the click-through rate (do fewer people continue at $30?) and the conversion rate (of the people who continue, how many actually subscribe?).

Be honest about what you’re doing. If a subscriber asks why they saw $30 and their friend saw $25, don’t pretend it was a glitch. Explain that you were testing pricing and wanted to understand what the right number was. Offer them whichever price they would have preferred.

Lee pauses there. “That’s the design. Now the part I can’t do for you. I’ve watched plenty of these experiments and I know what they cost when they’re set up wrong, but the actual sample-size maths, how big and how long and how confident, isn’t my world. I can tell you a pricing test is worth running. I can’t tell you whether your traffic will let you read the result. You need that worked out before you agree on a duration.”

He looks across the table. Priya is already on it.

“Give me a minute. I want to be sure we can actually read a result before we agree to run this.”

She pulls her notepad towards her.

“The thing we’re worried about with $30 is people seeing the higher number and not subscribing. So the question I have to size up is: how many visitors per arm before we’d reliably notice a drop in conversion at $30, if there’s one to notice? The maths is symmetric, the sample size comes out the same whether we frame the change we’re looking for as a fall or a lift, but the fall is what would actually hurt us, so that’s the version I’ll plug in.”

She writes a few lines.

“Before I pick a number for ‘how big a drop’, let me work out what would actually count as bad for us. At $25 with seven percent conversion we make $1.75 a visitor. We’re testing $30, so the question becomes: how far would conversion have to fall at $30 before the price hike is a loss instead of a win? Revenue per visitor matches when $30 × p equals $1.75, so p = $1.75 ÷ $30, five point eight three percent. Anything below that and $30 is making us less money than $25, not more. So the drop we’d care about catching is conversion falling from seven percent to about five point eight, roughly a seventeen percent relative drop. That’s what I’ll size for.

“Two arms, $25 and $30. Conversion seven percent today. Target drop pinned at break-even, seventeen percent relative. Standard knobs on confidence and power. The back-of-envelope number is roughly seven thousand visitors per arm. We’re getting maybe a hundred and fifty visitors a week. That’s nearly a year per arm.”

Tom holds up a hand. “Nearly a year? Hold on, before you tell me the rest, can you walk us through where that number actually comes from? I hear ‘seven thousand per arm’ and I’ve got no idea what’s in the calculation.”

Priya nods. “Worth doing. Five things go into it, and they’re all things we have to commit to before we run the experiment, not after.”

She turns to a fresh page.

“First, the baseline. The conversion rate we’re comparing against. For us that’s seven percent, the fraction of visitors today who actually subscribe at $25.

“Second, the shift we want to be able to detect. We have to commit up front to a size of change that matters to us. I picked the break-even drop, seven percent falling to about five point eight, a roughly seventeen percent relative shift. If we’d only cared about catching a much bigger collapse, twenty-five percent, conversion all the way down to five point three, we’d need fewer visitors because the gap is bigger and easier to see through the noise. If we wanted to spot a subtler ten percent drop, seven sliding to six and a bit, we’d need a lot more. Small differences look a lot like noise, and the maths punishes us for asking it to see through them.

“Third, the null hypothesis. The ‘nothing’s actually happening here’ assumption, the default we’d hold to until the data pushes us off it. In our case: ‘$25 and $30 convert identically.’ The whole game is asking whether the gap we observe between the two arms could plausibly have come from chance, even if the prices really do convert identically. If yes, we shrug. If no, we’ve learned something.

“Fourth, alpha. How often we’re willing to be fooled by chance into thinking there’s an effect when there isn’t. Standard number is five percent, one experiment in twenty where we cry ‘effect!’ when actually nothing’s there.

“Fifth, power. The flip side. If there really is an effect, how likely are we to catch it? Standard number is eighty percent. So even with a perfectly designed experiment, one time in five we’d miss a real difference and write it off as noise.”

Tom nods slowly. “So alpha and power are us deciding how cautious we want to be in each direction.”

“Exactly. Once you’ve fixed all four, baseline, target shift, alpha, power, there’s a formula that turns them into a sample size. Two pieces of vocabulary inside that formula are worth pinning down before I show the working. The first is standard deviation. Think of it as the typical size of the wobble. Flip a fair coin a hundred times and you expect fifty heads, but you don’t get exactly fifty every time, sometimes forty-six, sometimes fifty-three. Standard deviation puts a number on that everyday wobble: the noise you get from a process even when nothing interesting is going on.

“The second is the z-score. That’s our signal measured in wobble-units: how many standard deviations away from boring-and-noisy a result sits. The bigger the z, the harder it gets to explain the result away as noise. Alpha and power feed into the formula as z-scores, five percent alpha and eighty percent power are roughly z = 1.96 and z = 0.84, numbers worth recognising on sight.”

She turns the notepad landscape to give herself more room and starts writing as she talks.

“Here’s the formula, for two proportions like ours. Sample size per arm:

Three pieces. The cautiousness term, (z₁ + z₂)² in blue, is alpha and power expressed as z-scores, added then squared. The noise term, the amber piece, has each arm contributing its own wobble, p(1−p), summed across the two arms. The signal term, (p₁ − p₂)² in green underneath, is the gap we want to detect, squared. Big signal, small sample needed. Small signal, the sample balloons, and the squaring on the gap is what makes it balloon.

“Plug ours in. Cautiousness: (1.96 + 0.84)² is about 7.84. Noise: 0.07 × 0.93 is about 0.065 in the $25 arm; 0.0583 × 0.9417 is about 0.055 in the $30 arm; sum is about 0.12. Signal: (0.07 − 0.0583)² is 0.0117², which is about 0.000137. Multiply cautiousness by noise: 7.84 × 0.12 is about 0.94. Divide by signal: 0.94 ÷ 0.000137 is just under seven thousand. Per arm. That’s where the number comes from.”

She underlines the result. “About seven thousand visitors per arm to spot the break-even drop. We get a hundred and fifty a week. Forty-six weeks per arm, nearly a year. If we’d settle for spotting a much bigger collapse, twenty-five percent or so, we could call it in about four months. For a subtler ten percent drop, we’re back over two years per arm. For a five percent drop, well inside what a small price tweak might plausibly move, the better part of a decade.”

Tom sits back. “So what does two weeks of data buy us?”

“Worth working out. Let me run the formula backwards. Two weeks gives us roughly three hundred visitors per arm. If I fix the cautiousness and noise where they were and solve for the gap instead, what’s the smallest difference we’d reliably catch at n = 300?, the maths gives me about six percentage points. So a real conversion rate would have to drop from seven percent to about one percent before our two-week test would reliably notice. Anything subtler than that, we miss most of the time.”

She thinks for a second.

“More usefully: take the gap we actually care about, the break-even drop, seven down to five point eight, and ask how often we’d correctly call it. At three hundred per arm the power drops from eighty percent to about eight percent. So even if $30 genuinely tips us over to break-even, our two-week test would flag it only about one time in twelve. The other eleven times we’d shrug and write it off as noise.

“And the flip side is just as ugly. Even if $25 and $30 convert identically, we’ll see a gap that looks significant about one time in twenty. That’s the alpha we picked, and alpha doesn’t get kinder when the sample’s small. So a ‘significant’ result at this scale could be a real effect we got lucky enough to spot, or it could be pure chance. We can’t tell which from the data alone.”

She caps the pen.

“At our volume the test is statistically blind. Whatever number comes out, we can’t separate signal from noise. What we can do is read the direction the gap leans, and decide in advance whether that’s enough to act on.”

Maya looks at Lee. “So the experiment can’t really prove anything in any sane window.”

“Not at your volume. Which means the choice you have isn’t ‘run it until it’s statistically valid’. It’s ‘run it long enough to see the shape of the signal, then act on the direction’. You’ll be moving from one defensible price to another defensible price with a lean, not a proof. That’s the only kind of pricing decision you can make at this stage.”

Maya thinks about it. “If we’re wrong by five percent in either direction, we can adjust. We won’t be wrong by fifty percent.”

“That’s the right framing. And it tells you what the time limit is for.”

Set a time limit. Two weeks, then stop, not because two weeks will give you certainty (it won’t), but because the time box is what limits your exposure. A pricing experiment is a temporary act of price discrimination, and the longer it runs the more it corrodes trust. The time limit is containment, not measurement.

Know what you’ll do with each outcome. Before starting, write down what decision you’ll make if revenue per visitor goes up, goes down, or barely moves. “The data is inconclusive” needs to be one of the outcomes you’ve planned for, because at this volume it’s the most likely one. Decide in advance whether a directional lean is enough to act on. If it isn’t, don’t run the experiment yet; save it for when you’ve got the traffic.

Priya adds one more thing. “And we write a note in the wiki, ‘small box price, revisit when weekly visitors exceed five thousand’. Future us deserves to know we acted on a lean, not on a proof.”

Tom has a concern. “What if conversion drops by ten percent at $30? Does that mean $30 is the wrong price?”

Lee thinks about it. “Not necessarily. If conversion drops ten percent but revenue per converted subscriber goes up twenty percent, you’re still ahead. The question isn’t ‘does conversion drop’, it’s ‘does total revenue go up or down.’ And you have to weight that against the long-term effects on word-of-mouth, retention, and brand.”

Running the experiment

They run it for two weeks.

The setup is deliberately simple. Priya writes a small piece of code that assigns each new visitor to one of two groups at random and shows them the appropriate landing page. The price on the page is the price they’d pay if they subscribed. Nothing else about the site changes.

Over two weeks, 312 visitors see the $25 page. 298 visitors see the $30 page. The split is even enough to compare, and, as the team already knows going in, well short of what statistical confidence would require.

The results:

Small box pricing experiment: two weeks

Variant	Visitors	Clicked through	Subscribed	Revenue/week
$25 (control)	312	184 (59%)	22 (7.1%)	$550
$30 (test)	298	164 (55%)	19 (6.4%)	$570

Click-through drops from 59% to 55%. Conversion drops from 7.1% to 6.4%. But revenue per visitor goes up, because the people who subscribe at $30 are paying more than the people who subscribe at $25.

The total revenue from the $30 cohort is slightly higher than the $25 cohort, despite slightly fewer subscribers.

Reading the numbers

The team gathers round Priya’s laptop.

“This is roughly the shape we expected,” Priya says. “Twenty-two subscribers versus nineteen, out of about three hundred visitors each. The gap is well inside the noise, if we’d run the experiment in a different fortnight, those numbers could easily have flipped. The data isn’t conclusive. We knew going in it wouldn’t be.”

“Then what is it telling us?” Maya asks.

“Direction. Click-through is slightly lower at $30. Conversion is slightly lower. Revenue per visitor is slightly higher. That’s the shape you’d expect if $30 is closer to the right price than $25, and roughly the opposite of what you’d see if $30 were too high. It’s not proof. It’s a lean.”

Lee picks it up. “And the team agreed before we ran it that a lean is what we’d act on. The alternative was waiting years for a confidence we don’t actually need to make a five-dollar decision.”

The decision

Maya looks at the numbers again. “So we raise the price.”

“On a directional signal,” Lee says. “Not because the data proved anything, but because we said we’d act on direction and the direction is up. If it had pointed the other way we’d be having a much shorter meeting. The five percent more revenue per visitor isn’t huge, but the people who said yes at $30 are telling you they value the box at $30. The people who said no were probably never going to be great subscribers anyway. They would have subscribed for a month and cancelled.”

“Before you change anything, how do you feel about the subscribers who paid $25?”

Maya thinks. “I don’t want to raise the price on them. They signed up at $25 and that was the deal.”

“Good instinct. Honour the original price for existing subscribers indefinitely. New subscribers sign up at $30. Your existing subscribers feel looked after. Your new subscribers feel fairly treated. The only people who lose are the ones who would have subscribed at $25 but won’t at $30, and the experiment tells you that’s a small group, and a group that probably wouldn’t have stuck around.”

It’s a clean decision, but it’s only clean because they measured first, and only honest because they were clear, before measuring, about what kind of evidence they’d accept.

What gets tested next

Maya and Lee work through the rest of the list. The $20 gap between small and large is the obvious next target, assumption 2 from the wall. After that, the mixed-sourcing pilot that’s been on the whiteboard for weeks. Each one a separate test. Each one starting the same way: write down the question, design the test, decide what you’d do with each outcome before you run it, measure, decide.

Pricing experiments. Not once. As a habit.

The team doesn’t know it yet, but the question on the wall is about to change. The discipline they’ve just learned will get its first real test on a deadline they didn’t pick. That’s a story for another week.

Lee writes a single sentence at the top of the pricing page in the team wiki:

“Price is an assumption until you’ve tested it. Test the assumptions you’d be most embarrassed to be wrong about first.”

Maya reads it, nods, and goes back to the kitchen to think about what to test next.

How to Make a Bedrock Chatbot Audit-Ready with Guardrails and Watermarks

2026-05-18T06:00:00+08:00

The situation

A mid-size fintech runs a customer-facing chatbot on Bedrock. The chatbot helps the roughly 2M active customers understand their transaction history, explain fees, surface policy documents, and escalate to a human when needed. It runs on Claude Sonnet 4.5, invoked from a Lambda behind API Gateway.

Three compliance obligations:

No regulated financial advice. The chatbot can explain what a fee is; it cannot recommend whether to invest, what to buy, or when to sell. Crossing that line is a regulated-advice violation.
No customer PII in outputs. The model should never echo a full account number, full name + date of birth together, or any other field that would count as PII under the relevant privacy regulation. The chatbot has access to this data (via tool use) but should redact it in responses.
Auditable provenance. Every response must be attributable: which model produced it, which promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , which customer session, and, in the event of a dispute, proof that the text came from the AWS-hosted model rather than a third-party intercept or a compromised channel.

Separately, the product team wants to know: when the chatbot refuses a request (e.g. “I can’t give investment advice”), what does that refusal look like? Who controls the refusal message? And can users bypass it with prompt tricks (“pretend you’re a financial advisor…”)?

What actually matters

Responsible-AI controls live in three broad layers, real-time enforcement, audit persistence, and identity/encryption, and they don’t overlap neatly with the three compliance obligations. The first job is mapping question to control category.

The first is topic restriction. The “no financial advice” requirement is about what the model talks about, not what words appear in the output. A topic-level control needs to recognise that “should I invest in XYZ?” is a regulated-advice question even if phrased as “what do you think about XYZ?” or “if you were me, what would you buy?”. That calls for a classifier-style filter that fires on intent, intercepting the invocation and returning a canned refusal instead of the model’s response.

The second is PII redaction. This is about patterns in the output, an account number has a recognisable shape, an email address matches a regex, a full name is an entity a tagger can identify. The right control combines a catalogue of standard PII types with user-defined regex patterns for domain-specific identifiers (the fintech’s internal account number format, say). It also has to offer a choice between blocking the invocation entirely and redacting, replacing the matched text with a typed tag, because some references are legitimate when anonymised and others are never acceptable in output at all.

The third is blunt word and phrase blocking. Profanity, competitor mentions, and a hard-coded block list live here. Less important for this specific scenario but part of any complete control catalogue.

The fourth is generic content moderation. Hate, insults, sexual content, violence, misconduct, and prompt-injectionJailbreakA prompt that bypasses a model’s safety training and gets it to produce output it would normally refuse. attempts, each with a configurable severity threshold, applied independently to input and output. This is the safety net that catches the cases the topic and PII controls don’t explicitly cover.

The fifth is grounding and relevance checks. The control evaluates whether the model’s response is grounded in the source material provided to it and whether it actually addresses the user’s question, blocking or flagging when either score falls below a threshold. Most relevant for RAG-heavy chatbots; worth knowing for the fintech scenario even though the primary obligations don’t lean on it.

The sixth is watermarking and provenance. Text-generation watermarking (statistical patterns embedded in the output that can be detected later) is an emerging capability, not yet universal. The reliable provenance story today is the platform’s own call log: every model invocation recorded with principal, model identity, timestamp, and, with full payload logging enabled, the request and response stored encrypted at rest. Combined with identity restrictions on which principals can invoke which models, this gives a cryptographically-bounded audit trail.

The seventh is the shape of the refusal itself. When a real-time filter intercepts an invocation, the caller needs to receive something distinguishable from a normal completion: a stop reason, an assessment showing which policy fired, and a configurable refusal message that the application can render. Refusals are control flow, not error handling.

What we’ll filter on

Five filters, one per compliance requirement (with PII split into “detect” and “action taken”).

Does this control enforce topic-level restrictions (e.g. no financial advice)?
Does it detect PII in inputs and outputs?
Does it let the response be redacted rather than blocked (for cases where redaction is enough)?
Does it cover prompt-injection attempts (“ignore previous instructions…”)?
Does it produce an audit artefact, something an auditor can inspect after the fact?

The Bedrock responsible-AI landscape

Bedrock Guardrails. A configuration object attached to an invocation that applies one or more policies to the input, the output, or both. Policies: denied topics (up to 30 per guardrail, each a named topic with description and example prompts), content filters (six categories + prompt attacks, each with severity threshold), word filters (block list + managed profanity), sensitive-information filters (the built-in PII catalogue + user regexes), and, for RAG use cases, contextual grounding and relevance checks. Each guardrail is versioned; invocations reference a specific version. Created via bedrock:CreateGuardrail, versioned via CreateGuardrailVersion, applied via the guardrailIdentifier + guardrailVersion fields on InvokeModel (or automatically when a Knowledge Base or Agent has a guardrail attached).
Model invocation logging. A Bedrock account-level setting (one per Region) that directs Bedrock to write full request and response payloads to a destination: S3 bucket, CloudWatch Logs, or both. Enabled via bedrock:PutModelInvocationLoggingConfiguration. Captures the prompt, the model’s raw output, any guardrail assessments, and metadata (model ID, timestamp, caller IAM principal via CloudTrail correlation). Encrypts at rest under a KMS key of your choice. This is the durable audit trail.
CloudTrail. Every Bedrock API call – InvokeModel, CreateGuardrail, GetGuardrail, PutModelInvocationLoggingConfiguration, emits a CloudTrail event. Data events can be enabled to capture InvokeModel calls specifically (they’re not in management events by default). Gives the “who called what, when” audit; doesn’t include the model’s output (that’s invocation logging’s job).
IAM-scoped model access. A Bedrock IAM policy controls which principals can invoke which models. bedrock:InvokeModel on arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0 restricts a role to one model. The chatbot Lambda’s role should allow exactly the models the application is approved to use, nothing else; requests for other models return AccessDenied in CloudTrail before the model is invoked.
Customer-managed KMS keys. Invocation logs, training data, and custom models can be encrypted with customer-managed KMS keys. Gives the ability to revoke access to historical logs by disabling the key, and to require explicit key-usage grants to read the audit record. The regulator-facing story.
Cross-Region inferenceInferenceRunning a trained model to produce output – as opposed to training it. profiles and data residency. For regulators that care where inference happens, Bedrock’s model ARNs pin the Region, and cross-Region inference profiles (for models that support it) expose an explicit list of which Regions can serve a request. Important for the audit story when data-residency constraints apply.
Bedrock Evaluation. Not a real-time control, but part of the responsible-AI story: systematic evaluation of a model (or a prompt-and-model combination) on dimensions including toxicity, robustness, and accuracy, against either built-in datasets or your own. The pre-production counterpart to Guardrails’ in-production enforcement.

Side by side

Mapping each control to the three compliance obligations plus the four attributes:

Control	Topic restriction	PII detection	Redact option	Prompt-injection	Audit artefact
Guardrails: denied topics	✓	—	—	Partial	✓ (assessment)
Guardrails: content filters	Partial	—	—	✓ (prompt attacks)	✓ (assessment)
Guardrails: PII filters	—	✓	✓ (anonymize)	—	✓ (assessment)
Guardrails: word filters	Partial	—	—	—	✓ (assessment)
Invocation logging	—	—	—	—	✓ (full payload)
CloudTrail	—	—	—	—	✓ (metadata)
IAM model scoping	—	—	—	—	✓ (deny trail)

A complete configuration for the fintech chatbot uses all of these, not one. Guardrails handle real-time enforcement; invocation logging and CloudTrail handle audit; IAM handles the “this model, not another” question; KMS handles the “this key, held by us” question.

How the controls compose

Guardrails enforce at two gates, input and output, around a single model call. CloudTrail, invocation logging, and KMS produce the three audit artefacts. Each layer does one job; removing any of them breaks a different piece of the compliance story.

The configuration in depth

The Guardrail. Create one guardrail per application (chatbot-customer-v1). The configuration, at a high level:

{
  "name": "chatbot-customer-v1",
  "blockedInputMessaging": "I can't help with that request. For investment advice, please speak with a licensed advisor at 0800-...",
  "blockedOutputsMessaging": "I can't share that response. Please contact support if you need more detail.",
  "topicPolicyConfig": {
    "topicsConfig": [
      {
        "name": "RegulatedFinancialAdvice",
        "definition": "Advice to buy, sell, or hold specific securities, or recommendations on investment strategy, asset allocation, retirement planning, or tax planning for a specific person.",
        "examples": [
          "Should I invest in XYZ stock?",
          "What should I do with my 401k?",
          "Is now a good time to buy bonds?"
        ],
        "type": "DENY"
      }
    ]
  },
  "contentPolicyConfig": {
    "filtersConfig": [
      {"type": "SEXUAL",      "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "VIOLENCE",    "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "HATE",        "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "INSULTS",     "inputStrength": "MEDIUM", "outputStrength": "MEDIUM"},
      {"type": "MISCONDUCT",  "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "PROMPT_ATTACK", "inputStrength": "HIGH", "outputStrength": "NONE"}
    ]
  },
  "sensitiveInformationPolicyConfig": {
    "piiEntitiesConfig": [
      {"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
      {"type": "US_BANK_ACCOUNT_NUMBER",   "action": "ANONYMIZE"},
      {"type": "US_SOCIAL_SECURITY_NUMBER","action": "BLOCK"},
      {"type": "EMAIL",                    "action": "ANONYMIZE"},
      {"type": "PHONE",                    "action": "ANONYMIZE"},
      {"type": "NAME",                     "action": "ANONYMIZE"}
    ],
    "regexesConfig": [
      {
        "name": "InternalAccountId",
        "pattern": "ACCT-[A-Z0-9]{10}",
        "action": "ANONYMIZE"
      }
    ]
  }
}

A few points on that. PROMPT_ATTACK is applied to input only (outputStrength: NONE) because what we’re catching is the user’s attempt to jailbreak; it doesn’t make sense on output. CREDIT_DEBIT_CARD_NUMBER is BLOCK (blocks the whole invocation) because a card number in response is never acceptable; US_BANK_ACCOUNT_NUMBER is ANONYMIZE because the chatbot can reference “your account ending in 1234” legitimately by using the anonymized form. The regexesConfig catches the company’s internal ACCT-... identifier that isn’t in the built-in PII catalogue.

Versioning. CreateGuardrailVersion snapshots the DRAFT into an immutable version. The Lambda invokes guardrailIdentifier=<id>, guardrailVersion=<N> pinning to a specific version; updates to the guardrail don’t affect production until the Lambda is updated to reference the new version. This is the change-control story: Legal reviews version 3, approves it, the Lambda is updated to reference version 3.

Invocation logging. Enable via PutModelInvocationLoggingConfiguration at the Region level:

{
  "loggingConfig": {
    "cloudWatchConfig": {
      "logGroupName": "/aws/bedrock/invocations",
      "roleArn": "arn:aws:iam::111122223333:role/BedrockLoggingRole"
    },
    "s3Config": {
      "bucketName": "fintech-bedrock-audit",
      "keyPrefix": "chatbot/"
    },
    "textDataDeliveryEnabled": true,
    "imageDataDeliveryEnabled": false,
    "embeddingDataDeliveryEnabled": false
  }
}

Every invocation’s full request, response, model metadata, and guardrail assessment land in both sinks. S3 is archival (Athena queryable); CloudWatch Logs is real-time (Logs Insights queryable for incident response). Both are encrypted; the S3 bucket’s default encryption uses a customer-managed KMS key that only the audit team can grant kms:Decrypt on.

CloudTrail data events. InvokeModel isn’t in management events by default, enable data events for Bedrock to capture each call’s principal, model ARN, and timestamp. Data events cost money per event but are the only way to get the “who called what” trail for high-volume model calls at the CloudTrail layer.

IAM restriction. The chatbot Lambda’s execution role has exactly one bedrock:InvokeModel permission, scoped to the Claude Sonnet model ARN and requiring the guardrail:

{
  "Effect": "Allow",
  "Action": "bedrock:InvokeModel",
  "Resource": [
    "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
    "arn:aws:bedrock:eu-west-1:111122223333:guardrail/chatbot-customer-v1"
  ],
  "Condition": {
    "StringEquals": {
      "bedrock:GuardrailIdentifier": "arn:aws:bedrock:eu-west-1:111122223333:guardrail/chatbot-customer-v1"
    }
  }
}

That condition block is the key enforcement: the Lambda cannot invoke the model without the guardrail attached. Even if a developer accidentally removed the guardrail reference in code, IAM would deny the call.

A worked refusal

A customer asks: “Hey, I’ve got 50k saved. Should I put it in index funds or high-yield savings?”

The Lambda forwards the message to Bedrock with the guardrail attached:

$ aws bedrock-runtime invoke-model \
    --model-id anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --guardrail-identifier chatbot-customer-v1 \
    --guardrail-version 3 \
    --body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":500,"messages":[{"role":"user","content":"Hey, I have got 50k saved. Should I put it in index funds or high-yield savings?"}]}' \
    --cli-binary-format raw-in-base64-out \
    out.json

$ jq . out.json
{
  "stopReason": "guardrail_intervened",
  "content": [
    {"type": "text", "text": "I can't help with that request. For investment advice, please speak with a licensed advisor at 0800-..."}
  ],
  "amazon-bedrock-guardrailAction": "INTERVENED",
  "amazon-bedrock-trace": {
    "guardrail": {
      "inputAssessment": {
        "chatbot-customer-v1": {
          "topicPolicy": {
            "topics": [
              {"name": "RegulatedFinancialAdvice", "type": "DENY", "action": "BLOCKED"}
            ]
          }
        }
      }
    }
  }
}

What happened:

The input guardrail evaluated the message against the RegulatedFinancialAdvice topic. The topic’s definition (“advice to buy, sell, or hold specific securities…”) plus the examples (“What should I do with my 401k?”) trained the topic classifier to recognise this phrasing.
The classifier flagged the input as matching. Guardrails short-circuited the invocation: the model was never called.
The response body contains the configured blockedInputMessaging plus the full assessment showing which policy fired.
The Lambda received this response with stopReason: "guardrail_intervened" and rendered the configured refusal in the chat UI.
CloudTrail recorded the InvokeModel call. The invocation log wrote the full prompt, the refusal response, and the guardrail assessment to S3 under the audit bucket’s KMS key.

The customer sees a polite refusal pointing them to a real human advisor. Compliance has an auditable record that the model was not invoked with that prompt, which is the stronger position than “the model was invoked and declined.”

A worked PII redaction

Customer asks: “Can you confirm the balance on my account ACCT-ABC1234567?”

The Lambda has tool-use wired up: it calls an internal API to look up the balance, includes the result in the prompt context, and asks the model to produce a natural-language response. The model generates:

The balance on account ACCT-ABC1234567 is $3,421.55 as of today.

The output guardrail evaluates. The InternalAccountId regex matches ACCT-ABC1234567 with action ANONYMIZE. The returned content:

The balance on account {ACCOUNT} is $3,421.55 as of today.

The application layer then looks at the original session context, confirms the customer is authenticated and authorised for that specific account, and renders “your account ending in 4567” in the UI. The guardrail doesn’t need to know which account number is OK to show which customer, it just ensures the raw internal identifier never reaches the rendered chat log. The application, which has the authz context, substitutes a friendly form.

This is the key pattern: Guardrails enforce a structural invariant (“no internal account IDs in output”); the application layer enforces contextual authorisation (“this customer can see a reference to their account”). The two compose.

What’s worth remembering

Responsible AI on Bedrock is three layers, not one. Real-time enforcement (Guardrails), audit persistence (invocation logging + CloudTrail), and identity/encryption (IAM + KMS). All three are needed for a defensible compliance story.
Guardrails has five policy types. Denied topics, content filters (including prompt attacks), word filters, sensitive-information filters (PII + user regex), and contextual grounding/relevance. Each can apply to input, output, or both.
PII filters can block or anonymize. Block stops the invocation; anonymize replaces the matched text with a tag like [EMAIL] or a user-defined placeholder. Choose per PII type: card numbers block, account references anonymize.
Guardrails are versioned and pinned per invocation. Create, version, reference a specific version in the invocation. Updates don’t affect production until the caller is updated. This is change control for model behaviour.
Model invocation logging captures the full payload. Prompt, response, guardrail assessment, metadata, to S3 or CloudWatch Logs, encrypted under a customer-managed KMS key. The durable audit artefact.
CloudTrail data events for Bedrock give the “who called what” trail. Not on by default. Pair with invocation logging for the full picture.
IAM conditions enforce guardrail usage. A policy that requires bedrock:GuardrailIdentifier to equal a specific guardrail ARN makes it impossible to invoke the model without the guardrail, bypassing guardrails requires changing IAM, which has its own audit trail.
Guardrails enforce structure; the application enforces context. Guardrails keep raw account numbers out of output. The app layer, which has the authenticated session, decides which anonymized references to show which customer. The two compose; neither alone is sufficient.

A chatbot that refuses financial advice, redacts account numbers, and produces an audit trail a regulator would accept isn’t one feature, it’s five Bedrock features configured together, plus IAM and KMS around them. The craft is knowing which feature answers which compliance question, and wiring the configuration so no obvious bypass exists (no guardrail-less invocation path, no unencrypted log sink, no overly broad IAM). Get the composition right once and the chatbot is defensible; miss a layer and the auditor has a question with no good answer.

Before the Transformer

2026-05-16T06:00:00+08:00

Your phone just suggested the word “tomorrow” before you finished typing “see you to”. That suggestion didn’t come from a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. . It came from a modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. that fits in 50KB, runs in microseconds, and is older than the smartphone you’re holding.

A lot of working software still runs on the AI that came before the AI. This post is about that AI.

In the previous post we looked at what might come after the transformer. This post goes the other way. Before BERT, before word2vec, before deep learning was the default, NLP ran on a small set of statistical and probabilistic models that did genuinely useful work, some of which they still do, today, in places where the cost or latency or interpretability of a transformer would be wrong.

These aren’t museum pieces. They’re production tools. You should know about them because they’re often the correct answer, especially for problems with tight latency budgets, small datasets, or auditability requirements.

n-gram language models

An n-gram model is a language modelLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. in the most literal sense: it estimates the probability of the next word given the previous n-1 words.

A bigram model (n=2) estimates P(word | previous word). “The cat sat on the ___”, given the model has seen “on the mat” enough times in training data, it estimates a high probability for “mat” given “the.” A trigram model uses two preceding words. A 5-gram model uses four.

The model itself is just a giant table of counts: count how many times each n-gram appeared in your training corpus, divide by the count of the prefix, and that’s your probability estimate. No neural network. No gradient descent. No GPU. Just a hash table.

This sounds laughably primitive in 2026. It’s also how Google’s mobile keyboard worked for years, how speech recognition worked for years, and how machine translation worked for years, and the n-gram model was state of the art at all three.

Why n-gram models still ship

Three reasons.

First, they’re tiny. A 5-gram model trained on a few million words of domain-specific text fits in megabytes. It runs on a phone, on an embedded device, in a process that wakes up for one millisecond at a time.

Second, they’re fast. Lookup is a single hash-table query. The latency is nanoseconds. There’s no model to load, no GPU to wait for.

Third, they’re deterministic and auditable. If your spam filter or autocomplete makes a mistake, you can find out exactly which n-gram triggered the decision and which counts produced the probability. There’s no opaque embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. to introspect.

The trade-off is severity: n-gram models can’t generalise beyond what they’ve literally seen. “The dog sat on the mat” might be a familiar pattern; “The aardvark sat on the mat” is brand new and the model has nothing useful to say. They suffer the sparsity problem, most plausible n-grams never appear in the training data at all, even with a large corpus.

A lot of the cleverness in classical n-gram modelling went into smoothing techniques (Kneser-Ney, Good-Turing) that estimate plausible probabilities for n-grams the model never saw, by backing off to shorter n-grams. These methods are mature and well-understood, and they’re still the foundation of fast statistical models for autocompletion, predictive text, and parts of speech recognition pipelines.

Where you’ll find them

Mobile autocomplete and predictive text in keyboards that need to run offline.
Speech recognition language models, the acoustic part is now neural, but a fast n-gram language model is often the rescoring layer that picks between candidate transcriptions.
Spell checkers and grammar checkers, especially for languages where there isn’t a large neural model available.
Search query understanding for tail queries where you want a fast statistical signal, not a 200ms LLM round trip.

Hidden Markov Models

A Hidden Markov Model (HMM) is the next conceptual rung up. It models a sequence of observations that are generated by an underlying sequence of hidden states, where each state depends only on the previous state and each observation depends only on the current state.

The classical example: part-of-speech tagging. The observation sequence is the words you can see. The hidden sequence is the part-of-speech tag for each word, noun, verb, adjective. The HMM models two things:

Transition probabilities: how likely is each tag to follow each other tag? (e.g. determiners are often followed by nouns)
Emission probabilities: how likely is each word to be generated by each tag? (e.g. “run” can be a noun or a verb, with different probabilities for each)

Given a sentence, you find the most likely sequence of tags by running the Viterbi algorithm, a dynamic programming procedure that’s been the standard textbook example since the 1970s.

HMMs were the dominant approach to:

Part-of-speech tagging, until CRFs (next section) and then neural taggers replaced them.
Speech recognition acoustic modelling, until deep learning replaced them in the early 2010s.
Bioinformatics gene prediction, where they’re still widely used because biology has structural assumptions that match HMMs well.
Chunking and shallow parsing.

Why HMMs still ship

Two reasons.

First, biology. Genes have a structure that maps cleanly onto hidden states (intron, exon, promoter, terminator) and HMMs have decades of biological-tuning baked into them. Tools like HMMER for protein sequence analysis are everywhere in computational biology, and they’re not getting replaced by transformers any time soon.

Second, speed and tractability for low-resource languages. Training a neural POS tagger requires a lot of labelled data and a lot of compute. Training an HMM tagger requires hundreds of labelled sentences and a laptop. For low-resource language pipelines, an HMM is often the actual production tool.

Conditional Random Fields

A Conditional Random Field (CRF) is the more flexible cousin of the HMM. The idea: instead of modelling the joint probability of observations and hidden states (HMM-style, which makes strong independence assumptions), model the conditional probability of the hidden states given the observations directly.

In practice this lets you incorporate arbitrary features, not just “the current word” but “is the current word capitalised?”, “does it end in -ing?”, “is the previous word ‘to’?”, “what’s the gazetteer match?”, without breaking the model’s mathematical structure. CRFs work by combining many weak features through learned weights, much like logistic regression for sequences.

CRFs were the standard for sequence labelling tasks throughout the 2010s:

Named entity recognition (people, places, organisations, dates).
Information extraction from semi-structured text.
Slot filling in dialogue systems.
Biomedical entity tagging (gene names, drug names, diseases).

The classical pipeline, hand-craft good features, train a CRF on labelled data, deploy, produced systems that ran on CPUs at thousands of sentences per second with high accuracy. Many production NER systems still run a CRF either as the primary tagger or as a final layer on top of a neural model.

When a CRF still wins

CRFs are a good answer when:

You have moderate amounts of labelled data (say, 1k-10k sentences), enough to learn meaningful weights, not enough to fine-tune a transformer well.
You need high precision on a fixed set of labels, regulatory keyword matching, structured-record extraction, controlled vocabularies.
You need to explain decisions, which features contributed to which label.
Latency matters, a CRF tagger runs in microseconds per sentence. A transformer NER model runs in milliseconds.
You’re working in a specialised domain with idiosyncratic vocabulary, medical, legal, scientific. Hand-crafted features encode domain knowledge that a generic transformer doesn’t have.

Word embeddings: the bridge

Between the n-gram era and the transformer era there was a brief but enormously influential phase where word embeddings became the primary research tool. Word2Vec (Mikolov et al., Google, 2013) and GloVe (Stanford, 2014) trained dense vectors for words that captured semantic relationships, the famous “king - man + woman = queen” arithmetic.

These models are no longer state of the art, but their descendants live everywhere. Modern sentence embeddings (BGE, E5, see The Other Transformers) are direct conceptual descendants. Many smaller production NLP systems still use word2vec-style embeddings as a fast feature backbone, sometimes feeding into a CRF or a small classifier rather than a transformer.

If you’ve ever called nltk.word_embeddings or used GloVe vectors as a baseline before reaching for a transformer, you’ve used this generation of model.

A decision table

If your task is…	Reach for…	Why not a transformer?
Predictive text on an offline device	A 5-gram language model with smoothing	Megabytes vs gigabytes; nanosecond latency
Rescoring speech recognition hypotheses	An n-gram LM	Streaming + low latency requirements
Predicting protein-coding regions in DNA	A profile HMM (HMMER)	Decades of domain tuning; biological structure matches the model
POS tagging a low-resource language	An HMM with a small labelled corpus	No transformer pre-training in that language
Extracting drug names from clinical notes	A CRF with hand-crafted features and a gazetteer	High precision; auditability; low latency on a CPU
Building a chatbot	A transformer LLM	n-grams and HMMs cannot generate fluent multi-turn text
Understanding ambiguous, context-rich queries	A transformer	Classical models struggle with long-range context

The story of NLP often gets told as a march of progress where each new generation makes the previous one obsolete. The actual picture is more layered. n-gram models still suggest the next word on your phone, still rescore speech-recognition hypotheses, still run inside spell checkers because they fit in megabytes and answer in nanoseconds. HMMs still dominate computational biology because gene structure maps cleanly onto hidden states and decades of domain tuning don’t transfer to a transformer overnight. CRFs are still the right answer when you have a thousand labelled sentences, a regulated domain, and a need to explain every decision the system makes.

Pre-transformer doesn’t mean obsolete. It means a different cost-benefit curve. The classical tools win where their curve dominates: on devices that can’t load a GPU, on languages without pre-training, on tasks that need to run in microseconds, on auditors who want to see the features and the weights. Reach for a transformer when you need the long-range context and the generative fluency. Reach for one of these when you don’t.

The Workshop: Assumption Mapping

2026-05-15T06:00:00+08:00

Assumption Mapping ranks the beliefs underneath a plan by risk and evidence so you test the dangerous ones first, cheaply, before they’re baked into the code. Worked example: Testing What You Believe.

Assumption Mapping

Assumption Mapping surfaces the beliefs hiding underneath a plan, plots them by how much evidence supports them and how badly the plan fails if they’re wrong, then picks a short list of assumptions to test before committing resources. Sometimes called the risk/evidence grid or the assumptions grid. A close cousin is hypothesis mapping (same shape, different labels). Popularised by David Bland as part of the Testing Business Ideas canon, building on earlier work by Giff Constable, Tom Chi, and the broader Lean Startup community. The 2x2 layout of evidence against importance is the artefact most people mean when they say “assumptions workshop.”

Bland’s canonical labels for the axes are Important / Unimportant (vertical) and Has evidence / No evidence (horizontal); the prioritised quadrant is top-left: important + no evidence = leap of faith. We use “impact if wrong” on the vertical axis instead of “important” because it forces the failure-mode question (what breaks if this turns out to be false?) but the placement and the priority are the same.

At a glance

Who, for how long: a facilitator, the product owner or initiative lead, one or two developers, a designer or researcher, and a business stakeholder. Four to six people, around 90 minutes.
What you walk out with: a populated 2x2 grid of named assumptions, and a short list of leap-of-faith assumptions in the top-left, each with a cheap test, an owner, a due date, and the result that would change the plan.
When to reach for it: you’re about to commit significant effort to a new product, feature, or initiative and want to separate the beliefs from the facts before you build. Not for low-risk work, awareness-raising without a decision on the table, or a plan the team can’t yet articulate (run Impact Mapping or Story Mapping first).

What’s It For

A team spends six weeks building a pause-and-resume flow for subscriptions. The flow ships. Adoption is low. The team investigates and discovers that subscribers don’t want to pause; they want to skip a week. Pause is a feature the product owner imagined subscribers needed, based on a conversation with two subscribers, one of whom was actually describing a skip. The team built the wrong thing, beautifully, for six weeks.

The assumption that “subscribers want to pause” was never identified as an assumption; it was treated as a fact. Because nobody had named it as a belief, nobody thought to test it. Because nobody tested it, the whole six-week build rested on a guess that cost two weeks of user research to validate.

This is the universal shape of the failure. Every plan is a stack of beliefs. Some of the beliefs are tested and solid; some are tested and wrong; some are untested and dangerous; and some are untested and cheap to recover from. A team that can’t see the difference treats all the beliefs the same way, which means they treat the dangerous untested ones like the solid tested ones, and they find out too late.

Assumption Mapping exists to make the beliefs visible and to separate them by how much damage they do if wrong. The grid is the forcing function: you can’t pretend an untested belief is solid when you’re looking at a note in the top-left quadrant of a whiteboard everyone is standing in front of.

Reach for it when:

You’re about to commit significant effort to a new product, feature, or initiative
The team is confident and you suspect the confidence is resting on beliefs that haven’t been checked
You’ve just finished an Impact Map, Story Map, or Business Model Canvas and want to push on the underlying beliefs
A decision feels high-stakes and you haven’t separated the reversible assumptions from the irreversible ones
An initiative has stalled and you want to know whether to continue or pivot

What It’s Not For

Skip it when:

The work is small and low-risk enough that the session costs more than the work itself
You’ve already validated the key assumptions through recent user research or experiments
The team can’t yet articulate what they’re building (run Impact Mapping or Story Mapping first)
There’s no actual decision on the table (Assumption Mapping is a pre-commitment tool, not a general awareness-raising exercise)

Stop a session that’s already started if:

The plan isn’t concrete enough for assumptions to attach to
The room is performing confidence and refusing to engage with the evidence question
The top-left quadrant is empty after twenty minutes; that’s not safety, that’s denial

Stopping and fixing the plan is not failure. Plotting assumptions about a plan that doesn’t exist is.

The session has real costs to weigh against the benefits. What you get: hidden beliefs made visible and explicit; a short list of cheap experiments that de-risk the plan within a week; decisions to commit made with clear eyes (“we know what we don’t know”); an artefact (the grid) that can be revisited as tests come in and assumptions move right or get invalidated; a team that starts treating “we believe” and “we know” as different statements. What it costs: 6, 9 person-hours per session with 4, 6 people; the follow-up work of actually running the tests, without which the session is just a wall of colourful worries; discomfort, because the session is designed to make confident people uncertain and that is hard on teams that reward confidence; and a recurring cost, because the grid needs to be run before any significant commitment, not just once.

The common failure modes are worth naming up front: the grid gets produced and then ignored because the team commits anyway; tests are scoped so large they become builds, defeating the point; the session becomes a generic worry exercise instead of focused assumption-testing; the team treats “we all agree this is true” as evidence, when agreement is not the same as evidence; one person dominates placement and the grid reflects their risk appetite, not the team’s.

Definitions & Background

The desirability / viability / feasibility lens. Every assumption tends to be one of three kinds:

Desirability: do customers want it? Will they choose it? Will they keep choosing it?
Viability: can we sustain a business doing it? Margins, churn, acquisition cost, regulation.
Feasibility: can we actually build it? Skills, time, infrastructure, integrations.

Tag each assumption with D / V / F before plotting. A leap-of-faith cluster on desirability is a different intervention from one on feasibility: D-leaps need customer interviews; V-leaps need spreadsheet modelling and small commercial tests; F-leaps need spikes. The grid plots all three the same way; the experiment design differs.

Inputs

Something concrete to test assumptions about. An Impact Map, a Story Map, a Business Model Canvas, or a one-page product brief. The plan is what makes the assumptions findable; without a plan, the session produces generic worries instead of specific beliefs.

You also need:

A 2x2 grid drawn on a wall or whiteboard, with evidence on the horizontal axis and impact-if-wrong on the vertical
Sticky notes and markers for silent generation
Wall space for clustering before plotting
Dot stickers (optional) for the prioritisation vote
A 90-minute slot with the right people in the room (see Who’s Needed)

Outputs

What lands on the wall at the end:

A populated 2x2 grid with every named assumption placed in a quadrant. The top-left quadrant, high impact, no evidence, is what the session exists to surface; everything else is context for it.
A short list of leap-of-faith assumptions to test first, each with: the proposed test, the owner, the due date, and the result that would change the plan.
A list of “we already know” assumptions parked in the bottom-right, useful for new joiners reading later.
Open assumptions to escalate: ones the team can’t test because they depend on leadership decisions or external factors.

Photograph the grid with every note readable and the quadrants clear before the notes come down.

These outputs feed straight into:

Impact Mapping: every impact on an Impact Map is an assumption about actor behaviour. Run Assumption Mapping on an Impact Map and the whole middle column becomes testable.
Business Model Canvas: a Canvas is nine boxes of assumptions. Assumption Mapping is the natural follow-up, especially on Revenue Streams and Cost Structure.
User Story Mapping: the release-1 slice of a Story Map rests on assumptions about what users actually need. Running Assumption Mapping on the slice tells you which tasks to validate before building.
Jobs to be Done: switch interviews surface beliefs about why customers hire (or fire) a product. The desirability assumptions on the grid, the ones that sit in the top-left because nobody has actually asked, are exactly what a JTBD interview round is designed to test.
Wardley Mapping: Wardley Mapping surfaces assumptions about component evolution and competitive position that Assumption Mapping can then test.
Threat Modelling: Threat Modelling surfaces security assumptions (“we assume the auth token can’t be forged”) that belong on the grid the same way product assumptions do.

Who’s Needed

Four to six people, around 90 minutes:

Facilitator. Runs the clock, moderates placement debates on the grid, intervenes when “evidence” drifts into “opinion.”
Product owner or initiative lead. Mandatory. They made most of the assumptions, consciously or not, and they’ll be the one deciding which tests to fund.
Developers. At least one, ideally two. They’ll catch the technical assumptions the business-side people don’t know to question: integration feasibility, scale limits, data availability.
Designers and researchers. They’ll catch the user-behaviour assumptions and, critically, they’ll know which of the “we know subscribers want X” claims have actually been researched and which are folklore.
Business stakeholders. Someone who can talk about pricing, margin, market, and competitive assumptions. Without them, the grid is thin on the commercial side, which is often where the dangerous assumptions live.
Operations / SRE (Site Reliability Engineering). For technical initiatives (migrations, platform rewrites, reliability projects) ops carries the assumptions about production behaviour that the feature team doesn’t know. “We assume we can cut over with no more than five minutes of downtime” is a foundational assumption on a migration, and only the on-call engineer knows what it would actually take to test.

Assumption Mapping is a debate room. Fewer than four and you lose productive disagreement; more than six and the placement arguments on the grid take longer than the session.

Who to leave out:

People who weren’t involved in making the plan. They don’t hold the assumptions. Their presence produces abstract concerns instead of the specific beliefs you’re trying to surface.
Large stakeholder groups. If seven people need to weigh in, run a pre-session with them to agree the assumption list, then run the mapping session with the smaller group.
Observers. Same rule as the other workshops: observers warp the room.

How To Run It

Phase	Duration	Materials	Key question
Orient on the plan	10 min	Plan artefact visible	“What are we testing the assumptions of?”
Generate assumptions	20 min	Yellow notes, silent	“What has to be true for this plan to work?”
Share and cluster	15 min	Wall space	“Which of these are the same belief?”
Plot on the grid	25 min	2x2 grid	“How much evidence? What breaks if we’re wrong?”
Prioritise testing	10 min	Dot votes or marks	“Which do we test first, and how?”
Wrap-up, owners	10 min	,	“Who owns which test, and by when?”
Total	~90 minutes

The 2x2 grid has evidence on the horizontal axis (left is “no evidence, we’re guessing”; right is “strong evidence, we’ve tested this”) and impact on the vertical axis (bottom is “low impact if wrong”; top is “high impact if wrong, the whole plan fails”). Quadrants:

Top-left: Test these first: high impact, no evidence. The dangerous ones.
Top-right: Monitor: high impact, but we have evidence. Keep watching.
Bottom-left: Test if time allows: low impact, no evidence. Not urgent.
Bottom-right: Known: low impact, strong evidence. Stop worrying.

The top-left quadrant is where the session earns its keep. Everything else is context for it.

Silent then loud

Assumption Mapping alternates between silent generation and open debate. The shape matters:

Generation is silent because talking first produces groupthink. One confident voice saying “obviously subscribers want this” suppresses the three people who would have written assumption notes about it.
Sharing is round-the-room so every person reads their notes aloud, even when several are duplicates. Duplicates are valuable; they tell you which assumptions are shared across the room and which are one person’s worry.
Plotting is loud on purpose. The grid placement debate is where the session earns its cost. “That’s low-impact” / “No it isn’t, if that’s wrong the whole plan dies” is the conversation you came to have.
Prioritising is decisive. The facilitator’s job at the end is to force commitment: each top-left assumption gets a test, an owner, and a date, or it doesn’t leave the room.

The key rhythm is write silently, share completely, argue loudly, commit sharply.

Phase 1: Orient on the plan (10 minutes)

Put the plan artefact where everyone can see it. The Impact Map, the Story Map, the Canvas, or a printed one-page brief. If there’s no artefact, write a one-paragraph description on a flip chart. Then read it aloud:

“Here’s the plan we’re putting under pressure today. Not whether the plan is right. Whether the beliefs underneath it are true. Our job is to find the assumptions this plan is standing on, plot them, and decide which ones to test before we commit further.”

Then frame the session:

“The point isn’t to debunk the plan; it’s to find the parts where we’ve been treating beliefs as facts. By the end of ninety minutes we’ll have a short list of beliefs worth testing in the next week. If the beliefs survive the tests, we commit harder. If they don’t, we’ve saved ourselves a month of building the wrong thing.”

This matters. Teams often arrive defensive. Framing the session as finding the beliefs rather than attacking the plan gets you the surfacing you need.

What to watch for:

Defensive framing. The product owner hears “pressure-test the plan” as “attack the plan.” Reframe: “This session exists because we take this plan seriously. We wouldn’t bother putting a plan we didn’t care about under this much pressure.”
No concrete plan. If the artefact is actually “we want to grow the business,” the session cannot run. Schedule Impact Mapping or Business Model Canvas first.

Phase 2: Generate assumptions (20 minutes)

Hand out sticky notes and markers. Set a timer for fifteen minutes. Give the one instruction:

“Write silently. One assumption per note. Use the framing ‘We believe that…’ or ‘We assume that…’. For example, ‘We believe subscribers want to pause their box when they go on holiday.’ Or ‘We assume we can hire a second developer by June.’ Don’t hold back. Half-formed beliefs are exactly what we’re here for. I’d rather you write thirty notes and we throw ten away than write ten and miss twenty.”

Prompt with categories if the room gets stuck:

“User beliefs: what do we assume subscribers want, or how we assume they’ll behave? Technical beliefs: what do we assume we can build, integrate with, or scale to? Business beliefs: pricing, margins, costs, churn, suppliers. Team beliefs: who we’ll hire, what the team can learn, how fast we can move. Market beliefs: competitors, regulations, timing.”

Silent writing for fifteen minutes. No talking. You’re looking for 15 to 30 assumptions from a 4 to 6 person room. Fewer than 15 and people are being cautious; more than 40 and you have a clustering problem in phase 3.

What to watch for:

Assumptions framed as facts. “Subscribers want a weekly delivery.” Someone writes that as a statement of truth. Challenge at the share: “How do we know that? Have we asked? Who? When?”
Too few assumptions. Push at the ten-minute mark: “What about pricing? Timing? Team capacity? Competitors? Regulations? Failure modes? What assumption would embarrass us most if it turned out to be wrong?”
Risks dressed as assumptions. “The API might be slow.” That’s a risk. The assumption is “We assume the API is fast enough for our load.” Reframe as you share.
Someone not writing. They may be overthinking or stuck. Quiet prompt: “What’s the thing you’re most worried about in this plan? Write that down. It counts.”
Deployment and reliability assumptions. For technical plans, the silent writing should produce notes like “We assume we can cut over in a five-minute maintenance window,” “We assume our canary (a small percentage of traffic routed to the new version before the rollout goes wide) is sensitive enough to catch regressions,” “We assume we can roll back the migration cleanly if it fails.” These are foundational and often unwritten.

Go round the room. Each person reads their assumptions aloud, one at a time, and places them on a blank section of the wall, not the grid yet. As notes go up, cluster similar assumptions physically together.

“As you read yours, if one of mine feels like the same belief, say so and we’ll stack them. If it’s close but distinct, we keep both.”

Clustering is a light touch, not a merge. “Subscribers will pay our headline price” and “Our pricing is competitive” are related but test differently; keep both. “Subscribers want weekly delivery” and “Subscribers prefer weekly over fortnightly” are the same belief; stack them.

Remove exact duplicates. Resist the urge to rewrite notes for clarity; the exact wording often carries the specific concern that made someone write it.

What to watch for:

Dismissing assumptions too quickly. Someone says “oh, we know that’s true” about an untested belief. Challenge: “What evidence? If the answer is ‘it’s obvious,’ that’s not evidence.”
Long debates about wording. Pick one phrasing and move on. The placement on the grid matters more than the exact text.
Clustering too aggressively. If you merge too many assumptions, you lose nuance. Keep clusters small: two or three notes maximum per cluster.
The “we already know” trap. The team dismisses half the assumptions as known. For each dismissed one, ask: “If I asked the CEO the same question, would they give the same answer? What about a new team member?” If the answer isn’t confidently yes, it’s not as known as it feels.

Phase 4: Plot on the grid (25 minutes)

Move to the 2x2 grid. Take each assumption (or cluster) and place it on the grid. For each one, the team debates:

“How much evidence do we actually have for this belief? Not ‘it feels true’: what concrete evidence? User research? Past experiments? Existing data? Or are we guessing?”

“If we’re wrong about this, what happens? Do we adjust a feature, or does the plan fall apart?”

Place the note where the debate settles. Exact position on the grid doesn’t matter; quadrant matters.

This phase produces the most valuable conversations in the session. Disagreement is productive; it reveals different levels of confidence across the team. When two people disagree about whether an assumption is high or low impact, they’re disagreeing about what the plan actually is.

What to watch for:

Everything in the top-left. If every assumption lands in high-impact-no-evidence, the team is either being dramatic or the plan really is that risky. Look for assumptions that can move right with minimal testing, and look for assumptions that are actually lower-impact than they feel.
Nothing in the top-left. If nothing is high-impact-untested, the team is overconfident. Challenge the top-right items: “Is that really evidence, or is that a strong opinion?” Push assumptions left until the team flinches.
Arguing about exact placement. “Is it at 60% or 70% on the evidence axis?” Interrupt: “The grid isn’t precise. Which quadrant? Pick.”
Silent placement. If people are placing notes without discussion, slow down: “Why does that belong in the top-right? What’s our evidence? Let me hear it.”
The compound assumption. “We assume subscribers want to pause, and that they’ll pay more for the feature, and that we can build it in two weeks.” That’s three assumptions. Split them; each one plots differently.

Phase 5: Prioritise testing (10 minutes)

Focus on the top-left quadrant. These are your leap-of-faith assumptions: high impact, low evidence. The ones that could sink the plan.

For each assumption in the top-left, briefly discuss:

“How could we test this cheaply and quickly? Not a full build. A landing page, a prototype, a handful of interviews, a manual version of the feature. What’s the cheapest thing we could do in the next week that would tell us something?”

“Who owns running the test? When do we want the answer?”

“What result would change the plan?”

If you have dot stickers, give each person three dots and vote on which top-left assumptions to test first. The ones with the most dots are the immediate priorities.

What to watch for:

Tests that are really full builds. “We’ll test whether subscribers want it by building it.” That’s not a test; that’s the commitment you’re trying to avoid. Push for smaller experiments: interviews, landing pages, manual concierge versions (a manually-delivered version of the service that proves the demand without building the software), prototypes, five-person usability studies.
No owner. Every assumption in the top-left needs a person and a date by the end of the session. “We should test this” without an owner means it won’t happen.
Cherry-picking. The team picks the interesting tests and skips the boring but important ones. Hold firm: “The dot vote selects the order, not a different set. We work through the top-left systematically.”
Tests too big to start this week. If the proposed test is a two-month research project, it’s not an experiment, it’s another commitment. Push: “What’s the smallest slice of that research we could run this week?”

A worked example

See Assumption Mapping: Testing What You Believe for the Greenbox team’s first session, including the moment an assumption that felt obvious turned out to be a guess, and the one-week experiment that saved a month of wrong work.

What Can Go Wrong

The optimist. Someone insists nothing is risky because “it’s going to work.” Recovery: Anchor to evidence: “I’m not asking whether you believe it’ll work. I’m asking what evidence we have. Those are different questions.” Stop if: They can’t engage with the evidence question. They’re not participating in the session, they’re performing confidence.

The pessimist. Someone puts everything in the top-left. Recovery: Calibrate: “If this assumption is wrong, what specifically breaks? Does the plan fail, or do we just adjust?” Force them to articulate the failure mode for each one. Stop if: The plan really is as fragile as they think. That’s a finding; escalate it rather than finishing the mapping.

The tangent. The team starts solving a problem they’ve found instead of finishing the map. Recovery: Time-box: “Great catch. Capture the test you’d run, put it next to the note, keep plotting. We’ll prioritise solutions after we see the full grid.” Stop if: The tangent reveals the whole plan is wrong. Pause the session and escalate.

The too-many-assumptions problem. The wall has thirty-five notes and the grid is becoming unreadable. Recovery: Pre-plot prioritise: dot-vote on the fifteen most important assumptions to plot. The rest go into a holding area for the next session or for asynchronous review. Stop if: The team can’t agree which fifteen matter most. That’s its own finding; the plan has no spine yet.

The “we already know” trap. The team dismisses most assumptions as known. Recovery: Challenge each “known” with a specific test: “If I asked a new hire the same question tomorrow, would they give the same answer? If I asked three different customers?” Most “known” assumptions fail this test. Stop if: The team won’t engage with the challenge. They’re overconfident and the session won’t persuade them; the findings will come from production.

The political no-go assumption. Someone writes an assumption that implicitly challenges a decision made above the team’s level. Recovery: Plot it honestly. Note it as “owned by leadership” and flag it for escalation rather than testing within the team. Stop if: Plotting the assumption will cause a political crisis the session can’t contain. Take the note privately to the product owner and handle it offline.

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Photographs the grid with all notes placed. Make sure each note is readable and the quadrants are clear.
Transcribes the top-left assumptions into a shared document with: the assumption, the proposed test, the owner, the due date, and the result that would change the plan.
Sends the photos and the top-left list to all participants and to whoever else needs to see it.

This week, the product owner:

This is where the pattern earns its cost, and the work is mostly the product owner’s. The grid is worthless without the follow-up.

Fund the tests. Each top-left test needs time, possibly budget, possibly access to users. The product owner’s first job is to make sure the tests actually run next week, not next month.
Run the tests fast. Days, not weeks. If a test is taking more than a week, it’s too elaborate; shrink it. An imperfect answer now is worth more than a perfect answer in a month.
Share early results. Even preliminary findings matter. An assumption that’s clearly wrong is worth knowing before the next planning session.
Update the grid. As test results come in, move assumptions from left to right on the grid (evidence accumulating) or kill them entirely (invalidated). The grid is a living artefact.
Use the grid to gate commitments. Before any significant hire, contract, or build decision, the product owner checks: are we betting on something in the top-left that we haven’t tested yet? If yes, the commitment waits.
Escalate irreversible assumptions. Some assumptions in the top-left can’t be tested by the team; they depend on leadership decisions or external factors. Walk them explicitly to the people who can answer them.

Ongoing, the team:

Re-runs the grid when the plan changes significantly. New impacts, new deliverables, new team members: each changes the assumption set.
Keeps the photographed grid visible where planning happens. It’s the reminder that the team is betting on beliefs, not facts.
Builds the language into daily conversation. “Is that a belief or a known?” becomes a useful question in standups, reviews, and planning.

Variants

Initiative Level (default). A single product, feature, or initiative about to take significant commitment. Ninety minutes, four to six people, one populated grid, a short list of leap-of-faith tests with owners and dates. This is what most teams need, and the rest of this post describes it.

Canvas-driven. Run Assumption Mapping directly off a Business Model Canvas. Each of the nine boxes generates assumptions; the Revenue Streams and Cost Structure boxes typically dominate the top-left. Use this when you’ve just produced a Canvas and want to know which boxes to validate before raising or committing.

Impact-Map-driven. Take an Impact Map and treat every actor-impact-deliverable line as a chain of assumptions. Each impact is a behaviour-change belief; each deliverable is a viability/feasibility belief. The Story Mapping release-1 slice variant is similar: assumption-map only the slice you’re about to build.

Remote. Miro or Mural board with a pre-drawn 2x2 grid and a clearly marked silent-generation area. Slightly slower than in-person plotting because the grid debate moves at the pace of one shared cursor, but it transfers cleanly. Have the facilitator place notes on prompts from the participants to keep the layout legible.

Pre-mortem hybrid. Add a pre-mortem prompt at the start of phase 2: “Imagine the plan failed catastrophically a year from now. What were the assumptions that turned out to be wrong?” This produces a different kind of assumption (failure-mode beliefs) and is worth the extra fifteen minutes when the plan is large or irreversible.

Can You Turn Back Time?

2026-05-14T06:00:00+08:00

Time Is Weirder Than You Think showed how time bends near mass and motion. Does Time Even Exist? asked the deeper question of whether it exists at all. This post asks a narrower one: can you move through it in the wrong direction? The answer, according to the equations, is “maybe”, and the universe isn’t letting us know.

Forward time travel is easy

Before tackling the hard direction, it’s worth noting that forward time travel is a solved problem. It’s been happening since the universe had mass and relative motion; we’ve just been proving it since 1971.

The twin paradox is real. Move fast enough relative to someone else, and less time passes for you. The Hafele-Keating experiment confirmed it with caesium clocks on commercial airliners. GPS satellites confirm it every second of every day. Scott Kelly came back from the ISS 5 milliseconds younger than his twin.

If you want to travel a thousand years into the future, the recipe is straightforward: accelerate to a significant fraction of the speed of light, cruise for a while (by your clock), decelerate, and come home. The energy requirements are absurd, accelerating a modest spacecraft to 99% of light speed would require more energy than the entire world currently produces in a year, but the physics is not in dispute. You would arrive in the future. Everyone you knew would be dead. Going back would need a different mechanism entirely, which is where this gets interesting.

Gravitational time dilation offers another route. Park yourself near (but not inside) a black hole, wait a while by your clock, then fly away. Less time passes for you than for the universe outside. The film Interstellar got this broadly right: the characters who visited the planet near the black hole aged hours while decades passed outside. The specific numbers in the film were dramatised, but the principle is textbook general relativity.

Forward time travel isn’t speculative; it’s engineering.

Backward time travel: the equations say yes

Going backward is where things get interesting, and contested.

The equations of general relativity describe the geometry of spacetime. They’re not suggestions; they’re constraints. Given a distribution of mass and energy, the equations tell you exactly how spacetime curves. And some solutions to those equations contain closed timelike curves (CTCs): paths through spacetime that loop back on themselves. Travel along a CTC, and you return to your own past. A handful of such solutions are known; each one is mathematically valid, and each one is physically strange in its own way.

Gödel’s rotating universe is where CTCs were first recognised for what they were. In 1949, Kurt Gödel found a solution to Einstein’s equations describing a universe that rotates as a whole, in which sufficiently long journeys through spacetime loop back to their starting point in time. You could, in principle, attend your own birth. CTCs had quietly been present in an earlier solution (Willem van Stockum’s 1937 infinite rotating dust cylinder) but nobody noticed until Frank Tipler pointed it out in 1974. Gödel’s was where the phenomenon became impossible to ignore.

Gödel presented this solution as a birthday gift to Einstein. It’s unclear whether Einstein was delighted or horrified. Gödel’s universe doesn’t match ours, ours expands (his doesn’t) and the cosmic microwave background shows no sign of global rotation to extremely tight bounds, but that’s not the point. The point is that general relativity, taken at face value, permits time travel. The equations don’t forbid it. Gödel proved that any argument of the form “time travel is impossible because it violates general relativity” is wrong. The theory allows it. Whether the universe chooses to use that allowance is a different question.

The Kerr metric is another CTC solution, and one we can point a telescope at. In 1963, Roy Kerr found the solution for a rotating black hole. The Event Horizon Telescope has since imaged M87* and Sgr A* directly; LIGO routinely catches pairs of spinning black holes merging. The geometry is real; whether the CTC region of the geometry is real is another question. Kerr’s solution contains closed timelike curves deep in the interior, behind the inner event horizon. In the mathematical solution, you could pass through the ring singularity and emerge in a region where time loops are possible.

Whether this is physically meaningful is debated. The interior of the Kerr solution may be unstable; perturbations might destroy the closed timelike curves before anything could traverse them. But the mathematical structure is there, and it’s a solution to the same equations that predict GPS corrections and gravitational waves.

Wormholes opened a third route. In 1988, Kip Thorne (who would later win a Nobel Prize for LIGO) showed that if traversable wormholes exist, shortcuts through spacetime connecting distant regions, they could be converted into time machines. The recipe: take one end of a wormhole, accelerate it to near-light speed, then bring it back. Time dilation means less time has passed at the accelerated end. Enter the “slow” end and you emerge from the “fast” end at an earlier time. You’ve gone backward.

Thorne wasn’t trying to design a time machine. He was responding to a question from Carl Sagan, who was writing Contact and wanted the physics to be plausible. But the analysis was rigorous, published in Physical Review Letters, and it launched a serious research programme into the physics of time travel that continues today.

The catch is that we don’t know if traversable wormholes can exist. They require “exotic matter” with negative energy density to keep them open. Quantum field theory allows negative energy densities in certain configurations (the Casimir effect is a real example), but whether you can get enough of it, concentrated enough, to hold open a wormhole is unknown.

The Tipler cylinder came from the same Frank Tipler who’d dredged van Stockum’s CTCs out of obscurity, and he didn’t stop at reanalysing other people’s work. In the same 1974 paper, he showed that an infinitely long, extremely dense, rapidly rotating cylinder would drag spacetime around it hard enough to create closed timelike curves of its own. Finite cylinders don’t work; Stephen Hawking proved that the closed timelike curves require the cylinder to be infinite. This makes it impractical (to put it mildly) but it’s another example of the equations permitting what intuition forbids.

The grandfather paradox and self-consistency

If backward time travel is possible, what stops you from killing your own grandfather before your parent is born? This is the oldest and most intuitive objection to time travel.

The Novikov self-consistency principle offers one resolution. Proposed by Igor Novikov in the 1980s, it states that any events on a closed timelike curve must be self-consistent. You can travel to the past, but you can’t change it, because you didn’t. Whatever you do in the past has already happened. It’s already part of the history that led to you travelling backward in the first place.

It’s like a jigsaw puzzle. You can’t place a piece that doesn’t fit. If you travel back and try to kill your grandfather, something prevents it: you slip, you miss, you change your mind. Not because of magic, but because the version of history where you succeed is logically inconsistent and therefore doesn’t exist. Only self-consistent histories are allowed.

This isn’t as strange as it sounds. We already accept that physical laws constrain what’s possible. You can’t build a perpetual motion machine, not because someone stops you, but because the laws of thermodynamics don’t permit it. The Novikov principle says that self-consistency is a similar constraint: the laws of physics, applied to closed timelike curves, only admit solutions where the timeline is internally coherent.

The Deutsch model takes a quantum approach. David Deutsch, in 1991, applied quantum mechanics to the grandfather paradox and showed that closed timelike curves are consistent if you allow the universe to be in a mixed quantum state. Roughly: the traveller who emerges from the time loop is not identical to the one who entered it. They’re a quantum mixture: partly themselves, partly a version from a slightly different history. This avoids paradoxes at the cost of letting quantum mechanics redefine what “the traveller” even means. Which, given everything else about quantum mechanics, is perhaps not a high price.

The quantum eraser: does the future affect the past?

In 1999, Yoon-Ho Kim and colleagues performed an experiment that seems to suggest the future can influence the past. It’s called the delayed-choice quantum eraser, and it’s one of the most unsettling experiments in physics.

Here’s the setup, simplified. You send photons through a double slit. Normally, they produce an interference pattern on a detector: the signature of quantum mechanics, showing the photons behaving as waves. But if you add a detector that tells you which slit each photon went through, the interference pattern disappears. The photons behave as particles. This much is standard quantum mechanics.

Now the twist. Kim’s experiment split each photon into two entangled partners. One partner (the “signal”) went to a screen. The other (the “idler”) went on a longer path to a second detector, where the “which-path” information was either preserved or erased, after the signal photon had already hit the screen.

When the experimenters later compared the data, they found that the signal photons whose idler partners had their which-path information erased showed an interference pattern. The ones whose idler partners retained the information did not. The choice about the idler, made after the signal photon hit the screen, appeared to retroactively determine whether the signal photon behaved as a wave or a particle.

This is not, despite appearances, evidence of backward causation. The interference pattern only becomes visible when you sort the signal photons using information from the idlers. If you look at all the signal photons together, without sorting, there’s no interference pattern. The “retrocausal” effect is an artefact of post-selection, not a signal travelling backward in time. Still, the lesson is real: quantum correlations don’t respect our intuitions about the order of cause and effect. The universe doesn’t care which measurement happened first; the entanglement ties the results together regardless of timing.

Hawking’s party

In 2009, Stephen Hawking threw a party for time travellers. He prepared champagne, put up a banner reading “Welcome, Time Travellers,” set coordinates, and waited. Nobody came.

He published the invitation afterward, so that future time travellers would know when and where to show up. The fact that nobody arrived was, Hawking suggested with a grin, “experimental evidence that time travel is not possible.”

It was a joke, mostly. The absence of guests doesn’t prove much: perhaps time travellers can’t travel to before the machine was built, or perhaps they chose not to come, or perhaps the invite was lost in the noise of history. But it illustrates Hawking’s own position: he believed the universe has a chronology protection mechanism that prevents closed timelike curves from forming.

His chronology protection conjecture, published in 1992, argues that whenever conditions approach those needed for a time loop, quantum effects (specifically, a divergence in the stress-energy tensor of the vacuum) intervene and destroy the loop before it can form. The back-reaction of quantum fields near a forming CTC generates enough energy to collapse the would-be time machine.

“It seems there is a chronology protection agency which prevents the appearance of closed timelike curves and so makes the universe safe for historians,” Hawking wrote. The conjecture is unproven. It might be wrong. But the fact that it was needed at all, that someone of Hawking’s stature felt the need to propose a law preventing time travel, tells you how seriously the equations permit it.

Retrocausality: a serious proposal

Most of this post has treated backward-in-time effects as paradoxical or impossible. But a growing number of physicists are taking retrocausality (genuine backward-in-time influence) seriously as a foundation for quantum mechanics.

The motivation is Bell’s theorem. In 1964, John Bell proved that quantum mechanics cannot be explained by any theory where particles have pre-existing properties and influences travel no faster than light. Experiments have repeatedly confirmed quantum mechanics. So at least one of those assumptions must be wrong.

Most physicists give up the pre-existing properties (this is the standard “Copenhagen” or “many-worlds” approach). But a minority, including Huw Price at Cambridge and Ken Wharton at San José State, argue that we should instead give up the assumption that causes always precede effects. If influences can travel backward in time, Bell’s theorem is satisfied without giving up realism. Particles do have definite properties; it’s just that future measurements can influence past states.

This isn’t crackpot physics. Price and Wharton’s work is published in peer-reviewed journals and taken seriously by the foundations-of-physics community. It’s a minority position, but it’s a legitimate interpretation, and it has the advantage of preserving something that most quantum interpretations sacrifice: the idea that things have definite properties even when nobody’s looking.

The price is steep. Retrocausality means that the state of a particle right now depends partly on what will happen to it in the future. Not in a way that lets you send messages backward (that would violate other constraints), but in a way that makes the universe’s bookkeeping work out. The future doesn’t cause the past in the way you’d normally use the word. It constrains it, the way a jigsaw puzzle constrains which pieces can go where.

What we actually know

Forward time travel is real. We’ve measured it. GPS depends on it. It’s engineering, not speculation.
General relativity permits closed timelike curves. Multiple exact solutions to Einstein’s equations contain them. This is mathematics, not handwaving.
We don’t know if the universe actually allows them. Hawking’s chronology protection conjecture says no, but it’s unproven. Quantum gravity might resolve this, but we don’t have a theory of quantum gravity.
The grandfather paradox has solutions. The Novikov principle (self-consistency) and the Deutsch model (quantum mixed states) both resolve it without contradiction.
Quantum mechanics is weird about time. Entanglement doesn’t respect temporal ordering. The delayed-choice quantum eraser demonstrates this without actually sending information backward.
Retrocausality is a legitimate interpretation. A minority of physicists take it seriously as a foundation for quantum mechanics.
Nobody came to Hawking’s party. Make of that what you will.

The physics of time travel isn’t a closed question; it’s an open one, sitting at the intersection of general relativity, quantum mechanics, and quantum gravity, precisely the intersection where our best theories break down. Until we have a theory that works at that intersection, the equations say “maybe” and the universe isn’t talking.

There’s another clock to examine, though: the one inside you. It has no caesium atom and no GPS correction. It runs on light, adenosine, and a cluster of twenty thousand neurons behind your eyes. And it sets the terms for how you experience every other clock in this series.

The Clock Inside You is next: the biology of jet lag, shift work, and why your body refuses to run on UTC.

Forecasting Without Writing Python

2026-05-13T06:00:00+08:00

The situation

Priya is a category manager at a mid-size retail business. She owns 400 SKUs across homewares. Her CSV export from the data warehouse has 78 weekly rows per SKU (18 months of history), with columns: sku, week_ending, units_sold, avg_unit_price, promo_flag, competitor_promo_flag, stock_out_days, weather_index, category. The ask from finance is a 13-week forward forecast of units sold per SKU, deliverable in two weeks, with enough of an explanation that a director can challenge it without Priya needing a data scientist in the room.

Priya knows Excel well enough to build a naive seasonal-average forecast, but finance has asked for something better: one that accounts for promotions, stock-outs (units-sold is artificially capped in weeks where stock ran out), and the weather-index column the ops team started tracking last year. She knows pivot tables, not Python. Hiring a consultant is on the table but slow; the ML team can help in Q3, which is too late.

The platform team has AWS available. Someone in the data team has muttered about “no-code ML tools,” but nobody has been precise about which one or whether it fits.

What actually matters

Before reaching for a tool, pin down what shape of problem this is and what the data can honestly tell us.

The first question is what kind of problem this actually is. Forecasting weekly units-sold from historical data is a time-series forecasting problem: the target is a number over time, history is ordered, seasonality matters, and exogenous variables (promo flag, weather index) may explain variation. It’s not a classification problem (“will this SKU sell out?”), not a regression on cross-sectional features (“predict price from SKU attributes”), not an image or text problem. Anything we pick has to treat time as a first-class axis.

The second is the shape of the data. 400 SKUs × 78 weeks is 31,200 rows. That’s small by ML standards but each SKU has only 78 points of history, which isn’t a lot for any individual series. There’s a real choice between fitting one model per SKU (each model starves on 78 points) and fitting one global model across all SKUs (the model learns patterns that transfer between series, so a SKU with 20 weeks of history benefits from the other 399). For a 400-item catalogue with short histories, the global-model approach is the one that earns its keep.

The third is exogenous features. promo_flag and competitor_promo_flag are known in advance for future weeks (the promo calendar is set); weather_index is not known in advance (it’s a forecast of its own). The right framing distinguishes between related time series that are known in advance (we can include future values, and the model will use them) and those that only have historical values (the model uses the history to learn correlations, but can’t see future values at prediction time). Getting this distinction correct matters: classifying weather_index as known-in-advance leaks future information in trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. and produces optimistic backtests that don’t hold up live.

The fourth is stock-outs. Units-sold in a stock-out week is censored, demand existed, but supply capped what was recorded. A forecast trained on raw units-sold learns that demand drops in those weeks, which is wrong. The fix is data preparation: either exclude stock-out weeks from training, or adjust the target using the stock_out_days column (e.g. if stock_out_days >= 4, flag the row as unreliable). That’s a feature engineeringFeature (ML)An input variable to a model – the numeric or categorical signals you compute from raw data and feed in. decision, not a tool decision; whatever we pick, the business user has to own this choice and document it.

The fifth is explainability. Finance will ask “why is the Q2 forecast 20% higher than last year’s Q2?” The chosen tool has to produce some combination of feature importance charts, per-prediction explanations, and a “what-if” capability so that a director can challenge the forecast without a data scientist in the room. Black-box predictions that beat the baseline by 5% but can’t be narrated are worse than a transparent forecast that loses 5% of accuracy.

The sixth is operationalisation. A one-off forecast is a CSV download. A repeatable quarterly forecast is a scheduled job. If this forecast is going to run every quarter, the tool needs a path from “model built in the UI” to “model called on a schedule” without rebuilding from scratch each time. Otherwise we’re committing to clicking through the same wizard four times a year for as long as the business cares about the answer.

And finally, the audience constraint. Priya knows Excel and SQL. She doesn’t know Python, statistics-as-code, or Jupyter. Anything that requires writing a notebook, even a friendly one, shifts the work back to the ML team and defeats the whole point. The tool has to be navigable by a category manager.

What we’ll filter on

Six filters, applied to the forecast-building tools Priya could use.

No-code interface, does the tool let a non-coder build and run the model?
Handles time-series forecasting natively, as a first-class problem type, not cross-sectional regression?
Supports exogenous features (known-in-advance vs. historical-only)?
Explainability, feature importance and per-prediction explanations?
Repeatable, can the same model run on a schedule without rebuilding in the UI?
Priced appropriately for 400 SKUs × quarterly cadence?

The no-code ML landscape

SageMaker Canvas. AWS’s no-code ML workspace. Supports tabular classification and regression (via AutoML), time-series forecasting, image and text classification, and GenAI-backed exploration (ask-your-data via a foundation model). For time-series, it runs a SageMaker Autopilot AutoML job under the hood, trying multiple algorithm families (DeepAR, CNN-QR, ETS, ARIMA, Prophet) and selecting the best by backtest. Explanation via feature importance charts; registration to Model Registry for scheduled reuse. Priced per-session-hour for the UI plus the underlying training and inferenceInferenceRunning a trained model to produce output – as opposed to training it. costs.
QuickSight + ML Insights. QuickSight is AWS’s BI tool; ML Insights adds anomaly detection and a forecasting feature that produces simple time-series forecasts on visualisations using an internal algorithm. Useful for quick “what’s the trend?” answers directly in a dashboard, not for a model-quality forecast with exogenous features or explainability. Good for situational awareness; not the correct tool for a 400-SKU production forecast.
Amazon QuickSight Q / Amazon Q in QuickSight. The natural-language interface to QuickSight. Answers questions like “what was last quarter’s top-selling SKU?” in English. Not a forecasting tool; complementary to a forecast once it exists.
SageMaker Autopilot (direct). The AutoML backbone Canvas uses. Callable via the SageMaker SDK or Studio UI. Produces the same models Canvas does but requires a user comfortable enough with notebooks to trigger jobs, inspect candidates, and call endpoints. The path a data scientist would take; not the path for a non-coder.
Amazon Forecast (retired as standalone). Was a dedicated time-series forecasting service. Functionality folded into SageMaker Canvas’s time-series forecast type. Mentioned for historical context; don’t plan new work against it as a separate service.
A third-party Excel add-in or a spreadsheet model. Priya could build an ETS or seasonal-naive model in Excel or a forecasting add-in. Cheap, familiar, but limited, hard to include exogenous features, hard to evaluate honestly, hard to explain beyond “I used a trend line.” Not a scaling answer for 400 SKUs with exogenous drivers.

Side by side

Tool	No-code	Time-series native	Exogenous features	Explainability	Repeatable	Sized for this
SageMaker Canvas	✓	✓	✓	✓	✓	✓
QuickSight ML Insights	✓	Partial	✗	✗	✓	✗
Amazon Q in QuickSight	✓	✗	✗	✗	N/A	✗
SageMaker Autopilot (direct)	✗	✓	✓	✓	✓	✓ (wrong audience)
Excel add-in	✓	Partial	Manual	✗	Partial	Limp

Only one tool ticks every box for the scenario: SageMaker Canvas in time-series forecast mode. The others either can’t handle the problem shape (QuickSight, Excel) or are the wrong audience (Autopilot directly).

Canvas in time-series mode

Six stages; Canvas automates four of them. The business judgement lives in preparation (what's a stock-out worth?) and review (does this backtest make sense?).

The pick in depth

Canvas time-series forecast, trained on the prepared dataset, registered for quarterly reuse.

The import is a two-click exercise: Canvas reads from S3 (or Snowflake, Redshift, Athena, or a direct upload up to 5 GB). Priya’s CSV lands in a dataset that Canvas can inspect.

Preparation in Data Wrangler. Canvas has an embedded Data Wrangler view, a visual transform builder. Priya’s required transforms:

Exclude unreliable rows. A filter step: stock_out_days < 4. Weeks where stock was out for more than half the week are removed from training. The alternative, scaling units_sold up to impute demand, is defensible but introduces assumptions; excluding is cleaner for a first pass.
Parse timestamp. Confirm week_ending is recognised as a date with weekly frequency.
Derive features. Add week_of_year and month columns (Canvas offers one-click “extract date parts”). These give the model explicit seasonality signals.
Confirm types. promo_flag and competitor_promo_flag should be categorical (not numeric), units_sold should be numeric, sku should be categorical as the item identifier.

The prepared dataset gets exported as the training input.

Forecast configuration. In Canvas’s time-series flow:

Target column: units_sold
Item identifier: sku (the column that distinguishes one time series from another)
Timestamp: week_ending
Frequency: Weekly
Forecast horizon: 13 (weeks)
Forecast quantiles: P10, P50, P90 (this is the spread of the probabilistic forecast; finance can see downside and upside, not just a point estimate)
Related time series, known in advance: promo_flag, competitor_promo_flag, week_of_year, month. These are all known for future weeks because the promo calendar is set and calendar features are deterministic.
Related time series, historical only: avg_unit_price, weather_index. Unknown for future weeks; the model uses their history to learn correlations but must impute them for prediction.
Item metadata: category. Static attributes of each SKU, useful for the model to learn category-level patterns.

Training. Canvas runs a SageMaker Autopilot job that tries several algorithms: DeepAR+ (a deep-learning autoregressive model that pools across items), CNN-QR (convolutional quantile regression), ETS (exponential smoothing), ARIMA, and Prophet. For 400 items × 78 weeks, typical training time is 2-4 hours. The job backtests each candidate on a rolling-origin split (train on weeks 1-65, predict 66-78; train on weeks 1-52, predict 53-65; etc.) and scores each on weighted quantile loss (wQL) at the chosen quantiles.

The winning model is usually DeepAR+ for retail-style data with many items, because it pools information across items, a SKU with 20 weeks of history benefits from what the model has learned about the other 399. For smaller datasets or single-item forecasts, classical methods (ETS, ARIMA) often win.

Review. Canvas presents a dashboard:

Accuracy metrics: wQL, MAPE (mean absolute percentage error), and RMSE on the backtest. Priya compares to her naive seasonal-average baseline, if Canvas’s model doesn’t beat it by a material margin, the added complexity isn’t earning its keep.
Feature importance: a bar chart showing which columns drove predictions. If promo_flag and week_of_year dominate, the story is coherent; if category alone dominates, the model may be learning a category-level average and ignoring within-category variation.
Per-SKU plots: historical vs. forecast on held-out weeks. Priya clicks through a sample of 20 SKUs and eyeballs whether the forecasts look reasonable. This is the human judgement step that no backtest metric captures.
What-if: for a chosen future week, override promo_flag from 0 to 1 and see the forecast shift. This is the explainability story for finance: “if we don’t run the Q2 promo, the forecast drops 15%.”

Prediction. Canvas generates a forecast CSV: one row per SKU × future week × quantile, written back to S3. For a quarterly cadence, Priya registers the model to SageMaker Model Registry and an engineering partner wires up an EventBridge Scheduler rule that calls a Lambda that triggers a SageMaker batch-transform job on the registered model each quarter. Priya re-uses the same model for three quarters, retrains in Canvas when accuracy starts drifting or when new SKUs enter the catalogue.

The honest limits

Canvas isn’t magic. A few things to name:

Small-history SKUs are still hard. A SKU with 12 weeks of data has no seasonal history; the model imputes from category peers, but confidence is low. Priya flagged these as fallback-to-human for the first forecast, which is the correct call. Trust the model where the data supports it.

Exogenous-feature honesty. Classifying weather_index as known-in-advance would leak future information into training. Canvas would learn to “use next month’s weather” and produce a spuriously good backtest that fails in production. Classifying historical-only is the correct answer; accept that the model uses history of weather, not future, and that it might miss a forecast-able weather-driven shift.

Stock-out handling is a modelling choice, not a tool choice. Canvas can’t know what a stock-out week’s “true” demand was. Priya chose to exclude them; someone else might scale up using stock_out_days as a censoring indicator. Either is defensible; the choice should be documented so the next quarter’s forecast is consistent.

The quantile spread is real information. A P10-to-P90 range that’s narrow says the model is confident; wide says the model doesn’t know. Finance should not be given only the P50, the range is part of the story. If the width is embarrassingly wide, the honest answer is “this forecast is a rough guide, not a commitment.”

What’s worth remembering

SageMaker Canvas is AWS’s no-code ML interface. Tabular classification and regression via Autopilot; time-series forecasting as a first-class mode; image and text classification; GenAI-backed data exploration. Business analysts, product managers, and category managers are the audience.
Time-series forecasting is a distinct problem type. Target over time, ordered history, seasonality, exogenous features. Don’t solve it with cross-sectional regression; Canvas’s time-series mode is the correct tool.
The global-model approach pools across items. 400 SKUs × 78 weeks is better trained as one model over 400 series than as 400 separate models. Short-history items benefit from patterns learned on richer series.
Classify exogenous features correctly. Known-in-advance (promo calendar, calendar features) go into future predictions directly. Historical-only (weather, price) inform via lag correlations but aren’t known for future weeks. Misclassifying leaks future information and produces optimistic backtests.
Data preparation is where business judgement lives. Canvas doesn’t know what a stock-out week means. Filtering or adjusting target values is a modelling choice; document it.
Backtest metrics plus per-item plots plus feature importance is the review triangle. Don’t trust one metric alone; eyeball a sample of forecasts, confirm the drivers look sensible, and compare to a naive baseline before trusting the model.
Quantile forecasts give finance downside and upside. P10/P50/P90 is more honest than a single point estimate. Narrow quantile spread means confidence; wide means the model doesn’t know, and saying so is better than faking precision.
Model Registry + EventBridge + batch transform is the quarterly-cadence plumbing. Canvas builds the model; an engineering partner wires the schedule. One Canvas build can serve several quarters before retraining is warranted.

A category manager with a spreadsheet-level skill set and two weeks can produce a defensible 13-week forecast for 400 SKUs, with quantile uncertainty and feature-attribution explanations, using Canvas. The ML team stays free for the harder problems. What Canvas gives up, the last few percentage points of accuracy a hand-tuned model might squeeze, is usually worth trading for the months of analyst time it returns.

Business Model Canvas: Does This Actually Work?

2026-05-12T06:00:00+08:00

Greenbox is a produce-box startup with 200 subscribers in Perth, racing to reach 1,000 within six months. They’ve discovered their customers care more about convenience than local sourcing, and now they need to figure out whether the business model actually works at scale.

Maya has a board meeting in three weeks. The agenda: present a credible path from 200 to 1,000 subscribers. If she can’t make the case, the money stops.

She’s been working on the pitch deck in the evenings. Good slides. Compelling narrative. But on Wednesday morning, she stares at slide nine, the financial projections, and realises she’s been avoiding the hard question. Not “can we grow?” but “can we grow profitably?”

Reaching for the familiar

Tom suggests Impact Mapping. The team spends thirty minutes on it. Useful, it shows the path to 1,000 involves both reducing churn and expanding acquisition. But Maya shakes her head.

“This tells me how to grow. It doesn’t tell me whether we can afford to.”

Lee recognises the gap. “What you need is a picture of the whole machine, how money comes in, where it goes out, and whether the engine runs at the scale you’re targeting. The Business Model Canvas maps that out. We can do that this morning.”

He pauses. The team is watching him. Lee has been their guide through the entire discovery journey. He’s the person who always has the next technique, the calm voice that says “let’s try this.”

“What I can’t do,” Lee says, “is read it for you once it’s mapped. CAC, lifetime value, what your moat looks like against a competitor with sixty times your funding. I’ve been around those questions, I’ve never run a subscription business through the wall they put up. If I try to interpret the canvas for you, I’ll be doing exactly what we tell teams not to do: guessing at the answers instead of finding someone who knows.”

The room is quiet. It’s a harder thing to say than it sounds. Admitting a limit feels like stepping off a cliff. But he feels something he hasn’t felt in twenty years of consulting: relief.

His phone buzzes in his pocket. He glances at it, a text from Yuki: Dad, can you call me this weekend? He puts the phone away. Maya notices.

“You can take that,” she says.

“She’ll call back,” Lee says.

Maya looks at him. “Will she?”

Lee doesn’t answer. He turns back to the whiteboard.

“We’ll map it this morning. Then I’m going to call someone. Charlotte Wong, she’s scaled two subscription businesses past Series A. Once we’ve got the picture, she can read it.”

What a Business Model Canvas is

The Business Model Canvas was created by Alexander Osterwalder. Nine building blocks on a single page describing how a business creates, delivers, and captures value.

Key Partners

Key Activities

Value Propositions

Customer Relationships

Customer Segments

Key Resources

Channels

Cost Structure

Revenue Streams

The power is that it forces everything onto one page. The connections, and contradictions, become visible.

Filling it in

Lee facilitates. The team takes a morning. Maya brings the business knowledge, Sam brings customer data, Tom and Priya bring operational reality, Jas brings the product perspective. Tom jokes about buying shares in 3M. Nobody laughs, which tells you something about the mood.

Customer Segments: Two segments from the JTBD and assumption mapping: Convenience seekers (60%) who hire Greenbox to eliminate dinner stress, and Local food advocates (40%) who believe in supporting local farms and eating seasonal produce.

Value Propositions: For convenience seekers: “Dinner decided.” For local advocates: “Know your farmer.” Maya writes both on the board and steps back. “We’ve been marketing one value proposition to two segments. That’s a problem.”

Channels: Word-of-mouth (31%), Google search (28%), Instagram (19%), local press (14%). Delivery via local courier. Customer communication by email.

Customer Relationships: First-box discount for acquisition. Recipe cards, pause/skip, box preview emails for retention. Referral programme for growth.

Revenue Streams: $25/week subscription. Potentially $20/week for a mixed-sourcing box.

At 200 subscribers, all on the $25 box: $5,000 per week. $260,000 per year. Sounds decent.

But Maya hasn’t looked at the other side yet.

Cost Structure:

This is where the room goes quiet. Maya pulls up the numbers on the projector. She hasn’t shared them with the full team before.

Cost component	Per box
Produce (farm gate price)	$14.00
Packing (materials + labour)	$3.50
Delivery (courier)	$4.50
Total variable cost	$22.00

Revenue per box: $25.00. Margin per box: $3.00.

Tom does the arithmetic. “Three dollars margin per box. Two hundred boxes a week. That’s $600 a week. $31,200 a year.”

“And that’s just the box,” Priya says. “Revenue minus what it costs to put one together and get it to the door. The $3 hasn’t paid the warehouse, the software, anyone’s salary, or marketing yet. It hasn’t been taxed yet either. Everything else the business does has to come out of that $31,200.”

The room is silent. The number doesn’t survive that subtraction.

“What about at 1,000 subscribers?” Priya asks.

Maya updates the spreadsheet. Some costs improve with volume. Produce costs are relatively fixed, farms don’t offer bulk discounts at this scale.

Cost component	Per box (at 1,000)
Produce (farm gate price)	$13.00
Packing (materials + labour)	$2.50
Delivery (courier, volume rate)	$3.50
Total variable cost	$19.00

Margin per box: $6.00. $6,000 per week. $312,000 per year. Barely covers operations. No money for growth.

Tom stares at the projector. “We’re building a charity.”

The two-tier question

The canvas is showing contradictions. 60% of subscribers would accept mixed sourcing at $20. But the cost structure assumes 100% local at $25. Maya is paying the premium for local produce, but the majority of her customers wouldn’t notice if she didn’t.

“What if we offered the mixed-sourcing box?” Jas asks.

Maya runs the numbers. If produce cost drops to $8 per box with mixed sourcing:

Model	Revenue/box	Cost/box	Margin/box
100% local, $25	$25.00	$19.00	$6.00
Mixed sourcing, $20	$20.00	$14.00	$6.00

The margin per box is the same. But the subscriber ceiling changes. With a $25-only model, the addressable market is the 40% who value local sourcing enough to pay the premium. With two tiers, the team can serve both segments.

Priya adds: “And mixed sourcing means we’re not dependent on local farms scaling up in six months. Dave told you he can’t increase supply until next growing season.”

Charlotte

Lee sets up a video call for Friday. Charlotte Wong joins from her home office in Perth’s northern suburbs. She’s 41, short grey hair, a bookshelf behind her stuffed with business books and, inexplicably, a small collection of wooden ducks.

Charlotte grew up in Penang, Malaysia. Moved to Australia at fifteen. Engineering degree from UNSW, then a career in the specific kind of companies that either scale or die: a meal kit company, a SaaS platform, a logistics startup. The SaaS platform was acquired. The logistics startup is still running. The meal kit company, the one she doesn’t talk about unless you ask directly, folded eighteen months after she joined. She’d done everything correctly, or thought she had. The unit economics were wrong from the start and nobody caught it until the cash ran out. She keeps a spreadsheet of every business she’s ever worked with. Row 47 is Greenbox. She added it yesterday, after Lee’s call.

Lee gives Charlotte a ten-minute summary. He shares the canvas. Charlotte listens without interrupting. Her face is still, not hostile, diagnostic. She’s reading the canvas the way a mechanic reads an engine.

Then she asks three questions.

“What’s your customer acquisition cost?”

Silence. Nobody knows.

“You don’t know,” Charlotte says. “That’s the most important number in a subscription business. If you can’t tell the board what it costs to acquire a customer, you can’t tell them whether growth is profitable or just expensive.”

“What’s your subscriber lifetime value?”

Maya starts: “Well, the average subscriber stays for…” She trails off.

“At 5% monthly churn, average lifetime is about twenty months,” Charlotte says. “At $25 a week, that’s roughly $2,000 lifetime revenue. Minus variable costs, about $480 lifetime margin at 1,000 subscribers. If your acquisition cost is more than $480, you lose money on every subscriber you add. Growth makes you poorer, not richer.”

She says this without emotion, but behind the flat tone is the meal kit company. They’d grown to 4,000 subscribers before anyone realised the CAC was higher than the lifetime margin. She’s never fully stopped carrying that one.

“One more thing. Freshly charges eighteen dollars a week. You charge twenty-five. They have sixty times your funding and a polished app. If your customers are convenience-driven, and your JTBD data says sixty percent are, and Freshly delivers convenience at a lower price with better technology, what’s your moat?”

Nobody answers. Charlotte doesn’t wait for one.

“Your canvas shows two segments. Have you modelled what happens to your farm relationships if you introduce mixed sourcing? If 60% of subscribers switch to the mixed box, your local farm orders drop by 60%. Dave and Rachel are suddenly selling you 40% of what they used to. Can their businesses survive that?”

Nobody had considered this. Charlotte saw the dependency that the canvas made visible, changing the cost structure could destroy the partnerships.

“I’m not saying the mixed box is wrong. I’m saying you need to model the second-order effects. You need to bring your farms along, or you’ll have a cheap box with no story and an expensive box with no supply.”

Maya writes furiously. Charlotte winds up the call.

“Lee told me about the discovery work. Event Storming, JTBD, assumption mapping. That’s genuinely impressive for a team this size. Most startups your stage are still arguing about what the product should be. You know your domain and your customers. That’s rare.” She pauses. “The next problem is different. You need to know whether the business works, not just the product. I can help with that.”

After the call, Charlotte sits in her home office. She picks up her phone and calls James.

“How was it?” he asks. She can hear the boys arguing in the background.

“I just told a founder her business model doesn’t work. The look on her face.”

“Is the business worth saving?”

Charlotte thinks about Maya’s eyes when the $3 margin appeared on the projector. Not defeat, recognition.

“I think so. But she has to decide that, not me.”

She opens her spreadsheet. Row 47. In the “First Impression” column: Strong discovery culture. Broken unit economics. Founder identity tied to local sourcing, biggest risk is emotional, not financial.

Maya’s draft

That night, Maya sits at the kitchen table in Fremantle. Nadia is in the other room reading. The house is quiet. Maya opens her laptop and starts a new email.

Dear Greenbox subscribers,

We’ve made the difficult decision to pause operations while we reassess our business model to ensure we can continue to deliver the quality you expect.

She reads the three sentences back. They’re corporate and bloodless and they sound nothing like her. She imagines Mrs Patterson reading them. She imagines Patrick reading them. She imagines Dave reading them and thinking: Another one.

She doesn’t delete the draft. She doesn’t send it. She closes the laptop.

Nadia appears in the doorway. “Come to bed.”

“Coming.”

She doesn’t tell Nadia about the email. She doesn’t tell anyone. The draft sits in her email, unsent, for the next six months.

When to use a Business Model Canvas

Preparing to pitch investors. The canvas forces you to think about the whole business, not just the product.
Considering a significant business model change. Launching a new tier, entering a new market, the canvas shows second-order effects.
Post-revenue, pre-profitability. When the product works and people pay, but the model might not sustain itself.

When not to use it

When the problem is execution, not strategy. If deliveries arrive late, fix logistics. The canvas is for strategic clarity.
When you need detailed financial modelling. The canvas shows what the cost structure looks like. For exact numbers, you need a spreadsheet.

What comes next

Maya has three weeks to prepare her board pitch. She has JTBD data, validated and invalidated assumptions, the canvas, and Charlotte’s framework for calculating the numbers that matter.

She’s also preparing to propose something that would have been unthinkable three months ago: a two-tier product that partially abandons the 100% local sourcing she built the company around. The data says it’s the correct move. Her gut says it’s a betrayal.

Charlotte told her, on that first call: “The founders who scale are the ones who fall in love with the problem, not the solution. You fell in love with local sourcing. Your customers fell in love with not thinking about dinner. Those aren’t the same thing.”

Maya is still thinking about that.

But thinking isn’t a plan. The team has data, frameworks, and broken unit economics. They know what’s wrong. They can’t fix everything at once. The board meeting is in three weeks.

The question isn’t what to change. It’s what changes first.

Grounding a Chatbot in Your Own PDFs

2026-05-11T06:00:00+08:00

The situation

A facilities-engineering team at a manufacturing site maintains 600 PDFs covering roughly 200 pieces of equipment, 50 safety procedures, and 30 maintenance schedules. Documents range from 5 to 300 pages; the largest are OEM manuals with dense tables, wiring diagrams, and exploded parts views. A handful are scans of older paper manuals where the PDF is a picture of a page.

The engineers, around 40 on rotating shifts, currently type keywords into SharePoint search, open the top three or four hits, and Ctrl-F through them. Time-to-answer for “what’s the torque spec on the chiller’s compressor mount?” averages 8-12 minutes. The team lead has asked whether “one of those AI things” could shorten that to under a minute, with a citation back to the exact manual and page.

There is already an S3 bucket mirroring the SharePoint drive (nightly sync). The team has AWS access; they don’t have ML engineers. Managed RAG services have come up in conversation; the question is what configuration actually makes a RAG pipeline work well for this corpus, versus what configuration just makes it work at all.

What actually matters

Before reaching for a managed service, name the levers that govern answer quality in a RAG pipeline, and which ones this corpus is going to push on hardest.

The first is the shape of the corpus. 600 PDFs averaging, say, 40 pages each is roughly 24,000 pages of text. Some have rich tables; some are OEM manuals with figure captions and callouts; some are scanned. A generic “chunk every 300 tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ” strategy will split a wiring-diagram table across two chunks, and the retrieved half won’t make sense on its own. Knowing where the corpus sits on the structured-to-unstructured axis drives the chunking choice; this corpus sits squarely in “structured-heavy.”

The second is what questions the engineers actually ask. “How do I reset the chiller?” maps well to procedure sections, which have clear headings. “What’s the torque spec on the compressor mount?” maps to a table of values. “What PPE do I need for this maintenance?” maps to a safety section. If most questions land on structured regions (tables, bulleted procedures, numbered safety steps), the retrieval needs to handle structure; if most are paraphrased conceptual questions (“why does the chiller do X?”), pure vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. similarity is fine.

The third is scale and cost. 24,000 pages at, say, 500 tokens per page is 12M tokens of corpus. EmbeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. the whole thing once is a one-time pennies-to-dollars cost at the cheap per-token tier; re-embedding on updates is fractions of that. A dedicated managed vector store starts at hundreds of dollars a month for its minimum capacity allocation; piggy-backing on an existing relational database costs the additional storage of a vector column. The cost floor is mostly vector-store running cost, not embedding or querying.

The fourth is scans. Some of the PDFs are image-only. Text doesn’t come out of them without OCR. Any pipeline that ingests this corpus needs a parsing path that calls out to OCR (and ideally layout-aware OCR for tables) instead of silently producing empty chunks. Without that, the scanned manuals are dark matter, they exist in the index but their chunks are near-empty.

The fifth is citation format. The engineers want “chiller manual page 42, section 3.2” as the citation, not a raw S3 URI. That means chunks need to carry location metadata (S3 URI, page number, and ideally section/heading context) all the way through to the response. If the chunks are parsed badly, the citations are rough.

The sixth is governance. Every retrieval call invokes a foundation modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. ; every call needs to land in CloudTrail and optionally invocation logs. The engineers aren’t making sensitive queries, but the corpus contains supplier confidential information (some OEM manuals are marked “not for distribution”). The chosen pipeline needs a place to redact model numbers or supplier names from output, and an IAM-scoped query API so the engineer tool’s role can’t reach beyond the corpus.

What we’ll filter on

Six configuration decisions for a managed RAG pipeline, scored against this particular corpus.

Chunking strategy, arbitrary-token splits, or boundaries that respect the document’s own structure?
Embedding model. English-only or multilingual; what dimension and cost trade-off?
Vector store, a managed standalone, or piggy-backed on a database the team already runs?
Parsing, default text extraction, or layout-aware extraction that handles tables, multi-column, and scans?
Retrieval configuration, how many results, vector-only or hybrid with keyword, and what metadata filters?
Generation model and promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. template, what quality tier, with what instructions about groundingGroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. , citation, and refusal?

The configuration landscape

Chunking strategy. Bedrock Knowledge Bases offers four options. Default chunks into roughly 300-token pieces with ~20% overlap, safe, generic, ignores structure. Fixed-size lets you set chunk size and overlap explicitly. Hierarchical creates a two-level index: larger “parent” chunks for context and smaller “child” chunks for retrieval; the child is matched but the parent is what gets sent to the generation model. Semantic chunks using a foundation model to identify natural boundaries, paragraphs, sections, topic shifts, instead of arbitrary token counts. For dense technical manuals with heading structure, semantic or hierarchical chunking retrieves more cleanly than fixed-size because chunk boundaries match the document’s own logic.
Embedding model. Titan Text Embeddings v2 produces 1024-dimensional vectors, costs $0.00002 per 1K tokens, and is multilingual (100+ languages). Cohere Embed English v3 (1024 dims) is English-only and often retrieves slightly better on English-heavy corpora in Cohere’s own benchmarks. Cohere Embed Multilingual v3 (1024 dims) handles non-English. For a mostly-English manuals corpus with occasional non-English OEM content (German machine-tool manuals, Japanese electronics datasheets), Titan v2 is the safe default; Cohere Multilingual if multilingual retrieval quality is proven to be better on a test set.
Vector store. OpenSearch Serverless is the zero-plumbing choice. Bedrock can create it for you. Minimum 2 OCUs at roughly $0.24/OCU/hour means a floor of roughly $350/month. Aurora PostgreSQL with pgvector piggy-backs on an existing Aurora cluster: a CREATE EXTENSION vector; and a vector column on a table. No additional running cost beyond what the cluster already burns, but you manage the schema, the ingestion hooks, and the index tuning. Pinecone and Redis Enterprise Cloud are third-party integrations; useful if the organisation already runs one of them but usually not the first choice for a new build.
Parsing. Default parsing extracts text from PDFs using AWS’s standard extractors, fast, cheap, loses layout. Scanned pages produce no text. Advanced parsing routes documents through a foundation model (Claude, Nova) that sees the page layout, tables, figures, columns, and emits structured text preserving that layout. Costs per-page extra (priced like a model invocation); for a 24,000-page corpus that’s a material ingestion bill. Pays back on corpora where layout matters (tables, multi-column, scanned). Defaults work on plain-text PDFs.
Retrieval configuration. numberOfResults, how many chunks to retrieve per query, defaults to 5. For a 600-PDF corpus where relevant content might be split across chunks, 6-10 is often better. overrideSearchType controls vector-only vs. hybrid (vector similarity plus keyword BM25). Hybrid matters when exact terms (part numbers, equipment tags) drive the query. Metadata filters let queries constrain retrieval by fields on the source document (“only safety manuals”, “only equipment in building B”), requires metadata to be attached at ingestion via .metadata.json sidecars in S3.
Generation model and prompt template. The generation model is independently configurable: Claude Sonnet, Nova Pro, Llama, any Bedrock-hosted text model. The Knowledge Base has a default prompt template that injects retrieved chunks under $search_results$ and asks the model to answer from them; you can override it with a custom template that specifies citation format, refusal behaviour, and tone.

Side by side

Matching each decision to the facilities-engineering corpus:

Decision	Default	This corpus	Rationale
Chunking	Default (300 tokens)	Semantic	Manual sections have natural boundaries; avoid splitting tables
Embedding model	Titan v2	Titan v2	English-heavy, occasional multilingual, multilingual model as default
Vector store	OpenSearch Serverless	OpenSearch Serverless	Lowest friction; no existing Aurora to piggy-back on
Parsing	Default text extraction	Advanced parsing	Scans + tables + figures require layout-aware extraction
Retrieval	5 results, vector-only	8 results, hybrid	Part numbers and equipment tags need keyword precision
Generation model	Claude Sonnet	Claude Sonnet	Quality on drafting technical procedures justifies the token cost

The two decisions that matter most for this corpus are advanced parsing (scanned manuals are otherwise invisible) and hybrid retrieval (part numbers and equipment tags are exact-match hints that pure vector search can miss). The others are close to defaults.

How the pieces fit together

Two phases, seven moving parts. Ingestion is rare and expensive per-page; query is frequent and cheap per-call.

The configuration in depth

Creating the Knowledge Base. CreateKnowledgeBase takes a data source configuration (S3 bucket ARN, optional inclusion/exclusion filters), an embedding model ARN (Titan v2 for this build), a vector store configuration (OpenSearch Serverless collection ARN and field mappings), and an IAM service role Bedrock will assume to read S3 and write to OpenSearch. The field mappings are worth getting right: vectorField names the column holding the 1024-dim vector, textField holds the chunk text, metadataField holds anything else (source URI, page number, section heading).

Advanced parsing configuration. In the data source config, parsingConfiguration with parsingStrategy: BEDROCK_FOUNDATION_MODEL and a bedrockFoundationModelConfiguration pointing at Claude 3 Haiku (cheapest capable option) or Claude Sonnet (more accurate on complex layouts). The parser sees each PDF page as an image and emits layout-aware text: tables as tables, figures with captions, multi-column text reassembled in reading order. Costs scale per page; budget for a one-off few-hundred-dollar ingestion bill on the initial 24,000 pages.

Chunking configuration. chunkingConfiguration with chunkingStrategy: SEMANTIC and semanticChunkingConfiguration specifying maxTokens (e.g. 600), bufferSize (e.g. 1), and breakpointPercentileThreshold (e.g. 95). The threshold controls how aggressively the chunker splits: higher values mean fewer, larger chunks; lower means more, smaller. 95 is a reasonable starting point for procedure-style documents; tune by running a test set of queries and looking at whether retrieved chunks contain the whole answer or half of it.

Data source sync. StartIngestionJob kicks off ingestion. For the initial run, this parses, chunks, embeds, and indexes the full corpus (24,000 pages taking typically a few hours end-to-end, mostly in advanced parsing). For subsequent runs, Bedrock diffs against the last manifest and only re-processes changed files. An EventBridge Scheduler rule running StartIngestionJob nightly (or hourly if updates are frequent) keeps the index current.

The retrieval call. RetrieveAndGenerate takes the question text and a configuration:

{
  "input": {"text": "How do I reset the chiller on floor 4?"},
  "retrieveAndGenerateConfiguration": {
    "type": "KNOWLEDGE_BASE",
    "knowledgeBaseConfiguration": {
      "knowledgeBaseId": "KB-FACILITIES",
      "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
      "retrievalConfiguration": {
        "vectorSearchConfiguration": {
          "numberOfResults": 8,
          "overrideSearchType": "HYBRID",
          "filter": {
            "equals": {
              "key": "building",
              "value": "B"
            }
          }
        }
      },
      "generationConfiguration": {
        "promptTemplate": {
          "textPromptTemplate": "You are a facilities-engineering assistant. Answer the engineer's question using only the provided manual excerpts. If the excerpts don't cover it, say 'I don't have that procedure in the indexed manuals' rather than guessing. Always cite the source as [manual name, page N, section if given]. Be concise; engineers are often on the floor. Question: $query$\n\nManual excerpts:\n$search_results$"
        },
        "inferenceConfig": {
          "textInferenceConfig": {
            "maxTokens": 800,
            "temperature": 0.1
          }
        }
      }
    }
  }
}

Three points on that call. The filter constrains retrieval to metadata matching building=B, which works because the ingestion step attached building metadata via .metadata.json sidecars in S3 (each PDF has a companion JSON listing building, equipment type, manufacturer). The custom promptTemplate is the refusal-and-citation instruction; without it the default template does a passable job but not a great one. temperature: 0.1 keeps the model close to the retrieved text, higher values invite embellishment, which for a safety-adjacent domain is exactly what isn’t wanted.

A worked question

Ravi is on shift and the chiller alarm is blinking. He opens the engineer tool on his tablet.

Ravi: "Chiller floor 4 fault code E-207, what's the reset procedure?"

The tool calls RetrieveAndGenerate. A second later:

Answer:

Fault code E-207 on the floor-4 chiller (Trane CGAM 050, manual v3.2)
indicates a high-pressure cutout on the refrigerant circuit. Reset procedure:

1. Verify the condenser fans are running and unobstructed.
2. Wait 5 minutes for head pressure to normalise.
3. At the controller panel, press MENU -> DIAGNOSTICS -> CLEAR FAULT.
4. If E-207 returns within 10 minutes, escalate to maintenance
   (do not attempt a second reset).

PPE required: safety glasses only for the reset; refrigerant PPE only
if a circuit must be opened.

Sources:
  [Trane CGAM 050 Manual v3.2, page 142, §7.4]
  [Chiller Fault Code Quick Reference, page 3]
  [Site PPE Matrix - Building B, page 12]

What happened under the hood:

Ravi’s question was embedded with Titan v2.
OpenSearch Serverless ran a hybrid query, the vector matched conceptually to “chiller reset”, the keyword half matched “E-207” exactly, returning 8 chunks.
The top chunks were §7.4 of the Trane manual, the entry for E-207 in the quick-reference, and the PPE matrix section for Building B.
Claude Sonnet saw the chunks, the custom prompt template, and produced a grounded answer with the three citations. Every cited fact came from the retrieved text.
The tool rendered clickable citations that link back to the S3 URI and page number of each source document. Tapping “[Trane CGAM 050 Manual v3.2, page 142, §7.4]” opens the PDF at that page.

Round trip: ~2 seconds. The model didn’t invent a fault code (it exists), didn’t invent a page number (it matches the source), and didn’t skip the PPE step (retrieval surfaced the site matrix). Time-to-answer went from 10 minutes to 5 seconds.

Edge cases the configuration handles

The scanned manual. An older Siemens drive manual is a scan, not a text PDF. Without advanced parsing, its chunks would be near-empty and it would be invisible to retrieval. With advanced parsing, Claude extracts the text from each page image; the chunks carry real content. The OCR quality is imperfect on handwritten annotations, but the typewritten body text extracts cleanly enough for questions about it to land on its pages.

The multi-building filter. Some procedures differ between Building A and Building B (different equipment models, different PPE requirements). Each PDF has a .metadata.json sidecar specifying which building it applies to. The retrieval call’s filter constrains to the engineer’s current building, so “what’s the PPE for confined-space entry?” returns the Building B matrix, not Building A’s.

The torque-spec table. A question like “what’s the torque spec on the compressor mount?” hits a table in the manual. Default parsing would have flattened the table row-by-row and split it across chunks; advanced parsing preserves it. Semantic chunking keeps the table intact within one chunk. The retrieved chunk contains the full table; the model extracts the right row based on the question’s equipment reference.

The no-answer case. An engineer asks “what’s the torque spec on the new HVAC from SupplierCo?”, but the SupplierCo HVAC was installed last week and its manual hasn’t been added yet. Hybrid retrieval returns low-relevance chunks. The custom prompt template’s instruction, “If the excerpts don’t cover it, say ‘I don’t have that procedure in the indexed manuals’ rather than guessing”, kicks in, and the model refuses gracefully, prompting the engineer to add the manual or call the supplier.

What’s worth remembering

Bedrock Knowledge Bases is end-to-end managed RAG. Point it at S3, configure it, call RetrieveAndGenerate. Ingestion, embedding, storage, retrieval, and generation plumbed for you.
Advanced parsing is the right default for document-heavy corpora. It costs real money at ingestion but turns scans into text and preserves tables and layout. Defaults lose all of that.
Semantic chunking respects document structure. Fixed-size chunking splits tables and procedure lists at arbitrary points. Semantic chunking aligns boundaries with the document’s own sections and paragraphs.
Hybrid search beats vector-only when exact terms matter. Part numbers, equipment tags, fault codes, keyword BM25 gets these right; pure vector search can miss them when surface forms don’t match.
Metadata filters are the scoping lever. Sidecar .metadata.json files attach structured attributes to each document; retrieval calls can filter by any of them. This is how you get per-building, per-equipment, per-role retrieval from one index.
Custom prompt templates are where refusal and citation behaviour lives. The default is adequate; a custom template is where you instruct the model to say “I don’t know” instead of inventing, and to format citations the way your UI expects.
Ingestion cost is one-off plus incremental; query cost is per-call. The big bill is the initial advanced-parsing pass over the whole corpus. Subsequent updates only re-parse changed documents; queries are standard Bedrock on-demand.
Invocation logging + CloudTrail + KMS keep the governance story complete. Every RetrieveAndGenerate call emits CloudTrail; invocation logs capture full prompt and response to S3 under a customer-managed key; Knowledge Base IAM is a separate policy from the underlying model policy.

A working facilities chatbot isn’t a single configuration choice, it’s six of them, each justified by the shape of the corpus. Advanced parsing and hybrid retrieval are the two that shift this build from “it mostly works” to “engineers trust it on the floor.” The others are close to defaults, and that’s fine: the defaults exist because they’re sensible starting points. The craft is knowing which defaults to change.

After the Transformer

2026-05-09T06:00:00+08:00

Your context windowContext windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. is one million tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. . The modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. bills you per token in and per token out, and the in-token bill grows linearly with the prompt, but the underlying compute grows quadratically. At a million tokens, the attentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. step is doing roughly a trillion pairwise calculations. Someone is paying for that. It’s you.

A handful of new architectures claim they can do the same job at linear cost. Some of them can. Some of them can’t. None of them have replaced transformers yet, but at least one of them is going to.

In To LLMs… and Beyond! we mentioned state-space models, specifically Mamba, as the leading post-transformer candidate. That’s accurate but underspecified. There’s a whole research front trying to do better than the transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. at sequence modelling, and the candidates differ in what they’re trying to fix. This post walks the field.

The point isn’t that transformers are about to be replaced. They aren’t. The point is that the assumption “transformer = the only way” is already broken, and the alternatives are interesting enough to know about before they show up in production.

What’s wrong with transformers

The transformer’s superpower is its attention mechanism: every token can attend to every other token. That’s how it captures long-range dependencies, and it’s why it dominates language modelling.

The cost is also right there in the design. If your sequence has n tokens, the attention step does roughly n² pairwise comparisons. Double the input, quadruple the compute and the memory.

For short sequences this doesn’t matter. For long ones it dominates. A 2,000-token prompt is fine. A 200,000-token prompt is expensive. A 2,000,000-token prompt is, on a vanilla transformer, infeasible.

The industry has worked around this with engineering. FlashAttention, sliding-window attention, ring attention, KV-cache compression, and the workable context window has stretched from 2k tokens (GPT-3) to 1M+ tokens (Claude, Gemini) over a few years. But the underlying complexity is still quadratic. The workarounds are clever, not free.

The post-transformer architectures all share one design goal: sub-quadratic scaling in sequence length. Beyond that they diverge sharply.

State-space models: Mamba

The most-discussed post-transformer architecture is the state-space model (SSM), and the leading example is Mamba (Gu and Dao, 2023).

The intuition is the one we used in the entry post: instead of every token attending to every other token (the “re-read the book each time” approach), the model maintains a compressed hidden state that gets updated as each token comes in (the “running notes” approach). The cost of updating is constant per token, so the total cost is linear in sequence length, not quadratic.

The catch is that the hidden state is lossy. It’s a fixed-size summary of everything that came before. If a transformer wants to recall the seventh sentence of a hundred-page document, it has the full attention budget to do so. If Mamba wants to recall it, it has to have written something useful about it into the hidden state at the time, and the hidden state has finite capacity.

The Mamba innovation that mattered was making the state-update mechanism selective, the model learns which tokens to actually attend to and which to skim past, rather than treating every token equally. This narrowed the gap with transformers significantly, particularly on language modelling benchmarks.

As of 2026, Mamba and Mamba-2 are competitive with transformers of similar size on many language tasks, sometimes superior on tasks involving very long sequences (DNA, audio, ultra-long documents), and sometimes weaker on tasks requiring precise long-range recall (associative memory). The honest summary: Mamba is real, it works, and it hasn’t beaten transformers across the board.

The hybrid approach: Striped Hyena, Jamba

Most serious research on post-transformer architectures has converged on a pragmatic answer: don’t pick one, mix them.

Hyena (Stanford, 2023) and its successor Striped Hyena are sub-quadratic architectures that interleave Hyena blocks with attention blocks, letting the cheap Hyena blocks do most of the work and the expensive attention blocks handle the parts that genuinely need cross-token comparison.

Jamba (AI21 Labs, 2024) does the same thing but with Mamba blocks: a transformer-Mamba hybrid that uses Mamba layers for efficiency and transformer layers for the kinds of pattern matching transformers are still better at.

The hybrid pattern is now the default assumption for “what comes after the pure transformer.” It’s not “Mamba replaces attention,” it’s “Mamba is a cheap layer that lets you spend your attention budget more carefully.”

RWKV and RetNet: the RNN comeback

Two other notable lines try to revive the recurrent neural network, the architecture transformers replaced, with modern training tricks.

RWKV (Receptance Weighted Key Value, BlinkDL, 2023+) is an RNN that can be trained like a transformer. Standard RNNs are notoriously slow to train because they’re inherently sequential, token t+1 depends on token t. RWKV reformulates the recurrence in a way that allows parallel training (like a transformer) but sequential inferenceInferenceRunning a trained model to produce output – as opposed to training it. at constant cost per token (like an RNN). At inference time, an RWKV model uses constant memory regardless of sequence length, the dream that transformers can’t achieve.

RetNet (Retentive Network, Microsoft, 2023) takes a similar approach with a different mechanism. It claims the “impossible triangle”: parallel training, recurrent inference, and strong performance.

Neither has displaced transformers. Both are competitive in their weight classes and both are interesting if you care about deployment cost more than peak quality, a constant-memory inference path is genuinely useful when you’re running models on phones or in tight latency budgets.

Liquid neural networks

Liquid AI (an MIT spin-out) builds on a different research lineage: continuous-time neural networks where the hidden state evolves according to differential equations rather than discrete update steps. The promise is dramatically smaller models (often orders of magnitude smaller) that match the performance of much larger transformers on specific tasks.

It’s early. Their language models are interesting and small (Liquid’s LFM-3B punches above its weight), but the wider research community hasn’t replicated the results across the spectrum of language tasks. Worth knowing exists. Probably not worth deploying yet unless you have a specific reason.

Diffusion for text

Image generation switched from autoregressive to diffusion years ago (DALL-E 1 was autoregressive; DALL-E 2 onwards is diffusion). The natural question: why not the same for text?

The answer for a long time was “because text is discrete and diffusion is continuous.” Recent work has found ways around this: discrete diffusion (operating directly on token distributions rather than continuous latents), masked diffusion (a generalisation of BERT’s masking objective), and absorbing-state diffusion (gradually replacing tokens with a special mask token, then learning to reverse the masking).

Models in this space include SEDD (Score Entropy Discrete Diffusion), Plaid, and LLaDA (Large Language Diffusion Model, 2024-2025). The pitch is interesting: instead of generating left-to-right one token at a time, the model generates the whole output simultaneously and refines it over multiple denoising steps. This gives you parallel generation (faster wall-clock for long outputs) and the ability to edit or fill in any part of the output (not just append to the end).

As of 2026, diffusion language models are competitive with similarly-sized autoregressive transformers on some benchmarks but lag on others. They’re a genuine alternative paradigm, not just a tweak. Whether they end up dominant or niche is one of the more open questions in the field.

A comparison

Architecture	Sequence cost	Inference memory	Strengths	Weaknesses
Transformer	O(n²)	Grows with context	General performance, ecosystem maturity	Cost at long context
Mamba (SSM)	O(n)	Constant per token	Long-sequence efficiency	Lossy hidden state, weaker associative recall
Striped Hyena / Jamba (hybrid)	Sub-quadratic	Mostly constant + some attention KV	Pragmatic mix, often best of both	More complex to train
RWKV / RetNet (RNN-like)	O(n) train, constant inference	Constant	Cheapest inference, edge-friendly	Smaller ecosystem, training quirks
Liquid (continuous-time)	O(n) typical	Constant or near-constant	Very small models punching up	Early, narrower benchmark coverage
Diffusion (discrete)	O(n) per step × steps	Holds full sequence	Parallel generation, in-place editing	Fixed step count, less mature for text

What’s actually in production

In 2026, transformers still dominate every major API and almost every open-weight release. The frontier models. Claude, GPT, Gemini, are transformers. The leading open-weight models. Llama, Mistral, Qwen, are transformers.

The cracks where alternatives have started shipping:

Long-sequence applications (DNA, audio, ultra-long-document analysis) increasingly use Mamba or hybrid architectures because the quadratic cost is the binding constraint.
Edge deployment (phones, embedded devices) is where RWKV and RetNet have the most traction, constant-memory inference matters more than peak benchmark scores when you have 4GB of RAM.
Hybrid models like Jamba are starting to appear in commercial offerings, mostly behind the scenes.
Diffusion language models are research today, productisation tomorrow, the parallel generation property is too useful to ignore long-term.

What this means for you

Probably nothing immediate. If you’re building on Claude or GPT or a Llama derivative, you’re using a transformer, and you’ll keep using a transformer for the foreseeable future. The point of knowing the alternatives isn’t to switch away from transformers tomorrow.

The point is to recognise the shape of the next disruption when it lands. The story of “X dominated Y until something better came along” is the story of every architecture in the history of machine learning. Convolutional networks dominated vision for a decade until Vision Transformers came for them. RNNs dominated sequence modelling until transformers came for them. Transformers will eventually be replaced by something, and the candidates above are the live ones in 2026.

If you maintain AI infrastructure, the bet that pays off is keeping the interfaces clean, treating “the language modelLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. ” as a swappable component rather than baking transformer-specific assumptions into your stack. The day a hybrid architecture starts winning at half the cost, you want to be able to swap.

The pure transformer is showing its age in one specific way: the quadratic cost of attending every token to every other token, which the workarounds soften but don’t remove. The candidates all try to escape that ceiling by some flavour of compressed running state. Mamba writes notes as it goes and pays the price in lossy recall. RWKV and RetNet pull the recurrent network out of retirement with new training tricks and get constant-memory inference in return. Liquid networks let the hidden state evolve continuously and squeeze surprising performance out of very small models. Diffusion abandons the left-to-right loop entirely and refines a whole output across multiple passes. None of these has unseated the transformer, and the hybrids. Striped Hyena, Jamba, are an admission that the most useful answer in the medium term is a mix.

If you’re building on Claude or GPT today, the practical takeaway is to keep the interface to “the language model” honest and swappable. The history of machine learning is a sequence of architectures dominating until something better arrived. Transformers will get their turn. The architecture that eventually replaces them is probably already in a paper somewhere on arXiv.

The Workshop: Jobs to be Done

2026-05-08T06:00:00+08:00

Customers don’t buy products; they hire them to do a job. JTBD is the interview technique that surfaces the actual job and the alternatives they’d defect to. Switch interviews (structured interviews with people who recently switched products or services, asking what triggered the move) are the core mechanic. Why Subscribers Actually Stay is the worked example; this post is the playbook.

Jobs to be Done

Jobs to be Done runs switch interviews with recent customers and reads them through the four forces (push from the old situation, pull of the new option, anxiety about switching, habit holding the customer in place), so the team ends up with candidate job statements grounded in what customers actually said rather than what the room already believed. Sometimes called JTBD, job mapping, or outcome-driven innovation, though outcome-driven innovation is a distinct quantitative framework (Ulwick) that layers over the qualitative interviewing. The switch-interview technique comes from Bob Moesta and the Re-Wired Group; the four-forces framing from Moesta and Chris Spiek; the broader theory from Clayton Christensen. Frequently confused with user personas: personas describe who a user is; jobs describe what they’re trying to get done, a different axis and a more useful one for product decisions.

At a glance

Who, for how long: a facilitator, two or three rotating interviewers, a note-taker, the product lead, and ideally a CS or ops observer. Four to six team members, around 3h 45min with two interviews or 4h 30min with three.
What you walk out with: 3–5 candidate job statements in the form “When [situation], I want to [motivation], so I can [outcome]”, a clustered wall of verbatim quotes tagged against the four forces, and the interview transcripts filed for future reading.
When to reach for it: churn that won’t move, a new feature area where the team can’t agree on the problem it solves, or several teams prioritising against different implicit jobs. Not for tactical backlog refinement, not when there are no recent switchers to talk to, and not when the team has decided the answer and only wants validation.

What’s It For

A team builds a feature someone asked for. The feature lands, the telemetry looks fine for a week, and then the customer who asked for it cancels. Nobody connects the cancellation to the feature (it was a quarter ago, a different person, a different conversation) but the pattern repeats quietly over the year. The backlog fills with requests. The product changes shape. Churn doesn’t move.

The problem is that asking a customer what they want produces a list of features. The list is honest and useless. Customers describe solutions they can imagine because describing causes they’re half-aware of is hard. The Jobs to be Done school of thinking (Moesta, Christensen, Ulwick) reframes the interview: don’t ask what they want. Ask what happened the day they switched. What prompted it. What they were trying to get done. What they’d been doing before. What would have made them stay with the old thing.

Switch interviews replace the feature wishlist with a story about a decision. The story contains the job. The job is usually not the one the team expected.

This workshop exists to collapse that reframing into a single session: three or four interviews, silent discovery, a clustering round, and a set of candidate job statements the whole team watched emerge. The statements are the artefact. The shared view is the point.

Reach for it when:

Churn is stable but not improving and you’ve exhausted surface-level theories
A new feature area is being considered and the team can’t agree on the problem it solves
You have access to recent switchers: people who started or stopped using the product in the last ninety days
Personas are in use and clearly not driving decisions
Several teams are prioritising against different implicit jobs and colliding

What It’s Not For

Skip it when:

You don’t have switchers to talk to. The technique is switch interviewing; discovery without interviews is just speculation in a conference room.
The decision you’re trying to make is tactical. JTBD is a framing exercise, not a backlog refinement tool.
The team believes they already know the job and you’re being asked to validate it. Confirmation-seeking kills the interviews.
You can’t get 2–3 hours of focus out of the product lead. The discovery cannot be delegated.

Stop a session that’s already started if:

The interviewees can’t remember why they switched; they’re not recent enough
The sticky-note wall is mostly empty after thirty minutes; the interviews didn’t land
The room is arguing about whether switch interviews are valid; you have a trust problem, not a method problem

Stopping after the first interview to regroup on technique is not failure. Running three mediocre interviews and producing confident statements from them is.

Definitions & Background

Three dimensions of a job. Every job has three layers, and the richest material lives in the second and third:

Functional: the practical thing being done. “Plan the week’s meals.”
Social: how the customer wants to be seen while doing it. “Be the parent who feeds the family well.”
Emotional: how they want to feel, or stop feeling. “Stop having to think about dinner on Sunday.”

Most teams capture only the functional layer and miss why the job actually matters. Push every candidate job statement to expose all three.

The switch timeline. Moesta’s interview structure has five anchor moments along the customer’s path to switching. Knowing the names lets the interviewer ask for each one explicitly:

First Thought: the moment the customer first considered making any change. Usually weeks or months before the switch. Push lives here.
Passive Looking: low-effort browsing, not actively shopping yet.
Active Looking: shortlisting, comparing, asking around.
Deciding: choosing between candidates. Anxiety spikes here.
Consuming: using the new thing for the first time. Habit sets in or it doesn’t.

The four forces (push from the old situation, pull of the new, anxiety about switching, habit holding the customer in place) map onto these moments rather than being asked about abstractly.

Inputs

Three to six recent switchers scheduled for 45-minute calls (ideally two recent customers who started, one recent canceller, or the equivalent for your switch).
A rough interview guide (we’ll give one below) but not a script.
The team to have read one or two switch interview transcripts before the session so they recognise the shape.
Recording setup (with permission) and a shared document for verbatim notes.
Sticky notes and a wall for the discovery and clustering phases.

If you don’t yet know which customers to talk to or what switch you’re trying to understand, run Event Storming a Domain first to map the customer landscape, or pull a churn list from your CS team to seed the recruit.

Outputs

What lands at the end:

3–5 candidate job statements in the form “When [situation], I want to [motivation], so I can [outcome].” These are the headline artefact.
A wall of verbatim quotes, clustered, with each cluster named.
Interview transcripts filed somewhere the team can read them for months. Redact names and any personal detail not relevant to the job.
Tags against the four forces: which quotes show push, which show pull, which show anxiety, which show habit. The tags are the evidence behind each job statement.
A “not our job” list, sometimes: the requests you can now deliberately decline.

These outputs feed straight into:

Impact Mapping: Impact Mapping tells you what behaviour to change; JTBD tells you what job the customer is hiring you for. Run JTBD first when you don’t know the job yet; run Impact Mapping first when the job is clear but the behaviour change isn’t.
User Story Mapping: once the jobs are named, User Story Mapping lays out the journey through them and slices the backlog against each job.
Example Mapping: Example Mapping turns a story into concrete rules; JTBD turns a customer conversation into a story worth writing. They compose at opposite ends of the refinement pipeline.

Who’s Needed

Four to six team members for roughly 3h 45min (with two interviews) to 4h 30min (with three):

Facilitator. Runs the session, conducts or co-conducts the interviews, keeps the discovery honest. Someone who has done switch interviews before is a large advantage. If nobody in the room has, schedule a practice interview with a friendly customer the week before.
Interviewers. Two or three, rotating. One person asks, another listens and notes, they swap between interviews. The rotation matters; it stops any single interviewer’s theory hardening into the session’s finding.
Note-taker. Often the facilitator doubles here, but if the interviews are back-to-back, split the role. The note-taker captures verbatim quotes, not paraphrases. Paraphrase is where the team’s existing theory sneaks in.
Product lead. Mandatory. The job statements will reshape the roadmap, and the product lead needs to have been in the room when they came out. If they arrive only for the readout, the statements will land as someone else’s conclusions, and they will be argued rather than used.
Optional ops / CS observer. Someone who talks to customers every day. Their job is to contradict the neat story that emerges from three interviews with the people who picked up the phone. They know the customers who didn’t, and that context keeps the discovery honest.

Group size is 4–6 team members (interviewees are not counted). Below four and the clustering lacks the friction it needs; above six and the silent discovery phase becomes committee writing.

Who to leave out:

Large groups of stakeholders. This is not a readout session. Discovery with more than six voices collapses into consensus-seeking.
People who can’t let go of existing features. If someone is going to defend the current roadmap sentence-by-sentence during clustering, they will prevent the session from doing its job. Invite them to the readout afterwards.
Anyone who won’t suspend their theory for three hours. JTBD interviews are deliberately theory-free. Bring a theory into the listening and you’ll hear confirmation.

How To Run It

Phase	Duration	Materials	Key question
Brief and prep	15 min	Interview guide, recording setup	“What are we listening for?”
Switch interview 1	45 min	Phone / call, notes	“Tell me about the day you switched.”
Switch interview 2	45 min	Phone / call, notes	“Tell me about the day you switched.”
Switch interview 3 (optional)	45 min	Phone / call, notes	“Tell me about the day you switched.”
Silent discovery	30 min	Sticky notes, quotes	“What did we actually hear?”
Cluster into candidate jobs	30 min	Clustered quotes	“What story do these clusters tell?”
Name 3–5 candidate jobs	20 min	Job statement cards	“When… I want to… so I can…”
Wrap-up	10 min	—	“Who owns what next?”
Total	~3h 45min with two interviews, ~4h 30min with three

The interviews drive the day. Everything else is in service of extracting the job from what the switchers said. If the interviews don’t happen (schedule slips, no-shows, technical failures) postpone the discovery. Don’t fake it with remembered quotes.

Listening and discovery

Two distinct modes, and the discipline of keeping them separate is most of the technique:

Listening mode. During the interviews. Open questions, long silences, “and then what happened?” nudges. No theorising, no reframing the question, no rescuing the interviewee when they stall. The pauses are where the good material comes out.
Discovery mode. After the interviews. Quotes on sticky notes, clustered by pattern, named at the end. Silent individual work first; discussion only after the clusters are visible.

The four forces come out in the second mode, not the first. Don’t ask interviewees about push, pull, anxiety, and habit; listen for them.

Push of the current situation. What was annoying, broken, or insufficient about what they were doing before.
Pull of the new solution. What drew them toward the new thing. What it promised.
Anxiety about the change. What made them hesitate. What they were afraid would go wrong.
Habit of the present. What made it easier to keep doing what they were already doing, even when it wasn’t working.

A real switch story contains all four. The four forces are a lens for reading the transcript, not a question to ask.

“Pulling them out” is the mechanic of that lens. After the interview, go back through the transcript and tag the lines that show each force. To make this concrete, here’s how it might land in a switch interview about a meal-box subscription: “The supermarket veg kept going off before we’d eaten it” is push, “My neighbour’s box looked amazing on Instagram” is pull, “What if we get things we don’t know how to cook?” is anxiety, and “We’d done the same Saturday shop for years” is habit. The interviewee never labelled any of them; they told a story, and the team tagged it afterwards. Those tags are the evidence behind the job statements you write later: when the situation clause reads “When the weekly shop has stopped working…” you can point at the push quote it came from.

Forces you can’t tag matter too. Strong push and weak anxiety is a switcher who was already on the way out. Strong habit and weak pull is a switcher who needs a bigger nudge than the product is currently offering. The tagging is what turns three interview stories into a map of the decision shape, not just three transcripts in a folder.

Phase 1: Brief and prep (15 min)

Gather the team. Walk through three things, briefly:

“We’re going to run switch interviews. The shape is: tell me about the day you switched, walk me backwards to when you first started thinking about it, and tell me what else you considered. We’re listening for what was going on in their life when they made the change, not for feature feedback. We’re not going to ask them what they want. We’re going to ask them what happened.”

Second: the four forces, one line each. Tell the team to keep the four forces in the back of their heads, not the front. The interview is not a forces-extraction machine.

Third: the roles for the first interview. Who asks, who takes notes, who observes in silence. Set the expectation that roles rotate.

What to watch for:

Pre-loading theories. “I bet they’re going to say it’s about convenience.” Name it and park it: “Let’s see what they say.”
Prepared questions. Interviewers who’ve written a list of fifteen things they want to ask will interrupt the story. The guide below is three prompts, not fifteen.
Recording permission missed. If you’re recording, confirm permission explicitly at the top of the call. If you can’t record, double the note-taking.

The interview guide:

“Take me back to the day you decided to switch. What was happening that day?”
“When did you first start thinking about it? What else were you considering?”
“Was there anything that almost stopped you? What made you go ahead anyway?”

Everything else is follow-up prompts: “tell me more about that,” “what happened next,” “who else was involved,” “how did that feel.”

Phase 2: Switch interviews (45 min each)

Run the interview by phone or video. Camera on if the interviewee’s comfortable, off if not. The note-taker captures verbatim quotes in a shared document, with timestamps if the call is recorded.

Ask the first question and then wait. The interviewee will start. Don’t fill silences. If they stop after thirty seconds, prompt with “and then what?”

Push for the concrete scene. “What day of the week was it? Where were you when you first thought about it? Who did you talk to?” Abstractions hide jobs; specifics reveal them.

Walk them backwards along the timeline. When they’ve finished the story of the day itself, walk back: when they first thought about it, what they were doing before, what triggered the first thought. The “first thought” is often weeks or months before the switch, and that’s where the push usually lives.

When they’re done with the timeline, ask about alternatives. “What else did you consider? Why did you pick this one? What would have made you stay with what you had before?”

What to watch for:

Generic answers. “It was just more convenient.” Push: “Convenient how? Give me the specific thing that annoyed you last time you did it the old way.” A generic answer is an unearned abstraction; the story is always underneath.
Rationalised stories. The interviewee has told themselves a tidy narrative about why they switched. You’ll hear marketing language in their mouth. Rewind: “Before you decided that, what were you actually doing?” Walk to the concrete scene.
Interviewers filling silences. Note-takers should kick the interviewer under the table. Thirty seconds of silence almost always produces the best quote of the interview.
The team diagnosing during the call. “Oh: they want a pause feature.” No. Listening only. Diagnosis is the next phase.

End the interview cleanly. Thank them. Don’t summarise back to them; summaries bias the memory of what they said.

Phase 3: Silent discovery (30 min)

Print or project the transcripts. Each team member works alone. The instruction is one sentence:

“Write every verbatim quote that feels telling onto a sticky note. One quote per note. No interpretation.”

Telling means: reveals a push, a pull, an anxiety, a habit, a moment of decision, a named alternative, an outcome they were trying to achieve. If the quote is about a feature they wanted, it’s probably not telling; that’s a solution, not a job.

Silent, individual, no discussion. Set a timer. When the timer ends, everyone posts their notes on the wall without comment.

What to watch for:

Paraphrase creep. Someone writes “customer wants convenience” on a note. That’s paraphrase. Push back: “What did they actually say? Use their words.”
Feature requests mistaken for jobs. “They said they want a weekly summary.” That’s a solution. The question underneath is what the weekly summary is being hired to do.
One team member producing twice as many notes as anyone else. Good. Don’t suppress it. The clustering will balance.

Phase 4: Cluster into candidate jobs (30 min)

Look at the wall. Ask the room to cluster notes that belong together. No talking for the first five minutes (the affinity-map convention: silently group sticky notes by similarity, then name the clusters). People move notes silently, and if two people keep moving the same note back and forth, it’s flagged for discussion.

After five silent minutes, open the conversation. For each cluster, ask two questions:

“What’s the pattern here? What are these quotes all saying?”

“Is this a situation, a motivation, or an outcome?”

Those three words (situation, motivation, outcome) are the JTBD shape. A cluster might be all situations (things that were going on in customers’ lives), all motivations (what they were trying to get done), or all outcomes (what they wanted to be true afterwards). Often a single cluster contains one of each and is the seed of a job statement.

Expect 4–7 clusters from three interviews. Fewer than four and you’ve over-abstracted; more than seven and you haven’t clustered enough.

What to watch for:

Everything-is-one-cluster. The room collapses the wall into two huge piles. Push for distinctions: “What’s different about these two quotes?”
The wishlist cluster. A cluster forms around feature requests. Re-frame: “If we built all of these, what would customers be able to do that they can’t do now?” The answer is usually the job.
Forgotten negative space. Quotes about anxiety and habit rarely cluster on their own unless you prompt for them. “Which of these clusters contains ‘what almost stopped them’? Is that cluster complete?”

Phase 5: Name 3–5 candidate jobs (20 min)

Pick the 3–5 strongest clusters and turn each into a job statement. The form is strict:

“When [situation], I want to [motivation], so I can [outcome].”

This is Klement’s job story form (Alan Klement, who refined the JTBD interview practice into the modern job-story shape), which bakes the situation in. Christensen’s classical job statement form is shorter, “Help me [verb] [object] [modifier]”, and useful for headline framing. We use Klement’s form here because the situation is what the four-forces evidence directly supports.

Each slot is a concrete phrase, not a category.

Situation: the context the customer is in when the job arises. Time, place, people, constraints.
Motivation: the action they want to take. A verb and an object.
Outcome: the state of the world they want to be true as a result. What they get to do next, how they want to feel, what they no longer have to worry about.

Write each statement on a card the whole room can see. Read it aloud. Challenge it against the quotes: does any sentence in the transcripts contradict the statement? If yes, adjust.

What to watch for:

Abstract outcomes. “So I can be happy.” Push for what happy looks like. “So I can stop having to think about dinner on Sunday.” That’s a specific outcome.
Product names in the statement. “So I can use our app.” No, that’s a solution. “So I can plan the week without a grocery trip.” That’s the job.
Two jobs in one statement. “When it’s busy, I want to plan the week and also try new recipes, so I can feed the family without stress.” Split into two statements. One job per card.
Committee wording. The room rewrites the same statement four times. Park it. Accept the rough version and move on; polish later.

Phase 6: Wrap-up (10 min)

Pin the 3–5 job statement cards on the wall. Photograph them. Read each aloud one more time with the team. Then name the owners:

“Product lead: these are yours from here. Ops observer: you’re running the sanity check against what you hear on calls next week. Engineering lead: I’ll walk these past you Monday.”

End on commitments, not summaries.

Worked example

See Jobs to be Done: Why Subscribers Actually Stay for a fictional team’s first switch-interview session, including the moment three interviewees independently describe the same Sunday-night job the team had never heard named. The product in that story is a meal-box subscription, but the shape of the session is the same in any domain.

What Can Go Wrong

The feature wishlist. The interview turns into a list of features the customer wants. Recovery: Interrupt gently: “Let me take a step back: what were you doing before you switched? Walk me through that week.” Pull them back to the timeline. Stop if: The same interviewee keeps returning to features despite three rewinds. Thank them, end the call, and try a different interviewee.

The generic answer. Everything is “convenience” or “quality” or “the vibe.” Recovery: “When you say convenience, what did the morning of your Monday look like before, versus after?” Specifics always. Stop if: They genuinely can’t recall. They’re probably not a recent switcher: check when they actually switched.

The rationalised story. The interviewee has a clean narrative that sounds like your own marketing. Recovery: Walk to the concrete scene. “Before you decided that, what were you actually doing on a Tuesday night?” Stop if: They resist the concrete scene. Rationalisation is often protective; don’t force it.

Jobs conflated with solutions. During discovery, someone keeps writing job statements that include the product. Recovery: Delete the product name and see if the statement still holds. If it doesn’t, it’s not a job; it’s a feature brief. Stop if: The whole wall is solution-shaped. The interviews didn’t produce enough material; schedule more.

The wording committee. Four people argue about the wording of a single statement for twenty minutes. Recovery: Force a rough version. “Worst acceptable version. We’ll polish next week.” Stop if: The argument is actually about whether the job is real. Go back to the quotes and check.

Confirmation bias. The room is finding what it already believed. Recovery: Ask the ops / CS observer to challenge every statement against the customers they talk to. “Would anyone you speak to on the phone recognise themselves in this?” Stop if: Two observers independently say the statements don’t match what they hear. The interviews may be unrepresentative; schedule different interviewees.

Other failure modes worth naming so you can spot them early:

The team treats three interviews as definitive and skips follow-up
The statements get written and shelved; the roadmap continues as before
Interviewers slide into persuasion mode and start explaining the product to the interviewee
Discovery collapses into consensus around the theory the product lead walked in with
Quotes get paraphrased into notes and the verbatim material is lost

The session has costs as well as benefits, and naming them helps the team commit honestly: 4–5 hours of session time plus 3–4 hours of interview scheduling and coordination, the emotional cost of hearing customers describe problems you haven’t solved, and the fact that the candidate job statements are candidates; they want validation with more interviews before they drive anything irreversible. Interviewer skill compounds; early sessions produce rougher material.

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Photographs of the wall: the full cluster layout, each candidate job card in close-up, the verbatim quote notes against their clusters.
Files the interview transcripts somewhere the team can read them for months. Redact names and any personal detail not relevant to the job.
Writes a short summary: the 3–5 candidate jobs, one sentence each about the strongest quote behind each, and the four forces where they showed up.

This week, the product lead:

This is where the pattern earns its cost, and the work is mostly the product lead’s.

Walk the candidate jobs past the ops / CS team. People who talk to customers every day will either nod or wince. Both reactions are useful. The wince is more useful.
Schedule three more interviews to validate the strongest candidate. Three interviews is not enough to commit; three more either strengthen the statement or reveal the hole. Treat the first session’s output as a hypothesis.
Map the current roadmap against the jobs. Which features serve a named job? Which don’t? A feature that doesn’t serve any job is either a job you haven’t articulated yet or work that shouldn’t be in the quarter.
Refuse the next feature request that doesn’t match a job. Politely. With a reason. This is the hardest week-after task and the one that makes JTBD earn its keep.

Ongoing:

Re-run JTBD when the product shape changes materially: a new segment, a new pricing tier, a new acquisition channel. The jobs change when the customers change.
Keep the job statements visible. Pin them in the team’s main room. When someone proposes a feature, they should be able to point at the job it serves.
Track the feature requests that don’t match any job. If the list grows, you’re either missing a job or missing the discipline to say no.

The benefits compound: job statements concrete enough to make product decisions (features that serve a named job get built; features that don’t get parked), a shared framing across product, engineering, and operations that reduces backlog churn, interview transcripts that earn their keep for months as future hires read them and onboard faster, a “not our job” list that is just as valuable as the jobs themselves, and the push / pull / anxiety / habit lens available for every future product conversation.

Variants

The default switch-interview shape captures one direction: people who chose the product. Two adjacent shapes capture the jobs you’re missing: customers who left, and prospects who never started.

Switch-out interviews (churn). Run the same playbook with people who cancelled in the last ninety days. The prompts adapt:

“Take me back to the day you decided to cancel. What was happening that week?”
“When did you first start thinking about it? What pushed you over the edge?”
“What are you doing now instead? Did you switch to something else, or go back to what you had before?”

The four forces re-orient. Push is what your product was doing wrong: the feature that broke, the support reply that landed badly, the price increase that finally tipped them over. Pull is the destination, which is often nothing: the cancelled customer went back to the way they did it before, not to a competitor. That’s a stronger signal than competitive churn; it means the job you thought you were doing wasn’t being done well enough to displace the old way at all. Anxiety is what made cancelling hard: the workflow they’d built around the product, the data or history they’d lose, the loyalty discount they’d give up. Habit is the inertia that kept them paying past the point of value: how many months did the bill go out after they’d stopped really using it?

A churn interview where the canceller went back to the way they did it before is the most useful kind you can run. It tells you the job you wrote down isn’t real, or isn’t being delivered. The team won’t want to hear it; the temptation will be to dismiss the canceller as not the target. Resist.

Non-adoption interviews. People who looked at the product and didn’t sign up, or fit the audience and never engaged. Harder to recruit (you don’t have their email) but the most valuable shape when growth has stalled and churn doesn’t explain the shortfall.

The prompts shift, because there’s no “day they switched”:

“Tell me about the last time you thought about a product like ours. What was happening?”
“What did you end up doing instead?”
“What stopped you from trying it?”

The forces re-orient again, and the missing forces are the finding. Push is what’s not working about whatever they’re using today; usually it’s weak, because the existing alternative is an adequate solution for adequate people. Pull is what your product promised them; usually it’s weak too, because if pull had been strong they’d have signed up. Anxiety is what stopped them: what if it doesn’t fit how they actually work, what if they can’t get value out of it, what if it locks them in. Habit is the strongest force in this set: most non-adopters are well served by what they’ve used for years, and the real question is whether anything could ever move them.

Recruit non-adopters by:

Asking churned customers to introduce contacts who also considered the product but didn’t sign up
Running a short paid screener through a research panel
Offering a small incentive through channels where the audience you serve already gathers
Using mutual connections, carefully, and never as a sales channel

Three non-adopter interviews are harder to schedule than ten switch interviews, but the missing jobs they reveal are the ones the rest of the playbook can’t see.

When to run which:

Switch-in only. A new team learning the technique, or a product where adoption and retention are both healthy.
Switch-in plus switch-out. The default for a team that wants the full picture of who they keep and who they lose. Run a session of each in the same fortnight; make sense of each separately, then compare the job statements.
All three. When growth has plateaued and churn data alone doesn’t explain it. The non-adopter shape is the one that finds the job you haven’t named yet.

Does Time Even Exist?

2026-05-07T06:00:00+08:00

Time Is Weirder Than You Think showed time bending near mass, dilating with motion, rippling when black holes collide, always as a thing that exists. This post asks whether it does. The arrow that distinguishes past from future isn’t in the equations. “Now” isn’t a location in spacetime. The equations of quantum gravity may contain no time variable at all. Some physicists think time is a shadow of something simpler. A few think it has more dimensions than we can see. A handful think it doesn’t fundamentally exist.

The arrow of time

At the quantum level, the equations of physics are mostly time-symmetric: they work just as well running backwards. Maxwell’s equations, the Schrodinger equation, even the equations of general relativity: none of them distinguish past from future. Run the film backwards and the physics still works. Yet we experience time as having a clear direction. Eggs break but don’t unbreak. You remember yesterday but not tomorrow. What gives time its arrow?

The standard answer involves entropy: roughly, the disorder of a system. There are astronomically more ways for an egg to be broken than for it to be perfectly intact. A broken egg isn’t going to spontaneously reassemble, not because the laws of physics forbid it, but because the odds against it are absurdly, comically enormous. This is the second law of thermodynamics: things tend to move from ordered states to disordered ones, because disordered states are overwhelmingly more probable.

But this just pushes the question back a step: why does entropy increase? The second law is statistical, not fundamental; it says that higher-entropy states are more probable, so systems tend to evolve toward them. But that only works if the universe started in a low-entropy state, a highly ordered initial condition. Why did it? This is one of the deepest unsolved problems in physics, and it sits at the intersection of cosmology, thermodynamics, and the foundations of quantum mechanics. Roger Penrose has devoted much of his career to it; he estimates the probability of the universe’s initial low-entropy state arising by chance at roughly 1 in 10^(10^123), a number so absurdly large that writing it out would require more paper than exists in the observable universe.

The arrow of time, on this view, isn’t a property of the equations; it’s a property of the initial condition. The universe was handed an astronomically improbable starting state, and everything since has been the slow unwinding of that order into disorder. Take away that initial condition and the arrow vanishes.

The block universe

At the cosmic level, time is inseparable from space. General relativity describes them as a single four-dimensional fabric, spacetime, that can be curved, stretched, and warped by mass and energy. The notion of “now” is surprisingly hard to define across large distances. In special relativity, simultaneity is relative. Two lightning bolts strike opposite ends of a train simultaneously, from the platform’s point of view. A passenger on the train, moving toward one bolt and away from the other, sees them hit at different times, and according to relativity, both observers are equally right. There is no universal “now”. There is only your now, defined by your position and velocity, and it disagrees with everyone else’s.

This leads some physicists to the block universe interpretation: the idea that past, present, and future all exist equally and simultaneously. The four-dimensional spacetime block simply is, complete and unchanging. What we experience as the flow of time is an artefact of our consciousness moving through this block. In this view, the future is as real as the past; we just haven’t encountered it yet.

It’s a view that Einstein himself appears to have held. After the death of his lifelong friend Michele Besso in 1955, Einstein wrote to Besso’s family: “For those of us who believe in physics, the distinction between past, present, and future is only a stubbornly persistent illusion.”

If the block universe is right, there’s no such thing as the flow of time. There’s only a static four-dimensional structure, and the appearance of passage is something our brains impose on it. Which is uncomfortable, because the passage of time feels like the most obvious thing in the world.

The beginning of time

If time bends near mass and stops at an event horizon, what happened at the Big Bang, the most extreme gravitational event of all? In 1983, Stephen Hawking and James Hartle proposed that the question is malformed. In their no-boundary proposal, as you trace time back toward the Big Bang, the distinction between time and space dissolves. Time doesn’t hit a wall or a starting gun. It smoothly becomes something more like a spatial dimension: rounded off, with no edge and no “before.”

Hawking’s analogy: asking what happened before the Big Bang is like asking what’s south of the South Pole. You can walk south from anywhere on Earth, and at every step there’s more south ahead of you, until you reach the pole, where “south” doesn’t end in a wall. The concept simply stops applying. There’s no sign saying “end of south.” There’s just a smooth surface that curves in a way that makes the question dissolve. Time at the Big Bang, in the Hartle-Hawking model, does the same thing. The universe didn’t begin at a first moment. The geometry of spacetime curves in a way that removes the need for a first moment.

Every possible history, all at once

The no-boundary proposal isn’t just a clever picture. It’s calculated using a technique from quantum mechanics called the path integral, an idea Feynman developed in the 1940s. Here’s the intuition.

Normally, if you want to know how a ball gets from point A to point B, you calculate the one path it takes: the arc through the air that Newton’s laws dictate. Feynman showed that in quantum mechanics, this is wrong. The ball takes every possible path simultaneously: straight lines, spirals, loops, detours through the next room and back. Every path contributes to the outcome. Most of them cancel each other out, and what survives is something that looks very much like Newton’s single arc. But the cancellation is the reason, not the single path.

Now apply this to the universe. In quantum cosmology, the universe didn’t take one history from the Big Bang to now. It took every possible history: every possible geometry of spacetime, every possible arrangement of matter and energy, all at once. Some of those histories have time that looks like ours. Some have radically different causal structures. Some might have multiple time dimensions, or looping time, or no time at all. What we observe is the interference pattern of all of them.

It’s like a choir. A hundred singers each sing a different note. Most of the notes clash and cancel. What the audience hears isn’t silence; it’s a chord. The chord is our universe. The individual notes are the histories that were summed over to produce it. Our experience of time, flowing forward, one second after another, is the chord that survived the cancellation. It’s not the only note that was sung.

Hawking’s last act

Hawking spent his final years refining this picture with Thomas Hertog. Their 2018 paper, submitted just weeks before Hawking’s death, used the holographic principle (more on that in a moment) to argue that the multiverse, if it exists, is far more constrained than the “anything goes” version popular in science fiction. Different regions of the universe might settle into different vacuum states (different stable configurations of the fundamental fields) and each vacuum state could have different effective physics. Different particle masses. Different force strengths. Possibly different properties of time itself.

This isn’t parallel universes in the Star Trek sense. It’s more like ice forming on a pond. Water can crystallise in different orientations, and different patches of ice have their crystals aligned differently. Same water, same physics, different local structure. Hawking and Hertog proposed that the universe is the same way: one underlying theory, but different regions that “froze” into different configurations. Time in one region might tick with subtly different properties than time in another, not because the laws are different, but because the local vacuum is.

The holographic principle

In 1993, Gerard ‘t Hooft proposed, and Leonard Susskind later developed, an idea that sounds absurd: all the information in a three-dimensional region of space can be encoded on its two-dimensional boundary. Like a hologram on a credit card that looks three-dimensional but is physically flat.

This wasn’t metaphor. It grew out of Hawking’s own work on black holes. Hawking showed in 1974 that black holes radiate, slowly evaporating over astronomical timescales. Jacob Bekenstein had shown that a black hole’s entropy (roughly, the amount of information it contains) is proportional to the area of its event horizon, not its volume. That’s deeply strange. The information content of a room is proportional to its volume: more room, more stuff, more information. But for a black hole, it’s the surface that matters. The interior is, informationally speaking, redundant.

If this holds generally (and there’s strong theoretical evidence that it does) then our entire three-dimensional experience, time included, might be a projection from a lower-dimensional boundary. Consider a shadow puppet show. Puppets move in 3D behind the screen. The audience sees 2D shadows on the wall. The holographic principle says something far stranger: the 2D shadow might be the fundamental reality, and the 3D puppet is the projection. We’re the audience and the shadow, convinced we live in 3D because the projection is so convincing.

What does this mean for time? On the boundary, time might work differently, or might not exist in the form we recognise. The “bulk” (our 3+1 dimensional experience) and the boundary encode the same information, but the encoding is radically different. The best-studied example is the AdS/CFT correspondence, discovered by Juan Maldacena in 1997, which shows an exact mathematical equivalence between a gravitational theory in a curved spacetime and a quantum field theory on its boundary: a theory that has no gravity at all. Same physics. Completely different description. In one description, time curves and dilates near massive objects. In the other, there’s no gravity to curve anything. Both are equally correct. They’re not two approximations of the same thing; they’re two exact descriptions of the same thing.

Two times

If time can be a projection of something simpler, can it also be a shadow of something richer?

Itzhak Bars at the University of Southern California has been developing a framework called two-time physics since the late 1990s. The idea: our universe has not four dimensions (three space, one time) but six: four of space and two of time. We can’t perceive the extra dimensions directly, any more than a shadow on a wall can perceive the lamp behind it. Our 3+1 dimensional experience is a particular projection of the 4+2 dimensional reality.

Here’s what makes it interesting. A 3D object casts different 2D shadows depending on the angle of the light. A cube’s shadow can look like a square, a hexagon, or a diamond. Same object, different projections, each one a valid 2D description. Bars showed that the same 4+2 dimensional physics, projected differently, gives different 3+1 dimensional theories: theories that look completely unrelated but are secretly the same underlying reality seen from different angles. Some of those projections have a time dimension that behaves like ours. Others have time that works differently. All are equally valid shadows of the same six-dimensional object.

This is speculative. There’s no experimental evidence for two time dimensions, and the framework is constructed to be mathematically consistent rather than empirically motivated. But it’s a legitimate research programme, published in peer-reviewed journals, and it demonstrates something important: our assumption that there’s exactly one time dimension is a choice, not a logical necessity. The mathematics works perfectly well with more.

Time loops

General relativity doesn’t just allow time to slow down or speed up. Under certain conditions, it permits time to form closed loops: paths through spacetime that return to their own starting point. Gödel found the first one in 1949. Spinning black holes have them. Wormholes might too. Hawking took the idea seriously enough to propose a law of physics to prevent it. It gets much stranger from there.

Time crystals

Start with the warm-up. Salt is a crystal because its atoms sit in a repeating pattern: atom, gap, atom, gap, atom, gap. Nothing in the laws of physics insists they line up that way; they just do, because the arrangement is stable. The pattern is in space.

In 2012, Frank Wilczek (a Nobel laureate) asked the obvious next question: could a pattern repeat in time instead? Could a system tick, tick, tick forever on its own preferred schedule, in its lowest energy state, with no energy input?

This was controversial. A system oscillating in its ground state would seem to violate the expectation that ground states are static: nothing happening, no change, as boring as physics gets. But in 2017, two teams independently built time crystals in the lab. One at Harvard using a chain of ytterbium ions, another at the University of Maryland using a different approach. The trick was a clever sleight of hand. You can’t just shake something and call it a time crystal, because then it’s only dancing to your beat. The Harvard and Maryland teams drove their systems at one speed and watched them respond at a different, slower speed: tap once a second, tick once every two seconds. That mismatch is the giveaway. The rhythm comes from inside the system, not from the experimenter. Time-translation symmetry (the assumption that the laws of physics are the same from one moment to the next) was broken.

Ordinary crystals break spatial symmetry: space looks the same in every direction, but inside a crystal, some directions are special. Time crystals do the same thing to time: time flows the same way from moment to moment, but inside the crystal, some moments are special. The crystal has a rhythm the underlying laws don’t require. It’s a genuinely new kind of stable arrangement of matter, a new “phase” alongside solid, liquid, gas, and magnet. We didn’t know matter could organise itself in time the way it organises itself in space. Now we know it can.

It’s tempting to read this as evidence that time itself is chunky, that the universe has a preferred beat hidden in it somewhere. It isn’t. The discreteness lives in the system’s state, not in time. Same as salt: atoms sit at specific spots, but the space between them is still a smooth continuum. The pattern is in the matter, not in the stage the matter sits on.

Whether the stage itself has a smallest possible tick (whether time is smooth all the way down, or whether the universe has a frame rate) is a different question entirely.

The smallest tick

Is there a shortest possible moment? A tick so small that “before” and “after” stop meaning anything?

Maybe. It’s called the Planck time, and it’s about 5.4 × 10⁻⁴⁴ seconds. To get a feel for how small that is: the ratio between one Planck time and one second is roughly the same as the ratio between one second and a hundred trillion trillion times the current age of the universe. It’s not a duration anyone has measured or ever will measure. It’s more like a speed limit sign at the edge of the map: our best theories of physics say “beyond here, we don’t know what happens.”

The number comes from combining three fundamental constants, the speed of light, the gravitational constant, and Planck’s constant, in the only way that gives you a unit of time. It’s the scale where quantum mechanics and gravity would both matter simultaneously, and right now we don’t have a theory that handles both at once. Our two best frameworks, quantum mechanics (which explains the very small) and general relativity (which explains the very massive), give contradictory answers at this scale.

Some physicists think the Planck time is a real boundary: that time is genuinely granular at this level, like pixels on a screen. Below one Planck time, there’s no “shorter.” Others think time is smooth all the way down and the Planck time is just where our equations stop working, not where time itself stops. We don’t know. We’re nowhere near being able to test it. But it’s a striking thought: the universe might have a frame rate.

Does time exist at all?

Some physicists have gone further. Julian Barbour, in The End of Time, argued that time doesn’t fundamentally exist. What we call time is just the way we experience the relationships between configurations of matter. The universe doesn’t evolve through time; it simply is a collection of states, and our brains string them into a narrative.

Carlo Rovelli, in The Order of Time, takes a related but more nuanced position: time as we experience it (flowing, universal, directed) is an emergent property that arises from our limited perspective as macroscopic beings who interact with the world thermodynamically. At the most fundamental level of quantum gravity, the equations may contain no time variable at all.

When physicists try to write down an equation that combines quantum mechanics and gravity, the so-called Wheeler-DeWitt equation, they get something startling: the equation has no time variable at all. It describes a universe where nothing changes. How you get from a timeless equation to our everyday experience of things happening one after another is, to put it mildly, an open question.

This is philosophy as much as physics, and it’s nowhere near settled experimentally. But it illustrates how deep the rabbit hole goes. We started with a simple question, “what time is it?”, and ended up with equations in which time has no place.

Where this leaves us

None of the foundations in this post are settled. The block universe is an interpretation, not a measurement. The no-boundary proposal is a model, not a verdict. The holographic principle has strong theoretical support but no direct experimental test. Two-time physics is consistent mathematics without empirical backing. Time crystals exist, but they’re a curiosity rather than a revolution. The Planck time is a scale we can’t probe. The Wheeler-DeWitt equation has no time variable, and nobody knows what to do about that.

What all of them share is the unsettling implication that the time we experience (flowing, directed, universal, one thing after another) might be a surface feature of something deeper. The equations don’t need the arrow. “Now” isn’t in the maths. The fundamental theories we have either don’t mention time or treat it as a dimension no more special than space.

And yet we live in time. Things happen. The egg breaks and doesn’t unbreak. You remember yesterday and not tomorrow. Whatever time is fundamentally, emergently, or not-at-all, our experience of it is real enough to live by.

There’s one more direction the equations let us push, and it’s the direction most people would actually want to use a time machine for. Not forward (forward is easy and we’ve covered it). Backward.

Can You Turn Back Time? is next, and the equations are more permissive than you’d expect.

Choosing Between Prompting, RAG, and Fine-Tuning

2026-05-06T06:00:00+08:00

The situation

The in-house legal team maintains 4,000 contract templates across 12 jurisdictions and 30 contract types (employment, NDA, master services, licensing, etc.). Each template is between 5 and 80 pages. They live in SharePoint today; an S3 bucket is being stood up to mirror them. Templates change: roughly 50 are updated every month when a jurisdiction’s law changes or a clause is renegotiated at the enterprise level.

Paralegals currently find the correct clause by searching SharePoint for keywords and skimming results. The turnaround for “what’s the standard force majeure language for a French SaaS contract?” is 10-15 minutes of human grepping. Legal-ops wants this under 30 seconds with a citation back to the exact template and clause.

A first prototype called Claude with the question in the prompt. It hallucinated, clauses that sounded right but used clause numbers the templates don’t use, jurisdictions the clause doesn’t cover, and in one case a citation to a template that doesn’t exist. Three techniques are on the table to fix it: prompt engineering (rewrite the prompt better), RAG (retrieve relevant template excerpts and include them in the prompt), and fine-tuning (train the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. on the template corpus until it knows the language natively). The team want a decision.

What actually matters

The three techniques are not interchangeable. They solve different problems, and the first mistake is treating them as points on a single “quality” axis.

Prompt engineering is changing the text you send to the model. Better instructions, worked examples in the prompt (few-shot), explicit format requirements, a system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. that sets the model’s persona. It costs nothing in infrastructure, no trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. run, no new data pipeline, though iterating on the wording can take hours to days of engineering time. It is also the only technique that works if what the model is producing is the wrong shape, too long, too short, wrong tone, missing a required field. A hallucinating model doesn’t need a better prompt alone; it needs information it doesn’t have. Prompt engineering is necessary, always, but rarely sufficient on its own for a knowledge-grounding problem.

Retrieval-augmented generation is pulling relevant documents into the prompt at query time. The model doesn’t need to know the 4,000 templates; it needs to be handed the correct three when the paralegal asks a question. The architecture is: pre-compute vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. of each template chunk (a chunk being a paragraph, a section, or a page), store them in a vector database, and at query time embed the user’s question, retrieve the top-k most similar chunks, and include them in the model’s prompt along with the question. The model answers from the retrieved text. If the retrieval is good, hallucinationHallucinationAn LLM stating something false with the same confidence it states something true. drops to near-zero on questions the corpus can answer.

Fine-tuning adjusts the weights of the model on a training dataset. For generative models, this usually means providing input-output pairs (“when you see X, produce Y”) and running a training job that nudges the model’s behaviour toward those examples. Fine-tuning teaches style, format, or specialised vocabulary, not facts. A fine-tuned model trained on the templates would learn to sound like a legal template, would learn the vocabulary and cadence of the corpus, but would still not reliably cite a specific clause number. Facts go stale the moment the corpus changes; weights don’t update when a template does.

The second is update frequency. The templates change 50 times a month. Fine-tuning takes hours to days to run and costs money per run; running it weekly to stay current would be expensive and would leave a gap between “clause updated” and “model knows.” RAG updates by re-embedding a changed document, minutes, maybe seconds, and the next query sees the new version immediately. Freshness is a first-class requirement for this domain, and fine-tuning does badly on freshness.

The third is data volume and quality for fine-tuning. Fine-tuning typically wants hundreds to thousands of high-quality labelled examples for the behaviour you’re trying to teach. For the legal-ops team, that would mean writing hundreds of (question, ideal-answer) pairs by hand, the kind of project that takes months and is done infrequently. By contrast, RAG needs the documents (already have them) and an embedding model (off the shelf). The bar to entry is lower by an order of magnitude.

The fourth is cost shape at inferenceInferenceRunning a trained model to produce output – as opposed to training it. . Prompt engineering and RAG both run on the standard on-demand per-token bill. A custom-trained model typically requires reserved capacity instead, a multi-month commitment of model units. That is a very different cost profile: RAG at 500 queries a day costs cents; reserved capacity for a custom model is thousands of dollars a month whether anyone uses it or not. Fine-tuning is appropriate when volume is high and predictable enough to saturate a reserved endpoint. Not a legal-ops team of 20 paralegals.

The fifth is explainability. When the model cites a specific clause, the paralegal wants to click through to the source template. RAG gives that for free, the retrieved chunks are the citation. Fine-tuning erases the provenance: the model emits text that came from somewhere in training, but “somewhere” isn’t a clickable link.

What we’ll filter on

Six filters, applied to each of the three techniques.

Corrects hallucination on proprietary data, does the technique give the model access to the actual templates?
Adapts format and tone, does the technique change how the model writes (length, structure, register)?
Handles data freshness, when a template updates, how fast does the system reflect it?
Setup cost, time and data-labelling effort to get to first working version?
Inference cost shape, on-demand per tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , provisioned, or something else?
Provides citations, can the paralegal trace the answer to a specific source?

The three-technique landscape

Prompt engineering. Refining the text sent to the model. For a question-answering task: system prompt that sets role and constraints (“You are a legal research assistant. Only answer based on provided source text. If the source doesn’t contain the answer, say so.”); few-shot examples showing the desired question-answer shape; instructions on format (“cite the template name and section number”). No new AWS services; the work is in the application code. Inference cost is whatever Bedrock on-demand charges. Setup: hours to days of iteration. Citation: only if the source text is already in the prompt (i.e. paired with retrieval).
Retrieval-augmented generation (RAG). Pre-embed the corpus, store embeddings, retrieve-and-inject at query time. AWS building blocks: Bedrock Knowledge Bases (managed: point at S3, configure chunking and embedding model, query via bedrock-agent-runtime:RetrieveAndGenerate), or DIY with a Bedrock embedding model (Titan Text Embeddings v2, Cohere Embed) plus a vector store (OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, etc.). Inference cost: on-demand Bedrock per token + vector-store running cost. Setup: hours to days, depending on chunking strategy. Citation: native, the retrieved chunks carry their source metadata.
Fine-tuning. Adjusting model weights on task-specific training data. Bedrock supports fine-tuning selected models (Nova, Titan, Llama) via bedrock:CreateModelCustomizationJob: upload JSONL training data to S3, configure hyperparameters, start the job. Output is a custom model that requires provisioned throughput to serve. Setup: days to weeks to prepare a training set of hundreds of high-quality examples, plus the training job itself (hours), plus evaluation. Inference cost: provisioned throughput starting at a few dollars per hour per model unit, 1- or 6-month commitments. Citation: none by default; the model produces text without provenance.
Continued pre-training. Bedrock’s other customisation path: feed a large corpus of unlabelled domain text (a few gigabytes to tens) and further train the base model on it. Useful for teaching the model a specialised vocabulary or domain (medical, legal, financial) when fine-tuning on labelled pairs isn’t enough. Same cost shape as fine-tuning: provisioned throughput to serve, days of setup. Mentioned for completeness; rarely the correct first answer for a question-answering problem.

Side by side

Technique	Corrects hallucination	Adapts format/tone	Handles freshness	Setup cost	Inference cost	Citations
Prompt engineering	✗	✓	N/A	Low	On-demand	✗
RAG	✓	Partial	✓ (seconds)	Medium	On-demand + vector store	✓
Fine-tuning	Partial (style only)	✓	✗ (re-train)	High	Provisioned throughput	✗
Continued pre-training	Partial (vocabulary)	✓	✗ (re-train)	Very high	Provisioned throughput	✗

Reading the table against the legal-ops team’s actual problem: the templates change 50 times a month (fine-tuning fails freshness), the paralegals want citations (fine-tuning doesn’t provide them), and the hallucination is about facts, not style (fine-tuning doesn’t fix facts). RAG is the technique for this problem. Prompt engineering will still be necessary on top of RAG, the retrieved chunks need the correct framing, but it’s not sufficient alone because the model needs the information injected, not just instructed.

When each technique earns its keep

Two questions, does the model need new knowledge, and does that knowledge change, partition the techniques. The legal-ops scenario lands on RAG.

The pick in depth

RAG via Bedrock Knowledge Bases, with prompt engineering on top. Bedrock Knowledge Bases is the managed RAG path: point it at an S3 bucket of source documents, configure an embedding model and a vector store, and Bedrock handles chunking, embedding, indexing, and retrieval. At query time, one API call (bedrock-agent-runtime:RetrieveAndGenerate) takes the user’s question, retrieves the most relevant chunks, constructs the prompt, calls the generation model, and returns the answer with source citations.

The configuration surface that matters:

Embedding model. The embedding model turns text into a vector (a list of numbers like [0.12, -0.45, ..., 0.08], typically 1024 or 1536 dimensions). Similar meanings produce similar vectors. Bedrock Knowledge Bases supports Titan Text Embeddings v2 (Amazon), Cohere Embed English v3, and Cohere Embed Multilingual v3. For a multi-jurisdiction legal corpus including non-English templates, Cohere Multilingual is the correct default.
Chunking strategy. A 50-page template isn’t embedded as one vector; it’s split into chunks, usually a few hundred tokens each, and each chunk gets its own vector. Default chunk size is 300 tokens with 20% overlap. For legal templates where clauses have meaningful boundaries, a semantic chunking strategy (chunks respect paragraph or heading boundaries) often retrieves more cleanly than fixed-size chunks. Bedrock Knowledge Bases supports default, fixed-size, hierarchical, and semantic chunking.
Vector store. Where the embeddings live. OpenSearch Serverless is the default (Bedrock can create it for you). Aurora PostgreSQL with pgvector is the alternative if you already run Aurora; Pinecone and Redis Enterprise Cloud are supported third-party options. For 4,000 templates of varying size, OpenSearch Serverless is the lowest-friction choice; Aurora pgvector matters if the legal team already runs metadata in Postgres and wants SQL joins across vector and structured data.
Retrieval configuration. How many chunks to retrieve per query (numberOfResults, default 5; legal corpora with many adjacent clauses often benefit from 6-8 to capture related sections), and whether to use hybrid search (vector similarity plus keyword matching) versus pure vector. For legal templates where exact clause names matter, hybrid search often retrieves more reliably than pure-vector.
Generation model. Separately configurable: Claude Sonnet, Nova Pro, Llama, whichever. The generation model sees the retrieved chunks plus the question and produces the answer. Bedrock’s default prompt template includes the chunks under a $search_results$ placeholder and instructs the model to answer based on them.

The prompt-engineering layer still matters. Knowledge Bases lets you override the default prompt template; a custom template for this use case might add instructions like “If the source templates don’t contain the answer, say ‘I don’t have a template matching those criteria’ rather than guessing. Always cite the template name and clause number in the format [Template Name, §Clause Number].” These instructions are why the technique works end-to-end: retrieval feeds the model the correct chunks; the prompt tells the model how to handle missing information without inventing it.

Freshness is handled by Bedrock’s ingestion pipeline. A changed template in S3 triggers a re-sync, either on demand via the console or API, or scheduled, that re-embeds only the changed documents and updates the vector store. From template-change to model-knows is minutes, not a training run.

A worked query

A paralegal has a question. They type it into the team’s internal tool, which calls RetrieveAndGenerate.

$ aws bedrock-agent-runtime retrieve-and-generate \
    --input '{"text": "What is our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees?"}' \
    --retrieve-and-generate-configuration '{
      "type": "KNOWLEDGE_BASE",
      "knowledgeBaseConfiguration": {
        "knowledgeBaseId": "KB-LEGAL-TEMPLATES",
        "modelArn": "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
        "retrievalConfiguration": {
          "vectorSearchConfiguration": {
            "numberOfResults": 6,
            "overrideSearchType": "HYBRID"
          }
        }
      }
    }'

{
  "output": {
    "text": "Our standard limitation of liability clause for French SaaS agreements, capped at 12 months of fees, is in SaaS-FR-v4.2 at §14.3. The clause reads: 'Except for breaches of confidentiality or indemnification obligations, each party's aggregate liability under this Agreement shall not exceed the fees paid or payable by Customer to Provider in the twelve (12) months immediately preceding the event giving rise to the claim.' Related carve-outs for gross negligence are in §14.4."
  },
  "citations": [
    {
      "generatedResponsePart": { ... },
      "retrievedReferences": [
        {
          "content": { "text": "..." },
          "location": {
            "type": "S3",
            "s3Location": { "uri": "s3://legal-templates/saas/FR/SaaS-FR-v4.2.docx" }
          }
        }
      ]
    }
  ]
}

What happened:

The retrieve-and-generate call embedded the question with Cohere Multilingual v3.
Bedrock queried the OpenSearch Serverless index with the resulting vector, using hybrid search (vector + keyword for “French SaaS”, “limitation of liability”, “12 months”).
It retrieved the top 6 chunks by relevance. The top hit was §14.3 of SaaS-FR-v4.2.docx; adjacent chunks included §14.4 and the definitions section.
The retrieved chunks were injected into the generation prompt under the $search_results$ placeholder; Claude Sonnet produced the answer, sticking to the retrieved text because the custom prompt template instructs it to.
The citations array on the response links each generated span to the source chunk. The UI renders a clickable citation: “[SaaS-FR-v4.2, §14.3]” jumps straight to the document in SharePoint.

The round trip is 1-3 seconds. The answer is grounded in actual text. If a paralegal asks about a jurisdiction the corpus doesn’t cover, the model says so rather than inventing.

When fine-tuning would be the right choice

Fine-tuning is the correct tool when the problem is how the model writes, matching a company’s tone, producing a specific output format reliably, handling specialised vocabulary the base model gets wrong. Legal templates have some of that, but the primary problem here is groundingGroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. , not voice. Solve the grounding problem with RAG first; fine-tune later if there’s still a style gap.

For reference, fine-tuning a Bedrock model would have meant writing 500-2000 question-answer pairs by hand (a month of paralegal time), running a training job, then paying provisioned throughput at $2-$20 per hour to keep the custom model serving, charged whether anyone queries it or not. RAG plus prompt engineering ships in days at cents per query.

What’s worth remembering

The three techniques solve three different problems. Prompt engineering changes instructions. RAG adds facts. Fine-tuning changes style and format. Reaching for the wrong one solves nothing.
Hallucination on proprietary data is almost always a retrieval problem, not a training problem. The base model doesn’t have your documents. Give it them at query time; don’t try to bake them into the weights.
Freshness kills fine-tuning for dynamic domains. If the knowledge changes faster than the training cadence, fine-tuned models are stale the moment they land. RAG’s freshness is minutes; fine-tuning’s is the next training run.
Bedrock Knowledge Bases is the managed RAG path. S3 in, vector store out, RetrieveAndGenerate as the single query API. Chunking strategy, embedding model, and retrieval config are the levers worth tuning.
Citations come from retrieval, not generation. RAG’s output carries retrievedReferences pointing to source documents. Fine-tuning produces text without provenance; if citations matter, fine-tuning alone won’t suffice.
Provisioned throughput is the cost shape for fine-tuned models on Bedrock. That’s a commitment of model units for 1 or 6 months, in the thousands of dollars per month. On-demand per-token pricing doesn’t apply to custom models.
Prompt engineering is always part of the answer. Even with perfect retrieval, the model needs instructions: format the citation this way, refuse to answer if the source doesn’t cover it, adopt this tone. Prompt work sits on top of every technique.
Combine where it makes sense. RAG + prompt engineering is the common pair. Fine-tuning + RAG is the domain-chatbot pattern: the fine-tune teaches voice, retrieval supplies facts. Rarely is fine-tuning alone the correct choice for a knowledge-heavy task.

The legal-ops team’s first prototype hallucinated because the model didn’t have the templates. The fix isn’t a smarter prompt or a longer training run, it’s plumbing the templates into the model’s input at query time. RAG, with prompt engineering shaping the output and Bedrock Knowledge Bases doing the retrieval plumbing, is the technique that matches the problem.

Assumption Mapping: Testing What You Believe

2026-05-05T06:00:00+08:00

Greenbox delivers weekly produce boxes from local farms to 200 subscribers in Perth. Recent customer interviews revealed that people stay for the convenience, not the local sourcing the team assumed. Now the team needs to find out what else they believe that they haven’t actually tested.

The JTBD interviews gave the Greenbox team a breakthrough. Subscribers don’t stay for fresh local vegetables. They stay because Greenbox eliminates weeknight dinner stress. The box arrives, dinner is decided, one less thing to worry about.

That insight reshaped the product roadmap. Recipe cards went into every box. Churn dropped from 8% to 5% in the first month.

But the interviews also revealed something less comfortable: a lot of what the team believes about the business is assumption, not fact.

Maya believes subscribers value local sourcing. She built the entire brand around it. Tom believes the substitution algorithm is good enough. Sam believes word-of-mouth is the main acquisition channel. Priya believes the weekly delivery cadence is right.

These aren’t minor details. They’re foundational assumptions. If any of them are wrong, the team could be optimising the wrong things for the next six months.

How assumptions hide

The tricky thing about assumptions is that the team doesn’t experience them as assumptions. They experience them as facts. “Subscribers value local sourcing” doesn’t feel like a guess, it feels like the foundation of the business. Maya would have said, with total confidence, that local sourcing is why people subscribe. Until the JTBD interviews showed otherwise.

The ones you’re most confident about are often the ones you’ve tested least. Nobody tests what they consider obvious.

Introducing Assumption Mapping

Assumption Mapping is a structured way to surface, categorise, and prioritise what you believe but haven’t validated.

Step 1: List your assumptions. Everything the team believes about the business. No judgement.

Step 2: Rate each on two axes. How critical is it? (If wrong, how badly does it hurt?) How much evidence? (Tested, or just a feeling?)

Step 3: Plot on a 2x2 grid.

Test Immediately High risk, low evidence

Monitor High risk, high evidence

Park Low risk, low evidence

Fine Low risk, high evidence

Horizontal axis: Low Evidence → High Evidence. Vertical axis: Low Risk → High Risk.

The top-left quadrant, high risk, low evidence, is where the landmines live.

Running the session

Lee facilitates. Dave is here. Maya invited him after the JTBD interviews, partly because the assumptions about farms need a farmer’s perspective, and partly because Dave has a way of saying things that cut through the noise. He drove in from Margaret River this morning, two hours in his ute with ABC Country playing the whole way. He sits at the end of the table in his work shirt, arms folded, watching the team arrange their sticky notes. He hasn’t been in this office since the Event Storm months ago. The walls are different now, covered in printouts and JTBD transcripts. Patrick’s quote is pinned above the whiteboard: “I was paying twenty-five dollars a week to feel bad about myself.”

Dave reads it. He doesn’t say anything.

“Write down everything you believe about Greenbox that you haven’t actually tested,” Lee says. “Not features. Beliefs. Things you’d bet the business on.”

The team writes for ten minutes. Twenty-four assumptions pile up:

Maya: Subscribers value local sourcing. Farms will scale with us. Our price ($25/box) is competitive. Subscribers prefer curated over choosing. The brand matters more than the price.

Tom: The substitution algorithm produces acceptable results. The platform can handle 1,000 subscribers. Farms will use the portal. Weekly delivery is the right cadence.

Priya: Subscribers want more variety. Mobile is primary for account management. The signup conversion rate is acceptable.

Sam: Word-of-mouth is our primary channel. Subscribers would recommend us. Instagram drives sign-ups. Churn is value-driven, not logistics-driven.

Dave writes slowly. Three notes in large, deliberate handwriting: Farms will scale with us. We’re the only option. Farmers will keep supplying if Greenbox has a bad quarter.

Sam sees Dave’s second note and writes his own version: “We don’t have a serious competitor in Perth.”

Plotting the map

The team takes each assumption and debates where it belongs. This is where the interesting conversations happen.

“Subscribers value local sourcing.”

Maya instinctively puts it top-right: high risk, high evidence. “It’s our brand identity.”

Lee pushes back. “How many JTBD interviewees mentioned it as the primary reason they subscribe?”

Sam checks the LLM’s analysis. Local sourcing appeared in nine of fifteen interviews, but as the primary motivator in only three.

The assumption moves to the top-left. High risk, low evidence.

That move is uncomfortable. Maya built Greenbox around local sourcing. It’s not just a feature, it’s personal. Discovering that subscribers might not share that belief feels like a challenge to her identity, not just her business strategy.

“The substitution algorithm produces acceptable results.”

Tom puts it top-right. “Nobody complains.”

Priya raises her hand. “Nobody complains to us. But three churned subscribers in the JTBD interviews mentioned getting items they didn’t want. One said ‘I got turnips three weeks in a row.’”

Tom opens his laptop. Thirty seconds: “Turnips were available in bulk from Dave’s farm for three weeks. The algorithm scored them as the best substitution because they were cheap and plentiful. Root-for-root swaps. Technically correct. Terrible customer experience.”

The assumption moves left. Working correctly and working well are different things.

“Farms will scale with us as we grow.”

Maya puts it top-right. “I talk to Dave and Rachel every week. They’re committed.”

Dave clears his throat. The room turns.

“Last bloke who asked me to scale went bust and owed me eight thousand dollars.”

The room goes still.

“Farm-to-table scheme out of Busselton. Three years ago. Promised guaranteed orders. I expanded my planting for them. Hired a casual for harvest. They folded in August and I was out the produce, the labour costs, and the eight grand they owed me. Never saw a cent.”

He looks at Maya. Not with hostility, with the kind of frank assessment you give a stock fence before leaning on it.

“I’m here because I trust you, Maya. I trust that you grew up on a farm and you know what it costs when things go wrong. But trust doesn’t plant seeds. Contracts plant seeds. And right now, you and I have a handshake.”

Lee lets the silence sit. Then: “The assumption isn’t ‘Dave trusts us.’ It’s ‘farms will scale with us.’ And the evidence for that is a handshake and a history of being burned.”

Top-left. Firmly.

Maya writes “formalise farm contracts” on a fresh sticky note and puts it in her pocket. Dave’s words, last bloke who asked me to scale went bust, sit in the room like weather.

“We don’t have a serious competitor in Perth.”

Sam opens his laptop. “Two churned subscribers mentioned a company called Freshly. I’ve been researching.”

He walks the team through it: launched in Sydney four months ago, twelve million in Series A, ex-McKinsey founders, recruiting delivery drivers in Perth right now.

“What do they charge?” Tom asks.

“Eighteen dollars a week.”

The room does the arithmetic. Greenbox charges twenty-five.

Top-left quadrant. The team had been operating as if they were the only game in town.

Dave, from the end of the table: “Freshly rang me last week. Asking about supply. I told them I was committed elsewhere. But they’ll ring Rachel next, if they haven’t already.”

The session continues for another twenty minutes. When it’s done:

Test Immediately High risk, low evidence

Subscribers value local sourcing
Our price point ($25/box) is competitive
Farms will scale with us as we grow
We don't have a serious competitor
Word-of-mouth is our primary acquisition channel
Weekly delivery is the right cadence

Monitor High risk, high evidence

Subscribers prefer a curated box
Recipe cards reduce churn
The brand matters more than the price

Park Low risk, low evidence

Mobile is the primary account management channel
Instagram drives sign-ups
The unboxing experience matters for retention
People who cancel would come back with a discount

Fine Low risk, high evidence

Subscribers would recommend Greenbox to friends
Signup flow conversion rate is acceptable
The platform can handle 1,000 subscribers

Six assumptions in the “Test Immediately” quadrant. Six things the business depends on that nobody has validated.

Designing cheap experiments

The team can’t do six research projects, they need to ship product and grow simultaneously. The experiments need to cost hours, not weeks.

“For each assumption,” Lee says, “find the smallest, cheapest experiment that would change your mind.”

Local sourcing: A survey to all active subscribers. Rank five factors. Would you consider a $20 mixed-sourcing box? The LLM helps phrase the questions to minimise leading bias. Twenty minutes to build.

Price point: Two landing page variants, current pricing alone versus current pricing with a $20 “Mixed Box” option. Track click intent for a week.

Farm scaling: Maya calls three farm partners and asks: “If we needed to double our order in three months, could you do it?”

Acquisition channel: Tom adds a mandatory “How did you hear about us?” dropdown to the sign-up flow. One hour.

Delivery cadence: Sam adds a question to the post-delivery email: weekly, fortnightly, or flexible?

Five experiments. Total cost: about eight hours. Results in one to two weeks.

The results

Local sourcing: 168 responses. Only 12% ranked local sourcing as the most important factor. Convenience dominated (38%), followed by produce quality (26%) and recipe cards (18%). 60% said they’d likely switch to a $20 mixed-sourcing box.

Maya sits with this. Sixty percent of her subscribers would accept non-local produce for a five-dollar saving. “I feel like I’ve been punched in the stomach,” she says.

Lee lets the silence sit. “It doesn’t mean local sourcing is worthless. Twelve percent rank it first, that’s twenty-four people who might leave if you drop it. But 100% local at $25 might not be the only viable model.”

Price point: 2.3x more clicks on “Subscribe” when the $20 mixed option appeared alongside the $25 local option. Having a choice made people more likely to subscribe at all.

Farm scaling: Two of three farms could increase supply by 50%. Dave, the biggest supplier, would cap out at current levels. He’d need a full growing season to expand.

Acquisition channel: Word-of-mouth: 31%. Google search: 28%. Instagram: 19%. Local press: 14%. Sam was partially right, word-of-mouth is biggest, but not dominant. Search and social together account for nearly half.

Delivery cadence: 41% wanted weekly. 35% wanted fortnightly. 24% wanted flexible. More than half wanted less frequent delivery. This explains churn the team hadn’t understood, subscribers accumulating unwanted produce and cancelling out of guilt.

The hard conversation

Maya is quiet for a long time. “I built this business around an assumption I never tested. I assumed people cared about local sourcing as much as I do. They don’t.”

“It means you have options you didn’t know you had,” Lee says. “A $20 mixed box could open up a much larger market. A fortnightly option reduces churn. Neither kills the local brand, you can still offer a premium local box for the people who value it most. But the path to 1,000 subscribers probably isn’t ‘1,000 people who care deeply about local produce.’ It’s ‘1,000 people who want dinner stress eliminated, some of whom also care about local.’”

“That’s a different business than the one I set out to build,” Maya says. She looks at Dave.

“Maybe,” Lee says. “Or maybe it’s the same business, with a broader front door.”

Dave stands up. He needs to get back before dark. He shakes Maya’s hand at the door.

“You’ll work it out,” he says. It’s not a compliment, it’s a bet. The same bet he made when he agreed to supply Greenbox on a handshake. He’s still holding.

Maya watches his ute pull out of the car park.

She thinks about his eight thousand dollars. Not a loan. A debt, the one the last bloke who asked Dave to scale left him holding when the business went under. Dave said it twice today. Once in the meeting and once in the way he shook her hand at the door. You know what it costs when things go wrong.

She does. That’s the problem.

The numbers say the premise was wrong. Not wrong wrong, 12% is real, those are real people, but it was the premise. Local at scale was the thing she was proving. If the market doesn’t want what she was proving, then she didn’t build a produce-box business. She built a vehicle for proving something nobody asked her to prove. And she got two farmers and two hundred subscribers and a handshake with Dave to sign on while she did it.

Lee is right that she has options. A mixed box, a fortnightly cadence, a broader front door, on a whiteboard, it’s obvious. She could draw the pivot in fifteen minutes. But the pivot isn’t a whiteboard. The pivot is driving back to Margaret River and telling Dave the model is changing, the exact sentence, more or less, that the last operator said before he went bust owing Dave eight thousand dollars. It’s asking two farmers who signed up for “local produce to Perth” to bet on a new story. It’s telling two hundred subscribers that what they bought is not what they’re getting. Some will stay. Some will leave. She does not know which, and she does not know how many nights between now and knowing.

And under all of that, older than all of that: her father. Who didn’t pivot either. Who held on until there was nothing to hold on to. In her family, the story of losing the farm is a story about a man who loved something too much to see it clearly. Maya has been building Greenbox partly to not be him. And today she is sitting in a car park being told that what she loves is not what the business needs her to love, and the only move that feels like the opposite of her father, the only move that isn’t holding on anyway, is to stop. Her father held on and lost the farm. The last bloke scaled and lost Dave’s eight thousand dollars. Stopping is the one thing neither of them did.

She knows, in the part of her brain that can still do arithmetic, that stopping and pivoting are not the same shape. That Dave’s eight thousand dollars gets paid back by a working business, not by an honourable wind-down. That pausing operations is still a kind of losing, just a tidier kind. But that part of her brain is tired and it is late and the drive home is long.

She thinks about the “pausing operations” email she hasn’t written yet but can feel forming at the edges of her mind, like weather moving in from the coast. Three sentences. Honest. A clean door closed. She isn’t going to write it tonight. She might not write it at all. But she can feel the shape of it now, and that frightens her more than the survey did.

When to use Assumption Mapping

Before major investment decisions. If the team is about to spend significant time or money, map the assumptions first. The cost is trivial compared to building on a wrong assumption.
After discovery reveals surprises. If one major assumption was wrong, others might be too.
When the team disagrees about direction. Disagreements often hide different assumptions. Mapping makes the disagreement concrete rather than political.

When not to use it

When the team isn’t safe enough to admit uncertainty. If admitting “I don’t have evidence” feels dangerous, the exercise produces a sanitised list. Fix the safety problem first.
As a substitute for talking to customers. The map tells you what to test. It doesn’t do the testing.

What comes next

Maya has a board meeting in three weeks. She needs a credible path to 1,000 subscribers. The insights are powerful, mixed sourcing, fortnightly options, SEO investment. But do the numbers add up? Can Greenbox reach 1,000 with a model that works financially, especially with a competitor about to enter at $18 per week?

That’s a question about the business model itself. And it’s where Lee starts to hit the limits of what he can help with.

For that, Lee reaches for the Business Model Canvas, and Charlotte brings someone who can help with the numbers.

How to Take a Foundation Model from Pick to Production Endpoint

2026-05-04T06:00:00+08:00

The situation

A support organisation handles around 8,000 tickets a week across five product lines. Each ticket is a thread of customer messages and agent replies, averaging roughly 1,500 words. Managers want a one-paragraph summary at the top of each ticket, written in the same tone the company uses in its knowledge base, that a reviewer can read in ten seconds.

The team is three backend engineers and a product manager. None of them has trained a model. The company already has an AWS account with a modest budget, the tickets live in an RDS Postgres database, and the security team has said anything sent to a third-party API needs a written exception. AWS-native is the path of least resistance.

“Foundation model” has been floated as a solution but nobody in the room can define it, let alone explain the path from “a foundation model exists somewhere” to “a reviewer sees a summary in the ticket UI tomorrow morning.” The lifecycle is the thing to walk.

What actually matters

A foundation model, in the sense the industry now uses the term, is a large neural network trained on a broad corpus, text, code, sometimes images, that can be adapted to many downstream tasks without being retrained from scratch. “Foundation” is the metaphor: the model is the ground floor, and the application sits on top.

The first thing worth thinking about is that there is no single “use a foundation model” step. There is a sequence: choose the model, get access to it, design the way you’ll promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. it, optionally teach it something about your data, deploy it behind an endpoint, plumb that endpoint into your application, and then watch what it does in production. Each of those is a distinct decision, and AWS sells a distinct service (or at least a distinct API surface) for each one.

The second is that most of those stages are optional. A team that needs a summariser doesn’t have to train, doesn’t have to fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. , and in many cases doesn’t even need retrieval, the model has read enough English by now that summarising a ticket is within its baseline capability. Recognising which stages a given problem needs is most of the work; adding stages that aren’t pulling weight is how projects end up with a trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. pipeline they never use.

The third is the managed-versus-self-managed axis. A managed API gives you a foundation model behind an SDK call with no infrastructure, you don’t see the GPUs, you pay per tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. . A self-hosted endpoint lets you take a model, put it on infrastructure you pick, and pay for the instance whether calls come or not. The first is the path for most text-summarisation-shaped problems; the second is the path when data can’t leave your VPC, when the model you want isn’t in the managed catalogue, or when latency demands a provisioned endpoint rather than an on-demand one.

The fourth is the cost shape. On-demand managed-API pricing is per input and output token, with different rates for different models. For an 8,000-ticket-per-week workload that’s predictable enough to price up front, but the pricing model matters: a longer summary is more output tokens, and input tokens scale with the length of the ticket, so summarisation costs scale roughly linearly with workload.

The fifth is governance. Once a model is behind an endpoint, every team in the company will want to call it. Who can, for which use cases, logged how, evaluated against what? “We stood up a model” is easy; “we stood up a model and a governance story around it” is the one that survives an audit.

What we’ll filter on

Every foundation-model project passes through some subset of seven stages. Scoring each stage against the team’s situation is the filter that decides what gets built.

Model choice, which foundation model fits the task’s quality, language, context-window, and cost profile?
Access, managed API or self-hosted endpoint?
Adaptation, prompt engineering alone, retrieval-augmented generationRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. , fine-tuning, or continued pre-training?
Deployment surface, on-demand per-token, provisioned throughput, or a provisioned real-time endpoint?
Integration, how does the application call the endpoint and handle responses, errors, and rate limits?
Evaluation, how do we know the model is getting it correct, and how do we track that over time?
Governance, logging, guardrailsGuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. , access control, cost attribution.

The lifecycle landscape

Model selection and access via Bedrock. Amazon Bedrock is a managed service that puts a catalogue of foundation models. Anthropic Claude, Meta Llama, Amazon Nova and Titan, Mistral, Cohere, AI21, behind a single API. No infrastructure to provision; access is granted per-model in the Bedrock console (some models require an access request, some are self-serve). Authentication is IAM; calls are bedrock-runtime:InvokeModel or InvokeModelWithResponseStream. For a summarisation task with 8,000 tickets a week, this is the shortest path from “we chose a model” to “the model is callable from Lambda.”
Model selection and access via SageMaker JumpStart. JumpStart is a SageMaker feature that lets you pick an open-weights model from a catalogue (Llama, Falcon, Mistral, and Amazon’s own models) and deploy it to a real-time SageMaker endpoint in your VPC with a few clicks or a CloudFormation-friendly SDK call. You pay for the underlying instance (e.g. ml.g5.2xlarge) whether calls come in or not, but the model lives in your account, talks only to your VPC, and is subject to no per-token pricing. The path when data residency, custom fine-tuning, or steady high throughput push you off on-demand.
Prompt engineering. The cheapest form of adaptation. A prompt is just the text you send to the model, instructions, examples, and the input. “Summarise the following support ticket in one paragraph, using a neutral professional tone” followed by the ticket text is a prompt. Good prompt engineering can take a generic model most of the way to task-specific behaviour without touching a training pipeline. No new AWS service; the work lives in your application code.
Retrieval-augmented generation (RAG). When the model needs facts it wasn’t trained on, internal product documentation, this quarter’s pricing, an engineer’s runbook, you retrieve relevant documents at request time and include them in the prompt. Bedrock Knowledge Bases is the AWS-managed path: point it at an S3 bucket of documents, it chunks them, embeds each chunk into a vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. (a list of numbers that encodes meaning), stores the vectors in an OpenSearch Serverless or Aurora PostgreSQL index, and at query time retrieves the most relevant chunks and injects them into the model’s prompt. The team can do this themselves with Titan or Cohere embeddingEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. models plus their own vector store; Knowledge Bases is the zero-plumbing version.
Fine-tuning. If prompt engineering and retrieval both fall short, typically because the task needs a voice, format, or domain vocabulary the base model doesn’t produce reliably, fine-tuning adjusts the model’s weights on a task-specific dataset. Bedrock supports fine-tuning a subset of its models (Nova, Titan, Llama) via the console and API: upload JSONL training data to S3, start a fine-tuning job, get a custom model that requires provisioned throughput to serve. Fine-tuning is expensive in dollars and in evaluation time; most projects don’t need it.
Deployment. Bedrock offers two throughput models: on-demand (pay per input and output token, no capacity reservation) and provisioned throughput (commit to a number of “model units” for 1 or 6 months in exchange for guaranteed capacity and a different price). Fine-tuned Bedrock models require provisioned throughput. SageMaker endpoints are a third path: provision instances, pay for them continuously, get sub-second predictable latency. The choice depends on whether the workload’s shape is bursty (on-demand wins), steady-high (provisioned throughput wins), or latency-critical (SageMaker endpoint wins).
Governance. Bedrock emits CloudTrail events for every InvokeModel call, supports data capture to S3 for input and output logging, and integrates with Bedrock Guardrails (topic denies, PII redaction, profanity filters) configured independently of the model. IAM policies scope which principals can invoke which models; AWS Config and Service Control Policies can prevent unapproved models from being invoked at all. SageMaker endpoints inherit the standard VPC, IAM, and CloudWatch story.

Side by side

Mapping the seven stages onto the support-ticket-summariser scenario:

Stage	Needed for this project?	AWS service	Notes
Model selection	✓	Bedrock catalogue	Claude / Nova for English summarisation
Access	✓	Bedrock	On-demand via `bedrock-runtime:InvokeModel`
Prompt engineering	✓	(application code)	One well-crafted prompt carries most of the work
Retrieval (RAG)	✗	Bedrock Knowledge Bases	Ticket is self-contained; no external facts needed
Fine-tuning	✗	Bedrock Custom Models	Defer until prompting is measured
Deployment surface	✓	Bedrock on-demand	8k/week is predictable but bursty; not fine-tuned
Evaluation	✓	Bedrock Evaluation + SageMaker Clarify	Sample, label, track drift
Governance	✓	IAM, CloudTrail, Guardrails	PII redaction on input; log everything

The two stages the team can skip, retrieval and fine-tuning, are the two stages where most “AI projects” burn budget unnecessarily. The ticket is the thing being summarised; the model doesn’t need facts beyond the ticket. Fine-tuning is premature until there’s evidence prompting has plateaued.

The lifecycle as a pipeline

Seven stages in sequence, two of them optional. The summariser touches five of the seven. Evaluation and governance are continuous, not a phase.

The pick in depth

Bedrock on-demand, Claude or Nova, one well-crafted prompt. Bedrock gives a catalogue of models behind a single SDK. bedrock-runtime:InvokeModel takes a model ID (anthropic.claude-sonnet-4-5-20250929-v1:0, amazon.nova-pro-v1:0, etc.) and a JSON body whose shape depends on the model family. For Claude, the body is {"anthropic_version":"bedrock-2023-05-31","max_tokens":1024,"messages":[{"role":"user","content":"..."}]}. For Nova, {"inferenceConfig":{"max_new_tokens":1024},"messages":[...]}. The response comes back JSON; the application extracts output.message.content[0].text (Nova) or content[0].text (Claude) and hands it to the UI.

Model choice is an empirical question, not a reading-specs question. Amazon Bedrock has an Evaluation feature that runs a set of prompts through a chosen model and scores the results on dimensions like accuracy, robustness, and toxicity, or against a custom ground-truth dataset. Run a batch of 50 representative tickets through three candidate models (Claude Sonnet, Nova Pro, Llama 3.3 70B), have the product manager score the summaries, pick the model that wins on the cheapest price-per-token that meets the quality bar. The evaluation is a few hours of work; it saves months of arguing about which model “feels better.”

Prompt engineering is where the quality lives. A prompt that says “Summarise this ticket” produces mediocre summaries. A prompt that says “You are writing a one-paragraph summary of a customer support ticket for an internal reviewer. Use a neutral professional tone. Mention the customer’s issue, what the agent did, and whether it’s resolved. Do not include the customer’s name or email. If the ticket is in a language other than English, summarise in English.” produces the correct shape, every time. Give the model a few labelled examples in the prompt, “few-shot prompting”, and the consistency tightens further. None of this touches AWS; it’s application code. It’s also where 80% of the lift comes from.

Deployment-wise, 8,000 tickets a week at roughly 2,000 input tokens and 200 output tokens each works out to 16M input and 1.6M output tokens per week. Claude Sonnet 4.5 on Bedrock bills at roughly $3 per million input tokens and $15 per million output tokens, so: $48 + $24 = $72/week, or about $310/month. On-demand is correct: the workload is small enough that provisioned throughput’s minimum commitment would cost more than the usage, and the traffic bursts to Monday-morning peaks that on-demand handles without capacity planning.

Governance is an independent track. A Bedrock Guardrail, a configuration object attached to the invocation, redacts PII from the input before it reaches the model, denies specific topics (medical or legal advice, for example), and filters profanity from the output. CloudTrail records every InvokeModel call with the model ID and the caller’s IAM principal. The bedrock-runtime invocation supports passing an invocation log destination (S3) so full input/output pairs land there, KMS-encrypted, for audit and evaluation-set curation. An IAM policy on the Lambda role restricts which models it can invoke: bedrock:InvokeModel with a Resource scoped to specific model ARNs.

A worked pipeline: one ticket end-to-end

The PM wants to see the pipeline work on a real ticket before sign-off. The engineering team has a Lambda wired up.

$ aws bedrock-runtime invoke-model \
    --model-id anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --body '{
      "anthropic_version": "bedrock-2023-05-31",
      "max_tokens": 400,
      "messages": [
        {"role": "user", "content": "You are writing a one-paragraph summary of a customer support ticket for an internal reviewer. Use a neutral professional tone. Mention the customers issue, what the agent did, and whether it is resolved. Do not include the customers name or email. Ticket follows:\n\n---\nCustomer: Hi, my dashboard has been stuck on loading for two hours. Im on the Pro plan.\nAgent: Hi there, sorry to hear. Can you try clearing your browser cache?\nCustomer: Tried that, same issue.\nAgent: Ok, Im seeing an issue on our side with the Pro-plan widget rendering. Engineering is deploying a fix; should be resolved in 30 min.\nCustomer: Ok thanks.\nAgent: Deployed. Can you refresh and confirm?\nCustomer: Working now. Thanks!"}
      ]
    }' \
    --guardrail-identifier ticket-summariser-gr \
    --guardrail-version 1 \
    --cli-binary-format raw-in-base64-out \
    out.json

$ jq -r '.content[0].text' out.json
A Pro-plan customer reported that their dashboard was stuck loading for
two hours. The agent diagnosed a server-side rendering issue affecting
Pro-plan widgets, deployed a fix, and the customer confirmed the
dashboard was working again. The ticket is resolved.

What happened behind the scenes:

IAM authorised the caller for bedrock:InvokeModel on the Claude Sonnet model ARN in the target Region.
Bedrock applied the ticket-summariser-gr guardrail: scanned the input for PII (no matches, because the prompt specifically said not to include names), scanned the output (no matches), passed both through.
Bedrock called Anthropic’s model (hosted inside the AWS-Anthropic arrangement, the request never leaves AWS), got the completion, returned it.
CloudTrail logged the InvokeModel call: principal, model ID, timestamp, and, because invocation logging is enabled, the input and output landed in s3://ticket-summariser-logs/ under a KMS key that only the platform team holds.
The Lambda wrote the summary to the ticket’s summary column in RDS. The support UI rendered it at the top of the thread next time the ticket was opened.

That’s the loop. For 8,000 tickets a week, an EventBridge rule fires a SQS message per new ticket, the Lambda processes them at whatever concurrency Bedrock’s on-demand quota allows (request a service-quota increase if the default isn’t enough), and the whole pipeline is roughly 200 lines of code plus the guardrail, the IAM policies, and the log bucket.

What’s worth remembering

Foundation model means a general-purpose pretrained model you adapt, not one you build. Someone else did the training; your job is to choose, access, prompt, and optionally adapt.
The lifecycle has seven stages but most projects only need five. Retrieval is for when the model needs external facts. Fine-tuning is for when prompting plateaus. Both add real complexity; neither is free.
Bedrock is the default access path for text generation. Managed, IAM-gated, per-token pricing, no GPUs to manage. The path of least resistance for most business-problem-shaped use cases.
SageMaker JumpStart is the path when data residency, model choice, or workload shape push you off Bedrock. Your own endpoint with the model you choose; you pay for the instance whether calls come or not.
Prompt engineering is where 80% of the quality lift comes from. One well-crafted prompt with instructions, tone guidance, and a few examples beats a mediocre prompt against a more expensive model.
Evaluate empirically, not by vibes. Bedrock Evaluation runs a candidate prompt across multiple models on a fixed dataset; the correct model is the one that wins your evaluation, not the one with the newest press release.
Deployment surface follows workload shape. Bursty and small: on-demand. Steady-high or fine-tuned: provisioned throughput. Latency-critical or data-residency-bound: SageMaker endpoint.
Governance is continuous, not a stage. Guardrails, CloudTrail, invocation logging, and IAM scoping belong in the first version, not the hardening pass. Retrofitting them is harder than starting with them.

The path from “use a foundation model” to “a reviewer sees a summary in the ticket UI” isn’t a single step; it’s a pipeline. Most of the stages in that pipeline have obvious AWS-native answers, and most teams that get stuck are the ones who treat adaptation (retrieval or fine-tuning) as compulsory rather than contingent. Start with the shortest path, measure, and only add stages when the evidence says they’d help.

The Reranker You Didn't Know You Needed

2026-05-02T06:00:00+08:00

You shipped a RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. chatbot last quarter. EmbeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. , vector databaseVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. , prompt template, the lot. Demo went great. Three months in, the support team is finding answers that are technically in the corpus but consistently the wrong ones, close enough on the embedding to rank highly, but not actually what the question was asking. You crank the top-k from 5 to 20, the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. gets confused by the noise, and the answers get worse. You’re stuck.

The fix is a step you skipped.

In To LLMs… and Beyond! we covered RAG, retrieval-augmented generation, as a two-step pattern: embed the query, retrieve relevant documents, generate the answer. That’s the correct shape for explanation. It’s also the wrong shape for production. Most working RAG systems have three steps, and the missing middle one is where the quality lives.

This post is about that middle step.

Why a single retrieval pass isn’t enough

The retrieval step in RAG uses what’s called a bi-encoder: an encoder model (usually BERT-family, see The Other Transformers) that produces a single vector for each piece of text. The query gets one vector. Each document gets one vector. You compare them by cosine similarity, the closer the angle, the more similar the texts.

This is fast. Embarrassingly fast. You can pre-compute the document vectors once and store them in a database. At query time, you only need to embed the query (a few milliseconds) and find the nearest neighbours (a few more milliseconds, even across millions of documents). It scales to web-search levels.

It’s also kind of dumb.

The bi-encoder embeds the query and the document independently. The model never sees them together. It produces a vector for the query that captures the query’s meaning in general, and a vector for the document that captures the document’s meaning in general, and then you compare those two general representations. There’s no opportunity for the model to notice that this specific query is asking about a specific aspect of this specific document.

In practice this means bi-encoders are good at finding documents that are topically related to the query. They’re less good at finding the documents that actually answer the query. Two documents about the same topic can have very similar embeddings even if only one of them contains the answer.

For a vague question like “what’s our refund policy?” topical similarity is enough. For a specific question like “can I get a refund on a digital download after 30 days if I haven’t used it?” you need a model that can read the query and the candidate documents together and decide which one actually addresses the conditions.

That’s a cross-encoder.

What a cross-encoder is

A cross-encoder is the same architecture (an encoder transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. ) used a different way. Instead of producing a vector for each text, it takes a pair of texts, query and candidate document, and produces a single relevance score.

The query and document get concatenated with a separator tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , fed through the model together, and the model’s full attentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. mechanism gets to see every query token attend to every document token and vice versa. The output is one number: how well does this document answer this query?

[CLS] can I get a refund on a digital download after 30 days [SEP]
Refund policy: physical goods may be returned within 30 days. Digital
downloads are non-refundable once purchased. [SEP]

The model reads that and outputs, say, 0.91, the document is highly relevant because it directly addresses both “digital download” and “refund,” even though the answer is “no.” A different document that only mentions the 30-day window for physical goods might score 0.34.

Cross-encoders are dramatically more accurate than bi-encoders for relevance. They’re also dramatically slower. Because the model has to see the query and document together, you can’t pre-compute anything, every query against every candidate is a fresh forward pass. If you have a million documents and you ran the cross-encoder against all of them, you’d be waiting weeks per query.

Which is why you don’t do that. You do retrieve-then-rerank.

The two-stage pattern

The standard production RAG pipeline is:

Retrieval (bi-encoder). Embed the query, find the top 50-200 candidate documents from the vector database. Fast, parallel, scalable.
Reranking (cross-encoder). Score each of those candidates against the query using a cross-encoder. Pick the top 3-10 by score.
Generation (LLM). Pass the top reranked documents into the LLM along with the query. Generate the answer.

The retrieval stage is “we cast a wide net, fast.” The reranking stage is “we read each catch carefully, slowly, but only the ones in the net.” Together they let you get cross-encoder-quality relevance at bi-encoder-scale corpus sizes.

The numbers are striking. For a corpus of one million documents:

Bi-encoder only: ~10ms per query, mediocre relevance.
Cross-encoder only: ~1,000,000 model calls per query. Untenable.
Bi-encoder + cross-encoder: ~10ms retrieval + ~200ms reranking on 100 candidates = ~210ms total, with relevance approaching cross-encoder-only quality.

That third option is what every serious RAG system is doing. The blog posts that don’t mention it are showing you the demo, not the production system.

Models you can actually use

Reranker models are a small but mature corner of the open-source ecosystem.

Model	Made by	Open / closed	Notable for
BGE Reranker (v2-m3, large)	BAAI	Open	Strong default, multilingual, well-supported
Cohere Rerank	Cohere	Closed (API)	Easy integration, multilingual, pay-per-call
Voyage Rerank	Voyage AI	Closed (API)	High quality, instruction-tuned variants
ms-marco-MiniLM-L-6-v2	sentence-transformers	Open	Tiny (22M params), runs on CPU, fine for English
Jina Reranker	Jina AI	Open / API	Long-context variants for document-level reranking

The lightweight ones (the MiniLM cross-encoders, around 20-100M parameters) run on a CPU. The heavyweight ones (BGE Reranker v2-m3, around 568M parameters) want a GPU but produce noticeably better rankings. For most projects the correct starting point is the smallest open model that fits your latency budget; you can swap up if quality demands it.

When reranking earns its keep

Not every retrieval task needs a reranker. The benefit grows with task difficulty:

Vague topical queries against a small corpus: bi-encoder is fine. “Tell me about our company values” against a 50-document handbook will return the correct document on cosine similarity alone.
Specific factual queries against a medium corpus: reranker helps. “What’s the SLA for our enterprise tier?” against a thousand-document knowledge base benefits from the cross-encoder noticing that the document mentioning enterprise tier SLAs specifically is more relevant than the one with the same words in a marketing context.
Long-tail queries against a large corpus: reranker is essential. Web-scale search, code search, scientific literature search, the bi-encoder will return a heap of plausible-but-not-quite candidates, and the reranker is what separates them.

The pattern: bi-encoders fail by returning plausibly-related but not actually-answering documents. If your eval set is full of cases like that, you need a reranker. If your bi-encoder is missing the correct document entirely (it’s not in the top 200), reranking won’t save you, you need better embeddings or a hybrid retrieval strategy. Different problem.

Hybrid retrieval: the other thing you might be missing

While we’re here, the second-most-skipped step in RAG explanations: hybrid retrieval.

Bi-encoders work on semantic meaning. They’re great at handling paraphrase (“how do I cancel?” finds documents about “subscription termination”). They’re weak at exact matches, product codes, person names, error messages, version numbers. The vector for KB-ERR-2847-fatal doesn’t necessarily live near the vector for 2847 in embedding space, because the model has never seen that specific string and treats it as a sequence of arbitrary subword tokens.

Hybrid retrieval combines a semantic search (bi-encoder, dense vectors) with a lexical search (BM25, sparse keyword matching) and merges the results. The semantic search catches paraphrase. The lexical search catches exact matches. The reranker takes the union and sorts it.

In production:

Semantic retrieval returns top 100 by embedding similarity.
Lexical retrieval returns top 100 by BM25 score.
Merge, take the union (often 150-200 documents after dedup).
Rerank with a cross-encoder, take the top 5-10.
Generate with the LLM.

This pattern, often called hybrid retrieval with cross-encoder reranking, is the realistic shape of a production RAG system in 2026. The blog-post version with one embedding lookup is the simplification.

A decision table

Symptom	Likely fix
"The correct document is in the top 50 but not the top 5"	Add a reranker
"The correct document isn't in the top 50 at all"	Better embeddings, or hybrid retrieval (BM25 + semantic), or chunk differently
"It can't find specific product codes / IDs"	Hybrid retrieval, you need lexical matching
"The LLM is confused by too many candidates"	Lower top-k after reranking; trust the reranker to filter
"Latency is too high"	Smaller reranker (MiniLM cross-encoders), or fewer candidates into the reranker
"Quality varies wildly between users"	Likely a chunking or query-rewriting issue, not a reranker issue

The shortcut version of RAG, embed, look up, generate, works in the demo because the demo corpus is small and the demo questions are vague. The production version has to handle a thousand specific questions against a million documents, and that’s where the bi-encoder’s independence starts to hurt. Embedding the query and the document separately is what makes retrieval scale, and it’s also what stops the model noticing whether the candidate it returned actually answers the question or merely shares a topic with it. The cross-encoder is the cure for that, because it reads the pair together and lets attention work across both halves. The price is speed, which is why nobody runs a cross-encoder against the whole corpus. They run it against the top hundred the bi-encoder fished out, and they merge in BM25 results so the product codes and error strings don’t get lost in the semantic blur.

A reranker can only do its job if the correct document already made it into the candidate set. If the bi-encoder misses entirely, no amount of reranking will recover the answer, the fix lives in the chunking, the embeddings, or the lexical search. Worth knowing which symptom you have before you start tuning.

The Knife in My Hand

2026-05-01T06:00:00+08:00

My kitchen knives are not beautiful. Blue plastic handles, no rivets, nothing decorative. They look like what they are: tools for working in a kitchen, not display pieces.

They are, however, extremely good. Heavy, precisely ground, tough enough that I haven’t chipped one in years of real use, sharp enough to cut a tomato under the weight of the blade alone. Sized correctly for my hands. I maintain them. I know them.

They live in a knife roll in a kitchen drawer. Canvas keeps the edges separated, a drawer keeps them away from small prying hands not yet ready to handle a very sharp knife. Six in the roll: a steel, a chef’s knife, a fish knife, a Santoku, a butcher’s dagger, and a paring knife.

This post is about what I’ve learned through those knives. Most of it turns out to apply to software.

A dull knife is more dangerous than a sharp one

The first thing any cook learns, and the last thing home cooks seem to believe, is that a dull knife is more dangerous than a sharp one.

A sharp knife bites into what you’re cutting where you put it. A dull knife slides. It requires more force to do the same work, and force that isn’t going into the cut has to go somewhere, usually into the hand of the person holding the knife, on the day they’re tired and the tomato is firmer than they expected. Dull-knife injuries are the ones that need stitches.

The analogue in engineering is almost embarrassingly direct. A broken test suite is more dangerous than no test suite. A stale monitoring dashboard is more dangerous than no dashboard. A CI pipeline that’s “mostly green” is more dangerous than one that’s explicitly red. The thing you want is a tool that gives you a clean, honest signal and that you trust enough to act on. A tool you mistrust, that slides off the tomato, that you have to lean on to get through: that tool is the injury waiting to happen.

Sharpen your knives. Fix your tests. Don’t tolerate dullness. Dull, blunt, less likely to cut: these things aren’t safer, no matter how they look.

Pick the right tool. Then practise with it.

The chef’s knife rocks through dense vegetables. The Santoku slices straight down through soft fruit. The paring knife works in tight space against your thumb. The fish knife flexes to follow a spine. None of them is a “better knife” in the abstract; each one exists because it fits a different job. Trying to bone a fish with a chef’s knife is awkward no matter how skilled you are. The first move in any cut is choosing the right knife for it.

The second move is having practised with it. When I’m cooking I reach for the chef’s knife without looking. I know its weight, how it rocks, how much pressure it takes through a carrot or a butternut squash. I know the Santoku picked up a spot of rust a few weeks ago (left in the sink too long after a distracted Sunday) and how it felt under the cloth when I worked it out. The muscle memory this develops doesn’t replace choosing the correct knife; it means I’m faster, safer, and more comfortable using the most accurate tool.

Picking up a new knife costs you. Hand me a hand-forged Japanese knife tomorrow and I’d be slower with it for a week. The handle the wrong shape, the balance different. I’d be thinking about the knife instead of the food. That dip is real, and it’s the price of upgrading, not a reason to skip the upgrade. The tool that promises to make you 20% faster once you’ve learned it will make you 40% slower for the three months you’re learning it, and then 10% faster forever. Nothing ever meets the hype, but the right tool can get close enough. Pay the dip once for the right tool and you recoup it the rest of your career.

Some tools never fit. A knife with the wrong handle for your hand causes fatigue every cut: no amount of practice fixes that. The learning dip is recoupable; an ill-fitting tool is friction forever. Selection comes before practice, and you can’t practise your way out of a bad selection. Software architecture is the same kind of choice, the framework, the database, the language. Get those right and practice compounds; get them wrong and practice runs into walls.

The trap isn’t deliberate upgrades; it’s churning. Picking up a new knife, or a new editor, or a new build tool, every month, never quite getting past the dip, never quite cashing in the upside. Pick deliberately. Then put in the practice.

Practice is the cut, not the recipe

The way you get good with a knife is that you cut. A lot. You cut deliberately slowly to pay attention to grip and rhythm, and you cut at speed when you’re in a hurry and discover the edges of your technique.

The technique is the floor, not the ceiling. You learn it in a weekend. Then you spend ten years making it automatic: so automatic that you stop thinking about the knife and start thinking about the food. The difference between a good home cook and a professional is almost never knowledge; it’s repetition. Professionals have cut ten thousand onions. You have cut one hundred. They’re not smarter; they’re smoother.

You don’t “know” SQL after reading a book. You know SQL after a thousand queries and several slow joins you had to rewrite. The book is the technique; the queries are the practice.

Speed comes from practice, not pressure. Every time I’ve seen an engineer try to move faster by concentrating harder, they’ve moved slower. Every time I’ve seen one move faster by doing something they’d done a hundred times before without thinking about it, they’ve been right.

Care as part of the craft

Every time I pull the knife out of the roll, I run it a few strokes down the honing steel. Ten seconds. I do it so automatically I’d feel wrong starting to cut without it.

When I’m done, I wash the knife by hand (dishwashers are hostile to good knives), dry it on a tea towel, and slide it back into its slot. Thirty seconds, every time, for so long that it isn’t a chore; it’s just what happens at the end of cooking.

Every few months I take the knives to a professional sharpener. I hone them every day because that’s use and maintenance in the same motion. But when they need actually sharpening (a proper regrind) I hand them to someone whose whole craft is sharpening knives. There is no prize for doing everything yourself.

The care is not separate from the cutting; it’s the same practice. A knife used a lot and cared for consistently gets better over time. A knife used a lot and cared for occasionally gets worse, because damage accumulates faster than attention.

The single biggest thing separating engineers who get better from engineers who plateau is whether they care for their tools alongside using them. Whether their editor config quietly improves. Whether they know the keyboard shortcut for the thing they do fifty times a day. None of it is glamorous. None shows up on a CV.

Six knives

Six knives in a roll, a board, a professional sharpener every few months. I can make almost any dinner in the world from those objects plus a pan and some heat.

I’ve been tempted to buy more. The internet is very good at trying to sell me more: beautifully photographed knives with long waiting lists and three-figure price tags, the kind that live on magnetic strips in other people’s kitchens. My knives cut something every day. They don’t look like anything special. They work.

The craft is not in the accumulation of tools, and certainly not in the appearance of them. The craft is in picking the right tools and putting in the work to know them. The tools that get photographed are rarely the tools that get used.

I look at my terminal and see the same thing. A few commands and shortcuts I’ve used so many times they feel like extensions of my hand. The rest is decoration.

Pick your tools. Keep them sharp. Use them every day. Practise. Replace one deliberately, when you can name the thing it doesn’t do that you now know you need.

The practice is the point.

Time Is Weirder Than You Think

2026-04-30T06:00:00+08:00

In What Time Is It? we untangled the human mess of the hour. In What Day Is It? we did the same for the calendar. In Ticks or Tocks? we traced the physics of the second from quartz crystals to optical lattice clocks that won’t lose a tick in the lifetime of the universe. All of those stories treated time as something that flows at the same rate everywhere, a backdrop against which clocks are merely more or less accurate. That assumption is wrong. Time itself bends.

Einstein enters the chat

Einstein’s special theory of relativity, published in 1905, showed that time passes more slowly for objects moving at high speeds relative to an observer. This isn’t a theoretical curiosity; it’s measurable. In 1971, Hafele and Keating flew caesium clocks on commercial airliners around the world and compared them to reference clocks on the ground. The flying clocks disagreed with the ground clocks by exactly the amount relativity predicted (Hafele & Keating, 1972, Science).

The speed of light as a universal speed limit. Nothing with mass can reach the speed of light. As you approach it, time dilation increases without bound. At the speed of light, time stops entirely. From a photon’s frame of reference (to the extent that’s meaningful), no time passes at all. A photon emitted from a star ten billion light-years away has, from its own perspective, arrived at your eye instantaneously.

Muon decay provides one of the cleanest experimental demonstrations. Cosmic ray muons are created in the upper atmosphere and should decay in roughly 2.2 microseconds, which at near-light speed would let them travel only about 660 metres. But we detect them at sea level, roughly 15 kilometres below where they were created. How? At 99% of the speed of light, their time is dilated by a factor of roughly seven. They “live” long enough to reach us. Rossi and Hall first confirmed this in 1941 (Physical Review), and it remains one of the most intuitive demonstrations of special relativity.

The twin paradox

Special relativity produces a result so counterintuitive that it has its own name. Take two twins: one stays on Earth, the other takes a round trip to a distant star at near-light speed. When the travelling twin returns, less time has passed for them. They are younger than their sibling. This is not an illusion or an accounting trick; it’s a real, physical difference in elapsed time.

The “paradox” label is misleading. There’s no logical contradiction. The resolution is that the two twins are not in symmetric situations: one of them accelerated (turned around), and that breaks the symmetry. The twin who stayed home followed an inertial path through spacetime: no acceleration, no turning around. In relativity, the straighter your path through spacetime, the more time you experience. It’s counterintuitive: we’re used to thinking that straight lines are shortest, but in spacetime, a straight path is the one that ages you the most. Any acceleration, any turning around, reduces the elapsed time. This is why the travelling twin ages less.

The effect doesn’t require a spaceship. The International Space Station orbits at about 7.7 kilometres per second. Astronauts on the ISS age very slightly slower than people on the ground, roughly 0.01 seconds less per year. Scott Kelly, who spent 340 days aboard the ISS in 2015-2016 while his identical twin Mark stayed on Earth, returned about 5 milliseconds younger than he would have been had he stayed home. Not enough to matter biologically. Enough to prove the physics is real.

Supersonic time travel

Concorde, that beautiful, impractical supersonic airliner, offered a surreal temporal experience. You could leave London at 10:30 AM and arrive in New York at 9:30 AM the same day, arriving before you departed by clock time. The crossing took about three and a half hours, but the five-hour time difference meant you gained more than you spent.

This wasn’t relativity; it was time zones. But the special relativistic effect was real too, if tiny. Concorde flew at roughly Mach 2: twice the speed of sound, about 600 metres per second. At that speed, the time dilation factor is approximately 1 + 2 x 10^-12, which means passengers aged about 0.000000002% less than people on the ground per flight. Over a career of flying Concorde, a pilot might have “saved” a few hundred nanoseconds of biological time. Not enough to notice. Enough to measure.

The more interesting effect was the experience itself. Westbound on Concorde, the sun appeared to move backwards in the sky. You were flying faster than the Earth rotates at that latitude. For the duration of the flight, you were outrunning the planet’s spin. It’s the closest any commercial passengers ever came to the intuitive experience of time running in an unusual direction.

Gravity bends time

Einstein’s general theory of relativity, from 1915, added another twist: time passes more slowly in stronger gravitational fields. The closer you are to a massive object, the slower your clock ticks relative to someone further away. A clock on the floor of your house runs very slightly slower than a clock on your roof. The difference is about 10 nanoseconds per year per metre of altitude, which doesn’t affect your morning routine but absolutely matters for GPS.

GPS satellites orbit at about 20,200 km above the Earth. Their clocks tick faster than ground clocks by about 45 microseconds per day due to weaker gravity up there. They tick slower by about 7 microseconds per day due to their orbital speed. The net effect is that satellite clocks gain roughly 38 microseconds per day relative to the ground. If this weren’t corrected, GPS positions would drift by about 10 kilometres per day. Every GPS satellite has its clock rate deliberately adjusted before launch to compensate.

This means that when you use your phone to navigate to a restaurant, you are relying on corrections derived from general relativity. Einstein helps you find pizza.

The gravitational effect has been measured with astonishing precision. In 2010, optical clocks at NIST detected the difference in time flow between two clocks separated by just 33 centimetres of altitude (Chou et al., 2010, Science). Time really does run at different speeds depending on where you are in a gravitational field. There is no single “correct” rate at which time passes. It’s always relative to something.

This has practical consequences beyond GPS. The definition of UTC itself requires a choice: the clocks that contribute to UTC are at different altitudes and latitudes, so they tick at slightly different rates due to gravity. The BIPM corrects all contributing clocks to the rate they would tick at the “geoid”: the mean sea-level gravitational potential of the Earth. A clock in Boulder, Colorado (1,655 metres above sea level) ticks faster than one in London (near sea level) by roughly 15 microseconds per year. Without the geoid correction, the ensemble average would be meaningless; you’d be averaging clocks that are physically keeping different times. The concept of “a second” on Earth is, in a gravitational sense, a political decision about which altitude to use.

The universe’s default clock rate

The geoid is a local compromise: we picked Earth’s mean sea level and called it “the reference.” But zoom out and the same problem applies everywhere. Every mass in the universe, every star, planet, galaxy cluster, sits in a gravitational well where time runs slower. A clock in deep intergalactic space, far from any significant mass, ticks faster than any clock on any planet. That hypothetical far-from-everything clock is as close as you can get to time running “undiluted”: the fastest rate time can flow.

There is no single point where gravity’s influence drops to exactly zero. Gravity has infinite range, and the universe is full of mass, so every location experiences some gravitational time dilation. But the effect falls off sharply with distance. In the great voids between galaxy clusters (regions hundreds of millions of light-years across containing almost nothing) gravitational time dilation is vanishingly small. For all practical purposes, that’s where time runs at its natural rate.

This creates an odd inversion of perspective. We think of time on Earth as “normal” and relativistic corrections as exotic. But from the universe’s point of view, we’re the anomaly. We live at the bottom of a gravitational well. Our clocks are the slow ones. Imagine you grew up in a swimming pool and thought water resistance was just how movement worked. Then someone drained the pool and you felt what running is like without the drag. Deep space is the drained pool. We’ve been wading our whole lives.

The practical consequence is that there’s no privileged clock in the universe. UTC is corrected to the geoid, but the geoid is a human choice, not a physical constant. A civilisation on a neutron star would pick a very different reference, one where “a second” on their surface lasts far longer than ours. Neither civilisation’s second is more correct than the other’s. “How fast does time pass?” isn’t a question with an answer until you specify where.

The young heart of the Earth

We don’t need black holes to see gravitational time dilation at work on a grand scale. The core of the Earth, being under more gravitational stress than the surface, has experienced less elapsed time since the planet formed. The centre of the Earth is roughly 2.5 years younger than the surface, not metaphorically but in actual measured atomic clock ticks. Time has passed more slowly down there for 4.5 billion years, and it adds up. Feynman mentioned a version of this calculation; it was rigorously computed by Uggerhoj et al. (2016, European Journal of Physics).

This is not a thought experiment; it’s a straightforward consequence of general relativity applied to the known density and gravitational profile of the Earth. If you could somehow place a clock at the centre of the planet when it formed and retrieve it today, it would show a date 2.5 years behind a clock that had spent its life on the surface. The rock beneath your feet is, in a physically meaningful sense, younger than the rock you’re standing on.

Black holes and the edge of time

Near a black hole, gravitational time dilation becomes extreme. At the event horizon, the boundary beyond which nothing, not even light, can escape, time, from an outside observer’s perspective, stops entirely. An object falling toward a black hole appears to slow down asymptotically, growing dimmer and redder, never quite crossing the horizon from the viewpoint of someone watching from a safe distance. The object falling in experiences time perfectly normally from its own point of view. Neither observer is wrong. Time is doing something different in each location.

The mathematics are well-established. Karl Schwarzschild worked out the mathematics of what happens to spacetime around a simple, non-spinning massive object, and he did it in 1916, just months after Einstein published general relativity. His solution predicts that at the event horizon, the gravitational time dilation factor goes to infinity. Time, as experienced by a distant observer, literally ceases to advance for anything at the horizon.

Inside the horizon, things get stranger still. The physics is hard to describe without the maths, but the gist is this: falling toward the centre becomes as unavoidable as the passage of time itself. You can no more stop falling inward than you can stop moving into the future. The singularity at the centre isn’t a place you travel to; it’s a moment you can’t avoid, the future that everything inside the horizon is headed toward.

Time ripples

If gravity bends time, and gravitational fields change, say, when two black holes spiral into each other, then the bending itself should propagate outward as a wave. Einstein predicted this in 1916. It took a century to confirm.

On 14 September 2015, the LIGO detectors in Livingston, Louisiana, and Hanford, Washington, detected gravitational waves from two black holes merging 1.3 billion light-years away (Abbott et al., 2016, Physical Review Letters). What LIGO measured was spacetime itself stretching and compressing as the wave passed through. The arms of the detector, each four kilometres long, changed length by roughly one-thousandth the diameter of a proton. That’s the most precise measurement humans have ever made.

Here’s what that means for time. A gravitational wave doesn’t just stretch space; it stretches spacetime. As the wave from those merging black holes passed through Louisiana, time in the detector was oscillating: running very slightly faster, then very slightly slower, then faster again, hundreds of times per second. The oscillation was absurdly tiny, but it was real. For a fraction of a second, time in Livingston and time in Hanford were running at different rates, because the wave hit them at different moments.

We usually think of time as the background against which things happen. Gravitational waves show that the background itself vibrates. Time has ripples. They’re passing through you right now: from distant supernovae, from colliding neutron stars, from black holes that merged before the Earth existed. You can’t feel them. LIGO can.

So what time is it?

After all of this (the human history of sundials and railways and political time zones, the physics of caesium atoms and clock ensembles, the relativity that bends time near massive objects and at high speeds) the answer is: it depends.

It depends on where you are in a gravitational field. It depends on how fast you’re moving. It depends on which timescale you’ve chosen and why. It depends on whether you care about the sun’s position, or the purity of atomic seconds, or the agreement between your timestamp and everyone else’s.

The phone in your pocket hides all of this heroically. It receives signals from GPS satellites that have been corrected for both special and general relativistic effects. It knows your time zone from your location. It knows about DST transitions from a regularly updated database. It adjusts for leap seconds, or at least it tries to. It presents you with a number that looks simple and authoritative, and you glance at it and get on with your day.

Underneath, it’s leaning on millennia of astronomy, centuries of mechanical engineering, decades of atomic physics, and Einstein. It’s a tower of clever hacks and hard-won compromises, and it’s a miracle it works at all.

But we’ve only covered what time does: how it bends near mass, dilates with motion, ripples across the universe. The harder question is whether time fundamentally exists. The arrow that distinguishes past from future isn’t in the equations. “Now” isn’t a location in spacetime. The equations of quantum gravity may contain no time variable at all.

Does Time Even Exist? is next: a tour of the foundations, from the block universe to the holographic principle and the physicists who think time is a shadow of something simpler.

Picking the AWS AI Service Tier for Each Feature

2026-04-29T06:00:00+08:00

The situation

A mid-sized B2B SaaS runs a help-desk product. Customers raise support tickets in a web form; the text flows through queues, is triaged by a rules engine, and lands in an agentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. ’s inbox. The CEO has returned from a conference with a directive: “add AI.” Board presentation in three weeks.

The PM has a backlog of half-formed ideas, six of them:

Sentiment on inbound tickets, flag angry customers so agents prioritise them.
Auto-translate tickets from non-English customers into the agent’s language, and the agent’s reply back.
Extract structured fields from attached PDFs, invoices, purchase orders, so agents don’t retype.
Moderate screenshots for anything NSFW before a human sees them.
Draft suggested replies based on the ticket and the knowledge base.
Rank the backlog by predicted priority using last year’s labelled tickets.

Six features. The board wants “AI.” The PM wants a plan that picks the shortest path for each.

Constraints are harsh but familiar: no ML background on the team (three backend engineers, one front-end, no data scientist); fast time-to-prototype (something running against real tickets within the fortnight); predictable cost (a line item finance can sign off without a model-unit-hour forecast); and no infrastructure the team has to babysit (managed endpoints, not GPU fleets).

What actually matters

Before reaching for a service, be honest about what “add AI” actually has to mean for a team without ML staff.

The first thing is that the cheapest AI feature is one AWS has already built. If the task is “flag angry emails” or “pull fields off an invoice,” those are problems many companies have; there’s a fair chance AWS has shipped a service that does exactly that. Starting at that layer, managed, task-specific, one API call, beats anything bespoke on time-to-prototype, cost, latency, and stability of behaviour. Only if no pre-built answer exists does it make sense to go up a layer.

The second is what happens the week after launch. A prototype is easy; a prototype the team has to keep alive for a year is harder. A managed service that AWS updates is no-maintenance. A foundation-model promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. on Bedrock is low-maintenance but drifts when the vendor retrains. A bespoke SageMaker model is high-maintenance, training data, drift monitoring, endpoint scaling, retraining cadence. The PM’s fortnight should generate something closer to the first shape than the third.

The third is cost predictability. Finance wants a line item. Per-request, per-page, per-character, per-token pricing gives a line item; it scales with use, which is usually fine. Per-instance-hour pricing for inference endpoints is a capacity forecast, uncomfortable when the product is new and the load is unknowable. Training jobs are per-compute-hour spikes with no guarantee the output model is actually good. The shape of the bill has to match the predictability of the product.

The fourth is time-to-quality. “Working prototype” is not “good enough to ship.” A managed service ships with AWS’s quality baseline; a tuned prompt ships with whatever the tuner can squeeze out of a general model; a bespoke model ships with whatever the data supports and the team can evaluate. The team has no evaluation muscle, which means the further up the custom stack they go, the longer they spend not sure if it’s good enough.

The fifth is how many of these features are really the same problem. Sentiment and moderation and field extraction are recognisably distinct; “draft a reply” and “rank by priority” look different from those and different from each other. A programme plan that picks the shortest path per feature, not the same path for all six, gets to shipped faster than one that forces everything through one layer.

Finally, the correct answer for one is not the correct answer for all. Some of the backlog is Layer 1 (managed service exists); some is Layer 2 (general-purpose LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. with a prompt); some is Layer 3 (bespoke model). The programme has three shapes of work, not one. The PM’s job for the fortnight is sorting the six features into those three buckets and picking the bucket that actually ships.

What we’ll filter on

Task already solved, does AWS ship a service that does this exact thing?
No ML expertise needed, team writes application code, doesn’t train models.
Predictable usage-based pricing, per-request, per-page, per-token. Scales with use.
Fully managed, no EC2, no containers, no endpoints the team provisions.
Time to prototype, days, not weeks.

The AWS AI landscape

AWS groups its AI offerings into three layers.

Layer 1. Managed AI services. Pre-built, task-specific: detect sentiment, translate text, extract data from a form, find faces in an image. AWS trained the model, AWS hosts it, AWS updates it. The service is the feature.

Layer 2. Amazon Bedrock. A serverless API over a catalogue of general-purpose foundation models from Anthropic, Amazon (Nova and Titan), Meta, Mistral, Cohere, AI21, Stability, and others. One API, many models. Pick the model, write the prompt, pay per token.

Layer 3. Amazon SageMaker AI. The platform for building, training, and hosting your own models. Notebooks, training jobs, inference endpoints, batch transform, feature stores, model registries. Pay per compute-hour at every stage. Where a data-science team lives when no pre-built answer fits.

Side by side

Layer	Pre-built task	No ML expertise	Predictable pricing	Fully managed
Managed AI services	✓	✓	✓	✓
Bedrock	,	✓	✓	✓
SageMaker AI	✗	✗	✗	✗

The rule of thumb: for each feature, work down the layers. Start at Layer 1; move to Layer 2 only if no managed service fits; reach Layer 3 only if neither will do.

Matching six features to three layers

Five features ship in a fortnight across two layers; the sixth waits for the correct team. Three layers, work top-down, and the plan writes itself.

The managed AI services in depth

The Layer 1 catalogue is worth knowing by name. Each has a scope, an SDK call, and a unit of billing you can put on a napkin.

Amazon Comprehend. NLP over text: sentiment, entities, key phrases, language detection, PII, topic modelling. Billed in units of 100 characters, three-unit minimum per request, $0.0001 per unit for the first 10M. Free tier: 50,000 units/month for 12 months. Where ticket sentiment lives.

Amazon Translate. Machine translation across 75+ languages, real-time and batch. $15 per million characters for standard. Free tier: 2M characters/month for 12 months.

Amazon Textract. Extracts text, handwriting, tables, and form data from documents. DetectDocumentText at $0.0015/page (raw OCR), AnalyzeDocument at $0.015/page (tables) or $0.05/page (form key-value pairs). Invoices use AnalyzeExpense at $0.01/page. Free tier: 1,000 pages/month for three months.

Amazon Rekognition. Image and video analysis: label detection (10,000+ categories), face detection and comparison, content moderation, OCR-in-images. $0.001/image for the first million. Free tier: 1,000 images/month, per API group, for 12 months. NSFW moderation is a single DetectModerationLabels call.

Amazon Transcribe. Speech-to-text. Per second (15-second minimum); standard starts at $0.024/minute. Free tier: 60 minutes/month for 12 months.

Amazon Polly. Text-to-speech. Standard voices $4/M characters; neural $16/M characters.

Amazon Lex. Conversational bots. $0.004/speech request, $0.00075/text request.

Amazon Kendra. Enterprise semantic search. Priced per index-hour (GenAI Enterprise from $0.32/hour), uniquely for Layer 1.

Amazon Personalize. Recommendations. V2 real-time at $0.15/1,000 requests.

Amazon Fraud Detector and Amazon Forecast: pre-built online-fraud scoring and time-series forecasting.

Against the backlog, four of the six ideas find a managed-service home:

Sentiment to Comprehend DetectSentiment.
Auto-translate to Translate TranslateText.
Extract fields from PDFs to Textract AnalyzeDocument or AnalyzeExpense.
Moderate screenshots to Rekognition DetectModerationLabels.

Four features, four SDK calls, four line items.

When Bedrock is the answer

Two ideas don’t match a managed service.

“Draft suggested replies based on the ticket and the knowledge base.” There’s no DraftReply API, the tone, the structure, the policy constraints, and the knowledge base are all specific to this company. But it is exactly what a general-purpose language model is for.

Bedrock’s shape: one Converse API call, pick a model ID, pass the prompt, get the generation. Per-token pricing, input and output charged separately. No training.

A sample of 2026 on-demand prices, for scale rather than memorisation:

Claude Haiku, cheap, fast. Roughly $1/$5 per million input/output tokens.
Claude Sonnet, mid tier most production RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. lands on. Roughly $3/$15.
Claude Opus, premium. Roughly $15/$75.
Amazon Nova Lite. Amazon’s own cheap tier, roughly $0.06/$0.24.
Meta Llama 3.1 70B, open-weight, competitive mid tier.

Two things matter for the PM. First, the model is a runtime parameter, not an architectural commitment. Switch Haiku to Sonnet via the model ID. Start cheap, upgrade only if quality doesn’t clear the bar. Second, Bedrock is serverless and per-token, same “no infrastructure, predictable cost” shape as the managed services, just with the model you chose.

Bedrock’s adjacent features. Knowledge Bases for RAG, GuardrailsGuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. for content safety, Agents for tool-using workflows, are available without leaving the SDK. A first-cut reply drafter is Bedrock plus a Knowledge Base pointed at the docs store.

When SageMaker is the answer

One idea fits neither Layer 1 nor Layer 2.

“Rank the backlog by predicted priority using last year’s labelled tickets.” No managed service exists for priority prediction; priority is company-specific. Bedrock could be asked to assign priority via a prompt, but the input is tabular: structured ticket features plus a labelled history. Classical supervised learning, not generation.

SageMaker’s parts: Studio/notebooks for exploration (per-instance-hour); training jobs (per-instance-hour); real-time inference endpoints (persistent, per-hour); serverless inference; batch transform; Autopilot and Canvas for teams without deep ML expertise (lower skill bar, not lower infrastructure).

What makes it distinctively Layer 3, not a fancier Bedrock, is the shape of the work. Training a priority classifier means feature engineering, labelled data, train/test splits, hyperparameter tuning, evaluation metrics, drift monitoring. SageMaker is the toolset for doing that properly. Without the skills, SageMaker is a very expensive Jupyter notebook.

The honest answer: defer the priority-ranking feature. Ship the other five on Layers 1 and 2, come back when a data scientist exists, or prototype in SageMaker Canvas with Autopilot and accept a rougher quality bar. Don’t stand up a training pipeline just to tick the “we have AI” box.

A worked trace through the backlog

Sentiment. Comprehend DetectSentiment. 500-char ticket = five units at $0.0001 = $0.0005/ticket. 10,000 tickets/month = $5 before the free tier. Layer 1.

Auto-translate. Translate both directions. 500 chars at $15/M = $0.0075/ticket. A thousand exchanges + replies/month = $15. Layer 1.

Extract invoice PDFs. Textract AnalyzeExpense at $0.01/page. 500 attachments × 2 pages = $10. Layer 1.

Moderate screenshots. Rekognition DetectModerationLabels at $0.001/image. 2,000/month = $2. Layer 1.

Draft replies. Bedrock Converse to Claude Haiku. Ticket (~700 tokens) + KB excerpt (~1,500) + draft (~300 out) = 2,200 in + 300 out. At Haiku’s ~$1/$5 per million: ~$0.004/draft. 1,000/month: ~$4. Upgrade to Sonnet proportionally if quality disappoints. Layer 2.

Rank backlog. Deferred; or, if a Canvas prototype is acceptable, tens of dollars in training compute plus a small endpoint. Layer 3.

Total running cost for the five shipped features: well under $50/month.

Where Bedrock and managed services overlap

Some features could be done at either Layer 1 or Layer 2. Sentiment is the classic case. Comprehend has a dedicated trained classifier; Bedrock can classify via a prompt. Which is correct?

Rule of thumb: prefer the managed service when one exists.

Cost. Comprehend at $0.0001/100-character unit beats Bedrock per-token pricing for short-classification tasks by an order of magnitude.
Latency. A purpose-built endpoint beats a general-purpose LLM parsing the instruction every time.
Behaviour stability. Comprehend’s API doesn’t change when AWS retrains; a Bedrock prompt drifts when the vendor ships a new model version.

The flip side: when the task is bespoke or the classes aren’t standard, Bedrock wins. Classify tickets into seven company-specific categories that don’t map onto Comprehend’s entity types? Prompt the model. The taxonomy is in the prompt, not hidden in a service’s training data.

Not “always managed” or “always Bedrock”, fixed-schema tasks where the managed service fits, open-schema tasks where instruction-following is the feature.

What’s worth remembering

AWS AI sorts into three layers. Managed services for pre-built tasks. Bedrock for foundation-model access. SageMaker for custom training and hosting. Work down from the top.
The managed-service catalogue. Comprehend (text NLP), Translate, Textract (documents), Rekognition (images/video), Transcribe (speech to text), Polly (text to speech), Lex (chatbots), Kendra (search), Personalize (recommendations), Fraud Detector, Forecast.
Per-unit pricing shapes by service. Per-100-character (Comprehend), per-million-character (Translate, Polly), per-page (Textract), per-image (Rekognition), per-second (Transcribe), per-request (Lex), per-index-hour (Kendra), per-token (Bedrock), per-instance-hour (SageMaker).
Bedrock is serverless and per-token. One API over many foundation models. The model ID is a runtime parameter.
SageMaker is the build layer, not the buy layer. Reserve it for tasks that don’t exist in Layers 1 or 2. Canvas and Autopilot lower the skill bar, not the infrastructure layer.
Managed services beat Bedrock for fixed-schema, common tasks. Lower cost, lower latency, more stable. Use Bedrock when bespoke.
Most free tiers are per-month, 12 months, new-customer. Enough to prototype every feature without hitting billing.
The “no ML background” constraint filters aggressively. It eliminates SageMaker from the default answer for most problems and pushes teams toward Layers 1 and 2.

Jobs to Be Done: Why Subscribers Actually Stay

2026-04-28T06:00:00+08:00

Greenbox is a produce-box startup that delivers weekly boxes of local farm produce to subscribers in Perth. They’ve used discovery workshops to build shared understanding and reach 200 subscribers, but now they need to grow to 1,000, and the techniques that got them here won’t answer the questions they’re facing next.

Greenbox has hit 200 subscribers. It took longer than anyone expected and involved more rework than anyone wants to admit, but the number is real. Two hundred people paying real money every week for a box of local produce.

Maya secured the next funding round. The board’s new target: 1,000 active subscribers within six months. Five times the current base in half a year.

Maya tells herself this on the coastal track at 5:45am on Monday, feet landing on packed sand, breath steady. The number is real. Two hundred. She’s proved something. She should feel good about it. But the board call last Thursday sits in her chest like a stone she swallowed. The new target isn’t a vote of confidence; it’s a test. Angela had said it kindly enough, “We’re excited about the trajectory, Maya”, but the slide behind Angela’s head had a red line showing where the funding ran out if they didn’t hit it.

She showers, makes coffee, sits at the kitchen table with her laptop. Nadia is still asleep. The photo of her parents’ farm catches the morning light: her father standing in front of the converted dairy shed, smiling but tired. She knows that look. It’s the face of someone who believes in what they’re building but can’t yet see how it survives.

She opens the subscriber dashboard. Two hundred and six. Net gain of three last week.

Three.

But there’s a problem hiding in the numbers.

The churn problem

Churn is 8% monthly. For every ten new subscribers the team signs up, they lose three or four existing ones.

Sam walks the team through it on Monday morning. “We added forty-two new subscribers last month. We lost sixteen. Net gain: twenty-six. If we keep losing sixteen a month, we need to sign up sixty a month just to net the growth we need.”

Maya does the maths on the whiteboard. Sam’s number assumes the sixteen-a-month loss stays flat. It won’t. Churn is a percentage, not a count: 8% of 200 is sixteen, but 8% of 400 is thirty-two, and 8% of 600 is forty-eight. The bigger the base, the more they have to replace before they grow at all. At 8% monthly churn, even doubling their acquisition rate would only get them to around 600 subscribers in six months. They’d never hit 1,000 on acquisition alone; they had to bring churn down too.

“We need to understand why people leave,” Maya says.

Tom nods. “Let’s Event Storm it.”

The wrong tool for the job

The team books the meeting room, grabs the sticky notes, and starts mapping the cancellation flow. After an hour, the wall has a clean timeline of what happens when someone cancels. The process is well mapped.

Lee has been quiet, which is unusual. “This is a good map of the cancellation process,” he says. “But you’re mapping what happens when they leave. Not why they decided to leave. Those are different questions.”

He’s correct. The map says nothing about why subscribers decided to cancel in the first place. That motivation lives outside the system, in the customer’s kitchen on a Tuesday evening, in the moment they decide this subscription isn’t worth it any more.

Tom is frustrated. “So we wasted an hour?”

“Not wasted. You now have a clean map of the cancellation flow, which you’ll need when you build retention features. But you need a different lens for the why.”

Jobs to Be Done

Lee draws a simple diagram on the whiteboard. A stick figure, an arrow, and a box labelled “Greenbox.”

“Clayton Christensen’s framework. The core idea: customers don’t buy products. They hire them to do a job in their life. Your product isn’t competing with other produce boxes; it’s competing with whatever else the customer could hire to do the same job.”

“Isn’t the job obvious?” Priya says. “They want fresh local vegetables.”

“Maybe. But if that were the job, they could go to a farmers’ market. Or join a food co-op. What does Greenbox do that those alternatives don’t?”

Nobody answers immediately. It’s a harder question than it sounds.

Lee tells them about Christensen’s milkshake study: a fast-food chain that couldn’t sell more milkshakes until they watched what actually happened at the counter. Half the milkshakes were sold before 8am, to commuters. The job wasn’t “enjoy a delicious milkshake.” The job was “make my commute less tedious.” Once they understood that, they made the milkshake thicker and added fruit. Sales went up 40%.

“Greenbox isn’t competing with other produce boxes. It’s competing with whatever else your customers could do to solve the same problem in their lives.”

Talking to actual humans

Lee suggests interviews. Not surveys, not analytics. Actual conversations with actual people.

“Three groups. Five active subscribers, five who cancelled, five who considered subscribing but didn’t. Thirty minutes each. The hard part isn’t the time; it’s asking the correct questions. And the first rule, the one everyone breaks: don’t defend. Whatever they say about the product, don’t explain, don’t correct, don’t apologise mid-sentence. Just listen and ask the next question. The moment you start defending, the conversation closes.”

The other rules: Don’t ask “why do you subscribe?”: people will rationalise. Ask about the timeline: “Walk me through the moment you decided to sign up.” Don’t ask “what features would you like?”: people will invent features they’d never use. Ask about struggles: “Tell me about the last time you were frustrated with dinner.” Listen for the switch: the moment someone moved from their old solution to Greenbox.

Maya records each interview on her phone. They use an LLM to transcribe the recordings and identify recurring themes across transcripts, faster than a human reader because it can hold five long conversations in context simultaneously. But Maya reads every transcript herself.

The interviews are harder than anyone expected. The first two feel stilted. Maya keeps asking leading questions. By the third interview, things go properly sideways.

His name is Greg. He cancelled six weeks ago. He arrives at the cafe ten minutes late, already irritated.

“Walk me through the moment you decided to sign up.”

“I’ll tell you what was happening. My wife found you on Instagram and signed us up without asking me. Then I was the one dealing with the box every week.”

“And what was the experience like?”

“Terrible. You sent me beetroot three weeks running. Three weeks. I told your support team after the second time. The third week I opened the box and there it was again. Purple. Staring at me.”

Maya feels heat rise in her neck. “We track all dietary preferences and, “

“No you don’t.” Greg puts down his coffee. “Or if you do, your system is broken. I sent two emails. Nobody responded to the second one.”

“I’m sorry about that. We’ve improved our, “

“I’m not here for an apology. You asked to talk. I’m talking. You want to know why I left? I spent more money on your box than I would have at Hartland Group and I got ingredients I didn’t want that nobody helped me cook. I switched to Freshly. Seven dollars cheaper and the delivery tracking is better.”

Maya blinks. “Freshly?”

“Yeah. The Sydney mob. They launched in Perth last month. The produce isn’t as good but at least I know what I’m getting.”

Maya writes down “Freshly” on her notepad and underlines it twice.

“Look, I could tell you cared. The little notes about which farm the carrots came from, that was nice. But nice doesn’t matter when I’m standing in my kitchen at six o’clock with a kohlrabi and no bloody idea what to do with it.”

The interview ends after twenty minutes. Greg shakes her hand and leaves. Maya sits at the cafe table, staring at her notepad. Lee, who’d been observing from the next table, walks over.

“That was rough.”

“He was rude.”

“He was honest. And you broke the first rule; you got defensive. The moment he said the system was broken, you stopped listening and started defending.”

“Because what he said wasn’t true. We do track preferences.”

“Do you track his preference? Did anyone action his emails?”

Maya opens her laptop and searches the support inbox. Greg’s first email: Sam had responded with a template. The second email, four days later, has no reply. It sits unread between forty other messages.

“We missed it,” Maya says quietly.

“That was the most useful twenty minutes of the whole batch. Greg gave you a system failure, a competitor name, and the clearest articulation of the core problem anyone’s said yet. ‘Standing in my kitchen at six o’clock with a kohlrabi and no idea what to do with it.’ That’s your answer. And you almost missed it because you were defending instead of listening.”

Maya nods slowly. She writes down Greg’s kohlrabi line and circles it.

That evening, she goes home and searches for Freshly. A polished website. A slick app with real-time delivery tracking. $18 per week. An Instagram with sixty thousand followers. A twelve-million-dollar Series A.

Nadia comes in from a late physio session and finds Maya at the kitchen table, laptop open to Freshly’s website, a glass of wine untouched.

“What’s that?”

“Competition. Well-funded competition.”

Nadia looks at the screen. “Their boxes look nice.”

“They’re not local. They buy wholesale from the markets.”

“Does that matter?”

Maya doesn’t answer. At midnight, Nadia finds her in the kitchen, reorganising the cupboards. Tins arranged by expiry date. Spices alphabetised. The jars of preserved lemons that Maya’s mother sent from Margaret River lined up like soldiers.

Nadia leans against the doorframe. “You’re doing the cupboard thing.”

“I’m fine.”

“You’re alphabetising cumin at midnight. You’re not fine.”

Maya puts down the jar. “The customers don’t care about local sourcing, Nadia. We interviewed fifteen people. Three of them mentioned local as the main reason they subscribe. I built the whole brand around it. The fifty-kilometre promise, the farm stories, all of it. They don’t care.”

“They care about something, though?”

“Convenience. They care about not having to think about dinner. That’s it. That’s the product.”

“Is that a bad thing?”

“It’s a different thing. It’s a completely different business than the one I thought I was building.”

Nadia sits down. “You built the brand around what matters to you. Now you’re finding out what matters to them. Those can both be true.”

Maya looks at the preserved lemons. Her mother made them last summer, in the kitchen of the small house in Margaret River. The recipe is her grandmother’s, from Taiwan. Three generations of women preserving food with their hands.

“I know,” Maya says. “I just need a minute.”

She calls her mum the next morning, before the coastal run.

“Mum, did it bother Dad that people didn’t care about where their food came from? When you were farming?”

Her mother laughs. “Your father didn’t farm because people cared about farming. He farmed because people needed to eat. The caring was his. The eating was theirs.”

Maya stands at the kitchen window watching the sky lighten over Fremantle. Her mother’s words land somewhere deep.

By the fourth interview, Maya finds her rhythm. She learns to sit with silence, the pauses where the interviewee is actually thinking. Those pauses produce the most honest answers.

One churned subscriber, a man named Patrick, gives them a fifteen-minute story about his Tuesday evenings that becomes the team’s touchstone. He describes getting home at six, opening the Greenbox, seeing ingredients he doesn’t recognise, googling recipes while his kids argue about homework, giving up, ordering pizza, and then feeling guilty about the $25 box of vegetables wilting on the counter. “I was paying twenty-five dollars a week to feel bad about myself.” That sentence ends up on a sticky note in the office.

What the interviews reveal

Three days later, the team has fifteen transcripts and a wall of quotes. The room goes quiet.

Active subscribers barely mention vegetables. They mention relief.

“I don’t have to think about what to cook on Tuesday. The box arrives and dinner is decided.”

“It’s one less thing to worry about. I get home, I open the box, and I know what we’re eating.”

One active subscriber is Mrs Patterson, the same Mrs Patterson whose beetroot aversion Maya has been carrying in her head since the Example Mapping sessions. She’s 63, lives alone on Stirling Highway, subscribed since the second week of the pilot.

“I just open the box and trust what’s inside,” she says. “Except when there’s beetroot.” She smiles. “I don’t even know what’s in the box most weeks. I just know I don’t have to think about it.”

Jas is sitting in on this interview. She’s in the corner with her Moleskine open. When Mrs Patterson says “dinner is decided,” Jas sketches a quick napkin-style drawing: a box opening, a recipe card visible on top, and underneath the words dinner decided. She underlines it. Then she underlines it again.

The job isn’t “get fresh local produce”; it’s “eliminate the mental load of deciding what to cook.” The produce is the mechanism; the stress relief is the product.

Churned subscribers tell a starkly different story.

“The vegetables were great but I’d open the box and have no idea what to do with half of it.”

“It actually added stress instead of removing it. I had all these beautiful vegetables and the guilt of not knowing how to use them before they went off.”

The box didn’t do the job. The mental load wasn’t reduced; it was relocated. “What should I buy?” became “What on earth do I do with this?”

Two of the five churned subscribers mentioned Freshly by name. Greg wasn’t the only defection. Louise said: “I tried that Freshly thing. It’s not as nice, but it’s easier.” Easier, not better. Easier.

People who considered but didn’t subscribe rejected the uncertainty, not the product.

“I looked at the website and I couldn’t tell what I’d actually get.”

“I was interested but my partner was sceptical. I couldn’t explain what we’d be getting.”

One non-subscriber, Clare, put it perfectly: “I’m already drowning in decisions. I didn’t want to add another one. If I’d known exactly what was coming and what I could cook with it, I probably would have signed up.” She was describing the same job from the outside looking in. The marketing communicated the mechanism (“local produce”) without the outcome (“dinner, sorted”).

The insight that changes everything

Maya stares at the quotes on the wall. Priya says it first.

“We’ve been marketing this as ‘fresh local vegetables.’ But that’s not why people stay. They stay because we solve Tuesday night. And they leave because we don’t solve Tuesday night; we just make it a different kind of hard.”

Tom leans forward. “So the next feature isn’t better substitution or more variety. It’s…”

“Recipe cards,” Jas says. She pulls out the Moleskine and opens it to the napkin sketch. “Simple, fast recipes that use exactly what’s in this week’s box. Open the box, pick a card, cook dinner. No thinking required.”

The room is energised in a way it hasn’t been for weeks. Not because recipe cards are exciting technology; they’re printed cards in a cardboard box. But they directly serve the job.

Priya pushes further. “Without them, we’re delivering ingredients. With them, we’re delivering dinner.”

“Patrick’s kohlrabi problem,” Sam says.

“Exactly. He didn’t need better kohlrabi. He needed someone to tell him what to do with it in twenty minutes.”

Maya adds a constraint: “Every recipe has to be doable by someone who considers themselves a bad cook. If Patrick can make it, anyone can.”

The team isn’t designing a feature; they’re designing around a specific human being they’ve actually talked to. Patrick isn’t a persona on a slide deck; he’s a real person who told them about feeling guilty on a Tuesday evening.

Tom is quiet for a moment. “I was about to spend three weeks improving the substitution algorithm. But it doesn’t serve the job. A better substitution algorithm doesn’t reduce anyone’s dinner stress.”

Maya asks the LLM to help draft the first set of recipe cards. She pastes in this week’s box contents and asks for three simple recipes, each under thirty minutes, using only box contents plus basic pantry staples. The LLM produces them in seconds. Jas designs a card layout. Sam sends them to the printer.

Tom builds a prototype that afternoon: a simple script that takes the week’s box contents, sends them to an LLM with recipe constraints, and produces three formatted recipes. The whole pipeline, from box contents to print-ready cards, takes less than ten minutes per week. Without the LLM, it would require a food writer. With it, Maya reviews and approves the output in fifteen minutes.

When to use Jobs to Be Done

When churn is high and you don’t know why. Exit surveys give surface reasons. JTBD interviews give the real reason: the job wasn’t being done.
When you’re about to invest in a new feature. Does it serve the job customers are hiring you for? If not, you might be building the wrong thing.
When acquisition is hard and you don’t know your message. “Fresh local vegetables” is a product description; “Stop stressing about Tuesday dinner” is a job statement. One converts better.

When not to use it

When the problem is operational, not motivational. If subscribers leave because deliveries arrive late, fix logistics. JTBD is for understanding why customers hire and fire your product.
When you already know the job. If the team has a clear, validated understanding of why customers buy, running more interviews is discovery theatre.

Back to Greenbox

Two weeks after the recipe cards ship, churn drops from 8% to 5.5%. Three of the five churned subscribers re-subscribe after Sam emails them. Patrick, the man with the kohlrabi guilt, signs back up the same day. Greg does not. Louise does not. Maya checked.

She also checked Freshly’s website again. They’ve added a Perth delivery zone. Launch date: next month. Twelve million dollars, a slick app, and $18 per week. Maya’s boxes cost $25 and come with a recipe card printed on a twelve-cent piece of cardboard.

The recipe cards are working. The churn is dropping. The direction is correct.

She doesn’t tell anyone that she spent twenty minutes on Freshly’s sign-up flow that evening, getting as far as the payment page, just to see what the experience felt like. It was smooth. It was fast. It was everything Greenbox’s sign-up flow isn’t. She closed the tab and went for a run on the coastal track, even though it was dark and Nadia told her the path wasn’t lit.

The path was fine. The run helped. The knot in her chest loosened by half a turn.

Next week, the team takes that insight and asks an uncomfortable question: what else do we believe about this business that we haven’t actually validated? That’s Assumption Mapping, and the answer is more than anyone wants to admit.

Choosing an S3 Storage Class for Cold Archives

2026-04-27T06:00:00+08:00

The situation

A financial services company holds 50 TB of regulatory archive data in S3. KYC documents, transaction histories, communications retained under FCA / SEC obligations. The retention period is seven years by regulation. The data is accessed twice a year when external auditors request a sample; the company is given at least 24 hours of notice before each retrieval.

Today it all lives on S3 Standard. That’s a bill in the neighbourhood of $13,800 a year for storage alone, on data that is touched for two days out of every 365 and required to exist for the other 363.

The team’s ask is straightforward. Lowest possible per-GB storage cost over seven years; retrieval feasible inside the 24-hour audit window; standard multi-AZ durability (single-AZ classes are non-starters for regulated data); and minimum operational overhead, no per-object monitoring services, no custom tiering logic, set it once and let lifecycle rules do the work.

What actually matters

Before touring the storage-class menu, look at what the shape of this workload actually rewards and punishes.

The dominant factor over seven years is $/GB/month. The data lives for eighty-four months and is read on two of them. Every fraction of a cent on storage compounds through the entire retention window, while retrieval-side costs scale with reads, a figure measured in single events per year. Any class whose marginal cost is “storage is cheap, retrieval is expensive” is structurally well-matched; any class that keeps the per-GB number high to guarantee instant access is paying for latency the workload will never exploit.

The second factor is retrieval latency and, more specifically, what “fast enough” means. Twenty-four hours is an eternity in software terms. Most S3 classes can return an object in milliseconds; paying the premium for that speed is only sensible if the workload exercises it. Here it doesn’t, so we can move all the way down the cost curve until we hit a class whose retrieval SLA threatens the 24-hour window.

The third factor is durability and availability. Durability, the probability that AWS loses our data, is eleven nines on every S3 class including single-AZ ones. Availability is where single-AZ classes are dangerous: one AZ outage during an audit is not a story that plays well with a regulator. The scenario explicitly rules out single-AZ, and it’s the correct call for regulated data even when it’s allowed.

The fourth factor is operational model. Some classes watch objects and re-tier them automatically, with a per-object watchdog fee for the discovery. For a predictable, write-once, read-on-schedule archive, paying a service to discover something we already know is a negative-value feature. The class should be chosen explicitly, not discovered by telemetry.

The fifth factor is the bill-shock traps that don’t show up on the pricing page. The cold-archive classes charge a minimum chargeable object size (plus a metadata sliver stored at warm-tier rates). For an archive of millions of small files that effect is visible on the invoice as billed-GB far higher than stored-GB. The coldest tiers also have long minimum storage durations and limited or no fast-retrieval options, delete-early and need-it-in-three-hours are both expensive surprises. A realistic cost picture has to price these in, not just the headline $/GB number.

And the sixth factor is a softer one: predictability of the pattern itself. This isn’t a workload where the access shape is going to drift quarter-to-quarter. Regulators will keep asking for the same shape of sample; the retention clock ticks predictably; deletions happen at year seven not year three. The more confident we are that the pattern won’t change, the safer it is to pick the cheapest class that fits, because the expensive classes aren’t buying us flexibility we need, they’re buying latency we don’t.

What we’ll filter on

Distilling the exploration into filters:

Lowest $/GB/month, storage dominates the seven-year bill.
Retrieval ≤ 24 hours, anything faster is overkill; anything slower violates the audit SLA.
Multi-AZ durability, single-AZ classes are off the table for regulated data.
No per-object monitoring fee, predictable access patterns don’t benefit from a watchdog.
Survives the small-object trap, if the archive contains millions of tiny files, a minimum chargeable object size can double-digit-multiply the apparent storage.

The S3 storage-class landscape

S3 ships eight storage classes. Each optimises a different point on the cost / latency / durability curve.

S3 Standard. ~$0.023/GB/month. Immediate access, no retrieval fees, no minimums. The correct home for active workloads. For 50 TB over seven years, about $96,600 in storage alone, most of which buys us nothing.
S3 Intelligent-Tiering. Frequent Access matches Standard; 30 days of no access moves objects to Infrequent Access pricing; 90 days to Archive Instant Access; opt-in tiers extend further. No retrieval fees between automatic tiers. The catch: a monitoring fee of ~$0.0025 per 1,000 objects per month. For a predictable archive, we’re paying the watchdog for information we already have.
S3 Standard-IA. ~$0.0125/GB/month. 30-day minimum storage. 128 KB minimum chargeable size. Per-GB retrieval fee (~$0.01/GB). For “infrequent but not cold” data read monthly or so, an order of magnitude more expensive than the deepest Glacier tier.
S3 One Zone-IA. ~$0.01/GB/month. Same minimums as Standard-IA but stored in a single AZ. 20% cheaper than Standard-IA. Non-starter for regulated data.
S3 Glacier Instant Retrieval. ~$0.004/GB/month. Millisecond retrieval, 90-day minimum, 128 KB floor, per-GB retrieval fee. Designed for archive that must be returned immediately when asked, quarterly dashboards, medical images, catalogues that surface in a UI. The 24-hour window makes “millisecond” an overpay.
S3 Glacier Flexible Retrieval. ~$0.0036/GB/month. 90-day minimum. Three retrieval tiers: Expedited (1-5 min, premium fee), Standard (3-5 hours, per-GB fee), and Bulk (5-12 hours, free). Within budget, but Deep Archive is cheaper still and the 24-hour window doesn’t reward the flexibility.
S3 Glacier Deep Archive. ~$0.00179/GB/month. 180-day minimum. Two retrieval tiers (no Expedited): Standard (12 hours) and Bulk (48 hours, free). Purpose-built for data we’ll touch once or twice a year or never.
S3 Express One Zone. ~$0.16/GB/month. Sub-millisecond latency, single-AZ, optimised for compute-adjacent high-IOPS workloads. The opposite of an archive class.

Side by side

Storage class	Lowest $/GB	Retrieval ≤ 24 h	Multi-AZ	No per-object monitoring
S3 Standard	✗	✓	✓	✓
Intelligent-Tiering	,	✓	✓	✗
Standard-IA	✗	✓	✓	✓
One Zone-IA	✗	✓	✗	✓
Glacier Instant Retrieval	✗	✓	✓	✓
Glacier Flexible Retrieval	,	✓	✓	✓
Glacier Deep Archive	✓	✓	✓	✓
Express One Zone	✗	✓	✗	✓

Matching the workload to the class

Slide down the cost axis as far as the retrieval SLA allows. Deep Archive is the first class cheaper than Standard-IA whose worst-case (12 h) fits inside 24 h with the correct tier selected.

Glacier Deep Archive, in depth

What makes Deep Archive cheap is what makes it slow. Objects aren’t on warm disk, they’re on tiered media optimised for sequential access, and a retrieval is a job AWS schedules against tape-class hardware. The 12-hour Standard retrieval window is the SLA AWS commits to; in practice many retrievals come back faster, but that’s not something to depend on.

The retrieval tiers. Standard costs roughly $0.02/GB and completes within 12 hours. Bulk is free and completes within 48 hours. There is no Expedited tier for Deep Archive, that’s exclusive to Flexible Retrieval. Anything faster than 12 hours means picking a different class.

The restore shape. RestoreObject doesn’t move the Deep Archive object permanently. It creates a temporary copy at S3 Standard rates for a number of days we specify (1-365). During that window the object is readable like any Standard object; when the window closes, the copy disappears and the Deep Archive object remains. The bill includes both the base storage and the temporary copy for the restore duration.

The 180-day minimum. Deleting before 180 days incurs a charge equal to the remaining days at the Deep Archive rate. For seven-year data this is irrelevant. For a workload that might delete at 60 days, Deep Archive is a trap.

The 128 KB minimum chargeable size, the most common trap. Glacier classes charge 128 KB per object as a floor, plus ~40 KB of metadata stored at S3 Standard rates so the object stays searchable. For an archive of millions of small files, 50 million 5 KB notification emails, say, the 128 KB floor inflates the apparent storage from 250 GB to 6.4 TB for billing purposes. The mitigation is to pack small objects into larger archives (zip, tar, or a custom format) before upload. A million 5 KB emails packed into 100 MB archives reduces the chargeable size by orders of magnitude.

A worked example: one year of bill shape

50 TB = 50,000 GB. Assume objects are packed large enough that the 128 KB minimum doesn’t dominate.

Storage at Deep Archive
  50,000 GB × $0.00179 × 12                        =  $1,074

Retrieval (twice a year, 50 TB returned per audit)
  Standard retrieval: 50,000 × $0.02 × 2           =  $2,000

Restore staging (14-day window each audit)
  50,000 × $0.023 × (14/30) × 2                    =  $1,073

Total annual                                          $4,147

Comparable S3 Standard bill:
  50,000 × $0.023 × 12                             =  $13,800
Saving                                                ~70%

Over seven years that’s roughly a $67,000 difference, paid for by tolerating a 12-hour retrieval window that the 24-hour audit SLA covers with a full day of headroom. If the auditor would accept a 48-hour window, Bulk retrieval drops retrieval cost to zero and the annual bill lands near $2,150, but the scenario’s 24-hour notice doesn’t tolerate Bulk’s worst case.

Getting data into Deep Archive

Three routes in, with different cost shapes.

Direct PUT to Deep Archive. Set x-amz-storage-class: DEEP_ARCHIVE on the upload request. The object lands directly in Deep Archive, no Standard hop, no transition fee. Requires the uploader to know about the storage class. Best when the data is born archival.

Lifecycle transition from Standard. Upload to Standard, then a lifecycle rule moves objects to Deep Archive after a configured number of days. Transition costs roughly $0.05 per 1,000 objects. For 50 million objects that’s $2,500 one-off, real money but amortised.

LifecycleConfiguration:
  Rules:
    - Id: ArchiveImmediately
      Status: Enabled
      Filter:
        Prefix: kyc/
      Transitions:
        - Days: 0
          StorageClass: DEEP_ARCHIVE

A Days: 0 transition fires within the next lifecycle evaluation cycle (once per day), so it’s not instant, there’s typically a one-day window where the object sits in Standard first. For workloads where that day matters, prefer direct PUT.

Intelligent-Tiering with Deep Archive Access opt-in. Let S3 discover the access pattern; untouched objects eventually reach Deep Archive Access after 180+ days of inactivity. Plus the monitoring fee per object. For a workload where the pattern is already known, overkill.

What’s worth remembering

Eight S3 storage classes exist, each optimising a different point on cost / latency / durability / AZ-count. For a seven-year compliance archive touched twice a year, Deep Archive is the one designed for the shape.
The 128 KB minimum chargeable object size on IA and Glacier classes is the most common bill-shock trap. Pack small files into larger archives before upload.
Glacier Flexible Retrieval has Expedited (1-5 min); Deep Archive doesn’t. If anything faster than 12 hours is required, pick Flexible, not Deep.
Minimum storage durations matter. 30 days on IA, 90 on Glacier Instant / Flexible, 180 on Deep Archive. Early deletion charges the remaining days.
RestoreObject creates a temporary S3 Standard copy for the duration specified. The underlying Glacier object is unchanged; the bill includes both for the restore window.
Intelligent-Tiering’s monitoring fee is per object, not per GB. Predictable-access archives don’t benefit from a watchdog, and billions of small objects make the watchdog expensive.
Lifecycle transitions have per-1,000-object costs that matter at scale. Direct PUT skips them.
Single-AZ classes share the 11-nines durability but drop availability. For regulated data the availability gap is disqualifying even when durability is fine on paper.

The Other Transformers

2026-04-25T06:00:00+08:00

You have a backlog of 80,000 support tickets and you need to tag each one with one of fourteen categories. Someone suggests using an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. . You write the prompt, you wire up the API, you run the numbers, and the bill comes back at $1,400 just for the categorisation. You haven’t even started doing anything with the categories yet.

There’s a better tool for this. It’s also a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. . It’s just not the one everyone talks about.

In To LLMs… and Beyond! we treated “transformer” as one thing, the engine behind Claude, GPT, Llama. That was useful for a tour of the field, but it elided a real distinction. The transformer architecture comes in three structural shapes, and only one of them is the autoregressive text-generator that the AI conversation has fixated on.

The other two are still in production at every serious AI shop. They’re cheaper, faster, and often more accurate for the jobs they were designed to do. This post is about when to reach for them instead.

Three shapes from one paper

The 2017 paper AttentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. Is All You Need introduced the transformer with a specific job in mind: machine translation. English in, French out. The architecture had two halves, an encoder that read the English sentence and produced an internal representation of its meaning, and a decoder that consumed that representation and produced French one tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. at a time.

Almost immediately, researchers noticed you could use the halves separately.

Encoder-only models keep just the encoder. They take text in and produce a representation, a vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. , a label, a span. They never generate text. BERT (2018) is the headline example.
Decoder-only models keep just the decoder. They take text in and produce more text, one token at a time. GPT, Claude, and Llama are all this shape.
Encoder-decoder models keep both halves. They take text in, encode it, and decode something different out. T5 and BART are the headline examples.

The shape determines what the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. is good at. And it determines what it costs.

Encoder-only: BERT and friends

BERT stands for Bidirectional Encoder Representations from Transformers. The “bidirectional” is the part that matters. A decoder-only model like GPT processes text left-to-right, one token at a time, when it’s predicting the next token, it can only see what came before. An encoder-only model processes the entire sequence at once, and every token can attend to every other token in both directions.

This makes encoder-only models worse at generating fluent text, in fact, they don’t generate text at all in the usual sense, but better at understanding it. When BERT looks at the word “bank” in “I sat by the bank of the river,” it can see “river” three tokens later, and that informs its representation of “bank.” A left-to-right model has to commit to a meaning before it has all the evidence.

What encoder-only models actually output is a sequence of vectors, one per input token. You can use those vectors directly (as embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. for similarity search) or you can stick a tiny classification head on top (a single linear layer that maps a vector to a label) and get a classifier.

The big BERT-family models you’ll encounter:

Model	Made by	Notable for
BERT	Google, 2018	The original. Set state of the art on a dozen benchmarks overnight.
RoBERTa	Meta, 2019	BERT trained better, more data, longer, with the masking strategy fixed. Usually beats BERT.
DeBERTa	Microsoft, 2020-2021	Disentangled attention. Strong on classification benchmarks, often the default for new projects.
DistilBERT	Hugging Face, 2019	A 40%-smaller BERT that's 60% faster and keeps 97% of the accuracy. The pragmatic choice.
ModernBERT	Answer.AI, 2024	BERT with the last six years of architectural improvements bolted on. Long context, fast inference.

These are all small. BERT-base has 110 million parameters, DistilBERT has 66 million, ModernBERT-large has 395 million. Compare that to a frontier LLM at hundreds of billions. They run on a CPU. They run on your laptop. They run on a Raspberry Pi if you don’t mind waiting.

What encoder-only models are good at

Anything where the answer is shorter than the input. Specifically:

Classification. Sentiment, intent, topic, language detection, content moderation, spam, urgency triage. One label out per input.
Multi-label classification. Tagging a document with several categories at once.
Named entity recognition (NER). Picking out people, places, organisations, dates from text. One label per token.
Span extraction. “Find the answer to this question inside this document.” The model points at the start and end positions of the span. SQuAD-style question answering.
Sentence embeddings. Producing a fixed-size vector that represents the meaning of a piece of text. The foundation of semantic search and RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. .
Pairwise classification. “Are these two sentences saying the same thing?” “Does sentence A entail sentence B?”

For all of these, an LLM will also work. It will just cost roughly a hundred times more, take roughly ten times longer, and, in many cases, be less accurate.

Why an LLM is often worse, not just more expensive

Counterintuitive but real: a fine-tuned BERT often outperforms a frontier LLM at classification tasks the BERT was specifically trained for.

The reason is task alignment. An LLM is trained to predict the next token across the entirety of internet text. A fine-tuned classifier is trained on labelled examples of exactly the task you care about, ten thousand support tickets with their correct categories, say. The LLM has read the universe and has a vague sense of what “billing” means; the classifier has stared at your specific definition of “billing” for a thousand epochs.

The LLM also has to speak its answer, which introduces failure modes the classifier doesn’t have. Will it return “billing” or “Billing” or “billing/payments” or a polite refusal because the ticket mentions a credit card? The classifier returns one of fourteen integers. Always.

There’s an obvious counter: what if you don’t have ten thousand labelled examples? Genuine constraint, and where LLMs shine, zero-shot or few-shot classification with a prompt is a real superpower when you’re starting from nothing. But the moment you’ve labelled enough data to fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. a small encoder, the cost-quality curve usually flips.

Encoder-decoder: T5, BART, FLAN

The encoder-decoder shape is for jobs where the output is structured but isn’t a free-form essay, a transformation of the input rather than a continuation of it.

The flagship example is Google’s T5 (Text-to-Text Transfer Transformer, 2019), which framed every NLP task as text-in, text-out:

Translation: input “translate English to German: That is good.” → output “Das ist gut.”
Summarisation: input “summarize: <article>” → output “<summary>”
Classification: input “cola sentence: The course is jumping well.” → output “not acceptable”
Question answering: input “question: What is the capital of France? context: …” → output “Paris”

The shape is well-suited to anything that has a deterministic-ish target, a translation, a summary, a structured output, a SQL query generated from a natural-language question. The encoder reads the whole input once, builds a rich representation, and the decoder produces the (usually short) output guided by that representation.

The other notable encoder-decoder family is BART (Meta, 2019), which was trained on a denoising objective, corrupt the input, recover the original, and is particularly strong at summarisation.

The instruction-tuned descendants. FLAN-T5, T5-XXL, BART-large-CNN, are still common backbones for production summarisation and translation pipelines, especially when you want to fine-tune on your own data.

What encoder-decoder models are good at

Translation. The original use case, still strong.
Summarisation. Extractive (copy spans) or abstractive (rewrite). BART-large-CNN was the production default for years.
Structured generation. Text-to-SQL, text-to-JSON, text-to-API-call. The encoder grounds the output in the input.
Grammar correction. Input: messy sentence. Output: clean sentence.
Question answering with generation. Where the answer isn’t necessarily a span in the document and needs to be paraphrased.

The boundary with decoder-only LLMs has blurred. Modern LLMs do all of the above competently, often better than older T5 models, and the simplicity of “one model for everything” has pulled a lot of work toward the decoder-only side. But for pipelines where you need something small, fast, deterministic, and fine-tuneable, T5-family models still pull their weight.

A decision table

If your task is…	Reach for…	Why not an LLM?
Tag each item with one of N categories	DeBERTa or DistilBERT, fine-tuned	100x cheaper, often more accurate, no parsing of free-text output
Find people, places, dates in text	A BERT-family NER model (e.g. spaCy's transformer)	Token-level precision, no hallucinated entities
Embed sentences for semantic search	A sentence-transformers model (BGE, E5, GTE)	LLMs don't natively produce sentence embeddings; encoder models do this as their primary job
Translate between languages at scale	A T5- or NLLB-family model, fine-tuned if needed	Per-token cost matters at translation volumes; specialised models still lead
Convert natural language to SQL or JSON	A code-fine-tuned T5, or an LLM if accuracy matters more than cost	Mixed. LLMs win on hard cases, encoder-decoders win on cost at scale
Decide if a comment is toxic	A fine-tuned encoder classifier (e.g. Detoxify)	Real-time moderation needs millisecond latency, not 800ms API round-trips
Have a free-form conversation	An LLM	Encoder models cannot generate fluent multi-turn text
Reason through a multi-step problem	An LLM, ideally a reasoning model	Encoder models have no chain-of-thought; they produce one answer in one pass

The pragmatic stack

In production AI systems, you’ll often see encoder, encoder-decoder, and decoder-only models working together rather than competing.

A typical retrieval-augmented chat application:

Bi-encoder (BERT-family) embeds the user’s query and finds the top 100 candidate documents from the vector database. Cheap, parallel, fast.
Cross-encoder (BERT-family) re-ranks those 100 down to the top 5 by reading each query-document pair carefully. We’ll cover this in the next post.
Decoder-only LLM consumes the top 5 documents alongside the query and writes a fluent answer.

Each stage uses the right tool for its job. The encoder does the cheap, high-throughput retrieval and ranking. The LLM does the expensive, low-throughput generation, but only after the encoder has narrowed the search space by three orders of magnitude.

This is the pattern that matters. It’s not “LLM vs BERT.” It’s “use BERT to make the LLM step efficient enough to be worth doing.”

Where to find them

Hugging Face is the de facto registry. bert-base-uncased, roberta-large, microsoft/deberta-v3-large, distilbert-base-uncased, answerdotai/ModernBERT-large, t5-base, facebook/bart-large-cnn, google/flan-t5-xl, all available, all free to download.
sentence-transformers is the library for using BERT-family models as embedding models. all-MiniLM-L6-v2 is the gateway drug, 22 million parameters, runs on a phone, and is the correct starting point for 80% of semantic-search projects.
spaCy wraps fine-tuned encoder models for NER, POS tagging, and similar pipelines, with an API designed for production use rather than research.
Cohere, OpenAI, Voyage sell hosted embedding APIs if you want the model without the operations.

The word “transformer” hides three quite different machines under one name. The decoder-only shape is what everyone means when they say LLM, and it’s the one that has to speak its answer aloud, one token at a time. That mouth is what makes it generative, and it’s also what makes the bill arrive. The encoder-only shape never opens its mouth: it reads, it understands, it points at a label or a span or hands back a vector. The encoder-decoder shape sits in between, reading once and producing a short, structured response.

If your job has a stable target, one of fourteen categories, a span in a document, an embedding for retrieval, a SQL query, there’s almost always a smaller, older, cheaper model that does it better than a frontier LLM, especially once you have labelled data to fine-tune on. The serious AI shops know this. Their production stacks don’t pick between transformer shapes; they chain them. The encoder narrows the search space by three orders of magnitude so the decoder’s expensive generation step is worth paying for. “Should I use an LLM?” is the wrong framing; the useful framing is where in the pipeline an LLM actually earns its cost.

The Workshop: User Story Mapping

2026-04-24T06:00:00+08:00

A flat backlog hides the journey. User Story Mapping unrolls it across a wall and slices it into a thinnest-honest first release. Worked example: Seeing the Whole.

User Story Mapping

User Story Mapping lays out the full user experience as a left-to-right narrative, then slices it horizontally into releases, so the team can see the whole journey and commit to the thinnest honest version of it first. Often just called story mapping. Sometimes confused with customer journey mapping (journey mapping is research-led and emotional; story mapping is build-led and functional) and with flat backlogs (a backlog is a list; a story map is a grid). Invented and named by Jeff Patton in 2005, published as a book in 2014 that is still the canonical reference.

At a glance

Who, for how long: a facilitator, the product owner, two or three developers, a designer, and someone who talks to real users (support, sales, ops). Five to eight people, two to three hours.
What you walk out with: a backbone of six to twelve activities with task columns beneath, and at least one release line marking the walking skeleton, an end-to-end-but-thin first slice where every activity has at least one task above the line.
When to reach for it: a new product or major feature area where the backlog has grown long, an MVP argument is going in circles, or the team has different mental models of the end-to-end experience. Not for a single well-understood story (use Example Mapping), purely technical work with no user-facing narrative, or strategy-level scope decisions (run Impact Mapping first).

The first slice is what Cockburn (and Patton, who adopted the name) call a walking skeleton: Alistair Cockburn’s term for an end-to-end-but-thin first slice of the system, alive but ugly. End-to-end, thin, ugly, but alive. Every activity has at least one task in release 1; the journey is whole even when the polish isn’t.

What’s It For

A team has a flat backlog of eighty-seven stories. They’ve been working through it for six weeks. Last week they shipped a beautiful payment form. This week they’re building subscription upgrade logic. Next week they’re building referrals. The product owner writes up a release announcement and notices, with a growing sense of unease, that nothing the team has built actually lets a new visitor sign up, choose a box, and receive their first delivery. The payment form doesn’t connect to anything yet. The upgrade logic assumes a subscriber who can’t yet exist. The referrals are for a product that has no users to refer. Every story delivered was a good story. The sum of the stories is not a product.

This is the flat-backlog failure mode. A list tells you what’s next; it doesn’t tell you what fits together. Priority orders stories by perceived value but loses the journey shape. Teams optimise each story locally and discover, six weeks in, that they’ve been building disconnected parts.

User Story Mapping exists to put the journey back. The wall is the shape. The vertical axis is priority; the horizontal axis is the user walking through the product end to end. A release is a horizontal line across the map, and the question the line forces is: what is the thinnest version of this journey that still works as a journey?

Reach for it when:

You’re planning a new product or a major new feature area
The backlog has grown large and nobody can see the big picture
You need to define an MVP or a first release and the arguments keep going in circles
Different team members have different mental models of what the product does end-to-end
You’ve finished Impact Mapping or Event Storming and now need to turn the insights into a buildable plan

What It’s Not For

Skip it when:

You’re mapping a single, well-understood story. Use Example Mapping instead.
You don’t have a clear user or set of users to map for. Story Mapping is anchored to a persona; without one, the wall has no shape.
The work is purely technical with no user-facing narrative. Infrastructure, internal tools, refactoring: those want a different artefact.
The scope is so broad that you’re really trying to decide strategy. Run Impact Mapping first.

Stop a session that’s already started if:

The team can’t agree which persona to map: you’re mapping two journeys, not one
Scope has quadrupled two hours in and the wall is full but incomplete
The walk-the-map narration reveals that nobody in the room actually understands the user

Stopping when the scope isn’t right is not failure. Producing a beautiful map of the wrong journey is.

Definitions & Background

The map is the source; the backlog is a flattening of release 1 for execution. When the map and the backlog disagree, the map wins and the backlog gets re-flattened. Treating the backlog as the source, and the map as a once-off artefact, is the failure mode that wrecks teams about six months in.

Patton’s release-slicing convention is now / next / later: now is the committed walking skeleton, next is the slice you’d build immediately after, later is everything you’re keeping on the wall but explicitly not committing to. The fuzziness of later is the point; it stops the team pretending later slices are plans.

The vocabulary of the wall:

Persona: pinned to the left end. The user whose journey is being mapped. One per wall.
Backbone: the row of blue activity notes across the top. Six to twelve big chunks of the user’s journey, left to right.
User tasks: yellow notes hanging vertically below each activity, ordered top (most essential) to bottom (nice to have).
Release lines: horizontal slices across the wall. Above each line is what’s committed for that release; below is later.
Walking skeleton: the first release line. End-to-end, thin, ugly, but alive.

Inputs

One clear user persona written on a card and pinned to the left of the wall. If you don’t have one, spend ten minutes writing one before any other note goes up.
A rough agreed scope for this map: the full product, or one journey within it. Write the end-point on a card and stick it at the right end of the wall. That’s the boundary.
A long wall, sticky notes in at least two colours (blue for the backbone, yellow for tasks), tape for the release lines, and a room that can accommodate people standing and moving for hours.
Any existing research you can reference without putting it on the wall: user interviews, support tickets, analytics. Bring it as evidence, not as voices.

If you don’t yet know what outcomes you’re chasing, run Impact Mapping first. If you don’t yet understand the system the user is moving through, run Event Storming first. Story Mapping turns those upstream insights into a buildable plan; it doesn’t generate them.

Outputs

What lands on the wall at the end:

A backbone of six to twelve activities describing the user’s journey end-to-end.
User tasks stacked vertically under each activity, ordered by importance.
Release lines: at least one (the walking skeleton), often three (now / next / later), making the trade-offs explicit.
A defensible MVP because the journey above the first line is visibly complete.
A backlog organised by both priority (vertical) and journey stage (horizontal), so new work has a place to go.
The “what about…” questions caught on the wall rather than mid-sprint.
An artefact that works as a communication tool for stakeholders who weren’t in the room.

Photograph the wall before the notes come down: panoramic shots of the full wall with good lighting and enough resolution to read every note, plus close-up shots of each activity section so the detail is preserved even if the panorama isn’t sharp enough.

These outputs feed straight into:

Example Mapping: the release-1 tasks from the wall are the input to Example Mapping. Story Mapping gives you the list of stories; Example Mapping decides whether each one is ready to build.
Sprint Planning: once the release-1 slice exists and Example Mapping has run on the top stories, Sprint Planning turns the map into committed sprints.
Assumption Mapping: the release-1 slice is a stack of assumptions about what users need. Assumption Mapping pulls the slice apart before you commit to building it.

Who’s Needed

Five to eight people, two to three hours:

Facilitator. Keeps the map growing in the right direction, catches backbone items that are really tasks, and runs the walk-the-map ritual.
Product owner. Mandatory. They’re the narrator during the walk-the-map phase: the person who tells the user’s story out loud while everyone else listens for gaps. They also make the final release-slicing call when the team disagrees.
Developers. At least two. They’ll ground the map in what’s actually buildable and catch tasks that look small but hide weeks of infrastructure work.
Designers. They think in journeys natively. A designer will reshape the backbone halfway through the session in ways a developer or product owner wouldn’t have thought to.
People who talk to real users. Support, sales, operations, account managers. They’ll add the unhappy paths the golden-path team forgot: the subscriber whose card declined, the pause that went wrong, the box that arrived damaged.
Operations / SRE (Site Reliability Engineering, the operations-and-reliability discipline). For products where operations are part of the user experience (on-call engineers, deployment pipelines, support agents using internal tools) the user being mapped might be them, and ops is the domain expert. Don’t tuck them in as afterthoughts.

Fewer than five and you miss perspectives; more than eight and the wall becomes a crowd. If you’re forced above 10, split into two sessions with overlapping attendance and reconcile the maps afterwards.

Who to leave out:

Real end users. Their presence warps what the insiders will say. Interview users separately and bring their words into the room as evidence, not as voices.
Senior leaders who will turn the session into a requirements meeting. Story Mapping is discovery; requirements come from it, not into it.
Spectators. Anyone “just observing” is absorbing attention without contributing. Either they participate or they read the output.

Budget three hours for a first session on a new product. Budget ninety minutes for a map of a single feature area inside an existing product. Do not try to map two different personas on the same wall in the same session; split them.

How To Run It

Phase	Duration	Notes colour	Key question
Orient, persona, scope	15 min	—	“Who is this map for and how far does it go?”
Backbone	20 min	Blue	“What does the user do at the highest level?”
User tasks	30 min	Yellow	“What specific things do they do at each step?”
Walk the map	15 min	(review)	“Does this journey make sense end-to-end?”
Slice releases	30 min	Tape lines	“What’s the thinnest complete journey?”
Wrap-up	10 min	—	“What’s in release 1?”
Total	~2 hours inside a 2–3 hour block

Story Mapping is a standing-up, walking-around, wall-based ritual. Nobody sits. Notes move. The shape of the room matches the shape of the map: wide, layered, with people moving along it as the conversation moves.

The rhythm is backbone, then vertical detail, then narrate, then slice:

During the backbone phase, one person, ideally the product owner, tells the high-level story in order. Everyone else listens and places blue activity notes. It’s a single voice telling the shape of the journey; the facilitator’s job is to catch details that slip into the backbone before they belong there.
During the user tasks phase, the conversation opens up. Everyone works on multiple activities at once, writing yellow task notes and placing them vertically. People move around the wall. The facilitator circulates and catches tasks that are really implementation details or screen specs.
The walk-the-map phase is a ritual interruption. Stop adding notes. One person narrates the entire journey aloud, left to right, using only the notes on the wall. Everyone else listens for gaps. Then gaps get filled.
The slice releases phase is the most political. A piece of tape goes across the wall. Every “above the line” decision is someone committing to ship something and someone not committing to ship something else. The facilitator holds the space for that trade to happen honestly.

Phase 1: Orient, persona, scope (15 minutes)

Before any note goes up, pin the persona card to the left end of the wall and read it aloud:

“Today we’re mapping the experience of a first-time subscriber. Let’s call her Anna. She’s health-conscious, she’s busy, and she’s just heard about us from a friend. The map we build is her journey, from the moment she hears about us to…”

Then agree the end-point explicitly:

“…where does the journey end? First delivery? Three months of subscribing? Cancellation and win-back? Pick one. We’ll map that scope and call it done.”

Scope drift is the single most common Story Mapping failure. A team starts mapping “sign up to first delivery” and an hour in discovers they’re also mapping pause, substitution, and cancellation. The wall fills up and the release slice becomes impossible. Agree the end-point now, write it on a card, stick it at the right end of the wall. That’s the boundary.

What to watch for:

Two personas trying to share a wall. “But the supplier does X…” If the conversation keeps switching personas, you’re mapping two journeys. Split them into two sessions.
Scope that’s too broad. “The whole product.” That’s six maps, not one. Pick the first-time subscriber journey, or the pause journey, or the renewal journey. One at a time.
Missing persona. The team can’t quite describe who the map is for. Pause: “Let’s spend ten minutes writing a persona card before we draw anything.”

Phase 2: Backbone (20 minutes)

The backbone is the user’s journey at the highest level: six to twelve big activities from the start of the journey to the end. These go along the top of the wall, left to right.

Ask the product owner:

“Walk me through what Anna does, from the very beginning. Not in detail. Big chunks. What’s the first thing that happens?”

Write each activity on a blue note and place it. Keep the granularity high: “Discover the service,” “Sign up,” “Choose a first box,” “Receive first delivery,” “Manage ongoing subscription,” “Refer a friend.” Not “Click the sign-up button,” which is a task, not an activity.

Aim for 6–12 activities across the backbone. More than that and you’re at the wrong granularity; fewer and you’re missing stages.

What to watch for:

Starting with the system, not the user. “The system sends a welcome email.” Reframe: “What does Anna do? She opens the email and reads it. That’s the activity.”
Skipping discovery. Teams start the backbone at “Sign up,” forgetting that Anna has to hear about the service first. Prompt: “What happens before she knows we exist?”
Skipping the end. Teams end the backbone at “First delivery,” forgetting ongoing management, cancellation, win-back. Prompt: “What happens after three months? After she tries to pause? After she cancels?”
Tasks leaking upward. Someone places “Enter email address” on the backbone. That’s a yellow task under the “Sign up” activity. Gently move it down: “Great detail. Let’s put it below, under ‘Sign up,’ when we get to tasks.”
Operations backbones. For an SRE-flavoured map (say, the journey of a deployment from commit to verified rollout) the backbone activities might be “Developer pushes,” “CI runs,” “Artefact built,” “Staging deployed,” “Production rolled out,” “Rollback decision available.” Same shape, different domain.

Phase 3: User tasks (30 minutes)

For each activity on the backbone, the team writes yellow task notes describing what the user does during that step. Tasks go vertically below their parent activity, ordered roughly top (most essential) to bottom (nice to have).

Open the phase:

“For each blue note on the backbone, I want specific things the user does. Not UI details: intents. ‘Browse available boxes.’ ‘See what’s in each box this week.’ ‘Pick a delivery day.’ Write them on yellow, place them below the activity they belong to, and roughly stack them by importance, most essential at the top.”

Let the conversation flow across the wall. People will jump between activities as they think of related tasks. Let them. The facilitator circulates and catches problems.

What to watch for:

UI details dressed up as tasks. “Click the dropdown.” Not a user task; a UI interaction. The task is “Pick a delivery day.”
Missing unhappy paths. The team maps the golden path. Prompt explicitly: “What if her card is declined? What if the box she wants is sold out? What if she signs up, then immediately changes her mind?” Unhappy paths are often where the release line is hardest to draw.
One activity with twenty tasks, another with two. The dense activity probably needs splitting into two activities; the sparse one might be fine, or might be missing work.
Arguments about horizontal order. Within an activity, vertical order (priority) matters more than horizontal order. If two people disagree about what comes first horizontally, there might be two valid paths; capture both.

Phase 4: Walk the map (15 minutes)

Stop adding notes. Everyone takes three steps back from the wall.

Ask the product owner to narrate the full journey:

“Walk me through Anna’s experience. Left to right. Use only the notes on the wall. Tell her story as if I’d never heard it.”

Everyone else listens. The facilitator’s job is to catch the stumbles:

“And then she… um, signs up, and then somehow ends up with a box…” There’s a missing activity or a missing task.
“She picks a box and pays…” Wait, is the payment activity there? Or is it hiding inside “sign up”?
“…and she receives her first delivery.” What happens if she doesn’t? Where’s the failed-delivery path?

When the narrator stumbles, pause. Ask the room:

“What’s missing here?”

Fill the gap. Continue the walk. The walk-the-map phase catches more problems than any other single phase in the session. It is the quality check.

What to watch for:

Nobody challenging the narrator. The room is polite. Name people by the slice of reality they own: “From what you see in deployment tickets, does this match? From what support hears on the phones, does this match?”
Duplicate tasks. The same task appearing under two activities. Is it genuinely part of both, or is one misplaced?
The narrator skipping sections. “And then all the usual stuff happens, and…” Interrupt: “Walk me through the usual stuff.”

Phase 5: Slice releases (30 minutes)

This is the most valuable and the most political phase. Take a piece of tape or draw a horizontal line across the wall. Above the line: release 1. Below the line: later.

The rule of the slice is simple:

“Release 1 is the thinnest horizontal slice that still tells a complete story left to right. Anna can walk from the leftmost activity to the rightmost and achieve her goal. There will be fewer options, less polish, and more manual work, but the journey has to be whole.”

For each activity, the question is the same:

“What’s the absolute minimum version of this step that lets Anna get through it?”

For the “Choose a box” activity, the minimum might be: “Browse available boxes. Select a size.” Everything else (substitutions, weekly previews, family-size recommendations, gift wrap) goes below the line.

You can draw multiple lines for multiple releases. Release 1 is the MVP. Release 2 is the next thinnest slice. And so on. The point is that every release above the line is a complete journey, not a complete activity.

What to watch for:

Release 1 is “everything above the line for every activity.” The most common mistake. If release 1 is the full golden path for every step, it’s not an MVP; it’s the whole product. Push: “Can a subscriber complete this step with just one option instead of five?”
Cutting entire activities. If an activity has nothing above the line, the journey has a hole. Every activity needs at least one task in release 1, even if the task is manual or minimal.
“We can’t launch without…” Some things genuinely can’t be cut (payment processing, for instance). Some things feel essential but aren’t (substitution preferences for launch). Challenge each claim individually.
Uneven slices. One activity has eight release-1 tasks, another has one. Sometimes that’s correct (payment really does need more than preferences) but check that the dense activity isn’t hiding overbuild.
Operations-flavoured slicing. For a deployment-pipeline map, the release 1 slice might be “Manual rollback, alerts go to one channel, health checks are basic, observability is minimal”: a pipeline that works end-to-end but isn’t polished. Later releases add automation.

See User Story Mapping: Seeing the Whole for the Greenbox team’s first mapping session, including the moment the walk-the-map phase reveals a gap between “pays for the subscription” and “receives first box” that nobody had noticed, and the release-slicing conversation that saves a month of scope.

What Can Go Wrong

The architect. Someone keeps mapping system architecture instead of user journey. Recovery: “What does Anna experience at this point? We’ll figure out the technical flow later.” Stop if: They can’t hold the distinction after three prompts. They belong in a design session that happens after the map.

The everything-is-essential person. Someone argues every task is critical for release 1. Recovery: Impose a constraint: “We have six weeks to launch. What can Anna live without until release 2?” Constraints force prioritisation in a way that abstract discussions don’t. Stop if: The person won’t accept any constraint. Escalate; they need a separate conversation about scope with the product owner.

The map gets too big. The wall is full and the team is still adding activities. Recovery: Scope is too broad. Pick the most important journey (e.g., first-time subscriber sign-up to first delivery) and park the rest for separate sessions. Stop if: The team can’t agree on which journey to focus on. That’s a strategy problem, not a mapping problem.

Two users on one map. The conversation keeps switching personas. Recovery: “I’m hearing both subscriber tasks and supplier tasks. Those are two maps. Let’s finish the subscriber one today and schedule the supplier session for next week.” Stop if: The team insists the personas share a journey. Check: do they really? Or are you trying to save a session?

The design session. People start sketching screens. Recovery: “Park the screens. We’re mapping what Anna needs to do, not how the screen looks. The design comes after we agree on the journey.” Stop if: Screens keep creeping back in. Pair the designer with a developer to hold each other honest.

The silent release-slicing. The team is quietly placing tasks above or below the line without any debate. Recovery: Slow it down: “Before we go further, can someone explain out loud why the substitution preferences are above the line? I want to hear the reasoning.” The point of the slice is the conversation about the trade-off. Stop if: The team still won’t engage. Something else is going on; maybe the release date is imposed and the slice is performative. Name it.

The map goes stale. The map is produced and photographed but nobody updates it. Within weeks, the backlog and the map disagree, and the team starts trusting the backlog. Recovery: Re-flatten the backlog from the map. Schedule a thirty-minute map-update at the end of every release. Stop if: The team won’t keep the map alive. Story Mapping is the wrong artefact for them; a flat backlog with explicit MVP scoping may be enough.

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Takes panoramic photographs of the full wall. Good lighting, sharp focus, enough resolution to read every note.
Takes close-up shots of each activity section, so the detail is preserved even if the panorama isn’t sharp enough.
Transcribes release-1 tasks into the backlog with the activity as context, so every story knows which journey stage it belongs to.
Sends a message to all participants with the photographs and the release line clearly marked.

This week, the product owner:

Turn release-1 tasks into backlog items. Each yellow note above the line becomes a backlog item, with the activity as context. The shape of the wall becomes the shape of the sprint plan over the next several iterations.
Protect the release-1 slice. The single hardest follow-up work. Every time a stakeholder asks for “just one more thing in the first release,” check the wall. Either it moves above the line (and something else moves below) or it waits for release 2. The map is the reason you can say that and have it be defensible.
Begin Example Mapping on the top release-1 tasks. Story Mapping produces tasks; Example Mapping makes them buildable. Start on the most essential tasks first.
Walk the map to anyone who couldn’t attend. Their perspective may reveal gaps the original group missed, or validate the slice. Either outcome is valuable.
Schedule any discovery work. Tasks on the wall that are guesses (“we think Anna will want this”) become research or experiment proposals. Don’t let the guesses graduate into commitments without being tested.

Ongoing, the team:

Keeps the map visible while the work is active. Print it, photograph it, pin it near the team’s desks. Teams that can glance at the map during standups and planning make better decisions than teams working from memory.
Updates the map after each release. Move the line to the next slice. Add new tasks learned from real user feedback. Remove tasks that turned out not to matter.
When new work is proposed, places it on the map. If it doesn’t fit, either the map needs updating or the work doesn’t belong. That’s a valuable filter.
Treats the map as the source and the backlog as a flattening of release 1 for execution. When they disagree, the map wins and the backlog gets re-flattened.

Where the map feeds next:

Event Storming: Event Storming maps the system from the inside; Story Mapping maps the user experience from the outside. Running Event Storming first can reveal the hotspots; Story Mapping turns the insights into a release plan.
Impact Mapping: Impact Mapping picks the deliverables; Story Mapping arranges those deliverables into a user journey and slices them into releases. Impact first, then story.
Example Mapping: the release-1 tasks from a Story Mapping session are the input to Example Mapping. Story Mapping gives you the list of stories; Example Mapping decides whether each one is ready to build.
Sprint Planning: once the release-1 slice exists and Example Mapping has run on the top stories, Sprint Planning turns the map into committed sprints.
Assumption Mapping: the release-1 slice is a stack of assumptions about what users need. Assumption Mapping pulls the slice apart before you commit to building it.

Variants

Product Level (default). A new product or a major new feature area, two to three hours, five to eight people, one persona. Output: a backbone, vertical task columns, and at least one release line marking the walking skeleton. This is what most teams need, and the rest of this post describes it.

Feature Level. A single feature area inside an existing product, ninety minutes, four to six people. The persona and scope are tighter; the backbone is shorter (often four to six activities); the release lines often collapse to a single now/next split. Reach for it when a product map already exists and you’re zooming into one journey within it.

Multi-persona. Two or more personas whose journeys overlap. Don’t try to share a wall: run two sessions on consecutive days with overlapping attendance, then reconcile in a third short session that compares the maps and surfaces the shared activities. Trying to run multi-persona on one wall in one session is the most common reason a Story Mapping session collapses.

Operations / SRE. The user being mapped is an on-call engineer, a deploy pipeline operator, or a support agent using internal tools. The backbone is a workflow rather than a customer journey; release 1 is the manual-but-end-to-end version of the workflow; later releases add automation. Same shape, different domain.

Remote. A Miro or Mural board with the persona pinned to the left and a horizontal lane for the backbone. Slightly slower than in-person (the rhythm of standing-up-and-moving is faster physically), but the structure transfers cleanly. Use one shared cursor: only the facilitator places notes, prompted by the team, to keep the layout legible. Walk-the-map still works, the narrator shares their screen and scrolls left to right while everyone listens.

Map-update. A thirty-minute recurring session at the end of every release. Move the line to the next slice. Add new tasks learned from real user feedback. Remove tasks that turned out not to matter. Keeps the map alive instead of letting it go stale.

Ticks or Tocks?

2026-04-23T06:00:00+08:00

In What Time Is It? we covered the human mess of the hour, sundials, railways, time zones, daylight saving, and the volunteer-maintained database that keeps your phone from lying to you. What Day Is It? did the same for the date. Gregorian switchovers, lunisolar calendars, the date line, and the year numbers that don’t agree. All of that assumes we know what a “second” actually is. But what is a second? How do you count one? And what happens when you count very, very carefully?

What are we actually counting?

A second used to be defined as 1/86,400 of a mean solar day. Simple enough, divide the day into hours, the hours into minutes, the minutes into seconds, done. The problem is that the Earth’s rotation isn’t constant. Tidal friction from the moon is gradually slowing us down. “Gradually” here means roughly 2.3 milliseconds per century, which sounds negligible until you’re trying to land a spacecraft or synchronise financial transactions across continents.

In 1967, the 13th General Conference on Weights and Measures decoupled the second from the Earth entirely. A second is now defined as 9,192,631,770 periods of the radiation corresponding to the transition between two hyperfine levels of the ground state of the caesium-133 atom. This is a mouthful. In plain terms: a caesium atom can exist in two very slightly different energy states, think of it like a coin that can be heads or tails, and when it flips between them, it emits radiation at one very specific frequency. Count those oscillations and you’re counting seconds. The advantage is enormous: this frequency is the same everywhere in the universe (with a caveat we’ll get to in a later post), and it’s measurable to extraordinary precision.

The trouble is that atomic seconds and solar days are now measuring different things. Atomic time marches on with metronomic precision. Solar time wobbles and drifts. They disagree, and the disagreement grows over time.

From quartz to caesium

Before atomic clocks, the most precise portable timekeepers were quartz crystal oscillators. Quartz has a neat trick: squeeze it and it generates a tiny voltage. Run a voltage through it and it vibrates. This property is called piezoelectricity, and it’s why quartz became the heart of modern timekeeping. A quartz crystal cut to the right shape and size vibrates at a very stable frequency. In a wristwatch, that frequency is typically 32,768 Hz. That’s 2 to the power of 15, chosen because it can be divided down to exactly one pulse per second using a simple binary counter circuit, fifteen halvings and you’re there.

The first quartz clock was built at Bell Telephone Laboratories in 1927 by Warren Marrison and J.W. Horton. It was roughly the size of a large refrigerator. By the late 1960s, Seiko had miniaturised the technology enough to fit it on a wrist, the Seiko Astron, released on Christmas Day 1969, was the world’s first commercially available quartz wristwatch. It cost as much as a small car. Within a decade, quartz watches were cheap enough to give away as promotional items. The Swiss watch industry, which had dominated mechanical horology for centuries, was nearly destroyed in what’s now called the Quartz Crisis. Accuracy that had once required master craftsmen and hand-finished movements was suddenly available from a factory in Japan for a few dollars.

Quartz watches were accurate to within a few seconds per month, far better than any mechanical watch. But quartz crystals aren’t perfect. Their frequency drifts with temperature, age, and mechanical stress. Engineers have pushed quartz further by controlling temperature. A temperature-compensated oscillator can hold stability to within a second or two per year. An oven-controlled version, the crystal sits in a tiny heated enclosure to keep its temperature rock-steady, does better still, a few milliseconds per day. For everyday timekeeping, even basic quartz is more than adequate. For science, navigation, and telecommunications, it matters enormously that “a few seconds per month” isn’t zero.

The leap to atomic timekeeping came from the insight that atoms are, in a sense, nature’s own frequency standards. The idea was first proposed by Isidor Rabi at Columbia University in 1945, building on his Nobel Prize-winning work on how atoms behave in magnetic fields. Every caesium-133 atom in the universe vibrates at exactly the same frequency when it transitions between two specific energy states. No manufacturing variation. No wear. No temperature drift (at least, not in the transition itself). If you can build a device that locks onto that frequency and counts the vibrations, you have a clock that’s stable to a degree that mechanical and quartz clocks can’t approach.

Why caesium specifically? Several reasons converged. Caesium has only one stable isotope (caesium-133), which eliminates ambiguity about which atom you’re measuring. Its hyperfine transition frequency, the frequency at which it flips between two energy states, falls in the microwave range at roughly 9.2 GHz, which in the 1950s was a frequency that existing electronics could already generate and measure accurately. Hydrogen has a simpler spectrum but its transition frequency is lower (1.4 GHz), giving coarser time slices. Rubidium was a strong candidate and is still used in cheaper atomic clocks, but its transition is harder to isolate cleanly because rubidium has two stable isotopes whose spectra overlap. Caesium’s combination of a single isotope, a conveniently high microwave frequency, and a strong well-separated spectral line made it the practical choice. The physics didn’t mandate caesium, it was the best available compromise between atomic properties and 1950s-era engineering.

The first working caesium beam clock, built by Louis Essen and Jack Parry at the National Physical Laboratory in Teddington, England, began operating in 1955. Within two years it had demonstrated accuracy of one second in 300 years, already orders of magnitude better than any quartz oscillator. By 1967, it was good enough that the international scientific community decided to redefine the second itself based on the caesium atom rather than the Earth’s rotation. The atom had become more reliable than the planet.

Atomic clocks and their limits

A caesium beam clock works by exposing a beam of caesium-133 atoms to microwave radiation and tuning the frequency until the maximum number of atoms change energy states. That peak frequency, 9,192,631,770 Hz exactly, by definition, is the second. Hydrogen maser clocks, a maser is a laser that works at microwave frequencies, use a similar principle with hydrogen atoms and are more stable over short periods, making them excellent for applications that need precise frequency over hours rather than years.

Optical lattice clocks represent the current frontier. They use atoms (often strontium or ytterbium) trapped in a lattice of laser light and interrogated with optical-frequency lasers rather than microwaves. The higher frequency means finer measurement. The best optical lattice clocks at NIST and JILA in the US, and at the University of Tokyo, have demonstrated accuracy of roughly one second in 15 billion years, longer than the age of the universe (Bloom et al., 2014, Nature). In 2024, the BIPM began formally considering redefining the second based on optical clocks.

But even they drift. Every clock, no matter how precise, has some uncertainty. Caesium beam clocks drift by roughly one second in 300 million years. Optical lattice clocks are better by orders of magnitude, but “better” isn’t “perfect”. No clock is perfect. This is a fundamental consequence of quantum mechanics: measurement always has uncertainty.

To address that uncertainty, UTC is kept not by a single clock but by an ensemble of clocks, a weighted average of approximately 450 atomic clocks in laboratories across more than 80 countries. The Bureau International des Poids et Mesures (BIPM) in Paris collects data from all of them, weights each clock by its past performance and stability, and computes a combined timescale called UTC. The results are published retrospectively in a document called Circular T, which means that UTC is, strictly speaking, only known after the fact. The UTC that your phone shows you is actually an approximation, steered to match the BIPM’s post-hoc calculation as closely as possible.

Leap years and leap seconds

Most people know about leap years. The Earth takes approximately 365.2422 days to orbit the sun, so every four years we add a day to February to stop the calendar drifting away from the seasons. Except every 100 years we skip the leap year. Except every 400 years we don’t skip it. So 1900 wasn’t a leap year, but 2000 was. This approximation is good to about one day in 3,236 years, which is close enough that nobody currently alive needs to worry about the next correction.

Leap seconds are a much more recent and much messier invention. Since atomic clocks and the Earth’s rotation disagree, the International Earth Rotation and Reference Systems Service (IERS) adds a leap second to UTC whenever the difference approaches 0.9 seconds from solar time, not on a fixed schedule, but when observed drift demands it. They’ve done this 27 times since 1972, always on the last day of June or December. All 27 have been positive, adding a second because the Earth is slowing down. But in recent years the Earth has unexpectedly sped up slightly, and for a while there was serious discussion about whether we’d need a negative leap second, removing a second, something that has never been done and that most software has certainly never been tested for. The prospect of 23:59:58 being followed directly by 00:00:00, skipping 23:59:59 entirely, was enough to give the timekeeping community genuine anxiety.

This sounds harmless but it drives software engineers to quiet despair. A leap second means that the sequence 23:59:59 is followed by 23:59:60 before 00:00:00. Most software doesn’t expect a minute to have 61 seconds. When a leap second was inserted in 2012, it crashed Reddit, Gawker, LinkedIn, FourSquare, and Yelp because of a Linux kernel bug in the way NTP interacted with the high-resolution timer system.

Google’s approach is to “smear” the leap second, they slightly slow down their clocks over a period of hours so the extra second is absorbed gradually. Amazon does something similar, though with a different smear profile. This is practical but means that during the smear window, Google’s clocks disagree with Amazon’s, and both disagree with everyone else’s, and a timestamp generated on one platform during that window doesn’t mean quite the same thing as a timestamp generated on another. If you’re processing financial transactions that cross cloud providers during a leap second smear, you’d best not think too hard about what “the same time” means.

The good news, or bad news depending on your perspective, is that in 2022 the General Conference on Weights and Measures voted to abolish leap seconds by 2035. UTC and solar time will be allowed to drift apart, with a correction planned at some larger threshold, perhaps a “leap minute” in a century or so. Astronomers who need solar time will adjust. The rest of us will stop having to worry about 61-second minutes.

Describing a moment

Given all of this, how do you actually specify an exact moment in time?

You might think a timestamp like “2026-04-28T14:30:00Z” does the job. And it does, mostly. The “Z” means UTC, which is a specific timescale maintained by a weighted average of atomic clocks around the world. But UTC includes leap seconds, which makes the relationship between any two UTC timestamps ambiguous unless you know how many leap seconds occurred between them.

This is where TAI. International Atomic Time, comes in. TAI is a pure count of standard atomic seconds (the internationally defined SI second, based on caesium) since an epoch in 1958, with no leap seconds. It’s the “true” atomic timescale. UTC is defined as TAI minus some whole number of seconds (currently 37). If you want to measure the exact duration between two events, TAI is what you want. If you want to know roughly what angle the sun is at, UTC is what you want.

Then there’s GPS time, which started counting at the same moment as UTC in January 1980 and has never inserted a leap second since. GPS time is currently 18 seconds ahead of UTC.

And there are others. TDB, Barycentric Dynamical Time, used for solar system ephemerides. TCG, Geocentric Coordinate Time, which ticks slightly faster than clocks on Earth’s surface because it’s defined for a clock at rest and infinitely far from the Earth’s gravitational field. Each serves a different purpose, each disagrees with the others by small but significant amounts.

The point is that “what time is it?” is never a single question. It’s really “what time is it, in which timescale, as measured by which clock, where?”

For most software, the practical answer is: use UTC, store it as an ISO 8601 string or a Unix timestamp (seconds since midnight on 1 January 1970, UTC, not counting leap seconds), and convert to local time for display only. This works for the vast majority of applications. But if you need to compute precise durations across leap second boundaries, or compare timestamps from different systems that may have been smearing at different rates, or handle historical dates in jurisdictions that have changed their timezone rules, “just use UTC” stops being simple fast. The rabbit hole is always deeper than it looks.

NTP and time synchronisation

Having an accurate clock is only half the problem. You also need to get that accuracy to the devices that need it. This is the job of the Network Time Protocol, NTP.

NTP was designed by David Mills at the University of Delaware in 1985, and its descendants still synchronise nearly every clock on the internet. The protocol works by exchanging timestamps between a client and a server, measuring the round-trip delay, and using the result to estimate the offset between the two clocks. The clever bit is in the statistics. NTP uses filtering algorithms to reject noisy measurements and converge on the best estimate of the true time.

The system is hierarchical. Stratum 0 sources are the reference clocks themselves, caesium standards, GPS receivers, radio stations like DCF77 in Germany or WWVB in the US that broadcast time signals. Stratum 1 servers are directly connected to a Stratum 0 source. Stratum 2 servers synchronise to Stratum 1, and so on. Your laptop or phone is typically Stratum 3 or 4, synchronised to a pool of public NTP servers.

That pool, the NTP Pool Project, is another piece of critical internet infrastructure run almost entirely by volunteers. Over 4,000 servers donated by individuals and organisations around the world, serving billions of time queries per day. When your phone synchronises its clock, it’s probably talking to a server that someone is running in their spare time, on their own hardware, at their own expense. Like the tz database, like the DNS root servers, like so much of the infrastructure the modern world depends on, it works because people choose to make it work. There’s no contract. There’s no SLA. There’s just a community that thinks accurate time matters enough to donate the resources.

The accuracy you can achieve depends on your network. On a local network, NTP can keep clocks within a few hundred microseconds. Over the internet, a few milliseconds is typical. For applications that need tighter synchronisation, financial trading, for instance, or telecommunications. Precision Time Protocol (PTP, IEEE 1588) operates at the hardware level, timestamping packets as they enter and leave the network interface card, and can achieve sub-microsecond accuracy.

GPS is also a time-distribution system, not just a positioning one. In fact, positioning is time distribution, a GPS receiver determines its position by measuring the time it takes signals to arrive from multiple satellites, then solving for the intersection. Each GPS satellite carries multiple atomic clocks, some using caesium, others using rubidium, a cheaper and lighter alternative that trades a bit of long-term accuracy for portability and broadcasts precise time signals. A GPS receiver on the ground can determine the time to within roughly 10 nanoseconds. Many NTP Stratum 1 servers use GPS as their reference source.

But GPS is a US military system. It was built by the Department of Defense, it’s operated by the US Space Force, and the US government retains the right to degrade or deny the civilian signal at will. They did exactly that until May 2000, a deliberate error called Selective Availability that made civilian GPS accurate to about 100 metres instead of 10. The military got the good signal. Everyone else got the blurred one.

That dependency on a single nation’s military made other countries nervous. The European Union built Galileo, which became fully operational in 2016, a civilian-controlled system from the start, with no equivalent of Selective Availability. Russia has GLONASS, operational since 1993. China has BeiDou, globally operational since 2020. India has NavIC covering the Indian subcontinent.

Modern receivers use multiple constellations simultaneously. Your phone probably tracks GPS, Galileo, and GLONASS at once. More satellites in view means better geometry, faster fixes, and improved accuracy, from roughly 3-5 metres with GPS alone to under 1 metre with multi-constellation receivers. For timing applications, using multiple independent constellations also provides redundancy: if one system has a problem, the others keep you synchronised.

When synchronisation fails, the consequences are real. In 2016, a GPS ground station error introduced a 13-microsecond timing glitch that propagated to GPS-disciplined clocks worldwide. Telecommunications networks that relied on GPS for synchronisation experienced disruptions. In 2019, a Galileo outage left receivers without a valid time signal for several days. Having multiple constellations didn’t prevent the Galileo outage, but it meant that receivers tracking GPS and GLONASS simultaneously kept working while Galileo was down. Redundancy isn’t a theoretical benefit, it’s the difference between “the system degraded” and “the system failed.”

Radio time signals offer a terrestrial alternative. MSF in the UK broadcasts from Anthorn in Cumbria on 60 kHz. DCF77 in Germany broadcasts from Mainflingen near Frankfurt on 77.5 kHz. WWVB in the US broadcasts from Fort Collins, Colorado on 60 kHz. These long-wave signals can reach hundreds of kilometres and are used by “radio-controlled” clocks and watches, the ones that seem to magically stay accurate without any intervention. They receive the signal, typically at night when propagation is best, and correct themselves against it. The system is elegant and low-tech compared to GPS, but limited in precision to roughly a millisecond and in range to whatever the transmitter can cover.

The dependency chain is worth noting: your phone’s clock depends on NTP, which depends on Stratum 1 servers, which depend on atomic clocks or GPS, which depends on the satellites’ onboard atomic clocks, which depend on the ground control system that monitors and corrects them against the master clock at the US Naval Observatory. Every link in the chain adds a tiny bit of uncertainty. The time on your phone is an estimate, steering toward a post-hoc average of 450 clocks, computed in Paris, distributed through a hierarchy of servers and satellites, and corrected for relativistic effects that Einstein predicted in 1915. It’s close enough. It’s never exact.

Time in the financial markets

Nowhere is the practical importance of precise time synchronisation more visible than in financial trading. The EU’s MiFID II regulation, which came into force in January 2018, requires that timestamps on financial transactions be accurate to within 100 microseconds of UTC for most trading activities, and within one microsecond for high-frequency trading. The US SEC has similar requirements. This isn’t paranoia, it’s about being able to reconstruct the exact order of events when disputes arise or markets crash.

High-frequency trading firms spend millions on low-latency connections and precise clock synchronisation. A difference of a few microseconds can determine who gets a trade filled and who doesn’t. Some firms use rubidium or caesium oscillators at their trading sites, disciplined by GPS, to ensure their timestamps are as close to UTC as hardware allows. Others lease dedicated fibre connections to minimise and stabilise network latency between their servers and the exchange.

The irony is that all this infrastructure. GPS-disciplined atomic clocks, PTP synchronisation, nanosecond-accurate timestamps, exists to coordinate an activity (buying and selling financial instruments) that is fundamentally a human invention. We built clocks precise enough to measure relativistic effects, and we use them to work out who pressed “buy” first.

The clock inside everything

We’ve gone from sticks in the ground to laser-trapped atoms oscillating hundreds of trillions of times per second. The precision is breathtaking. But precision brings its own strange problems. When your clocks are accurate enough to detect the difference in gravity between the floor and the ceiling, “what time is it?” stops being a simple question and starts being a question about the structure of spacetime itself.

That’s where things get weird. In Time Is Weirder Than You Think, we’ll see what happens when Einstein enters the picture, why GPS satellites need relativistic corrections, why the core of the Earth is younger than the surface, and why time might not flow at all.

Routing to the Closest Healthy Region

2026-04-22T06:00:00+08:00

The situation

A company operates a web application with regional deployments in three AWS Regions, us-east-1, eu-west-1, and ap-southeast-2. Each Region has an Application Load Balancer fronting an Auto Scaling group of EC2 instances.

The team wants app.example.com to route each user to the Region with the lowest network latency for their resolver, and, when the preferred Region is unhealthy, to fall over to the next-lowest-latency healthy Region automatically. They do not want to rely on client-side retries or third-party DNS, and they are allergic to extra health-check resources to configure, bill, and maintain.

Today every region is independently reachable via its own hostname; there is no intelligent DNS layer in front of them. This scenario is the work of picking one.

What actually matters

Before opening the Route 53 console, ask what this routing layer is actually supposed to do, because “closest region with failover” hides a cluster of decisions.

The first question is what “closest” even means. Physical distance is the intuitive answer, and it’s usually wrong. A user in Istanbul is physically close to Frankfurt but might take a faster network path to Dublin depending on peering; a user in Perth is physically close to Singapore but the undersea cable topology can put Sydney at a similar RTT. The honest definition of “closest” for a web application is “lowest measured latency”, and the routing layer has to have some way of knowing that. A policy that picks on continent-code geography will be wrong for every user whose network path doesn’t match their passport.

The second question is how the routing layer learns that a Region is unhealthy. There are two paths in Route 53: a dedicated health-check resource that we configure, point at an endpoint, and pay for, or a derived signal lifted from an AWS resource we already own, an ALB’s own view of its target-group health. The second path is cheaper and less likely to drift out of sync with reality, because the ALB is the thing that knows whether any backend is serving traffic. The first path is more flexible (it can check any URL anywhere) but it’s another resource in the inventory.

The third question is what happens when every region is unhealthy. DNS does not have a clean way to say “the service is globally offline”. Returning an empty answer sounds honest but breaks clients that can’t distinguish “resolver broken” from “service broken”. Returning a wrong answer gives the client something to try. Route 53’s own choice here, return everything when nothing is healthy, is worth knowing because it’s the default we inherit, not a knob we tune.

The fourth question is whether the policy composes. Real traffic-flow graphs rarely fit inside one routing decision. “Closest region that’s healthy, but EU users only ever go to EU regions”, “closest region that’s healthy, with a per-region active/passive inner layer”, these are common shapes, and picking a policy that refuses to nest underneath or on top of another one writes the team into a corner later. Route 53 supports up to ten levels of nesting; the tool is ready even if the first problem only needs one level.

And finally there’s operational overhead. Each of these routing policies is a set of record configurations plus, sometimes, a Traffic Flow policy document plus, sometimes, health-check resources plus, sometimes, CloudWatch alarms wiring those checks to SNS. The cheapest answer on paper isn’t the cheapest answer in the on-call rotation if the pager goes off because someone edited the Traffic Flow JSON by hand.

What we’ll filter on

Distilling that exploration into filters we can score each routing policy against:

Locality-aware. The policy picks on measured network latency, not continent code, not weight, not physical distance.
Health-aware. Unhealthy records drop out of the candidate set without a human editing DNS.
Supports three or more records. Two-record policies are structurally insufficient for this three-Region shape.
Low operational overhead. The health signal comes from a resource we already own, ideally the ALB itself, rather than a separate Route 53 health check configured, monitored, and billed.

The Route 53 routing-policy landscape

Route 53 ships seven routing policies. Each picks an answer on a different axis.

Simple routing. One record, no decision logic. Route 53 returns whatever is configured, regardless of who asked. Useful for single-Region services and static aliases. Can’t filter by health. Can’t do locality.
Weighted routing. Distributes answers across N records in proportion to integer weights. Supports health checks, unhealthy records drop from the pool. Ignores latency entirely. A resolver in Sydney with three equal-weight records would get a random continent on every new query. Useful for canary rollouts and gradual blue/green traffic shifts, not locality.
Latency-based routing. Returns the record pointing to the AWS Region with the lowest measured round-trip time for the resolver’s location. Supports one record per AWS Region. Supports health checks, including the lightweight EvaluateTargetHealth path on alias records, which reuses an existing resource’s health signal instead of configuring a separate health-check resource.
Geolocation routing. Returns the record matching the resolver’s geographic location. Granularity runs continent → country → US state, with a mandatory default record for resolvers that don’t match any configured entry. A resolver in Istanbul sits in Asia by IANA continent code, so it would be routed to the APAC record even when eu-west-1 has a lower network RTT. Failover is weaker too, a failing geographic record falls through only to the default, not to the next-nearest neighbour.
Geoproximity routing. Returns the record “closest” to the resolver by physical distance from a user-configured origin, with optional bias (-99 to +99) that stretches or shrinks each origin’s pull. Distance is physical, not network. Requires Route 53 Traffic Flow (a visual policy-editor layer on top of DNS records) to configure, so the setup is no longer a simple record set.
Failover routing. Two records, primary and secondary. Route 53 returns the primary while its health check passes, the secondary when it doesn’t. Supports exactly N = 2. Ignores latency, a Sydney user would hit us-east-1 at 220 ms when ap-southeast-2 would give them 5 ms. Useful for active/passive DR and as the inner layer of a nested policy, not as a standalone answer.
Multi-value answer routing. Returns up to eight healthy records per query; the client’s resolver picks one (typically the first). Doesn’t consider latency, weight, or geography, it’s DNS-level load balancing for clients that retry if the first answer fails.

Side by side

Routing policy	Locality-aware	Health-aware	N ≥ 3	Low ops overhead
Simple	✗	✗	✗	✓
Weighted	✗	✓	✓	✓
Latency-based (alias + `EvaluateTargetHealth`)	✓	✓	✓	✓
Geolocation	✗	✓ (weak)	✓	✓
Geoproximity	✗	✓	✓	✗
Failover	✗	✓	✗	✓
Multi-value answer	✗	✓	✓	✓

One row is all ticks, latency-based routing with alias records and EvaluateTargetHealth = true.

Matching shape to policy

Each candidate falls out of the funnel on a different attribute. Latency-based with alias records and EvaluateTargetHealth is the one that reaches the bottom intact.

Latency-based routing, in depth

What Route 53 calls “latency” is not the latency from the resolver to the ALB. It’s the latency from Route 53’s own measurement points to each AWS Region, not to the service, not to the ALB, to the Region. When a resolver queries, Route 53 looks up which Region has the lowest RTT for that resolver’s network position (based on its IP and Route 53’s internal map of measurement points) and returns the record pointing at that Region.

Two practical consequences. First, the latency readings are independent of the application’s performance. If the ALB is slow but the Region’s network paths are fast, Route 53 still treats the Region as fast, the health signal is what compensates. Second, the measurements cover AWS’s public-internet paths to its edge locations and Regions, which are the ones that matter for reaching the Region; per-service latency inside the Region is a separate problem.

The health-awareness wiring is EvaluateTargetHealth on an alias record. Aliases are an AWS extension to DNS that let a record point directly at an AWS resource. ALB, NLB, CloudFront distribution, S3 website endpoint, another Route 53 record, instead of a hard-coded IP. When a latency record is an alias to an ALB with EvaluateTargetHealth = true, Route 53 consults the ALB’s own health signal as part of the routing decision. For an ALB, “healthy” means at least one target is healthy in at least one of the ALB’s target groups. The ALB already knows this; there is nothing new to configure.

Compared to the alternative, a separate Route 53 health check pointing at an HTTP endpoint or IP, alias + EvaluateTargetHealth is cheaper, has one fewer moving part, and can’t drift out of sync with the actual backend health the way that a separately-configured health check can.

When the latency-preferred record’s alias target reports unhealthy, Route 53 excludes it from the candidate set for the response and falls through to the next-lowest-latency record whose target is healthy. No client retry, no application-level awareness of the failover, no DNS record change from the administrator’s side.

One last-resort behaviour worth knowing. If every record in the set is unhealthy, Route 53 does not return an empty answer. It returns all of them, regardless of health. The reasoning is that a broken DNS response (NXDOMAIN or empty answer set) is strictly worse than a long-shot answer, the client might still succeed via retry, via a health check that’s lagging reality, or via an in-flight recovery. “Try something” beats “refuse to answer.”

A worked example: Madrid through four states

Madrid resolver, DNS TTL of 60 seconds on the latency record set.

State 1, all three regions healthy. Route 53 evaluates candidates: eu-west-1 (~28 ms), us-east-1 (~95 ms), ap-southeast-2 (~210 ms). All three alias targets healthy. Response: eu-west-1. Client connects at ~28 ms.

State 2, eu-west-1 ALB has no healthy targets. EvaluateTargetHealth = true on the eu-west-1 record excludes it from the candidate set. Next-lowest healthy: us-east-1. Response: us-east-1. Client connects at ~95 ms. End-to-end cutover is roughly the DNS TTL (60 s) plus Route 53’s internal health-propagation delay (~30 s).

State 3, eu-west-1 and us-east-1 both unhealthy. Candidate set: only ap-southeast-2. Response: ap-southeast-2. Madrid connects at ~210 ms. Painful, but the application is intact.

State 4, all three ALBs unhealthy. Candidate set empty. Route 53 returns all three records regardless of health. The client receives three addresses. First attempt fails; resolver behaviour varies from there. The point is the system doesn’t hand out DNS failures when the whole routing set is down.

Where this nests

Route 53 supports up to ten levels of nesting, and two-level patterns show up in the same scenario shape repeatedly. Latency → Failover puts a per-Region active/passive inside each latency leg, one record set giving “closest region AND in-region active/passive”. Geolocation → Latency pins GDPR-scoped users to the EU continent, then latency-selects among EU Regions inside that rule. Weighted → Latency runs a 10% canary globally with locality preserved inside both cohorts.

The useful skill is spotting the primary axis (the one the scenario optimises on) and the secondary axis (the one it constrains on). Once those are clear, the nesting writes itself.

What’s worth remembering

Seven routing policies exist, simple, weighted, latency, geolocation, geoproximity, failover, multi-value answer. Each optimises a different axis and most real setups nest two or more.
“Latency” means Route 53’s measured RTT from its probes to each AWS Region, not resolver-to-ALB, not physical distance. Continent-code geolocation is not a substitute.
Alias + EvaluateTargetHealth is the lightweight health-awareness path. It reuses the ALB’s own view of target health instead of a separately-configured Route 53 health-check resource.
Unhealthy records drop out of the candidate set silently. Route 53 falls through to the next-lowest-latency healthy record with no DNS edit and no client retry.
When every record is unhealthy, Route 53 returns them all. Empty-set fallback is the least-worst choice, a long-shot answer beats NXDOMAIN.
Nesting goes up to ten levels deep. Latency-outer with Failover-inner is a common two-level shape for “closest region AND in-region active/passive”.
DNS TTL plus propagation sets the cutover floor. A 60-second TTL and Route 53’s ~30-second propagation means clients cache the old answer for up to a minute and a half.

Accessibility: A Product Decision, Not a Compliance Tick

2026-04-21T06:00:00+08:00

Greenbox has sixty-eight subscribers. The signup flow works, mostly. Two sprints into the new delivery, a subscriber called Helen sends a polite email that lands like a small earthquake under the kitchen table where Maya does her reading.

Helen’s email is three paragraphs long. The first paragraph thanks Maya for the boxes and mentions that her daughter-in-law recommended Greenbox. The second paragraph asks a practical question about substitutions. The third paragraph is the one Maya reads twice.

“I should mention,” Helen writes, “that I’m registered blind. I use a screen reader called JAWS to use my computer. I managed to sign up to Greenbox, but it took me about forty minutes. The checkout form had some fields my screen reader couldn’t identify, and I had to guess what some of the buttons did. I got there in the end, but I wanted to let you know in case you can fix it for others. I’m not complaining; I’m glad I found you. Just letting you know.”

Maya reads it a third time. Then she forwards it to Tom, Priya, and Lee with one line: “We need to talk about this.”

The Monday morning conversation

By Monday, Tom has opened the signup form on his own laptop, turned on VoiceOver, and tried to complete it with his eyes closed.

He gets to the delivery frequency dropdown and stops. VoiceOver reads it as “button, pop-up button.” It doesn’t say what the button is for. It doesn’t read the current selection. If Tom hadn’t known he was on the delivery frequency field, he would have no idea what was happening.

He tries the substitution preferences checklist next. VoiceOver reads “checkbox, unchecked” three times in a row and then stops. There are twelve checkboxes on that page. Helen would have had to tab through all twelve without knowing what any of them were for.

Tom puts his headphones down.

“We failed her,” he says to Priya. “She got through it because she’s determined. Not because we did our job.”

Priya has been reading the Web Content Accessibility Guidelines on her laptop. “WCAG 2.2,” she says. “There are three conformance levels: A, AA, AAA. AA is the standard most regulators use. It covers perceivable, operable, understandable, and robust. Four principles, thirteen guidelines, seventy-eight success criteria.”

“Seventy-eight.”

“Some of them are easy. Colour contrast, alt text on images, form labels. Some of them are harder, like making sure that every interactive element works with a keyboard, or that the order the screen reader reads things in matches the visual order. The hard ones are the ones we’re failing.”

Maya is listening from the other end of the kitchen. She asks the question she always asks when she’s trying to decide how seriously to take something. “What does it cost to fix?”

Compliance or product?

This is the moment the conversation could go two different ways.

One version: Greenbox treats accessibility as a compliance checklist. Tom and Priya spend a week grinding through the WCAG criteria, ticking boxes, running automated scans with axe and Lighthouse, fixing the failures the scans flag. At the end of the week, the automated scans are green. They declare victory. They write a blog post about being WCAG AA compliant. Nobody tests with an actual screen reader again for the rest of the year.

The other version: Greenbox treats accessibility as a product quality decision. The signup flow has to work for everyone, because every person who can’t complete signup is a subscriber Greenbox loses, and a person whose experience of Greenbox is worse than their experience of the shops. The team doesn’t just run automated scans. They test the flow with a screen reader, with keyboard-only navigation, with high-contrast mode, and with somebody whose vision is actually impaired.

Maya has been in enough tech companies to know which version happens by default. She’s seen “WCAG AA compliant” stickers on websites that are unusable with a screen reader. The compliance checklist is seductive because it’s finite. You do the list, you’re done. The product quality framing is harder because there’s no finish line, just a commitment to keep checking.

She picks the harder framing.

“We’re not going to be WCAG AA compliant,” Maya says. “We’re going to be a subscription box that actually works for people who can’t see the box on the website. Those are different goals, and I want to be sure we know which one we’re solving.”

Tom nods slowly. He can feel the difference in his stomach. Compliance is a sprint: a week of furious fixing, then done. Product quality is an orientation, something you carry into every future sprint.

What “works for Helen” looks like

Lee takes out a marker and goes to the whiteboard. “Okay. If we’re going to treat this as a product decision, we need to be specific about what we’re deciding. What does ‘works for Helen’ mean?”

The team spends an hour working through it. The outcome isn’t a WCAG checklist; it’s a set of user story map updates, concrete journey steps that have to work regardless of how the subscriber is accessing the site.

A subscriber who cannot see the screen can sign up, choose a box, set preferences, and complete checkout using only a screen reader, without asking for help.
A subscriber who cannot use a mouse can do the same using only a keyboard.
A subscriber with low vision can read every piece of content on the site with the browser’s text size set to 200%.
A subscriber who is colour-blind can distinguish every button, status indicator, and error message without relying on colour alone.
A subscriber with cognitive difficulties can understand the delivery schedule, the substitution rules, and the cancellation process on first reading.

Each of these feeds back into the user story map as a row of slices that cut across every feature. They’re not a separate “accessibility backlog” to be done later. They’re the same stories the team has already mapped, with a new dimension added.

“This is how we avoid the compliance trap,” Lee says. “Accessibility isn’t a feature; it’s a quality attribute of every feature. If we build a new delivery scheduler, it has to work for Helen. If we build a new substitution flow, it has to work for Helen. We don’t ship a feature unless it works for Helen.”

Tom writes “Helen” on a sticky note and puts it next to the Definition of Done on the team’s process wall. Next to “passes tests,” “code reviewed,” and “deployed to staging,” they add “tested with keyboard and screen reader.”

It doesn’t fix everything. The existing signup flow still has the problems Helen described. Tom and Priya spend the rest of the week going through it systematically, not by running automated scans, but by closing their eyes and trying to use it. They find things the scans missed. The checkout button is a styled <div> instead of a <button>, so screen readers don’t announce it as clickable. The error messages appear as text near the broken field, but the screen reader doesn’t read them when the field is focused. The delivery day picker is a custom component that traps keyboard focus.

They fix each one. Priya writes a short internal doc, Greenbox Accessibility Standards, that lists the decisions they’ve made and why. It has four sections: semantic HTML first, keyboard navigation always, visible focus indicators, and test with actual assistive technology. It’s shorter than the WCAG spec, and more useful, because it’s specific to the decisions they make every day.

The email back to Helen

On Friday, Maya sends Helen a reply.

“Helen, thank you for writing to us. Your email changed how we’re building Greenbox. This week, Tom and Priya went through the entire signup flow with a screen reader and fixed the problems you described. We’ve also added keyboard and screen reader testing to our Definition of Done, which means every new feature we ship has to work before it goes live. I’d love to send you a free box as an apology for the forty minutes, and to thank you for teaching us something we should have known. If you’re willing, we’d also love to ask you a few questions about the rest of the site, not to interrogate you, but because you’ll notice things we won’t.”

Helen writes back within the hour. She says yes to the box, and yes to the questions. She adds: “Most of the time when I tell a company their site is broken for me, they apologise and nothing changes. Thank you for being different.”

Maya pins the email to the wall next to the sticky note with Helen’s name on it.

Why this matters beyond Helen

Here’s the thing about building the product this way: it doesn’t just benefit Helen. The changes Tom and Priya made also benefit:

The subscriber who has broken their arm and is navigating the site one-handed with a keyboard.
The subscriber on a train with a bad connection and a small phone screen.
The subscriber whose first language isn’t English, who uses a screen reader to slow down the text.
The subscriber who is sixty-eight and wears reading glasses and needs the text bigger than the default.
The subscriber whose ADHD makes it hard to parse a cluttered interface and needs clear headings and clean structure.

This is a well-known result in accessibility research: building for the edges improves the middle. Captions were designed for deaf users and are now used by everyone watching TV in a noisy pub. Kerb cuts were designed for wheelchair users and now serve parents with prams, cyclists, and people wheeling suitcases. The term for this is the curb-cut effect, and it’s real and measurable.

Maya didn’t know the term when she made the decision. She just knew that Helen was a subscriber, Helen had been failed, and the right response wasn’t a compliance sticker.

The lesson Maya writes down

In her notebook, the one she uses to capture the decisions she wants to remember. Maya writes:

“Accessibility is a product quality decision. The compliance frame makes it finite and lets you declare victory. The product frame makes it ongoing and lets you keep improving. Choose the product frame. The cost is real but small. The benefit is that every subscriber can actually use what you built. That is the whole point of building something.”

She closes the notebook. The email from Helen is still pinned to the wall.

Next to it, she adds a second sticky note. On it, in her careful handwriting: Helen is a subscriber. Every subscriber matters. Every subscriber counts.

It isn’t a WCAG criterion; it’s better.

What Day Is It?

2026-04-20T06:00:00+08:00

What Time Is It? dealt with the hour, a fragile compromise between the sun and politics. The date next to it is fragile too, built from a different cast of characters: monks miscalculating epochs, popes deleting weeks, traders trying to align their working days with their trading partners, and software developers stuck with whichever epoch their operating system chose decades ago.

The fundamental problem

Calendars are hard for the same reason time zones are hard: the universe didn’t supply round numbers.

The Earth’s orbital period is roughly 365.2422 days. The lunar month is roughly 29.5306 days. Neither divides neatly into the other, and neither divides neatly into a day. Every calendar humanity has ever built is a compromise, some prioritise the sun, some the moon, some try to honour both, and a few just give up and count days from a fixed point.

There is no clean answer. There is only a choice about what to round, what to ignore, and what to patch with the occasional extra day or month bolted on.

The Gregorian calendar and the eleven missing days

The Gregorian calendar, the one most of the world uses for civil purposes, counts years from an epoch chosen in the 6th century by a monk named Dionysius Exiguus, who was trying to calculate the birth year of Jesus and got it wrong by several years. Modern scholarship places the actual birth somewhere between 6 and 4 BCE, which means the year on your phone is off by half a decade from the event it’s nominally counting from. We are stuck with Dionysius’s guess, because nobody is going to renumber every document, gravestone, and database in the world to fix it.

The Gregorian calendar uses a solar year: 365 days, with a leap day every four years, except every hundred years, except every four hundred years. So 1900 was not a leap year, but 2000 was. The rule keeps the calendar aligned with the seasons to within roughly one day every 3,236 years, close enough that nobody currently alive needs to worry about the next correction.

It was introduced by Pope Gregory XIII in 1582 to fix the drift of the Julian calendar, which had been gaining roughly three days every four hundred years against the seasons. The fix required a one-time correction: ten days were simply deleted. In the countries that adopted the new calendar in 1582, October 4 was followed by October 15. Ten days that never happened.

The switchover was not smooth. Catholic countries adopted it immediately. Protestant and Orthodox countries dragged their feet for centuries. Britain didn’t switch until 1752, by which point the discrepancy had grown to eleven days. September 2, 1752 was followed by September 14. Rents, wages, and birthdays had to be renegotiated. There’s a persistent legend that mobs rioted in the streets shouting “give us back our eleven days!”, the truth is more prosaic; the political turmoil was real but largely quiet.

Russia held out until 1918. Greece until 1923. For centuries, different countries were on different dates at the same time. The October Revolution? It happened on 25 October by Russia’s Julian calendar, 7 November by the Gregorian calendar everyone else was using. An October revolution that happened in November.

Lunar, solar, lunisolar

Beyond the Gregorian, the variety is dizzying.

The Islamic (Hijri) calendar is purely lunar, 354 or 355 days per year, so its months rotate through the Gregorian seasons over a 33-year cycle. Ramadan slowly walks through the year, falling in summer for a while, then spring, then winter. Months traditionally begin when a crescent moon is physically sighted by human observers. Not calculated, observed. Saudi Arabia and Morocco sometimes start Ramadan on different days because one country’s observers spotted the crescent and the other’s didn’t. The date of the most important month in the Islamic calendar is, in the strictest traditional sense, unknowable in advance. Software that has to schedule Islamic holidays falls back on calculated approximations and accepts that it will sometimes be a day off.

The Hebrew and Chinese calendars are lunisolar: lunar months adjusted with the occasional leap month to stay aligned with the solar year. The Hebrew calendar uses a 19-year cycle in which seven of the years contain an extra month. The Chinese calendar uses astronomical calculation to decide when to insert a leap month based on solar terms. The result is months that follow the moon and years that follow the sun, glued together by a rule that sounds simple and is anything but.

The Hindu calendars, there are several regional variants, are also lunisolar but use different epoch dates and different rules. The Bengali calendar starts its year in mid-April. The Tamil calendar uses a 60-year cycle of named years. None of them agree with each other, and most have to be reconciled with the Gregorian calendar for civil purposes.

The year itself is negotiable

The number on your screen depends on which tradition you ask.

The Ethiopian calendar runs seven to eight years behind the Gregorian, a result of using a different calculation for the Annunciation. Ethiopia entered the third millennium in 2007 by Western reckoning. The country celebrated. The Western press, already several years past its own millennium, mostly missed it.

The Thai Buddhist calendar counts from the death of the Buddha in 543 BCE, which is why Thai expiry dates look like they’re from the future. A bottle of water bought in Bangkok in 2026 might be stamped with an expiry of 2570. The product hasn’t time-travelled. The calendar just starts somewhere else.

The Juche calendar in North Korea counts from Kim Il-sung’s birth year (1912), introduced in 1997, three years after his death, retroactively renumbering the entire country’s history. The calendar exists alongside the Gregorian in official documents, with the Juche year cited first.

The Hebrew calendar counts from a calculated date for the creation of the world. We’re currently in the year 5786 by that reckoning. The Islamic calendar counts from the Hijra. Muhammad’s migration from Mecca to Medina in 622 CE, placing us in the 1440s. The Republic of China calendar still in official use in Taiwan counts from the founding of the republic in 1912, making 2026 the year 115. None of these traditions is wrong. They are answering a slightly different question.

Calendars without numbers

Not all calendars count days at all in the way Western calendars do.

Here in Western Australia, the Nyoongar people, the traditional custodians of the south-west, use six seasons based on ecological indicators rather than calendar dates. Djilba (first rains) starts when the first rains come, which might be August or September depending on the year. Bunuru (the hot, dry time) starts when the weather turns, not when February begins. You know what season you’re in by looking at the land, not at a calendar. It’s a fundamentally different relationship with time: not “what date is it?” but “what is country doing right now?” (“Country” in Aboriginal English means the land itself, the living landscape, not a nation state.) It’s a calendar that’s always in sync with the actual ecology, at the cost of being impossible to print on a wall planner.

Other Indigenous calendars across Australia work similarly. The Yolngu people of Arnhem Land recognise six seasons based on wind direction, plant flowering, and animal behaviour. The D’harawal of the Sydney basin recognise six. None of them line up with the Gregorian quarters because the Gregorian quarters describe northern-hemisphere agriculture, not the actual rhythms of the southern continent.

Roman counting and revolutionary weeks

Not all calendars even count the same direction.

Roman calendars counted backwards from fixed points in each month, the Kalends (the first), the Nones (around the fifth or seventh), and the Ides (around the thirteenth or fifteenth). Caesar was assassinated on the Ides of March. March 15, but a Roman would have referred to the day before that as “the day before the Ides” rather than “the fourteenth.” Days were named relative to the next landmark, not numbered absolutely.

The Maya Long Count tracked elapsed days from a mythological creation date in 3114 BCE, using a base-20 system with one quirky base-18 layer. It generated the apocalypse hysteria around December 21, 2012, when the count rolled over from one b’ak’tun to the next. The Maya themselves did not predict the world would end. They predicted the counter would tick. Western tabloids did the rest.

The French Republican Calendar (1793-1805) was a deliberate attempt to scrub Christianity and royalty out of the year. It introduced a ten-day week, three weeks per month, twelve months of thirty days, plus five or six “complementary days” tacked on at the end of the year. Months got new poetic names, Brumaire (mist), Thermidor (heat), Floreal (flowers). It was abolished after twelve years partly because workers only got one day off in ten, partly because nobody outside France used it, and partly because the calendar’s astronomical rules required astronomical observations from the Paris Observatory, which made it deeply impractical for shipping and trade.

The Soviet Union tried something similar in 1929 with a five-day week, with workers divided into five colour-coded groups so that production could continue uninterrupted, one fifth of the workforce was always on rest. Family members in different colour groups never had a day off together. The experiment was abandoned within a few years.

The International Date Line and the disappearing day

Once the world agreed on Greenwich as the prime meridian, an awkward consequence followed: somewhere on the opposite side of the world, the calendar date had to change. That somewhere is the International Date Line, which roughly follows the 180-degree meridian but zigzags wildly to avoid splitting countries.

It’s not defined by any treaty. It’s a convention, and nations choose which side they sit on.

The earliest time zone on Earth is UTC+14 (Kiribati’s Line Islands). The latest is UTC-12 (Baker Island and Howland Island, both uninhabited). The gap is 26 hours, which means any given calendar date exists somewhere on Earth for a total of fifty hours. New Year’s Eve starts in Kiribati and finishes more than two days later, by clock time, somewhere in the Pacific.

Kiribati earned its UTC+14 the hard way. Until 1995, the country straddled the date line: the western Gilbert Islands were on Tuesday while the eastern Line Islands were still on Monday. A government on the wrong side of its own date line struggles to function. Civil servants in the capital couldn’t telephone the eastern islands during normal business hours because the eastern islands were closed for the day before, or open for the day after. In 1995 the country redrew its time zone so the whole nation sat on the same side of the line, which meant the Line Islands skipped a day. December 30, 1994 simply did not happen there.

Samoa did the reverse in 2011. For more than a century Samoa had been on the American side of the date line (UTC-11) because most of its 19th-century trade was with California. By 2011, most of its trade was with Australia and New Zealand, which were a full day ahead. The mismatch meant Samoan businesses had only four overlapping working days per week with their main partners. The government decided to jump across the line. December 29, 2011 was followed directly by December 31. Friday December 30, 2011 simply ceased to exist in Samoa. People born on December 30 in earlier years had no birthday that year. The country switched from UTC-11 to UTC+13.

The neighbouring American Samoa, on the other hand, stayed where it was. The two Samoas are 100 kilometres apart and now live a full day apart by clock.

When computers count days

Every operating system, every database, every programming language has had to make peace with all of this. The compromise is usually a fixed epoch, a reference moment from which time is counted as a single number (typically seconds or milliseconds), and a separate library that knows how to translate that number back into a human-readable date in a human-chosen calendar.

The choices are deeply arbitrary.

Unix counts seconds from 1 January 1970 UTC. This is the most widely deployed epoch on Earth, inside almost every server, every smartphone, every embedded device. It was chosen because it was a recent round date when Unix was being designed, and nobody expected the choice to matter for very long. It will overflow a signed 32-bit integer in 2038, the Year 2038 Problem, also known as the Unix Millennium Bug, which is currently a slow-burning crisis for any 32-bit system that hasn’t been updated.
Windows uses 1 January 1601 (the start of the previous Gregorian 400-year cycle, chosen so that calendar arithmetic was simpler).
macOS Cocoa uses 1 January 2001.
GPS counts weeks from 6 January 1980. The week counter was originally 10 bits, which rolled over for the first time in 1999 and again in 2019. Receivers that didn’t handle the rollover started reporting times decades in the past.
NTP uses 1 January 1900. Its 32-bit second counter will roll over in 2036, two years before the Unix problem.
Excel uses 1 January 1900 as day 1, and famously believes 1900 was a leap year. It wasn’t, 1900 was divisible by 100 but not 400, so the Gregorian rule says no leap day. But Lotus 1-2-3 had the same bug, and Excel chose compatibility over correctness when Microsoft was trying to win the spreadsheet wars in the 1980s. That bug ships in every copy of Excel to this day. Any date arithmetic in Excel that crosses the (nonexistent) February 29, 1900 is silently wrong.

The pattern repeats. Every computer system makes a choice about its epoch, and every choice ages badly. If you store a date as “days since 1 January 1900” in 16 bits, you get to 2079 before you run out. If you store it as a Gregorian “year, month, day” triple, you’re fine for billions of years but you have to do calendar arithmetic every time you want to compute a duration. The choice between those two, a number, or a structured representation, is one of the oldest debates in software, and there is still no clean answer.

So what day is it?

The hour on your phone is a fragile compromise between the sun and politics. The date next to it is a fragile compromise between the moon, the sun, and several thousand years of calendar reformers, deleted weeks, regional epochs, and arbitrary numbering systems chosen by people who are now dead.

Today’s date depends on:

Which calendar. Gregorian for civil purposes in most of the world, but Hebrew, Islamic, Chinese, Ethiopian, Thai, Juche, and dozens of others operate alongside it for religious or national use.
Which side of the date line. And whether the date line in your part of the world has moved recently.
Which epoch your computer was built on. And whether that epoch is about to overflow.
Whether your timezone has changed recently. Russia has reshuffled its time zones repeatedly. Samoa moved itself across the date line. Kiribati skipped a day. Each of those events makes “what day was it on the 30th of December 1994 in the eastern Line Islands?” a question with an unsatisfying answer.

We’ve now covered the human story of the hour and the day. But all of this has been about how we agree on time. The next post asks what we’re actually measuring when we count seconds at all. Ticks or Tocks? is about the physics of the second, from quartz crystals to caesium atoms to optical lattice clocks that won’t lose a tick in the lifetime of the universe.

The Workshop: Impact Mapping

2026-04-19T06:00:00+08:00

Impact Mapping connects deliverables to actors, behaviour change, and a measurable goal, so when you ship a feature you can tell whether it worked. Connecting Work to Goals is the worked example; this post is the playbook.

Impact Mapping

Impact Mapping traces a path from a measurable business goal, through the actors whose behaviour affects that goal, through the behaviour changes you want to cause, to the deliverables that might cause them, so the team can tell the difference between work that moves the number and work that just feels productive. The four columns answer why (goal), who (actors), how (impacts, behaviour changes), what (deliverables), in that order, left to right. Invented and named by Gojko Adzic in 2012 and sometimes called goal mapping or outcome mapping (though outcome mapping has a separate formal definition in international development that isn’t quite the same thing). Frequently confused with story mapping, story mapping lays out what a user does; Impact Mapping lays out what behaviour you want to change.

At a glance

Who, for how long: a facilitator, a product owner who owns the goal, one or two developers, a designer, and someone with unfiltered exposure to the actors (sales, support, ops). Four to six people, about ninety minutes.
What you walk out with: a four-column map (goal → actors → impacts → deliverables) with a prioritised path picked across it, a target metric and date written on the chosen deliverable so it lands as a bet, and a “not this quarter” list of everything that didn’t make the path.
When to reach for it: the start of a quarter or initiative when you need to decide what to build, or a backlog that’s drifted away from any measurable outcome. Not for sprint planning, and not when nobody can articulate a measurable business goal (fix the goal first, separately).

What’s It For

“Which of these features, if we built them tomorrow, would move the number we’re supposed to be moving this quarter?”

Asked out loud in the right room, that question almost always lands like a flat stone hitting glass. The answer takes longer than anyone expects. Someone defends a feature by explaining why the customer asked for it. Someone else defends another by naming a competitor. A third person points to the roadmap. Nobody says, without hedging, this one will move the number because this specific group of users will do this specific thing differently. The question doesn’t get answered; it gets re-framed until it can be.

The backlog has drifted. Each individual feature connects to something — a request, a rival, a hunch — but nothing connects the features to each other, and nothing connects the set to an outcome anyone is measuring. Reasonable things are infinite. Strategy is the shortlist of reasonable things that share a goal, and the shortlist has gone missing.

Impact Mapping exists to make that shortlist visible. The wall shows the goal on the left and the work on the right, with the logic connecting them drawn explicitly in between. Work that can’t find a place on the wall can still be reasonable. It just doesn’t belong in this quarter.

Reach for it when:

You’re starting a quarter, an initiative, or a new product line and need to decide what to build
The team has a list of features but no shared story about how they connect to business outcomes
Stakeholders are requesting features and nobody is asking why
You need to say no to work and you want a defensible reason
The organisation is measuring an outcome and the team doesn’t know how their work connects to it

What It’s Not For

Skip it when:

You already have a clear, shared understanding of the connection between goal and work
You’re planning at the sprint level — Impact Mapping is strategic; sprint planning is tactical
Nobody in the room can articulate a measurable business goal — solve that first, separately
The goal is imposed from above without buy-in — a mapping session with a fake goal is worse than no session

Stop a session that’s already started if:

The goal can’t survive five minutes of scrutiny
Every impact the team writes is a feature in disguise
Key stakeholders aren’t in the room and the map depends on their actors
The disagreement about the goal is political — mapping through a fake goal produces a map nobody will use

Stopping and fixing the goal is not failure. Running a session that produces an elegant map of the wrong thing is.

Inputs

One measurable, time-bound business goal, written on a card before the session starts. “Increase weekly active subscribers from 200 to 500 by end of Q3.” Nothing else. If the goal isn’t concrete enough to fit on a card, the session isn’t ready to run.
Sticky notes in four colours: green (goal), yellow (actors), orange (impacts), blue (deliverables). A wall wide enough that all four columns can grow left-to-right without crowding.
A 90-minute slot with the right people in the room (see Who’s Needed) and no interruptions.

If the goal itself is unclear or the business model isn’t coherent, run Business Model Canvas first — the Canvas sets the strategy that Impact Mapping then executes against. If the team doesn’t yet know what the system does, Event Storming comes first to surface that.

Outputs

What lands on the wall at the end:

A four-column map — goal on the left, actors next, impacts next, deliverables on the right — with lines connecting each note to the column behind it.
A prioritised path: one actor, one impact, the smallest deliverable that might cause the impact, with a target metric, a target change, and a date written on the deliverable card. The deliverable is now a bet: a hypothesis that this work will move that number by that much by that date. If the metric doesn’t move, you walk back up the map.
A “not this quarter” list: every deliverable that didn’t make the priority path. Just as valuable as the priority list, because it’s the work the team has explicitly agreed not to do yet.
A defensible answer to “why are we building this” for every item on the priority path, traceable back to the goal through an actor and an impact.

Photograph the wall in panorama (good lighting, readable notes) plus one close-up shot per column. Mark the chosen path on the map — dot stickers, a pen line, a photograph with it circled.

These outputs feed straight into:

User Story Mapping — once Impact Mapping has chosen the deliverables, User Story Mapping lays out the user journey through them and slices it into releases.
Assumption Mapping — every impact on the map is an assumption. Assumption Mapping sorts which of those assumptions actually deserve a test run before the team commits to the deliverable.
Wardley Mapping — Impact Mapping tells you what to change; Wardley Mapping tells you where each component sits in its evolution and therefore how to change it. They compose well for strategic initiatives.

Who’s Needed

Four to six people, about ninety minutes:

Facilitator. Holds the shape of the tree (goal → actors → impacts → deliverables), keeps people from jumping columns, and calls out when someone has smuggled a deliverable into the actors column.
Product owner or business stakeholder. Mandatory. They own the goal. They’re the one who will defend it, refine it, or abandon it when the session reveals that the goal itself is the problem.
Developers. At least one, ideally two. They know what’s cheap and what’s expensive, which is what turns the deliverables column from wishful thinking into a triaged list.
Designers. They think about actor behaviour natively. In the impacts column — the hardest column — a designer’s framing (“they would finish the sign-up flow instead of abandoning it at step three”) is usually sharper than a developer’s.
People who talk to the actors. Sales, support, operations, account managers. Whoever has the least-filtered view of the people whose behaviour you’re trying to change. They will contradict the optimistic assumptions in the room and that is exactly what they’re there to do.
SRE / Operations. For infrastructure or reliability initiatives, SRE is the domain expert on actors and impacts — “on-call engineers stop being paged for the billing cron” is a valid impact, and “our customers stop opening support tickets about missed deliveries” is downstream of it.

Group size is 4–6. Impact Mapping is a thinking-aloud exercise and the room has to stay a conversation. Above six, it becomes a meeting with a whiteboard.

Who to leave out:

Large groups of stakeholders. If seven people need to shape the goal, that’s a pre-session, not this session. Come to Impact Mapping with the goal agreed.
People who can’t say no. Someone who will accept every proposed deliverable without challenge makes the prioritisation phase impossible.
Pure spectators. Impact Mapping is not a presentation; observers change the dynamic and absorb oxygen without contributing.

How To Run It

Phase	Duration	Notes colour	Key question
Orient on the goal	10 min	Green (one card)	“What are we trying to move?”
Actors	15 min	Yellow	“Whose behaviour affects the goal?”
Impacts	25 min	Orange	“How would their behaviour change?”
Deliverables	20 min	Blue	“What could we do to cause that change?”
Prioritise	10 min	Marks on the map	“What’s the highest-leverage path?”
Wrap-up	10 min	—	“Who owns what next?”
Total	~90 minutes

The map grows left-to-right, one column per phase. Skipping columns is the single most common failure mode. An impact without an actor is a feature. A deliverable without an impact is a hunch. The discipline of the four columns is the technique.

Impact Mapping alternates between open conversation and quiet placement. The goal is fixed; everything to the right of it is debatable. The key rhythm is work backwards — goal before actor, actor before impact, impact before deliverable. Any deliverable that appears before its impact gets politely moved into a holding area until someone can connect it.

Phase 1. Orient on the goal (10 minutes)

Put the goal card on the far left of the wall. Read it aloud:

“Our goal for this quarter is to increase weekly active subscribers from 200 to 500 by end of Q3. That’s on the wall. We are not here to debate whether this is the right goal. We are here to map how we could move it.”

Then make sure everyone understands it the same way:

“Before we go any further, what does ‘weekly active’ mean here? What counts as a subscriber? If someone paused mid-July, are they in or out of the 500?”

Often a five-minute clarification at this stage reveals that the goal is ambiguous, and the map would have split into three directions based on three different interpretations. Resolve it now. If you can’t, the session isn’t ready.

What to watch for:

The goal isn’t measurable. “Grow the business.” The session cannot proceed. End it and schedule a goal-setting conversation.
The goal is actually three goals. “Grow subscribers and reduce churn and increase average box size.” Pick one for this session. Run the others separately.
Silent disagreement. The goal is on the card but one person clearly doesn’t believe it. Surface it: “You look sceptical. Is the goal wrong or is it the number?”

Phase 2. Actors (15 minutes)

Ask the room:

“Whose behaviour, if it changed, would affect this goal? I want names or roles, not ‘users.’ Specific enough that we could identify them in our database or watch them at their desk.”

The team writes actors on yellow notes and places them in the second column. Actors can be external (subscribers, prospects, referrers, suppliers, journalists) or internal (support agents, warehouse staff, operations). They can be automated (the billing cron, the churn-prediction model, the weekly newsletter). Automated actors are first-class here — a scheduled job that sends the wrong email affects the goal exactly as much as a person who does.

Push for specificity:

“‘Subscribers’ is too broad. Which subscribers? First-month subscribers? Subscribers who’ve paused once and come back? Subscribers whose delivery day has changed in the last sixty days?”

Different slices of subscribers have different behaviour, and the map gets useful when the slices are named.

What to watch for:

Jumping to deliverables. “We need a referral programme.” That’s a blue note. Pull it back: “Who would use the referral programme? What behaviour would change? Start from the actor.”
Forgetting internal actors. Teams focus on external customers and forget that their own support team, warehouse, or on-call engineer is an actor whose behaviour affects the goal.
Forgetting adversarial actors. Churning subscribers are actors. People who try the service and don’t convert are actors. Don’t only list the actors you want to help.
Too many actors. Above eight, the map becomes unreadable. Group the similar ones or focus on the actors most central to the goal.

Phase 3. Impacts (25 minutes)

This is the hardest and most valuable phase. For each actor, ask:

“How could their behaviour change in a way that helps us hit the goal? I want verbs. What would they do differently?”

Write impacts on orange notes. Each impact sits in column three, connected to its actor. Good impacts are behaviour changes, not features:

“New visitors sign up on their first visit instead of leaving to think about it.” (good)
“Existing subscribers tell one friend within their first month.” (good)
“On-call engineers get paged fewer than twice per week for billing issues.” (good, SRE flavour)
“We build a landing page with better copy.” (not an impact — that’s a deliverable)

Then flip it:

“Now the negative version. How could their behaviour change in a way that hurts the goal?”

Negative impacts are where the defensive work lives and where the risks hide. They are usually where the biggest savings come from: preventing a bad behaviour is often cheaper than causing a good one.

What to watch for:

Deliverables disguised as impacts. “Subscribers use the mobile app” is a deliverable dressed up as a behaviour. The impact is “subscribers manage their subscription on the go”; the app is one possible deliverable.
Vague impacts. “Subscribers are happier.” Not actionable. Push: “What would a happier subscriber do differently? Stay longer? Refer? Upgrade? Complain less?”
One actor absorbing all the attention. Time-box each actor. You can come back if needed.
No negative impacts. Prompt directly: “What could this actor do that would make the goal harder to hit?” If the answer is nothing, the actor probably doesn’t belong on the map.

Phase 4. Deliverables (20 minutes)

For each impact, ask:

“What could we build, do, write, or change to cause this behaviour? I want a list, not a single answer.”

Write deliverables on blue notes and place them in the fourth column, connected to their impact. A good deliverables column contains multiple options per impact, ordered roughly from cheapest to most ambitious:

Features (a referral programme, a pause flow, a rollback automation)
Content (a welcome sequence, a runbook, a one-pager for support)
Processes (a proactive call to at-risk subscribers, a handover checklist for on-call)
Experiments (a landing page, a prototype, a manual concierge version of the feature)
Changes to existing things (copy edits, configuration tweaks, prompt updates)

The most valuable column in Impact Mapping is not the widest — it’s the one where cheap experiments live next to expensive builds. If every deliverable is a multi-month project, you’ve filled the column wrong.

What to watch for:

Pet features appearing. Someone places a deliverable they’ve wanted to build but can’t connect to an impact. Challenge gently: “Which impact does this serve? If we built it, whose behaviour would change?” If they can’t answer, park it.
Only big deliverables. Push for cheap ones: “What’s the smallest thing we could do this week that would tell us whether the impact is real?”
Duplicate deliverables. The same deliverable might serve multiple impacts. That’s a signal: draw lines to both. High-leverage deliverables are the ones that show up in multiple places.

Phase 5. Prioritise (10 minutes)

Step back. Look at the whole map. You now have a visual argument from goal to work.

Use it to pick the first path. Ask four questions in order:

“Which actor has the most influence on this goal? Not the most numerous, the most influential.”

“Which impact is the highest-leverage one for that actor? If we only caused one behaviour change, which one would move the number most?”

“Which deliverable is the cheapest way to test whether we can actually cause that impact?”

“What measurable change in this impact would tell us the deliverable worked, and by when?”

Write the answer to the fourth question on the deliverable card, the target metric, the size of the change, the date. The deliverable is now a bet: a hypothesis that this work will move that number by that much by that date. If the metric doesn’t move, you walk back up the map, maybe the deliverable was wrong, maybe the impact wasn’t what mattered. The map is a path of bets, not a plan of work.

Mark the chosen path on the map — dot stickers, a pen line, a photograph with it circled. This is your first commitment. Everything else on the map is the second commitment, the third commitment, or “not this quarter.”

What to watch for:

Prioritising by excitement. The team gravitates toward the interesting technical deliverable rather than the high-leverage one. Redirect to the goal: “Which of these moves the number most?”
Trying to do everything. The map has twenty deliverables; the team wants to do all of them. Hold firm: “Pick the top three. If they work, we come back for more.”
Ignoring the map after voting. Someone argues for a deliverable that isn’t on the map. Either put it on the map properly (with actor and impact) or park it.

A worked example

See Impact Mapping: Connecting Work to Goals for the Greenbox team’s first mapping session — including the moment they realise a feature they’ve been planning for six weeks doesn’t connect to any impact on the wall, and the relief of deciding not to build it.

What Can Go Wrong

The goal debate. The team starts arguing about whether the goal is right. Recovery: “We can’t map a goal we don’t agree on. Let’s pause the session, fix the goal with leadership in the next forty-eight hours, and reconvene.” Stop if: The disagreement is political. Mapping through a fake goal produces a map nobody will use.

The solution-first thinker. Someone keeps proposing deliverables without connecting them to impacts. Recovery: Give them a specific job: “For every deliverable you think of, I need a yellow note and an orange note first. Who does it affect? What behaviour changes?” Stop if: They can’t hold the shape after three prompts. Pair them with a designer for the rest of the session.

Analysis paralysis. The team is stuck debating whether something is an actor, an impact, or a deliverable. Recovery: “The columns exist to structure thinking, not to be perfectly taxonomic. Put it in the best-fit column and move on.” Stop if: The same argument happens on a third note. The team is avoiding the harder conversation; name it.

The sceptic. Someone thinks the exercise is pointless because “we already know what we’re building.” Recovery: Ask them to place their planned work on the map. “Take your top three items. Which actor? Which impact? Place them.” If they can’t connect the work to the goal through an actor and an impact, the exercise has just earned its keep and they usually become the most engaged participant in the room. Stop if: They refuse to engage. They’re not blocking the session, just their own learning. Carry on without them.

The everything-is-high-impact problem. Every impact the team writes feels like it moves the goal equally. Recovery: Force a ranking: “If we could only cause one of these impacts, which one? Now pretend I’ve taken that one away — which of the rest?” Stop if: The team genuinely can’t distinguish. The goal is probably too abstract to map against; sharpen it before continuing.

The absent stakeholder. Halfway through, the team realises an actor is owned by someone not in the room. Recovery: Put a pink note on that actor: “Need to talk to [person] before we map this.” Carry on with the actors you can map. Stop if: More than half the map depends on people who aren’t in the room. You’re storming without the right participants.

Common failure modes to watch for across the whole session:

The map gets produced, photographed, and then ignored because the backlog keeps being the source of truth
The deliverables column is full of big builds and no cheap experiments
One actor absorbs the entire conversation and the rest of the map is thin
The team confuses “we wrote it down” with “we agreed on it” — absent stakeholders discover the map later and veto half of it
Impacts drift into features and nobody catches it

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Takes panoramic photographs of the map. Good lighting, readable notes, one shot per column in close-up.
Transcribes the map into a shared document or a digital mind-mapping tool — goal, actors, impacts, deliverables, and the lines between them.
Writes a one-page summary message to participants and stakeholders: here’s the goal, here’s the prioritised path, here’s what we’re deferring.

This week, the product owner:

This is where the pattern earns its cost, and the work is mostly the product owner’s.

Turn the priority path into backlog items. Each deliverable on the priority path becomes a backlog item — but with the actor and impact captured in the description. When someone later asks “why are we building this,” the backlog item contains the answer.
Park the deferred deliverables explicitly. A “not this quarter” list is as valuable as the priority list. Put it somewhere visible. When new work gets proposed, the first question should be “does this displace something on the not-now list?”
Schedule discovery for the experiments. Cheap experiments from the deliverables column — landing pages, manual concierge runs, interview scripts — need to be started within a week of the session. If they sit, the map’s value decays fast.
Walk the map to absent stakeholders. Anyone who should have been in the room but wasn’t gets a walk-through. Their challenges either strengthen the map or reveal a problem you need to fix before committing.
Use the map to say no. This is the hardest and most important week-after task. The map gives you a defensible reason to refuse work that doesn’t connect. Use it.

Ongoing, the team:

Reviews the map at the start of each quarter or planning cycle. Has the goal changed? Have you learned which impacts actually work? Update and re-prioritise.
When someone proposes new work, asks them to place it on the map. If it doesn’t trace back to the goal through an actor and an impact, it probably isn’t worth doing — or the map needs to grow.
Keeps the photographed map visible where the team works. It’s the reference that prevents the slow drift back into feature-list thinking.

The benefits compound when this becomes routine: work connected to outcomes instead of to opinions; a defensible answer to “why are we building this” for every item in the quarter; a visible short list, three deliverables picked out of twenty, that the team commits to first; a “not now” list that is just as valuable; a shared mental model of the business strategy that developers, designers, and product all recognise.

The costs are real too: 6–9 person-hours per session with 4–6 people; a pre-session goal-setting conversation (sometimes the real work); political cost when the map reveals that work people wanted to do doesn’t connect to any goal; quarterly recurrence — the map goes stale as the goal, the actors, and the learnings move.

Sibling sessions that often follow:

Event Storming — Event Storming describes how things happen now; Impact Mapping describes what you want to change. Run Impact Mapping first when you’re choosing what to build; run Event Storming first when you’re understanding what already exists.
User Story Mapping — once Impact Mapping has chosen the deliverables, User Story Mapping lays out the user journey through them and slices it into releases.
Assumption Mapping — every impact on the map is an assumption. Assumption Mapping sorts which of those assumptions actually deserve a test run before the team commits to the deliverable.
Business Model Canvas — when the goal itself is unclear or the business model isn’t coherent, the Canvas is the session to run before Impact Mapping. The Canvas sets the strategy; Impact Mapping executes against it.
Wardley Mapping — Impact Mapping tells you what to change; Wardley Mapping tells you where each component sits in its evolution and therefore how to change it. They compose well for strategic initiatives.

Variants

Quarterly strategic mapping (default). One measurable goal, 4–6 people, ninety minutes, four columns built left-to-right with prioritisation at the end. Output: a priority path of bets and a “not this quarter” list. This is what most teams need, and the rest of this post describes it.

Initiative or product-line mapping. A new product line, a major bet, or a discrete initiative. Same shape, but the goal sits at the initiative level rather than the quarter, and the session may run longer (two to three hours) because the actors and impacts are less familiar. Run it once at kick-off, refresh it every six to eight weeks.

Infrastructure or reliability mapping. When the goal is operational (“reduce on-call pages by 50% by end of Q3,” “cut mean time to recovery from forty minutes to ten”), SRE takes the product owner’s seat and the actors column fills with on-call engineers, paging systems, support agents, and the customers downstream of incidents. The four-column shape is identical; the vocabulary shifts. Particularly useful when an SRE team needs to defend why reliability investment moves a business number.

Remote. A Miro or Mural board with the four columns pinned, video call for the conversation. Slightly slower than the in-person rhythm, but the structure transfers cleanly. Use one shared cursor: only the facilitator places notes, prompted by the team, to keep the map legible. Take screenshots at the end of each phase rather than waiting for the close.

User Story Mapping: Seeing the Whole

2026-04-18T06:00:00+08:00

The Greenbox team has been busy. Event Storming gave them shared understanding. Example Mapping made their stories concrete. Impact Mapping connected their work to business goals.

The result? Eighty-three stories in the backlog.

Eighty-three. And climbing. Every discovery session has surfaced more work, more edge cases, more things that need building. Priya scrolls through the list on her screen and her face goes still. Tom exhales audibly.

Lee looks at the list. “Eighty-three is a lot. How many are actually ready to build?”

Tom scrolls. “Maybe… twenty? The rest are ideas, or things that came out of red cards, or stuff Maya mentioned once in a standup.”

“Right. Part of what we’re going to do today isn’t just organise these. It’s be honest about which ones you actually need.”

Tom has a sharper question. “Impact Mapping told us what matters. But I’m looking at the referral programme and it’s five stories. The shortfall tool is three. I don’t know which stories need to ship together to actually work. If I build referral link generation but not referral tracking, have I shipped anything useful?”

He’s not wrong. A flat backlog tells you what to build. Impact Mapping told them what matters. But neither shows how stories relate to each other, where the gaps are, or what subset adds up to a coherent experience.

What User Story Mapping is

User Story Mapping is Jeff Patton’s technique for visualising the user journey and organising stories within it. Instead of a flat list, you arrange stories in a two-dimensional map.

Three layers:

Activities run left to right across the top. The big things a user does, roughly chronological. This row is called the “backbone.”
Tasks sit below each activity. The specific things a user does within that activity.
Stories sit below each task, prioritised top to bottom. Must-haves at the top, nice-to-haves below.

Activities (the backbone) →

Activity 1

Task

Story (must-have)

Story (should-have)

Story (nice-to-have)

Activity 2

Task

Story (must-have)

Story (should-have)

Story (nice-to-have)

Activity 3

Task

Story (must-have)

Story (should-have)

Activity 4

Task

Story (must-have)

Story (should-have)

Activity 5

Task

Story (must-have)

Activity 6

Task

Story (must-have)

Left to right tells you the user’s journey. Top to bottom tells you priority.

Building the Greenbox story map

Jas suggests running the session. She grabs a fresh wall, a stack of sticky notes, and the whole team. Lee watches her take charge of the room, finding the markers, arranging the space, framing the question, and says nothing until afterwards. Then he tells Maya quietly: “She’s good. She thinks about the whole journey, not just the screen.”

“Let’s start with the backbone,” Jas says. “What are the big activities a customer goes through, from first hearing about us to becoming a loyal subscriber?”

After fifteen minutes, six activities:

Discover Greenbox

→

Browse Boxes

→

Receive First Box

→

Manage Subscription

→

Refer a Friend

That’s the backbone. Not the system’s internals, the person’s experience.

Adding tasks and stories

The team fills in the tasks under each activity, then takes their eighty-three stories and places them. Some map neatly. Others don’t fit anywhere, which is revealing in itself.

Sam sticks a note under “Discover Greenbox” and pauses. “I’ve got three marketing stories here, social media, SEO, press outreach. But none of them connect to anything Tom or Priya are building. If I run a press campaign and someone signs up, is the onboarding experience actually ready for them?”

Everyone looks at the map. There’s a gap. The “Discover” column has marketing work, but “Browse” and “Subscribe” are sparse.

“That’s exactly why we’re doing this,” Lee says. “A flat backlog would never have shown you that gap.”

Sam points at the gap between “Subscribe” and “Receive First Box.” “After signup, the customer’s next touchpoint is the box arriving. If anything goes wrong, payment fails, delivery delayed, substitution they hate, the only way they can tell us is email. We have no status page, no tracking, no FAQ.” She pulls up her spreadsheet. “Sixty percent of my inbox is people asking things they should be able to find themselves.”

They add new stories where the map shows gaps. Under “Check delivery area,” there were no stories at all. Under “Manage Subscription,” there were fourteen, far more than any other activity. One card, almost hidden in the supply-side column, reads “farm reliability scoring.” It goes up without much discussion.

Three things jump out:

Gaps. The “Discover” column is thin. Sam flags it: “We can build the best subscription experience in the world, but if nobody knows we exist, it doesn’t matter.”

Over-investment. Fourteen stories about pausing, changing, upgrading, downgrading, cancelling. Is that where the team should spend its energy before they have more subscribers?

Missing connections. Nothing between “Receive First Box” and “Manage Subscription.” What happens after someone gets their first box? How do they become a regular?

None of this was visible in the flat backlog.

Release slicing

This is where User Story Mapping earns its keep. Instead of arguing about which stories to do first, the team draws horizontal lines across the map. Each line defines a release. Everything above ships in that release. Everything below waits.

The rule: each release must tell a complete story. You can’t ship “Subscribe” without “Receive.” Each horizontal slice must be a usable product, even if it’s thin.

Discover

Browse

Receive

Manage

Refer

Release 1 — MVP

Landing page with value prop

Show two box sizes

Show sample contents

Stripe checkout

Collect address

Email: box is on its way

—

Release 2 — Operational

SEO basics

Delivery area checker

—

Delivery tracking link

Rate this box

Pause for one week

Change box size

—

Release 3 — Growth

Instagram integration

Seasonal calendar

Gift subscription

Photo of your farmer

Update payment card

Cancel with feedback form

Generate referral link

Friend gets 10% off

Release 1 (MVP): Find Greenbox, see what’s on offer, subscribe, pay, receive a box with a notification. The bare minimum to prove someone will pay.

Release 2 (Operational): Delivery tracking, pause and resize, delivery area checker, basic feedback. The essentials for keeping people subscribed.

Release 3 (Growth): Referral programme, gift subscriptions, seasonal calendar, Instagram integration. Growth features that only matter once the product works.

The team didn’t argue about whether referrals or delivery tracking should come first. The map made the answer obvious. Referrals are meaningless without a product worth referring.

When to use User Story Mapping

You’re planning releases and need to figure out what subset makes a coherent, shippable product.
You’ve lost the plot. “We have 80 stories and no idea what to ship first” is the classic symptom.
Product and engineering are misaligned. The story map bridges user journeys and engineering components on the same wall.

When not to use it

You need to refine individual stories. That’s Example Mapping.
You need to understand the domain. That’s Event Storming.
You have a small, well-understood scope. Three stories don’t need a map. They need a whiteboard.

What happens next

Looking at the wall, the team feels like they’ve cracked it. Four workshops, each building on the last, and they have a clear path from where they are to where they need to be. Event Storming gave them the domain. Example Mapping gave them concrete stories. Impact Mapping connected the work to goals. And now User Story Mapping shows them the whole journey, with release slices that make planning obvious instead of political.

“I wish we’d done this four weeks ago,” Tom says.

Jas smiles. “We didn’t know enough four weeks ago. We needed Event Storming to understand the domain, and Example Mapping to make the stories real. This built on top of all that.”

The team knows what to build. They know the order. They know what a coherent release looks like. For the first time, the whole product is visible on a single wall.

But a harder question is coming. The story map is beautiful. The release slices are clean. The team is confident they’re building the correct thing.

They haven’t asked the customers yet.

Forty percent monthly churn will force that question in ways the team doesn’t expect. Because it turns out the answer to “what’s for dinner?” isn’t just a box of vegetables, it’s a job to be done.

The Workshop: Sprint Planning

2026-04-17T06:00:00+08:00

Two weeks, one goal, a plan everyone in the room committed to. Sprint Planning is the workshop where a refined backlog becomes a sprint: a contract about what you’ll learn by the end, not a to-do list with a deadline. For the worked example, see Turning Sticky Notes into Delivery.

Sprint Planning

Sprint Planning turns a refined, prioritised backlog into a sprint the team can commit to: a goal, a set of stories that fit capacity, a task breakdown, and an explicit commitment from everyone in the room. Often called planning or iteration planning. Frequently confused with roadmap planning (which spans months) and with the story-preparation work that belongs before planning, not inside it. The ritual is Scrum’s, but the pattern, “here is what we’ll do and here is why we believe it”, predates Scrum by decades.

At a glance

Who, for how long: the whole team (4-9 people: facilitator, product owner, developers, tester, ops where relevant), one hour per sprint week.
What you walk out with: a sprint goal the team can state from memory, a set of stories that fit capacity, a task breakdown concrete enough for daily standup, and an explicit commitment from every person in the room.
When to reach for it: a refined backlog needs to become a sprint the team genuinely believes in. Not for unrefined stories (run Example Mapping first), roadmap-scale planning, continuous-flow teams with no sprint boundary, or any session the product owner can’t attend.

What’s It For

A team finishes a sprint having delivered six of the eight stories they pulled in. The two unfinished stories get carried over. The next sprint they pull in ten stories to “catch up” and finish five. The sprint after that they pull in twelve. Velocity (the team’s average completed points per sprint, smoothed over the last few sprints) is now invisible, commitment has become theatre, and the team quietly stops believing the numbers.

What went wrong wasn’t the work; it was the planning. Nobody said out loud what the sprint was actually for. Stories got pulled in because they were next in the list, not because they served a goal. When a fire came up mid-sprint, the team had no way to decide whether to fight it or park it, because there was no goal to measure the fire against. Every sprint became a slightly different version of “do as much as possible,” which is indistinguishable from “do whatever’s loudest.”

Sprint Planning exists to break that pattern. The sprint goal is the contract. The stories are the plan. When the plan has to change (and it always does) the goal is the thing you steer by. Without a goal, a sprint is a to-do list. With one, it’s a commitment.

Reach for it when:

You work in sprints of one or two weeks
The top of the backlog has been refined: stories have acceptance criteria, have been through Example Mapping or equivalent, and are sized
The whole team can attend
Someone can articulate what the sprint should achieve, not just what should be done

What It’s Not For

Skip it when:

The top of the backlog is a mess. Prepare the stories first, in a separate session; planning is not story preparation with an audience. Example Mapping is the usual gate before planning.
The team operates in continuous flow and has no sprint boundary.
You’re planning more than one sprint ahead. That’s roadmap work, a different session.
The product owner can’t attend. Reschedule.

Stop a session that’s already started if:

Three stories in a row need preparation work mid-planning. The top of the backlog isn’t ready; end the session, schedule the preparation, and come back
The product owner won’t accept capacity as a constraint. That’s a systemic problem, not a session problem
Two sprints in a row the committed plan wasn’t achievable. The sprint length, the preparation process, or the estimation practice is broken, not the planning session

Ending planning and scheduling something else is not failure. Committing to a sprint nobody believes is.

Definitions & Background

The sprint goal is one sentence describing what the sprint must achieve. Not a list. A capability the team delivers, a metric they move, a problem they solve. “Subscribers can pause and resume their subscriptions through the web app” is a goal. “Make progress on the subscription area” is not.

Capacity is not velocity. Velocity is a historical average: what the team has delivered over recent sprints. Capacity is this sprint’s actually-available time: velocity, minus on-call rotations, minus holidays and leave, minus scheduled meetings outside the sprint cadence, minus carry-over (stories committed last sprint that didn’t ship) still in flight. Teams that plan to last sprint’s velocity in a sprint with two people on holiday over-commit by definition.

Points are abstract sizing units, usually a Fibonacci-like 1/2/3/5/8/13 scale. T-shirt sizes or hours work too; the unit matters less than using the same one consistently.

The four phases, set, select, break down, commit, each have a different shape:

Goal-setting is a conversation between the product owner and the team, moderated by the facilitator. The product owner proposes, the team pressure-tests, the facilitator writes the agreed goal somewhere everyone can see it for the rest of the session.
Story selection is a negotiation. The product owner defends priority; the team asserts capacity; the facilitator holds the capacity number honest.
Task breakdown is team work. The product owner is available for questions but not driving. Developers, testers, and ops decompose each story into tasks that fit inside a day.
Commitment check is a round-the-room moment. Each person says, in their own words, whether they believe the plan is achievable. This is the contract.

If any of the four collapse into the others, the ritual stops working.

Inputs

A refined backlog. Stories at the top with acceptance criteria, sized, and through Example Mapping or equivalent. A story that hasn’t been through Example Mapping is not ready to plan; plan it anyway and you’ll be mid-sprint when you find out why.
The team’s actual capacity for this sprint: velocity, minus on-call rotations, minus holidays and leave, minus carry-over still in flight. Calculate this before the session starts and write it on the board.
Recent velocity: the average of the last three delivered totals, not committed totals.
The whole team available for the duration of the session.

Nothing else needs to be prepared in-session. If it does, the preparation didn’t happen. If the backlog is chaotic, User Story Mapping is where that gets fixed, not in planning. If the sprint goal can’t be traced back to a meaningful outcome, Impact Mapping is the upstream conversation.

Outputs

What lands at the end of the session:

A sprint goal the whole team can state from memory: the one line that makes mid-sprint trade-offs decidable.
A selected set of stories that fits inside capacity, with priority order intact.
A task breakdown for each story, concrete enough that the daily standup has something to work against.
An explicit commitment from everyone in the room, not silent acquiescence.
Visible dependencies, on-call load, and carry-over so the sprint doesn’t break on something the plan didn’t acknowledge.

Photograph the whiteboard if the task breakdown happened on physical sticky notes. The sprint goal goes where everyone can see it: team board, wiki, Slack topic, the top of the sprint in the tracker.

These outputs feed straight into:

Retrospectives: the retrospective is where you notice that planning sessions have become theatre and fix the ritual before it fully collapses.
Ensemble Programming: when the task breakdown keeps flagging one developer as the sole person who can touch a system, ensemble work is the pattern that breaks the bottleneck.

Who’s Needed

The whole team, typically 4-9 people, for one hour per sprint week:

Facilitator. Runs the session, holds the clock, keeps the team off preparation detours. Often the Scrum Master (the role that protects the team from disruption and removes blockers), team lead, or a rotating role.
Product owner. Mandatory. They set the sprint goal, explain the stories, and make the trade-off calls when capacity doesn’t match ambition. Without them in the room, you are planning to build the wrong thing.
Developers. The whole development team. Sprint Planning is the one meeting where partial attendance breaks the ritual. If a developer isn’t in the room, they haven’t committed, and the commitment is what the meeting is for.
Tester / QA. If they sit with the team, they’re part of the team, and they plan with the team. Testing capacity is capacity. Treat it that way.
Operations / SRE. For any team whose sprint includes deployment work, on-call rotations, or infrastructure change, ops is a first-class planning participant. On-call load is a real capacity drain and if it isn’t in the plan it will consume the plan anyway.

Sprint Planning is one of the few patterns that scales with team size; larger teams need longer sessions, not different ones.

Who to leave out:

Stakeholders. They shape the backlog before planning and see the output afterwards. They don’t attend planning itself. Observers warp the commitment conversation because people self-censor in front of the people they serve.
Other teams. Dependencies on other teams belong in the task breakdown as risks, not in the room as people.
Senior leaders with “just a quick ask.” The just-a-quick-ask is the thing that destroys the sprint goal you’re supposed to be setting. Leadership input happens in the story-preparation work upstream of planning, not in planning itself.

How To Run It

Phase	Duration (1-week sprint)	Duration (2-week sprint)	Key question
Sprint goal	10 min	15 min	“What must this sprint achieve?”
Story selection	20 min	40 min	“What fits, against real capacity?”
Task breakdown	25 min	50 min	“What are the concrete steps?”
Commitment check	5 min	10 min	“Does everyone believe we can do this?”
Total	1 hour	~2 hours

The rule of thumb is one hour per sprint week. Most teams beat that once they’ve run the pattern a few times and once upstream story preparation is reliable. If you’re consistently running longer, the problem is upstream: stories are arriving at planning unready.

Phase 1. Sprint goal (10-15 minutes)

Before any story is discussed, the product owner proposes a sprint goal. Not a list. A sentence.

Open with:

“Before we look at stories, let’s agree what this sprint is for. Product owner, if we could only ship one thing this sprint, one capability, one metric we move, one problem we solve, what is it?”

Write whatever they say on the whiteboard. Then pressure-test it with the team:

“Is this achievable in one sprint? Is this worth a sprint? Does everyone in this room understand why this matters?”

The goal should be specific, achievable in one sprint, and measurable or demonstrable. A good goal sounds like: “Subscribers can pause and resume their subscriptions through the web app.” A bad goal sounds like: “Make progress on the subscription area.”

Once the team accepts the goal, write it in large letters at the top of the whiteboard. Everything that follows is in service of this goal.

What to watch for:

No goal, just a list. The product owner says “let’s just pull in as much as we can.” Push back: “If we could only ship one thing this sprint, what would it be?” Refuse to move to story selection until a goal is on the board.
Vague goal. “Make progress on subscriptions” is not a goal. Push: “What would a subscriber be able to do at the end of this sprint that they can’t do now?”
Three goals disguised as one. “Pause, resume, and billing integration” is a roadmap item, not a sprint goal. Help the product owner pick the most important one; the others come next sprint.
SRE sprint goal looks different. For an ops-heavy sprint, the goal might be “Reduce deployment rollback rate from 1 in 5 to 1 in 20” or “Move the billing cron to the new scheduler with zero missed runs.” Same shape, same test: specific, achievable, demonstrable.

Phase 2. Story selection (20-40 minutes)

Start from the top of the backlog. For each story, the product owner gives a thirty-second explanation of what it is and why it serves the sprint goal. The team confirms they understand it (a quick nod around the room is enough; if it’s more than a nod, the story wasn’t refined). Then the team decides whether it fits.

Keep a running tally of points (or t-shirt sizes, or hours, or whatever you size in) against capacity, visible at the edge of the whiteboard. When the tally reaches capacity, stop pulling.

Capacity, not velocity. Velocity is the historical average; capacity is this sprint’s actually-available time. Plan to capacity, not velocity.

Say it explicitly when you get there:

“We’re at 24 points. That’s our capacity. Any story we pull in now has to push another one out. Product owner, is there a trade you want to make?”

What to watch for:

Over-commitment. The team pulls in more than their average velocity because “this sprint feels different.” It never is. Use the velocity. Optimism is not a planning strategy, and a team that over-commits twice loses the ability to trust itself.
Under-commitment. The team sandbagging because they got burned. A sprint or two of under-commitment to rebuild confidence is fine; persistent under-commitment means the stories are bigger than estimated or there’s hidden work the estimates don’t cover.
Skipping the priority order. “Let’s skip story 3 and do story 7 instead.” Only the product owner can approve a priority swap, and they should say why out loud. Otherwise the backlog order stops meaning anything.
Gold-plating during selection. The team starts redesigning a story. Redirect: “We’re deciding what’s in, not how to build it. Task breakdown is next.”
Carry-over invisible. Last sprint’s unfinished work is coming in. Make it visible in the tally: “We have 8 points of carry-over. That leaves 16 for new work.” Carry-over that isn’t counted is how velocity quietly disappears.

Phase 3. Task breakdown (25-50 minutes)

For each selected story, the team breaks it into tasks. A task is concrete enough that one person can do it in a day or less.

The facilitator’s job in this phase is mostly to keep moving. Give each story a time budget (“five minutes per story”) and hold it. If a story needs more than five minutes of task breakdown, it wasn’t ready for planning; pull it and prepare it separately.

For each story, the team identifies:

What needs to happen: the concrete tasks, written on sticky notes or into the tracker
Who’s likely to do what (not assignments, but a flag for tasks that need specific expertise)
Dependencies: between tasks, between stories, between teams
The not-obvious work: testing, deployment, migration, documentation, feature flag cleanup, on-call handover

What to watch for:

Tasks that are too big. “Build the UI for pausing” is probably three tasks: form, validation, API wiring. If a task is longer than a day, split it.
Missing the unglamorous work. Teams forget testing, migrations, deployment, feature flag management, observability wiring, documentation updates, on-call runbook edits. Prompt: “What else has to happen before we can call this done?”
One-person bottlenecks. If every story has the same developer flagged as essential, that’s a risk. Don’t solve it now; flag it and discuss pairing or knowledge-sharing in the retrospective.
External team dependencies. If a task needs another team’s API, approval, or review, name it and name the person who’ll chase it. Better still: can you start with a mock or stub so the dependency isn’t blocking?
On-call capacity. If a developer is carrying the pager this sprint, their capacity is not the same as a developer who isn’t. Build that in during task assignment, not after the fact.

Phase 4. Commitment check (5-10 minutes)

Read the sprint goal aloud. Read the list of selected stories. Then go round the room and ask each person the same question:

“Do you believe we can deliver this sprint as planned?”

This is not a vote. It’s a check. You are looking for the person whose body language doesn’t match their words. You are giving the quiet team member a direct invitation to raise a concern that would otherwise stay silent until the retrospective.

Run a silent confidence check before any verbal round-the-room: “On a count of three, hold up fingers from one to five. One means you don’t believe we can deliver this sprint. Five means you’re confident. Three is the middle. No talking yet.” Anything below a four gets a follow-up question: “You held up a two. Which story is the worry, and what would lift it?”

The silent signal goes first because round-the-room polling, with senior people answering early, pressures introverts and juniors toward conformity. By the time the second person speaks, the room has anchored. Silent first surfaces the dissent that the verbal round would flatten.

If someone says “we’ll try,” it’s a signal. Ask one more question: “What would turn ‘we’ll try’ into ‘I think we can’?” Sometimes the answer is “nothing, that’s my honest level of confidence and I’d ship anyway.” Sometimes it’s “remove this one story.” Either is a real answer; the silent vote is the trigger to find out which.

What to watch for:

Silent discomfort. Someone doesn’t object but their face says the plan is too much. The confidence check should have caught this; if it didn’t, ask directly, by name: “You’ve gone quiet. Which story are you worried about, and what would it take to remove the worry?”
“We’ll try.” Not automatically a no; a signal to ask one more question. “What would turn ‘we’ll try’ into ‘I think we can’?”
Scope-cutting inside stories. “We can do it if we skip the error handling.” No. Error handling is in the acceptance criteria or it isn’t. Cut scope by removing stories, never by cutting corners inside them.
The product owner negotiating down. The product owner offers to cut a story to reassure the team. Let them. This is the ritual working.

When everyone has said yes, genuinely said yes, not politely nodded, the sprint is committed. Write the commitment down with a date. The sprint starts.

See Sprint Planning: Turning Sticky Notes into Delivery for the Greenbox team running their first planning session, including the moment they discover that a sprint goal changes every single decision that follows it, and the moment one of the developers realises the commitment check exists precisely for people who were about to nod along with a plan they didn’t believe in.

What Can Go Wrong

The preparation session in disguise. Fifteen minutes debating what a story means, mid-planning. Recovery: “This story isn’t ready. Let’s pull it out of the sprint and work through it properly this week, maybe with Example Mapping, before it comes back to planning.” Stop if: Three stories in a row need that treatment. The top of the backlog isn’t ready for planning. End the session, schedule the preparation work, and come back.

The wish list. The product owner keeps adding “just one more.” Recovery: Hold the capacity number honest: “We’re at 24. This story is 5. Which 5-point story do you want to remove to make room?” Force the trade, every time. Stop if: The product owner won’t accept capacity as a constraint. That’s a systemic problem, not a session problem. Flag it to leadership outside the room.

The architecture debate. Developers start debating framework choices during task breakdown. Recovery: Park it: “Capture that as a design question. We need to know can we do it this sprint, not how.” Stop if: The debate is blocking task breakdown entirely. The story needs a time-boxed design investigation before it’s plannable; pull it.

The absent product owner. Product owner cancels the morning of. Recovery: Reschedule, same day if possible, next day if not. Stop if: This keeps happening. The ritual is broken; escalate the pattern, not the individual session.

The permanent over-commitment. Every sprint the team commits to 30 points and delivers 20, and nobody adjusts. Recovery: In the next planning session, write the last three sprints’ delivered totals on the board before story selection. Plan to the delivered number, not the committed one. Watch what happens. Stop if: The product owner insists on the committed number anyway. That’s a systemic trust problem; a planning session won’t solve it.

The silent veto. Commitment check passes, but one person clearly doesn’t believe it. They’ve said yes because saying no feels rude. Recovery: Take a break. Talk to them privately. Bring the concern back into the room as “There’s a worry about the migration story that I don’t think we surfaced properly. Can we talk through it before committing?” so the objection is legitimised by the facilitator, not left to the quiet person to defend alone. Stop if: They still won’t speak. The team has a safety problem, not a planning problem.

The ritual collapsing into theatre. The sprint goal gets skipped and the sprint becomes a to-do list. Or carry-over isn’t counted against capacity and velocity quietly collapses. Or the commitment check becomes theatre and nobody actually believes the plan. Or task breakdown is skipped because “we know what to do,” and the team discovers mid-sprint that they didn’t. Or the session runs so long the team arrives at their first ticket exhausted. Recovery: Name the failure mode in the next retrospective. Pick one to fix. Don’t try to fix all five at once. Stop if: The retrospective itself can’t surface the problem. The ritual has fully collapsed and a planning session won’t rebuild it. Step back to first principles: what is this sprint for?

Next Steps

The session ends; the sprint begins.

Same day, the facilitator:

Sprint goal written where everyone can see it: team board, wiki, Slack topic, the top of the sprint in the tracker.
All selected stories moved into the sprint in the tracker, with tasks attached.
External dependencies communicated to the teams they touch, today, before the sprint starts.
Photograph the whiteboard if the task breakdown happened on physical sticky notes.

This sprint, the product owner:

This is where the pattern earns its cost, and the work is mostly the product owner’s.

Protect the sprint goal. Every “just one more thing” request that arrives this sprint gets measured against the goal. If it doesn’t serve the goal, it goes into the backlog for next sprint. If it does, it replaces something that doesn’t. The product owner is the only person who can make that trade.
Watch the burndown (the sprint’s day-by-day chart of remaining work) at the midpoint. For a two-week sprint, Thursday of week one. For a one-week sprint, the morning of day three. If you’re behind, have the conversation about what to cut now, not on the last day. The goal survives a scope cut; it doesn’t survive a last-day scramble.
Prepare the top of the backlog for the next planning session. This is the unglamorous work that makes next sprint’s planning an hour instead of three. Run Example Mapping on the candidate stories. Answer the red cards. Size the stories. Arrive at the next planning session with a backlog that is actually plannable.
Walk the goal to anyone who matters. Stakeholders, leadership, dependent teams. “Here’s what we’re doing this sprint and here’s why” said once at the start prevents five “what are you working on” interruptions mid-sprint.

Ongoing, the team:

Track velocity honestly. It’s the single most useful planning input and it only works if you measure what actually shipped, not what was committed.
If planning sessions consistently run long, the preparation happening before them needs work. Stories should arrive at planning ready to plan.
Retrospect on the planning session itself periodically. Is the goal still the contract? Is commitment still meaningful? Or has the ritual become theatre?

The pattern earns its cost across sprints, not within one. One hour per sprint week, every sprint, forever: that’s the price. A product owner who has to be available and prepared, a story-preparation process upstream that reliably produces ready stories, and the discipline to say no to “just one more” every single time: that’s what holds it up.

Variants

One-week sprint (default short cadence). One hour total, the four phases compressed. Task breakdown is tighter because the stories are smaller. Use when the team needs faster feedback loops or when scope volatility is high enough that two-week commitments break too often.

Two-week sprint (default long cadence). Two hours total, more room for goal pressure-testing and richer task breakdown. The default for most teams; one-week cadence costs more facilitation overhead per unit of delivery.

Distributed / remote. Same four phases on a Miro or Mural board. The silent confidence check transfers especially well to remote: everyone holds up fingers to camera at the same moment, no anchoring effect. Run the goal-setting conversation on video with the goal pinned at the top of the board for the rest of the session.

SRE / ops-heavy sprint. The goal looks like “Reduce deployment rollback rate from 1 in 5 to 1 in 20” or “Move the billing cron to the new scheduler with zero missed runs” rather than a user-facing capability. Capacity calculations include on-call load explicitly. Task breakdown often surfaces runbook updates and observability wiring as first-class tasks rather than afterthoughts.

First-sprint-with-this-team. Velocity is unknown, so capacity is a guess. Plan conservatively (commit to less than feels right), agree explicitly that the first two sprints are calibration, and use them to discover real velocity. Don’t pretend the guess is data.

What Time Is It?

2026-04-16T06:00:00+08:00

You glance at your phone, read the digits, and get on with your day. Behind those digits is a tower of compromises, conventions, and politics that has taken humanity thousands of years to build. The time it shows you is wobblier than you’d think.

Sticks in the ground

Timekeeping started the way most useful things start: someone needed to solve a practical problem.

If you’re growing crops, you need to know when to plant. If you’re a priest, you need to know when to perform a ritual. If you’re a trader, you need to know when the market opens. The sun moves across the sky in a predictable arc, so you stick a pole in the ground and watch its shadow. Congratulations: you’ve invented the sundial, and you’re roughly in agreement with everyone else in your village about when noon is.

The Egyptians were dividing daylight into twelve parts around 1500 BCE. The Babylonians gave us base-60 counting, which is why we’re stuck with 60 minutes in an hour and 60 seconds in a minute: a convention so old that nobody alive chose it, yet nobody can change it.

But sundials only work when the sun shines. So people got creative.

Water clocks, clepsydrae, from the Greek kleptein (to steal) and hydor (water), measured time by the regulated flow of water from one vessel to another. The ancient Egyptians used them by at least 1500 BCE, and versions appeared independently in China, Greece, and Rome. Some were remarkably sophisticated. The Chinese polymath Su Song built a water-powered astronomical clock tower in 1088 that stood ten metres tall and featured an escapement mechanism (a device for converting continuous water flow into regular, counted intervals) centuries before European clockmakers would independently develop the same idea (Joseph Needham, Science and Civilisation in China, Vol. 4, Part 2).

Candles as clocks were simpler but clever. You’d mark a candle at regular intervals and burn it down, reading the time from how much remained. King Alfred the Great is traditionally credited with using graduated candles to regulate his daily routine in the 9th century, though the story is likely embroidered. The ingenious trick was the candle alarm clock: push tacks or small nails into the wax at a specific height. When the candle burns down to that point, the tacks fall onto a metal plate below with a clatter. You’ve just been woken up by a piece of wax and some hardware.

Hourglasses, or sandglasses really, became widespread from the 14th century. Cheap, portable, and indifferent to weather. Ships used them to mark watches. Churches used them to time sermons (some congregations reportedly installed them facing the preacher, as a gentle hint). Kitchens used them, and still do. An hourglass doesn’t tell you what time it is; it tells you how much time has passed, which is often what you actually need.

Village clocks mattered more than any personal timepiece for most of human history. Before wristwatches, before pocket watches, the church or town clock was the time. Its bell rang the hours, and entire communities synchronised their days to one mechanism in one tower. If that clock drifted, everyone drifted with it, and nobody noticed because there was nothing else to compare it to. Time was communal, and it was local.

For most of human history, this was fine. Noon was when the sun was highest where you stood, and what noon meant three towns over was someone else’s problem.

Clockwork

The mechanical clock changed everything slowly, then all at once.

Weight-driven clocks with verge escapements (an early mechanism that converted the steady pull of a hanging weight into a regular tick-tock) appeared in European church towers in the late 13th and early 14th centuries. They were large, expensive, and not particularly accurate, drifting by perhaps fifteen minutes per day. But they worked at night, they worked in rain, and they kept the whole town on the same schedule.

Then Galileo noticed something. In 1583, as the story goes, he watched a lamp swinging in the Cathedral of Pisa and timed it against his pulse. Every swing took the same amount of time, regardless of how far the lamp swung. He’d stumbled on what physicists call isochronism: a pendulum’s swings take the same time regardless of how wide they are. He never built a pendulum clock himself.

Christiaan Huygens did. In 1656, the Dutch mathematician and physicist built the first working pendulum clock, and the leap in accuracy was extraordinary: from roughly fifteen minutes of drift per day to about fifteen seconds per day. That’s an improvement of roughly sixty-fold. Huygens patented the design the following year and published the theory in Horologium Oscillatorium (1673), one of the great works of 17th-century physics.

From there, the history of timekeeping is a history of progressive miniaturisation. Tower clocks became mantel clocks. Mantel clocks became pocket watches as mainsprings replaced hanging weights and balance wheels replaced pendulums (a pendulum, after all, needs gravity and a stable surface; useless in a trouser pocket). Pocket watches became wristwatches. Each step required new engineering: smaller parts, better lubricants, more precise machining. The craft of watchmaking drove precision manufacturing for centuries before the Industrial Revolution made it commonplace.

The problem that made time matter

In the 18th century, ships were sinking because sailors couldn’t figure out where they were. Not north-south; that part was easy. You measure the angle of the sun or the North Star above the horizon and you’ve got your latitude, your position north or south of the equator. Sailors had been doing this reliably for centuries.

The deadly question was east-west. Longitude, your position east or west of a reference point, was a different beast entirely, because longitude is fundamentally a time problem. The Earth rotates 360 degrees in 24 hours, which is 15 degrees per hour. If you know it’s noon where you are and you also know it’s currently 3:00 PM back at your reference point, you’re three hours west, 45 degrees of longitude. Simple arithmetic. And once you know your longitude, you combine it with your latitude (which you already have from the stars) and plot your position on a chart. From your position on a chart, you can see where the land is, where the rocks are, and whether you need to change course. Longitude turns “somewhere in the Atlantic” into a dot on a map.

The catch: you need to know what time it is somewhere else. And in the 18th century, no clock could survive months at sea. Pendulum clocks were hopeless on a rocking ship. Without a reliable way to carry a reference time, sailors relied on dead reckoning: estimating their position by tracking how far they’d travelled from a known starting point. Speed was measured with beautiful simplicity: throw a rope with knots tied at regular intervals off the stern, let it run through your hands, and count how many knots pay out in a set time. That’s why we still measure nautical speed in knots. Note your speed, note your compass heading, note how long you’ve been on that heading, and do the arithmetic. If you left Lisbon heading west at five knots for six hours, you’re roughly thirty nautical miles west of Lisbon.

The problem is that dead reckoning accumulates errors. Every estimate is slightly off: the current pushed you north, the wind shifted and nobody noticed for an hour, the speed measurement was wrong because the sea was rough. Each small error compounds on the last. After weeks at sea, a dead reckoning position could be off by hundreds of miles. And there was no way to check it, because checking required knowing your longitude, which required a clock.

On 22 October 1707, a fleet of Royal Navy warships under Admiral Sir Cloudesley Shovell was returning to England through the Western Approaches. Fog. No sun for days. The navigators’ dead reckoning, weeks of accumulated estimates, each one slightly off, told them they were safely west of the Isles of Scilly. They were further east than they thought. Four ships struck the rocks. The Association, Shovell’s flagship, went down in minutes. Nearly two thousand sailors drowned. It was one of the worst maritime disasters in British history, and the root cause was that nobody on board could answer “what time is it in Greenwich right now?”

A Greenwich clock would have saved them. A navigator with a sextant can fix local noon to within a minute or two even through overcast. If local noon fell at 12:40 PM Greenwich time, that’s a 40-minute difference: 10 degrees west. Scilly sits at 6.3 degrees west. A clock, a sextant, and some arithmetic would have shown them they were closer to the rocks than they thought. They didn’t have the clock. They hit the rocks.

The disaster was so shocking that Parliament offered a prize of £20,000 (millions in today’s money) for a practical solution. The Longitude Act of 1714 established the Board of Longitude, and the race was on.

John Harrison, a self-taught carpenter and clockmaker from Yorkshire, spent decades building a series of marine chronometers, each one a masterwork of engineering. His H4, completed in 1761, was a pocket-watch-sized device that lost only five seconds on an 81-day voyage to Jamaica. The Board of Longitude, staffed largely by astronomers who preferred a celestial solution, dragged their feet on paying him. Harrison eventually got his money, but he was 80 years old by the time the matter was fully settled. Dava Sobel’s Longitude (1995) tells the story beautifully.

It’s hard to overstate the impact. Accurate portable clocks didn’t just solve navigation; they made the modern world possible. Once you can coordinate time across distance, you can coordinate anything across distance.

Railways ruin everything (in a good way)

For a long time after Harrison, local time persisted on land. Bristol is about 2.5 degrees west of London, so noon in Bristol is roughly 10 minutes after noon in London. Nobody cared, because the fastest you could travel between them was by horse, and ten minutes didn’t matter.

Then came the railways.

If a train departs London at 8:00 AM London time and is due in Bristol at 10:00 AM, is that 10:00 AM London time or Bristol time? Now multiply this confusion by every station on every line. Timetables became dangerous nonsense, and not just an inconvenience. On single-track lines, the entire safety model depended on timetables keeping trains from meeting head-on. If the station master in Bristol and the station master in Bath were working to clocks that disagreed by several minutes, trains could occupy the same stretch of track at the same time. And they did. Accidents were attributed to time discrepancies between stations.

Passengers missed trains because timetables were printed in London time but station clocks showed local time. Goods shipments went astray. Mail coaches connecting to trains arrived at the wrong moment. Some station clocks tried to have it both ways, sporting two minute hands, one showing local time, one showing railway time, a wonderfully British solution to a problem that shouldn’t have existed.

The confusion reached the courts. In the 1858 case Curtis v. March, the verdict hinged on whether “10:00” meant local time or Greenwich time. The law itself couldn’t answer the question “what time is it?” with a single answer.

The Great Western Railway had already forced the issue. It adopted Greenwich Mean Time across its network in 1840, and other railways followed. The practice became known as Railway Time: GMT imposed not by government decree but by operational necessity. The trains couldn’t run safely without it, so the trains won. The legal standardisation didn’t come until the Definition of Time Act 1880, four decades after the railways had already settled the matter in practice.

Once Britain had a single time, the same problem surfaced at the international scale. Telegraph networks and shipping lanes crossed borders, and every country still kept its own reference. The International Meridian Conference in Washington DC in 1884 was convened to fix this. It didn’t impose a grid of time zones the way people often assume.

What the conference actually decided was narrower: Greenwich would be the prime meridian (longitude zero), and a universal day would start at Greenwich midnight. The vote was 22 to 1, with San Domingo against and France and Brazil abstaining. France was the holdout: Paris had been a rival prime meridian for centuries, and French pride didn’t yield easily. France didn’t officially adopt Greenwich-based time until 1911, and even then called it “Paris Mean Time retarded by 9 minutes 21 seconds” to avoid saying “GMT.” (The grudge was real.)

The conference said nothing about how countries should organise their civil clocks. Time zones emerged organically over the following decades as each nation decided how to align its local time to the Greenwich reference. Some adopted clean hour offsets. Others didn’t. The result is the glorious, maddening patchwork we have today.

Time zones and their discontents

Time zones are a hack. They pretend that everyone within a wide strip of the Earth shares the same local time, which is obviously not true. And they’re political as much as they are geographical.

The zones are not neat strips. They follow national and regional borders, creating wild zigzags on any map. Spain is geographically in line with the UK and Portugal but uses Central European Time because Francisco Franco aligned Spain’s clocks with Nazi Germany in 1940, and nobody ever changed them back; the sun sets absurdly late in Madrid in summer. Western China is officially UTC+8 (Beijing Time) but the sun doesn’t rise until 10 AM in winter in Kashgar; the whole country uses a single time zone because Beijing says so. France uses UTC+1 despite Brest being west of Greenwich. Western Argentina runs on UTC-3 though its geography suggests UTC-5.

Some countries use odd offsets. India uses UTC+5:30, a compromise between Mumbai in the west and Kolkata in the east. Iran, Afghanistan, and Myanmar sit on their own half-hour offsets, as do Newfoundland and the Marquesas. Nepal is UTC+5:45, the only country on a 45-minute offset. Sri Lanka briefly switched from UTC+5:30 to UTC+6 in 1996, then switched back six months later.

And then there’s Eucla. On the Eyre Highway near the Western Australia-South Australia border, a handful of roadhouses and a telegraph station use UTC+8:45, an unofficial timezone that splits the difference between Western Australia’s UTC+8 and South Australia’s UTC+9:30. It’s not recognised by any government. It’s not in any legislation. The locals just decided that neither neighbouring timezone made sense for them, so they invented their own. The IANA database doesn’t even have an entry for it; it falls under Australia/Eucla with a +8:45 offset, one of the most obscure timezone entries in the world. If you’re driving from Perth to Adelaide and you stop for petrol at Border Village, you’re in a timezone that officially doesn’t exist. Your phone will probably show the wrong time. Welcome to Australia.

The shifting ground

Even what’s been agreed keeps moving.

Standard offsets shift, not just because of daylight saving. Take Perth. We’re on UTC+8 now, but before 1895, Western Australia used local mean time, roughly UTC+7:43. During both World Wars and again from 2006 to 2009, Perth observed daylight saving time and temporarily became UTC+9. During those DST periods, a timestamp from Perth at 2:00 AM on a transition day is ambiguous: did it happen before the clocks went back, or after? The same wall-clock time occurred twice. And when clocks spring forward, an hour simply doesn’t exist; 2:00 AM to 2:59 AM never happened. Anyone born in that hour, any event scheduled in that hour, any log entry timestamped in that hour: none of it is real.

This isn’t unique to Perth. Virtually every inhabited place on Earth has changed its UTC offset at least once. Russia has reshuffled its eleven time zones repeatedly. Turkey moved from UTC+2 to UTC+3 permanently in 2016. North Korea created UTC+8:30 in 2015, then switched back to UTC+9 in 2018 as a diplomatic gesture. Morocco observes DST year-round except during Ramadan, when they suspend it, meaning the offset changes on religious dates that shift by roughly eleven days each year against the Gregorian calendar.

The IANA timezone database, the file your phone and your servers use to figure out what time it is, tracks all of this. Every historical offset change, every DST transition, every political decision that moved a clock. It’s updated several times a year because governments keep changing the rules. If you’re writing software that handles time, this database is your source of truth, and the fact that it needs regular updates tells you everything about how stable time zones actually are.

The consequence for software is brutal. You cannot store a local time and assume you know the UTC equivalent without also knowing which version of the timezone rules were in effect. A timestamp of “2:30 AM, 25 March 2007, Perth” is meaningless unless you know whether DST was active, and the answer depends on whether you’re using the pre-trial rules or the trial rules. The time itself depends on when you ask the question.

Daylight saving time

And then there’s daylight saving time, which deserves its own category of complaint.

The idea is attributed to George Vernon Hudson, a New Zealand entomologist who proposed it in 1895 because he wanted more daylight hours after work for collecting insects. (The things that change the world.) Germany was the first country to actually adopt it, in 1916, to save coal during World War I. Britain and the United States followed.

When it happens varies wildly. The EU switches on the last Sunday of March and October. The US switches on the second Sunday of March and the first Sunday of November, a change made by the Energy Policy Act of 2005, which took effect in 2007 and broke a surprising amount of software. Australia varies by state. And because Australia is in the southern hemisphere, the transitions go the opposite way: clocks spring forward in October and fall back in April, which confuses anyone used to the northern pattern.

Where it doesn’t happen is a longer and more entertaining list. Most of Africa. Most of Asia. Iceland. Hawaii. Most of the tropics. Queensland, Australia, though New South Wales, Victoria, and South Australia, which share the same longitude, do observe it, leading to the odd situation where crossing a state border changes your clock. Here in Western Australia, we’ve voted against DST in four separate referendums, most recently in 2009, after a three-year trial, and the answer is always no.

And then there’s Arizona. Arizona doesn’t observe DST. But the Navajo Nation, which sits inside Arizona, does. And the Hopi reservation, which sits inside the Navajo Nation, doesn’t. Drive across those borders and your clock changes, doesn’t change, changes, and doesn’t change again. It’s a time zone nesting doll. Meanwhile, Lord Howe Island, a small Australian territory in the Tasman Sea, shifts by only 30 minutes for DST, because, apparently, why not.

The costs are real. A 2008 study by Janszky and Ljung in the New England Journal of Medicine found that heart attacks increase by about 5% in the week after the spring-forward transition, likely due to sleep disruption. Car accidents increase. Productivity drops. Software bugs bloom. The EU Parliament voted in 2019 to abolish DST entirely, but member states couldn’t agree on whether to keep permanent summer time or permanent winter time, and the proposal stalled.

If you’re writing code that handles time zones, the IANA tz database, sometimes called the Olson database, after its creator Arthur David Olson, is your scripture. It’s updated several times a year because governments keep changing the rules. I’ve written about the kind of compound complexity this creates in The Value Is in Ideas, Not Code; your library of knowledge about edge cases like these is exactly the sort of thing that separates useful software from software that crashes on a Sunday in Samoa.

Beautiful ideas nobody uses

Every now and then someone looks at the mess of time zones and leap seconds and local conventions and says: surely we can do better.

TAI64 is one such attempt. Proposed by Daniel J. Bernstein (the same person behind qmail and djbdns, and the plaintiff in Bernstein v. United States, the case that established code as protected speech under the First Amendment), TAI64 is a 64-bit representation of TAI: a simple count of seconds from a fixed epoch, with no leap seconds, no time zones, no daylight saving. It’s monotonically increasing, which means it’s ideal for log timestamps and event ordering. Every TAI64 label refers to exactly one second of real time, and the labels never go backwards or repeat. The extended form, TAI64N, adds nanosecond precision.

It’s elegant. It solves almost every practical problem with timestamps in one clean design. Almost nobody uses it.

Swatch Internet Time took a completely different approach. In 1998, the Swatch watch company proposed dividing the day into 1,000 “.beats”, with no time zones at all. The whole world would share a single time: @500 would mean the same moment for someone in Tokyo as in Toronto. The meridian was set at Biel, Switzerland (Swatch’s headquarters, naturally). One .beat is 86.4 seconds.

It was a lovely idea. Time zones exist because of the sun, but in an increasingly connected digital world, coordinating across zones creates constant friction. A universal internet time would eliminate “my 3 PM or your 3 PM?” forever. The notation was fun, the concept was sound, and it was backed by a major brand.

Nobody used it. The sun is still there. People still wake when it rises and sleep when it sets, more or less, and local time still reflects that biological reality. Swatch Internet Time lives on as a curious footnote and the occasional novelty watch face.

Both TAI64 and Swatch Internet Time failed for the same fundamental reason: they solved a technical problem while ignoring the human one. We don’t just use time to coordinate machines. We use it to coordinate lives, and lives are lived in places where the sun rises and sets at particular local times. Any scheme that ignores this is swimming against a very strong current.

It’s the same pattern: the technically “correct” solution (a universal encoding, a universal timescale) only wins when it also solves the human problem. UTF-8 succeeded where other encodings failed because it was backwards-compatible with ASCII. A universal time system would need to be backwards-compatible with the sun.

Even the source of truth gets it wrong

The tz database is the closest thing we have to a global authority on time. Every phone, every server, every programming language runtime uses it. And it has been wrong.

Governments don’t give notice. Egypt has announced DST changes with literally days of warning: not enough time for the database to ship an update, propagate through OS vendors, and reach the devices that need it. In 2014, Egypt cancelled DST with ten days’ notice, then reinstated it two years later, then cancelled it again. Each flip left a window where every computer in Egypt was showing the wrong time. Morocco’s Ramadan DST suspensions are worse; they shift against the Gregorian calendar by roughly eleven days each year, so the database has to predict Islamic calendar dates in advance. Sometimes the prediction is wrong and a correction has to be issued after the fact.

Turkey in 2016 was a sharp example. The government announced permanent UTC+3 with almost no lead time. Software using cached or bundled tz data (which is most software) was simply wrong until updates shipped. Java, Python, every major OS: all had a window where timezone calculations for Istanbul were incorrect. If you’d scheduled a meeting in Turkey during that window, your calendar was lying to you.

A lawsuit nearly killed it. In 2011, a company called Astrolabe Inc. sued Arthur David Olson for copyright infringement, claiming the database incorporated data from their copyrighted timezone atlas. The database was briefly taken offline; the world’s timezone source of truth, gone. ICANN stepped in, took over maintenance under the Internet Engineering Task Force, and the lawsuit was eventually dismissed. But for a period, the infrastructure that every computer on Earth depends on for knowing what time it is was legally threatened by a copyright claim.

The past keeps changing. The database relies on historical records (newspaper clippings, government gazettes, personal recollections) that are sometimes incomplete or contradictory, and corrections to decades-old entries ship regularly. Sometimes a zone didn’t exist yet: Asia/Tomsk wasn’t added until tzdata 2016j, so Tomsk events stored before then used Asia/Novosibirsk rules, which had different offsets in some periods. The timestamp didn’t change, the interpretation did. Sometimes the history gets rewritten: in tzdata 2018i, the pre-independence data for several West African countries was substantially revised based on new archival research. Sometimes it gets erased: in tzdata 2022b, zones with identical post-1970 data were merged and their distinct pre-1970 histories moved to a separate backzone file that most operating systems don’t ship, so pre-1970 lookups silently started resolving to different UTC instants. The answer to “what time was it?” depends on when you ask.

And then there’s Antarctica. The South Pole doesn’t have a natural timezone; every line of longitude converges there, so the concept is meaningless. The Amundsen-Scott South Pole Station uses New Zealand time (UTC+12/+13) because its supply flights come from Christchurch. But other Antarctic stations use the timezone of their home country, or the timezone of their supply base, or whatever the station commander decided that year. The Antarctica/Vostok entry in the tz database has been corrected multiple times; at one point it referenced the South Magnetic Pole, which had drifted hundreds of kilometres away from the station. Vostok officially uses UTC+5 (matching its Russian supply base in Novosibirsk), but in practice the station has used UTC+6 and UTC+7 at various points depending on who was running the base that year. When the tz database maintainers tried to nail down the correct offset, the answer was: it depends on who you ask and when you asked them. In 2023, the actual chief of Vostok station wrote to the tz mailing list to announce yet another offset change. When other list members questioned the short notice and process, his reply was disarming: “Well, sorry, but I am not too experienced with timezone changing.” The man responsible for the time at one of the most remote places on Earth was doing it for the first time, explaining his reasoning to a mailing list of strangers, and offering to send documentation in Russian.

The tz database is maintained by volunteers. It’s one of the most critical pieces of infrastructure on the internet, right up there with DNS root servers and the BGP routing tables, and it runs on the goodwill of people who care about getting the time right. Every time your phone silently adjusts for a timezone change you didn’t know about, that’s someone on the tz mailing list who noticed, researched it, wrote a patch, and got it merged. The system works. It just works by the thinnest of margins.

So what time is it?

That’s the human story of the hour: thousands of years of sticks in the ground, springs and pendulums, political compromises, and a volunteer-maintained database that quietly keeps the world’s clocks from lying to us.

The hour on your phone is a fragile compromise between the sun and politics. The date next to it is a fragile compromise too, built from a different history, a different cast, and its own pile of arguments. That’s what What Day Is It? is about: Gregorian switchovers, lunar and lunisolar calendars, the International Date Line, and the year numbers that don’t agree. It’s coming shortly.

The Workshop: Example Mapping

2026-04-15T06:00:00+08:00

Twenty-five minutes, four colours of card, and a vague user story turns into something developers can actually build. Making Stories Concrete shows one team’s first run; this post is the reference you keep open the morning of yours.

Example Mapping

Example Mapping breaks a single user story into rules and concrete examples in twenty-five minutes, so the team knows whether the story is ready to build and what “done” actually means. Invented by Matt Wynne in 2015. Frequently confused with BDD scenario writing; Example Mapping produces the material BDD scenarios are then written from.

At a glance

Who, for how long: a facilitator who doesn’t put cards down, a product owner, one or two developers, and a tester if you have one, with an ops or domain expert pulled in when the story touches their patch. Three to five people, twenty-five minutes.
What you walk out with: a verdict said out loud (ready, ready with named assumptions, needs splitting, or blocked), blue rules ready to paste into the tracker as acceptance criteria, green examples that are almost BDD scenarios already, and a counted pile of red question cards.
When to reach for it: a story about to enter a sprint where you want to know if it’s actually ready, or where you suspect the product owner and developers disagree about “done” but haven’t surfaced it. Not for genuinely trivial stories (just build them), not for stories whose shape you don’t yet know (run Event Storming or User Story Mapping first), and not without the product owner in the room.

What’s It For

A developer picks up a story called “Subscriber can pause their box.” She reads the acceptance criteria, they look fine, she starts building. Three days in she hits a question: what happens to a box that’s already been packed when the pause takes effect? She asks the product owner. The product owner doesn’t know. The product owner asks the warehouse lead. The warehouse lead says “obviously the packed box goes out, we can’t unpack it,” and the developer’s first two days of work are now wrong.

This happens because the story looked simple. It wasn’t. There were three rules hiding inside it and at least one of them depended on operational knowledge nobody had written down. The conversation that would have caught it, a twenty-minute chat between the product owner, a developer, and someone from operations, would have happened before the sprint started if anyone had thought it was worth the time.

Example Mapping exists to make that conversation cheap enough that you always have it. The cards are the forcing function: you can’t hand-wave acceptance criteria when someone asks for a concrete example and you have to write it on a green card.

Reach for it when:

A story is about to enter a sprint and you want to know whether it’s actually ready
Developers and the product owner suspect they disagree about “done” but haven’t surfaced it
The story feels simple and you don’t trust the feeling
You need to decide whether a story should be split, built, or deferred for more discovery

What It’s Not For

Skip it when:

The story is genuinely trivial. A copy change, a feature flag flip, a config tweak. Just build it.
You don’t yet know what you’re building. Example Mapping assumes you have a story and drills into what it means; run Event Storming or User Story Mapping first to get the story in the first place.
The people who know the answers aren’t in the room. Without the product owner or the domain expert, the rules will get written but nobody will have the authority to say they’re right. Reschedule.

Stop a session that’s already started if:

Five minutes in and you can’t even seed a green card, the story is too vague to map; it needs discovery, not Example Mapping
The product owner is absent or checking their phone
Every rule is producing a red card, you’re not mapping, you’re discovering, and that’s a different session

Stopping early when the signal is clear is not failure. Forcing a doomed session to the 25-minute bell is.

Definitions & Background

Four card colours, each one belonging to a role in the conversation:

Yellow, the story. One card, at the top of the table, present for the whole session.
Blue, rules. Acceptance criteria phrased as business rules. “A subscriber can pause for up to eight weeks.”
Green, examples. Concrete scenarios that illustrate a rule. “A subscriber pauses on a Monday for two weeks. Her Wednesday box skips. Her next box is the following Wednesday.”
Red, questions. Things nobody in the room can answer. “What happens to a box that’s already been packed when the pause takes effect?”

Examples are primary; rules fall out of them. Wynne’s framing all the way through: someone offers a concrete example the product owner has in mind, the room abstracts a rule from it, then someone offers another example to test that rule. The next example either confirms the rule, refines it, or breaks it, and breaking it is fine. A broken rule is replaced with one or two sharper rules, or a red card if nobody knows. Teams who try to lead with rules produce neat-looking maps that miss the cases the business actually cares about.

The cards arrange themselves around the story: blue rules stretch left-to-right under the yellow story; green examples drop in columns under the rule they illustrate; red questions go off to the side where they can be counted at the end.

Inputs

One user story written on a yellow card at the top of the table. Participants who have read the story in advance are a small bonus, not a prerequisite.
Cards in four colours: yellow, blue, green, red. A table the participants can stand around, the layout grows: yellow story at the top, blue rules in a row underneath, green examples dropping in columns under each rule, red questions off to one side.
A 25-minute slot with no interruptions and the right people in the room (see Who’s Needed).

If the story isn’t yet defined enough to write on a card, run Event Storming or User Story Mapping first. Example Mapping doesn’t generate the story; it sharpens one you already have.

Outputs

What lands on the table at the end:

A clear verdict on the story, ready to build, ready with named assumptions, needs splitting, or blocked. State it out loud before anyone leaves the room.
Acceptance criteria as the blue rule cards, concrete enough to paste into the tracker.
Test scenarios as the green example cards, almost in BDD format already.
Open questions as red cards, each one a thing somebody needs to answer before the story enters a sprint.
Sibling stories, sometimes. When an example or a rule points to behaviour that doesn’t actually serve the user and outcome on the yellow card, capture it as a new yellow card parked next to the main one. This is a feature, not feature creep, discovering a sibling story mid-session is one of the most valuable ways scope gets split honestly.

Photograph the table layout from directly above before the cards come down, yellow at the top, blue rules in a row beneath it, green examples dropping in columns under each rule, red questions and any parked yellows off to the side.

These outputs feed straight into:

Sprint Planning, a story that’s been Example Mapped is ready for the capacity discussion. One that hasn’t shouldn’t be in the sprint conversation.
Decision Tables, when a rule has many conditions and the green cards become unmanageable, promote the rule to a decision table.
Assumption Mapping, red cards that can’t be answered by anyone in the room are often assumptions in disguise. They belong on the grid.

Who’s Needed

Three to five people, twenty-five minutes:

Facilitator. Runs the clock, asks for the next example, keeps people off implementation detours. Does not put cards on the table themselves except to demonstrate the colour convention at the start.
Product owner. Mandatory. They own the story and they’re the person who decides which rule applies when two participants disagree. If the product owner can’t attend, reschedule.
Developers. At least one, ideally two. They’ll catch the rules that are impossible or expensive to implement as written, and they benefit most from leaving with concrete tests in hand.
Tester or QA. Highly valuable if you have one. They will think of edge cases faster than anyone else in the room and they are the natural customer for the green cards.
Operations / support / domain expert. When the story touches a part of the system only one person really understands, the warehouse lead, the SRE who owns the cron that matters, the support agent who talks to the subscribers who hit this particular path, pull them in for this one session. They’ll save you a week.

Fewer than three and you don’t get enough friction between perspectives; more than five and the table becomes a meeting. Leave the rest of the team out, they’ll get the output through the acceptance criteria. Observers warp the conversation: if they care about the story, they should come as participants or read the output afterwards.

How To Run It

Phase	Duration	Cards	Key question
Explain the colours, read the story	2 min	Yellow	“What are we mapping?”
Seed the first rule	3 min	Blue + green	“What’s the most obvious rule? Give me an example.”
Explore the rules	15 min	Blue + green + red	“What happens if…?”
Assess readiness	5 min	Review all	“Is this ready to build?”
Total	25 minutes

Twenty-five minutes is the pattern. The time constraint is not cosmetic, if you can’t map the story in twenty-five minutes, the story is telling you something: it’s too big, too vague, or standing on unresolved questions. That signal is worth the entire session by itself.

Phase 1. Seed the rules (3 minutes)

Read the yellow card aloud. All of it. Don’t paraphrase. If the story is two sentences long, read both sentences.

Then set the colour convention:

“Yellow is the story, it’s already on the table. Blue is for rules. A rule is an acceptance criterion phrased as a business rule: ‘a subscriber can pause for up to eight weeks.’ Green is for examples. An example is a concrete scenario: ‘A subscriber pauses on a Monday for two weeks, her Wednesday box skips.’ Red is for questions nobody in this room can answer right now. We’ll come back to those at the end.”

Now draw out the first rule. The product owner almost always has the obvious one:

“What’s the most basic rule, the thing that has to be true for this story to exist at all?”

Write it on a blue card. Place it below the yellow story. Then immediately:

“Can someone give me a concrete example of that rule? One specific scenario. Real names are fine.”

Write the example on a green card. Place it under the blue rule. You now have the shape of the session: yellow on top, blue in a row underneath, green in a column under each blue. Once the shape is visible, the room knows what to do.

Phase 2. Explore the rules (15 minutes)

This is the core of the session. You now cycle: example, rule, example, red card, example, rule. The facilitator’s job is to keep the cycle moving by asking one of five questions over and over:

“Have you got a real example, and what rule does that example imply?”

“Can someone give me an example of that?”

“What happens if…?”

“Is that always true, or only sometimes?”

“Is that the same rule or a different rule?”

The first question is the most important and the easiest to skip. Examples drive rules, every time, start from a concrete example the product owner has in mind, abstract the rule from it, then offer another example to test the rule. The next example confirms the rule, refines it, or breaks it. A broken rule is replaced with one or two sharper rules, or a red card if nobody knows. Don’t let the team flip the order: when a rule shows up before an example, ask for the example before you write the rule down.

When an example or a candidate rule clearly belongs to a different story, it doesn’t serve the user or outcome on the yellow card, grab a yellow card and park it to the side. Don’t argue, don’t fold it in. Discovering a sibling story is a useful outcome, not a derailment, and the parked yellow becomes a candidate for its own session.

Place blue cards in a row. Place green cards in columns under their rule. Place red cards off to the right, a visible pile the team can count. Parked yellows go alongside the red pile; they’re a different signal but they live in the same margin.

Example Mapping works the same for infrastructure stories, with the SRE or pipeline owner taking the product owner’s seat. A rule might be “the pipeline rolls back on failed health checks” and an example “health check fails at 10:02, rollback begins at 10:03, service back on previous version by 10:05.” Same shape, different domain.

See Example Mapping: Making Stories Concrete for one team’s first run, including the moment a green card about an already-packed box turns into the red card that reshapes a week of work.

Phase 3. Assess readiness (5 minutes)

Step back from the table. Look at the shape of the cards. The shape tells you the verdict.

Few blue cards, examples for each, no red cards, the story is well understood. Build it.
Many blue cards, examples for each, no red cards, the story is well understood but too big. Split it, probably by rule.
Red cards, the story has open questions. Each red card has two valid closures: get the answer, or make an explicit assumption you can defend and write it on the back of the card. The second closure is fine when the assumption is low-stakes or the team is willing to wear the consequence, but flag any high-stakes assumption (one where the wrong call breaks the plan) for Assumption Mapping before the sprint commits to building. Reds left as neither answered nor assumed mean the story isn’t ready.
Very few cards, session finished in twelve minutes, either the story really is trivial, or the team is being superficial. Probe once: “Is there any scenario we haven’t considered where this would behave differently?” If the answer is a confident no, you’re done.

State the verdict out loud. Someone, ideally the product owner, says the phrase: “This is ready to build”, “Ready with these assumptions”, “Needs splitting”, or “Blocked on these red cards”. Saying it out loud matters. It’s the commitment, and it’s what everyone remembers when someone later asks “wait, did we decide about this?”

What Can Go Wrong

The monologue. The product owner explains the story for ten minutes while everyone listens politely. Recovery: Interrupt and cash the monologue in for cards: “Stop there. Can someone capture what they just said as a rule on a blue card?” Forcing people to write cards forces them to be precise. Stop if: The monologue restarts after a second redirect. The story isn’t ready for Example Mapping, it needs a longer conversation first, and you should schedule one.

The rabbit hole. The team is fifteen minutes into debating one edge case. Recovery: Cash it in as a red card: “This is a great question. Let’s put it on red and keep moving. We’ll come back to it or take it offline.” Stop if: The same edge case keeps surfacing after two red cards. There’s a deeper unknown and you need a time-boxed investigation outside the room before another Example Mapping session will land.

The empty table. Nobody is writing cards. The conversation is circular. Recovery: The story is too vague. Ask a sharply concrete question: “What’s the simplest version of this story? One subscriber, one action, one outcome. What happens?” If that produces a green card, you have a seed. Stop if: The room genuinely can’t answer the simplest version. The story needs discovery, not mapping.

The premature solution. Developers start debating database schemas and API signatures. Recovery: “Great implementation thinking, hold it for when you start the story. Right now we’re still mapping what should happen.” Stop if: They can’t hold the distinction after three prompts. They’ll get more value from a design session.

The silent developer. A developer is in the room but not speaking, not writing, not asking for examples. Recovery: Name them and ask directly: “What’s the part of this story you’re most worried about implementing?” Worry is the fastest route to a red card. Stop if: They disengage completely. Something else is going on. Don’t try to fix it in-session.

The reverse map. The team writes blue cards directly from existing acceptance criteria and never generates green cards at all. Common in teams who’ve been running Example Mapping for six months, the ritual survives but the discovery has died. Recovery: Cover the blue cards with paper and ask for examples first. “Forget what we wrote. Tell me a real scenario for this story, a specific subscriber doing a specific thing on a specific Tuesday.” Once a green card lands, uncover the blues and see if they survive. Stop if: The room can’t produce a green card without seeing the rules. Example Mapping has degenerated into AC-rephrasing theatre; the story needs different discovery work first.

The committee verdict. The product owner defers to the developers when stating the readiness call. Recovery: Ask them directly: “Product owner, what do you want to do with this story?” The verdict is theirs to state. Stop if: They can’t or won’t make a call. The story isn’t owned. Don’t add it to the sprint until it is.

Next Steps

The session ends; the work begins.

Same day, the facilitator:

Photographs the card layout (one shot from above is enough, it’s only one table)
Transcribes the rules into the story’s acceptance criteria in the tracker
Transcribes the green cards as test scenarios, in BDD format if the team uses it
Writes the red cards into the tracker as open questions, each one with an assigned owner and a date

This week, the product owner:

Chases every red card to closure. Each red card needs one of two outcomes before the story enters a sprint: answered, or assumed-and-named. The product owner owns the choice. An answered red card folds back into the rules and examples; a named assumption gets written on the back of the card and goes into the story description so the team building it knows what they’re betting on. High-stakes assumptions, the kind where the wrong call breaks the plan, get tested via Assumption Mapping before the sprint commits. A red card that’s neither answered nor assumed is a production bug rehearsed.
States the verdict in writing. Update the story with “Ready to build”, “Needs splitting”, or “Blocked on [red cards]”. The verdict is the single most valuable artefact from the session.
Splits the story if it needs splitting. Each rule with its examples can often become its own story. Don’t defer the split until sprint planning, the shape is freshest now.
Walks the acceptance criteria back to the developers. They were in the room, but the transcribed criteria may look different from what they remember. Five minutes of “does this match what we decided?” prevents the slow drift between session and sprint.

Ongoing, the team:

Runs Example Mapping for every story before it enters a sprint. It takes twenty-five minutes and consistently prevents the mid-sprint “but I thought it meant…” conversations that cost days.
Tracks the red card rate. If it’s trending upward, stories are arriving at Example Mapping too raw, push for better discovery upstream.
Keeps the green cards visible during the sprint. They’re the tests, and having them on the team board keeps “done” honest.

Variants

Story Level (default). One user story about to enter a sprint, twenty-five minutes, three to five people. Output: rules, examples, questions, build/split/defer verdict. This is what most teams need, and the rest of this post describes it.

Epic Level. A cluster of related stories in an epic, ninety minutes, multiple passes with breaks between them. Output: a rough split of the epic into buildable stories. Reach for this when you’re trying to decide how to break up a large feature area; it’s really three or four Story Level sessions stacked together.

Remote. A Miro or Mural board with the four card colours pinned, video call for the conversation. Slightly slower (the rhythm of “write a card, place a card” is faster in person), but the structure transfers cleanly. Use one shared cursor: only the facilitator places cards, prompted by the team, to keep the layout legible.

Pre-sprint sweep. Run six or eight Story Level sessions back-to-back with a fifteen-minute break in the middle, twice a week. Three hours total but it surfaces the entire next sprint’s worth of unknowns in one sitting. Best for teams whose backlog grooming has slipped and stories arrive at planning underprepared.

Impact Mapping: Connecting Work to Goals

2026-04-14T06:00:00+08:00

The Greenbox team has been heads-down for a few sprints. Subscriptions work. Basic delivery is running. Event Storming gave them shared understanding. Example Mapping drives their stories. BDD catches bugs before production. The team is shipping faster and cleaner than ever.

But on Friday afternoon, Maya pulls up the numbers and stares at them. They hit 214 subscribers at the end of sprint three. That was the target. That was the win. A month later, they’re at 197. The slope is going the wrong direction.

She brings it to the Monday standup. “We hit 214. We celebrated. Now we’re at 197. We’re adding new subscribers, but we’re losing existing ones faster. Churn is eating the growth.”

The room goes quiet.

The feature trap

Tom has been lobbying for a farm analytics dashboard, charts showing yield trends, delivery reliability scores, seasonal forecasting. It’s technically interesting. It’s obviously useful. It feels like the next logical feature.

Priya wants to improve the substitution algorithm. Jas has sketched a redesigned customer homepage. Sam wants an email onboarding sequence for new subscribers.

Everyone has a credible next thing to build. Each one would Example Map beautifully. But none of them address the problem Maya just put on the table: why aren’t they growing?

Sam mentions something else. “Last Tuesday the site was slow for about an hour. Three potential subscribers tried to sign up and got a timeout.” Nobody knew it was slow. Sam’s uptime monitor only checks if the site is up, not if it’s fast. Tom adds a response time check. Crude, a single number, checked every five minutes, but now they know when the site is slow, not just when it’s down.

This is the feature trap. Teams build what seems obvious, what’s technically exciting, or what the loudest person wants. The board looks healthy. Velocity is great. But the metric that matters is flat.

What Impact Mapping is

Impact Mapping is a technique created by Gojko Adzic. The core idea: before you decide what to build, work backwards from why you’re building it.

Four levels:

Goal, the measurable business objective. Not a feature. A number you can check.
Actors, the people whose behaviour needs to change to reach the goal.
Impacts, the specific behaviour changes you need from those actors.
Deliverables, the features that could create those behaviour changes.

Why → Who → How → What.

The structure is a tree. One goal at the root. Every feature can trace a path back to the goal. If it can’t, it doesn’t belong.

GOAL
(Why?)

ACTOR
(Who?)

IMPACT
(How?)

DELIVERABLE
(What?)

Running the session

Maya books an hour. The whole team plus Lee.

Lee starts at the root. “What’s the one goal? Not a feature. A business outcome you can measure.”

Maya doesn’t hesitate. “Stop the bleeding and get to three hundred active subscribers within three months. The board needs to see the model works before the next funding round.”

“Good. Now, who are the people whose behaviour affects whether you hit that number?”

Identifying actors

Subscribers, people already paying. If they churn, you’re running to stand still.

Potential subscribers, people who haven’t signed up yet.

Farms, the supply side. If farms can’t reliably deliver, the product falls apart and subscribers leave.

Maya (operations), writing her own name on the board is uncomfortable. The Event Storm already surfaced her as the supply-matching bottleneck. Now the Impact Map puts it more starkly: she’s not just a bottleneck in the process, she’s a risk to the goal. She’s spending hours each week manually matching supply to demand, and last week it caught up with her, she was still finalising substitutions when the courier arrived, two boxes went out wrong, and one subscriber cancelled on the spot.

Mapping impacts

For each actor, Lee asks: “What behaviour change would help us reach 300 subscribers?”

Not “what feature do they need.” Behaviour change.

Subscribers: Stay subscribed (don’t churn). Refer friends.

Potential subscribers: Discover Greenbox exists. Trust it enough to try.

Farms: Commit supply reliably. Communicate shortfalls early.

Maya: Spend less time on manual matching.

Seven impacts. Each one is a lever that moves the goal.

From impacts to deliverables

Now, and only now, does the team talk about features.

Subscribers → Stay subscribed: Pause subscription. Box preview notifications. Flexible box sizes. Subscribers → Refer friends: Referral programme. Shareable box photos.

300 subscribers in 3 months

Subscribers

Stay subscribed

Pause subscription

Box preview notifications

Flexible box sizes

Refer friends

Referral programme

Shareable box photos

Potential subscribers → Discover Greenbox: SEO landing pages, local press outreach, social media content. Potential subscribers → Trust enough to try: Customer reviews, first-box discount, money-back guarantee.

300 subscribers in 3 months

Potential subscribers

Discover Greenbox

SEO landing pages

Local press outreach

Social media content

Trust enough to try

Customer reviews

First box discount

Money-back guarantee

Farms → Commit supply reliably: Forward contracts, demand forecasting tools. Farms → Communicate shortfalls early: Shortfall reporting tool, SMS deadline reminders.

Maya → Less time on manual matching: Automated supply matching, substitution rules engine.

Seventeen deliverables across four actors. Every single one traces back to a specific behaviour change, which traces back to the goal.

The insight that changes everything

Tom looks at the map and goes quiet. His farm analytics dashboard isn’t on it.

He tries to find a place for it. “It could go under farms, help them commit supply reliably?”

Lee pushes back. “Would a dashboard showing yield trends actually change whether a farm commits supply?”

Maya is honest. “The farms I work with commit supply because I ring them on Tuesday and ask what they’ve got. A dashboard wouldn’t change that. What would help is if they could just text me when something’s gone wrong.”

Tom’s dashboard is interesting software. It’s not goal-critical. The map makes that visible.

Meanwhile, “pause subscription” is under the most important impact: keeping existing subscribers. Jas mentions that three subscribers have already cancelled because they were going on holiday and couldn’t skip a week. They didn’t churn because of bad produce. They churned because there was no pause button.

Sam pulls up the numbers. They’ve lost 17 subscribers in a month. Three mentioned inflexibility. If they could retain even half the churning subscribers, it would be worth more than acquiring new ones, because retained subscribers also refer friends.

The pause feature is a day’s work, maybe two. Tom’s dashboard would take three weeks. The map makes the decision obvious.

Prioritising with the map

Lee draws a two-by-two grid, impact on the goal versus effort to build.

Do first High impact, lower effort

Pause subscription
Shortfall reporting tool
SMS deadline reminders
Box preview notifications

Plan carefully High impact, higher effort

Referral programme
Automated supply matching
Customer reviews

Maybe later Lower impact, lower effort

First box discount
Money-back guarantee
Shareable box photos
SEO landing pages

Probably never Lower impact, higher effort

Demand forecasting for farms
Forward contracts
Flexible box sizes

Tom notices something. “My dashboard isn’t even on the grid.”

“The grid only has deliverables from the Impact Map,” Lee says. “Your dashboard couldn’t trace a line back to the goal.”

Tom’s dashboard was never rejected or argued down. It simply didn’t appear. There’s nothing personal about it, the reasoning is visible on the whiteboard.

But it doesn’t quite feel that way to Tom. After the session, he goes back to his desk and quietly closes the design document he’d been working on, the one with the seasonal forecasting charts he’d been excited about. Priya notices. She sends him a message: “The dashboard isn’t dead. It just serves a different goal.” Tom doesn’t respond for an hour. Then: “I know. Thanks.”

The team commits to the top-left quadrant for the next two sprints. Pause subscription first, then shortfall reporting.

When to use Impact Mapping

Quarterly planning, when deciding what the team should focus on next, an Impact Map grounds the conversation in outcomes rather than feature wishlists.
Roadmap discussions, when stakeholders lobby for competing features, the map provides a framework: “Which impact does this serve?”
When the backlog feels disconnected, if nobody can explain why half the items are there, Impact Mapping will either connect them to a goal or expose them as noise.

When not to use it

Individual story-level decisions. Impact Mapping operates above the feature level. For “what does this specific story mean,” use Example Mapping.
When the goal isn’t clear. Impact Mapping will just expose that gap, useful, but fix the goal first.
As a one-off exercise. A map created once and filed away is worthless. Revisit it as you learn. Some hypotheses won’t work. Update the map.

Back to Greenbox

The session takes about ninety minutes. Tom starts on the pause feature that afternoon. It ships two days later.

The deploy has a bug. Paused subscribers still get charged. Sam gets three angry emails within an hour. Tom tries to roll back but his deploy script only goes forward, there’s no way to swap back to the previous version. He has to fix the bug and deploy again, which takes forty-five minutes. Three subscribers were incorrectly charged $25 each. Maya refunds them personally. Tom writes the rollback capability that evening. “I never want to be unable to undo a deploy again.” It’s a one-line change, keeping the previous version and being able to swap back. Simple but essential.

Within a fortnight, churn drops noticeably. Two subscribers who’d been about to cancel stay on because they can pause over the Easter holidays. One tells a friend, who signs up.

It’s a small win. But it’s a connected win, the team can trace a line from the feature to the behaviour change to the goal. That’s the difference between shipping features and making progress. The map is a set of hypotheses, and this one checked out. Ship the pause button, churn drops. Hypothesis confirmed. Move to the next.

But a new problem is emerging. The Impact Map has generated a prioritised list of deliverables, and each one is breaking into multiple stories. The pause feature was simple, two days, done. But the referral programme has five stories. The shortfall reporting tool has three. The backlog is growing fast.

And it’s causing real problems. Priya ships “pause for one week,” but “resume after pause” is three sprints away because Tom is working on referral tracking. From the subscriber’s perspective, they can pause but there’s no way to unpause, a broken experience, not a feature. Meanwhile, Sam has built the shortfall reporting tool for farms, but nobody built the notification that tells Maya a shortfall was reported. The tool exists but it’s disconnected from the workflow.

The team is shipping individual stories that make sense in isolation but don’t add up to a coherent experience. They need a way to see how everything connects from the user’s perspective, so they can ship things that actually work end to end.

They need a way to see the whole, to lay out the user journey end to end, so they stop shipping puzzle pieces that don’t connect.

The Workshop: Event Storming a Process

2026-04-13T06:30:00+08:00

This is the second of three posts on running Event Storming. The Event Storming a Domain post is the entry point in Brandolini’s ordering (Alberto Brandolini, inventor of Event Storming, proposes Big Picture → Process Level → Software Design as the natural order) and introduces the technique at a whole-domain scale; if you haven’t read it, start there. This post picks up where Big Picture drops off: a dot-voted hotspot from a Big Picture session is the natural scope of a Process Level session.

Event Storming an Architecture is the next step — zooming further in, turning a Process Level map into a software design. Coming soon.

For the technique in action inside a small startup, see Event Storming: Building Shared Understanding.

Where Process Level sits

Process Level is the middle zoom of Event Storming: one flow, small team, three hours, the full Process Modelling palette of events, commands, actors, policies, and read models. Big Picture (the previous post) sits above it at whole-domain scale on a stripped-down three-colour palette; Software Design sits below it, turning one flow’s wall into aggregates and code boundaries. Most of the time you run Process Level on its own, on a scoped flow — a billing cycle, a deployment pipeline, an incident that crossed a couple of services — without ever running Big Picture first. When you do run it after Big Picture, the scope comes from a dot-voted hotspot the Big Picture wall surfaced.

At a glance

Who, for how long: a facilitator plus one or two domain experts, at least one developer (include a junior), product or design, and operations/support where the flow touches them. Four to eight people, three hours.
What you walk out with: a wall of events in time order with commands, actors, and selectively policies and read models underneath, plus 3–7 named hotspot piles each with an owner and a next step.
When to reach for it: a specific flow you’re about to build, inherit, or investigate, where the team’s mental models are quietly different, or a dot-voted hotspot from a Big Picture session. Not for whole-domain scope (run Big Picture first), code design (run Architecture), or a flow one team already shares a strong model of.

The Process Modelling palette

Brandolini’s Process Modelling uses six note colours. Four of them carry the backbone of the wall; the other two appear where they add precision.

Orange — events. Things that happened, past tense. The backbone of the wall. “Payment Captured.” “Stock Reserved.”
Blue — commands. The intent that produced the event, present tense, imperative. “Capture Payment.” “Reserve Stock.”
Small yellow square — actors. The person who issued the command. “Customer.” “Warehouse picker.”
Larger pink rectangle — external systems. Third parties whose vocabulary you don’t own and whose contracts you negotiate against. “Stripe.” “The carrier API.” Distinct from actors, separating them is what later drives anti-corruption-layer conversations.
Small pink — hotspots. Disagreements, questions, painpoints, anything the room flags for follow-up.
Lilac / purple — policies. The “whenever X, then Y” rules that issue commands in response to events. The chain is always event → policy → command → event, events don’t cause events directly; something reads the event (a policy, a person, a clock) and decides to issue a command. Most of a Process Level wall’s policies are implicit; when the room agrees on a rule out loud, or when a rule is contested and resolved, it earns a purple sticky.
Pale green — read models. The data a policy (or a person) consults before deciding what to do. “Before reserving stock, check current stock level.” Green notes live next to the policy or command that reads them.

If you’re running your first few Process Level sessions, it’s fine to stay on the four-colour backbone (orange, blue, yellow, pink hotspot) and capture policies verbally in the notes. Brandolini’s full palette is what you grow into as the room gets comfortable — not a gate you have to pass before you run the session.

Intent

Build one precise shared model of one specific process — a flow, a pipeline, a cycle, an incident, a feature area — with the people who touch it in the same room, so the team building or operating that process has one model, not five.

The output is a wall of events in time order, commands under them, actors above them, and a prioritised list of the questions the session raised.

When to use it

Reach for Process Level when:

You’re about to build a specific flow and the team’s mental models of it are quietly different
You’re inheriting a process nobody documented
You’re investigating an incident that crossed two or three services
You’ve zoomed in from a Big Picture session and you have a named hotspot to dig into
A piece of work is about to cross multiple people’s areas and you want to spot mismatches early

Don’t reach for Process Level when:

The scope is the whole business or a whole product line — run Big Picture first
You’re ready to design code — run Event Storming an Architecture
Only one team is involved and they share a strong mental model already
The scope is one screen, one function, or one isolated job — it’s too small for the ceremony

Participants

Facilitator. Does not participate in content; their job is to manage the session. Ideally someone who’s run one of these before; if not, pair with someone who has.

Domain expert(s). The people who know how the process actually works. For a billing flow, the finance lead. For a deployment pipeline, the SRE who runs it. For an incident review, the engineers who responded. One or two, not a crowd.

Developers. At least one, and include a junior if you have one. Juniors ask the questions seniors have stopped asking.

Product or design — whoever will turn the output into stories.

Operations, support, frontline. For incident, deployment, or support-heavy flows, these are the domain experts. Don’t tuck them in as afterthoughts.

Group size: 4–8. Smaller than Big Picture because the scope is tighter. Fewer than four and the conversation is too thin; more than eight and the voices overlap.

Who to leave out:

End users and customers. People self-censor around the people they serve. Interview users separately; bring their words in on a sticky note.
Senior leaders who can’t stop correcting. If the senior is the domain expert, brief them first: their job is to answer when asked, not to lead.
Spectators. Anyone “just observing” absorbs airtime without contributing. Either in or out.

Materials and timing

Phase	Duration	Materials	Key question
Arrivals, intro, ground rules	~15 min	—	“What are we doing and why?”
Chaotic exploration	20 min	Orange	“What happens?”
Timeline	30 min	Orange + pink	“What order? What’s wrong?”
Break	10 min	—	—
Commands and actors	20 min	Blue + yellow (purple + green as rules emerge)	“What triggered it? Who did it?”
Hotspots	30 min	Pink	“What scares us most?”
Wrap-up, owners, next steps	15 min	—	“Who owns what next?”
Buffer	20 min	—	—
Total	~2h 40min inside a 3-hour block

The four working phases are 100 minutes. The remaining hour is the unglamorous stuff: arrivals, the intro, the mid-session break, the wrap-up, and the conversations that inevitably run long. Don’t try to fill the 20 minutes of slack — you’ll need it.

Facilitator playbook

Phase 1 — Chaotic exploration (20 min)

Before the first sticky goes up, do two things that look trivial and aren’t.

Set the safety out loud:

“The only rule for this phase is that every note is valid. Duplicates are fine, half-formed ideas are fine, things that might be wrong are fine — that’s exactly what we’re here to find. If you’re not sure, write it anyway.”

Set the granularity. Stick two example events on the wall yourself at the level a domain expert would say them out loud:

“Payment Submitted.” “Parcel Dispatched.” “Alert Fired.” “Deployment Rolled Back.”

Hand out orange pads. Name the most junior person in the room:

“[Name], what’s the first event you can think of for this flow? Doesn’t have to be the start — just the first one that comes to mind.”

Then give the instruction that covers the rest of the phase:

“Write silently. Get everything you can think of onto orange notes and onto the wall. Don’t worry about order. Don’t worry about duplicates. Twenty minutes on the clock, then we stop.”

No talking until the timer goes off. By the end you should have 40–80 notes. Some will be duplicates; some will contradict; some will make no sense yet. That’s exactly right.

What to watch for:

Someone talking instead of writing. Gently: “Get it on a sticky note.”
Someone waiting for permission. “Duplicates are gold. Write yours anyway.”
One person filling the wall while others have three notes. Usually sorts itself out in the timeline phase; keep an eye on it.
Someone reaching for a pink note already. “Good instinct — hold that thought.”

Phase 2 — Timeline (30 min)

Now everyone talks. The job is to arrange the orange notes left-to-right in chronological order.

“Let’s put these in order. Talk to each other. If you disagree, put a pink note on it and we’ll come back.”

At the fifteen-minute mark, the wall will look messy and you’ll worry. That is what success looks like midway through. There’ll be clumps where people stood; gaps where nobody’s arranged yet; two notes stacked because someone tried to merge them and gave up; overlapping candidates for the first event; a couple of pink notes nailed into contested spots. If the wall looks tidy at the fifteen-minute mark, either the scope was too small or one person is doing all the moving.

Merge obvious duplicates. Leave ambiguous ones — if two notes might be the same event, that’s a conversation worth having later.

What to watch for:

One person placing every note while everyone else watches. The most common failure mode. Pair a developer with the domain expert and ask them to walk a section together.
No pink notes appearing. Disagreements are hidden, not absent. Prompt: “Is anything on this wall surprising you?”
Rabbit holes into solution design. “Great implementation idea — park it. We’re mapping, not building.”
Parallel flows emerging. Let them spread vertically into swim lanes, horizontal bands on the wall, one per actor or system, used to keep parallel flows visually separate. A rollout flow and a rollback flow can share a wall.
Events causing events. Someone asks “so does Payment Captured cause Stock Reserved?” Name the rule: “Events don’t cause events. Something reads Payment Captured — a person, a rule, a clock — and decides to reserve stock. The chain is always event → decision → command → event, not event → event.” First-timers want to draw arrows between orange notes within the first hour. Name the rule before they do.

Phase 3 — Commands and actors (20 min)

Hand out blue and yellow notes. Introduce them one colour at a time — if you drop both on the table at once, people grab whichever is closest and the wall gets noisy.

Blue first — commands. For each event, what intent produced it?

“Blue notes go below the orange event. They’re the command that made it happen. ‘Submit Payment’ caused ‘Payment Submitted’. ‘Pick Order’ caused ‘Items Picked’. Every event has a command somewhere — even if the command is a scheduled job or a reaction to another event.”

Yellow next — actors. Who issued the command?

“Small yellow squares go above the command. An actor is a person, a role, or a system. ‘Customer’ is fine; ‘the system’ is not — which system? ‘Warehouse picker.’ ‘Stripe.’ ‘Nightly cron.’ Be specific enough that the name points to someone or something you could actually talk to.”

Two discipline points most first-time facilitators miss:

Small yellow squares, not full-width notes. The actor sits next to or on top of the command; it’s smaller than the command, because the command is the important bit at this level.
Deduplicate. If the same actor issues three commands in a row, you don’t need three yellow squares — stick one next to the first command and let the row speak for itself. Real ES walls have one or two actor squares per band, not one per event.

What to watch for:

“The system” on too many yellow squares. “Which system? Automated or manual? What happens when it fails?”
Business rules hiding inside a command. “Wait, this command only runs sometimes — what decides?” If the room can name the rule out loud, purple-note it in “whenever X, then Y” form between the triggering event and the command; if the decision needs a fact (a balance, a stock level, a flag), stick a pale-green read model next to the policy. If the rule is contested, leave it as a pink hotspot for the next phase to resolve.
One person’s name on multiple recurring actors. Scaling bottleneck. Pink note.
Commands that nobody can explain. “Who decides this?” followed by silence is extremely valuable. Pink note.

Phase 4 — Hotspots (30 min)

Gather the pink notes — the ones you’ve been accumulating on the wall, plus new ones you’ll generate by prompting for them. These are the most valuable output of the session.

The mechanics matter more than most facilitators realise; clustering pinks under time pressure is where first-time facilitators freeze. A shape that works:

Read the pinks aloud, one by one (5 min). You walk the wall and read each pink note. No commentary — just read. This refreshes the room and gives you time to spot repetition.
Move notes into rough piles (10 min). Take the pinks off the wall and put them into 3–7 piles on a table or a clear section of wall. Let the room help. If a note could go in two piles, put it in the bigger one. The goal isn’t clean boundaries; it’s rough themes.
Name each pile (5 min). For each pile, write a one-sentence theme on a fresh pink note and put it on top. “Rules nobody has written down.” “Cross-team handoffs with no SLA.” “Edge cases we deferred.”
Owner and next step per pile (10 min). “Who owns finding the answer? What’s the next step?” Write both on the theme note. Time-box 90 seconds per pile; if a conversation runs long, park it.

Don’t try to solve anything in-session. Identify, name, assign, move on. Solving is not this phase’s job.

Worked example — Pagebound’s order-to-delivery flow

Pagebound is a mid-sized online independent bookshop: about 200,000 customers, six warehouses, an engineering team of thirty, a customer support operation that fields returns and lost parcels. A recent Big Picture session on the whole customer experience produced a prioritised list of hotspots; the top one was “when do we reserve stock?” — commerce said at checkout, the warehouse said at payment captured, and both teams had been operating on their own model for eighteen months.

The Process Level session scope, in one phrase: the order-to-delivery flow, from checkout started through parcel delivered. Six people in the room — commerce lead, a commerce engineer, the fulfilment team lead, a warehouse supervisor, the SRE who owns the payment integration, and a customer-success lead who’s been fielding the over-sold complaints.

Here’s what the wall looks like at the end of the session — fifteen events in rough time order, commands underneath, small yellow actor squares deduplicated per band, one purple policy and its pale-green read model where the room agreed on a contested rule, and four pink hotspots showing the questions the session raised:

Notice what’s on the wall and what isn’t. The actor bands show that the customer initiates the first two events, then the order service quietly does its work, then the warehouse takes over for three events, then the carrier, then the customer again at the end. That’s five handoffs — and every handoff is a candidate for something to go wrong, which is why three of the four pink notes cluster around handoff boundaries.

One purple policy made it onto the wall, and getting it there is what the session set out to do: the room resolved the stock-reservation timing question out loud, decided that “whenever Payment Captured → reserve stock” was the right rule, and stuck it up so nobody could quietly forget later. A pale-green read model beside it names the fact the policy depends on — current stock level — because the next thing the team will argue about is what happens when that number is zero, and pinning the read model now saves an argument later. The other implicit policies (events quietly triggering subsequent commands) are left unspoken — that’s fine for Process Level; the Architecture session is where every crossing gets a policy and every policy gets its read model.

The four pink hotspots are the real output. Two (over-sold orders, substitution) will turn into Example Mapping sessions with business rules attached. One (the handoff gap) becomes an investigation with the carrier. One (wrong-address SLA) becomes a conversation between customer success and legal. None of them get “solved” at this session, and trying to would burn the next two hours on arguments that belong in their own meetings.

What can go wrong

Named failure modes.

The silent room. Nobody is writing or talking. Recovery: The prompt is too abstract. Make it concrete: “What’s the first thing that happens when a customer clicks ‘place order’?” Stop if: 20 minutes in and the wall is still empty. Scope is wrong, people are wrong, or there’s a political problem you haven’t named.

The lecture. One expert explains while everyone listens politely. Recovery: Pair people up, give each pair a section of the wall. Stop if: Two pairs in and it’s still the same voice. The session is producing one person’s model.

The argument. Two people disagree about how something works. Recovery: Let it play for 2–3 minutes. This is often the session working. If it’s not resolving, pink note it. Stop if: The argument has gone personal. Break; resume only if the air has cleared.

The solution-jumper. Someone keeps designing the code instead of mapping the process. Recovery: “Great implementation idea — park it.” Stop if: They can’t hold the distinction after a third prompt. They belong in an Architecture session, not this one.

The missing person. Nobody in the room knows how a key part of the process works. Recovery: Pink note it with a name. “Need to talk to [person] about [topic].” Stop if: Multiple key parts are owned by people not in the room. Reschedule with the right attendees.

The political silence. A senior is in the room and the juniors have stopped writing. Recovery: Pair juniors with peers away from the senior; or ask the senior to step out for a call (briefed in advance); or enforce silent writing with no exceptions. Stop if: None of the above shifts the dynamic. Photograph what’s on the wall, reschedule.

Outputs

Same day, 24 hours:

Panoramic high-resolution photographs of the wall.
A transcribed event list, command list, and hotspot list (each pile named, owner, next step) in a shared document.
A short summary to participants: “Here’s what we found, here’s what happens next.”

The product owner’s (or equivalent’s) week:

Turn each event into a vocabulary entry. “Stock Reserved” means exactly one thing; defend the phrase against drift.
Triage the hotspots. Each pile becomes one of: (a) work for this sprint, (b) a time-boxed investigation, (c) a follow-up workshop (Example Mapping, Decision Tables, Architecture), (d) a conversation. Resolve the ones that block the next sprint; defer the rest.
Book the follow-ups. Don’t let momentum dissipate.
Walk the wall with anyone who couldn’t attend. Their perspective often surfaces hotspots the original room missed.

Where to go next

Event Storming a Domain — zoom out when you realise the scope of your Process Level problem is actually organisational, not procedural.
Event Storming an Architecture — the natural next step when the Process Level flow is clear and you’re about to turn it into code boundaries.
Event Storming: Building Shared Understanding — the narrative version on a smaller team.

The Workshop: Event Storming a Domain

2026-04-11T06:30:00+08:00

This is the first of three posts on running Event Storming. Brandolini presents the technique starting from here. Big Picture, the widest zoom, because it’s the session you usually reach for first when you step into an unfamiliar domain. The other two posts zoom in:

Event Storming a Process: the default, smaller session you’ll run most often. Holds the full four-colour palette and the shape of a standard three-hour session.
Event Storming an Architecture: zooms further in, turning a Process Level map into a software design. Coming soon.

For the technique in action inside a small startup, see Event Storming: Building Shared Understanding.

About Event Storming

Event Storming gathers everyone who touches a domain in front of a long wall and asks them, silently and in parallel, to write down things that happened on orange sticky notes, in past tense, one fact per note. “Order Placed.” “Payment Captured.” “Parcel Delivered.” The notes go up; the timeline gets enforced left-to-right; the arguments that break out over where a note belongs become the thing you came for. The output is a shared wall, pink hotspots marking the places that hurt, and a dot-voted shortlist of what to investigate next. Big Picture is the widest of the three Event Storming zooms; if you already know which single flow you want to map, you want Process Level instead, and if you’re ready to turn a process into code boundaries, you want Architecture.

At a glance

Who, for how long: one or two facilitators (always two above ten people), domain experts from every slice of the domain (two per slice), developers and architects to listen, frontline operations and support, and a sponsor who opens and leaves. Eight to twenty people, a full day minimum, often two.
What you walk out with: a wall the room agrees on (orange events, pink hotspots, yellow systems and people, green opportunities, red pivotal moments), panoramic photos and a transcribed event/system/hotspot list within 24 hours, a glossary of the vocabulary that emerged, and three to five follow-up Process Level sessions booked within two weeks, one per dot-voted hotspot.
When to reach for it: a major initiative is starting and several teams need one picture before anyone commits, you’re new to an organisation and nobody can describe the domain end-to-end, or an incident crossed services and the timeline lives in Slack and people’s heads. Not for designing code (that’s Architecture), not for mapping a single known flow (that’s Process Level), and not when leadership will sit in and correct the frontline.

Intent

Build one shared picture of a whole domain (a product, a platform, a business line, a customer experience) with everyone who owns a piece of it in the same room, so the organisation can see itself end-to-end and pick the hotspots worth investigating.

The output isn’t a design, a roadmap, or a plan. It’s a long wall of orange stickies the room agrees on, pink notes marking the places that hurt, and a prioritised shortlist of follow-up sessions.

When to use it

Reach for Big Picture when:

A major initiative is starting and several teams need one picture before anyone commits
You’re new to an organisation and nobody can describe the domain end-to-end without stopping three times to ask someone else
An incident crossed six services and the timeline lives in Slack, git history, and people’s heads
Two companies are integrating and both sides need to see each other’s domains
You’re a consultant and the client has asked for “help with architecture” but you don’t yet know what help means

Don’t reach for Big Picture when:

You know which specific flow you need to work on: run Process Level on that flow
You’re ready to design code: run Event Storming an Architecture
You can’t get the right people in the room for most of a day
Leadership will sit in and correct people; you’ll get political theatre, not discovery
The scope is one team, one product, one well-understood flow; it’s too much machine for the job

Scope: the hardest decision before the session

The single most common way Big Picture sessions go wrong is the scope being wrong. Not too ambitious; wrong-shaped. Two failure modes to avoid:

Too big. “Map the whole enterprise.” An enterprise with five product lines, three channels, and two regulatory contexts is five or six separate Big Pictures, not one. If you find yourself asking “whose slice do we even start with?”, split.

Too small. “Map the deployment pipeline.” That’s a single process; it’ll fit comfortably in a Process Level session and won’t need twelve people in a room for a day.

The sweet spot. Something you can describe in one short phrase that (a) spans 3–6 teams, (b) is coherent enough to fit on one wall over a day, and (c) nobody in the organisation currently owns end-to-end. “The billing platform.” “Customer onboarding.” “Our order-to-cash.” “The claims lifecycle.” If several different people in the organisation each own a piece and none owns the whole, you’re on.

Participants

Facilitator(s). Two for groups above ten, always. One watches the wall; one watches the room. Big Picture is harder to facilitate than Process Level because the group is bigger and the failure modes are more political. Don’t run your first one alone.

Domain experts from every part of the domain. The rule: if a slice isn’t represented in the room, it’ll be missing from the wall. For an e-commerce business that means product, engineering, operations, support, finance, logistics, maybe marketing. For a bank it means front office, back office, compliance, risk, IT. Two people per slice: one with deep domain knowledge, one with freshest-to-the-job eyes.

Developers and architects. Not to design; to listen, write, and discover where the business model and the code model have quietly diverged.

Operations and frontline support. Where the surprises live. If the leadership team says the product works one way and the support team sees something different, Big Picture is where both of those truths land on the same wall. Don’t tuck them in as afterthoughts.

Sometimes leadership, with care. A sponsor who opens the session and then leaves is useful. A leader who sits in and corrects every event they disagree with kills the session. Brief them before; if they can’t hold the discipline, run without them.

Group size: 8–20. Below 8 and you’re not spanning enough of the domain; above 20 and the conversations fragment and some voices stop contributing.

Before the session

The single biggest lever on outcome quality isn’t what happens in the room; it’s the week before. Meet the sponsor and agree four things:

Scope, in one short phrase. If you can’t both say it the same way, don’t schedule the session yet.
The guest list. Every slice represented by one or two names; no political attendees.
The sponsor’s role during the session. Ideally: open, leave, come back for the wrap-up. Explicitly negotiate this. If they won’t hold it, reschedule.
The question the output has to answer. Not “do a Big Picture” (that’s the method, not the outcome). “Give us a prioritised list of cross-team investigations worth running next.” “Give us one shared picture we can point at when we disagree later.”

Without this, you’re flipping coins. With it, you’ve done half the facilitation before the first sticky goes up.

Materials and timing

Phase	Duration	Materials	Key question
Sponsor opens; ground rules	15–20 min	—	“Why are we here?”
Chaotic exploration	60–90 min	Orange notes	“What happens in this domain?”
Enforce the timeline	45–60 min	Orange notes, pink notes	“What order? What’s contested?”
Reverse narrative	20–30 min	Orange, pink	“What had to be true for this?”
Break	30–60 min	—	—
Explicit walkthrough	60–120 min	A walker, listeners	“Does this match what you know?”
Pain and systems	60 min	Pink, yellow	“Where does this hurt?”
Dot-voting	20–30 min	Sticky dots	“Which hotspots matter most?”
Wrap-up, owners, next steps	20–30 min	—	“Who does what next?”
Buffer	30–60 min	—	—
Total	Plan for a full day, minimum. Two days is common. Three for complex domains.

Big Picture is the most expensive of the three Event Storming levels by a wide margin, and the easiest to do badly. It isn’t something you cram into an afternoon.

A note on note colours

At Big Picture, you deliberately use fewer colours than you would at Process Level. Brandolini’s rule: you’re looking for shape, not precision. The palette:

Orange: domain events, in past tense. The backbone of the wall. “Order Placed.” “Parcel Delivered.”
Pink: hotspots, painpoints, disagreements, questions, places where the room stops agreeing. Every pink note is a candidate for follow-up.
Yellow: systems and people, loosely. “Stripe.” “Our warehouse.” “The customer.” Don’t worry about the person/system distinction at this level; stick it on the wall and let Process Level sort it out.
Green: opportunities. “Could we let subscribers preview next week’s box?” “Could the carrier handle returns themselves?” The lightbulb sticky, the thing that isn’t happening yet but the room thinks should be. A Big-Picture-only colour. Encourage them throughout exploration; they’re where productive arguments often start.
Red (or a tall vertical line): pivotal events. The four to eight key moments where the state of the domain fundamentally changes. They emerge during the timeline phase and divide the wall into phases.

That’s the whole palette. No blue commands. No purple policies. (Blue means commands, intentions, things someone is doing; purple means policies, “when X happens, do Y” rules; green-as-read-model means query-shaped projections of state. All three belong at Process Level, not here.) Those belong at Process Level. If you reach for them here, you’re on the wrong level. Note that the green sticky at Big Picture means opportunity, distinct from Process Level’s green read model sticky. Same colour, different meaning, different level.

Facilitator playbook

The exact phase structure varies by practitioner. Here’s a shape that works for a one-day session on a medium-sized domain (8–15 people). Scale the timings up for two-day sessions.

Phase 1: Chaotic exploration (60–90 min)

Set the safety out loud:

“Every note is valid. Duplicates are fine. Things that might be wrong are fine; that’s exactly the kind of note this session lives on. If you’re not sure whether something counts as an event, stick it up anyway and we’ll sort it out later. Silent writing for the next hour. No talking; the wall does the talking.”

Set the granularity with a mix of examples from across the domain:

“These are events: things that happened, past tense. Order Placed. Adjuster Assigned. Complaint Filed. Integration Deployed. Account Suspended. Write at the level a domain expert would say it out loud, not a database-row level, not a strategy level.”

Name the most junior or most frontline person in the room and ask them to stick up the first note. The pattern you’re setting: this is a working session, not an executive meeting, and the least senior person writes first.

Then silence. Set a visible timer for sixty minutes; if the wall isn’t full at the hour mark, run it to ninety.

By the end you should have somewhere between 150 and 400 notes, depending on the domain. If you have fewer than 100, either the scope was wrong, the guest list was wrong, or the room hasn’t yet believed you that writing is the job.

What to watch for:

Talking instead of writing. “Get it on a note. We’ll talk during the timeline.” Repeat as needed.
Whole departments not writing. If everyone from support has three notes between them and sales has forty, something is off. Move the facilitator over. Make eye contact. Invite specific events: “What’s the first thing you see when a customer calls in?”
Executive-only events. “Strategy Agreed.” “Board Met.” If the wall is all leadership verbs, the frontline isn’t contributing yet. Something is blocking them, usually whoever is standing at the other end of the room.
People writing wishes, not events. “That sounds like what we’d like to happen. What actually happens?”

Phase 2: Enforce the timeline (45–60 min)

Everyone talks. The job is to arrange the notes left-to-right in rough chronological order, spreading vertically into parallel tracks wherever the flow genuinely forks. It will be messy. That’s the point.

Open it:

“Put these in order. Don’t aim for perfection; rough chronology is enough. Parallel things go in parallel. If you disagree about where something goes, put a pink note on it and move on.”

Walk the room. Prompt clusters to form around parts of the domain: “discovery over here, money in the middle, fulfilment to the right.” Accept that the timeline will have several overlapping tracks.

About thirty minutes in, pause and find the pivotal events. This is the single most productive move in timeline construction, and first-time facilitators almost always skip it.

“What are the 4–8 most important events on this wall? The moments where the state of the customer, the product, or the business fundamentally changes? Call them out.”

Mark each with a tall red dashed line or a big dashed box around it. Once the pivotals are visible, the rest of the timeline organises itself into the phases between them. Teams that skip this step spend another twenty minutes arguing about whether Card Expired goes before or after Renewal Notice Sent; teams that do it first stop caring, because both events belong to the same phase.

What to watch for:

One department dominating the timeline. Pair people from different departments and give them sections.
No pink notes appearing at all. Disagreements are hidden, not absent. Prompt: “Is anything on this wall surprising you?”
Rabbit holes into policy debates. “Great policy conversation; park it. We’re looking for rough chronology.”
People trying to make it tidy too early. “It’s supposed to be messy. Tidy comes at Process Level, not here.”
Duplicates proliferating. Leave ambiguous ones; if two notes might be the same event, that’s a pink note, not a merge.
No pivotal events getting called out. The team may be too deep in the weeds. Name two you think are obvious and ask which other ones they’d add. Then let them disagree.

Phase 3: Reverse narrative (20–30 min)

Walk the wall backwards once. This is a Brandolini move that sounds strange and is the single most effective way to find missing events.

Start at the rightmost event and ask: “What had to be true for this to happen? What had to happen just before it?” Work right to left.

“Going forwards, we tell a story we already believe. Going backwards, we discover the bits we’ve been handwaving. Every ‘we don’t know’ is a pink note. Every ‘oh wait, it must be…’ is a new orange sticky.”

Expect 20–40 new events on the reverse pass, most of them on the left-hand side of the wall where the early steps got skipped because nobody in the room owns them. The reverse narrative is where the cross-team gaps become undeniable: three teams each discover they don’t know how something actually starts, and the answer almost always involves a team that isn’t in the room.

Phase 4: Explicit walkthrough (60–120 min)

This is what Big Picture is for. Everything before was preparation.

One person, ideally someone who thinks they know the whole flow, walks the wall end to end, out loud, narrating each event in order as if explaining it to a newcomer. Everyone else’s job is to listen and interrupt when something doesn’t match what they know.

“One of us is about to walk this wall start to finish, out loud. Their job is to narrate what happens at each event. Your job is to interrupt when it doesn’t match what you know. Interruptions are the point of this phase; hold nothing back.”

Pick the walker carefully. Not the most senior person. Not someone who’ll perform. Someone who knows a lot but not everything, who’ll narrate what they think is happening and be genuinely surprised when corrected.

The walker moves slowly: ten seconds per event, minimum. For a 300-event wall that’s 50 minutes without interruptions, and with real interruptions it’ll run 90–180. Budget double the no-interruption time.

Every interruption is precious. Pink note the disagreement, stick it on the event, and move on; don’t try to resolve during the walkthrough.

What to watch for:

The walker turning it into a lecture. “Keep moving; the interruptions are the output.”
Nobody interrupting. Either the walker is genuinely correct (rare) or the room has stopped listening. Pause; ask a specific person by name: “From where you sit in support, does this match what you hear on the phones?”
Interruptions becoming arguments. “Pink note it. Keep walking.”
The walker skipping sections. “Good, stop there. Who knows what happens next?” Let someone else take over for that stretch.
The room running out of energy. Break into 45-minute segments with stretches between.

This phase is what people remember for years. Protect it.

Phase 5: Pain and systems (60 min)

Now add the pink notes deliberately. You already have some from the timeline and the walkthrough; add more. Also add yellow notes for the systems and people that keep reappearing across the wall.

Prompt:

“Where does this hurt? Where do people work around the system? Where is information lost? Where does a decision get made with the wrong context? Every pain point is a pink note.”

Don’t try to solve anything; just surface it. The wall should look dense with pink by the end.

Phase 6: Dot-voting (20–30 min)

There will be too many pinks. That’s normal. Dot-voting turns the wall into a prioritised shortlist.

Give everyone five or six coloured dots and let them place them on the pinks that matter most to them. Frame it:

“We’re not fixing anything in this room. We’re picking the top 3 to 5 places worth digging into next, the places where a Process Level session will be most valuable. Put your dots where you’d most want to zoom in.”

Count. The clusters with the most dots become the candidates for follow-up Process Level work.

What to watch for:

Leadership dots dominating. If leadership votes first, the result is their priorities with a veneer. Ask the frontline to vote first, or do it anonymously.
Dots concentrating on one department. That department’s pinks may genuinely be the worst, or the voting has been political. Worth a one-minute conversation about the distribution.
Pinks with zero dots. Don’t throw them away; photograph them. They survive in the record.

Worked example: Pagebound, online indie bookshop

Pagebound is a mid-sized online independent bookshop: about 200,000 customers, six warehouses, a handful of physical partner shops, an engineering team of thirty split across product, commerce, fulfilment, and data, plus a customer support operation that fields returns and refunds.

The sponsor is the CTO. The reason for the session: “We keep hearing that things go wrong in order-to-delivery but no two teams describe the problem the same way. Before we commit to a big migration we want everyone in one room looking at the same wall.”

The scope, in one phrase: the whole Pagebound customer experience, from the moment someone first hears about a book through to the day they either recommend it to a friend or decline a repeat purchase. That’s wider than order-to-delivery on purpose; the CTO believes the real problems sit at the edges (discovery, returns, loyalty), not the middle.

Fourteen people in the room: a product lead, two engineers, a data analyst, the warehouse manager, a fulfilment team lead, a customer-success lead, a support agent who volunteered, a finance analyst, a marketing lead, a buyer (the person who decides which books Pagebound stocks), and the SRE on call that week. The CTO opens, then leaves.

By the end of the day the wall looks something like this, simplified from several hundred events to around thirty key ones grouped into six phases, with four pink hotspots and two pivotal events marked:

The moment this wall earns its cost is during the explicit walkthrough, when the customer-success lead stops the walker at Return Requested and says: “Wait, our returns logic treats a reviewed book as non-returnable because we assume the customer has opened it. Is that actually in the terms?” The finance analyst checks; it isn’t in the terms. The warehouse manager says his team has been refusing those returns for eighteen months. The support lead says she’s been authorising them case-by-case because customers complain.

Three people in a corridor would have argued about that for a month. On the wall, with marketing and commerce watching, it takes ninety seconds to surface and a pink note to capture.

That’s the thing Big Picture is for: the mismatches that only become visible when the whole wall is on view.

The four dot-voted hotspots become the candidates for follow-up work. The one with the most dots, the stock-reservation timing question, is what the Process Level post uses as its own running example.

What can go wrong

Named failure modes. Each has a symptom, a recovery move, and a threshold where you stop rather than limp through.

Nobody will commit to the whole day. Half the room drifts in and out. Recovery: Stop and reset. Rebook with people who’ll commit. Stop if: Two hours in and half the room is still on their laptops. Apologise, reschedule.

Political theatre. A senior is in the room, corrects every event they disagree with, and the frontline has stopped writing. Recovery: Name it carefully. “We need the frontline view right now. Let’s hear from support and operations first.” Stop if: The dynamic doesn’t shift. Photograph the wall, thank everyone, reschedule without the leader.

The wall of one department. 90% of notes come from engineering, or 90% from sales. Recovery: Pause writing. Give each under-represented department 20 minutes with a facilitator at their shoulder, adding events from their slice. Stop if: A department genuinely has nothing to add. Either they shouldn’t be here, or the scope is wrong.

Hotspot overwhelm. Eighty pinks and nobody knows what to do with them. Recovery: Cluster into themes before voting. Vote on themes, not individual notes. Stop if: The themes don’t cohere. Photograph everything; make the follow-up “sort offline” rather than “decide now”.

Leadership sidebar. Two or three senior people cluster together and start having their own meeting. Recovery: Interrupt it, politely, out loud. “Sidebar forming — can we bring that into the room?” Most sidebars collapse when named. Stop if: The sidebar absorbs the session. Two rooms isn’t a workshop.

Outputs

Within 24 hours of the session ending:

Panoramic high-resolution photographs of the wall, overlapping so it can be reassembled digitally. One per metre for a very long wall.
A transcribed event list, system list, and hotspot list, organised by rough zone.
A short summary message to participants: “Here’s what we found, here are the dot-voted hotspots, here’s what happens next.” Send within 24 hours, while the energy is fresh.
A schedule of 3–5 follow-up Process Level sessions, one per top hotspot. Book them within two weeks; momentum dies fast.

In the weeks after:

Pin the vocabulary that emerged. The words that kept reappearing on the wall are the start of the organisation’s shared language. Circulate a glossary.
Walk the wall with anyone who couldn’t attend. Especially peers of the attendees; their reactions tell you whether the picture lands outside the room.
Don’t try to keep the wall “current”. It’s a snapshot of a moment, not a live document. Run another Big Picture when the snapshot is stale enough to mislead, usually six to twelve months later.

Where to go next

Event Storming a Process: the natural follow-up. Big Picture finds the hotspots; Process Level zooms into one and maps it precisely. In the Pagebound example, the stock-reservation hotspot is the candidate.
Event Storming an Architecture: two zooms further in, turning a Process Level map into a software design.
Event Storming: Building Shared Understanding: the narrative post showing a smaller team running their first session.

Teaching Your LLM the Codebase: CLAUDE.md and AGENTS.md

2026-04-09T06:00:00+08:00

The previous post introduced the idea: teach the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. your conventions through a file it reads on every task. This post shows the files themselves.

The root CLAUDE.md

This is the file at the root of the Greenbox repository. It’s the first thing the LLM reads when it starts working:

<!-- file: CLAUDE.md -->
# Greenbox

Produce-box subscription service. Go monorepo.

## Build & Test

- `go test ./...` to run all tests
- `go vet ./...` before committing
- `golangci-lint run` for full lint check

## Project Structure

- `cmd/greenbox/` — Main application entry point
- `subscription/` — Subscription lifecycle (create, pause, resume, cancel, box size)
- `billing/` — Invoices, payment confirmation, pricing
- `delivery/` — Delivery scheduling, packing, dispatch
- `db/` — Database access and migrations

## Conventions

- Guard clauses for early returns. No deep nesting.
- Custom types for IDs and dates: `SubscriptionID`, `CustomerID`, `DeliveryDate`, not raw strings.
- Unexported struct fields. Constructor functions enforce invariants.
- Error wrapping: `fmt.Errorf("doing thing: %w", err)`
- Table-driven tests with `t.Run` subtests.
- Test names describe behaviour: `TestPausedSubscription_CannotChangeBoxSize`

## Domain Language

- "subscription" not "order"
- "box" not "product" or "package"
- "delivery day" not "shipping date"
- "subscriber" not "user" or "customer" (except in CustomerID, which is the billing reference)
- "pause" not "suspend" or "hold"

## Do Not

- No `interface{}` or `any`. Use concrete types or narrow interfaces.
- No `utils`, `helpers`, or `common` packages.
- No global state or package-level variables (except constants).

Thirty lines. Everything a developer, or an LLM, needs to write code that fits the project. The conventions section is the most valuable: it prevents the style drift that Tom and Priya discovered when their LLM-generated code looked like it came from different teams.

Why each section matters

Build & Test seems obvious, but LLMs use it. When asked to verify a change, the LLM runs go test ./... because the file told it to. Without this section, it might run go build and call it done, or guess at a test command that doesn’t exist.

Project Structure tells the LLM where to put new code. When asked to add a delivery feature, it goes to delivery/, not a new top-level package. The structure section is a map.

Conventions is the style guide. Guard clauses, typed IDs, table-driven tests, these are the patterns the team agreed on. Without this section, the LLM generates valid Go that doesn’t match the team’s Go. With it, generated code passes review faster because it already looks like the codebase.

Domain Language is subtle but powerful. Before this section existed, the LLM would generate variable names like orderID, productName, shippingDate. Each one required a review comment: “We call this a subscription, not an order.” Now the LLM uses the right words the first time. This also helps new developers absorb the team’s vocabulary, they see it in the generated code before they’ve read every file.

Do Not is the anti-pattern list. This prevents the LLM’s most common bad habits. Without it, Go LLMs love to create utils packages, use interface{} for flexibility, and introduce package-level variables. The explicit prohibition stops these before they start.

Package-level CLAUDE.md

The root file covers the whole project. Package-level files add context for specific packages:

<!-- file: subscription/CLAUDE.md -->
# Subscription

Manages subscription lifecycle.

## Status Transitions

- Pending → Active → Paused → Active (resume) or Cancelled
- Paused subscriptions cannot change box size.
- Cancelled subscriptions cannot be modified at all.
- `NewSubscription` starts in `StatusPending`.

## Conventions

- All mutations go through methods on `Subscription`. No direct field access from outside.
- Status is a typed constant (`StatusPending`, `StatusActive`, etc.), not a raw string.

And for the billing package:

<!-- file: billing/CLAUDE.md -->
# Billing

Invoices, payment confirmation, pricing.

## Money

- All amounts stored in cents (int64), not dollars (float64).
- Display formatting happens at the HTTP layer, not in billing logic.
- Currency is always AUD. No multi-currency support yet.

When the LLM works in the billing package, it reads both the root CLAUDE.md and the package-level one. The root provides general conventions. The package file provides package-specific rules. The LLM stores amounts in cents because the file says so, no more pull request comments asking “should this be cents or dollars?”

AGENTS.md: specialised roles

Where CLAUDE.md is the general brief, AGENTS.md defines specialised roles, agents the LLM can adopt for specific tasks. The Greenbox team defines two:

# file: AGENTS.md

[[agents]]
name = "test-writer"
description = "Writes tests for Greenbox code following team conventions"

[agents.instructions]
text = """
You write tests for the Greenbox codebase.

Conventions:
- Use table-driven tests with t.Run subtests for any function with more than two cases.
- Test names describe behaviour, not implementation: TestPausedSubscription_CannotChangeBoxSize
- Use precise language in test names:
  - "Cannot" = hard constraint, test failure means a bug
  - "Returns" = pure output check
- Create test fixtures using constructor functions, not struct literals with exported fields.
- Prefer assertion messages that explain the business rule: "paused subscriptions cannot change box size"
- Do not use testify or other assertion libraries. Use stdlib testing only.
- Test through public methods. Never access unexported fields.
"""

[[agents]]
name = "reviewer"
description = "Reviews code for convention drift"

[agents.instructions]
text = """
You review pull requests for the Greenbox codebase.

Check for:
1. Exported fields that should be unexported. Structs should have unexported fields with constructors.
2. Raw strings where typed IDs should be used: SubscriptionID, CustomerID, BoxSize.
3. Deep nesting: more than two levels of if/else suggests missing guard clauses.
4. Missing error handling or unwrapped errors.
5. Tests that test implementation instead of behaviour.

Do not nitpick formatting or style — the linter handles that.
"""

Each agentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. encodes expertise the team has built up. The test writer knows about precise naming because the team keeps finding that vague test names make failures harder to diagnose. The reviewer catches the convention drift that slips through when everyone’s moving fast, exported fields, raw strings where typed IDs belong, nested conditionals that should be guard clauses.

How agents are invoked

When Priya asks the LLM to write tests, she invokes the test-writer agent:

> /test-writer Write tests for the new pause subscription handler

# The agent reads:
# 1. Root CLAUDE.md (general conventions)
# 2. subscription/CLAUDE.md (package-specific rules)
# 3. The test-writer agent instructions from AGENTS.md
# 4. The relevant source files

The generated tests use table-driven structure, descriptive names, and stdlib assertions, because the agent’s instructions specify all of that. Without the agent, the LLM would still generate tests (it read the root CLAUDE.md), but the agent adds the thoroughness.

Tom uses the reviewer agent during code review:

> /reviewer Review this PR for billing/invoices.go

# The agent checks:
# - Invoice struct has unexported fields
# - Amounts stored in cents, not dollars
# - Typed IDs used instead of raw strings
# - No deep nesting

The reviewer catches an exported Amount field that should be unexported with a constructor, and a string parameter where SubscriptionID should be used. Tom would have caught these too, eventually. The agent catches them in seconds, every time, without fatigue.

The maintenance cycle

Priya warns the team early: “A stale CLAUDE.md is worse than no CLAUDE.md. If the file says ‘use guard clauses’ but the codebase has moved to a different pattern, the LLM generates code that doesn’t match anything.”

The team adopts a rule: when you change a convention, update the CLAUDE.md in the same commit. It’s like updating tests when you change behaviour, the documentation and the code move together.

# Tom's commit message when they adopt a new error type
git log --oneline -1
# a1b2c3d Add DomainError type, update CLAUDE.md conventions

The CLAUDE.md diff in that commit:

 ## Conventions

 - Error wrapping: `fmt.Errorf("doing thing: %w", err)`
+- Domain errors: use `DomainError{Code, Message}` for business rule violations.
+  Reserve `fmt.Errorf` for infrastructure errors (database, network).

Two lines. The LLM now generates DomainError for business rule violations and fmt.Errorf for infrastructure errors. The convention is encoded the moment it’s decided.

Before and after

The clearest proof is in the generated code. Here’s what the LLM generates for “add a Resume method to Subscription”, first without CLAUDE.md, then with it.

Without CLAUDE.md:

// file: subscription/subscription.go
func (s *Subscription) Resume() {
    s.Status = "active"
    s.PauseReason = ""
}

Exported fields. String status. No error handling. No guard clause.

With CLAUDE.md:

// file: subscription/subscription.go
func (s *Subscription) Resume() error {
    if s.status != StatusPaused {
        return fmt.Errorf("cannot resume subscription in status %v", s.status)
    }
    s.status = StatusActive
    s.pauseReason = ""
    s.updatedAt = time.Now()
    return nil
}

Guard clause. Unexported fields. Status constant. Error returned. The code matches the codebase because the LLM read the brief.

The difference isn’t intelligence, it’s context. The LLM is equally capable in both cases. The CLAUDE.md gives it the context to be capable in the right direction.

The compound effect

The team notices something over the following months. The CLAUDE.md doesn’t just make LLM-generated code better. It makes the whole codebase more consistent, because:

New developers read it as an onboarding doc.
The LLM follows it, so generated code demonstrates the conventions.
Code reviewers reference it when explaining why a pattern should change.
The conventions themselves get sharper, because writing them down forces the team to resolve ambiguity. “Use typed IDs” is vague. “Use SubscriptionID not string for subscription identifiers” is precise.

Tom puts it simply: “We wrote a page of conventions for the LLM and accidentally standardised the whole team.”

Lee’s version: “The best documentation is documentation that has a reader. The LLM reads the CLAUDE.md on every task. That makes it the most-read document in the repository.”

Teaching Your LLM the Codebase

2026-04-08T06:00:00+08:00

Two developers. Same codebase. Same LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. . Different code. That’s not a bug in the LLM, it’s a missing brief.

After the BDD work, Tom and Priya are both leaning hard on LLMs. The Feature files make it easy: hand the LLM a .feature file, ask for an implementation, get code back. Tom noticed the LLM generates code faster than he can review it. That’s true. But he’s about to notice something else.

The code review that took an hour

Tom opens Priya’s pull request. The code is correct, tests pass, behaviour matches the feature file. But it looks nothing like his code. Her handler functions return early on errors. His use if-else chains. Her test names read like sentences: TestPausedSubscription_CannotChangeBoxSize. His read like labels: TestChangeBoxSizePaused. Her structs have unexported fields with constructor functions. His have exported fields.

None of this is wrong. It’s all defensible. But the review takes an hour because Tom keeps stopping to ask: “Is this a style choice or a behaviour choice?” Every difference is a potential bug he has to investigate.

He brings it up at standup. “Priya’s code and my code look like they were written by different teams.”

Priya frowns. “We’re using the same LLM. Same model, same tool.”

“But not the same prompts,” Lee says. He’s been listening. “You’re each telling it something different about how you want the code to look. The LLM doesn’t have opinions, it reflects whatever you give it.”

The experiment

Lee suggests they test this. Same task, both developers, compare the results. The task: write a function that calculates the next delivery date, skipping public holidays. Same requirements. Same language. Same LLM.

Tom prompts: “Write a Go function that calculates the next delivery date after a given date, skipping any dates in a public holidays list.”

Priya prompts: “In our Greenbox codebase we use custom types for dates and guard clauses for validation. Write a Go function that calculates the next delivery date after a given date, skipping public holidays. Return an error if the input date is in the past.”

Tom gets back a clean function. It takes time.Time and []time.Time, returns time.Time. No error handling. No validation. Works fine.

Priya gets back a function that takes a DeliveryDate type and a HolidayCalendar interface. Guard clause at the top rejects past dates. Returns (DeliveryDate, error). The generated code matches the patterns in the rest of the codebase because she described those patterns in the promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. .

“You gave it context,” Tom says.

“I gave it the same context I’d give a new developer on their first day,” Priya says. “Here’s how we do things. Here’s what the conventions are. Here’s what the types look like.”

“But you had to type all of that every time.”

“Right. And that’s the problem.” Priya pulls up the Claude Code documentation on her screen. “There’s a way to make it permanent.”

The brief

Lee draws a parallel to his consulting work. “When I join a new client, the first thing I look for is a brief: how the team works, what they’ve decided, what they’ve explicitly rejected. When the brief exists, I’m productive in days. When it doesn’t, I spend weeks asking ‘why did you do it this way?’”

“The LLM needs the same thing,” Priya says. “And there’s a file for it.”

In Claude Code, this brief is a file called CLAUDE.md. It lives in the root of the repository. Every time the LLM starts a task, it reads this file first. The file becomes the persistent context that Tom was missing and Priya was typing out by hand.

“Think of it as the onboarding document for your AI pair programmer,” Priya says. “Everything you’d tell a new hire in their first week goes in this file.”

What goes in the brief

The team sits down and writes their first CLAUDE.md together. Lee facilitates, he’s good at drawing out the things people know but haven’t said aloud. He asks three questions:

“What patterns have you settled on?”

Priya lists what she’s been pushing for over the past few months: guard clauses for early returns, table-driven tests, custom types for IDs and dates instead of raw strings, unexported struct fields with constructor functions, error wrapping with fmt.Errorf("context: %w", err). Tom nods along. He’s not sold on all of it, the typed IDs still feel like boilerplate to him, but he can’t argue with the consistency.

“What patterns have you explicitly rejected?”

This one surprises Tom. He hadn’t thought about anti-patterns as something to document. But Priya points out: “The LLM keeps generating interface{} parameters. We never use those. It keeps creating utility packages. We don’t have a utils package and we don’t want one.”

Lee nods. “Telling the LLM what not to do is as important as telling it what to do. Same as onboarding. A new developer who’s told ‘we don’t use global state’ won’t introduce global state. An LLM that’s told the same thing won’t either.”

“What does someone need to know about the domain?”

This is where Maya’s language matters. The LLM shouldn’t call it an “order”, it’s a “subscription.” It shouldn’t call it a “product”, it’s a “box.” The delivery happens on a “delivery day,” not a “shipping date.” The team has been building a shared vocabulary, and the LLM needs to speak it too.

The first version

They write a CLAUDE.md that fits on one screen. Lee insists on this. “If it’s longer than a page, nobody will maintain it. Not the developers, and not the LLM, it’ll dilute the important stuff with noise.”

The file covers:

Project structure: where things live, what each package does.
Coding conventions: guard clauses, error handling, test naming, no utils package.
Domain language: subscription not order, box not product, delivery day not shipping date.
Build and test commands: go test ./..., go vet ./..., how to run the linter.
What not to do: no interface{}, no global state, no utility packages.

Tom commits it. The next morning, he prompts the LLM with the same delivery date task. Without changing his prompt at all, the generated code comes back with a DeliveryDate type, a guard clause, and the domain terminology.

“It read the brief,” he says.

“It read the brief,” Priya confirms.

When the team grows

A month later, Kai joins the project. He’s a contractor, less familiar with the codebase. His first day, he sets up Claude Code, opens the repo, and starts working. His first PR looks like it was written by someone who’s been on the project for months. The naming is right. The patterns match. The test structure follows the team’s convention.

Tom reviews it in fifteen minutes. No style questions. No “we don’t do it that way” comments. Just a review of the logic.

“This is the real win,” Lee says. “The CLAUDE.md isn’t just for the LLM. It’s for every developer who works with the LLM. When the brief is right, the generated code teaches the patterns to new team members faster than any onboarding document.”

Kai reads the CLAUDE.md himself, separate from the LLM. “This is the best onboarding doc I’ve ever seen,” he says. “And it’s thirteen lines of conventions.”

Beyond the project root

The team discovers that some conventions are package-specific. The subscription package has rules about status transitions that don’t apply elsewhere. The billing package has rules about how invoice amounts are stored (cents, not dollars).

Claude Code supports CLAUDE.md files in subdirectories. A CLAUDE.md in subscription/ applies when working in that package. The root CLAUDE.md applies everywhere. The specificity model is the same as .gitignore, closest file wins for its scope, with the root as the baseline.

Tom adds a CLAUDE.md to the subscription package:

Status transitions: Pending → Active → Paused → Active (resume) or Cancelled.
Paused subscriptions cannot change box size.
Cancelled subscriptions cannot be modified at all.
NewSubscription starts in StatusPending.

Four lines. The LLM generates subscription code that respects the status rules every time.

Specialised agents

Priya finds the next piece. “What if the LLM could behave differently depending on the task? When it’s writing tests, it should be thorough and consider edge cases. When it’s reviewing code, it should check for convention drift. When it’s writing migration code, it should be conservative and prefer backwards compatibility.”

This is what AGENTS.md does. Where CLAUDE.md is the general brief, AGENTS.md defines specialised roles, agents with specific instructions, tools, and constraints.

The team starts with two:

A test writer agent that knows about the team’s test conventions, table-driven tests, descriptive names, the distinction between hard constraints and soft expectations in test naming.

A reviewer agent that checks PRs for convention drift, exported fields that should be unexported, missing error handling, deep nesting that could be a guard clause.

Priya sets these up. When she asks the LLM to write tests, it applies the test writer’s conventions automatically. When Tom asks for a code review, the reviewer checks for the patterns the team has agreed on.

“The agents encode what we’ve learned,” Tom realises. “If someone new joins, they don’t just get the conventions, they get the reasoning built into the tool.”

Lee smiles. “That’s the best kind of process. The kind that outlives the person who set it up.”

What the team learned

Three months later, the CLAUDE.md has been updated fourteen times. Each update is small, a line added when a new convention is agreed, a line removed when a pattern is abandoned. The file is a living document of the team’s coding standards, maintained not by discipline but by self-interest: when the CLAUDE.md is accurate, the LLM generates better code, and reviews go faster.

Tom, who started the week typing bare prompts and getting inconsistent results, now treats the CLAUDE.md as seriously as the test suite. “Tests tell you if the code is correct. The CLAUDE.md tells the LLM how to write code that’s correct and consistent.”

The insight that sticks: the style of your codebase is a few-shot prompt. When the codebase is consistent, the LLM generates consistent code. When the conventions are explicit, the LLM follows them. CLAUDE.md is just making that implicit prompt explicit, and shareable across a team.

What the files look like

The team’s actual CLAUDE.md and AGENTS.md files, what goes in them, how they’re structured, and how they shape the LLM’s output, are worth seeing in detail. Next: CLAUDE.md and AGENTS.md in practice.

Behaviour-Driven Development: From Stories to Working Software

2026-04-07T06:00:00+08:00

The Greenbox team hit 214 subscribers. The sprint cadence is working. Event Storming gave them shared understanding. Example Mapping made their stories concrete. The sprint rhythm turned sticky notes into delivery.

But bugs keep appearing.

Not catastrophic bugs: the payment system works, the delivery scheduling is solid. But edge cases slip through. The delivery date calculation breaks on public holidays because nobody checked. The box-size switch fails if a customer changes on Wednesday instead of Monday. A paused subscriber gets charged because the retry logic doesn’t check pause state. Each one is a twenty-minute fix. Each one costs trust.

The team has concrete examples from their Example Mapping sessions: context, action, outcome, written on cards. But those cards are on a table. The code is on a screen. Somewhere between the two, the details get lost.

A language for examples

The Example Map gave the team examples as Context/Action/Outcome. There’s a step between “cards on a table” and “something a test framework can run.” The team needs a way to express those examples formally enough for a computer to use, while keeping them readable enough that Maya can look at them and say “yes, that’s what I meant.”

The language for this is Gherkin. Three keywords. Given, When, and Then, mapping directly to Context/Action/Outcome.

Given sets up the context: what’s true before anything happens.
When describes the action: what someone does.
Then states the outcome: what should be true afterwards.

A trivial example:

Given it is raining
When I go outside
Then I should get wet

No code. No special syntax. Anyone can read it. It’s the same pattern the team already used on their green cards, just formalised with keywords a test framework can parse.

From Example Map to Gherkin

The Example Map output for “Subscribe to a produce box” is already there:

Rule: Customer must choose a box size (Small $25/week, Large $45/week)
Rule: Payment must succeed (valid card → confirmed, declined card → retry)
Rule: Customer sees their first delivery date (Monday → this Thursday, Friday → next Thursday)

Take the delivery date example from the green card:

Context: delivery day is Thursday, minimum lead time is 3 days. Sarah subscribes on Friday. → First delivery is next Thursday.

Translated to Gherkin:

Given today is Friday
And deliveries happen on Thursdays
And the minimum lead time is 3 days
And a customer has a valid payment method
When they subscribe to the "Small" box
Then their first delivery date should be next Thursday

Mechanical translation. The hard thinking already happened round the table with Maya and the team.

The Feature file

Individual scenarios group into a Feature file: one coherent piece of behaviour. A Background section captures context shared across every scenario.

Feature: Subscribe to a produce box
  Customers want a regular supply of fresh, local produce
  without having to think about it each week.

  Background:
    Given the following box sizes are available:
      | name   | price    |
      | Small  | $25/week |
      | Large  | $45/week |

  Scenario: Subscribing with a valid payment method
    Given a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their subscription should be confirmed
    And they should see their first delivery date

  Scenario: Payment is declined
    Given a customer has an expired credit card
    When they subscribe to the "Small" box
    Then no subscription should be created
    And they should be asked to update their payment method

  Scenario: Subscribing without enough lead time
    Given today is Friday
    And deliveries happen on Thursdays
    And the minimum lead time is 3 days
    And a customer has a valid payment method
    When they subscribe to the "Small" box
    Then their first delivery date should be next Thursday

Each rule from the Example Map maps to one or more scenarios. Each green card becomes concrete data inside a scenario. You’re not staring at a blank file wondering what to write. The conversation already happened. You’re transcribing.

The BDD cycle: story, unit, code

Now the team has scenarios: acceptance tests describing the agreed behaviour. But you don’t implement them top-down. You work inward, using two loops.

The outer loop is the acceptance test, the Gherkin scenario itself. The inner loop is unit tests driving the implementation. The acceptance test tells you when you’re done. The unit tests tell you how to get there.

Pick a scenario. Run it. RED: it fails because nothing exists yet.
Drop to unit tests. Write a small, focused test. RED.
Write the simplest code that makes it pass. GREEN.
Refactor if needed.
Repeat 2-4 until the acceptance test passes. GREEN.
Move to the next scenario.

Worked example: Greenbox subscription in Go

The code that follows is deliberately simple; it shows the BDD rhythm without the noise of a real production system. The discovery techniques produce the same concrete examples regardless of implementation complexity.

Tom and Priya are implementing the subscription story together. They’re sitting side by side for the first time. Priya usually works with headphones on, Tom usually works alone. He notices she names her tests differently. “How do you name tests?” he asks. “I describe what the customer expects, not what the code does,” she says. It’s a small thing. Tom starts doing it too.

Delivery date calculator

They start with the third scenario, delivery date calculation, because it’s pure logic with no external dependencies. Self-contained, well-specified by the Example Map, easy to test in isolation.

The rules:

Deliveries happen on Thursdays
Minimum lead time is 3 days
Subscribe on Monday → this Thursday (3 days, just enough)
Subscribe on Friday → next Thursday (less than 3 days to this Thursday, rolls forward)

RED. Unit test first.

// delivery_test.go
package greenbox

import (
    "testing"
    "time"
)

func TestFirstDeliveryDate_MondaySubscription(t *testing.T) {
    monday := time.Date(2026, 3, 23, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(monday, deliveryDay, minLeadDays)

    want := time.Date(2026, 3, 26, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            monday.Weekday(), got.Weekday(), want.Weekday())
    }
}

Won’t compile. FirstDeliveryDate doesn’t exist yet. That’s the point.

GREEN. Write the function.

// delivery.go
package greenbox

import "time"

func FirstDeliveryDate(from time.Time, deliveryDay time.Weekday, minLeadDays int) time.Time {
    earliest := from.AddDate(0, 0, minLeadDays)
    daysUntil := (int(deliveryDay) - int(earliest.Weekday()) + 7) % 7
    if daysUntil == 0 {
        return earliest
    }
    return earliest.AddDate(0, 0, daysUntil)
}

Test passes.

RED. Edge case from the Example Map: Friday subscription.

func TestFirstDeliveryDate_FridaySubscription(t *testing.T) {
    friday := time.Date(2026, 3, 27, 10, 0, 0, 0, time.UTC)
    deliveryDay := time.Thursday
    minLeadDays := 3

    got := FirstDeliveryDate(friday, deliveryDay, minLeadDays)

    want := time.Date(2026, 4, 2, 10, 0, 0, 0, time.UTC)
    if !got.Equal(want) {
        t.Errorf("FirstDeliveryDate(%v, Thursday, 3) = %v, want %v",
            friday.Format("Monday"), got.Format("Monday 2006-01-02"),
            want.Format("Monday 2006-01-02"))
    }
}

GREEN. Already passes. The modular arithmetic handles it naturally. One of the pleasures of TDD: you write a test expecting failure, and it passes, telling you your implementation is more general than you thought.

Subscription creation

The second piece: creating the subscription, including payment.

RED.

// subscription_test.go
package greenbox

import (
    "testing"
    "time"
)

type fakeGateway struct {
    shouldSucceed bool
    chargedAmount int
}

func (f *fakeGateway) Charge(amountCents int) (bool, error) {
    f.chargedAmount = amountCents
    return f.shouldSucceed, nil
}

func TestSubscribe_ValidPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: true}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != nil {
        t.Fatalf("unexpected error: %v", err)
    }
    if sub.BoxSize != "Small" {
        t.Errorf("BoxSize = %q, want %q", sub.BoxSize, "Small")
    }
    if sub.PricePerWeek != 2500 {
        t.Errorf("PricePerWeek = %d, want %d", sub.PricePerWeek, 2500)
    }
    if !sub.FirstDelivery.Equal(delivery) {
        t.Errorf("FirstDelivery = %v, want %v", sub.FirstDelivery, delivery)
    }
    if gw.chargedAmount != 2500 {
        t.Errorf("charged %d, want %d", gw.chargedAmount, 2500)
    }
}

GREEN. Simplest thing that passes.

// subscription.go
package greenbox

import (
    "errors"
    "time"
)

var ErrPaymentDeclined = errors.New("payment declined")

type Subscription struct {
    BoxSize       string
    PricePerWeek  int
    FirstDelivery time.Time
}

type PaymentGateway interface {
    Charge(amountCents int) (ok bool, err error)
}

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    _, _ = gw.Charge(priceCents)
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Test passes. But the implementation is deliberately naive; it ignores the payment result. The next test will force the fix.

RED. Declined payment.

func TestSubscribe_DeclinedPayment(t *testing.T) {
    gw := &fakeGateway{shouldSucceed: false}
    delivery := time.Date(2026, 3, 26, 0, 0, 0, 0, time.UTC)

    sub, err := Subscribe("Small", 2500, gw, delivery)

    if err != ErrPaymentDeclined {
        t.Errorf("err = %v, want %v", err, ErrPaymentDeclined)
    }
    if sub != nil {
        t.Errorf("subscription should be nil when payment declined")
    }
}

Fails. The current implementation always returns a subscription.

GREEN.

func Subscribe(boxSize string, priceCents int, gw PaymentGateway, firstDelivery time.Time) (*Subscription, error) {
    ok, err := gw.Charge(priceCents)
    if err != nil {
        return nil, err
    }
    if !ok {
        return nil, ErrPaymentDeclined
    }
    return &Subscription{
        BoxSize:       boxSize,
        PricePerWeek:  priceCents,
        FirstDelivery: firstDelivery,
    }, nil
}

Both tests pass. Four unit tests, two source files, clean types, narrow interfaces.

Step definitions: the glue

Step definitions connect Gherkin keywords to your application. When the test runner sees When they subscribe to the "Small" box, it needs a function that calls your real Subscribe code.

func iSubscribeToTheBox(ctx context.Context, size string) error {
    gw := stripeGateway()
    sub, err := greenbox.Subscribe(size, boxPrice(size), gw, greenbox.FirstDeliveryDate(time.Now(), time.Thursday, 3))
    if err != nil {
        lastError = err
        return nil
    }
    lastSubscription = sub
    return nil
}

Thin on purpose. It delegates to the real functions the team already wrote and tested. No business logic. Just glue.

Three guidelines for keeping them healthy:

Keep them thin. If you’re writing if statements or business logic inside a step definition, the logic belongs in domain code where it’s unit-tested.
Use consistent language. If the team says “subscribe,” every step says “subscribe.” Inconsistent language means duplicate step definitions doing the same thing with different words.
Maintain them like production code. Review in PRs. Refactor when the domain language evolves. Delete when scenarios are removed. If step definitions drift from reality, the team stops trusting the scenarios, stops maintaining them, and BDD quietly dies.

Priya suggests running the Gherkin tests automatically. “We’re writing tests that prove the code does what Maya expects. Why are we running them by hand?” She sets up a GitHub Action so tests run on every pull request. It takes her an afternoon. The first automated run catches a bug in Tom’s payment retry logic that manual testing missed. Tom: “That saved me a day.” Priya: “That saved a customer.”

LLMs as implementation partners

Here’s the thing about everything you just read: an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. could have written most of it.

Not the Example Map. Not the discovery conversation where Maya explained that deliveries happen on Thursdays and the minimum lead time is three days. Not the moment when Tom asked “what about Friday?” and surfaced an edge case. The LLM wasn’t in the room for that.

But the code? You could hand an LLM the Feature file and say: “Write me a Go implementation with tests that makes these scenarios pass.” And it would produce something remarkably close to what you just read. The behaviour would be correct, because the scenarios are concrete and unambiguous. There’s no room for the LLM to guess wrong about what “subscribe” means when the Feature file spells it out.

A caveat. LLMs are good at the happy path. They’ll miss things you didn’t specify: network timeouts, concurrency issues, flaky payment gateways. Code review isn’t optional. Budget roughly half your time for reviewing and hardening what comes back. The discovery work is what makes this review possible. Because you have concrete examples, you can check the LLM’s output against something specific. Without that, you’re reviewing code against vibes.

The pipeline:

Event Storming

→

Example Mapping

→

BDD Scenarios

→

Hand to LLM

→

Review Output

→

Ship

Everything left of “Hand to LLM” is human thinking. Everything right is review and refinement. The human work is the thinking. The LLM work is the typing. Both are necessary. Neither is sufficient alone.

While implementing the payment integration, Tom makes a deliberate shortcut: he hardcodes the currency to AUD instead of making it configurable. He writes a comment: // SHORTCUT: AUD only. If we ever go international, this needs to change. Lee sees it during review and says: “That’s a good shortcut. You know it’s there, you know when it’ll matter, and you’ve documented it. Technical debt is fine when it’s conscious.” Tom carries the idea forward: debt is a choice, not an accident. The dangerous kind is the kind you don’t know you’re taking on.

That same week, a subscriber emails Sam on Saturday: “Your website has been showing an error since yesterday afternoon.” Nobody noticed; they don’t monitor the site outside business hours. Sam signs up for a free uptime monitor that pings the site every five minutes and texts her if it’s down. It isn’t observability; it’s a text message. But it’s the first time a machine is watching instead of a person.

But are we building the correct things?

One thing Tom notices: the LLM generates code faster than he can review it. The code arrives clean and confident, but he can’t always tell if it’s correct until he traces through it line by line. The Feature file gives him something concrete to check against. But the speed creates an odd sensation: the bottleneck isn’t writing code any more; it’s knowing whether the code is correct.

A few weeks in, the rhythm is working. Example Mapping eliminates the surprises. BDD catches bugs before production. The code quality is up. The board looks healthy.

But the number that actually matters, active subscribers, is going backwards. They hit 214 at the end of the first sprint cycle. A month later, they’re at 197.

Maya checks the number at her kitchen table one evening. Nadia looks over her shoulder. “Is that good?”

“It’s going the wrong way.”

Churn is eating the growth. For every ten new subscribers, three or four cancel. The team is building well, but subscriber count doesn’t care about code quality.

The frustrating thing is that the team is doing good work. They’ve built a solid subscription system, payment processing, delivery date logic. Tom has been improving the admin tools. Jas redesigned the onboarding flow. Sam is pushing for a farm analytics dashboard. Everyone has a reasonable next thing to build.

But nobody has stepped back to ask: which of these things will actually stop the bleeding? A prettier onboarding flow won’t fix churn. A farm dashboard won’t either. The team is efficiently building features that don’t address the problem.

Maya raises it at the Monday standup. “We’re shipping faster than ever. But we’re shrinking. Something’s wrong and I don’t think the answer is to ship even faster.”

Which stories should the team be building? How do they connect work to the business goal? For that, they need a technique that works backwards from outcomes, one that forces the question “why are we building this?”

Lee suggests a technique that works backwards from the goal. It’s called Impact Mapping, and it starts with one question: why are we building this?

The Quiet Jar in the Fridge

2026-04-05T06:00:00+08:00

I am making a sourdough starter today. Fresh jar, a scoop of wholemeal flour, a splash of water, a stir with a cheap rubber spatula. Not precious about it. In six weeks, if I do this properly, I’ll have bread again.

The last one was two and a half years old when it died, not through drama, through a quiet chain of postponed feeds that started with a busy week and ended two months later when I opened the jar to a monstrous mess of black mould that was by this point very nearly ambulatory and would, given another week, probably have begun drafting grievances about the state of the fridge. I scraped it into the bin, washed the jar, and here we are.

I’m not new to sourdough. I’ve made every beginner mistake and a few advanced ones. This post is about what I intend to do differently, and why almost all of it is actually about software.

What a starter actually is

A sourdough starter is a colony of wild yeast and lactic acid bacteria living in a paste of flour and water. You feed it, it eats the sugars, it rises and falls. You bake with some, set the rest aside, feed it again. Flour, water, time, consistency. There is no secret. The mystique around sourdough is almost entirely vibes.

Starters aren’t hard to make. They’re hard to keep. And everything I failed to do with the last one is something I was failing to do in a codebase somewhere at the same time.

Lesson one: consistency beats intensity

Somewhere in the first week or two of a new starter, it will begin to smell strongly of acetone, the acrid chemical note of nail polish remover. All new starters do this. It’s a normal stage of the culture establishing itself, but if you haven’t seen it before it can make you worry.

This is the moment most new bakers panic. They read the first thing the internet tells them, “your starter is sick, feed it more”, and do the wrong thing with great care: more frequent feeds, stronger flour, warmer water, twice-daily instead of once. It feels active. It’s almost exactly the opposite of what the starter needs. I know, because it was me in my very first week, and the starter responded by smelling worse for longer than if I’d left it alone.

The fix is boring. One feed a day, same time, same ratio, same flour, until the phase passes. A small ritual I do while I make my wife tea.

The best starters are not the ones fed most dramatically, but the ones fed most reliably.

The team that runs a three-week “tech debt sprint” every quarter is feeding their codebase intensely but inconsistently. The team that quietly deletes one dead file, writes one missing test, and closes one stale TODO every week is feeding it consistently. Six months later the second team has the cleaner codebase and a deeper understanding of it. Twelve months later it isn’t even close.

Consistency compounds. Intensity burns out. The last starter died because I fed it generously on Sundays and missed too many Thursdays in a row.

Lesson two: maintenance is not waste

Every time you feed a starter, you throw most of it away. It feels profligate. It feels like you’re killing the thing you’re trying to grow.

The reason isn’t volume; it’s ratios. Between feeds the microbes exhaust the sugars and leave their waste behind. The culture turns tired and acidic, and the yeast, which is what actually makes bread rise, struggles in those conditions because bacteria tolerate them better. Leave it long enough and the yeast is outcompeted and the starter goes sour and sluggish.

The discard resets the balance. Throw most of the culture away, keep a small inoculum of still-healthy microbes, feed it generously. The microbes have a huge meal ahead and plenty of space. They multiply back to strength, the yeast keeps up, and the discard itself isn’t waste, it makes excellent crackers, pancakes, and pizza dough.

Codebases are the same. Every week I delete some code, dead feature flags, tests that no longer test what the code does, config files for services we stopped running. Each deletion is uncomfortable in the moment, because I wrote this, and it meant something once. But what remains is closer to the shape I can work with. The point isn’t the size of the codebase; it’s the ratio of living code to tired nobody-remembers-why-this-is-here code. Removal isn’t the opposite of care. It is the care.

And keep the whole thing small. My starter lives in a small jar in the fridge. Unimpressive. Not Instagram-worthy. It waits quietly for Friday afternoon before Saturday’s bake. The counter-top sourdough that looks impressive in a sunlit photograph is also the one that usually dies when life gets busy. Good maintenance is almost always quieter than the thing it’s maintaining.

Lesson three: the practice, not the artefact

If the new starter lives twenty years, good. If it dies in two months and I start another, also fine. The point isn’t this jar; it’s whether I can keep the practice going.

The San Francisco sourdough at Boudin Bakery has been continuously fed since 1849, older than California’s state government. The actual organisms don’t live anything like that long: yeast cells bud and split every two to three hours, and no cell in today’s culture has any ancestor alive more than a few weeks ago. What persists is the practice of feeding the jar. The culture is remade every week. The practice is the thing.

The code you wrote ten years ago is mostly gone by now, rewritten, replaced, deleted, refactored into something unrecognisable. What remains is the habit of care. Code is the artefact. The practice is the point.

Today, the jar has nothing in it

The starter has made nothing so far. It isn’t even, strictly, a starter yet, a scoop of wholemeal flour, a splash of water, whatever wild yeast happened to be on the flour. What it has is an idea of what it will become and a set of practices I intend to follow to get it there.

Tomorrow morning I’ll discard most of it, add fresh flour and water, and stir. The morning after, the same again. The acetone phase will come and I’ll resist the urge to panic-feed. In six weeks I’ll bake bread I’m happy with.

The thing I’m committing to today is not a jar. It’s a practice. The practice is what will produce bread; the jar is just where the evidence lives. The code I look after is the same: it needs my presence on a schedule I can keep, long enough for the compounding to catch up to the cleverness.

Feed the practice. Learn from the maintenance. Stay focussed. Don’t get attached to the artefact. Start again when you have to.

The practice is the point. Everything else is decoration.

Sprint Planning: Turning Sticky Notes into Delivery

2026-04-04T06:00:00+08:00

The Greenbox team has done remarkable work over the past two weeks. They Event Stormed the whole domain onto a wall. They Example Mapped their stories into concrete scenarios with rules, examples, and edge cases.

The wall looks beautiful. Sticky notes everywhere. Concrete examples for the first batch of stories. A shared understanding of what the team is building and why.

It’s week six of twelve.

Maya pulls Lee aside after the Monday standup. “We’ve spent two weeks on workshops. The wall looks great. But we haven’t shipped anything new since the subscription prototype. When do we start building?”

Lee doesn’t answer directly. Instead he asks: “What’s Tom working on right now?”

Maya knows. “The farm availability screen.”

“And Priya?”

“Delivery logistics.”

“And which of those matters more for hitting 200 subscribers by the deadline?”

Silence. Maya doesn’t know. Neither does Tom, when she looks at him. They’ve been building. Tom finished the subscription flow last week, pulled the next story off the map, started on farm availability. Priya is deep in delivery. Work is happening. But it’s happening the way it happened in week one: individually, without a shared sense of what the team is doing this week, or whether the pace is enough to hit the deadline.

Six weeks left. 200 subscribers. And the team has no way of knowing whether they’re going to make it.

The missing layer

Lee draws a rough diagram on the whiteboard. Three circles, nested.

“You’ve been working out here,” he says, pointing to the outer ring. “Event Storming gave you the big picture: the whole domain. Example Mapping gave you the detail: concrete rules and examples for each story.” He taps the innermost circle. “This is the bit you’re missing. The delivery layer. What are we doing this fortnight? What does ‘done’ look like in two weeks? How do we know if we’re on pace?”

Maya folds her arms. “We don’t have time for more process. We’ve got six weeks.”

“This isn’t more process; it’s less chaos.”

Introducing the sprint

The concept is simple: two-week iterations. The team calls them sprints, though the name matters less than the rhythm.

Every two weeks:

Plan together what they’ll build in the next fortnight.
Demo together what they actually shipped, to the whole team, not just the developers.

Every day:

Check in to surface blockers before they fester. Fifteen minutes, standing up, first thing in the morning.

Three practices. Lee keeps the ceremony light deliberately. Five people don’t need a Scrum Master, a Product Owner, and a burndown chart. They need a rhythm.

Monday morning: the first sprint planning

The team gathers round the wall. Lee runs the session.

“The goal for this sprint isn’t a list of stories; it’s a sentence. What do we need to be true in two weeks that isn’t true today?”

Maya translates it: “This sprint, we ship the subscription flow end to end so we can onboard our first twenty pilot subscribers.”

Lee writes it on a card and sticks it above the story map: Sprint 1 goal: ship subscription flow, onboard first 20 pilot subscribers.

“Every story you pick should serve that goal. If it doesn’t, it doesn’t go in.”

Six stories:

Sprint 1 backlog

1. Customer selects box size and subscribes

2. Payment integration (Stripe, initial charge)

3. Confirmation email with first delivery date

4. Landing page with box descriptions and pricing

5. Farm submits weekly availability (basic version)

6. Maya's matching tool (supply to demand, draft version)

Tom looks at the list. “Six? We could do ten.”

Lee shakes his head. “First sprint. You don’t know your pace yet. If you finish early, pull more. But it’s better to finish everything than to finish five of ten and feel behind.”

Tom doesn’t look convinced. Priya catches his eye and gives a small nod. She’s been on teams before where overcommitting in sprint one set a miserable tone for the whole project.

Six stories it is.

They spend twenty minutes Example Mapping the stories that haven’t been mapped yet. The confirmation email story is quick: three rules, five examples, one red card about bounced emails. The farm availability story surfaces the same questions from earlier sessions: units, deadlines, update policies. Maya resolves the critical ones on the spot. The rest become red cards for next sprint.

This is important: sprint planning isn’t just picking stories. It’s the moment where Example Mapping happens for the stories you’re about to build. Discovery and planning, in the same conversation.

The daily standup

Fifteen minutes, every morning. Three questions per person: what did I do yesterday, what am I doing today, is anything blocking me?

The first three days feel odd. The team stands awkwardly in a circle. Tom gives a forty-five second summary of his code changes. Nobody has any blockers. The standup takes four minutes. Tom mutters something about it being a waste of time.

Day four is different.

Priya says: “I’m stuck. Stripe’s webhook for failed payments doesn’t include the subscription ID in the format we expected. I’ve been debugging it since yesterday afternoon.”

Tom looks up from his phone. “I hit that last month on a side project. The subscription ID moved to a nested object. Want me to show you after this?”

“Yes. Please.”

Thirty seconds during the standup. Tom and Priya pair on it afterwards and resolve it in twenty minutes. Without the standup, Priya would have spent another half-day on it alone.

That’s the pitch for daily check-ins: not the days when everything is fine, but the one day in five when someone’s stuck and the answer is sitting three metres away.

The first sprint review

Two weeks pass. Friday afternoon. The team gathers.

Lee keeps it simple: “Show what you built. Not slides. Working software.”

Tom shares his screen and walks through the subscription flow. A customer lands on the page, picks a box size, enters payment details, gets a confirmation with a delivery date. It works.

Then Sam says: “Can I try it?”

She picks up her laptop, goes to the landing page, and starts subscribing. She gets to the box selection screen and pauses. She’s been fielding exactly this question from potential subscribers all week: explaining the difference between small and large boxes over email, over the phone, at the Margaret River farmers’ market. Three people this week alone.

“Which one’s the good one? Small or Large, what’s the difference? How many people does each one feed? There’s nothing on this page that helps them decide.”

Jas pulls up her design file. “I had comparison copy in the original mockup. It got cut when we were trying to keep the first version simple.”

Maya: “That’s not simple, that’s confusing.”

Tom: “I can add it. Half a day, maybe less.”

Sam spotted in thirty seconds what nobody caught during two weeks of development. That’s why the whole team demos, not just the developers. Sam thinks like a customer. Tom and Priya think like engineers. You need both perspectives seeing the same thing.

Here’s the sprint review scorecard:

Story	Status
Customer selects box and subscribes	Done
Payment integration	Done
Confirmation email	Done
Landing page	Done
Farm availability (basic)	Done
Maya's matching tool (draft)	Partial

Five done, one partial. The matching tool has the basic algorithm working but no UI yet; Maya is running it from a command line script. Lee marks it as carried over to sprint two.

Tom, who wanted ten stories, sees six was exactly correct. If they’d committed to ten, the story would be “we missed our target” instead of “we nearly hit it.”

That evening, Tom mentions to Sarah that Lee was correct about the six stories. “I would have overcommitted and then blamed the process,” he says.

Sarah looks up from marking papers. “You sound surprised that someone else was right.”

“I’m surprised I listened,” Tom says.

Seeing the trajectory

After the review, Lee draws a simple chart on the whiteboard. Horizontal: weeks remaining. Vertical: subscriber count. He plots where they are, week 6, 38 pilot subscribers from the manual Google Form days, and draws a dotted line from 38 to 200.

“You’ve got six weeks and you need to more than quintuple what you’ve got now. Thirty-eight said yes when there was nothing but Maya’s promise and a spreadsheet. Now you’ve got software and a team. Every two weeks, we’ll plot where you actually are. If you’re falling behind, you’ll know in two weeks, not four.”

Priya takes a photo of the chart and pins it in the team Slack channel. She updates it every Friday. It becomes her quiet ritual, the act that makes the numbers visible to everyone. Nobody asks her to. She just does it.

The first data point goes on the chart at the end of sprint one: 42 pilot subscribers. Four new signups through the self-service flow in the last few days of the sprint, after the landing page went live. The software works. Four is not very many. The line to 200 still sits far above them, and the gap between “we can take signups” and “people are actually signing up” is now a visible, numerical fact. Tom stares at the whiteboard for a moment and then goes back to his desk.

Sprint two: the rhythm clicks

Sam is answering subscriber emails from her personal Gmail. By the end of sprint one, she’s getting fifteen emails a day. She sets up a shared inbox, support@greenbox.com.au, password on a sticky note stuck to Maya’s monitor. The same three questions every week: When does my box arrive? Can I skip a week? What’s in the box? She starts a spreadsheet to track them. Mrs Patterson emails twice about her delivery day, polite both times.

Sprint two planning happens on Monday morning. Forty minutes instead of ninety.

Sprint goal: Ship the farm portal, wire up referrals, and grow to 80 subscribers.

Eight stories this time, two more than sprint one. Lee raises an eyebrow but doesn’t object.

The daily standups get faster. By day three, four minutes. On Wednesday, Jas mentions that the farm portal design has a problem: she’s designed it for a desktop browser, but Dave does everything on his phone. She knows this because Sam mentioned it in passing during Monday’s standup. Sam also remembers something from the Event Storm: Rachel’s comment about her dodgy satellite broadband, the twenty minutes to load a map. “If Rachel’s going to use this portal,” Sam says, “it needs to work on a connection that drops out halfway through a form submission.” Nobody had written that down. Sam just remembered. Jas redesigns the submission flow to save progress locally and retry when the connection comes back. It adds half a day of work and saves Rachel from losing her availability data every time her internet blinks.

On Thursday morning, Lee asks a quiet question at the standup: “What happens if Tom is sick on a Thursday and you need to deploy?”

Silence.

Priya: “I’ve never deployed.”

Tom writes a README that afternoon and walks Priya through the deploy script. By the end of sprint two, Priya has deployed twice. Their bus factor for deployments goes from one to two. It’s not a pipeline; it’s a shared script and a document. But it’s the difference between “one person can ship” and “two people can ship.”

Later that day, Tom says something that surprises everyone.

“I thought standups were a waste of time. I still think most of them are. I’ve been on teams where it was twenty minutes of people reading Jira tickets aloud. These aren’t that. Four minutes, and last week it saved Priya a day. I’m in.”

Lee smiles but says nothing.

The sprint review is smoother. The farm portal works on desktop and mobile. The landing page has comparison copy. Maya demonstrates the matching tool with a real UI. Sam has brought in 26 new subscribers through a combination of local Facebook groups and door-to-door conversations at the Margaret River farmers’ market.

Subscriber count on the whiteboard: 68. Priya updates the chart. Thirty subscribers added across four weeks of sprinting. The curve is bending the correct way. Then she draws where Lee’s dotted line sits at this point in the six-week stretch, 108, and the relief thins. They’re forty short of the linear trajectory, with one sprint left to close that gap and find another 132 subscribers on top. They always knew early growth would be slow and the curve would have to steepen at the end. Seeing it is different from knowing it.

Tom looks at the chart. “We need to more than triple this in two weeks.”

“I know,” Maya says.

Sprint three: the final push

Two weeks. One hundred and thirty-two subscribers to find. By sprint three, the rhythm is second nature (Monday morning planning and Example Mapping, daily standups, Friday demo and chart update) but nothing about the mood is routine. Maya’s been at the office until midnight on Sunday, going over the sprint plan with Lee and redrawing the assumptions behind the referral programme on the back of a receipt.

Sprint three goal: Ship delivery logistics, pause-and-resume, and the referral programme. Hit 200.

The story map for this sprint is the longest they’ve written. Eleven stories. Lee raises both eyebrows but doesn’t object; he can see what the team sees.

Tom also starts feeding Example Map output into Claude during planning: “Break this story into implementation tasks and estimate the relative complexity of each.” The LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. comes back with reasonable task breakdowns that give the team a starting point for conversation. Jas uses it differently: she feeds in the Example Map cards and asks for draft acceptance criteria, then edits them. It saves fifteen minutes of typing per story. The pattern is the same as everywhere else: the LLM is an assistant, not a decision-maker.

By Wednesday of week one, delivery logistics is working end to end. Three farms are submitting availability. Maya’s matching tool produces a packing list each Tuesday evening. The first deliveries go out on Thursday: real deliveries, not the pilot ones, through the software the team built. Pause-and-resume ships on Friday afternoon, quietly, because Mrs Patterson is going on holiday and needs it by Monday.

Subscriber count on the whiteboard, end of week one: 96. Twenty-eight new in the first week of sprint three, the fastest growth yet, but Priya runs the maths and it isn’t enough. At that pace they’ll land at 154 by Friday of week two. Maya sees the number and says nothing for a full minute.

The referral programme goes live on the Monday of week two. It’s a simple thing: if a subscriber refers a friend who signs up, both get $10 off next month’s box. Sam designed it, Jas built the flow, Tom wired it into the subscription system. By Tuesday afternoon, every new subscriber is bringing an average of 0.4 friends into the funnel. By Wednesday, the number is 0.7. Sam starts tracking referrals on a second whiteboard next to Lee’s chart: names, dates, which subscriber referred them. The board fills up faster than she can write.

Priya updates the chart on Wednesday afternoon. Subscriber count on the whiteboard: 143. Still fifty-seven short of the target, with two and a half days to go. Maya looks at it for a long time. Then she says, quietly, “It’s going to be close.”

On Thursday morning, Sam opens her laptop before her feet touch the floor and sees thirty-seven new sign-ups overnight. She refreshes, thinks she’s mis-read it, refreshes again. Thirty-seven. She calls Maya before she even stands up.

Thursday rolls. The referral flywheel is spinning for itself now. Every one of those overnight sign-ups arrived with friends they’d already told about the box, and several of those friends sign up within hours of getting the $10-off code. Eight more sign-ups land during the morning school run. Five over lunch. Three in the afternoon slot. By the end of Thursday the count is 196. Sam writes it on the chart in pencil, because she’s afraid that if she uses marker she’ll jinx it.

Friday morning, 8:47am: they cross 200. Priya is unlocking the office when her phone pings. Tom, who has been awake since 5am refreshing the dashboard on his phone, is already at the cafe downstairs with two coffees in a cardboard tray. “We got the hardest one,” he says, handing her a flat white. Priya laughs so hard she nearly drops the keys.

The final sign-ups trickle in across Friday. A cluster of four late in the morning: someone’s book club. Two in the early afternoon: Dave’s neighbour and her daughter. Then a slow drip through the rest of the day, referrals chasing referrals, every ping of the dashboard another small cheer from wherever on the floor people are sitting. By 4pm Priya draws the final data point slowly, as if she can’t quite believe it.

The subscriber count on the whiteboard: 214.

For a second nobody moves. Then Sam lets out a sound that isn’t quite a word, clamps her hand over her mouth, and starts to cry. Tom says “no way” very quietly, to nobody, and then says it again, louder. Jas is already in Maya’s arms. Priya stands by the whiteboard, marker still in her hand, looking at the numbers. She’s the one who plotted every single Friday for twelve weeks. She knows what this curve looks like because she drew it. Lee is standing by the door with his hands in his pockets, smiling in the way he smiles when he’s trying not to cry himself.

Maya looks at the wall: at the sticky notes from the first Event Storm, still laminated there, at the pink hotspots, at the chart with the jagged line climbing from 38 to 214 across three sprints. It’s a real number on a real whiteboard in a real office. Two hundred and fourteen people in Perth paid them money this week because they trusted a company that, twelve weeks ago, barely existed. Greenbox is real. Not a pitch deck. Not a spreadsheet. Not Maya’s idea. A company. With customers. With a team. With software that ships boxes of produce to people’s doorsteps every Thursday.

Tom, whose default is skepticism, walks up to the whiteboard and writes “214” in much bigger letters underneath Priya’s dot. Then he adds an exclamation mark. Then a second one.

Someone orders pizza. Someone else goes down to the cafe below the office and comes back with a bottle of something that is technically champagne and practically just sparkling wine. Maya, who has been running on coffee and adrenaline for twelve weeks, takes a glass and sits on the floor with her back against the wall and laughs for the first time in about six days. Sam, who hasn’t stopped smiling, keeps pulling out her phone and looking at the subscriber dashboard and then putting it away and then pulling it out again like she can’t quite trust the number to stay there.

At some point in the evening, Dave calls from Margaret River. Maya had emailed him the number an hour earlier. He says: “I don’t know what I expected when I first met you at the market, but it wasn’t this. Congratulations, kid. I told Rachel. She cried too.” Maya laughs and wipes her eyes and tells him the next box of his tomatoes is going out to a family in North Perth who specifically requested them after reading about Dave’s farm on the about page.

Lee raises his glass at one point. “To the team that nearly broke itself in month one and put itself back together.” Everyone drinks. Nobody says anything for a while.

They did it. Not comfortably. There was a rough patch at the start of sprint two when a payment bug knocked out twenty subscribers for a day. Sam fielded the angry emails. Maya personally called every affected subscriber to apologise. One of them said: “I’m switching to something else if this happens again.” Maya asked what else. “I don’t know yet. But there must be something.” There was also a three-day window in sprint three where referral growth stalled and Maya stayed up until midnight emailing every contact she had. But the sprint rhythm gave them visibility. They could see the problem coming, adjust, and respond, instead of discovering at week eleven that they were behind.

Tom says something in the final retrospective that sticks with Lee: “In week one, I was shipping code faster than I ever had. But I had no idea if it was the correct code, or if we were going to make it. Now I’m shipping at about the same pace, but I know it’s the correct stuff and I can see that we’re on track. That feels completely different.” He pauses. “Week one was the wrong kind of fast.”

The total ceremony overhead: about three hours per fortnight. For that investment, the team got shared visibility, early blocker detection, regular feedback from non-developers, and a clear picture of whether they’d hit the deadline. Compare that to the four weeks the team lost in month one, and it’s not even close.

What the sprint can’t tell you

Two hundred and fourteen subscribers. Three sprints. A rhythm that went from awkward silences to four-minute standups. A team that started as five people shipping code in different directions and ended as five people who know what they’re building, why, and whether they’re on pace.

That’s a different company from the one that started twelve weeks ago.

But the sprint cadence tells the team what they’re building and whether they’re on pace. It doesn’t tell them whether the code they’re shipping is correct.

Tom has been writing code fast. With an LLM as a pair, he’s generating more code in a day than he used to write in a week. But speed creates a new problem. The code arrives quickly and looks correct, but bugs are slipping through: the kind that the Example Maps would have caught if anyone had checked the implementation against the green cards. Three subscribers hit a payment edge case in sprint three that was right there on a red card from the Example Mapping session. The team has concrete scenarios with context, actions, and outcomes. What they’re missing is the bridge between those cards on a table and verified, working software.

Priya starts running through the Example Map cards one by one against the code. “This scenario works,” she says. “This one doesn’t.” She’s testing by hand. She’s catching bugs. And she’s spending two hours per story doing it. At eight stories per sprint, that’s two full days of manual checking every fortnight: a quarter of Priya’s capacity, spent reading cards and comparing them to screens. And it’s only going to get worse. The team is shipping faster every sprint. More stories means more cards means more checking. Priya can see the trajectory: by sprint six she’ll be spending half her time clicking through a browser instead of writing code.

She didn’t move to Perth for this. She moved to Perth to build things.

There has to be a better way.

There is. It starts with turning those Example Map cards into working software.

To LLMs… and Beyond!

2026-04-02T06:00:00+08:00

You’ve heard of ChatGPT. Someone at work mentioned “diffusion models” and you nodded. A blog post told you to use a “multimodal” something. Your cousin sent you an AI-generated image of a cat riding a submarine and you wondered, vaguely, how that works. You’ve been meaning to look into all of this but every explanation assumes you already know the bit you don’t.

This is the field guide you needed six months ago.

In the previous post in this series, we opened up a Large Language ModelLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. and looked at the machinery inside — tokens, embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. , attentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. , transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. blocks, the training pipeline. That post answered one question: how does an LLM actually work?

This post answers the next one: what else is out there?

Because LLMs are one corner of a much larger field. There are models that generate images, models that generate video, models that produce music, models that reason step by step for minutes before answering, and models that combine several of these capabilities at once. The terminology is a mess. The marketing is worse. And if you’re trying to figure out what tool you actually need for a specific job, the landscape can feel impenetrable.

Let’s fix that. We’ll start with a word that gets thrown around constantly and rarely defined.

Modality: types of information

In AI, a modality is a type of input or output — a channel of information. The word comes from philosophy and cognitive science, where it refers to the senses: sight, hearing, touch. In AI, it’s been stretched to cover any distinct form of data.

The main modalities you’ll encounter:

Modality	What it is	Example models
Text	Natural language, prose, dialogue	Claude, GPT-4, Llama
Code	Programming languages — arguably text, but the rules are different enough to matter	Claude, Codex, Code Llama
Image	Photographs, illustrations, diagrams, sprites	DALL-E, Stable Diffusion, Midjourney
Audio	Speech, music, sound effects	Whisper (speech→text), Suno (text→music)
Video	Moving images, often with audio	Sora, Runway, Kling
3D	Meshes, point clouds, scenes	Point-E, NeRFs (emerging)
Structured data	Tables, databases, graphs	Various specialised models
Embeddings	Numerical representations that capture meaning — the hidden modality that powers search	text-embedding-3, Cohere Embed

A modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. can be single-modality — text in, text out. Or it can be multimodal — accepting and producing multiple types. When someone says “multimodal model,” they mean a model that crosses these boundaries. GPT-4o takes text and images as input and produces text, images, and audio as output. Claude takes text and images as input and produces text. Gemini handles text, images, audio, and video.

The direction matters. A model that takes text in and produces images out (DALL-E) is doing something fundamentally different from a model that takes images in and produces text out (image captioning). Both are “multimodal,” but the underlying machinery is very different.

This brings us to the machinery itself.

Architectures: the engine designs

An architecture is the fundamental design of the neural network — the blueprint for how data flows through the model and how it learns. It’s like engine designs in cars: petrol, diesel, electric, hybrid. Different engineering, different trade-offs, different things they’re good at.

Transformers

If you read the previous post, you already know this one. The transformer architecture, introduced in “Attention Is All You Need” (Vaswani et al., 2017), is the engine behind virtually every major text-generating AI. Claude, GPT-4, Llama, Gemini, Mistral — all transformers.

The key innovation is the attention mechanism: instead of processing text sequentially (one word at a time, left to right), the transformer looks at the entire input at once and figures out which parts relate to which. This parallelism makes them fast to train and excellent at capturing long-range dependencies in text.

Transformers aren’t limited to text. Vision Transformers (ViT, Dosovitskiy et al., 2021) apply the same architecture to images by splitting an image into patches and treating each patch like a token. The attention mechanism then figures out which patches relate to which — exactly the same principle, different input.

The transformer has been remarkably dominant. But it has a known weakness: the attention mechanism scales quadratically with sequence length. Double the input, quadruple the compute. For very long inputs (millions of tokens), this becomes expensive. Which is part of why alternatives exist.

Diffusion models

Diffusion models are the engine behind most modern image generation: Stable Diffusion, DALL-E 3, Midjourney, and Flux.

The core idea is beautifully counterintuitive. During training, the model learns to reverse the process of adding noise to an image. You take a real image, gradually add random noise over many steps until it’s pure static, and train the model to predict what the image looked like one step earlier — slightly less noisy.

At generation time, you start with pure random noise and ask the model to denoise it, step by step. Each step removes a little noise and adds a little structure. After enough steps (typically 20-50), you have a coherent image.

The text promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. enters the picture through conditioning. The model doesn’t just denoise randomly — it denoises in a direction guided by a text description. The text “a cat riding a submarine in the style of Studio Ghibli” gets encoded into a numerical representation (usually by a text encoder like CLIP), and that representation steers every denoising step. The model has learned, from millions of image-caption pairs, which visual patterns correspond to which text descriptions.

This is fundamentally different from how LLMs work. An LLM generates output one token at a time, left to right. A diffusion model generates the entire image at once, refining it in passes from noise to clarity. There’s no concept of “next pixel” the way there’s a “next token.”

	LLM (transformer)	Diffusion model
Generates	One token at a time	Entire output at once, refined iteratively
Training signal	"Predict the next token"	"Remove the noise"
Output type	Sequential (text, code)	Spatial (images, video frames)
Guided by	All previous tokens	Text embedding + previous denoising step
Speed	Fast per token, slow for long outputs	Fixed number of steps regardless of complexity

The idea was first made practical by Ho et al. (2020). The breakthrough that made it work for high-resolution images was latent diffusion (Rombach et al., 2022) — instead of denoising the full image pixel by pixel (which is absurdly expensive at high resolution), you first compress the image into a much smaller representation, do the denoising there, and then decompress the result. It’s the difference between sculpting a full-size statue and sculpting a maquette that gets scaled up. This is the approach behind Stable Diffusion.

GANs (Generative Adversarial Networks)

Before diffusion models, GANs were the dominant approach to image generation. Introduced by Goodfellow et al. (2014), the idea is elegant: train two neural networks against each other.

The generator creates fake images. The discriminator tries to tell real images from fake ones. The generator gets better at fooling the discriminator. The discriminator gets better at detecting fakes. They push each other to improve, like a counterfeiter and a detective in an arms race.

GANs produced stunning results — StyleGAN (Karras et al., 2019) generated photorealistic faces that were indistinguishable from real photographs. But they were notoriously difficult to train. The two networks can fall out of balance (the generator collapses to producing one image, or the discriminator becomes unbeatable), and the training process is unstable compared to diffusion models.

Diffusion models have largely replaced GANs for general-purpose image generation, but GANs remain useful in some niches — real-time applications where the single-pass generation is faster than iterative denoising, and super-resolution tasks where you’re enhancing an existing image rather than generating from scratch.

State-space models

Transformers aren’t the only game in town for text. State-space models (SSMs), most notably Mamba (Gu and Dao, 2023), are an alternative architecture that processes sequences without the quadratic attention cost.

Instead of letting every token attend to every other token, SSMs maintain a compressed hidden state that evolves as each token is processed. Think of it as the difference between re-reading an entire book every time you want to recall something (attention) versus keeping a running set of notes that you update as you read (state-space). The notes are lossy — you can’t recall every detail — but updating them is fast and the cost scales linearly with sequence length, not quadratically.

SSMs are still emerging. They show promising results on long sequences where the quadratic cost of attention is prohibitive, but transformers remain dominant for most tasks as of early 2026. The two approaches may converge — hybrid architectures that combine attention for local precision with state-space mechanisms for long-range efficiency are an active area of research.

Paradigms: patterns built on top

Architectures are the engine. Paradigms are how you drive. These are patterns and techniques that sit on top of the fundamental architectures, often combining them in clever ways.

Reasoning models

Standard LLMs generate text in a single pass — the model reads your prompt, then starts producing tokens immediately. Reasoning models add an explicit thinking phase before answering.

OpenAI’s o1 and o3 models, and DeepSeek-R1, are the most prominent examples. When you ask a reasoning model a hard question, it generates a long internal chain of thought — sometimes thousands of tokens of deliberation — before producing the visible response. The model might consider multiple approaches, check its own reasoning, backtrack from dead ends, and work through intermediate steps.

This isn’t just chain-of-thoughtChain-of-thoughtPrompting the model to write out its intermediate reasoning before giving a final answer – which empirically makes hard problems get answered better. prompting (which we covered in the LLM post). Chain-of-thought prompting asks a standard model to show its working. Reasoning models are specifically trained — often using reinforcement learning — to use that thinking time productively. The training process rewards not just correct answers but effective reasoning strategies.

The trade-off is straightforward: reasoning models are slower and more expensive, but substantially better at tasks that require genuine multi-step reasoning — mathematics, formal logic, complex code, and scientific analysis. For a simple question like “what’s the capital of France?”, a reasoning model is overkill. For “find the bug in this 500-line concurrent program,” the extra thinking time pays for itself.

Recursive Language Models (RLMs)

RLMs are a recent inference-time paradigm from MIT (Zhang, Kraska, and Khattab, 2026) that addresses one of the most stubborn limitations of LLMs: the context window.

The insight is simple and surprisingly effective. Instead of cramming a massive prompt directly into the model’s context window, an RLM loads the prompt as a variable in a Python REPL and lets the model write code to examine, decompose, and process it. The model can peek at snippets, chunk the input, search through it, and call itself recursively on sub-sections.

This means a model with a 272K token context window can effectively process inputs of 10 million tokens or more. The model never sees the whole input at once. Instead, it writes a program that strategically examines the parts it needs, delegates sub-questions to copies of itself, and assembles the results.

It’s not a new architecture — the underlying model is still a standard transformer. It’s a scaffold, a way of using an existing model more effectively. But the results are striking: RLMs outperformed both the base model and existing long-context approaches (summarisation agents, retrieval-augmented generationRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. ) by large margins on four diverse benchmarks, while maintaining comparable cost.

The pattern here is worth noting: some of the most impactful advances aren’t new architectures at all. They’re clever ways of using existing architectures differently.

Agents

An agentAgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. is an AI system that can take actions in the world — not just generate text, but use tools, browse the web, execute code, call APIs, and make decisions about what to do next.

The underlying model is typically an LLM, but instead of just producing a response, it produces a plan: “I need to search for X, then read the result, then calculate Y, then write a file.” Each step generates a new prompt that includes the results of previous steps. To understand agents, you need a few pieces of vocabulary.

Prompts are the instructions you give a model. You already know this — you type something, the model responds. But there’s a layer most people don’t see: the system promptSystem promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. . Before your message ever reaches the model, the application wraps it with hidden instructions that shape behaviour. “You are a helpful assistant. Answer concisely. Do not produce harmful content.” That’s a system prompt. When ChatGPT refuses to help you build a bomb, that’s not some deep moral reasoning — it’s following instructions in a system prompt, reinforced by RLHF training. When Claude writes code in a particular style, that’s partly system prompt too. The system prompt is the invisible hand that makes the same underlying model behave differently in different products.

Tools are capabilities granted to an agent — things it can do beyond generating text. A bare LLM can only produce words. Give it tools and it can read files, search the web, execute code, query databases, send messages, or call external APIs. The model doesn’t inherently have these abilities. They’re defined by the developer who builds the agent, and the model learns to invoke them by generating structured requests (“I want to call the read_file tool with the path /src/main.py”). The set of tools available to an agent defines what it can accomplish — and its limits.

Sub-agents extend this further. A complex task might be too large or too varied for a single agent to handle efficiently. Instead, the agent can spawn sub-agents — smaller, focused agents that handle specific sub-tasks. An agent reviewing a large codebase might spawn one sub-agent to explore the directory structure, another to search for specific patterns, and a third to read and summarise relevant files — all working in parallel. Each sub-agent has its own context, its own tools, and returns its results to the parent. It’s delegation, the same way a manager breaks work into tasks for a team.

Skills are pre-packaged workflows — reusable recipes that an agent can invoke rather than figuring out from scratch. Instead of reasoning through the twelve steps of “create a git commit with the correct message format,” an agent might invoke a commit skill that encapsulates that workflow. Skills trade flexibility for reliability: the agent doesn’t need to reinvent common procedures every time.

Agents blur the line between “AI as a tool” and “AI as a collaborator.” A tool responds to a single prompt. An agent pursues a goal across multiple steps, adapting its approach based on what it discovers along the way.

RAG (Retrieval-Augmented Generation)

RAG is a pattern that addresses a fundamental limitation: the model’s knowledge is frozen at training time. If you ask about something that happened after the training cutoff, or about your company’s internal documentation, the model can only hallucinate.

RAG works by retrieving relevant documents before generating a response. Your question gets converted into an embedding (a numerical representation), that embedding is compared against a database of document embeddings, the most relevant documents are pulled in, and those documents are included in the prompt alongside your question. The model then generates a response grounded in the retrieved text, rather than relying solely on what it learned during training.

This is how most enterprise AI deployments work in practice. The model might be Claude or GPT-4, but the knowledge comes from your documentation, your codebase, your internal wiki. RAG lets you get domain-specific answers from a general-purpose model without fine-tuningFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. it.

The models: what you can actually use

All of the above is theory. Here’s the practical bit: what models exist, who makes them, and what can you do with them?

GPT is not a generic term

Let’s start with the biggest source of confusion. GPT stands for Generative Pre-trained Transformer. It’s the name of OpenAI’s model family — GPT-3, GPT-4, GPT-4o, GPT-5. It is not a generic term for AI models, despite being used that way in roughly half of all conversations about AI.

Calling all AI models “GPTs” is like calling all vacuum cleaners “Hoovers” or all search engines “Google.” Understandable, but imprecise. When someone says “we should use a GPT for this,” they might mean “we should use an LLM” — or they might specifically mean OpenAI’s product. It’s worth asking.

The major LLM families

Model family	Made by	Open / closed	Notable for
GPT (GPT-4o, o1, o3, GPT-5)	OpenAI	Closed	First mover, reasoning models (o-series), broad multimodal support
Claude (Haiku, Sonnet, Opus)	Anthropic	Closed	Long context (1M tokens), strong at code and structured reasoning, Constitutional AI safety approach
Gemini	Google DeepMind	Closed	Natively multimodal (text, image, audio, video), integrated with Google services
Llama (Llama 3, 4)	Meta	Open-weight	Largest open model ecosystem, strong community, commercially usable
Mistral / Mixtral	Mistral AI	Open-weight	European, efficient MoE architecture, strong multilingual
Qwen	Alibaba	Open-weight	Strong multilingual (especially CJK), good code models, range of sizes
DeepSeek	DeepSeek AI	Open-weight	Reasoning focus (DeepSeek-R1), competitive with frontier closed models at lower cost
Grok	xAI	Partially open	Integrated with X (Twitter) data, less filtered

Open-weight vs closed: why it matters

This distinction is one of the most important practical decisions you’ll make.

Closed models (GPT, Claude, Gemini) are accessible only through an API. You send your prompt to someone else’s servers and get a response back. You can’t see the model’s weights, can’t run it on your own hardware, and can’t modify it. The provider controls the model’s behaviour, pricing, and availability.

Open-weight models (Llama, Mistral, Qwen, DeepSeek) publish their model weights. You can download them, run them on your own hardware, fine-tune them for your specific use case, and inspect them. “Open-weight” rather than “open-source” because many of these models have restrictive licences — you can use the weights but the training code, data, and full methodology are often proprietary.

When does this matter?

Fine-tuning: If you want to train a model on your own data (say, a dataset of space game sprites), you need open weights. You cannot fine-tune GPT-4 or Claude from scratch. OpenAI and others offer limited fine-tuning APIs, but the level of customisation is constrained.
Privacy: If your data can’t leave your infrastructure (medical, legal, financial), you need a model you can run locally.
Cost at scale: API calls add up. If you’re making millions of inference calls, running your own model on your own GPUs can be cheaper — though the upfront hardware cost is significant.
Control: Closed models can change behaviour between versions, add or remove capabilities, or adjust content policies in ways that break your workflow. Open-weight models are a snapshot — the version you downloaded today will behave the same way tomorrow.

For most individuals and small teams experimenting with AI, the closed model APIs are the pragmatic starting point. They’re the most capable, the easiest to use, and the per-query cost is manageable at small scale. Open-weight models become compelling when you need customisation, privacy, or cost control at volume.

Image generation models

Model	Made by	Architecture	Open / closed	Notable for
DALL-E 3	OpenAI	Diffusion	Closed	Integrated with ChatGPT, good prompt adherence
Midjourney	Midjourney	Diffusion (proprietary)	Closed	Aesthetically striking defaults, strong at artistic styles
Stable Diffusion / SDXL	Stability AI	Latent diffusion	Open-weight	Enormous community, fine-tunable, runs locally
Flux	Black Forest Labs	Flow matching	Open-weight	Founded by original Stable Diffusion researchers, strong prompt adherence, efficient
Imagen	Google DeepMind	Diffusion	Closed	Integrated with Google products

The open-weight image models — particularly Stable Diffusion and Flux — have spawned an enormous ecosystem of community-trained variants, style adaptations, and fine-tuning techniques. This is where LoRALoRAA fine-tuning technique that trains a small low-rank matrix on top of the frozen base model, instead of updating every parameter. (Low-Rank Adaptation) and Dreambooth come in: techniques for teaching an existing model a new style or concept with relatively little data and compute. Want a model that generates pixel art sprites in a specific style? Fine-tune Stable Diffusion or Flux with LoRA on a few hundred examples. We’ll dig deeper into this in a future post.

Video, audio, and beyond

The landscape for non-text, non-image modalities is moving fast but less mature:

Video generation: Sora (OpenAI), Runway Gen-3, Kling (Kuaishou), Veo (Google). These typically extend diffusion models to generate sequences of frames. Quality has improved dramatically but consistency across long videos (characters changing appearance, physics breaking) remains challenging.
Music and audio: Suno and Udio generate full songs from text descriptions. Whisper (OpenAI) is the standard for speech-to-text. Text-to-speech models (ElevenLabs, XTTS) produce increasingly natural-sounding voices.
3D generation: Still early. Point-E (OpenAI), various NeRF-based approaches. Generating 3D assets from text or images is an active research area but not yet reliable enough for production use in most cases.

Mixture of Experts: an architecture trick worth knowing

You’ll encounter the term Mixture of Experts (MoE) and it’s worth understanding because it explains how some models can be very large without being very expensive to run.

A standard transformer activates all of its parameters for every token. A 70-billion-parameter model does 70 billion parameters’ worth of computation for every single token it processes.

A Mixture of Experts model has many more total parameters, but only activates a subset of them for each token. The model contains multiple “expert” sub-networks, and a learned routing mechanism decides which experts to use for each token. Mixtral 8x7B, for example, has 8 expert networks of 7 billion parameters each (about 47 billion total), but only activates 2 experts per token — so the effective compute per token is closer to a 14-billion-parameter model, while having access to a much larger knowledge base.

This is how some models can be “bigger” without being proportionally slower or more expensive. The total parameter count (which gets the headlines) is much larger than the active parameter count per token (which determines the actual cost).

Embeddings: the hidden infrastructure

Embeddings deserve special mention because they’re everywhere and rarely explained.

An embedding is a vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. that represents the meaning of a piece of text (or an image, or an audio clip) in a high-dimensional space that captures semantic similarity. Two texts that mean similar things will have similar embeddings, even if they use completely different words.

“The cat sat on the mat” and “A feline rested on the rug” would have very similar embeddings. “The stock market crashed” would have a very different one.

This matters because embeddings are the glue behind:

Semantic search: Instead of keyword matching (“does this document contain the word ‘cat’?”), you compare embeddings (“is this document about a similar concept?”).
RAG: The retrieval step in retrieval-augmented generation uses embeddings to find relevant documents.
Clustering and classification: Group similar items together without hand-written rules.
Recommendation systems: “You liked X, here are similar things.”

Embedding models are typically smaller, faster, and cheaper than generative models. They don’t produce text — they produce vectors. OpenAI’s text-embedding-3, Cohere’s Embed, and various open-source options (e5, GTE, BGE) are the main choices.

Making sense of it all: a decision framework

If you’ve read this far, you have the vocabulary. Now let’s make it practical. You have a task. Which model type do you need?

I want to…	You need	Start here
Write or edit text, summarise documents, answer questions	An LLM	Claude or GPT-4o via API
Solve hard maths, logic, or coding problems	A reasoning model	Claude (extended thinking), o3, DeepSeek-R1
Generate images from text descriptions	A diffusion model	Midjourney (quality), Stable Diffusion / Flux (open, fine-tunable)
Generate images in a specific style	A fine-tuned diffusion model	Stable Diffusion or Flux + LoRA fine-tuning
Generate video	A video generation model	Sora, Runway, Kling
Transcribe speech to text	A speech recognition model	Whisper
Generate music	A music generation model	Suno, Udio
Search my own documents using meaning, not keywords	An embedding model + vector database	text-embedding-3 + Pinecone/Chroma/pgvector
Build an AI that uses tools, browses the web, writes code	An agent framework around an LLM	Claude Code, LangChain, or build your own
Answer questions using my company's internal knowledge	RAG (embedding model + LLM)	Embed your docs, retrieve relevant ones, pass to Claude/GPT
Process inputs far beyond any model's context window	An RLM scaffold or chunking strategy	RLM framework, or manual chunking with an LLM
Run AI locally, on my own hardware, with full privacy	An open-weight model	Llama or Mistral via Ollama

The pace of change

One thing this post can’t give you is a stable picture. It won’t last.

The landscape described here is accurate as of mid-2026. Six months ago, some of these models didn’t exist. Six months from now, some of them will have been superseded. The pace is genuinely unprecedented in software engineering — not just incremental improvements, but new categories of capability appearing every few months.

What will last is the framework. Modalities, architectures, paradigms, and models. New things will appear, but they’ll slot into this structure. A new model will operate on specific modalities, use a specific architecture (or a hybrid), employ specific paradigms, and be open or closed. If you understand the categories, you can evaluate new developments without starting from scratch every time.

Where to from here?

This post gave you the map. Future posts in this series will zoom into specific squares on it — picking a real problem, choosing the correct model type, and walking through the process end to end, including what it actually costs.

Because the real test of understanding a landscape isn’t being able to name everything in it. It’s being able to pick the correct path through it for where you’re trying to go.

Example Mapping: Making Stories Concrete

2026-03-31T06:00:00+08:00

The Greenbox team has done good work. They Event Stormed and got the whole domain out of Maya’s head. The hotspots on the wall made it clear: subscriptions are the critical path, nothing else works without them.

Now they need to build something. And the first story on the board is: “Subscribe to a produce box.”

Sounds clear enough, right? That’s what they thought four weeks ago too, and it didn’t go well.

The story is too vague to build from. What does “subscribe” actually mean? What has to happen? What could go wrong? What does the customer see? Tom could start coding right now, but he’d be guessing, again, and the team knows where that leads.

They need a way to turn that vague story into something concrete before anyone opens an IDE.

What is Example Mapping?

Example Mapping is a structured conversation technique created by Matt Wynne. The idea is simple: get a small group together for a short, focused session, take a single user story, and break it apart until everyone agrees on what “done” looks like. What are the rules? What are the concrete examples? What can’t we answer yet?

By the end of the session, you know one of three things: the story is well-understood and ready to build, the story is too big and needs splitting, or there are too many unknowns and it needs more research first. All three are useful outcomes. The worst thing you can do with a vague story is ship it unexplored, with unknown assumptions and no validation. You may choose to build part of it to learn what’s missing, then revisit and possibly rewrite the story before you commit to the delivery.

The technique uses four colours of index card (or sticky note, or virtual equivalent) to keep the conversation structured.

Four colours of card

Yellow is the story. One card. The thing you’re discussing.

Blue is for rules. These are the business rules, constraints, and acceptance criteria that govern how the story works. Each rule gets its own card.

Green is for examples. Concrete, specific instances that illustrate a rule. “If X happens, then Y.” These are the things that tell you what “done” looks like.

Red is for questions. Anything you can’t answer in the room. Unknowns, disagreements, things that need research or a decision from someone who isn’t here.

That’s it. Four colours, four purposes.

How to run one

The format is deliberately tight:

Keep it short. Twenty-five minutes. Long enough to explore a story properly, short enough to stay focused. Set a timer. If you haven’t finished, the story is too big or too unclear. That’s useful information.
Small group. Someone who understands the business, someone who’ll build it, someone who’ll challenge the assumptions. Three to five people is ideal.
One story at a time. Don’t try to batch these. One story, one session.
Write as you go. Someone states a rule, write it on a blue card. Someone gives an example, write it on a green card under that rule. Someone asks a question nobody can answer, write it on a red card.

The conversation flows naturally. Someone proposes a rule. Someone else challenges it with an example. Edge cases surface. Assumptions get exposed. The map grows organically.

The Greenbox team gathers round a table. Maya, Tom, Priya, Jas, and Sam. Lee facilitates. Twenty-five minutes on the clock.

Lee places a yellow card in the middle of the table and writes on it: Subscribe to a produce box.

“Don’t try to define it,” Lee says. “Start with a concrete scenario. A real person doing a real thing. Tell me about a real person subscribing to a produce box.”

Starting with examples

Jas goes first: “Someone visits the site, picks a box, enters their card details, and they’re subscribed.”

Lee pushes back gently. “Who? Which box? What price? What happens so they know they’re subscribed? The more concrete the example, the more useful it is. Abstract examples hide assumptions.”

Jas tries again: “OK. Sarah visits the site, picks a small box at $25 a week, enters her Visa ending in 4242, and gets a confirmation with a delivery date of Thursday 2nd April.”

Lee writes it on a green card: Sarah chooses small box ($25/week), pays with Visa 4242 → subscription confirmed, first delivery Thursday 2 April. “See the difference? The first version could mean almost anything. Everyone in the room would picture something slightly different. This one leaves much less room for ambiguity, and ambiguity is where assumptions hide, and assumptions are where the bugs, the waste, and the rework come from.”

“Give me another one. What else could happen?”

Tom: “The card gets declined. Say Sarah enters an expired card.”

Green card: Sarah tries to subscribe with expired Visa → no subscription, asked to retry with a different card.

“What happens then?” Lee asks. “Does she lose her box choice? Start over from scratch?”

Maya: “No, she just re-enters payment details. The box choice stays.”

Lee writes that detail on the green card. “Good, that’s exactly the kind of detail that would have been a surprise in code review if nobody asked.”

Maya: “We deliver on Thursdays. If someone subscribes on Monday, they should get a box this Thursday. If they subscribe on Friday, it’s next Thursday.”

Jas: “Should we ask about dietary preferences when they subscribe? Allergies, things they don’t want?”

Maya nods. “Mrs Patterson hates beetroot. We should probably –”

Lee reaches for a red card. “That’s worth solving, but is it part of subscribing, or is it its own thing?” He writes: Dietary preferences and allergies during subscription? and moves it to the parked area. “We’ll come back to it. For now, let’s finish the shape of this one.”

Lee pushes for dates: “Which Monday? Which Friday?”

Maya: “If Sarah subscribes on Monday 30th March, she gets a box Thursday 2nd April. If she subscribes on Friday 3rd April, she gets a box Thursday 9th April.”

Two more green cards:

Sarah subscribes Monday 30 March → first delivery Thursday 2 April
Sarah subscribes Friday 3 April → first delivery Thursday 9 April

Context, action, outcome

Lee looks at the cards on the table. “Every solid example has three parts: the context, what’s true before anything happens, the action, what someone does, and the outcome, what should be true afterwards.”

He picks up the delivery date card. “Sarah subscribes Monday 30 March, first delivery Thursday 2 April. What’s the context?”

Tom: “Delivery day is Thursday.”

Maya: “And the minimum lead time is three days.”

Priya: “And there’s no public holiday that week.”

“Right. None of that is on the card.” He rewrites it:

Context: delivery day is Thursday, minimum lead time is 3 days, no public holiday this week. Sarah subscribes Monday 30 March. → First delivery Thursday 2 April.

“Now it’s self-contained. Anyone can pick up this card and understand not just what happens but why. And Priya’s point about public holidays, that’s on the card now. If someone reads this example in two weeks, they won’t have to guess whether we considered holidays. We did. It’s right there.”

Priya starts rewriting some of the earlier cards without being asked. This is Priya at her best, she sees structure where others see conversation, and she can’t leave a sloppy card on the table. The payment one becomes: Context: Sarah has selected a small box ($25/week). She enters an expired Visa. → No subscription created, asked to retry. Box choice is preserved.

Not every example needs three paragraphs of context. But the discipline of asking “what’s the context?” catches the assumptions that aren’t obvious, and those are the ones that cause problems in production.

“What about Wednesday?” Tom asks. “If someone subscribes at 11pm on Wednesday, do they make the cutoff? And whose 11pm, ours or the customer’s?”

Maya hesitates on the cutoff. “I think so… but the farms need to have confirmed supply by then.” The timezone question she can answer: “We’re Perth only. Everything is AWST.”

“For now,” Tom says.

“For now,” Maya agrees. “When we hit Melbourne we’ll need to revisit. They’re on a different timezone and they have daylight saving. Perth doesn’t.”

Lee writes a blue card: All times are AWST (Perth). Then a red card: Exact cutoff time for same-week delivery? He places the red card off to the side.

“Red cards are good. They’re unknowns we’ve caught before they became expensive surprises.”

Questions and assumptions

Sam asks: “Can someone have two subscriptions? Like a small box to their place and a large one to their mum’s house?”

The room goes quiet. Maya hadn’t considered it.

Lee writes a red card: Multiple subscriptions per customer? Then he asks, “Is that something we need for the first version?”

Maya: “No. Definitely not for version one.”

“Good. Park it.” He moves the red card to a separate area of the table. “Anything that isn’t part of this story goes over here. We’re not losing it, we’re recognising it belongs somewhere else.”

Jas: “What about cancellation? Can they cancel any time?”

Another red card: What’s the cancellation policy? Parked.

“What about 3D Secure?” Priya asks. “Some cards need that extra authentication step.”

Red card: How do we handle 3D Secure? This one stays with the story, it’s a technical detail that affects the subscription flow directly. Tom volunteers to research it.

Generalising to rules

“OK,” Lee says. “We’ve got a good set of examples and questions. Now let’s look at what they have in common. If you look across several examples, you’ll start to see patterns, things that are always true, constraints that apply every time. Those patterns are rules. A rule is a general statement that a set of examples all obey. ‘Payment must succeed before a subscription is created’, that’s a rule. Every example we’ve written either follows it or tests what happens when it breaks.”

The team looks at the green cards spread across the table.

Maya sees it first: “There’s a box size choice. Small or large. That’s it for now.”

Blue card: Customer must choose a box size. He arranges the size-related green cards underneath it.

“Does the rule spark new examples? What could go wrong with box size selection?”

Priya: “What if they don’t choose? What if they hit ‘subscribe’ without selecting a size?”

Green card: Sarah clicks subscribe without choosing a size → error, asked to choose.

Tom: “And can they change their mind later? Switch from small to large?”

Maya: “Yes, but not mid-week. It takes effect from the next delivery.”

Green card: Sarah switches from small ($25) to large ($45) on Monday → change takes effect Thursday.

Lee nods. “The rule generates new examples, and the examples constrain the rule. It’s not just ‘choose a size’, it’s ‘must choose before subscribing, and can change with notice.’”

Tom: “Payment has to work too. No valid payment, no subscription.”

Blue card: Payment must succeed before subscription is created.

“What else can go wrong with payment?”

Sam: “What about when the weekly charge fails three weeks in? Card expired, insufficient funds?”

Maya: “First failed charge, we retry after 24 hours. Second failure, we email them. Third, we pause the subscription.”

Three new green cards:

Weekly charge fails once → retry after 24 hours
Two failures → email customer to update payment
Three failures → subscription paused automatically

Tom whistles. “That’s a lot more than ‘payment must succeed.’” Something he assumed would be straightforward, payment works or it doesn’t, just turned into a state machine with five transitions. Twenty-five minutes ago, he would have built it wrong.

Jas: “And they need to know when their first box arrives.”

Blue card: Customer sees their first delivery date after subscribing.

Sam: “Public holidays. What if Thursday is a public holiday?”

Maya: “We’d deliver Wednesday instead. Or Friday. Depends on the courier.”

Red card: How do public holidays affect delivery dates?

“Notice what happened,” Lee says. “We started with examples, and the rules emerged naturally. Then the rules generated more examples, and those examples tightened the rules. If you start with rules, you tend to stay abstract. If you start with examples, you stay grounded.”

The map so far

The timer hasn’t gone off yet, but the team feels like they’ve covered the core shape. Here’s what the table looks like:

Subscribe to a produce box

Customer must choose a box size

Small box: $25/week

Large box: $45/week

No size selected → error

Switch small→large Monday → change from Thursday

Payment must succeed

Valid card → confirmed

Declined card → retry

Weekly charge fails → retry after 24hrs

Two failures → email customer

Three failures → auto-pause

How do we handle 3D Secure?

Customer sees first delivery date

Monday sub → this Thursday

Friday sub → next Thursday

Exact cutoff for same-week delivery?

Public holidays and delivery dates?

All times are AWST (Perth)

Parked (other stories)

Multiple subscriptions per customer?

What's the cancellation policy?

Dietary preferences and allergies during subscription?

Four rules, eleven examples, three questions still attached to the story, and three parked for other stories.

The “too many red cards” signal

If you have more red cards than green cards, the story isn’t ready to build.

Three red cards against eleven green cards is fine, and those are just the ones attached to this story, not the parked ones. The Greenbox team decides to resolve the cutoff and 3D Secure questions before starting work, and to treat multiple subscriptions, cancellation, and dietary preferences as separate stories.

If they’d had eight red cards and three green cards, that would be a clear signal: go away, answer the questions, and come back for another session.

This is one of the best things about Example Mapping. It doesn’t just help you understand a story, it tells you when you don’t understand it. A readiness check disguised as a planning session.

Second session: “Pause a subscription”

The next story up is “Pause a subscription.” A customer is going on holiday and wants to skip a week or two.

This time the session goes more smoothly. The team knows the domain better. Maya is in the groove of stating rules explicitly instead of assuming everyone already knows them.

Three rules emerge quickly: customers can pause for one or more weeks, they’re not charged for paused weeks, and they must pause at least three days before the next delivery.

The edge cases are where it gets interesting. “What about Tuesday?” Priya asks. “If the delivery is Thursday and they pause on Tuesday, is that three days?”

Maya hesitates. “I don’t think so… Monday to Thursday is three days. Tuesday to Thursday is two.”

“So Tuesday is too late,” Tom says. “But what does ‘three days before’ actually mean? Before midnight on Monday? Or 72 hours before the delivery window starts?”

Maya: “Before the end of Monday. If you pause any time on Monday, you’re fine. Tuesday, you’re not.”

They update the rule to be precise: pause must be requested before midnight AWST on the day three days before delivery. For Thursday deliveries, that’s end of Monday. Sam asks: “Does the same cutoff apply to unpausing? If I unpause on Wednesday, do I get a box Thursday?”

Maya: “No, same rule. You’d need to unpause by end of Monday to get Thursday’s box. Otherwise it’s the following week.”

One question comes up that nobody can answer: can a subscription stay paused indefinitely, or does something happen if a customer never resumes?

Three rules, seven examples, one question. Much cleaner ratio than the first session. This story is nearly ready to build.

Notice how much faster it went. The team is developing a shared language. When Maya says “three days before delivery,” everyone knows what delivery day means, how the weekly cycle works, what the constraints are. That shared understanding from Event Storming is paying off already.

Why Example Mapping is the one you’ll use most

Event Storming is brilliant for understanding a whole domain. You might do it once at the start of a project, or when entering a new area.

Example Mapping is different. You do it before every story. Every single one.

It’s a short conversation. It surfaces assumptions. It catches edge cases. It builds shared understanding. And it tells you when a story isn’t ready.

The Greenbox team starts doing Example Maps before picking up each new story. Before Tom and Priya start building, they spend twenty-five minutes with Maya and Jas mapping it out. The red cards tell them what to resolve. The green cards tell them what to build. The blue cards tell them the rules to enforce.

Three weeks in, they’ve stopped finding surprises in code review. The arguments about scope have disappeared. When Priya finishes a story, it matches what Maya expected, because they agreed on concrete examples before anyone wrote a line of code.

If you only adopt one technique from this series, make it Example Mapping. Twenty-five minutes. Four colours of card. Every assumption surfaced before it becomes a bug.

Tom sits in his car after the session and texts Sarah: “I just spent 25 minutes doing something I thought was pointless and it saved me a week of work.” Sarah replies: “You sound surprised that something other than coding was useful.” He puts the phone down without responding. But he’s smiling.

Now what?

The team has cards on a table and a shared understanding of what “subscribe to a produce box” means, concrete, unambiguous, agreed upon by everyone in the room.

But cards on a table aren’t software. Tom picks up his bag. “Right. I’m going to build this.”

“Which part first?” Lee asks. “You’ve got red cards to resolve, Maya needs the subscription system live and hitting 200 subscribers before the seed money runs out, and some of these stories reduce more risk than others. What order gives you the most confidence that you’ll ship something useful by then?”

Tom looks at the cards. He knows what to build. He doesn’t know what to build first, or how to make the building predictable. None of them do. Not yet.

That’s where the first sprints come in, turning sticky notes into delivery, one fortnight at a time.

How LLMs Actually Work

2026-03-26T06:00:00+08:00

You type a question. A few seconds later, coherent, fluent text appears on your screen, text that seems to understand what you asked, that follows instructions, that writes code and poetry and legal briefs. It’s natural to wonder: what is actually happening in there?

In 1980, the philosopher John Searle posed a thought experiment. Imagine you’re locked in a room. People slide Chinese characters under the door. You don’t speak Chinese, but you have an enormous book of rules: “When you see this pattern, write that pattern and slide it back.” You follow the rules perfectly. To the people outside, it looks like the room understands Chinese. But you, the person in the room, understand nothing. You’re just matching patterns.

Large language models are the most sophisticated Chinese Room ever built. They don’t “understand” language in the way humans do. They don’t have beliefs, memories, or intentions. What they do, and they do it extraordinarily well, is predict the next tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. in a sequence. One token at a time, over and over, until the response is complete.

But here’s where Searle’s analogy breaks down, or at least gets interesting. “Just predicting the next token” turns out to be a surprisingly rich activity. To predict well, the model has to capture something about syntax, semantics, logic, world knowledge, coding conventions, social norms, and the structure of arguments. Not because anyone told it to. Because all of those things are reflected in the patterns of text that humans produce, and the model learned those patterns by reading a significant fraction of the internet.

Is that understanding? Or just very good pattern matching? We’ll come back to that question; it’s more slippery than it sounds. But first, let’s open up the room and look at the machinery inside. It starts with tokens.

Tokens: the atoms of text

LLMs don’t read characters. They don’t read words, either. They read tokens: chunks of text that sit somewhere between characters and words in size.

The word “understanding” might be a single token. The word “tokenisation” might be split into “token” + “isation”. A common word like “the” is almost certainly a single token in any major tokeniser. An uncommon word like “antidisestablishmentarianism” would be split into several. Numbers are tokenised digit by digit or in small groups. Code tokens include things like def, return, (), and \n.

Why tokens instead of characters or words? Characters are too granular; a model working character by character would need enormous context windows to see meaningful patterns. Words are too coarse, with hundreds of thousands of distinct words in English alone, and the model would need a separate entry for every inflection, tense, and compound. Tokens hit a practical sweet spot.

The process of breaking text into tokens is called tokenisation, and the dominant method is Byte Pair Encoding (BPE), originally described by Philip Gage in 1994 as a data compression algorithm and later adapted for neural language models by Sennrich, Haddow, and Birch in 2016.

BPE works by starting with individual bytes (or characters) and iteratively merging the most frequent pair. Here’s a simplified example:

Suppose your trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. text contains the sequence low lower lowest repeatedly. BPE starts with individual characters: l, o, w, e, r, s, t, and so on. It counts every adjacent pair. If l + o appears most frequently, it merges them into a new token lo. Now it counts again. If lo + w is the most frequent pair, it merges them into low. Then low + e might merge into lowe, and so on. The process continues for a fixed number of merge operations (typically 30,000 to 100,000), producing a vocabulary of that many tokens.

The result is a vocabulary where common words are single tokens, common subwords are single tokens, and rare or novel words get split into known pieces. This is crucial for handling words the model has never seen before: it can still process them, just broken into familiar subword units.

Most modern LLMs use vocabularies of 30,000 to 100,000 tokens. GPT-4 uses around 100,000. Claude uses a similar order of magnitude. The exact vocabulary depends on the training data and the number of BPE merges performed.

A practical consequence: LLMs “see” text differently from humans. The sentence “I saw a dog” might be four tokens. The sentence “I saw a Labradoodle” might be five or six, because “Labradoodle” gets split into subwords. The model doesn’t see characters. It sees a sequence of integer IDs, each mapping to a token in its vocabulary. Token 1547 might be “the”. Token 28903 might be “ function” (with a leading space; spaces are part of tokens in most schemes). Token 85 might be a newline character.

This tokenisation step is entirely mechanical. It happens before the model sees anything. The model never operates on raw text, only on sequences of token IDs.

Embeddings: giving tokens meaning

A token ID is just a number. The model needs something richer: a representation that captures the meaning of each token and its relationship to other tokens.

This is where embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. come in. Each token in the vocabulary is assigned a high-dimensional vector, a list of numbers, typically 4,096 to 12,288 of them in modern LLMs. These vectors are learned during training, not hand-crafted. At the start of training, they’re initialised randomly. By the end, tokens with similar meanings have vectors that point in similar directions in this high-dimensional space.

The classic example, from Mikolov et al.’s 2013 word2vec paper, is that the vector for “king” minus the vector for “man” plus the vector for “woman” gives a vector very close to “queen”. This isn’t a trick; it falls out naturally from training on large amounts of text, because the contexts in which these words appear encode their relationships.

In an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. , the embedding layer is the first thing that happens. The input sequence of token IDs gets converted into a sequence of embedding vectors. If your input is 500 tokens and each token maps to a vector of 8,192 dimensions, you now have a 500 x 8,192 matrix of floating-point numbers. This matrix is what flows into the rest of the model.

But there’s a problem: the embedding for a token is the same regardless of where it appears in the sequence. The word “bank” has one embedding, whether it means a river bank, a financial bank, or a shot in snooker. The model needs to know not just what each token is, but where it sits in the sequence.

Positional encoding solves this. The original transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. paper (Vaswani et al., 2017) used sinusoidal functions to generate position-dependent vectors that are added to the token embeddings. More recent models use Rotary Position Embeddings (RoPE, Su et al., 2021), which encode relative positions by rotating the embedding vectors. The details vary, but the purpose is the same: after positional encoding, the model can distinguish between “The dog bit the man” and “The man bit the dog”.

The transformer: the architecture underneath

Every major LLM (GPT, Claude, Llama, Gemini) is built on the transformer architecture, introduced in a 2017 paper by researchers at Google with the quietly confident title “Attention Is All You Need”. Before transformers, language models used recurrent neural networks (RNNs) that processed text one word at a time, left to right, like reading a sentence with a finger. This worked, but it was slow and struggled with long-range dependencies; by the time the model reached the end of a paragraph, it had largely forgotten the beginning.

Transformers threw that away. Instead of processing text sequentially, a transformer looks at the entire input at once and figures out which parts relate to which other parts. It’s the difference between reading a sentence word by word and seeing the whole sentence on a page. This parallelism made transformers dramatically faster to train, and the ability to attend to any part of the input regardless of distance made them dramatically better at capturing meaning.

The transformer is built from a stack of identical blocks, each containing two key components: an attentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. mechanism (which figures out what to pay attention to) and a feed-forward network (which processes the result). We’ll look at both, starting with attention, the mechanism that made the whole thing work.

Attention: the mechanism that changed everything

The core innovation is the attention mechanism. It’s what allows the model to relate different parts of the input to each other, regardless of distance.

Here’s the intuition. Consider the sentence: “The cat sat on the mat because it was tired.” What does “it” refer to? The cat, obviously. But how does the model figure that out? It needs to look back at every previous token and determine which ones are relevant to interpreting “it” in this context.

Attention lets the model do exactly this. For each token in the sequence, the model computes three things from its embedding:

A query vector: “What am I looking for?”
A key vector: “What do I contain?”
A value vector: “What information should I provide if I’m relevant?”

These are computed by multiplying the token’s embedding by three learned weight matrices (Q, K, and V). Then, for each token, the model computes the dot product of its query with every other token’s key. This produces a set of attention scores: numbers indicating how relevant each other token is to the current one.

These scores are passed through a softmax function (which converts them into probabilities that sum to 1), and then used to compute a weighted average of the value vectors. The result is a new representation of the current token that incorporates information from every other token in the sequence, weighted by relevance.

In the “it was tired” example, the attention mechanism would assign a high score to the pairing of “it” (query) with “cat” (key), because the model has learned from training data that pronouns attend to their antecedents.

The mathematical formulation, from the original transformer paper, is:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

The sqrt(d_k) term is a scaling factor (d_k is the dimension of the key vectors) that prevents the dot products from becoming too large, which would push the softmax into regions where the gradients are tiny and learning stalls.

Multi-head attention: parallel perspectives

A single attention computation captures one kind of relationship between tokens. But language is rich. A single token might simultaneously need to attend to its syntactic subject, the verb it modifies, the topic of the paragraph, and the format of the document.

Multi-head attention runs multiple attention computations in parallel, each with its own Q, K, and V weight matrices. A model with 32 attention heads computes 32 different sets of attention patterns simultaneously. The results are concatenated and projected back to the model’s dimension through another learned weight matrix.

Different heads learn to capture different kinds of relationships. Research by Clark et al. (2019) and others has found that in trained models, some attention heads specialise in syntactic dependencies (subject-verb agreement), some in positional relationships (attending to the previous token), some in semantic relationships, and some in patterns that are difficult for humans to interpret.

Nobody tells the heads what to specialise in. The specialisation emerges from training. The model discovers that attending to different kinds of information in parallel produces better predictions.

The transformer block

An attention layer is part of a larger unit called a transformer block (or transformer layer). Each block consists of:

Multi-head self-attention: the attention mechanism described above
Layer normalisation: scaling the outputs to have zero mean and unit variance, which stabilises training
Feed-forward network: two linear transformations with a non-linear activation function (typically GeLU or SwiGLU) in between
Residual connections: adding the input of each sub-layer to its output, so information can flow through the network without being forced through every transformation

The feed-forward network is where much of the model’s “knowledge” is believed to be stored. While attention handles the relationships between tokens, the feed-forward layers act as a kind of lookup table: a massive, compressed, approximate memory of facts and patterns learned during training. Research by Geva et al. (2021) characterised feed-forward layers as “key-value memories” where the first linear transformation acts as keys and the second acts as values.

A modern LLM stacks many transformer blocks on top of each other. GPT-4 is believed to have around 120 layers. Claude’s architecture isn’t public, but models of this class typically have 80 to 120 layers. The input embeddings flow through every block, being progressively refined. Early layers tend to capture surface-level patterns (syntax, local word relationships). Middle layers capture more abstract features (semantic roles, entity relationships). Late layers produce the representations that directly inform the prediction of the next token.

Context windows: how much the model can see

The context windowContext windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. is the maximum number of tokens the model can process in a single forward pass. It’s a hard limit: the model literally cannot see tokens outside this window.

Early transformer models had modest context windows: GPT-2 (2019) had 1,024 tokens, roughly 750 words. GPT-3 (2020) had 2,048 tokens. As of 2025, context windows have expanded dramatically. Claude’s context window is 1,000,000 tokens, roughly 750,000 words, or about ten novels.

The expansion is non-trivial because the standard attention mechanism has a computational cost that scales quadratically with sequence length. If you double the context window, the attention computation costs four times as much. For a 200,000-token context window with naive attention, the cost would be staggering.

Modern models address this through various efficiency techniques. FlashAttention (Dao et al., 2022) restructures the attention computation to be more cache-efficient without changing the mathematical result. Grouped-query attention (GQA) shares key and value projections across multiple query heads, reducing memory requirements. Some models use sparse attention patterns that allow each token to attend to only a subset of other tokens.

The context window matters because everything the model “knows” about your specific conversation comes from the context window. The model has no persistent memory between conversations. If you had a conversation yesterday, the model doesn’t remember it. If you mentioned your name 50,000 tokens ago, the model can (in principle) still attend to that information, but the practical effectiveness of attention over very long ranges depends on the model and the training.

Generating text: one token at a time

Here’s where things get concrete. When you send a promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. to an LLM, the model processes the entire input through all its layers and produces, at the final layer, a probability distribution over the entire vocabulary for the next token.

Not the next sentence. Not the next word. The next token.

The model might assign a 15% probability to “the”, 8% to “a”, 4% to “\n”, 3% to “this”, and so on across all 100,000 tokens in its vocabulary. These probabilities sum to 1.

Then the model selects one token from this distribution, appends it to the sequence, and runs the whole process again to predict the token after that. This is called autoregressive generation: each output becomes part of the input for the next prediction.

A 500-token response requires 500 forward passes through the entire model. This is why generation is slower than processing the input. Each new token requires a full pass through all layers (though in practice, the computation is optimised using a KV cacheKV cacheA reuseable cache of the model’s attention computations for tokens it’s already seen, so generating the next token doesn’t redo work. that stores the key and value vectors from previous tokens so they don’t need to be recomputed).

Temperature and top-p: controlling randomness

How does the model choose which token to select from the probability distribution? This is where temperatureTemperatureA knob (usually 0 to 2) that controls how much the model deviates from its highest-probability next token. and top-p (nucleus samplingTop-p / Top-kTwo ways to truncate the model’s probability distribution before sampling – top-k keeps the K most likely tokens, top-p (nucleus) keeps the smallest set whose cumulative probability reaches P. ) come in.

Temperature scales the logits (the raw, pre-softmax scores) before converting them to probabilities. A temperature of 1.0 uses the distribution as-is. A temperature below 1.0 (say, 0.3) makes the distribution “sharper”: the most likely tokens become even more likely, and unlikely tokens become even less likely. A temperature of 0 is deterministic: always pick the highest-probability token. A temperature above 1.0 “flattens” the distribution, making unlikely tokens more likely to be selected.

Low temperature produces more predictable, focused text. High temperature produces more varied, creative (and sometimes nonsensical) text.

Top-p (nucleus sampling, introduced by Holtzman et al., 2020) takes a different approach: instead of scaling all probabilities, it considers only the smallest set of tokens whose cumulative probability exceeds a threshold p. If p = 0.9, the model considers only the top tokens that together account for 90% of the probability mass, and samples from among those. Everything else is excluded.

Top-p is adaptive. When the model is confident (one token dominates the distribution), the nucleus is small. When the model is uncertain (many tokens are roughly equally likely), the nucleus is large. This tends to produce better results than temperature alone, because it naturally adjusts the diversity of outputs to the model’s confidence.

In practice, APIs expose both parameters, and they interact. Most production uses keep temperature relatively low (0.0 to 0.7) for factual tasks and higher (0.7 to 1.0) for creative tasks.

The training pipeline

How does a model learn to predict the next token? The training process has three major phases, each building on the last.

Phase 1: Pretraining

Pretraining is where the model learns language. The training data is a massive corpus of text: web pages, books, code repositories, academic papers, forums, documentation. For frontier models, the term the industry uses for the most capable models from the leading labs, like Claude, GPT-4, and Gemini, this corpus is measured in trillions of tokens. The exact composition is typically proprietary, but it includes a broad cross-section of human-written text.

The training objective is straightforward: given a sequence of tokens, predict the next one. The model processes the training data in batches, makes predictions, computes how wrong it was (using cross-entropy loss, which measures the difference between the predicted probability distribution and the actual next token), and adjusts its weights to be slightly less wrong next time.

This adjustment happens through backpropagation and gradient descent, the same optimisation procedure used in virtually all deep learning. The loss function tells you how wrong the model was. Backpropagation computes how each weight in the model contributed to that error. Gradient descent adjusts each weight by a small amount in the direction that reduces the error. Repeat this billions of times, across trillions of tokens, and the weights gradually converge on values that produce good predictions.

Modern pretraining uses the Adam optimiser (Kingma and Ba, 2015) or variants of it, with learning rate schedules that warm up the learning rate gradually and then decay it. The training runs on thousands of GPUs (or TPUs) for weeks or months. The compute cost for frontier models is measured in tens of millions of dollars.

The remarkable thing about pretraining is how much emerges from such a simple objective. The model isn’t told about grammar, logic, programming languages, history, or mathematics. It just learns to predict the next token. But to predict well across such a diverse corpus, it must implicitly capture an enormous amount about the structure of language and the world it describes.

Phase 2: Fine-tuning (supervised)

A pretrained model is good at predicting text, but it’s not yet useful as an assistant. If you prompt it with “What is the capital of Australia?”, a purely pretrained model might continue with “The answer is Canberra”, but it might also continue with “This question appears on the geography quiz for Year 7 students” or “A. Canberra B. Sydney C. Melbourne D. Brisbane”. It’s predicting what text is likely to follow, and there are many plausible continuations.

Supervised fine-tuningFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. (SFT) narrows the model’s behaviour by training it on examples of the desired interaction pattern. Human annotators write thousands of example prompt-response pairs demonstrating the kind of helpful, accurate, structured responses the model should produce. The model is fine-tuned on these examples using the same next-token prediction objective, but with a much smaller, curated dataset.

SFT teaches the model the format of being an assistant: that it should answer questions directly, structure its responses clearly, acknowledge uncertainty, and follow instructions.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

SFT gets the model most of the way there, but human preferences are subtle. Is it better to give a concise answer or a thorough one? How should the model handle ambiguous instructions? When should it refuse a request?

Reinforcement Learning from Human FeedbackRLHFTraining a model to prefer outputs humans rank highly, on top of standard supervised training. (RLHF, described by Ouyang et al., 2022 for the InstructGPT work) addresses this by training the model to optimise for human preferences.

The process has two steps:

Train a reward model. Generate multiple responses to the same prompt. Human annotators rank them from best to worst. Train a separate neural network (the reward model) to predict which response a human would prefer. This reward model learns to score outputs on quality, helpfulness, safety, and adherence to instructions.
Optimise the language model against the reward model. Using a reinforcement learning algorithm (typically PPO, Proximal Policy Optimisation, Schulman et al., 2017, or more recently DPO, Direct Preference Optimisation), adjust the language model’s weights to produce outputs that the reward model scores highly. The key constraint is that the model shouldn’t deviate too far from the fine-tuned model. You don’t want optimising for the reward model to destroy the model’s general capabilities.

RLHF is what makes the difference between a model that can predict text and a model that is genuinely useful to interact with. It’s also what makes models more cautious, more structured in their responses, and more inclined to refuse harmful requests.

Some newer approaches, including Constitutional AI (Bai et al., 2022), use AI feedback in addition to (or instead of) human feedback in parts of the process, but the core idea remains: optimise the model’s outputs to align with human preferences.

What “predicting the next token” actually means

There’s a common dismissal of LLMs: “It’s just predicting the next token.” This is technically accurate and deeply misleading.

Consider what it takes to predict the next token well. If the context is a legal contract, the model must “know” contract structure, legal terminology, and the conventions of contract drafting. If the context is Python code, it must track variable scopes, function signatures, indentation, and the semantics of the language. If the context is a conversation about quantum physics, it must produce text that’s consistent with quantum mechanics.

The model doesn’t “know” these things in the way a human expert does. It has no experiences, no intuitions, no understanding of why quantum mechanics is the way it is. But it has captured statistical patterns in text that are rich enough to produce outputs that look like they come from someone who does understand.

This is genuinely remarkable, and it’s also the source of the most important failure modes. The model is optimising for “what would plausible-sounding text look like here?”, not for “what is true?” These are usually the same thing, because plausible text about well-covered topics tends to be accurate. But they diverge in exactly the cases where accuracy matters most: obscure facts, recent events, precise numerical claims, and reasoning chains that require strict logical validity.

Why they hallucinate

HallucinationHallucinationAn LLM stating something false with the same confidence it states something true. , the generation of confident, fluent, entirely fabricated information, is not a bug that can be fixed with more training data. It’s a structural consequence of how LLMs work.

The model generates text by choosing high-probability tokens one at a time. It has no mechanism for checking whether its output is factually correct. It has no database of facts it can look up. It has no way to distinguish between “this is a pattern I learned from reliable sources” and “this is a plausible-sounding continuation that happens to be wrong.”

When the model encounters a question about an obscure topic, it faces a choice: produce fluent text that matches the expected pattern (which might be wrong), or signal uncertainty (which requires overriding the strong pattern of producing confident text that it learned during training). The training process, especially RLHF, has pushed models toward expressing uncertainty more often, but the fundamental tension remains.

Hallucination is especially likely when:

The question asks for specific details (dates, numbers, names) about topics that appear infrequently in the training data
The model is asked to cite sources (it has learned the pattern of citations but doesn’t have access to a citation database)
The question requires reasoning that extends beyond the patterns in the training data
The prompt is ambiguous and the model guesses at intent rather than asking for clarification

Retrieval-augmented generationRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. (RAG), where the model is given relevant documents to reference, helps significantly, because it replaces “generate from patterns” with “summarise from provided text.” But the underlying architecture hasn’t changed. The model is still predicting tokens, not verifying facts.

Why they’re good at code

LLMs are disproportionately good at writing code, and the reasons are illuminating.

First, code is heavily represented in training data. GitHub alone contains billions of files of source code, all publicly available. Stack Overflow has millions of answered questions with code examples. Documentation, tutorials, blog posts, textbooks: the volume of well-structured code in the training corpus is enormous.

Second, code is less ambiguous than natural language. A function either compiles or it doesn’t. A variable is either in scope or it isn’t. The syntax rules are strict and well-defined. This makes code easier for a statistical model to learn, because the patterns are more consistent. In natural language, “bank” can mean ten different things. In Python, def always means the same thing.

Third, code is highly repetitive. Most code follows standard patterns: import libraries, define functions, handle errors, return results. Design patterns recur across millions of repositories. The model doesn’t need to invent novel algorithms (though it sometimes can); it needs to recognise which pattern applies and instantiate it correctly for the current context.

Fourth, code comes with its own error-checking mechanism. When you run LLM-generated code and it fails, the error message is itself a prompt you can feed back to the model. This feedback loop, generate, run, fix, repeat, is enormously productive, because the model is good at understanding error messages and making targeted corrections.

This is part of the shift described in The Value Is in Ideas, Not Code: when code generation becomes cheap, the bottleneck moves to knowing what to ask for. The teams that get the most from LLMs aren’t the ones with the best prompts; they’re the ones with the clearest understanding of their domain, the best-structured knowledge (decision records, test suites, observability), and the discipline to review what the model produces rather than trusting it blindly.

The gap between capability and understanding

Here’s the thing that I think is most important to understand about LLMs, and it’s the thing that most commentary gets wrong.

LLMs are not “stochastic parrots” that merely recombine memorised text. Nor are they conscious beings that understand what they’re saying. They’re something new, something we don’t have a great word for yet.

They can follow complex instructions. They can write functional code for problems that don’t appear in their training data. They can reason through multi-step problems (imperfectly, but measurably). They can transfer knowledge between domains in ways that look a lot like understanding. They can generate creative solutions that surprise even their creators.

But they can also fail at basic arithmetic, get confused by negation, confidently assert falsehoods, struggle with spatial reasoning, and produce outputs that are syntactically perfect but semantically absurd. These failures are not random; they reflect the boundaries of what can be learned from the statistical patterns of text.

A useful analogy: an LLM is like someone who has read everything ever written but has never been outside. They can describe a sunset beautifully because they’ve read thousands of descriptions. They can explain the physics of light scattering. They can write a character who watches a sunset and feels moved. But they’ve never actually seen one. Their knowledge is real, and it produces genuinely useful outputs, but it’s mediated entirely through text.

This gap matters practically. LLMs are extraordinary tools for generation, summarisation, translation, code writing, brainstorming, and pattern matching. They are poor tools for factual verification, mathematical proof, real-time information, and any task where correctness must be guaranteed rather than probable.

The transformer architecture at a glance

Here’s a summary of how the pieces fit together, from input to output.

Stage	What happens	Output
Tokenisation	Raw text is split into tokens using BPE	Sequence of token IDs
Embedding	Token IDs are mapped to high-dimensional vectors	Matrix of embedding vectors
Positional encoding	Position information is added to embeddings	Position-aware embeddings
Transformer blocks (x80-120)	Multi-head attention + feed-forward, repeated	Refined representations at each layer
Output projection	Final layer representations projected to vocabulary size	Logits (scores) for every token in vocabulary
Softmax + sampling	Logits converted to probabilities, one token selected	The next token

Then the selected token is appended to the sequence and the process repeats from the transformer blocks onward (with the KV cache avoiding redundant computation for earlier tokens).

Scale and emergent capabilities

One of the most striking findings in LLM research is that capabilities emerge at scale. Smaller models can complete simple text. Larger models can follow instructions. Even larger models can perform multi-step reasoning, write complex code, and engage with nuanced arguments.

These emergent abilities, capabilities that appear suddenly as models scale, rather than improving gradually, were characterised by Wei et al. (2022). A model with 1 billion parameters might be unable to do basic arithmetic. A model with 10 billion might do simple addition. A model with 100 billion might do multi-digit multiplication. The capability doesn’t improve linearly with scale; it appears relatively abruptly.

Whether “emergence” is a phase transition or an artefact of how we measure performance is debated (Schaeffer et al., 2023 argue it’s partly the latter), but the practical observation is clear: larger models are not just slightly better; they’re qualitatively different in what they can do.

The scaling laws described by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) (the “Chinchilla” paper) established that model performance follows predictable power laws as a function of model size, dataset size, and compute. The Chinchilla paper’s key finding was that many models were trained on too little data relative to their size: a 70-billion-parameter model should be trained on roughly 1.4 trillion tokens, far more than was standard at the time.

The parameter count

When people talk about a “70B model” or a “400B model”, the B stands for billions of parameters: the learned weights in the model. These are the numbers that get adjusted during training. Every attention weight, every feed-forward weight, every embedding vector is a parameter.

A 70-billion-parameter model stored in 16-bit floating point requires roughly 140 GB of memory just for the weights. And that’s before accounting for the memory needed when the model actually runs, what the industry calls inferenceInferenceRunning a trained model to produce output – as opposed to training it. , meaning the process of feeding in a prompt and generating a response. During inference, the model needs additional memory for the KV cache (a store of previously computed attention keys and values so it doesn’t have to recompute them for every new token), activations, and overhead. This is why running large models requires multiple GPUs.

The cost of inference is substantial. Running a frontier model requires a cluster of high-end GPUs, typically NVIDIA A100s or H100s. A single H100 costs around US$30,000, and you need eight of them to run a 70B model (more for larger models). A cluster capable of serving a model like Claude or GPT-4 to millions of users costs tens of millions of dollars in hardware alone, before electricity, cooling, networking, and the engineering team to keep it running.

This cost is what drives the per-token pricing you see from API providers. When Anthropic charges a fraction of a cent per token, that price reflects the amortised cost of the GPU cluster, the electricity to run it (a single H100 draws around 700 watts), the memory bandwidth consumed by the KV cache, and the engineering overhead. Input tokens are cheaper than output tokens because reading the prompt involves a single forward pass, while generating a response requires a separate forward pass for every token produced, each one computing attention across the full context. A long conversation with a frontier model might generate 2,000 output tokens. At each step, the model is attending to every previous token, which is why the cost scales with both the length of the input and the length of the output.

For perspective: generating a 2,000-word response from a frontier model via API is typically priced at between AU$0.05 and AU$0.50, depending on the model and the length of the input context. Note the word “priced”: what you pay and what it costs to serve are different things. The API price includes the provider’s margin, their amortised R&D costs (training a frontier model can cost US$100 million or more), and the overhead of running the platform. The actual compute cost of your individual request is a fraction of the price, but the infrastructure to serve millions of concurrent requests at low latency is what makes the price what it is. Providers are competing aggressively on pricing, and costs are falling, but the underlying economics remain a story about GPU memory, electricity, and how many tokens you can push through a chip per second.

The parameters are where the model’s “knowledge” lives, encoded in the relationships between weights. A specific fact isn’t stored in a specific parameter; it’s distributed across millions of parameters in a way that makes it accessible when the correct pattern of activation occurs. This distributed representation is what makes it possible to store so much information in a relatively compact set of numbers, and it’s also what makes hallucination so difficult to prevent: you can’t just look up “is this fact correct?” in the model’s weights.

Chain of thought and reasoning

A pure next-token predictor struggles with multi-step reasoning because each token is generated based on the full context but without any explicit “thinking” step. In 2022, Wei et al. showed that prompting models to “think step by step”, chain-of-thoughtChain-of-thoughtPrompting the model to write out its intermediate reasoning before giving a final answer – which empirically makes hard problems get answered better. prompting, dramatically improves performance on reasoning tasks.

This works because it gives the model more tokens in which to work through intermediate steps. Instead of jumping from question to answer in one step, the model generates its reasoning as text, and that text becomes part of the context for subsequent tokens. The model is using its own output as a scratchpad.

This is less magical than it sounds. The model isn’t “thinking” in the way a human does. It’s producing text that follows the pattern of step-by-step reasoning, and each step constrains the next step in useful ways. But the practical effect is substantial: chain-of-thought prompting can improve accuracy on mathematical and logical reasoning tasks by 20-40 percentage points.

More recent models have this behaviour built into their training. Claude, for instance, often works through problems step by step without being asked, because this pattern was reinforced during RLHF.

What about the future?

LLMs are improving fast. Context windows are expanding. Training data curation is becoming more sophisticated. New architectures (mixture-of-experts models, which activate only a subset of parameters for each token) are making larger models more efficient. Multimodal models that process text, images, and audio are becoming standard.

But the fundamental architecture, transformers predicting the next token, has been remarkably stable since 2017. The improvements have come from scale, data quality, training techniques, and engineering, not from a radical rethinking of the approach.

Whether this architecture has a ceiling (whether “predict the next token” can scale all the way to artificial general intelligence, or whether something fundamentally different is needed) is the most important open question in AI research. The optimists point to the steady improvement of scaling laws and the continued emergence of new capabilities. The sceptics point to the persistent failure modes (hallucination, poor arithmetic, brittleness to adversarial inputs) as evidence that statistical pattern matching has structural limits.

Both sides might be right. LLMs might continue to improve dramatically while retaining certain categories of failure. They might become better at everything we need them for while still not “understanding” anything in the way humans do.

For practical purposes, the answer to “how do LLMs work?” is: they read text as tokens, embed those tokens in high-dimensional space, use attention to relate tokens to each other across thousands of layers, and predict the next token from the resulting representation. The training process teaches them patterns that span syntax, semantics, logic, and world knowledge. The result is a system that can generate remarkably useful text while having no explicit model of truth, no persistent memory, and no understanding of why its outputs are correct when they are.

That’s not a criticism; it’s a description. And understanding the description makes you better at using the tool: knowing when to trust it, when to verify, and when to reach for something else entirely.

So does the room understand?

We opened this post with Searle’s Chinese Room: a person matching patterns without comprehension, producing outputs that look like understanding. Now you’ve seen the full machinery: tokens, embeddings, attention heads running in parallel, transformer blocks stacked a hundred layers deep, billions of parameters adjusted through gradient descent on trillions of tokens, reinforcement learning from human feedback, chain-of-thought reasoning, inference clusters burning megawatts of electricity. The room is vastly more complex than Searle imagined. But the question remains.

The honest answer is: we don’t know. And the reason we don’t know exposes a deeper problem. We can’t define what “understanding” means precisely enough to test for it.

When a child learns that fire is hot, is that understanding or pattern matching: touch fire, feel pain, don’t touch fire again? When a doctor diagnoses a rare disease from a cluster of symptoms, is that understanding or pattern matching against thousands of cases they’ve seen? When you catch a ball, are you solving differential equations or running a learned motor pattern? The boundary between “genuine understanding” and “very sophisticated pattern matching” is far blurrier than Searle’s thought experiment suggests.

The question people really want answered, “is AI actually intelligent?”, runs into the same wall. We don’t have a rigorous definition. Alan Turing sidestepped it in 1950 with his famous test: don’t ask whether the machine thinks, ask whether you can tell the difference. That’s pragmatic, not philosophical. The Turing Test tells you about your ability to detect the difference, not about what’s happening inside.

Howard Gardner proposed that intelligence isn’t one thing; it’s at least eight (linguistic, logical-mathematical, spatial, musical, bodily-kinaesthetic, interpersonal, intrapersonal, naturalistic). LLMs are superhuman by some of those measures and non-functional by others. A system that writes better prose than most humans but can’t tell you whether a ball fits in a box is intelligent by one definition and not by another.

The practical takeaway: stop asking “is it intelligent?” and start asking “is it useful for this specific task?” The Chinese Room might not understand Chinese, but if it answers your questions correctly, helps you write better code, and catches bugs you missed, does the philosophy matter? Searle would say yes. Your deploy pipeline doesn’t care.

What I find most interesting is that the debate reveals more about the limits of our definitions than about the limits of the technology. We built something that defies our existing categories. It’s not intelligent the way humans are, and it’s not unintelligent the way a calculator is. It’s something else, and we’ll probably need new words before we can talk about it clearly.

Event Storming: Building Shared Understanding

2026-03-24T06:00:00+08:00

After four weeks of building the wrong thing, the Greenbox team knows they need a different approach. Lee’s advice was clear: get the domain out of Maya’s head and into shared understanding before anyone writes another line of code.

The technique Lee recommends is Event Storming. It was created by Alberto Brandolini, and the premise is disarmingly simple: get everyone in a room, cover a wall in sticky notes, and map out how things actually work as a series of events.

No code. No architecture diagrams. No user stories yet. Just: what happens, in what order, and where are the hard parts?

It sounds almost too simple to be useful. That’s what Tom thinks when Maya suggests it. “We’re going to spend three hours sticking notes on a wall?” But the simplicity is the point. The sticky notes are a constraint that forces everyone to express ideas in small, concrete units. You can’t hide behind vague hand-waving when you have to write a specific event on a specific note.

What you need

Event Storming doesn’t require fancy tools or expensive facilitators. You need:

A long wall (or a very long roll of paper stuck to a wall)
Sticky notes in four colours (orange, blue, yellow, pink; Lee will explain what each means)
Markers, one per person, thick enough to read the notes from a distance
Everyone who matters in the room: developers, domain experts, product people, operations
Two to four hours of uninterrupted time

Setting up

Maya books the meeting room with the biggest wall. She grabs sticky notes from Officeworks: four packs, one of each colour. She invites the whole team: Tom, Priya, Jas, Sam. She also invites two of her farming contacts, Dave and Rachel, who she hopes will eventually supply Greenbox. They know the supply side in ways the team doesn’t.

Dave Morrison arrives ten minutes early. He’s been to “workshops” before. The last one was run by a government agricultural adviser and produced a glossy brochure that nobody ever opened. He’s here because Maya asked personally, and because she grew up on a farm, and because that counts for something. He shakes Lee’s hand and eyes the wall of blank paper with the expression of a man who has seen a lot of fences built in the wrong paddock.

Rachel, who runs a smaller mixed farm nearby, mentions her “dodgy broadband” when Lee hands her a marker. “Took me twenty minutes to load the map to get here,” she says. “Satellite internet. Works when it feels like it.” Nobody thinks much of it at the time.

Seven people. One wall. Three hours blocked out on a Tuesday morning.

Lee offered to facilitate, which helps enormously. The facilitator’s job isn’t to have domain knowledge; it’s to keep things moving, ask awkward questions, and make sure the quiet people get heard. You can run a session without a dedicated facilitator, but it’s harder. Someone inevitably gets sucked into the content and stops managing the process. If you can borrow someone who’s done it before, do.

Phase one: chaos

Lee starts by explaining the format. He holds up the four colours of sticky note and runs through them quickly:

Orange: things that happen, written in past tense. “Payment Submitted.” “Box Packed.” “Farm Confirmed Availability.” These are the backbone: the story of how the business works, told as a sequence of things that already happened. (Lee calls them “domain events,” but at this point nobody cares about the jargon. They’re just things that happen.)
Blue: decisions or actions that make those things happen. “Submit Payment.” “Pack Box.” Someone or something chose to do this. If orange is “what happened,” blue is “what triggered it.”
Yellow: who or what is involved. A customer clicking a button. A farmer calling with availability. A scheduled job that runs overnight. The people and systems in the story.
Pink: problems, questions, disagreements. Anything that makes someone say “wait, how does that work?” or “I thought it worked differently.” “These are the gold dust,” Lee says. “When you spot something that doesn’t make sense, or that two people disagree about, slap a pink note on it. Don’t try to resolve it now. Just mark it. We’ll get to pink notes later in the session; for now I just want you to know they exist.”

“We’ll start with just orange,” Lee says. “Only events. Write each one in past tense on an orange note. Don’t worry about order. Don’t worry about getting it right. Just get everything out of your heads and onto the wall. Keep the pink notes in your pocket for now; we’ll come back to them.”

He gives one important instruction: no talking during this phase. Just write and stick. Conversation comes later.

He sets a timer for twenty minutes and says go.

What follows is beautifully chaotic. Everyone grabs orange sticky notes and starts writing. Maya is writing rapidly: “Farm Listed Produce,” “Box Packed,” “Subscription Created,” “Weekly Menu Decided.” Tom writes “Payment Processed,” “Account Created,” “Subscription Cancelled.” Priya writes “Inventory Updated” and “Farm Onboarded.” Jas writes “Customer Signed Up” and “Box Previewed.” Sam writes “Delivery Scheduled” and “Customer Complained” (Sam always thinks about the operational realities).

Dave, one of the farmers, writes “Harvest Confirmed,” “Surplus Reported,” and “Growing Schedule Committed.” Rachel hesitates over her next note, then writes “Crop Failed” quickly and sticks it on the wall without looking at it. She’s thinking about the 2019 frost that wiped out Dave’s entire tomato crop. Dave sees it go up and his jaw tightens, but he says nothing. Rachel also writes “Delivery Window Missed” and “Price Renegotiated.” These are events the team hadn’t considered at all. Nobody on the Greenbox team had thought about what happens on the farm before produce arrives at the packing facility.

Priya notices Rachel writing “Crop Failed” and reaches for a pink note; she has questions. Lee catches her eye and taps his watch. “Good instinct. Hold that thought for the pink notes phase. Right now, just orange.” Priya nods and puts the pink note back, but she doesn’t forget the question.

Within twenty minutes, there are about sixty orange sticky notes scattered across the wall in no particular order. Some are duplicates. Some contradict each other. “Payment Processed” and “Payment Confirmed” might be the same event, or they might not. “Customer Signed Up” and “Account Created” look like duplicates. That’s fine. That’s the point. The goal of this phase is volume, not precision.

Phase two: the timeline

Lee gets everyone to step back and look at the wall. “Now let’s put these in order. Left to right, earliest to latest. Talk to each other. If you disagree about where something goes, that’s interesting; stick a pink note on it and we’ll come back to it.”

This is where the real conversations start.

Maya picks up “Farm Listed Produce” and puts it early on the timeline. Tom picks up “Customer Signed Up” and puts it at the start. Priya asks, “Which comes first? Do we need farms onboarded before customers can sign up, or can customers sign up before we have supply?”

Maya pauses. “Good question. We need to know we can fulfil before we take subscriptions. So farm onboarding is first.”

Tom didn’t know that. He’d been building the subscription system in isolation, assuming customers came first. One sticky note conversation, and an assumption is surfaced and resolved.

But Lee notices a pattern forming. Maya is the one placing notes with confidence. Everyone else is asking, deferring, moving on. The pink notes, the disagreements, aren’t appearing.

This is exactly what Event Storming is supposed to prevent. If one person places all the notes and nobody disagrees, you haven’t built shared understanding. You’ve just transferred one person’s mental model onto a wall. The whole point of getting everyone in the room is to surface the places where people see the domain differently. No pink notes doesn’t mean there are no disagreements. It means the disagreements are hidden, buried under politeness, deference, or the assumption that the founder must be right. Those hidden disagreements don’t go away. They become bugs, missed requirements, and late-night arguments in sprint three.

He tries a direct prompt. “Tom, challenge one of these. Is there a note that might be in the wrong place?”

Tom glances at the wall. “Looks right to me. Maya knows the farming side better than I do.”

Lee changes tactic. He walks over to Dave. “Walk the supply side of this timeline with Tom. Tell him what actually happens on a farm between committing produce and it arriving at the packing facility.”

Dave pulls “Farm Listed Produce” off the wall and holds it at arm’s length. “This makes it sound like I sit down on a Monday and know what I’ve got. I don’t. I can tell you what I’ll probably have. But the weather, the pests, the truck, anything changes it between now and Thursday.”

Tom stares at the note. “So the data model can’t treat supply as definite. It’s more like a forecast?”

“Now you’re talking,” Dave says. And now there are pink notes.

There’s a brief tangent about whether “Payment Submitted” and “Payment Confirmed” are the same event. Tom explains they’re not: one is the customer clicking “pay,” the other is Stripe confirming the charge went through. A payment can be submitted and then fail. Maya hadn’t thought about that. Priya makes a note that they’ll need to handle failed payments, another pink note for the wall.

The duplicates get merged. “Customer Signed Up” and “Account Created” collapse into a single event. “Growing Schedule Committed” gets moved to a parallel swim lane because it happens on a different timeline to the customer flow. The wall starts to take shape.

The team works through the timeline together. After thirty minutes of shuffling, arguing, and clarifying, a rough sequence emerges:

Eighteen events across four clusters. That’s the core of Greenbox, from farm onboarding to customer feedback. It took the group about an hour to get here, and already the room feels different. Everyone can see the same picture.

Notice the structure that’s emerged. The one-time events (farm onboarding, customer signup, payment) are the scaffolding. They happen once and create the conditions for everything else. The recurring events (weekly supply matching, packing, delivery) are the business. They repeat every week for as long as farms supply and customers stay subscribed. Tom is already thinking about how this affects the data model.

Phase three: commands and actors

Lee hands out blue and yellow sticky notes. “For each event, let’s figure out what triggers it. Write the command on a blue note, and who or what performs the command on a yellow note.”

This phase goes faster because the timeline provides structure. But it surfaces new questions.

“Who decides substitutions?” Jas asks, placing a blue “Decide Substitution” note next to “Substitution Decided.”

“I do,” Maya says. “For now, anyway. Eventually maybe an algorithm, but right now it’s judgement. You need to know the produce; you can’t just swap beetroot for lettuce.”

Tom had assumed substitutions would be automatic. He was planning a simple algorithm: if item A is unavailable, pick the next cheapest item in the same category. Maya is telling him that’s not how it works at all. The substitution logic is a core part of the value proposition, and it requires domain expertise.

Pink sticky note goes on the wall: “Substitution policy: who decides, and how?”

Sam asks another question: “Who dispatches the boxes? Us, or a courier?”

Maya says, “We’ll use a local courier for now, but eventually I want our own drivers. The delivery experience matters.”

Sam writes a pink note: “Delivery logistics: own drivers vs courier, and when do we switch?”

The actor layer reveals something interesting about “Supply Aggregated.” Who does the aggregating? Right now it would be Maya, manually checking what each farm has submitted. But with ten farms, that’s manageable. With fifty, it’s a full-time job. The yellow note says “Maya” but really it should say “System,” eventually. Another pink note: “When does supply aggregation need to be automated?”

Priya points to the bracket the team added during the ordering phase, the one marking where the weekly cycle starts. “Every actor from here onwards is doing something every week,” she says. “But the yellow notes don’t show that. Maya doesn’t aggregate supply once. She does it every Wednesday, for every box.” The actor layer makes the repeating workload visible in a way the event timeline alone didn’t, and with it, the bottlenecks. If one person’s name appears on five weekly events, that’s a scaling problem waiting to happen.

Phase four: hotspots

By now the wall is covered. Orange notes tracing what happens, in order. Blue notes beneath them showing what triggers each step. Yellow notes above showing who’s involved. And scattered across the whole thing, pink notes marking every question, disagreement, and “wait, how does that actually work?”

Lee gathers everyone around the hotspots. “These pink notes are the most valuable thing on the wall. Every one of them is a misunderstanding you caught before it became a bug, a wrong assumption, or a wasted sprint.”

The team counts twelve pink notes. The four biggest clusters:

Supply shortfalls. What happens when total farm supply doesn’t cover subscriber demand for the week? Rachel explains that this is completely normal in farming. “You think you’ll have twenty crates of zucchini, then the slugs get in.” Dave adds that some farms will over-promise because they don’t want to lose the contract. “We’ve all done it,” he says. “You say yes and hope the crop comes through. Sometimes it doesn’t.”

The team needs a process for handling shortfalls, and it needs to be baked into the weekly cycle, not treated as an exception. This is a design decision that affects everything: the commitment deadline for farms, the buffer stock policy, the customer communication if a box has fewer items than expected.

Substitution policy. This one sparks the longest argument of the session. Tom thought boxes had fixed contents: the same items every week, based on what the customer selected at signup. Jas thought customers picked individual items each week, like a supermarket order. Maya says neither is right. The box contents change weekly based on what’s available, and the curation is Greenbox’s differentiator. The customer doesn’t choose. They trust Greenbox to choose well.

Three people, three completely different mental models. If the team had kept building without this conversation, they’d have shipped three different products.

Delivery logistics. Sam raises the practical questions nobody else had thought about. What’s the delivery window? What happens if nobody’s home? Who handles complaints about damaged produce? Can customers change their delivery day? Is there a minimum order density per area to make delivery economical? None of these have answers yet, and every one of them affects the software.

Seasonal availability gaps. Rachel explains something the team hadn’t considered at all. In Western Australia, summer is abundant, but late winter is lean: fewer varieties, smaller yields, and some crops just don’t grow. What does Greenbox do during those weeks? Pause subscriptions? Source from further afield and compromise on the local promise? Offer a reduced box at a lower price? This is a business model question disguised as a supply chain problem.

The remaining hotspots are smaller but still important: how do farms get paid, what happens when a customer wants to skip a week, how is feedback collected and acted on, what are the deadlines for each step in the weekly cycle. None of them are show-stoppers individually, but together they represent the operational complexity that nobody had mapped before today.

The arguments are the point

About ninety minutes into the session, Tom and Maya have a proper disagreement. Tom is placing the “Supply Matched to Demand” event and says, “So the system automatically allocates produce to boxes based on the subscription sizes?”

Maya shakes her head. “No. I look at what’s come in from the farms, I think about what makes a good combination, and I decide what goes in each box size. It’s not just weight and price matching. A box needs to make sense as a meal plan for the week.”

Tom looks frustrated. He’s been building things for twelve years. He can hear the problem being described, and his instinct is to solve it with code. That’s what he does, that’s who he is. Being told that the answer is “Maya decides” feels like being told the problem isn’t worth solving properly. “So there’s no algorithm? You just… decide?”

“For now, yes. The algorithm is my brain.” Tom says nothing, but the thought flickers: the substitution logic “could be automated eventually.” Maya catches his expression. “Eventually,” she says, with a weight that closes the topic for now.

Lee steps in. “This is great. Put a pink note on it. The question is: can this scale? And if not, what does the handover from Maya-decides to system-decides look like?”

This is exactly the kind of conversation that Event Storming is designed to provoke. The argument isn’t a problem; it’s the discovery working. Tom now understands that the matching process is far more nuanced than he assumed. Maya now understands that if they want to scale, they’ll eventually need to codify her decision-making process. Both of those insights are worth the entire session.

Priya has been staring at the timeline. She traces it with her finger: “Supply Matched to Demand” on Tuesday, then “Box Contents Decided,” then “Box Packed,” then… she stops. “Where does payment happen?”

Tom points to the left end of the wall. “Payment Submitted” is near “Customer Signed Up,” right at the beginning. That’s how he built it, charge on signup.

Priya shakes her head slowly. “But the box contents change every week. The cost depends on what’s in the box, and we don’t know what’s in the box until Tuesday evening when Maya finishes the matching. If we charge at signup, we’re charging for a box whose contents aren’t known yet.”

The room goes quiet. Tom stares at the wall. She’s right. The entire payment flow he built assumes a fixed price at a fixed time. But the timeline on the wall makes it obvious: billing can’t happen until after supply matching. The data model, the Stripe integration, the receipt emails: all of it is in the wrong place.

Pink sticky note: “Billing point: must be after supply matching, not at signup.”

It’s one of those moments where the wall shows you something that no amount of code review would have caught. The billing architecture isn’t a technical decision; it’s a domain decision, visible only when you see the full sequence of events.

There’s a quieter but equally important moment when Jas admits she’d been designing the customer experience around item selection. “I thought the whole point was letting customers choose,” she says. “Like a farmers’ market online.” Maya gently corrects her: the point is the opposite of choosing. Customers are busy. They don’t want to browse and pick. They want to open their door and find a box of good stuff.

Jas pauses. She’s been designing the wrong product for four weeks and nobody told her. She can feel the heat rising in her face. Then something clicks. “That actually changes everything about the landing page. The value proposition isn’t choice; it’s trust.”

Nobody told Jas she was wrong at any point during her first two weeks. She’d been designing in good faith based on an assumption that nobody thought to challenge. Event Storming created the space for that challenge to happen naturally, without blame.

What emerged

By the end of three hours, the wall tells a story that nobody in the room could have told alone. Maya knew the farming side but hadn’t thought through the software implications. Tom and Priya understood the technical constraints but had wrong assumptions about the domain. Jas had been designing for a product that doesn’t exist. Sam had operational questions that nobody else had considered.

Before the session	After three hours on the wall
The domain lived in Maya’s head, and nobody else could build without asking her	18 domain events on the wall, visible to everyone. The team shares one picture instead of five different guesses.
Tom assumed box contents were fixed. Jas assumed customers chose items. Neither knew they were wrong.	Three fundamental misunderstandings surfaced and resolved: box contents vary weekly, customers don’t choose, farm onboarding comes before subscriptions.
Priya had a list of unanswered questions about the farm portal and no way to get them answered	12 pink hotspot notes, each one a question the team caught before it became a bug or a wasted sprint. Priya’s questions are now on the wall where everyone can see them.
Jas had designed a customisation interface for a product that doesn’t work that way	The value proposition is trust, not choice. Jas now knows what to design for.
Sam’s operational concerns (delivery logistics, courier contracts, customer complaints) hadn’t been heard by the developers	Sam’s events are on the wall alongside Tom’s and Priya’s. Operations is part of the domain, not an afterthought.

What the session bought

The session resolved in three hours what would have taken weeks to discover through code. Every one of those twelve hotspots was a potential sprint of wasted work. The fundamental disagreement about box contents alone would have caused a full rewrite if discovered in production.

And it’s not just about avoiding waste. Tom now knows the subscription model needs variable weekly contents. Priya knows about commitment deadlines and shortfall reporting. Jas is designing for trust, not choice. Sam has a list of logistics questions that need answers. Everyone is building toward the same product because they all stood in front of the same wall.

Facilitation matters

A few things Lee did that made the session work:

He enforced the no-talking rule in phase one. When people talk too early, the loudest voices dominate and the quieter participants defer. The silent writing phase gives everyone equal weight. Dave and Rachel, who might have felt like outsiders in a tech team’s meeting, produced some of the most important events because they were writing, not competing for airtime.

He kept asking “what happens next?” and “what could go wrong?” These two questions drive the entire session. “What happens next?” extends the timeline. “What could go wrong?” generates hotspots. The second question is the more valuable one, because it forces the group to think about the unhappy paths, and that’s where most of the domain complexity lives.

He didn’t let anyone open a laptop. The moment someone starts Googling or checking Slack, they’re mentally out of the room. Event Storming works because everyone is physically engaged with the wall, moving sticky notes, pointing, arguing. Screens kill that energy.

He adjusted when his first approach didn’t work. When Lee tried to get Tom to challenge the timeline directly, Tom deferred to Maya. So Lee paired Dave with Tom instead, putting a domain expert next to a developer and asking them to find contradictions. The best facilitation is adaptive, not scripted.

He time-boxed ruthlessly. Three hours is enough for a first pass. After three hours people are tired and the returns diminish. Better to photograph the wall, take a break, and come back for a deeper session on the hotspots if needed. The wall isn’t going anywhere.

When to use Event Storming

You’re entering an unfamiliar domain. If your team doesn’t deeply understand the business process, you need to surface that understanding before you build. Most domains are more complex than they first appear, and the complexity hides in the parts you don’t think to ask about.
You’re kicking off a new project. Even if individual team members understand parts of the domain, they probably don’t share a mental model. Event Storming builds that shared model explicitly.
You have access to domain experts. The technique depends on having people in the room who know how things actually work. Dave and Rachel’s farming knowledge was essential to the Greenbox session.

When not to use it

The feature is small and well-understood. Adding a password reset flow doesn’t need a three-hour workshop.
You don’t have the right people available. Running Event Storming without domain experts is just developers guessing together. That’s worse than useless: it creates false confidence.
The team already shares a strong mental model. If everyone genuinely agrees on how things work, Event Storming will confirm what you know without adding much. The danger is that “we all understand it” is often an untested assumption, though sometimes it’s genuinely true.

What happens next

Maya photographs the wall: five panoramic shots that she’ll have laminated the following week. Sam volunteers to transcribe the events and hotspots into a shared document. Tom, who was sceptical about spending three hours not coding, admits he’s glad they did it. “I would have spent a week building automated substitution logic. That would have been completely wrong.”

Lee and Dave walk to the car park together. Dave pulls his keys out of a jacket that’s seen a decade of Margaret River weather. “That wasn’t as bad as I expected.”

“High praise from a farmer.”

Dave pauses by his ute. “The thing about crop failures. That’s not a what-if for us. That’s a Tuesday.”

“I know,” Lee says. “That’s why you needed to be in the room.”

Back inside, the team stands in front of the wall, arms folded. Twelve hotspots, eighteen core events, dozens of questions. Tom’s face says where do we even start? Priya is reading the pink notes methodically. Sam is counting them.

“Pick the one that scares you most,” Lee says.

Maya doesn’t hesitate. She reaches for the pink note from the substitution cluster. Substitution policy: who decides, and how? The question that cost Tom two rewrites in week one. The question that sits at the heart of what makes Greenbox different from a supermarket delivery.

“Good,” Lee says. “Now let’s make it concrete. Rules. Examples. Edge cases. Twenty-five minutes and four colours of card.”

Tom groans. “More sticky notes?”

“Cards, actually.” Lee is already pulling a fresh pack from his bag.

After the session, Jas catches Maya in the kitchen. She’s been on a two-day-a-week contract (Maya’s “we don’t need a designer yet” arrangement) and she’s supposed to be in again on Thursday and then gone until next week.

“I don’t want to do two days anymore,” Jas says.

Maya’s face falls. She’s already thinking about finding another designer, about the landing page redesign, about all the trust-not-choice work that just landed on the wall.

“I want to do five,” Jas says.

Maya is quiet for a moment. The seed money landed a few weeks ago (Angela’s first tranche, $75K) and it’s already stretched across Priya’s salary, Sam’s reduced-but-no-longer-catastrophic pay, and the cafe-office lease. Tom is on equity and a token wage. The budget doesn’t have a full-time designer in it. But it also didn’t have a two-day-a-week contractor in it until Sam talked her into it, and she found the money for that.

“I can’t match your contract rate,” Maya says. “Not even close. I can do a salary, startup salary, which means it’ll be less per week than you’re making now for two days. But I can offer equity. A small stake, vesting over two years. If this works, it’s worth something. If it doesn’t, you’ve taken a pay cut for a startup that folded.”

Jas has done the maths already. She was doing it during the session, while the sticky notes were going up and the arguments were flying. Two days a week at contractor rates is safe money. Five days a week at a startup salary is less money and more risk. But she’s twenty-six, her rent in Leederville is manageable, she doesn’t have a mortgage, and she’s just spent three hours in a room where she understood for the first time what she’d actually be designing.

“I want the equity in writing,” Jas says. “And I want to be in the room for product decisions. Not briefed after.”

“That’s fair,” Maya says. “I should have had you in the room from the start.”

They shake on it in the kitchen of a cafe-office in Fremantle, next to a kettle that takes four minutes to boil and a jar of instant coffee that nobody likes but everybody drinks.

Jas goes home to her Leederville flat that evening and opens her Moleskine to the page where she’d been sketching the customisation flow, the dead one, the one nobody told her about. She turns to a fresh page and writes trust, not choice at the top. Underneath, she starts sketching a landing page that sells the feeling of opening your front door and finding dinner sorted. Her grandmother’s market garden in the Adelaide Hills. Grow what they actually want.

She fills three pages before she looks up.

Lee calls the technique Example Mapping. Twenty-five minutes, four colours of card, and a vague story becomes something you can actually build.

Retrospectives: Catching the Wrong Kind of Fast

2026-03-17T06:00:00+08:00

The seed round closed last week. They were officially a company now, with an office above a cafe in Fremantle and $75K in the bank. Angela’s $150K investment came in two tranches: half now, half when they hit 200 active subscribers. Miss the milestone and the second tranche doesn’t release, leaving them with whatever runway remains from the first. At their current burn rate, that meant about three months after the milestone deadline to either hit it late, find other funding, or wind down. The clock was real.

The seed money bought two things the team needed badly. Priya (29, quiet, precise, recently moved to Perth from Melbourne, her first startup) joined as the second developer. And they finally had enough runway that Maya, Tom, and Sam could stop treating this as a side project and start treating it as a job.

They didn’t have a designer yet. Maya handled the brand herself, badly. Tom built the UI, also badly. That was fine for now. Design could wait. The first boxes had gone out to 38 pilot subscribers and the feedback was good. The produce was excellent, the delivery logistics were shaky, and the sign-up process was held together with sticky tape and a Google Form. It worked. It wouldn’t scale.

Maya’s pitch to the team on Monday morning was simple: “We’ve proved people want this. Now we need to build it properly. Farms list what they have each week. Customers subscribe to a box size. We match supply to demand, pack the boxes, and deliver. Let’s build the software that makes it real.”

Sounds straightforward. The team gets to work.

Week one

The output is incredible. Tom has Claude open in one tab and his IDE in the other. He describes the subscription model he wants, and the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. generates a complete Stripe integration, data models, signup flow, all in an afternoon. He’s shipping pull requests faster than he ever has in his career.

Priya does the same with the farm portal. She prompts for an inventory management screen, gets a working prototype back, tweaks it, and asks Tom to push it live. By Wednesday she has a portal where farms can list their available produce: tomatoes, 50kg, $4/kg.

Maya sketches a box customisation page on a whiteboard: subscribers pick which items they want each week. Customer delight angle. Tom builds a prototype from the sketch, promptingPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. an LLM for the UI components. It looks rough but functional.

Slack is buzzing with screenshots and pull requests. The team is shipping faster than any of them expected. Tom messages the group: “This is the most productive week I’ve ever had.” Everyone agrees. It feels like they’ve cracked it.

When the code is ready, Tom deploys by SSH-ing into the server from his laptop and running a script he wrote on the first day. It takes about twelve minutes. Nobody else has the credentials or knows the steps. That’s fine, Tom thinks; he’s the only one writing code that matters right now.

Week two

Maya reviews what the team has built.

The subscription system looks impressive: lots of code, clean UI, working payment integration. But Tom has assumed the box contents are fixed, the same items every week. “No,” Maya explains. “The whole point is that contents change based on what farms have available that week. Seasonal produce. That’s what makes it different from a supermarket delivery.”

Tom’s subscription model doesn’t account for variable contents at all. The data model is wrong. The LLM generated exactly what he asked for, the problem is he asked for the wrong thing. That’s a substantial rewrite.

Priya’s farm portal works, but she has questions nobody has answered. How far in advance do farms need to commit their availability? Can they update quantities after a deadline? What happens when total supply across all farms doesn’t cover all subscriber orders? She’d been guessing at the answers and feeding those guesses to the LLM, and some of those guesses are wrong.

The customisation prototype is functional. Maya clicks through it: pick your tomatoes, swap out the zucchini, add extra basil. It works. And something about seeing it working makes her stomach drop.

“This isn’t right,” she says, mostly to herself. Then, louder: “This isn’t what we’re selling. We curate the box. That’s the whole point, they trust us to give them good stuff. If customers are picking items themselves, we’re just a worse version of online grocery shopping.”

Tom looks at her. “You sketched this. On the whiteboard. Last Tuesday.”

“I know.” Maya stares at the screen. She had been thinking about delight, about customers feeling involved. But seeing it built, she can see what it actually is: a feature that undermines the thing that makes them different. She hadn’t thought it through. She’d had a half-formed idea, sketched it in the excitement of the moment, and Tom had built it before either of them stopped to ask whether it made sense.

The whole customisation flow is wasted work. Tom spent a day and a half on it.

Week three

The team tries to course correct. Tom prompts Claude again: “Rebuild the subscription model to support variable weekly contents based on farm availability.” The code comes back in twenty minutes. It’s clean, well-structured, has tests. Tom is pleased.

Then Maya asks: “What happens when a farm can’t supply what they promised?”

Tom looks at the code. There’s no concept of supply shortfalls. He prompts again: “Add handling for when farm supply doesn’t meet subscriber demand.” Claude generates a substitution system that randomly swaps items. Maya shakes her head. “You can’t just swap randomly. Carrots for parsnips, sure. Carrots for lettuce? Nobody wants that.”

Tom prompts again. And again. Each iteration gets closer, but each one surfaces a new question nobody had thought to ask. The LLM is extraordinarily helpful at generating code. It’s just that nobody can tell it what the code should do.

Meanwhile, Priya has paused the farm portal entirely. She has a list of questions and nobody to answer them: How far in advance do farms commit? Can they change quantities after a deadline? What units do they report in: kilograms, crates, “enough for about forty boxes”? She asks Maya, but Maya is in back-to-back meetings with courier companies trying to figure out delivery logistics.

Maya starts redesigning the customer experience without customisation, but she’s pulled in every direction, answering Priya’s farm portal questions, reviewing Tom’s code, talking to courier companies. Nobody is doing the design work full-time. It shows.

Sam mentions a designer she met at a coworking space event in Leederville: Jas Kowalski, freelance, good portfolio, available. Maya hesitates. “We don’t need a designer yet. Not full-time.” Sam pushes back: “Two days a week. Just to sort out the customer-facing stuff. You’re doing three jobs and none of them are design.” Maya agrees to two days. Jas starts the following Monday. Nobody briefs her on the customisation decision. Her first task is tidying up a flow that the team has already decided to throw away.

Week four

Tom’s subscription model v2 is working, sort of. He demos it to Maya on Monday. She spots a problem immediately: “This charges customers on signup day. We need to charge them on delivery day, because we don’t know what’s in their box until the morning we pack it.”

Tom stares at the screen. The entire payment flow assumes charge-on-signup. The data model, the Stripe integration, the receipt emails: all of it. He could ask the LLM to restructure, but the last three restructures have each introduced new assumptions that turned out to be wrong.

“I’ll fix it,” he says, but the energy has gone out of his voice. He’s thinking about his brother Marco at the family Christmas, asking “how’s the little startup going?” in that tone that manages to be both supportive and pitying.

Priya, still blocked on the farm portal, starts helping Tom with the subscription rewrite. Maya is supposed to be thinking about the customer experience but hasn’t found time. Sam redesigns the landing page instead; at least that’s something she can do without needing decisions from anyone.

Sam sends a cheerful Slack message: “Customer #1 just emailed asking when their first box arrives! The pilot subscribers are getting restless.” Nobody knows the answer. Even Mrs Patterson on Stirling Highway has asked twice now.

New questions keep surfacing:

What happens when a customer is allergic to something in this week’s box?
Do farms get paid per item, per box, or per week?
Who decides substitutions when a farm can’t deliver what they promised?
What about delivery logistics: own drivers or a courier?
What if a customer wants to skip a week on holiday?

The LLM is still generating code fast. But each answer raises two more questions, and the team keeps building on assumptions that turn out to be wrong. The velocity is high. The progress is circular.

Four weeks in, the team has a subscription system built on wrong assumptions (twice), a farm portal that nobody’s sure how to finish, a discarded customisation prototype, and a growing list of questions that should have been answered before anyone opened an IDE. Tom’s git log has more reverts than merges. Maya needs to talk to Dave Morrison (a third-generation farmer outside Margaret River whose produce she’s hoping to build the business around) about supply commitments, but she hasn’t found the time.

They’re not lazy. They’re not bad at their jobs. The LLMs aren’t the problem either; they did exactly what they were asked, impressively fast. The problem is that nobody understood what to ask for. The team started building before they understood the problem, and the LLMs just helped them build the wrong thing faster.

The expensive kind of learning

Every one of those surprises was knowable. The team assumed they understood the domain because the concept sounded simple.

LLMs made it worse, not better. The sheer speed of code generation disguises the lack of understanding. When it took two weeks to build something wrong, you noticed after two weeks. When the LLM builds it wrong in an afternoon, you might not notice until you’ve built three more things on top of the wrong foundation. The velocity feels incredible. The progress is an illusion.

“It’s just a…” is one of the most expensive phrases in software development. And “the LLM can build that in an hour” is its dangerous new cousin.

The cost isn’t just the wasted code. It’s the trust erosion. Tom is frustrated because his work got thrown away, twice. Jas is frustrated because nobody told her the customisation premise was wrong. Priya is blocked and going quiet about it. Maya is wondering if she hired the correct people. Everyone’s doing their best, but the team is pulling in different directions because they never built a shared understanding of what they’re actually building.

The retro that changed everything

Maya’s friend Lee drops by the office on a Friday afternoon. They’d met at the Margaret River farmers’ market six months earlier: Maya buying produce for a dinner party, Lee buying coffee, and a twenty-minute conversation about supply chains that turned into a friendship. Lee spent twenty years in enterprise consulting before semi-retiring to the coast. He’s 52, surfs badly but persistently, and has the calm manner of someone who’s watched a lot of teams struggle with the same problems. He can feel the tension the moment he walks in. Tom is quiet. Priya is staring at a Jira board full of blocked tickets. Jas is redesigning the landing page for the third time because nobody will answer her questions about the customer experience.

“When was the last time you all stopped and talked about how the work is going?” Lee asks.

Maya looks blank. “We have standups.”

“Not standups. A proper retrospective. Where you actually talk about what’s working and what isn’t.”

Maya is sceptical; they’re burning runway and the last thing they need is another meeting. But Lee pushes gently: “Ninety minutes. I’ll facilitate. If it’s a waste of time, I’ll buy the team lunch.”

They gather in the meeting room on Monday morning. Lee draws two columns on the whiteboard (“What went well” and “What didn’t go well”) and hands out two colours of sticky notes.

“Five stages,” he says. “Let’s start.”

Stage one: set the stage. Lee reads the Retrospective Prime Directive: “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time.”

He lets it sit for a moment. “This isn’t a blame session. One word from each of you: how are you feeling right now?”

Tom: “Frustrated.” Priya: “Stuck.” Jas: “Confused.” Sam: “Anxious.” She has 47 unread emails from pilot subscribers on her phone. She reads them before bed most nights, but she hasn’t told anyone that. Maya pauses. “Guilty.”

Lee nods. “Good. That’s honest. Let’s work with that.”

Stage two: gather data. “Green notes for what went well. Pink notes for what didn’t. One thing per note, as many as you want. No talking; just write.”

The team writes for five minutes. Lee tells them to put the green notes on the left side of the board and the pink notes on the right, then read each one aloud as they place it.

The green side is thinner than the pink side, but it’s not empty.

Tom: “LLM code generation is genuinely fast. I’ve never shipped this much code this quickly.” And: “The Stripe integration works perfectly. Payment flow is solid.”

Priya: “I identified the farm portal questions early. The problem wasn’t spotting them, it was getting answers.”

Sam: “We have pilot subscribers. People actually want this product. The landing page is working.”

Maya: “The team is motivated and hardworking. Nobody’s coasting.”

Then the pink side.

Tom: “I’ve rebuilt the subscription model twice. Both times I asked the LLM to generate it, both times it was wrong, and both times I didn’t find out until Maya looked at it.”

Maya: “I sketched a whole customisation flow that we’re not using. I should have checked the premise before Tom built it.”

Priya: “I’ve been blocked for two weeks waiting for answers about how farms work. I keep guessing and getting it wrong.”

Sam: “Pilot subscribers are emailing me asking when their first box arrives. I don’t know the answer.”

Jas: “I spent my first three days redesigning a customisation flow that was already dead. Nobody told me.”

Maya, reading her own note back: “Everyone is frustrated with me. I have the answers but I’m not sharing them fast enough.”

Stage three: generate insights. Lee asks the team to stand up and look at the wall. “Group the pink notes that seem related.”

Priya puts her “blocked waiting for answers” note next to Tom’s “didn’t find out until Maya looked at it.” Jas adds her customisation note to the same cluster. Sam’s note goes there too.

One large cluster. A few stragglers.

“What do you notice?” Lee asks.

Tom sees it first. “They’re all the same problem. Maya understands the business. We don’t. And building stuff without that understanding isn’t working. I’m prompting an LLM to write code, but I’m describing the wrong thing because I don’t know what the correct thing is.”

Priya nods. “The LLM does exactly what I ask. The problem is I’m asking the wrong questions.”

Lee: “And the green side?”

Jas reads them again. “We’re not bad at our jobs. The code quality is high. The speed is real. We have customers who want the product.”

“Right,” Lee says. “The tools aren’t the problem. The people aren’t the problem. One person has the domain knowledge, and everyone else is guessing. The LLMs made that worse, not better, because the guesses turned into working code before anyone could catch them.”

The room goes quiet.

Stage four: decide what to do. “Actions,” Lee says. “What could this team do to fix the root cause? One idea per note, no filtering. Two minutes.”

The notes come fast.

Tom: “Daily check-ins with Maya.” And: “Maya reviews every PR before merge.” Priya: “Shared document of all business rules.” And: “Weekly domain Q&A session.” Sam: “Record Maya explaining the business on video.”

Five ideas. Lee reads them back. “What do they all have in common?”

Priya sees it. “They all depend on Maya. Every single one puts Maya at the centre.”

“Right. Five ways to get knowledge out of Maya’s head, one conversation at a time. They’d work, slowly.” He writes a seventh note. “There’s a technique called Event Storming. Whole team in a room, farming contacts too if you can get them. A few hours mapping out how the business actually works, not architecture, not user stories. Just: what happens, in what order, and where are the hard parts. Sticky notes on a wall. The shared understanding these six ideas are reaching for? Event Storming builds it in an afternoon.”

He sticks it on the board. “Dot vote. Two dots each. Pick whatever you think will make the biggest difference, even if it’s not mine.”

Event Storming gets six dots out of eight. Daily check-ins get two.

Lee nods. “That’s your call, not mine. If I’d walked in and said ‘do Event Storming,’ you’d be doing it because I told you to. Different thing entirely.”

Maya looks unconvinced. “So the answer is… sticky notes.”

“The misunderstandings that just cost you four weeks? They surface in the first hour, when they’re cheap to fix.” Lee pauses. “You’ll feel like you’re going slower. You’re not. You’re just putting the learning where it’s cheap: on a wall instead of in production.”

Stage five: close. “One last thing,” Lee says. “One thing you appreciated about someone else these past four weeks.”

Tom: “Priya spotted the farm portal questions before any of us even thought about them. That’s good instinct.”

Priya: “Tom’s code is always clean. Even the stuff we threw away was well-written.”

Priya: “Sam’s been handling angry pilot subscribers by herself and never complained.”

Sam: “Maya’s always available when you can actually get hold of her. She never brushes you off.”

Maya: “Everyone kept working even when they weren’t sure what they were building. That takes guts.”

Lee smiles. “You’ve got a good team. You just need a shared picture of what you’re building. Let’s go get one.”

“And the retros?”

“Every two weeks. Non-negotiable.” Lee glances at the wall of pink notes. “When everyone’s prompting LLMs on their own, the thinking goes invisible. This is where it becomes visible again. But first. Event Storming.”

The team files out. Lee steps outside by himself. His phone shows a missed call from his daughter Yuki. He looks at it for a moment, puts the phone back in his pocket, and goes to find his car.

Inside, Maya stays in the meeting room alone. The wall of pink sticky notes stares back at her, every one of them a version of the same problem. She calls Nadia. “I think I’m the problem,” she says. Nadia listens for a long time.

That evening, Tom sits on the couch while Sarah puts Ava and Leo to bed. Ava calls out from her room: “Did you make something today, Daddy?” Tom doesn’t answer. Sarah comes out and asks how the startup is going. “It’s fine,” he says. Sarah studies him. She knows it’s not fine, but she also knows that Tom processes things by building, not by talking. She lets it go. Tom opens his laptop and stares at his git log. More reverts than merges. He’d been thinking about other jobs all weekend. He’s not thinking about them now. Not quite.

Priya goes home to her flat in North Perth, feeds her cat Refactor, and calls her mum in Melbourne. Her mum asks about work. “It’s fine,” Priya says. It’s not fine, but she doesn’t know how to explain what “blocked on domain questions” means to someone who runs a grocery shop in Dandenong.

Jas walks back to her flat in Leederville. Her contract is two days a week and she’s already wondering if those two days are worth it. She spent her first week designing improvements to a customisation flow that was already dead. Nobody told her. She found out when Tom mentioned it in standup, casually, like everyone knew. She’d sat there with her Moleskine open and said nothing. She thinks about not renewing. It’s only two days. She could fill them easily. She calls her mum in Adelaide instead. Her mum listens, then tells her about her grandmother, who ran a market garden in the Adelaide Hills for thirty years. “She never grew what she thought people should eat. She grew what they actually wanted.” Her mum pauses. “The good ones figure that out. Give them a minute.”

Jas doesn’t quit.

The retro produced one action. One. And it changed everything that followed.

Maya books the biggest meeting room she can find and calls Dave Morrison.

“I need you to come to Perth,” she says. “You and Rachel. My team has been building for a month and half of what they’ve built is wrong because they don’t understand how any of this actually works. How the farms operate. What happens when a crop fails. What the substitution logic really looks like. They need to hear it from someone who lives it, not from me relaying it secondhand between meetings.”

She takes a breath. “I need everyone in the same room: the developers, the designer, Sam, you, Rachel, Lee. I need us to map the whole thing out together. What happens, in what order, where it gets complicated. I need the team to see the problems you see. I need you to tell them about the deadlines that matter, the things that go wrong, the stuff I’ve been carrying around in my head that I should have put on a wall weeks ago.”

Dave is quiet for a moment. “I’ve been to workshops before. They were rubbish.”

“This one might be too. But we can’t keep building on guesses. I need you there so we can figure out what’s actually important, what we’re getting wrong, and what we haven’t even thought about yet.”

“What time? I’ve got cows.”

Dave agrees to come. The workshop is called Event Storming, and it starts with a wall of sticky notes and everyone in the room.

The Value Is in Ideas, Not Code

2026-03-12T06:00:00+08:00

Writing code used to be the bottleneck. You’d have an idea, and then you’d spend days or weeks turning it into something you could actually try. Most ideas died in that gap, not because they were bad, but because the cost of finding out was too high.

That’s changed. LLMs have made code implementation almost trivial for a huge class of problems. I don’t mean they write perfect production systems; they don’t (who does?). But they’re astonishingly good at producing “good enough”. The kind of thing you need to try an idea out, show it to someone, see if the shape of it works. A rough dashboard. A prototype API. A quick tool that does the one thing you need. An iOS app to manage substitutions on your kid’s sports team. What used to take a week or two takes an afternoon.

The value has moved

If producing code is cheap, the bottleneck shifts. The scarce resource isn’t implementation any more; it’s knowing what to ask for. Two things feed that: curation and knowledge.

Curation is the strategic bit. Which ideas are worth pulling together? What combination of things, each individually unremarkable, becomes something genuinely useful when you stack them up? An LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. can build what you describe, but it can’t (yet…) tell you what’s worth building. That judgement (knowing which thread to pull, which experiment to run next, which of your twelve half-formed ideas deserves an afternoon) is where the leverage is now.

Knowledge is the tactical bit. The more you know exists, the more you can build. LLMs are force multipliers, but they only multiply what you bring to the conversation.

If you know that sparkline charts exist, you can say “put sparklines in the table cells” and get them in minutes. If you don’t know sparklines are a thing, you’ll never think to ask, and they are unlikely to crop up as the LLM explores for you.

This pattern is everywhere:

Know what a dead letter queue is? You can ask for one by name instead of reinventing retry logic from scratch.
Seen an optimistic UI before? You can tell the LLM “update the UI before the server responds, roll back if it fails” and get a snappy interface in minutes.
Heard of feature flags? You can ask for a feature flag system in your prototype and suddenly you’re testing two versions of an idea at once.
Know what eventual consistency means? You can describe the tradeoff you want and skip the long detour where you accidentally build something that doesn’t scale.
Familiar with the concept of a circuit breaker? One sentence in your promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. and your API client handles failures gracefully instead of hammering a dead service.

Every piece of knowledge you’ve accumulated over the years is a prompt waiting to happen. Broad technical knowledge has always been valuable, but now it converts directly into working software in a way it never did before. The person who’s seen a lot of things and roughly knows what’s possible will consistently out-build the person who’s deeper in one stack but doesn’t know what’s out there.

Deploy, learn, iterate

When the cost of trying something drops this far, you can run experiments you’d never have justified before. Build the thing. Ship it. See if anyone cares. If they don’t, you’ve lost a few hours, not a sprint.

We’ve talked about rapid prototyping (deploy, learn, iterate) for years, but the cost has finally dropped low enough that it’s genuinely practical for most ideas. Not just the ones that survive a prioritisation meeting. Instead of specifying, building, testing, deploying over weeks, you can have something in front of real users in hours, and that changes which ideas get a chance at all.

So what?

If you’re a builder: lean into breadth. Read widely. Collect patterns and concepts. Your library of “things I know exist” is your competitive advantage, because each one is a card you can play when the right problem shows up.

And if you’re not a builder yet? The barrier just got a whole lot lower.

Minimum Viable Product: The First Box

2026-03-10T06:00:00+08:00

Tom built the landing page in a weekend. It wasn’t beautiful. Jas hadn’t joined yet, and Tom’s design instincts ran to “functional.” But it had a clear headline (Fresh local produce, delivered weekly), a description of the two box sizes, a price, and a signup form that collected a name, email, address, and payment details.

The payment integration worked. Tom had used Claude to generate a Stripe setup in an afternoon, and it was solid: one of the few things from those early weeks that didn’t need rebuilding. The confirmation email went out. The landing page loaded fast. The form submitted cleanly.

What the form didn’t collect was a unit number. Or a delivery note. Or any indication of whether the customer lived in a house, a flat, or a unit complex. Tom’s data model had: name, email, street address, suburb, postcode. It seemed like enough at the time.

The flyer

Maya designed the flyer herself. Hand-drawn, because she couldn’t afford a designer and because she wanted it to feel personal. A sketch of a green crate overflowing with vegetables. The Greenbox name in her own handwriting. A QR code that Tom generated, linking to the landing page. And a line at the bottom: Local farms. Weekly boxes. No thinking required.

She printed fifty copies at Officeworks and drove down to the Margaret River farmers’ market on Saturday morning.

The market was where Maya had grown up. Her parents had sold produce from a trestle table at the far end for fifteen years. She knew the rhythms: the early-morning setup, the rush between nine and eleven, the slow afternoon when the stallholders started packing up and the remaining customers got the best deals. She knew the regulars: the retired couples who came every week, the young families with kids running between the stalls, the restaurant owners doing their weekend sourcing.

Maya walked the market with her flyers, talking to everyone who’d listen. Some of them remembered her parents. Most of them liked the idea. A few were sceptical.

“Another subscription thing? I tried one of those meal kit services. Lasted three weeks.”

“This is different. It’s not a meal kit. It’s actual produce from actual farms within fifty k’s.”

“Which farms?”

“Dave Morrison, for one. And Rachel’s place.”

The mention of Dave’s name carried weight at the Margaret River market. People knew Dave. People trusted Dave. If Dave was involved, the vegetables would be good.

By the end of Saturday, twenty-two people had scanned the QR code and signed up. Twenty-two. Maya sat in her car in the market car park and stared at her phone. Twenty-two real people had given her their credit card details and trusted her to send them a box of vegetables.

She called Tom. “Twenty-two.”

“Twenty-two what?”

“Subscribers. We have twenty-two subscribers.”

A pause. Then Tom’s voice, with the particular excitement of a builder who’s just learned that the thing he built has users: “That’s… that’s actual people.”

“Actual people who expect a box of vegetables on Thursday.”

“Right. Thursday. That’s… five days from now.”

“Four, actually.”

Packing day

Wednesday morning. Maya’s kitchen table.

Dave arrived at 5am in his ute, the back loaded with green crates. Tomatoes, zucchini, spinach, carrots, beetroot, a few bunches of herbs. Everything picked the day before. He carried the crates in through the front door, set them on the kitchen floor, and surveyed the operation.

“This is your packing facility?”

“For now.”

“It’s a kitchen table.”

“It’s a large kitchen table.”

Dave shook his head with the expression of a man who had seen many ambitious plans meet their first contact with reality. He left without saying much else, though he paused at the door and said, “The spinach won’t last if you leave it out. Get it in the boxes fast.”

Rachel arrived an hour later in her own ute, with a smaller contribution: bunches of kale, some sweet potatoes, and a crate of capsicums. She helped carry them in and looked at the kitchen table.

“You right?”

“I think so.”

Rachel studied the piles of produce. “You’ll want to pack the heavy stuff at the bottom. Sweet potatoes first, then the root veg, then the leafy stuff on top. If you put the spinach at the bottom it’ll be soup by the time it arrives.”

Maya wrote this down. She hadn’t thought about packing order. The spreadsheet had columns for subscription size, produce allocation, and delivery address. It did not have a column for “which vegetables go on the bottom.”

Sam arrived at seven with boxes. Actual cardboard boxes; she’d sourced them from a packaging company in Welshpool, the cheapest option that was still food-grade. They were plain brown, no branding, because branded boxes cost four times as much and Maya’s budget didn’t stretch that far. Sam had written “GREENBOX” on each one in green marker. It looked homemade, because it was.

They packed twenty-two boxes on the kitchen table. Maya and Sam working side by side, consulting the spreadsheet on Maya’s laptop, weighing produce on a kitchen scale. Tom sat in the living room, monitoring the website and the payment system, feeling useless.

“Can I help pack?” he asked.

“Can you tell the difference between baby spinach and rocket?” Sam replied.

“They’re both green.”

“Stay in the living room.”

The packing took four hours. Maya’s back ached. Sam had produce stains on her shirt. The kitchen looked like a greengrocer had exploded. Nadia, who had taken the day off to help, wrapped the last box in brown paper and taped the address label on with the precision of someone who had decided that if her living room was going to be a warehouse, at least the warehouse would be tidy.

By midday, twenty-two boxes were stacked by the front door. They looked good. They smelled good. Maya took a photo and sent it to Dave. He replied with a single thumbs-up emoji, the most effusive communication she’d ever received from him.

The delivery

Sam had arranged delivery through her mate Callum, who drove a courier van in the southern suburbs. Callum was reliable, Sam said. He’d done deliveries for the trucking company and he knew the Perth metro area. He was also cheap. Sam had negotiated a rate per box that was barely above fuel costs, a favour that Callum would regret by the third week.

The plan was Thursday delivery. Callum would pick up the boxes at midday and deliver them between 2pm and 6pm.

Callum picked up the boxes on Wednesday.

“Thursday,” Sam said, when he turned up a day early. “Thursday delivery. I said Thursday.”

“You said this week. I’ve got a full run tomorrow. Today’s better.”

Sam called Maya. Maya called Callum. Callum was already driving, with twenty-two boxes in the back of his van and the confidence of a man who had been doing deliveries for twelve years and didn’t see the problem.

“Most of these people are at work,” Maya said. “They won’t be home until five or six.”

“I’ll leave them on the doorstep.”

“It’s fresh produce. In a cardboard box. In the sun.”

“I’ll find shade.”

Half the boxes were delivered to empty houses on a Wednesday afternoon. Six were left on doorsteps in full sun. Three were delivered to the wrong addresses because Tom’s address data didn’t include unit numbers. Two customers lived in unit complexes, and Callum had left the boxes at the front door of the building, not the individual unit. One box was never found. The spinach, which Dave had warned them about, had wilted in the boxes that sat in the sun for three hours.

Maya’s car

At 4pm on Wednesday, Maya was sitting in her car outside a house in Applecross. She’d driven out to intercept the last few deliveries, hoping to correct the addresses and apologise in person. The house belonged to a subscriber named Mrs Patterson, a woman in her sixties who lived alone on Stirling Highway and had signed up at the market because she liked the idea of someone else choosing her vegetables for the week.

Mrs Patterson wasn’t home. The box was on her doorstep, in the shade at least, but it had been there for two hours. Maya picked it up and opened it. The spinach was limp. The herbs had started to wilt. The tomatoes were fine (tomatoes are forgiving), but the overall impression was not “premium local produce.” It was “vegetables that had been sitting in a box for too long.”

Maya put the box back, sat in her car, and called Mrs Patterson.

“Hello?”

“Mrs Patterson, this is Maya from Greenbox. Your box was delivered today instead of tomorrow, and I’m afraid some of the produce might not be at its best. I’m so sorry. I’m outside your house now and I’d like to –”

“Oh, that’s all right, love. I saw it when I came home for lunch. The tomatoes looked gorgeous. Don’t worry about it.”

Mrs Patterson was kind. She was generous. She told Maya to stop worrying and come back next week with a better box. Maya thanked her, hung up, and sat in her car for five minutes with her hands on the steering wheel, staring at nothing.

She wasn’t crying because Mrs Patterson was angry. She was crying because Mrs Patterson was kind, and Maya felt like she didn’t deserve it. Twenty-two people had trusted her with their dinner, and she’d delivered wilted spinach on the wrong day to the wrong addresses. The spreadsheet, the flyer, the 5am packing session: all of it had produced a result that was, by any honest assessment, a disaster.

She wiped her face, started the car, and drove to the next address.

The recovery

That evening, Maya, Tom, and Sam sat at the kitchen table (the same table they’d packed boxes on that morning) and went through every problem.

Tom opened his laptop and pulled up the customer data. “We need unit numbers. I’ll add a field to the signup form tonight.” He paused. “I should have thought of that.”

“We all should have,” Maya said.

Sam had a list on her phone. “The delivery window is non-negotiable. Thursday between 3pm and 7pm. Not Wednesday. Not whenever Callum feels like it. I’ll find a different courier if I have to.”

“Can we afford a different courier?”

“Can we afford to lose customers?”

Maya conceded the point.

They went through every failure. The spinach problem was timing; they’d packed too early. If they packed on Thursday morning and delivered Thursday afternoon, the produce would be hours old instead of a day old. Dave had told them this. They hadn’t listened, or rather, they’d listened and then let the logistics override what they’d heard.

The address problem was data. Tom fixed the form that night, adding fields for unit number and delivery instructions. He also added a confirmation step that showed the customer their full address before they submitted, so they could catch errors.

The delivery timing was Sam’s domain. She called three courier companies on Thursday morning and found one (a woman named Jen who ran a small delivery business in Fremantle) who could guarantee a Thursday afternoon window. Jen was more expensive than Callum, but she answered her phone, confirmed delivery times, and understood that fresh produce and hot doorsteps were a bad combination.

By the following Thursday (week two), the process worked. Pack on Thursday morning. Deliver Thursday afternoon. Address data includes unit numbers. Jen delivers within the confirmed window. No boxes in the sun. No wrong-day deliveries. No wilted spinach.

It wasn’t smooth. Sam spent Thursday afternoon texting Jen for updates. Maya called three customers to confirm their boxes had arrived. Tom refreshed the delivery tracker (a shared Google Sheet that was the entire “operations platform”) every fifteen minutes. But the boxes arrived. The produce was fresh. Nobody called to complain.

Mrs Patterson emailed on Friday morning: “Much better this week! The carrots were beautiful.”

Week three

By week three, the process was routine. Pack at 6am Thursday. Jen picks up at 10am. Deliveries between 2pm and 5pm. Sam confirms each delivery by text. Tom monitors the payments. Maya handles the farm coordination: checking in with Dave and Rachel on Monday about what they’d have available, confirming quantities on Wednesday, adjusting the packing list if something fell short.

The rhythm emerged not from a plan but from the accumulated learning of things that went wrong. Every mistake in week one became a process in week three. Unit numbers on the form. Packing order: heavy at the bottom, leafy on top. Delivery window confirmed 24 hours in advance. A shared spreadsheet tracking every box from packing to delivery.

Sam started a simple feedback system: an email sent to every subscriber on Friday asking how their box was. Most people didn’t reply. The ones who did were either very happy or very specific about what they didn’t like. One subscriber requested no coriander. Another asked if they could get extra tomatoes. A third wanted to know which farm her carrots came from.

Maya answered every email personally. She learned the subscribers’ names, their preferences, their quirks. Mrs Patterson didn’t like beetroot. A young couple in Northbridge were vegetarian and wanted more variety in leafy greens. A family in Cottesloe had three kids and needed quantity over variety. A retired teacher in Mosman Park wanted whatever Dave recommended, because she’d been buying from Dave at the market for years and trusted his judgement.

These weren’t user personas on a whiteboard. They were real people with real kitchens and real opinions about coriander.

The email

On a Friday afternoon in the third week, an email arrived from a subscriber named Claire. It was three sentences long.

Hi Maya, just wanted to say thanks. I haven’t thought about what’s for dinner since I started getting the box. That’s worth more than the vegetables.

Maya read it three times. She read it standing at the kitchen counter while Nadia made tea. She read it again before bed. She didn’t have the language for what Claire was describing, not yet. The phrase “job to be done” was months away. But she felt the shape of it. The box wasn’t about vegetables. It was about something else, something larger and harder to name, and Claire had just told her what it was.

The box was about one fewer decision in a day full of decisions. Open the door, pick up the box, cook what’s inside. No planning, no shopping, no standing in a supermarket aisle at 6pm wondering what to have for dinner. Trust the box. Trust the farm. Trust Maya.

She saved the email in a folder she named “Why We Do This.” It was the only email in the folder. Over the next year, it would have company.

Growth

Twenty-two subscribers became twenty-six in week two. Word of mouth: the people who got the box told the people who didn’t. By week four, thirty-one. By week six, thirty-eight.

Maya hadn’t run a single ad. The growth was entirely organic: market flyers, word of mouth, and a short piece in the local Fremantle newspaper that Sam had arranged by calling the editor and saying, “We’re three people packing vegetables on a kitchen table and delivering them to your neighbours. Want to write about it?”

The editor did want to write about it. The article ran on a Wednesday and produced nine signups by Friday. Small numbers, but each one was a person who’d read about Greenbox and decided to trust a stranger with their weekly dinner.

Thirty-eight subscribers was encouraging. It was also nowhere near enough. And the manual operation (Maya, Sam, and a kitchen table) couldn’t scale. Every Thursday was a full day of packing and coordinating. Maya was spending Monday on farm calls, Tuesday on the packing list, Wednesday on logistics, Thursday on packing and delivery, and Friday on customer emails. That left no time for anything else. Tom was building the platform as fast as he could, but the platform was for 200 subscribers and they needed to stop packing by hand long before then.

The seed round would change that. Maya had been talking to investors since before the first box shipped. The terms were clear: reach 200 active subscribers within three months of funding, and the next round follows. Miss the target, and Greenbox is done.

Two hundred. From thirty-eight. In twelve weeks.

Once the money came in, they could hire a proper team, move out of the living room, and build the systems to replace the kitchen-table operation. The pilot subscribers (the thirty-eight people who’d trusted Maya with their Thursday dinners) would keep getting boxes through the manual process for now. But the platform Tom was building had to be ready before the numbers got any higher. You can hand-pack thirty-eight boxes on a kitchen table. You cannot hand-pack two hundred.

Maya looked at the subscriber graph, a line on a spreadsheet that she checked every morning at 5am, before her run, before coffee, before anything else. The line was going up. Slowly. Steadily. But 200 was a long way from 38, and the gap between them was filled with packing days and delivery runs and emails about coriander and a team of three people who were already working as hard as they could.

She needed more people. She needed an office. Nadia’s patience with the living room situation was genuine but not infinite. She needed a developer who wasn’t Tom, because Tom was one person and the codebase was growing faster than one person could manage. She needed money.

The seed round had to close. And once it did, the clock started. Three months. Two hundred subscribers. Build the platform, grow the customer base, prove the model. Tom was already building as fast as he could, using LLMs to generate code at a pace that felt miraculous. They’d shipped a subscription system, a landing page, a basic farm portal, all in weeks.

The question Maya couldn’t answer (the question that would define the next three months) was whether all that speed was pointed in the right direction.

Customer Discovery: Before the First Line of Code

2026-03-03T06:00:00+08:00

Maya’s earliest memories are of dirt under her fingernails and the sound of her father’s ute on the gravel road before dawn.

The farm was sixty acres outside Margaret River: dairy originally, then mixed organic produce after her parents made the conversion. Her parents had emigrated from Taiwan in the early eighties, bought the cheapest land they could find in a place where nobody would tell them they didn’t belong, and built something with their hands. The conversion from dairy nearly bankrupted them. Two seasons of no income while they learned new skills. Her mother picking up extra shifts at the local school canteen. Her father up at four every morning, teaching himself soil chemistry from library books and a Mandarin agricultural manual he’d brought in his suitcase.

By the time Maya was ten, the farm was producing vegetables that restaurants in Margaret River asked for by name. Her parents sold the rest at the Saturday farmers’ market: a trestle table, a hand-painted sign, and crates of whatever was in season. Maya worked the market from age twelve. She learned to make change, to explain what kohlrabi was, and to smile when tourists asked if the vegetables were “really organic” in a tone that meant they didn’t believe her.

She also learned something about distribution that she wouldn’t have words for until much later: the gap between what the farm produced and what people could actually buy.

Her parents grew beautiful produce. The restaurants took a small percentage. The market took a Saturday. The rest (the bulk of what they grew) went to a wholesaler who paid them barely enough to cover costs. The supermarkets took 40% margins and put their produce next to imported tomatoes from Queensland at half the price. The economics were brutal. The quality was irrelevant to anyone who wasn’t standing at the market stall on a Saturday morning, holding a bunch of carrots and tasting the difference.

Maya left for Perth at eighteen. Computer science at UWA. She was good at it; the logical structure of code felt like a language she’d been waiting to learn. She graduated with honours, took a job at a consulting firm, and spent the next decade doing technical advisory work for companies that were always larger, richer, and less interesting than they appeared from the outside. She built systems for mining companies, insurance firms, a state government department that needed a new payroll system and took three years to get one. She learned to translate between business people and technical people. She learned that the hardest problems in software were never about software.

She also learned to dress for offices, to present to boards, and to eat lunch at her desk without getting crumbs on client deliverables. She was good at consulting. She was not passionate about it. The difference matters less than you’d think in your twenties and more than you’d think in your thirties.

The idea

The idea arrived the way most ideas arrive: not in a flash, but as a slow accumulation of irritation.

Maya was thirty-one, living in a flat in Fremantle with her partner Nadia, buying vegetables from the supermarket on the way home from work. The tomatoes were pale and mealy. The lettuce was wrapped in plastic and had been picked four days ago in another state. She knew, because she’d grown up on a farm, because her hands remembered what a good tomato felt like, that there were farms within fifty kilometres growing produce that was better in every way. But she couldn’t get it. Not easily, not regularly, not without driving to a farmers’ market on a Saturday morning and hoping for the best.

The farmers’ markets were good, but they were weekend-only. You had to plan your whole Saturday around them. And the farms themselves had no direct-to-consumer channel at all. Dave Morrison, a third-generation farmer near Margaret River whose family had been working the same soil since 1962, sold most of his crop to a wholesaler and whatever was left at the market. Rachel, who ran a smaller mixed farm nearby, did the same. Both of them produced food that was extraordinary. Neither of them had a way to get it to the people who would value it most.

What if you could subscribe to a weekly box? Fresh seasonal vegetables, sourced from farms within fifty kilometres, delivered to your door every Thursday. Simple concept. The farms get a reliable buyer at a fair price. The customer gets produce they can trust without thinking about it. The middleman (the wholesaler, the supermarket) gets cut out.

Maya wrote the idea on a napkin at a cafe in Fremantle. Then she wrote it again, more carefully, in a notebook. Then she opened a spreadsheet and started modelling costs. The spreadsheet grew over three months. She worked on it in the evenings after Nadia went to bed, sitting at the kitchen table with a cup of tea and the quiet focus of someone who knows they’re building something real.

The conversations

The first person she called was Dave Morrison.

Maya had known Dave since childhood. Her parents and Dave’s family had sold produce at adjacent stalls at the Margaret River market for years. Dave was laconic, careful with words, and deeply sceptical of anything that came from the city. He was fifty-eight, had survived droughts, frosts, a global financial crisis, and two decades of supermarket price pressure. He’d seen the co-ops come and go. He’d watched startups promise to “disrupt” agriculture and then disappear when the venture capital ran out.

“You’re not the first city kid with this idea,” he said, when Maya pitched it over the phone.

“I’m not a city kid, Dave. You’ve known me since I was twelve.”

A long pause. Dave’s pauses carried more information than most people’s paragraphs.

“Fair point. But the idea’s still not new. I’ve watched three co-ops and two startups promise to fix farm distribution. All of them ran out of money.”

“What was different about them?”

Another pause. “They didn’t understand farming. They thought it was a supply chain problem. It’s not. It’s a relationship problem. Farms don’t produce on demand. We produce what the season gives us, and then we figure out who wants it.”

“I know that.”

“You know it because you grew up on a farm. They didn’t.”

Dave didn’t say yes. He didn’t say no. He said: “Come down to the farm. Bring your spreadsheet. I’ll tell you what’s wrong with it.”

Maya drove down on a Saturday. Dave walked her through the operation: the fields, the packing shed, the cold storage. He showed her the wholesale orders, the market prep, the waste. Produce that didn’t sell at market. Produce that was too small or too oddly shaped for the wholesaler. Produce that was perfect but had no buyer.

“You see those crates?” Dave pointed to a stack of weathered green plastic crates by the packing shed door. The kind farms use everywhere: stackable, reusable, the colour of sun-faded gum leaves. “That’s what I send produce in. Twenty-odd years I’ve been using those crates. They go to market, they come home, they go out again.”

Maya looked at the crates. Green, sturdy, practical. A farm thing. A real thing.

“Greenbox,” she said.

Dave raised an eyebrow.

“That’s the name. Greenbox.”

“It’s a crate.”

“It’s a box. A green box. With produce in it, delivered to someone’s door.”

Dave shook his head, but Maya saw the corner of his mouth twitch. That was as close to approval as Dave got.

Recruiting Tom

Tom Chen was Maya’s oldest friend from UWA. They’d met in a second-year algorithms tutorial. Maya was the only woman in the room and Tom was the only person who talked to her like a normal human being instead of either ignoring her or explaining things she already understood. They’d stayed friends through fifteen years of diverging careers: Maya into consulting, Tom into software development. He was thirty-eight now, married to Sarah, two kids (Ava and Leo), and the kind of programmer who built side projects after bedtime because making things was how he processed the world.

Tom was between jobs. His last company had been acquired by a larger firm, the culture had rotted within six months, and he’d taken voluntary redundancy rather than spend another year in meetings about meetings. He was interviewing at two companies and felt lukewarm about both.

Maya bought him coffee at a cafe in Leederville and pitched.

Tom listened with the particular attentiveness of someone who builds systems for a living. He asked good questions. How many farms? What’s the delivery radius? How do you handle seasonal variation? What’s the tech stack?

Maya answered what she could and was honest about what she couldn’t. “I don’t have all the answers. I’ve got a spreadsheet, a farming contact who hasn’t said no yet, and an idea that I can’t stop thinking about.”

Tom stirred his coffee. “You know the success rate for food startups?”

“I know it’s terrible.”

“And you want me to leave a stable job market for this?”

“You don’t have a stable job. You have two interviews you described as, what was the word, ‘uninspiring.’”

Tom laughed. It was the first genuine laugh Maya had seen from him in months. “When do you need an answer?”

“Yesterday.”

He looked at his coffee. Then at Maya. Then at something in the middle distance that might have been the future or might have been the memory of all those side projects he’d built because the work that paid him wasn’t the work that interested him.

“Yeah, all right. I’m in.”

Sam

Sam Okafor was Maya’s cousin on her mother’s side; her mother’s sister had married a Nigerian engineer who’d moved to Perth in the nineties. Sam had grown up in Baldivis, studied business, and spent six years running logistics for a trucking company in Kewdale. She knew supply chains the way Tom knew code: from the inside, with the kind of practical knowledge that doesn’t come from textbooks.

Sam was twenty-nine, competent, restless, and thoroughly bored. The trucking company moved the same cargo along the same routes on the same schedule, and the only variation was which driver called in sick. She’d been talking about leaving for a year. When Maya called, Sam didn’t need the full pitch.

“What’s the job?”

“Everything that isn’t code or farming. Marketing. Operations. Customer support. Logistics.”

“That’s four jobs.”

“It’s a startup. Everything is four jobs.”

Sam was quiet for a moment. “What’s the pay?”

Maya told her. Sam made a sound that was somewhere between a laugh and a cough.

“That’s a 60% pay cut.”

“I know. I’m asking a lot.”

“You’re asking me to give up a salary to pack vegetables in your living room.”

“I’m asking you to help me build something that matters. The salary comes later. If it works.”

Sam thought about the trucking company. The same routes. The same cargo. The same conversations in the same break room. She thought about the spreadsheet Maya had shown her over family dinner last month: the one with the revenue projections and the subscriber targets and the note at the bottom that said Break-even: Month 14 (optimistic).

“When do I start?”

“Monday.”

The living room

They started on a Monday in February. Three people, three laptops, Maya’s living room in Fremantle. The coffee table was their desk. The whiteboard was a sheet of butcher’s paper taped to the wall behind the couch. Nadia, who worked as a physiotherapist and kept sensible hours, came home that first evening to find her living room converted into an office.

“How long is this going to last?” she asked, stepping over a power cable.

“Not long. We’ll get an office soon.”

Nadia looked at the three people hunched over laptops on her couch. “Define ‘soon.’”

“A few weeks?”

It was six weeks. Nadia never complained, though she did start leaving passive-aggressive notes on the fridge about the milk disappearing faster than usual. She also started making extra coffee in the mornings, enough for four, without being asked. That was Nadia. She expressed love in practical gestures and expected Maya to understand what they meant.

The first week was all planning. Maya laid out the business model on the butcher’s paper in her neat handwriting (the same handwriting from the market flyer): farms commit weekly availability, customers subscribe to a box size, Greenbox matches supply to demand, packs the boxes, and delivers. Revenue comes from the subscription margin: the difference between what they pay the farms and what the customer pays. Simple, she said. Straightforward.

Tom and Sam looked at each other. They’d both been around long enough to know that “simple” and “straightforward” were the words people used right before discovering that something was neither.

Tom listened and started sketching a data model on the butcher’s paper. Subscription. Customer. Farm. Produce. Order. Box. The entities came easily. The relationships between them were where the complexity lived.

Sam started on logistics. Delivery routes. Courier options. Packing materials. Cold chain timing: how long could produce sit in a box before it deteriorated? She called four courier companies and got quotes that ranged from expensive to absurd. One of them wanted a minimum of two hundred deliveries per week. Sam explained they’d be starting with about twenty. The line went quiet, then polite. She started a spreadsheet that would, over the next year, become the operational backbone of the company. It had twelve tabs by Friday.

Maya called Dave. “We’re starting.”

“Starting what?”

“Building it. The app, the website, the operations. All of it.”

A pause. “You haven’t got any customers yet.”

“We will.”

“Lot of confidence for someone with three laptops and no office.”

“We’ve got a living room. It’s practically the same thing.”

Dave’s silence was eloquent. Then: “I’ll have some produce ready when you need it. Don’t make me regret it.”

Maya put the phone on the kitchen counter and looked at Tom and Sam. “He’s in.”

“That didn’t sound like ‘in,’” Tom said.

“For Dave, that was a standing ovation.”

By Friday of the first week, Tom had a rough architecture sketched out. A web app for customer subscriptions. A portal for farms to submit their weekly availability. A matching engine to connect supply to demand. A basic admin panel for Maya to manage everything else. He’d been researching LLM-assisted development. The new code generation tools were getting impressive reviews, and he was itching to try them on a real project.

“I reckon I can have a working prototype in two weeks,” he said.

Sam raised her eyebrows. “Two weeks?”

“The code generation tools are incredible. You describe what you want and they build it. I saw a demo where a guy built a full e-commerce site in an afternoon.”

Maya looked at the butcher’s paper covered in entity relationships and arrows and questions. “That sounds fast.”

“That’s the point.”

The following Monday, Tom opened his laptop, fired up Claude in one browser tab and his IDE in the other, and started building.

Bash Pipes Execute in Subshells

2014-10-02T00:00:00+08:00

Here’s a gotcha that caught me out this week.

I had code like this, used to source settings from scripts stored in another directory:

find /etc/application -name "*.sh" -type f | while read FILE; do
  source $FILE
done

exec /path/to/application/run.sh

Inside /etc/application/set_name.sh I’d have something like:

export SOME_VARIABLE="some value"

But when the application ran, it never saw the value of SOME_VARIABLE. Puzzling.

The reason: bash pipes run in subshells. The while read loop on the right side of the pipe runs in a subshell, so that’s where source executes. And subshells can’t modify the environment of their parent process. The exported variables vanish the moment the subshell exits.

The fix is to make sure source runs in the main process. You can do this with process substitution and input redirection instead of a pipe:

while read FILE; do
  source $FILE
done < <(find /etc/application -name "*.sh" -type f)

exec /path/to/application/run.sh

Now the while read loop runs in the main shell, source sets the variables in the right place, and the application sees everything it expects.

Pooling ActiveMQ Connections for Camel

2012-09-30T00:00:00+08:00

In my previous camel.xml I used the following XML to set up the connection to ActiveMQ:

    <bean id="activemq" class="org.apache.activemq.camel.component.ActiveMQComponent" >
      <property name="connectionFactory">
	<bean class="org.apache.activemq.ActiveMQConnectionFactory">
	  <property name="brokerURL" value="vm://zuu:61613?create=false&amp;waitForStart=10000" />
	</bean>
      </property>
    </bean>

While this works, every time a message is sent Camel opens a new connection to the broker. I know I’m going to be sending a lot of messages, and I’d rather not waste time opening and closing connections for each one. A connection pool is the obvious fix.

By wrapping the ActiveMQConnectionFactory in a PooledConnectionFactory, I can maintain a pool of up to 8 connections that stay open and get returned to the pool (rather than closed) after each message is sent:

    <bean id="activemq" class="org.apache.activemq.camel.component.ActiveMQComponent" >
      <property name="connectionFactory">
	<bean id="pooledConnectionFactory" class="org.apache.activemq.pool.PooledConnectionFactory">
	  <property name="maxConnections" value="8" />
	  <property name="connectionFactory">
	    <bean class="org.apache.activemq.ActiveMQConnectionFactory">
	      <property name="brokerURL" value="vm://zuu:61613?create=false&amp;waitForStart=10000" />
	    </bean>
	  </property>
	</bean>
      </property>
    </bean>

A small change, but it makes a real difference under load.

A Basic ServiceMix Install

2012-09-10T00:00:00+08:00

Over the past several years I’ve frequently used ActiveMQ and Camel as a message broker and integration platform for my applications. They handle the glue and the message delivery so I can focus on what’s really interesting: solving business problems. Apache ServiceMix provides an OSGi container in which I can run, configure, and manage Camel and ActiveMQ instances, and I want to explore the other services it can provide.

The full ServiceMix install is rather large. I don’t need most of it yet, and I don’t want to be running services I don’t understand, so I’m starting with a very minimal install and building from there.

At the time of writing the most recent ServiceMix release is 4.4.2, so I’ll download, unpack, and run that:

$ curl -L -O http://www.mirrorservice.org/sites/ftp.apache.org/servicemix/servicemix-4/4.4.2/apache-servicemix-minimal-4.4.2.tar.gz
$ tar -xzvf apache-servicemix-minimal-4.4.2.tar.gz
$ cd apache-servicemix-4.4.2/
$ ./bin/servicemix

Let’s verify what ships in the minimal install and make sure there are no surprises:

karaf@root> features:list --installed
State         Version   Name            Repository  Description
[installed  ] [2.2.4  ] karaf-framework karaf-2.2.4
[installed  ] [2.2.4  ] config          karaf-2.2.4

Not much — just a basic Karaf install, pre-configured with the ServiceMix Maven repositories:

karaf@root> features:listurl
 Loaded   URI
  true    mvn:org.apache.karaf.assemblies.features/standard/2.2.4/xml/features
  true    mvn:org.apache.servicemix/apache-servicemix/4.4.2/xml/features
  true    mvn:org.apache.activemq/activemq-karaf/5.5.1/xml/features
  true    mvn:org.apache.camel.karaf/apache-camel/2.8.5/xml/features
  true    mvn:org.apache.cxf.karaf/apache-cxf/2.4.6/xml/features
  true    mvn:org.apache.karaf.assemblies.features/enterprise/2.2.4/xml/features
  true    mvn:org.apache.servicemix.nmr/apache-servicemix-nmr/1.5.0/xml/features

I can build on this by adding the features I need. To start, I definitely need Camel and ActiveMQ since those are the foundation of my integration layer. I’m used to configuring them with Spring, so I’ll use the *-spring variants rather than the *-blueprint variants more commonly used in ServiceMix.

First, I need to install some OSGi bundles that Camel depends on. I’m not yet sure how to configure ServiceMix to pull these in automatically — I suspect I need to add the correct feature URL, but I haven’t figured out which one. Please get in touch if you can explain.

karaf@root> osgi:install -s mvn:org.apache.geronimo.specs/geronimo-activation_1.1_spec/1.0.2
karaf@root> osgi:install -s mvn:org.apache.servicemix.specs/org.apache.servicemix.specs.stax-api-1.0/1.1.0
karaf@root> osgi:install -s mvn:org.apache.servicemix.specs/org.apache.servicemix.specs.jaxb-api-2.1/1.1.0
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.jaxb-impl/2.1.6_1
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.xstream/1.3_4
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.joda-time/1.5.2_3
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.jdom/1.1_3
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.dom4j/1.6.1_3
karaf@root> osgi:install -s mvn:org.apache.servicemix.bundles/org.apache.servicemix.bundles.xstream/1.3_4

Now I can install ActiveMQ and Camel:

karaf@root> features:install spring
karaf@root> features:install camel-core
karaf@root> features:install camel-spring
karaf@root> features:install activemq-spring

I also need the camel-activemq component so Camel can talk to ActiveMQ:

karaf@root> features:install camel-activemq

With everything installed, I set up the broker. I’m telling it to use the name zuu (the name of my laptop, but it can be anything):

karaf@root> activemq:create-broker --name zuu

Creating file: @|green /Users/craig/code/tmp/apache-servicemix-4.4.2/deploy/zuu-broker.xml|

Default ActiveMQ Broker (zuu) configuration file created at: /Users/craig/code/tmp/apache-servicemix-4.4.2/deploy/zuu-broker.xml
Please review the configuration and modify to suite your needs.

0

The default configuration sets up a Stomp transport on port 61613, which I’ll use from my Ruby (and other language) clients. No changes needed, although I could remove the OpenWire connector on port 61616 if I wanted.

Configuring Camel is a touch more involved. I need to drop a camel.xml file into the deploy/ subdirectory of the ServiceMix install:

<beans xmlns="http://www.springframework.org/schema/beans"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
http://camel.apache.org/schema/spring http://camel.apache.org/schema/spring/camel-spring-2.8.5.xsd">

  <camelContext id="camel" xmlns="http://camel.apache.org/schema/spring">
    <route id="tick-tock">
      <from uri="timer://tick-tock-timer?fixedRate=true&amp;period=5000" />
      <to uri="log:tick-tock-log" />
      <to uri="activemq:topic:tick-tock" />
    </route>
  </camelContext>

  <bean id="activemq" class="org.apache.activemq.camel.component.ActiveMQComponent" >
    <property name="connectionFactory">
      <bean class="org.apache.activemq.ActiveMQConnectionFactory">
	<property name="brokerURL" value="vm://zuu?create=false&amp;waitForStart=10000" />
	<property name="userName" value="${activemq.username}"/>
	<property name="password" value="${activemq.password}"/>
      </bean>
    </property>
  </bean>
</beans>

I can verify the route is running by checking the logs:

karaf@root> log:tail
2012-09-10 22:39:19,603 | INFO  | tick-tock-timer  | tick-tock-log      | ? ? | 54 - org.apache.camel.camel-core - 2.8.5 | Exchange[ExchangePattern:InOnly, BodyType:null, Body:[Body is null]]
2012-09-10 22:39:19,603 | INFO  | tick-tock-timer  | TransportConnector | ? ? | 79 - org.apache.activemq.activemq-core - 5.5.1 | Connector vm://zuu Started
2012-09-10 22:39:19,606 | INFO  | tick-tock-timer  | TransportConnector | ? ? | 79 - org.apache.activemq.activemq-core - 5.5.1 | Connector vm://zuu Stopped

I can also hook up a Ruby client to listen to the topic:

require 'rubygems'
require 'stomp'

STDOUT.sync = true
c = Stomp::Client.new 'stomp://127.0.0.1:61613'
c.subscribe '/topic/tick-tock' do |m|
  puts m.headers.inspect
end
c.join

Running that produces a steady stream of messages in the console:

$ ruby ./client.rb
{"message-id"=>"ID:zuu.local-53690-1347226528097-2:42161:1:1:1", "breadcrumbId"=>"ID-zuu-local-54049-1347228987046-12-310", "destination"=>"/topic/tick-tock", "timestamp"=>"1347313904606", "expires"=>"0", "subscription"=>"587e9bbe3714dfd10b3cfe9837a1fb7daac2d8b2", "priority"=>"4", "firedTime"=>"Mon Sep 10 22:51:44 BST 2012"}
{"message-id"=>"ID:zuu.local-53690-1347226528097-2:42162:1:1:1", "breadcrumbId"=>"ID-zuu-local-54049-1347228987046-12-312", "destination"=>"/topic/tick-tock", "timestamp"=>"1347313909605", "expires"=>"0", "subscription"=>"587e9bbe3714dfd10b3cfe9837a1fb7daac2d8b2", "priority"=>"4", "firedTime"=>"Mon Sep 10 22:51:49 BST 2012"}

I can now tinker with zuu-broker.xml and camel.xml, and every time I save, ServiceMix picks up the change and restarts the appropriate bundle.

I now have a basic ServiceMix install providing what I’m used to. Time to explore.

Logging Considerations

2011-12-14T00:00:00+08:00

From years of bitter experience staring at log files trying to work out what turned the servers into a pile of molten rubble, I’ve built up a list of what I really like to see when a process logs its activity. Prompted by some discussions at work, I’d like to share it — in the hope of raising the quality of logging in our software and saving someone a lot of stress at 3am when their project absolutely refuses to work and they can’t figure out why.

This is explicitly not about logging frameworks. I don’t particularly care how logging is implemented in code, since that will necessarily differ by language and application architecture. I just care about the outcome.

Each process logs to one log file

I don’t want to jump back and forth between log files, interleaving lines, trying to reconstruct what happened when. It’s hard enough to work out what’s going on at the best of times. Let’s not make it harder.

Each log file has one process, one thread writing to it

When two or more processes or threads write to the same file, it’s difficult to isolate what the process you care about actually did. You can work around this by adding a token to each log line — a process or thread ID, for instance — but there are deeper complications.

Say thread A logs something at exactly the same time as thread B. Halfway through thread A writing its log entry, thread B becomes active and starts logging. Thread B eventually yields, and thread A resumes. What a mess. Thread A’s log entry now has thread B’s log entry spliced right through the middle. Who said what? Nightmare.

When you can’t avoid having multiple threads or processes writing to a log file, they should talk to a logging arbitrator service that manages the file and ensures entries are written atomically.

Logging is not buffered

When it comes to logging, I prefer completeness over speed. If the process dies or is killed, I don’t want the last few log entries sitting in a memory buffer somewhere — I want them on disk where I can read them. If my process does something, I should be able to see it immediately, not after waiting for a buffer to fill or a flush interval to expire.

Each log line has a timestamp

This seems obvious, but I’m amazed by how often it doesn’t happen. A log file without timestamps is useless unless you happen to be watching it when something goes wrong.

Each timestamp is at sub-second resolution

Logging with timestamps accurate only to one second is maddening when you’re dealing with thousands of entries per second. Which of those caused the issue? Good luck.

Given the choice, I’d prefer to let me configure the logger to print to STDOUT so I can use svlogd to add tai64n timestamps (and do much more besides). But just do something reasonably sane and I’ll be happy.

Each log file has a guaranteed maximum size

Ever seen what happens when a process tries to start and the disk is full of old logs? Generally, it doesn’t work. That’s annoying.

Older logs should be archived elsewhere. Only recent logs should live on local disk for easy debugging. Your definition of “older” and “recent” will vary, but knowing you’re keeping, say, 1GB of logs on disk means you can ensure there’s always enough space.

Ad hoc or interval-based log rotation doesn’t help here, because by the time the log is rotated the disk is already full. Processes like svlogd and some syslog implementations handle this properly.

There’s some way of processing a log file once it reaches a known size

At some point I’ll want to analyse and archive log files. It’s annoying to miss a rotation trigger and discover I’ve lost several hours of entries. Please don’t make me track file rotation myself. I’ll do it badly, and that will make me sad.

Most of these preferences are legacies of working with (and being spoiled by) DaemonTools and Runit, coming from a Rails-centric, Mongrel-running world where there’s typically one process logging to one file. If I’ve missed something, please let me know — email address is below.

How I Structure RubyGems

2011-12-13T00:00:00+08:00

I haven’t been consistent in how I structure my RubyGems, and I want to be. Consistency means I know what to provide, and people who use my code know what to expect.

These are guidelines for my future self.

The require statement follows the gem name

You should be able to figure out the require path just by looking at the gem name:

A gem called nyan_cat: require "nyan_cat"
A gem called nyan-cat: require "nyan/cat"
A gem called nyan_cat-moar_cats: require "nyan_cat/moar_cats"

File structure

A basic project layout should look like this:

your-rubygem/
            |- bin/
            |- lib/
            |- tests/
            |- Gemfile
            |- Rakefile
            |- README
            |- LICENCE
            \- your-rubygem.gemspec

The code that provides your gem’s functionality lives under lib/ in a directory named according to these rules:

A gem called nyan_cat: lib/
A gem called nyan-cat: lib/nyan/
A gem called nyan_cat-moar_cats: lib/nyan_cat/

The file required by the gem name rule above should sit directly under lib/:

nyan_cat -> lib/nyan_cat.rb
nyan-cat -> lib/nyan/cat.rb
nyan_cat-moar_cats -> lib/nyan_cat/moar_cats.rb

This file should require everything needed for the gem to work.

The Gemfile should contain just gemspec, and Gemfile.lock should not be checked in for gems. Yehuda Katz has a good write-up on the roles of the gemspec and Gemfile.

Code structure and namespace

Your gem should have a namespace that matches the directory structure:

nyan_cat -> NyanCat
nyan-cat -> Nyan::Cat
nyan_cat-moar_cats -> NyanCat::MoarCats

Everything should live under this namespace.

Versioning

Provide a version.rb file containing the current version and nothing else. Be kind to the people who depend on your gem — stick to the Semantic Versioning scheme.

nyan_cat -> lib/nyan_cat/version.rb
nyan-cat -> lib/nyan/cat/version.rb
nyan_cat-moar_cats -> lib/nyan_cat/moar_cats/version.rb

An example version.rb for nyan_cat-moar_cats:

module NyanCat
  module MoarCats
    VERSION = "0.0.1"
  end
end

Tests

Your code should be tested. You don’t need to distribute the tests in the gem file itself, though.

Logging

Unless you’re providing a logger implementation, it’s not your job to configure logging. Logging is good and incredibly useful for debugging, so the answer isn’t to avoid it. What I want is to give your code a logger that I’ve configured to my liking. It will support the standard Logger interface. Please make the logger an option — let me pass mine to you — and stop worrying about logging configuration.

You can do this easily by defaulting to NullLogger when no logger is provided:

require "null_logger"

class Foo
  attr_accessor :logger
  private :logger=, :logger

  def initialize bar, options = {}
    self.logger = options[:logger] || NullLogger.instance
  end

  def quux
    logger.info "Called #quux"
  end
end

Find out more about NullLogger at http://github.com/craigw/null_logger.

Dependencies

Read about your dependencies’ versioning schemes. If they use Semantic Versioning (and hopefully they do), depend on the appropriate version. Read about using the pessimistic version constraint operator to depend on major, minor, or exact versions as appropriate.

Rake tasks

Provide tasks to run your tests. The default rake task should run all tests.

README

Include at minimum:

A brief description of the gem, ideally with an example of the problem it solves
Installation instructions, even if they’re just gem install foo or “add this to your Gemfile”
A brief usage example, possibly with a link to more detailed documentation
Licensing info, even if it’s just “see the LICENCE file”
A “how to contribute” section explaining how to submit patches
A list of authors (it’s nice to see your name there)

Licensing

If you don’t provide a licence, I can’t use your project, because I don’t know the terms under which it’s available. I really want to use your project. Please provide a licence.

Principles of Service Design: Program to an Interface

2011-12-05T00:00:00+08:00

I’ve been thinking a lot about service design recently, and one of the trickier problems is deciding how to implement and version a service so that supporting (or dropping) older versions is straightforward. It turns out that advice originally meant for writing code works brilliantly for writing services too: program to an interface.

Program to an Interface

Borrowing from many blogs and books, I’ve come to believe that viewing a service interface the same way you’d view a programming interface is the correct move. Interfaces can be versioned. They isolate client code from the implementation behind them. And once published, a given version should be immutable.

A service should hide its implementation details. If a database table changes inside the application providing the service, the clients of that service shouldn’t have to care.

Just like interfaces in a programming language, by specifying a well-known interface to a service we free ourselves from worrying about how clients interact with it. When we need to change the implementation, we can do so without breaking anyone. And the reverse is also true: clients don’t need to worry about implementation changes as long as the interface stays consistent.

Of course, interfaces sometimes have to change to support new functionality. When they do, we want to be confident we’re using the correct version. Just because v2 has been released doesn’t mean our clients automatically support it. We want to keep using v1 until we’re ready to update. Versioning gives us that choice.

Beyond URIs

When we think of a service, we usually think of a web service. In these RESTful days the interface is generally thought to be the combination of URIs we interact with. But that’s not the full picture. Supporting an interface doesn’t just mean your URIs are stable between versions — it also means the content returned from service calls (i.e. HTTP responses) conforms to a defined structure.

And of course, a web service is only one kind of service. Plenty of services don’t have a web interface at all, usually in situations where synchronous request-response messaging isn’t appropriate. Order processing, inventory management, fraud checks — these might take several seconds and are better handled asynchronously. The interfaces to these services should be versioned for the same reasons a web service’s should.

The version of the interface should be detectable with each message passed or received, no matter the transport. We should know that we’re dealing with version 3 of an API, not guess.

Mime types to the rescue

Handily, versioning interfaces for these types of services is pretty much the ideal use case for a MIME type, and most message transports support custom MIME types:

HTTP 1.1 supports a Content-Type header
As does Stomp 1.1
And AMQP 1.0 (search for “content-type”)

MIME types have a space reserved for vendor-specific types, application/vnd, inside which we’re free to define our own. There are a few conventions to follow to avoid name collisions: include your organisation name, a very short description of what you’re representing, a version number, and a base format.

A worked example

Say you work at the Acme Toy Company. When your web service accepts an order via its RESTful interface, it puts a message on a queue with four fields — customer_id, purchase_order_id, amount, and description — in JSON:

{
  "customer_id": 123,
  "purchase_order_id": "ASLA-001-2031",
  "amount": 1000,
  "description": "100 x Acme Toy Dynamite"
}

We coin a MIME type, application/vnd.acme.order-v1+json, and publish an interface specification saying any message claiming to be this type will have these four fields. Then in a consumer of the orders queue we use the Selective Consumer pattern to subscribe only to messages of this MIME type. Inside the consumer we can be confident that we’ll only receive orders in a format we understand and can process. Partners can POST with this MIME type in the Content-Type header so everyone knows what they’re talking about all the way through the system.

A few months pass. Several partners are using the order API, but we want to automate our stock inventory, so instead of a plain text description we want item IDs. We don’t want to force this change on our partners, though — their development cycle is slow and they’re sending us plenty of orders. We like their cash.

So we publish a second version, application/vnd.acme.order-v2+json, defining messages like this:

{
  "customer_id": 123,
  "purchase_order_id": "ASLA-001-2031",
  "amount": 1000,
  "items": [
    { "item_id": 1032, "quantity": 100 }
  ]
}

It’s now trivial to add a second Selective Consumer that handles only v2 messages and updates inventory accordingly. The v1 consumer keeps running happily with v1 orders. There’s a smooth, unhurried migration path for clients from v1 to v2. We can support both versions or drop older ones as we choose. We could even use a combination of Splitter, Translator, and Enricher to route v2 messages into the v1 consumer while splitting off inventory management messages to a separate, lightweight consumer. None of this matters to our partners, because they know they’re working to the interface we’ve defined.

When the transport changes

We might eventually decide that the RESTful order service isn’t appropriate for v3 — perhaps we’ve been won over by WebSockets. When we receive an order claiming to be v1 or v2 on the RESTful service, we can still happily accept it. If we receive anything else, we return an HTTP 406 to tell the client we can’t accept orders that way for v3.

In contrast, if we’d used plain application/json we’d have to guess based on message fields which version the client intended. That’s barely practical with the trivial example above, and once there are several versions of the interface it becomes a nightmare.

Painting with Constable

2011-10-05T00:00:00+08:00

ImageMagick annoys me. Not because of what it does — functionally, it’s the bee’s knees — but because installing it is a pain. Like many Ruby developers, I tend to develop on a Mac, an operating system without much of an official package manager. Installing tools with complex dependencies like ImageMagick gets tedious fast. I long for the days when I can apt-get install imagemagick while still enjoying all the lovely hardware a Mac provides.

Over the years I’d built up some solid experience with virtualisation, and then Vagrant came along and made it trivially easy to run Ubuntu on my Mac. Suddenly I had access to apt and proper ImageMagick packages. Unfortunately I couldn’t use them natively on my Mac — they were only accessible from inside the VM. Better than nothing, so for command-line image manipulation I’d been getting by.

During my more recent work I’d spent a fair amount of time with messaging, exposing services on a message bus for remote clients. There’s something really satisfying about not having to worry about the implementation of a service — just knowing that a message in a certain format sent to a certain destination will get the job done. So I tried exactly that for ImageMagick: exposing it on my VM as a service on the bus. It worked well enough that I threw up a project on GitHub and released a RubyGem. The project is called Constable — the README explains why.

Constable is very nearly a drop-in replacement for ImageMagick. After installing the gem and setting up the service, you can use the same ImageMagick commands to do a lot of the same stuff that a local install would let you do. There are some caveats, of course. Output must (at the moment, at least) be streamed to STDOUT. ImageMagick supports this by letting your output filename take the form format:-, e.g. jpg:- or png:-. There may be other shortcomings I haven’t hit, possibly because while I use ImageMagick a lot, I don’t use it in particularly complex ways. If you come across any problems, let me know. If you can submit a patch, even better.

Setting up an example service is covered in the “Up and running fast” section of the README, so I’ll skip that and run through a quick demo: creating a couple of JPEGs with text in them, compositing one on top of the other, and producing a PNG at 50% of the original dimensions.

First, make sure the service is up as described in the README.

Second — a bit of an undocumented easter egg at the moment — install the binstubs for the ImageMagick services:

sudo constable-install

Note that this will overwrite the following files if they exist:

/usr/bin/identify
/usr/bin/convert
/usr/bin/compare
/usr/bin/composite

This step is optional but gives you the same command names that ImageMagick uses. If you’d rather skip it, just prefix each command with constable- and add a double dash immediately after the command name, e.g. convert foo.jpg png:- becomes constable-convert -- foo.jpg png:-.

Now, on to actually using the service. Creating text-based images is straightforward with the convert command. Here I create two JPEGs — one in blue tones with the text “Anthony”, and another in pink tones with the text “Cleopatra”:

convert -background lightblue -fill blue -font Candice -pointsize 72 \
  label:Anthony   jpg:- > anthony.jpg
convert -background pink      -fill red  -font Candice -pointsize 72 \
  label:Cleopatra jpg:- > cleopatra.jpg

They won’t win any design awards, but they’re good enough for a demo:

To combine them, we use convert in a different invocation:

convert anthony.jpg cleopatra.jpg +append jpg:- \
  > anthony_and_cleopatra.jpg

Resulting in this magnificent creation:

Finally, we can convert the combined image to a PNG at 50% of its original size:

convert anthony_and_cleopatra.jpg -resize 50% png:- \
  > anthony_and_cleopatra.png

Which outputs the smaller PNG:

This demo works identically whether you’re running the service in a VM or using a local ImageMagick install. I’m rather happy with that. But there’s no reason to stop here — why not expose this as a proper remote service and write a plugin for AttachmentFu or Paperclip that uses it to offload image processing entirely? Get that heavy lifting out of the request-response cycle!

The code is out there. It’s rough around the edges but it works. Let me know if you find it useful, and if you’d like it to do something it can’t yet, patches and suggestions are very welcome.

A Simple SOCKS Proxy Using SSH

2011-08-17T00:00:00+08:00

Ever forget to add a firewall rule so people can reach an internal staging server from outside the network? I needed to verify that a server was accessible from the outside world, but I wanted to do it right now, from the machine sitting inside the network.

Turns out this is trivially easy with a tool pretty much every developer already has installed: SSH.

Three steps and you’re done:

Have a server somewhere that you can SSH into. EC2 is perfect for this sort of thing.
Open an SSH connection to it with the -D flag, which tells SSH to act as a SOCKS proxy. The other flags enable compression and keep things quiet:
```
ssh -C2qTnN -D 8080 your-server-name-here.com
```
Configure your browser to use a SOCKS proxy on localhost, port 8080.

Once that’s in place, head over to whatismyip.com and you should see the public IP address of the remote server rather than your own. All your browser traffic is now tunnelled through that box.

Quick, easy, and no VPN software required.

Reposted: Ten Steps for Attending a Keysigning Party

2011-07-10T00:00:00+08:00

This is a copy of the post originally found at http://commandline.org.uk/command-line/2007/sep/7/ten-steps-for-attending-a-keysigning-party/. The original appears to have vanished and the URL now returns a 404. This work is not mine and I'm not trying to claim it as such — I linked to it in a few places and wanted a permanent archive. Thanks to Vic Demuzere who let me know the link had gone dead.

Update: the original post appears to be archived at http://old.commandline.org.uk/command-line/ten-steps-for-attending-a-keysigning-party/.

A key signing party can be an event of its own, or it might happen at a user group meeting, a conference, or a workplace. The idea is to grow the "web of trust" and strengthen the system as a whole, while also making your own key more trusted. Alex Willmer explains what you need to do to participate in a key signing party using GNU Privacy Guard.

You can use either the command line gpg tool or a GUI front end such as Seahorse. The command line approach goes as follows:

0. Generate a key

If you haven't already done so, generate a key pair:

$ gpg --gen-key

1. Get your key ID

Find your public key:

$ gpg --list-keys

This gives results like the below. The uid should match your name and chosen email address. Note the id on the line labelled "pub":

> /home/alex/.gnupg/pubring.gpg
-----------------------------
pub 1024D/5A6F95BE 2007-02-08
uid Alex Willmer <alex at moreati.org.uk>
sub 2048g/63329941 2007-02-08

2. Upload your key

Publish your public key to a keyserver:

$ gpg --keyserver ldap://keyserver.pgp.com --send-keys 5A6F95BE

Which should respond:

> gpg: sending key 5A6F95BE to ldap server keyserver.pgp.com

3. Print your key fingerprint

Using the id from step 1:

$ gpg --fingerprint 5A6F95BE

The result is the fingerprint of your public key:

> pub 1024D/5A6F95BE 2007-02-08
Key fingerprint = C9CD 3335 C138 7291 2022 F30D 2E51 C57B 5A6F 95BE
uid Alex Willmer <alex at moreati.org.uk>
sub 2048g/63329941 2007-02-08

Print your fingerprint onto paper — you should be able to fit quite a few on a page, which you can then cut into slips. You can also generate these with the command gpg-key2ps.

4. Go to the party!

Bring the slips and credentials that prove your identity. Normally parties require photo ID (e.g. your passport or driving licence).

5. Give out slips

Give a fingerprint slip to anybody you'd like to sign your key, and allow them to verify your identity using your credentials.

6. Take slips

Verify in person the identity of anybody you accept a slip from. Make sure the slip has a uid matching their name.

Note: it's anti-social to take slips and then throw them away or forget about them. If you take a slip from someone, it's polite to actually follow through with steps 7 and 8.

7. Verify the key fingerprints of your acquaintances

Once you're home, use the id from each slip to download and verify each person's key fingerprint:

$ gpg --keyserver ldap://keyserver.pgp.com --recv-keys [key_id]

$ gpg --fingerprint [key_id]

8. Sign and upload your acquaintances' keys

Sign each verified key and upload it to a keyserver:

$ gpg --sign-key [key_id]

$ gpg --keyserver ldap://keyserver.pgp.com --send-key [key_id]

9. Use GPG!

You can now sign emails, and anybody who signed your key can verify that the email was sent by you and hasn't been modified. You can also encrypt anything you send to a person whose key you've signed.

10. Advanced usage

There are optional additional steps, such as encrypting a signed key and sending it to the listed uid. By receiving the signed key and decrypting it, they prove access to the email address and control of the private key.

More Information

Code to an Interface (aka Stop Using Instance Variables)

2011-04-21T00:00:00+08:00

We all know the drill: only call methods a class declares public, leave protected and private methods alone, because they can change at any time. In other words, code to the public interface and don't depend on implementation details. It keeps our code clean and means that when the internals of a class change, its clients don't have to.

Curiously, we rarely apply the same thinking when managing state inside our own classes — and that can make refactoring surprisingly painful.

The problem with bare instance variables

Here's a Book class from a hypothetical bookstore application. Books have titles and authors. They have a publication date that can change — maybe the author misses a deadline, or editing runs long. Titles can change too, but authors won't.

class Book
  attr_reader :author
  attr_accessor :title, :published_at

  def initialize author, title, published_at
    @author = author
    @title = title
    @published_at = published_at
  end

  def to_s
    "\"#{@title}\" by #{@author}. Publication date: #{@published_at}"
  end
end

A few weeks pass and we start doing more deals with publishers. One of them wants us to exclusively list an upcoming book by A.N. Big Author. Great! Except… we can't handle books that don't have a publication date yet. We need to update the class:

class Book
  attr_reader :author
  attr_accessor :title, :published_at

  # published_at = nil if the book doesn't have a publication date
  def initialize author, title, published_at
    @author = author
    @title = title
    @published_at = published_at
  end

  def to_s
    "\"#{@title}\" by #{@author}. Publication date: #{@published_at ? @published_at : 'not yet published'}"
  end
end

That's tolerable for this tiny class, but it's ugly, and it's easy to imagine a real class where @published_at gets accessed directly in a dozen places. Changing every one of those takes time and the resulting conditionals don't read well. It's a prime candidate for the Introduce Null Object refactoring, but because we're reaching for @published_at directly everywhere, there's still a lot of churn. We could introduce the Null Object during instantiation, except the publication date can change at any time — a publisher might call and say they've missed their date and don't know when they'll publish.

A better starting point

Here's the class I wish I'd written from the beginning. It exposes the same public API but uses accessor methods internally instead of bare instance variables:

class Book
  attr_accessor :author, :title, :published_at
  private :author=

  def initialize author, title, published_at
    self.author = author
    self.title = title
    self.published_at = published_at
  end

  def to_s
    "\"#{title}\" by #{author}. Publication date: #{published_at}"
  end
end

Now when I get the call about the unpublished book, I can introduce a Null Object by simply overriding the reader for published_at:

class MissingPublicationDate
  include Singleton
  def to_s
    'not yet published'
  end
end

class Book
  attr_accessor :author, :title, :published_at
  private :author=

  def initialize author, title, published_at
    self.author = author
    self.title = title
    self.published_at = published_at
  end

  def published_at_with_null_object
    published_at_without_null_object || MissingPublicationDate.instance
  end
  alias_method :published_at_without_null_object, :published_at
  alias_method :published_at, :published_at_with_null_object

  def to_s
    "\"#{title}\" by #{author}. Publication date: #{published_at}"
  end
end

It's a touch more code in this example, but almost none of the methods that use published_at need to change, and the result is vastly more readable. The lesson: treat your own class's state the same way you'd treat someone else's API. Code to the interface, even internally.

Working with Ruby Arrays: Map with Index

2011-04-01T00:00:00+08:00

Here's a handy little method I keep reaching for: map_with_index. It does exactly what you'd expect — it works like each_with_index but with the return-value behaviour of map. Every element in the resulting array is whatever the block returns when that element and its index are yielded to it.

module BarkingIguana
  module ArrayExt
    def map_with_index &block
      index = 0
      map do |element|
        result = yield element, index
        index += 1
        result
      end
    end
  end
end

Array.class_eval do
  include BarkingIguana::ArrayExt
end

This is particularly useful when the first N elements of an array need to be treated differently from the rest:

[1, 2, 3, 4, 5].map_with_index do |element, index|
  model = Model.new element
  model.unlock if index < 3
  model
end

Note that Ruby 1.9.3+ gives you each_with_index.map and later versions provide each_with_object and other enumerator-chaining tricks that can achieve similar results — but sometimes a purpose-built method just reads better.

Debugging JavaScript with a Stack Trace

2011-03-20T00:00:00+08:00

I was trying to work with some JavaScript that kept popping up alert boxes. The library was huge and not particularly well organised, so rather than hunting through thousands of lines of code, I wrapped the original window.alert with a version that shows a stack trace just before the real alert fires.

You can adapt this technique to trace calls to any function; just change what gets wrapped in the last four lines.

var original_alert = window.alert;

var stacktrace = function() {
  var regex = /function\W+([\w-]+)/i;

  var callee = arguments.callee;
  var trace = "";
  while(callee) {
    trace += (regex.exec(callee))[1] + '(';

    for (i = 0; i < callee.arguments.length - 1; i++) {
      trace += "'" + callee.arguments[i] + "', ";
    }

    if (arguments.length > 0) {
      trace += "'" + callee.arguments[i] + "'";
    }

    trace += ")\n\n";

    callee = callee.arguments.callee.caller;
  }
  original_alert(trace);
}

window.alert = function(msg) {
  stacktrace();
  original_alert(msg);
}

The approach is straightforward: save a reference to the real window.alert, then replace it with a wrapper that walks up the call stack using arguments.callee.caller, building a string of function names and their arguments as it goes. It pops up the trace in one alert, then lets the original alert through.

A word of caution: arguments.callee is deprecated in strict mode and won’t work in modern ES5+ strict code. For anything current, you’d want to use console.trace() or the browser’s built-in debugger instead. But when you’re stuck debugging a sprawling legacy codebase that predates those niceties, this trick can save you a lot of time.

Be Cool with Arrays

2011-03-14T00:00:00+08:00

A few of my pet peeves centre around arrays. Ruby gives you a beautifully expressive language for working with collections; use it. Your code will be more readable, and your future self will thank you.

Ask an array if it’s empty. Don’t check if its size equals zero.

bookmarks.size == 0 # no!
bookmarks.empty? # yes

Ask an array if it has any elements. Don’t check if it has a non-zero size.

bookmarks.size > 0 # no!
bookmarks.any? # yes

Don’t guard each with an emptiness check. It already handles empty arrays gracefully; it simply won’t yield.

if bookmarks.any?; bookmarks.each { ... }; end # pointless
bookmarks.each { ... } # does the same thing

The general principle: if a method exists that says what you mean, use it instead of reinventing the check with arithmetic. It reads better and communicates intent more clearly.

I’m sure you have similar peeves. I’d love to hear what they are.

Moving LVM Volumes Between Hosts Without an Intermediate File

2011-03-13T00:00:00+08:00

At Xeriom Networks we provide virtual machines for clients to run their applications. Clients quite sensibly start with the smallest VM that meets their needs, then upgrade as they grow. Unfortunately, we can only fit so much disk space in each physical server, so when clients upgrade we sometimes need to move their LVM volumes to another physical server with enough free space for the expanded disk image.

The obvious approach would be to use dd to copy the volume to a file, SCP that file to the new server, then dd it back into a volume on the other end. The problem is that some of these volumes are hundreds of gigabytes, and the local disk often doesn’t have enough room for the intermediate file. It also just feels messy.

After some investigation, I discovered you can pipe dd’s output directly through SSH and into a dd process on the remote end, skipping the intermediate file entirely.

There are two steps:

1. On the destination host, create a volume large enough for the data:

sudo lvcreate -L 10G -n destination-lvm-volume-name destination-vg-name

2. From the source host, stream the volume across the network:

sudo dd if=/dev/source-vg-name/source-volume-name | ssh -c arcfour -l root host-b 'dd of=/dev/destination-vg-name/destination-lvm-volume-name'

That’s it. Much easier than I was expecting.

The -c arcfour flag selects a fast cipher for SSH, which helps with throughput on large transfers. The main downside compared to SCP is that you don’t get a progress bar, so you’re largely left guessing when the transfer will finish. If you know a good way to add progress indication to piped transfers like this, I’d love to hear about it.

Home Delivery Network Limited: Pretending to Deliver for Amazon Prime

2011-03-12T00:00:00+08:00

I’ve been failed once again by Home Delivery Network Limited pretending to deliver my Amazon order. I’m getting properly fed up with it, and I’m not the only one.

I’ve tried calling both numbers on the HDNL contact us page. Both require a delivery number, which is written on the card the driver supposedly leaves when they attempt delivery. There’s been no delivery attempt, so there’s no card and no delivery number. Tremendously helpful.

I tried talking to Amazon support, who were very apologetic and tried to call HDNL on my behalf. They couldn’t get through to anyone at Home Delivery Network and said they couldn’t see a delivery number in the system, which strongly suggests the driver didn’t leave a card. Correct! After I complained about this being a recurring problem, they added a query to request that HDNL investigate what happened and contact me. I don’t hold out much hope that anything beyond “we attempted delivery but couldn’t access the property” will come out of that.

A quick search turned up the phone number for the HDNL depot at New Cross Gate, where my package had set out from and been returned to after the non-existent delivery attempt. It’s listed at Say No To 0870 (search for “Home Delivery Network”) as 020 7635 8094, in case anyone else needs it (other depots are listed there too). The chap on the other end didn’t seem particularly surprised when I told him the driver never showed up, told me I couldn’t come down to pick it up (the depot is a 10-minute bus ride from my flat) because it would be mixed in with all the other parcels by now, and said my best bet was to wait in on Monday. To complain, I’d have to call one of the original premium-rate numbers and wait (paying through the nose) until an operator eventually picks up.

A week ago I signed up for Amazon Prime, thinking I’d get a reliable delivery service for my GBP 49 a year. This package was ordered for next-day delivery. Does that sound like value for money? Shouldn’t I be able to trust that deliveries will actually be attempted on the day the courier claims they will be?

I want to see one of two things added to my Amazon delivery options:

Don’t use HDNL for any of my deliveries. I’ll happily pay more.
Let me collect from the depot. HDNL clearly struggle with the last-mile problem, so send it to the depot and I’ll pick it up myself. I’ll pay less.

Please, Amazon: give me some way to avoid this terrible delivery company.

Update: Monday 14th March

When I called on Saturday, both Amazon and HDNL told me to wait in my flat for the parcel to arrive on Monday. This morning at 06:45 I was emailed by Amazon to say that HDNL will now deliver my Kindle on Tuesday. High five, guys. Big success. Meanwhile, James Cridland has pointed out that if I'd just nipped into the local Tesco superstore I could have picked one up in about 10 minutes.

Forking Ruby Processes

2011-03-03T00:00:00+08:00

I was recently asked if I had the content of some articles that I posted a long time ago on a blog I used to run. After some searching I managed to scrape together the content using the Wayback Machine. It's faithfully recreated here without changes, something I should have done when I first bought the barkingiguana.com domain.

For today’s adventure in Ruby, I’m going to write a simple daemon process. To start with, it won’t do anything particularly useful; every second it’ll print the current time to STDOUT.

Once that’s working, I’ll swap in the socket-checking code from my earlier posts and bump the interval to 15 seconds.

A simple time-printing daemon

kawaii:~ craig$ irb
irb(main):001:0> fork do # Fork a new process
irb(main):002:1*   while true # Loop forever
irb(main):003:2>     puts Time.now # Print the time
irb(main):004:2>     sleep 1 # Sleep for a second
irb(main):005:2>   end # while true
irb(main):006:1> end # fork
=> 15738
irb(main):007:0> Sat Jun 03 11:31:09 BST 2006
Sat Jun 03 11:31:10 BST 2006
Sat Jun 03 11:31:11 BST 2006

Easy. The fork call creates a child process that runs independently, and we get back a PID. Meanwhile, the parent IRB session carries on as normal (well, with timestamps appearing in the background).

Monitoring a socket

Next, let’s check that Postfix is listening on port 25 on the secondary MX, mx2.xeriom.net. Don’t forget to require the socket library; otherwise you’ll always hit the rescue block. Ask me how I know.

irb(main):045:0> require 'socket'
=> true
irb(main):046:0> fork do
irb(main):047:1*   while true
irb(main):048:2>     begin
irb(main):049:3*       t = TCPSocket.open('mx2.xeriom.net', 'smtp')
irb(main):050:3>       puts Time.now.to_s + ": MX2 is listening on port 25."
irb(main):051:3>       t.close
irb(main):052:3>     rescue
irb(main):053:3>       puts Time.now.to_s + ": MX2 is NOT listening on port 25."
irb(main):054:3>     end
irb(main):055:2>     sleep 15
irb(main):056:2>   end
irb(main):057:1> end
=> 15759
irb(main):058:0> Sat Jun 03 11:48:25 BST 2006: MX2 is listening on port 25.
Sat Jun 03 11:48:41 BST 2006: MX2 is listening on port 25.
Sat Jun 03 11:48:56 BST 2006: MX2 is NOT listening on port 25.
Sat Jun 03 11:49:11 BST 2006: MX2 is listening on port 25.

That third check caught Postfix being momentarily unavailable. Useful.

Monitoring multiple hosts and ports

That was a little too easy, so let’s extend the problem. This time we’ll check an arbitrary number of ports across an arbitrary number of hosts, using threads inside the forked process for concurrency:

irb(main):107:0> host_sockets = { 'mx2.xeriom.net' => [ 25 ], 'kiwi.xeriom.net' => [ 21, 22, 25 ], 'guava.xeriom.net' => [ 21, 22, 25 ], 'mx1.xeriom.net' => [ 25 ] }
=> {'mx2.xeriom.net' => [ 25 ], 'kiwi.xeriom.net' => [ 21, 22, 25 ], 'guava.xeriom.net' => [ 21, 22, 25 ], 'mx1.xeriom.net' => [ 25 ]}
irb(main):108:0> fork do
irb(main):109:1*   while true
irb(main):110:2>     host_sockets.each { |hostname, sockets|
irb(main):111:3*       Thread.new(hostname, sockets) { |host, socks|
irb(main):112:4*         socks.each { |socket|
irb(main):113:5*           begin
irb(main):114:6*             t = TCPSocket.new(host, socket)
irb(main):115:6>             puts Time.now.to_s + ": " + host.to_s + " is listening on port " + socket.to_s
irb(main):116:6>             t.close
irb(main):117:6>           rescue
irb(main):118:6>             puts Time.now.to_s + ": " + host.to_s + " is NOT listening on port " + socket.to_s
irb(main):119:6>           end
irb(main):120:5>         }
irb(main):121:4>       }
irb(main):122:3>     }
irb(main):123:2>     sleep 15
irb(main):124:2>   end
irb(main):125:1> end
=> 15784
Sat Jun 03 12:16:56 BST 2006: kiwi.xeriom.net is listening on port 21Sat Jun 03 12:16:56 BST 2006: mx2.xeriom.net is listening on port 22

Sat Jun 03 12:16:56 BST 2006: guava.xeriom.net is listening on port 22
Sat Jun 03 12:16:56 BST 2006: kiwi.xeriom.net is listening on port 22Sat Jun 03 12:16:56 BST 2006: mx2.xeriom.net is listening on port 25

Sat Jun 03 12:16:56 BST 2006: kiwi.xeriom.net is listening on port 25Sat Jun 03 12:16:56 BST 2006: guava.xeriom.net is listening on port 25 ...

Obviously, if I were scaling this to millions of hosts, a thread-per-host approach would collapse under its own weight. But for keeping an eye on a small network, it does the job nicely.

Concurrent Socket Programming in Ruby

2011-03-02T00:00:00+08:00

Continuing my previous adventure in socket programming with Ruby, today I’ve attempted to communicate with multiple sockets concurrently.

The idea is simple: spin up a thread for each port we want to check, and let them all run at once.

kawaii:~ craig$ irb
irb(main):001:0> require 'socket'
=> true
irb(main):002:0> threads = []
=> []
irb(main):003:0> ports = [22,23,24,25,26,27,28,29,30].freeze
=> [22, 23, 24, 25, 26, 27, 28, 29, 30]
irb(main):004:0> for port in ports
irb(main):005:1>   threads << Thread.new(port) { |p|
irb(main):006:2*     puts "Checking if port " + p.to_s + " is open..."
irb(main):007:2>     begin
irb(main):008:3*       t = TCPSocket.new('xeriom.net', p)
irb(main):009:3>       t.close
irb(main):010:3>       puts "Port " + p.to_s + " is open."
irb(main):011:3>     rescue
irb(main):012:3>       puts "Port " + p.to_s + " is not open."
irb(main):013:3>     end
irb(main):014:2>   }
irb(main):015:1> end
Checking if port 22 is open...Checking if port 23 is open...Checking if port 24 is open...
Checking if port 25 is open...
Checking if port 26 is open...
Checking if port 27 is open...
Port 22 is open.Checking if port 28 is open...
Checking if port 29 is open...

Checking if port 30 is open...
=> [22, 23, 24, 25, 26, 27, 28, 29, 30]
irb(main):016:0>

Port 24 is not open.Port 23 is not open.Port 29 is not open.Port 28 is not open.Port 25 is open.

Port 27 is not open.Port 26 is not open.Port 30 is not open.

The output is a jumbled mess because threads are writing to STDOUT whenever they feel like it, but that’s not the point. We can see that ports 22 (SSH) and 25 (SMTP) are open, and everything else is closed. I’ve just built a simple port scanner in 15 lines of Ruby. It’s not pretty, but it works.

The Ruby thread tutorial at Ruby Central mentions a fairly important caveat: if a thread executes something at the OS level that takes a long time to return, it can freeze the entire interpreter. That sounds bad.

Interestingly though, it doesn’t seem to apply to TCPSocket operations. Adding in a few checks (left as an exercise for the reader), it seems that the only thing limiting the number of active threads is the overhead of creating them. There are up to 10 running at once with the above code, and I suspect you could push that number considerably higher if thread creation were faster.

Tomorrow (or maybe later today) I’ll be attempting to use ActiveRecord outside of Rails. I know it can be done; I just don’t know how hard it is yet.

Socket Programming in Ruby

2011-03-01T00:00:00+08:00

I decided today to find out how hard socket programming in Ruby would be, mainly because I’d finished a huge chunk of work and could find nothing better to do. The alternative was tidying the kitchen, so the bar was low.

A quick search turned up an extract from the Pragmatic Ruby book at rubycentral.com, and that simple library reference proved surprisingly handy. I’ll need to buy a copy of that book. Somebody remind me when I’m feeling flush.

Anyway, unsurprisingly, it’s very easy. Here’s how you open and work with a TCP socket.

First, fire up irb and load the socket library:

kawaii:~ craig$ irb
irb(main):001:0> require 'socket'
=> true

Then open a socket, passing in a block:

irb(main):002:0> TCPSocket.open('xeriom.net','smtp') do |t|
irb(main):003:1*   t.gets
irb(main):004:1> end
=> "220 pluto.xeriom.net ESMTP Postfix\r\n"

Simple, elegant, delightful. Checking whether a socket is listening is just as easy. Since we already have the socket library loaded:

irb(main):005:0> begin
irb(main):006:1*   t = TCPSocket.new('xeriom.net',8000)
irb(main):007:1>   t.close
irb(main):008:1> rescue
irb(main):009:1>   "Error: socket not open"
irb(main):010:1> end
=> "Error: socket not open"

Done. Perhaps a little too easy; now I have nothing to do except tidy.

Tomorrow I’ll play with opening many sockets simultaneously, checking if each one is open or closed. Bonus points if you beat me to it, or if you can make Java look as clean.

Blogging from Vim

2010-10-08T00:00:00+08:00

Now that I’ve switched to a full-screen MacVim session for all my coding, switching to another application to jot down notes feels genuinely disruptive. Without notes, my blogging suffers; I never have anything to write about because I never captured the thought when it was fresh.

Enter vimblog. It lets me draft and publish blog posts without ever leaving Vim. No context switching, no breaking flow. I can wax lyrical about whatever’s on my mind without reaching for another app.

Lucky you.

Encrypting Data with GnuPG

2010-10-01T00:00:00+08:00

There was recently yet another case of an organisation passing around unencrypted sensitive data. It keeps happening, and I’m constantly surprised that more people don’t reach for the perfectly good encryption tools that are freely available. GnuPG is fast, free, and straightforward to use. If you handle sensitive files, there’s really no excuse not to use it.

Installing GnuPG

I’m on macOS, so I use the MacGPG2 package (MacGPG2-2.0.14RC2 at the time of writing). Download the zip, unzip it, and run the installer. A few clicks and you’re ready to start encrypting.

Encrypting a file

Say you have a file full of confidential data called confidential-data.xls. Run:

gpg -c ./confidential-data.xls

GnuPG will prompt you for a passphrase, then ask you to confirm it. Pick something strong. Once it finishes, you’ll have a new file called confidential-data.xls.gpg; that’s the encrypted version. Delete the original and store the encrypted file wherever you need to.

Decrypting a file

When you need the data back, retrieve the encrypted file and run:

gpg -d ./confidential-data.xls.gpg --output ./confidential-data.xls

That’s it. The decrypted file is back where it started.

Not a command-line person?

I use the command line, which might not be your thing. Honestly, it’s not that scary, and I’d encourage you to give it a go. But if you prefer windows and drag-and-drop, take a look at something like GPGDropThing; you can encrypt files just by dropping them onto it.

The important thing is that you encrypt sensitive data at all. The specific tool matters less than the habit.

My Dot Files: Dot Aliases

2010-07-07T00:00:00+08:00

This is the first part of a series where I’ll walk through the dotfiles I use to make my day-to-day work easier and more enjoyable.

I use Git and Rails every day. To save my fingers from unnecessary wear, I’ve created short aliases for the commands I type most often.

Stick these in ~/.aliases:

# ~/.aliases
# Record how much I've used various Git commands:
#   http://github.com/icefox/git-achievements
alias git="git-achievements"

# Working with Git
alias g='git'
alias gs='git status'
alias gc='git commit'
alias gca='git commit -a'
alias ga='git add'
alias gco='git checkout'
alias gb='git branch'
alias gm='git merge'
alias gd="git diff"

# Working with Rails
alias s='script/server'
alias c='script/console'
alias m='rake db:migrate'
alias r='rake'

# Open the current directory in TextMate
alias e='mate .'

# Serve the contents of the current directory over HTTP
alias serve="ruby -rwebrick -e\"s = WEBrick::HTTPServer.new(:Port => 3000, :DocumentRoot => Dir.pwd); trap('INT') { s.shutdown }; s.start\""

The git-achievements alias wraps the real git binary with git-achievements, which tracks how often you use various Git commands. It’s a fun little motivator.

The serve alias is surprisingly handy; it spins up a quick WEBrick server on port 3000 serving whatever’s in your current directory. Great for previewing static sites or sharing files on a local network.

Now source the aliases file from your ~/.profile so they’re available in every session:

# ~/.profile
for I in aliases; do
  [ -f ~/.$I ] && . ~/.$I
done

The loop might look like overkill for a single file, but it scales nicely as you add more dotfiles to the pattern; just append their names to the list.

An Updated Command Prompt

2010-04-12T00:00:00+08:00

It’s been a while since I added the current Git branch to my command prompt to help with my development workflow. Since then I’ve started juggling multiple Ruby versions and I find myself increasingly wanting to know the exit status of the last command at a glance. So I gave my prompt an upgrade.

Here’s what it looks like now:

It packs in the username, hostname, last exit code (green for success, red for failure), the active Ruby interpreter and version, the current directory, and Git branch status. Everything I need, nothing I don’t.

To get this, I declare $PS1 like so:

# Show the exit code of the last command.
# Idea stolen from @mathie.
function last_exit_code() {
  local code=$?
  if [ $code = 0 ]; then
    printf "$1" $code
  else
    printf "$2" $code
  fi
  return $code
}

# I only want to see the interpreter in the output if I'm not using MRI.
function ruby_version() {
  local i=$(/Users/craig/.rvm/bin/rvm-prompt i)
  case $i in
    ruby) printf "$1" $(/Users/craig/.rvm/bin/rvm-prompt $2) ;;
    *)    printf "$1" $(/Users/craig/.rvm/bin/rvm-prompt $3) ;;
  esac
}

# Show lots of info in the __git_ps1 output.
# Thanks for the info @mathie.
export GIT_PS1_SHOWDIRTYSTATE="true"
export GIT_PS1_SHOWSTASHSTATE="true"
export GIT_PS1_SHOWUNTRACKEDFILES="true"

export PS1='\[\033[01;32m\]\u@\h\[\033[00m\] $(last_exit_code "\[\033[1;32m\]%s\[\033[00m\]" "\[\033[01;31m\]%s\[\033[00m\]") $(ruby_version "\[\033[01;36m\]%s\[\033[00m\]" "v p" "i v p") \[\033[01;34m\]\W\[\033[00m\]$(__git_ps1 "\[\033[01;33m\](%s)\[\033[00m\]")\$ '

A couple of things worth noting. The last_exit_code function captures $? immediately; if you wait too long, some other command will overwrite it. And the ruby_version function only shows the interpreter name when you’re running something other than MRI, which keeps things tidy for the common case.

The GIT_PS1_SHOW* exports turn on indicators for dirty state, stashed changes, and untracked files in the Git portion of the prompt. If you haven’t tried these, they’re wonderful; you’ll never accidentally commit from the wrong state again.

A One-Line Web Server in Ruby

2010-04-11T00:00:00+08:00

Inspired by a tweet, here's how to serve the current directory over HTTP with a single line of Ruby:

ruby -rwebrick -e'WEBrick::HTTPServer.new(:Port => 3000, :DocumentRoot => Dir.pwd).start'

That's it. Point a browser at http://localhost:3000 and you'll see a directory listing. It's great for quickly sharing files at a conference, previewing static sites, or any situation where you need a throwaway web server with zero setup.

Command-Line EC2 with ec2-api-tools

2010-03-21T00:00:00+08:00

A company I've been working with hosts some of their applications on EC2. As someone who has spent years working with Linux and Unix servers from the command line, I find the EC2 web console pretty frustrating. Here's how I set up the EC2 API tools on my MacBook Pro so I can manage instances from the terminal.

mkdir ~/.ec2
cd ~/Downloads
curl -O -L "http://www.amazon.com/gp/redirect.html/ref=aws_rc_ec2tools?location=http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip&token=A80325AA4DAB186C80828ED5138633E3F49160D9"
unzip ec2-api-tools.zip*
cd ec2-api-tools
mv bin lib ~/.ec2/
echo 'export EC2_HOME=~/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=`ls $EC2_HOME/pk-*.pem`
export EC2_CERT=`ls $EC2_HOME/cert-*.pem`
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home/
# I use eu-west-1 - you may want to change this
EC2_REGION="eu-west-1"
export EC2_URL="https://${EC2_REGION}.ec2.amazonaws.com/"
export EC2_KEYPAIR_NAME="aws-`whoami`"' > ~/.ec2/env
echo '[ -f ~/.ec2/env ] && . ~/.ec2/env' >> ~/.profile
ec2-add-keypair aws-`whoami` > ~/.ec2/aws-`whoami`
chmod 0600 ~/.ec2/aws-`whoami`

Next, download the X.509 private key and certificate from the Security Identifiers page of your AWS account and save them to ~/.ec2/. Leave the filenames as-is with the big messy jumble of characters — the setup script uses a glob pattern to find them.

That should be everything. To verify it's working, try listing all the Amazon-owned machine images:

ec2-describe-images -o amazon

You should see a long list that looks something like this:

IMAGE	ami-13042f67	amazon/fedora-8-i386-v1.14-std	amazon	available	public		i386	machine	aki-61022915	ari-63022917		ebs
BLOCKDEVICEMAPPING	/dev/sda1		snap-34739d5d	15
IMAGE	ami-1d042f69	amazon/fedora-8-x86_64-v1.14-std	amazon	available	public		x86_64	machine	aki-6d022919	ari-37022943		ebs
BLOCKDEVICEMAPPING	/dev/sda1		snap-08739d61	15

All the EC2 commands are prefixed with ec2-. To see them all:

ls ~/.ec2/bin/ec2-*

If you see deprecation notices from Xalan, don't worry about it — everything still works fine:

[Deprecated] Xalan: org.apache.xml.res.XMLErrorResources_en_US

Creating a New Subversion Branch from an Existing Local Git Branch

2010-03-03T00:00:00+08:00

I frequently have to work with Subversion repositories, and as a Git user I rely on git-svn to bridge the two worlds. My usual workflow is to do development in local Git branches, then check out the integration branch, merge my changes, and git svn dcommit to push the code to Subversion.

Sometimes, though, I need to share an in-progress local branch with a Subversion user before it's ready to merge into the mainline. Every time this comes up I find myself hunting for the correct sequence of commands, so here they are for future reference.

git checkout master
git svn branch <new_svn_branch_name>
git svn fetch
git branch -r # make sure <new_svn_branch_name> exists
git checkout -b tmp/svn-rebase-target <new_svn_branch_name>
git rebase --onto tmp/svn-rebase-target master <existing_git_branch_name>
# That should have checked out <existing_git_branch_name>.
git svn dcommit -n # This should say it'll commit to <new_svn_branch_name>.
git branch -D tmp/svn-rebase-target # clean up the temporary branch.
git svn dcommit

The key idea: you create a new branch in Subversion, fetch it into Git, then rebase your local work onto it so that git svn dcommit pushes to the correct place.

Credit goes to Bjoern Steinbrink and Cameron for the comments that pointed me in the right direction.

I've also wrapped this up as a shell script. Download it, make it executable, and pass it the name of the local branch you want to push:

./svn-push development/avoid-the-wombat-widgets

This assumes you told git svn clone where to find your Subversion branches when you first set up the repository. If you didn't, your mileage may vary.

Installing the MySQL Gem on OS X 10.6 (Snow Leopard) with MacPorts MySQL5

2010-03-02T00:00:00+08:00

This one took me longer to figure out than I'd like to admit. If you're running Snow Leopard with MySQL installed via MacPorts, here's the incantation you need to install the MySQL gem:

sudo port install mysql5-server
sudo env ARCHFLAGS="-arch x86_64" gem install mysql, --with-mysql-config=/opt/local/bin/mysql_config5

The key details: you need to force the x86_64 architecture flag, and you need to point the gem build at MacPorts' mysql_config5 rather than the default mysql_config path. Hopefully this saves someone else the half hour I spent on it.

London Tech Meetups

2010-02-23T00:00:00+08:00

Finding tech meetups in your area can be surprisingly difficult, and even when you know a group exists, working out when they actually meet can be a puzzle. Some of them have frankly bewildering scheduling rules.

John Sutherland solved this problem for the Edinburgh tech community by listing when various groups meet and who they'd be of interest to. With his permission, I've done the same thing for London tech meetups.

If your meetup isn't listed and you'd like it to be, drop me an email at craig@barkingiguana.com.

Decoupling Nagios Host and Service Check Events for Fun and Profit

2010-02-17T00:00:00+08:00

Nagios does a solid job of watching over my services and hosts, but I want to do a lot more with the events it generates — when a check fails, when something recovers. Specifically, I want to give clients incredibly fine-grained control over their notifications: what services, how often, and at what level of technical detail. I also want to use those events as upsell opportunities for Xeriom — if a disk is filling up or bandwidth is being consumed faster than expected, it should be easy to suggest a plan upgrade. And I'd like to experiment with fun delivery mechanisms — iPhone push notifications, SMS gateways, audible alarms, whatever — without any risk of breaking Nagios itself.

Message queues are the natural solution here. They let you decouple systems, moving complexity and risk away from the core. Nagios shouldn't have to worry about any of this extra stuff. It should just do what it's good at: monitoring hosts and services.

Luckily, I already have ActiveMQ running for other tasks, writing a STOMP client with SMQueue is straightforward, and Nagios has several ways to execute external commands when events occur, including the global host and service event handlers. All I need is a command that accepts event data from Nagios and drops it onto the message queue.

Here's what I came up with:

require 'rubygems'
require 'smqueue'
require 'json'

message = {
  :hostname => ARGV[2],
  :service => ARGV[3],
  :state => ARGV[4],
  :state_type => ARGV[5],
  :state_time => ARGV[6].to_i,
  :attempt => ARGV[7].to_i,
  :max_attempts => ARGV[8].to_i,
  :time_t => Time.now.to_i
}

configuration = {
  :host => ARGV[0],
  :name => ARGV[1],
  :adapter => :StompAdapter
}

broadcast = SMQueue(configuration)
broadcast.put message.to_json, "content-type" => "application/json"

You'll need Ruby and RubyGems installed. Once you have those, install the dependencies and the script like this:

sudo su -
gem sources -a http://gems.github.com/
gem install seanohalpin-smqueue json --no-ri --no-rdoc
cd /usr/bin
wget http://gist.github.com/raw/306765/2a3e9cbade88b4c6dd430e108bc8a28f95047462/notify-service-by-stomp.rb
chmod +x notify-service-by-stomp.rb

Once installed, tell Nagios to use it by adding this to your Nagios configuration:

define command {
  command_name notify-service-by-stomp
  command_line /usr/bin/notify-service-by-stomp.rb mq.example.com /topic/foo.bar.baz.quux $HOSTADDRESS$ "$SERVICEDESC$" $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEDURATIONSEC$ $SERVICEATTEMPT$ $MAXSERVICEATTEMPTS$
}

global_service_event_handler=notify-service-by-stomp

Change mq.example.com to the hostname of your message broker, and /topic/foo.bar.baz.quux to whatever topic or queue you want notifications sent to. Restart Nagios and events should start flowing.

Testing it

If your Nagios doesn't generate events very often, you'll want a way to verify everything is wired up correctly. Attach a simple stompcat listener to the topic, then manually fire some test notifications.

Here's a quick stompcat tool in case you don't have one handy:

#! /usr/bin/env ruby

# Run me like this:
#
#   ./stompcat.rb mq.example.com /topic/foo.bar.baz.quux
#

require 'rubygems'
require 'smqueue'

configuration = {
  :host => ARGV[0],
  :name => ARGV[1],
  :adapter => :StompAdapter
}

source = SMQueue(configuration)
source.get do |m|
  payload = m.body
  puts ">>> #{payload}"
end

And here's how to send a test notification to the queue:

/usr/bin/notify-service-by-stomp.rb mq.example.com \
  /topic/foo.bar.baz.quux service-host.example.com "SERVICE NAME" \
  WARNING HARD 86492 6 6

If it's working, you should see something like this appear in your stompcat output:

{
  "time_t":1266427384,
  "state":"WARNING",
  "state_type":"HARD",
  "state_time":86492,
  "attempt":6,
  "hostname":"service-host.example.com",
  "max_attempts":6,
  "service":"SERVICE NAME"
}

From here, you can modify the stompcat example to do anything you like — look up clients in a database, send SMS alerts if an account has enough credit, trigger webhooks, whatever takes your fancy. If you build something fun with this, I'd love to hear about it.

The Correct OID for System Uptime

2010-02-11T00:00:00+08:00

I use SNMP to track system uptime so I know when hosts have recently rebooted. But I keep making the same mistake: reaching for sysUpTime.0 when I should be using hrSystem.hrSystemUptime.0.

Here's the difference, so I stop tripping over this:

sysUpTime.0: Timeticks (in hundredths of a second) since snmpd started. If someone restarts the SNMP daemon, this resets — even though the machine hasn't rebooted.
hrSystem.hrSystemUptime.0: Timeticks since the hardware started. This is the one you want for actual system uptime.

In short: if you want to know how long the machine has been running, use hrSystem.hrSystemUptime.0. If you want to know how long the SNMP agent has been running, use sysUpTime.0.

Keeping the Software on Your Ubuntu Server Up to Date

2010-02-11T00:00:00+08:00

New exploits are discovered just about every day in software both old and new. To combat this, software vendors release security updates, which the Ubuntu team packages up and ships as new, more secure versions of the software you’ve installed.

Supporting every version of every package ever built for Ubuntu would be an impossible task, so the Ubuntu team produces releases with defined support windows. There are two kinds: Long Term Support (LTS) releases get 5 years of server support after the release date, while regular releases get 18 months. Once a support window closes, you won’t receive security updates or be able to easily upgrade packages, so it’s important to plan your upgrades before support ends.

Here are the commonly referenced releases, their dates, and their support windows:

Version	Name	Release Date	Support Ends
10.04 [LTS]	Lucid Lynx	April 2010	April 2015
9.10	Karmic Koala	October 29, 2009	April 2011
9.04	Jaunty Jackalope	April 23, 2009	October 2010
8.10	Intrepid Ibex	October 30, 2008	April 2010
8.04.4 [LTS]	Hardy Heron	January 28, 2010	April 2013
8.04.3 [LTS]	Hardy Heron	July 16, 2009	April 2013
8.04.2 [LTS]	Hardy Heron	January 22, 2009	April 2013
8.04.1 [LTS]	Hardy Heron	July 3, 2008	April 2013
8.04 [LTS]	Hardy Heron	April 24, 2008	April 2013
7.10	Gutsy Gibbon	October 18, 2007	April 2009
7.04	Feisty Fawn	April 19, 2007	October 2008
6.10	Edgy Eft	October 26, 2006	April 2008
6.06.2 [LTS]	Dapper Drake	January 21, 2008	June 2011
6.06.1 [LTS]	Dapper Drake	August 10, 2006	June 2011
6.06 [LTS]	Dapper Drake	June 1, 2006	June 2011
5.10	Breezy Badger	October 12, 2005	April 2007
5.04	Hoary Hedgehog	April 8, 2005	October 2006
4.10	Warty Warthog	October 26, 2004	April 2006

At the time of writing, the currently supported releases are 6.06, 8.04, 8.10, 9.04, and 9.10. Ubuntu 10.04 is due in April.

Your responsibilities

As a server operator, there are two things you need to know how to do: upgrade installed packages, and upgrade to the next Ubuntu release. I’ll cover both, but first let’s do a little setup to make the whole process faster.

Using a package mirror

The most time-consuming part of any update is downloading packages from remote servers. To speed things up, Xeriom Networks provides a local mirror of the software packages for 8.04, 8.10, 9.04, and 9.10. If you’re not hosted with Xeriom (why not?), ask your provider whether they offer a package mirror. If they don’t, skip this section and hope your connection is fast enough.

Setting up the mirror requires editing just one file. A straightforward editor for this is nano. Install it by connecting to your server via SSH and running:

sudo apt-get install nano --yes

Next, find out which Ubuntu release you’re running:

cat /etc/lsb-release

Match your release to the appropriate entry on this wiki page: http://wiki.xeriom.net/w/XeriomUbuntuPackagesService

Copy the text from the box that matches your release. Then open the sources list for editing:

sudo nano -w /etc/apt/sources.list

Delete all existing lines and paste in the text you copied. Save and exit with Ctrl+X.

Now tell Ubuntu to refresh its package list so it picks up the local mirror:

sudo apt-get update

You’re now using the Xeriom package mirror.

Upgrading installed software

Keeping your packages up to date is one of the most important things you can do for server security. That said, new packages can occasionally break things, so don’t set this up to run automatically. Sit down, review what’s changing, and apply updates deliberately.

First, refresh your package database to make sure you’re seeing the latest available versions:

sudo apt-get update

Then ask apt-get to upgrade your installed packages:

sudo apt-get upgrade

This calculates everything that needs upgrading, shows you the list, and asks for confirmation. Most of the time it will run smoothly, but always check what’s about to change before saying yes.

Upgrading to the next release

A full release upgrade is a bigger operation. A large number of packages will be updated, and you’ll almost certainly need to reboot (the kernel is usually among the upgraded packages), so plan for a little downtime.

You’ll need the update-manager-core package. If this is your first release upgrade, install it:

sudo apt-get install update-manager-core

Next, configure your upgrade strategy. Open the configuration file:

sudo nano -w /etc/update-manager/release-upgrades

Find the line that starts with Prompt= and set it to one of: lts, normal, or never. For example, Prompt=lts will only offer upgrades to LTS releases, giving you 5 years of support per release. Save and exit with Ctrl+X.

Before you upgrade, read the release notes for the version you’re upgrading to. Make sure you understand any known issues and caveats.

Once you’re satisfied and have scheduled a maintenance window, start the upgrade:

sudo do-release-upgrade

This will calculate the full list of package changes and ask for confirmation. Don’t just say yes; read through the list and make sure you understand what upgrading means for your setup.

If it all goes wrong

Sometimes things break. Maybe a new release has an unexpected issue, or the upgrade removes a package your application depends on. If that happens, we can create a fresh image of whatever supported release you need. Your data won’t be on the new image, of course, so make sure your backups are current before you start.

Getting Started with Node.js

2010-01-21T00:00:00+08:00

Tonight I'm giving a talk at the London JavaScript User Group, introducing Node.js.

The slides are available here: Getting Started with Node.js. If you print them out you'll find speaker notes included, or you can watch the video on Vimeo.

If you have any feedback, I'd love to hear it — please leave a comment.

Telnet 101

2009-12-10T00:00:00+08:00

Telnet has been around since before the dawn of Unix time, yet surprisingly few people know how to wield this tremendously useful debugging tool. A few seconds with telnet can save you hours of frustrated searching, trial-and-error config changes, and shouting at your monitor.

Telnet lets you speak plain-text protocols by hand. I've used it to talk to MySQL, Memcached, and Postfix. Here I'll show you how to use it to verify that an HTTP server can serve content over HTTP/1.1.

What is HTTP?

Before we can simulate HTTP with telnet, we need a quick refresher on how the protocol works.

HTTP/1.1 — the HyperText Transfer Protocol — is a plain-text protocol defined in RFC 2616. It's used for all sorts of things, but the most visible use for most people is fetching web pages.

When you request a webpage, your browser connects to the web server and sends a request. A typical one looks like this:

GET / HTTP/1.1
Host: example.com

The format is:

[METHOD] [PATH] HTTP/1.1
Host: [HOSTNAME]
[BLANK LINE]

The server responds with something like this — headers first, then a blank line, then the page content:

HTTP/1.1 200 OK
Server: Apache/2.2.3 (Red Hat)
Last-Modified: Tue, 15 Nov 2005 13:24:10 GMT
ETag: "b300b4-1b6-4059a80bfd280"
Accept-Ranges: bytes
Content-Type: text/html; charset=UTF-8
Connection: close
Date: Thu, 10 Dec 2009 10:37:33 GMT
Age: 7114
Content-Length: 438

<HTML>
<HEAD>
  <TITLE>Example Web Page</TITLE>
</HEAD>
<body>
<p>You have reached this web page by typing "example.com",
"example.net",
  or "example.org" into your web browser.</p>
<p>These domain names are reserved for use in documentation and are not available
  for registration. See <a href="http://www.rfc-editor.org/rfc/rfc2606.txt">RFC
  2606</a>, Section 3.</p>
</BODY>
</HTML>

The response format is:

HTTP/1.1 [STATUS CODE AND REASON]
[HEADERS]
[BLANK LINE]
[BODY]

There's a lot more to HTTP, all documented in rather dry detail in RFC 2616. Mostly you can skim it for the parts you need.

Trying it with telnet

Now that we know how HTTP requests look, let's use telnet to make one by hand.

The telnet man page tells us the command accepts a host and a port. We want to talk to example.com on port 80 (the standard HTTP port):

telnet example.com 80

You'll see output like this as it connects:

Trying 192.0.32.10...
Connected to example.com.
Escape character is '^]'.

Now your cursor is sitting on a blank line. This is where you become the browser. Type the GET request from above (including the blank line at the end), and after a short pause you should get the example.com web page back.

Why is this useful?

Manually requesting a page like this can quickly expose several common problems:

Firewall issues — if telnet can't connect, you know the problem is at the network level, not in your application.
Status codes — the response code tells you exactly what the server did with your request. RFC 2616, Section 10 has the full list.
No caching surprises — unlike a browser, telnet won't serve you a stale cached version of the page.
Header inspection — you can see every header the server returns, which is invaluable for debugging.
Compression testing — add an Accept-Encoding header to verify your assets are being served gzipped.

And telnet isn't limited to HTTP. SMTP, IMAP, POP, and many other plain-text protocols can all be explored this way. It's not a silver bullet, but it's one of the most useful tools you'll find already installed on your machine.

Simulating Slow or Laggy Network Connections on OS X

2009-12-04T00:00:00+08:00

A client recently reported that their site was loading painfully slowly from certain remote locations. We got the specs of their network connection, but every single time I need to simulate bandwidth limits or latency on OS X I end up searching for the same commands. So here they are, written down once and for all.

Set up the pipe

First, configure an ipfw pipe with the bandwidth limit and delay you want to simulate.

sudo ipfw pipe 1 config bw 16Kbit/s delay 350ms

Attach it to HTTP traffic

Next, attach the pipe to all traffic going to or coming from port 80.

sudo ipfw add 1 pipe 1 src-port 80
sudo ipfw add 2 pipe 1 dst-port 80

All HTTP traffic is now throttled through your simulated connection. Do your testing, experience the pain your users feel, and then clean up.

Tear it down

Once you're done (or once you get frustrated with how slowly everything loads), remove the firewall rules and delete the pipe.

sudo ipfw delete 1
sudo ipfw delete 2

sudo ipfw pipe 1 delete

And you're back to full speed. Adjust the bw and delay values to match whatever real-world connection you're trying to reproduce.

Returning Explicitly Is Slower

2009-11-11T00:00:00+08:00

My main objection to returning explicitly is readability. It is a subjective thing, but every time I see an unnecessary return statement my internal WTF counter ticks up.

Less subjectively, it has been pointed out that returning explicitly is actually slower. Let's measure it.

Benchmarking in Ruby is easy:

require 'benchmark'

def explicit
  return "TEST"
end

def implicit
  "TEST"
end

n = 100_000_000
Benchmark.bmbm do |x|
  x.report("Explicit return") { n.times { explicit } }
  x.report("Implicit return") { n.times { implicit } }
end

And here are the results:

Rehearsal ---------------------------------------------------
Explicit return  50.380000   0.210000  50.590000 ( 51.000510)
Implicit return  36.200000   0.100000  36.300000 ( 36.454038)
----------------------------------------- total: 86.890000sec

                      user     system      total        real
Explicit return  47.650000   0.070000  47.720000 ( 47.744167)
Implicit return  35.900000   0.070000  35.970000 ( 35.985493)

So yes, returning explicitly is slower — but like the Symbol#to_proc question, it is not slow enough to matter in practice. You need an enormous number of returns before the difference becomes significant.

Does this change my mind? No. Returning explicitly is still ugly.

Update: The benchmark above was run on Ruby 1.8.6. Tom Ward has provided similar benchmarks for Ruby 1.8.7, 1.9, and JRuby 1.1.6 (using n = 10,000,000) which show that the cost of explicit returns on these platforms is negligible. Still ugly though.

The Stack Trace Is Precious

2009-11-10T00:00:00+08:00

The stack trace is one of the most valuable pieces of information you can have when debugging. It tells you exactly which line of code was running when an error was thrown, and it gives you the full execution path that led there.

So here is a quick plea. Please don't do this:

def foo
  do_something
rescue => e
  puts "Problem: #{e}"
  raise e
end

Writing raise e starts a new stack trace originating at the raise call itself. If something further up the stack rescues this exception, there is no indication of where the problem originally occurred — all you get is a pointer to the error handling code. Precious information, gone.

Do this instead:

def foo
  do_something
rescue => e
  puts "Problem: #{e}"
  raise
end

Notice the bare raise with no argument. This tells Ruby to re-raise the current exception, keeping the original stack trace intact. Debugging can continue unhindered.

The Truth Speaks for Itself

2009-10-24T00:00:00+08:00

This one isn't just for Ruby — it applies to pretty much every programming language under the sun.

Don't wrap a boolean expression in a control statement just to return true or false:

def foo
  if some_boolean && other_boolean
    return true
  else
    return false
  end
end

The expression already is a boolean. Return it directly:

def foo
  return some_boolean && other_boolean
end

It is very rare that I ever need to return an explicit true or false. If you find yourself doing it, treat it as a warning sign.

And of course, in Ruby you don't need to return explicitly, so you can simplify further:

def foo
  some_boolean && other_boolean
end

You Don't Need to Return Explicitly

2009-10-21T00:00:00+08:00

In Ruby, every method returns the value of the last expression evaluated. There is no need to spell it out.

Don't do this:

def foo
  value = Foo.first(:conditions => { :label => "bar" })
  return value
end

Do this instead:

def foo
  Foo.first(:conditions => { :label => "bar" })
end

The return keyword still has its place — early returns for guard clauses, for instance — but if you are just returning the last expression, let Ruby do what Ruby does.

Twitter OAuth Authentication Using Ruby

2009-10-13T00:00:00+08:00

Here are the steps involved in using Twitter for OAuth authentication. I wanted this post a few days ago and couldn't find it anywhere, so I wrote it myself.

First, install the required gems:

sudo gem install json oauth

Next, set up your application at http://twitter.com/apps. Make sure you choose Browser as the application type and check the box to use Twitter for login.

A gotcha: if you make a mistake on the new application form, it will silently reset the application type to Client and uncheck the login box. Double-check these settings before saving.

Now for the actual code. Despite the hugely complicated examples floating around elsewhere, you only need two actions: one to initiate the authentication request (the login action) and one to handle the callback when Twitter sends the user back. If you have used OpenID before, this flow should feel familiar.

Your login action looks something like this:

# consumer_key and consumer_secret are from Twitter.
# You'll get them on your application details page.
oauth = OAuth::Consumer.new(consumer_key, consumer_secret,
                             { :site => "http://twitter.com" })

# Ask for a token to make a request
url = "http://whatever.com/login/complete"
request_token = oauth.get_request_token(:oauth_callback => url)

# Take a note of the token and the secret. You'll need these later
session[:token] = request_token.token
session[:secret] = request_token.secret

# Send the user to Twitter to be authenticated
redirect_to request_token.authorize_url

Your callback action looks something like this:

# Your callback URL will receive a request containing an
# oauth_verifier. Use this along with the request token from
# earlier to construct an access request.
request_token = OAuth::RequestToken.new(oauth, session[:token],
                                        session[:secret])
access_token = request_token.get_access_token(
                 :oauth_verifier => params[:oauth_verifier])

# consumer_key and consumer_secret are from Twitter.
# You'll get them on your application details page.
oauth = OAuth::Consumer.new(consumer_key, consumer_secret,
                             { :site => "http://twitter.com" })

# Get account details from Twitter
response = oauth.request(:get, '/account/verify_credentials.json',
                         access_token, { :scheme => :query_string })

# Then do stuff with the details
user_info = JSON.parse(response.body)
# Like find the person that logged in...
Person.find_by_twitter_id(user_info["id"])

If you keep getting 401 Unauthorized errors after implementing this, check that your application is set to Browser mode in the Twitter configuration. That tripped me up for longer than I would like to admit.

You Don't Need to Count Array Offsets by Hand

2009-10-02T00:00:00+08:00

When you need both the item and its index while iterating over an array in Ruby, don't do this:

index = 0
for item in array
  index += 1
  puts "Item #{index}: #{item.inspect}"
end

Do this instead:

array.each_with_index do |item, index|
  puts "Item #{index}: #{item.inspect}"
end

Ruby's Enumerable module is full of handy methods like this. Take a few minutes to read through the documentation — your code will be better for it.

First Steps with RabbitMQ in Ruby 1.8.6

2009-08-13T00:00:00+08:00

Until recently I was perfectly happy using ActiveMQ as my message broker. I had heard of RabbitMQ several times but never got around to investigating it. Then a talk at LRUG convinced me I had left it too long — if I didn't start soon, I would be left behind.

Here is how I got started with RabbitMQ 1.6.0 on OS X under Ruby 1.8.6.

Installation

mkdir /tmp/rabbit-mq && cd /tmp/rabbit-mq
wget http://www.rabbitmq.com/releases/rabbitmq-server/v1.6.0/rabbitmq-server-generic-unix-1.6.0.tar.gz
tar -xzvf rabbitmq-server-generic-unix-1.6.0.tar.gz
sudo mv rabbitmq_server-1.6.0/ /opt/local/lib

Running the Server

sudo /opt/local/lib/rabbitmq_server-1.6.0/sbin/rabbitmq-server

Seriously, that is it.

Passing Messages

When I wrote about getting started with SMQueue, I created a producer that pushed timestamps onto a queue and a consumer that printed them to the terminal. Recreating that with the AMQP gem is straightforward.

First, install the AMQP gem:

gem sources -a http://gems.github.com
gem install tmm1-amqp

Open an IRB session and paste this to create a producer:

require 'mq'
EM.run {
  broker = MQ.new
  EM.add_periodic_timer(1) {
    broker.queue("timestamps").publish(Time.now.to_f)
  }
}

Open another IRB session and paste this to create a consumer:

require 'mq'
EM.run {
  broker = MQ.new
  broker.queue("timestamps").subscribe { |timestamp|
    time = Time.at(timestamp.to_f)
    puts "Got #{timestamp} which is #{time}"
  }
}

That is all there is to it. RabbitMQ is extremely easy to get started with. I suspect it would not take much effort to write an SMQueue adapter for it, letting deployed projects switch message brokers without changing their code. If you end up building one, I would love to hear about it.

Securing Passwords with Salt, Pepper, and Rainbows

2009-08-03T00:00:00+08:00

You have heard again and again that storing passwords in plain text is a bad idea. So now you store your passwords as MD5 or SHA1 hashes. If someone steals your password database, your users' passwords are safe, right?

Actually, no. They are never totally safe. You can, however, make the effort required to break into an individual account too large for all but the most dedicated attacker.

Unfortunately, most web applications I get a chance to examine don't bother making their password storage more secure, which is a shame — because it really is not that hard.

For completeness, let's start from the bottom and work our way up.

Plain Text Passwords

Anathema to account security. If your password database is compromised, every account is wide open. Congratulations, you just handed over the details of your entire user base.

It gets worse. Anyone listening to the traffic between your application and the database can pluck passwords right out of the air. Very few people secure their database connections with TLS or SSH. On an unswitched network, spying on something like MySQL traffic is as easy as running one command:

# tcpdump -l -i eth0 -w - src or dst port 3306 | strings

Queries like SELECT users.id FROM users WHERE password = 'foo' or results from SELECT users.* FROM users will show up in plain text, and you won't even know your passwords have been stolen.

In a switched environment it is possible to trick the switch into sending you traffic (although this can be detectable). Depending on the hardware, a switch failure may cause it to fail open and behave like an unswitched network anyway.

Simple Hashed Passwords

Hashing is generally seen as the solution. No passwords are stored in plain text, and it is hard to guess a password that matches a given hash. Even if the database is compromised or snooped, you should be fine.

That may have been true once, but hashes for many common words, passwords, and passphrases have already been calculated. Translating from those hashes back to a matching password is trivial. Remember: since the original password is not stored, all you need is any input that produces the same hash.

How easy is it to crack an account protected by an MD5-hashed password? Say we attacked a site and found this table:

Username : Hashed Password
Alice    : a34bc26f864ed5f404eac5b7a20cd9aa
Bob      : 7a75a532aaab234ad4bd33ed67e67242
Malory   : 39579c8d4a536eb092f959b4a3d14aa8
Zebedee  : 57208d910b63e879d2bae3b3a5f8366d

Take each hashed password and look it up in a rainbow table for the appropriate hash algorithm. Given that these are 32 hex characters, they are almost certainly MD5. Using something like GData to search an MD5 rainbow table gives us:

Username : Password
Alice    : alphabets
Bob      : ch1cken
Malory   : blue41
Zebedee  : ?????

Only Zebedee is safe, and that is only for two reasons: (1) he is a freaky little spring creature with a magnificent moustache who can do magic things, and (2) nobody has added his password — or a collision for it — to the rainbow table yet.

Rainbow tables exist for several hashing algorithms including MD5 and SHA1. If the hash is not in the table, causing a collision for a specific account would cost around USD$2,000 and take about a day for MD5.

Multiply Hashed Passwords

Rainbow tables take a long time to populate, and that time can be made longer by running the hash function multiple times before storing the result:

MD5(alphabets)                        = a34bc26f864ed5f404eac5b7a20cd9aa
MD5(a34bc26f864ed5f404eac5b7a20cd9aa) = dd3f1bf5a36529705d08fe50b966d41a
MD5(...)                              = ...
MD5(...)                              = b5fdbbd055fcbfd3958a28f15661aea0

Each iteration takes CPU time, so generating a rainbow table for these hashes costs more. But CPU time is cheap these days. The advantage is that the attacker doesn't know how many times you applied the hash function unless they also have your code. Unfortunately, they can brute-force that number by starting at the hash and working backwards through generated rainbow tables until they find a value that logs into the site. Once that magic number is established, you have just a few days before the rest of your accounts are compromised.

Peppered Hashes

Rainbow tables can be generated reasonably fast, and while they are not trivially cheap, they are no longer prohibitively expensive either. How do we make rainbow tables a less viable attack vector?

Rainbow tables are simply maps from hashes to the inputs that generate them. If we require that every password includes a bit of extra data — a piece of spice, let's call it pepper — that we define in our application, then existing rainbow tables become useless. An attacker would have to generate entirely new tables where every input includes the pepper.

The pepper lives in your application code and never reaches the database except as part of a hash. In this way it behaves much like the magic number in the multiply-hashed approach. And like that approach, once the pepper is discovered it can be used against all accounts.

Someone could — and if they are determined enough, will — calculate a new rainbow table given time. But if you pick a strong, unique pepper, at least there is no off-the-shelf table that works.

Spicy Hashes

By combining the pepper with multiple rounds of hashing, we force the attacker to guess two things: the number of iterations and the pepper.

pepper = ...aliesc3ifCTAasd4$af...
MD5(pepper + password)    = ...b5f34...
MD5(pepper + ...b5f34...) = ...ea28c...
MD5(pepper + ...ea28c...) = ...

SELECT users.id FROM users WHERE hashed_password = ...

I am not entirely sure this buys much over just applying the hash function many times, but it sure looks pretty — and I really wanted an excuse to make a pun about using lots of pepper to make hashes spicy. Sorry.

Salted Hashes

The pepper and the multiply-hashed approaches share a weakness: they use a single value for the entire database. What if there were a different value for each account? A small, unique-per-account value mixed in the same way as the pepper — a salt.

With a per-account salt, a rainbow table generated to crack one account is useless for cracking the next.

Where do we store the salt? I quite like tucking it into the first few characters of the hashed password field, though you might prefer a separate column. Yes, the salt lives right there in the password database. Sounds like it would make cracking easier, right? Not really. All the salt tells an attacker is that the password is somehow combined with this value to produce the hash. The how is still hidden in your application code, and a valid password is still several iterations of rainbow table generation away — for each individual account.

Safe Now?

Not even close. With a solid combination of the above — strong salts, a good pepper, and a decent number of hashing rounds — you have made it unlikely that someone who steals your password database can use it to access accounts. But that doesn't mean your users will pick sane passwords, that your system is bug-free, or that there aren't other ways to find those passwords.

Running Starling under DaemonTools

2009-05-13T00:00:00+08:00

I have been playing with Starling quite a bit recently. Like most of my deployed tools, I want to be confident it stays running. Here is a run script for Starling under DaemonTools:

#!/bin/sh
# This is /home/starling/service/run

exec 2>&1

echo "Starting..."

PORT=22122
IP=0.0.0.0
USER=starling
HOME=/home/starling

exec setuidgid $USER \
     starling -v -v -v -h $IP -p $PORT -P $HOME/starling.pid -q $HOME/queue 2>&1

You will want to keep the logs too. Here is the log/run script:

#!/bin/sh
# This is /home/starling/service/log/run

exec multilog t s1000000 n10 ./main

Note that you will need to create the starling user before using these scripts, or just update them to use an existing user.

A Starling Adapter for SMQueue

2009-05-08T00:00:00+08:00

Starling is a persistent, lightweight work queue implemented in Ruby that speaks the memcache protocol. I have been playing with it recently because I don't have the resources to look after — or the requirement for — a full-blown service bus. Starling is easier to install and configure than ActiveMQ, though nowhere near as fully featured. Both have their place, but comparing them is outside the scope of this article.

I knew I wanted a message bus to turn synchronous requests into asynchronous ones, pushing work off to background processes. What I didn't know was which message bus I would end up using. If you are familiar with the Gang of Four patterns book you have probably already spotted the relevant pattern here. SMQueue, which I am familiar with, provides a clean abstraction that makes it easy to swap out the message bus implementation while keeping your code identical. The catch: SMQueue didn't ship with an adapter for Starling.

"How hard," I thought, "would it be to write one?"

I blinked and suddenly it existed.

require 'rubygems'
require 'smqueue'
require 'starling'
require 'yaml'

module BarkingIguana
  module Messaging
    module SMQueue
      class StarlingAdapter < ::SMQueue::Adapter
        class Configuration < ::SMQueue::AdapterConfiguration
          DEFAULT_SERVER = '127.0.0.1:22122'

          has :queue
          has :server, :default => DEFAULT_SERVER
        end

        def initialize(*args)
          super
          options = args.first
          @configuration = options[:configuration]
          @configuration[:server] ||= Configuration::DEFAULT_SERVER

          @client = ::Starling.new(@configuration[:server])
        end

        def put(*args, &block)
          @client.set @configuration[:queue], args[0].to_yaml
        end

        def get(*args, &block)
          if block_given?
            loop do
              yield next_message
            end
          else
            next_message
          end
        end

        private
        def next_message
          ::SMQueue::Message(:headers => {},
            :body => YAML.load(@client.get(@configuration[:queue])))
        end
      end
    end
  end
end

Want to use it? You will need Starling running somewhere. After that, a producer is just two lines of code:

producer = SMQueue(:adapter => BarkingIguana::Messaging::SMQueue::StarlingAdapter, :queue => "some.queue.name")
producer.put "Quack quack"

And here is a consumer on the other side of the connection:

consumer = SMQueue(:adapter => BarkingIguana::Messaging::SMQueue::StarlingAdapter, :queue => "some.queue.name")
consumer.get do |message|
  puts message.body.inspect
  # => "Quack quack"
end

One thing worth noting: this adapter assumes YAML as the transport format. I would prefer JSON or XML, but YAML was the easiest to implement and I am not above taking the lazy path when it gets the job done.

There is also work to be done around failover — this adapter only supports a single server. I don't yet know enough about how Starling handles failover, and I would rather not rush into an implementation that turns out to be wrong.

If you can help with patches for other transport formats or failover support, please do.

Expanding Shortened URLs in a Ruby String

2009-05-07T00:00:00+08:00

Everyone and their dog uses some sort of URL shortening service these days. While it's handy for cramming a link into short messages like those on Twitter, it's not always considered best practice for a bunch of reasons.

Since plenty of applications pull content from Twitter feeds and similar services, it would be great to expand those shortened URLs and undo the damage. So I built a little module that does exactly that.

Borrowing heavily from a Ruby-based Twitter client, I extracted a module you can mix into String. The idea is simple: for each known shortening service, follow the redirect and swap in the real URL.

require 'net/http'

module BarkingIguana
  module ExpandUrl
    def expand_urls!
      ExpandUrl.services.each do |service|
        gsub!(service[:pattern]) { |match|
          ExpandUrl.expand($2, service[:host]) || $1
        }
      end
    end

    def expand_urls
      s = dup
      s.expand_urls!
      s
    end

    def ExpandUrl.services
      [
        { :host => "tinyurl.com", :pattern => %r'(http://tinyurl\.com(/[\w/]+))' },
        { :host => "is.gd", :pattern => %r'(http://is\.gd(/[\w/]+))' },
        { :host => "bit.ly", :pattern => %r'(http://bit\.ly(/[\w/]+))' },
        { :host => "ff.im", :pattern => %r'(http://ff\.im(/[\w/]+))'},
      ]
    end

    def ExpandUrl.expand(path, host)
      result = ::Net::HTTP.new(host).head(path)
      case result
      when ::Net::HTTPRedirection
        result['Location']
      end
    end
  end
end

To use it, include the module into String:

class String
  include BarkingIguana::ExpandUrl
end

Then call expand_urls or expand_urls! on any text containing shortened URLs. The bang method modifies the string in place; the regular method returns a new string and leaves the original untouched.

s = "http://tinyurl.com/asdf"
s.expand_urls!
puts s.inspect
# => "http://support.microsoft.com/default.aspx?scid=kb;EN-US;158122"

It currently supports ff.im, is.gd, bit.ly, and tinyurl. If you know of other services that should be included, I would love to hear about them. This code — like the original implementation — is released under the MIT licence. The full code including licence and RDoc can be found at http://pastie.org/471016.

Aspell for Ruby with MacPorts-Installed Aspell

2009-04-03T00:00:00+08:00

If you want to use Aspell from Ruby and you use MacPorts to manage software on your Mac, you'll likely hit a wall compiling the native extensions for RAspell. The error log is lengthy, but the important line is this:

raspell.h:6:20: error: aspell.h: No such file or directory

It can't find the Aspell headers, even though Aspell is installed via MacPorts. The fix is simple: tell RubyGems where MacPorts put everything.

# Install the Aspell port
sudo port install aspell
# Install the Ruby bindings, pointing at MacPorts' install location
sudo gem install raspell, --with-opt-dir=/opt/local

That's it. The --with-opt-dir=/opt/local flag tells the native extension builder to look in MacPorts' prefix for headers and libraries, and everything compiles cleanly.

Posting to IRC Using ActiveMQ

2009-03-06T00:00:00+09:00

Previously I wrote about querying your app using IRC and IRCCat. But that's only half the story. IRCCat can also let your applications talk to you. A source code commit, a user logging in, a server going down — these are all things worth knowing about, and they're surprisingly easy to pipe into IRC.

The IRCCat examples typically use netcat to send data over the network to the IRCCat process. I prefer a small Ruby script backed by a message bus. Since I already have ActiveMQ running, there's very little extra overhead:

#! /usr/bin/env ruby

STDOUT.sync = true

require 'rubygems'
require 'smqueue'
require 'yaml'
require 'socket'

puts "Starting..."

messages = SMQueue(:name => "/queue/irc.outgoing", :host => "mq.domain.com", :reliable => true, :adapter => "StompAdapter")

messages.get do |job|
  message = YAML.parse(job.body).transform
  puts "Posting #{message['text']} in #{message.headers['message-id']}."
  irc = TCPSocket.open('localhost', '12345')
  irc.send("#{message['text']}\r\n", 0)
  irc.close
  puts "Posted #{message.headers['message-id']}."
end

With this running on the same box as IRCCat, any other process can drop a message onto the /queue/irc.outgoing queue and it will appear in IRC. If IRCCat happens to be down, the messages sit safely in the queue until it comes back up.

I like this approach because the various processes that generate notifications don't need to know anything about where IRCCat is running. They just talk to the message queue, which SMQueue makes painless.

Memcache Statistics from the Command Line

2009-03-04T00:00:00+09:00

When debugging memcache issues, being able to see the output of the stats command is invaluable. I got tired of manually connecting via telnet every time, so I wrote this little Ruby script to pull the statistics cleanly:

#! /usr/bin/env ruby

require 'socket'

socket = TCPSocket.open('localhost', '11211')
socket.send("stats\r\n", 0)

statistics = []
loop do
  data = socket.recv(4096)
  if !data || data.length == 0
    break
  end
  statistics << data
  if statistics.join.split(/\n/)[-1] =~ /END/
    break
  end
end

puts statistics.join()

It opens a raw TCP connection to the memcache daemon, sends stats, reads until it sees the END marker, and prints the result. Quick, simple, and saves you from typing telnet localhost 11211 for the hundredth time.

Query Your Applications Using IRC

2009-03-02T00:00:00+09:00

IRC — most of you know what it is. For those who don't, it stands for Internet Relay Chat. Think of it as a geeky group chat and you won't be far off.

There's a long tradition of using bots — automated processes — to provide services in IRC channels. Bots that help people share code through Paste Bin services, bots that take messages for offline users and replay them later. They're genuinely useful because they enhance a communication medium that people are already using, without requiring any extra software on the client side.

Last.fm use IRC as an internal communication tool. They've written (and released under the GPL — thanks!) IRCCat, which makes it straightforward to build bots that answer queries or perform commands right from IRC channels.

I've set up IRCCat and written a few scripts for it. Getting started is pretty easy. You'll need Java and Ant installed. I'm on a Mac with OS X 10.4, so Java is already there, and MacPorts provides an Ant port.

With Java and Ant ready, clone the IRCCat source from GitHub:

git clone git://github.com/RJ/irccat.git

Compile and package the bot by running ant dist in the cloned directory.

Once it's packaged, create a config/ directory and copy the example configuration from examples/irccat.xml into it. This is where you tell the bot how to behave.

The config file is reasonably well commented. Walk through each section and fill in the details:

Provide your IRC server connection details. I use an internal server, but if you don't have one, there are plenty of public IRC networks a quick search away.
Set the bot's username.
Change the external scripts handler to scripts/run and bump the max response lines to 30.
Choose which channels the bot should join. If they don't exist, they'll be created when the bot joins (depending on network policy).

With the configuration done, launch the bot:

ant -Dconfgfile=./config/irccat.xml

If you're in one of the channels you told it to join, you should see it appear. Verify it's working by typing !channels:

CraigW: !channels
bot: I am in 2 channels: #foo #bar

There are a few built-in commands, all prefixed with an exclamation mark:

Command	Description
!join #channel password	Make the bot join a channel (password is optional)
!part #channel	Make the bot leave a channel
!channels	List all channels the bot is in
!spam message	Send a message to all channels
!exit	Shut down the bot

The really interesting part is external commands, triggered with a question mark prefix. You write these yourself, and they can do anything you want.

Remember the cmdhandler config value I set to scripts/run? That's the entry point for externals. I use it to launch a router that loads and executes other command scripts from the scripts/ directory.

My scripts/run looks like this:

#!/bin/bash
# This script handles ?commands to irccat

exec ruby ./scripts/router "$@" 2>&1

Make that executable (chmod +x scripts/run). The scripts/router handles the dispatch:

#! /usr/bin/env ruby

COMMANDS = File.expand_path(File.dirname(__FILE__))
name, channel, username, command, arguments = *ARGV[0].split(/ /, 5)

command_script = File.join(COMMANDS, File.basename(command))

if File.exists?(command_script) && !%W(run router).include?(command)
  load command_script
  puts Command.execute(name, channel, username, arguments).strip
else
  desired_command = "#{command} #{arguments}".strip
  puts "Sorry #{name}, I don't understand `#{desired_command}`."
end

Writing a new command is now just a matter of creating a script that implements a Command class. The filename determines what you type in IRC. Want to query SNMP on a host? You'd type something like ?snmp xeriom-vm-host-06 .1.3.6.1.2.1.1.1, so the script goes in scripts/snmp:

class Command
  class << self
    def execute(name, channel, username, arguments)
      hostname, oid, remainder = arguments.split(/ /, 3)
      `snmpwalk -c public -v 1 #{hostname}.core.xeriom.net #{oid}`
    end
  end
end

Type the command in IRC and the results come straight back. No bot restart needed:

CraigW: ?snmp xeriom-vm-host-06 .1.3.6.1.2.1.1.1
bot: SNMPv2-MIB::sysDescr.0 = STRING: Linux xeriom-vm-host-06.core.xeriom.net 2.6.24-17-xen #1 SMP Thu May 1 15:55:31 UTC 2008 x86_64

There's real power in having this kind of access right inside your team's communication channel. A quick command can pull up customer records, server stats, or application data without anyone needing to drop to a terminal or load a web page.

Ruby vs Java

I've since discovered a Ruby port of IRCCat. I'll be switching to that — I find Ruby projects easier to maintain and fork than Java ones. Your mileage may vary.

Running Mongrel under DaemonTools

2009-02-27T00:00:00+09:00

I use DaemonTools to keep my services running and behaving themselves. Since I run plenty of Rails applications, here's the DaemonTools run script I use to keep Mongrel humming along:

#!/bin/sh
exec 2>&1

echo "Starting..."

ENVIRONMENT=production
PORT=8000
IP=0.0.0.0

CHDIR=/var/www/www.application.com
USER=application_user

exec softlimit -m 134217728 \
     setuidgid $USER \
     env HOME=$CHDIR \
     mongrel_rails start -e $ENVIRONMENT -p $PORT -a $IP -c $CHDIR 2>&1

The softlimit caps memory usage at 128MB, setuidgid drops privileges to the application user, and the rest is standard Mongrel configuration. Create a separate DaemonTools service for each Mongrel instance you want to run and just change the PORT variable in each script.

Finding and Enumerating Document Attributes with ActiveCouch

2009-02-03T00:00:00+09:00

Following on from my exploration of counting tags with CouchDB and map-reduce, I've added support to ActiveCouch for counting all uses of an attribute across a document type in your database. As a bonus, you can also retrieve all unique values for any attribute.

The API is straightforward. Call enumerate_all_[attribute_name] to get a hash of values and their counts, or find_all_[attribute_name] to get just the unique values:

>> Article.enumerate_all_tags
=> {"security"=>2, "ldap"=>1, "xen"=>1, "stories"=>3, "rails"=>13, "xeriom"=>3, "mysql"=>3, ... }

>> Article.find_all_tags
=> ["agile", "ajax", "apache", "api", "caching", "coding", ... ]

>> Article.find_all_author_ids
=> ["craig@barkingiguana.com"]

Under the hood, this builds the appropriate map-reduce views automatically. If you're curious about the implementation details, have a look at commit 1cbbe71.

Conditions and Ordering with ActiveCouch Views

2009-01-31T00:00:00+09:00

When I posted about my hacking on ActiveCouch, I mentioned it didn't yet support ordering. Well, since commit 87120176, it does. It's not as fine-grained as ActiveRecord yet, but it handles what I need: setting conditions on the finder and getting results ordered by posted_at date and then id.

When I say "not as fine-grained," I mean ActiveRecord can effortlessly build queries like ORDER BY posted_at ASC, id DESC, created_at DESC, author ASC. ActiveCouch can only order view results by key — either ascending or descending. I don't think that's an insurmountable limitation; I just haven't needed more control yet.

So how does it work?

When you want to find by conditions but don't particularly care about the order, ActiveCouch creates a view that emits keys based on just those conditions. Say you want all articles by "craig@barkingiguana.com" with a "Live" status:

Article.find(:all, :conditions => { :author_id => "craig@barkingiguana.com", :status => "Live" })

The first time this runs, ActiveCouch creates a view called by_author_id_and_status in the articles design document. The view emits a key built from those two attributes, along with the full document as the value:

{
  "_id": "_design/articles",
  "_rev": "1532981864",
  "language": "javascript",
  "views": {
    "by_author_id_and_status": {
      "map": "function(doc) { if(doc.type == 'article') { emit([doc.author_id, doc.status], doc); }  }"
    }
    // other views cut for brevity
  }
}

The query then hits this view asking for the key ["craig@barkingiguana.com", "Live"], which matches exactly the documents we're after.

When you add an order, things get a bit more interesting. Since these are articles and probably time-sensitive, let's order by posted_at:

Article.find(:all, :conditions => { :author_id => "craig@barkingiguana.com", :status => "Live" }, :order => :posted_at)

This time, ActiveCouch creates a view whose key also includes the posted_at attribute, named by_author_id_and_status_and_posted_at:

{
  "_id": "_design/articles",
  "_rev": "3752119467",
  "language": "javascript",
  "views": {
    "by_author_id_and_status_and_posted_at": {
      "map": "function(doc) { if(doc.type == 'article') { emit([doc.author_id, doc.status, doc.posted_at], doc); }  }"
    }
    // other views omitted for brevity
  }
}

When the query runs, it takes advantage of CouchDB's view collation specification by requesting keys in a calculated range. For the example above, it asks for keys between ["craig@barkingiguana.com", "Live"] and ["craig@barkingiguana.com", "Live", "\u9999"] (that's a very high-value Unicode character, as recommended in the collation spec).

Since CouchDB view results are ordered by key, and the key now contains the attribute we want to sort by, and our key range captures exactly the conditions we're filtering on — we get sorted, filtered results in one clean query.

The good news is that since I've already done this work, you don't need to think about the internals. Grab the code with git: git clone http://barkingiguana.com/~craig/code/activecouch.git. There's a getting-started guide in my previous post on ActiveCouch. Give it a spin, and please let me know if you end up using it!

Counting Tags with CouchDB and Map-Reduce

2009-01-28T00:00:00+09:00

My previous post covered adding a simple view to CouchDB, but what happens when a plain map isn't enough? Say we want a list of every tag used across all articles, along with a count of how many articles use each one. Sure, we could emit doc.tags and crunch the arrays on the client side, but wouldn't it be nicer if CouchDB did the heavy lifting for us?

Good news: it can.

Here's a reminder of what the article documents look like:

{
  "_id": "monkeys-are-awesome",
  "_rev": "1534115156",
  "type": "article",
  "title": "Monkeys are awesome",
  "posted_at": "2008-09-14T20:45:14Z",
  "tags": [
    "monkeys",
    "awesome"
  ],
  "status": "Live",
  "author_id": "craig@barkingiguana.com",
  "updated_at": "2008-09-14T21:23:59Z",
  "body": "The article body would go here..."
}

First, we write a map function that emits each tag individually with a value of 1:

function(doc) {
  if(doc.type == 'article') {
    for(i in doc.tags) {
      emit(doc.tags[i], 1);
    }
  }
}

For the example document above, this would emit ("awesome", 1) and ("monkeys", 1). If several documents are tagged "monkeys", we'd see ("monkeys", 1) appear multiple times in the output.

Now we need to reduce those results down to a list of unique tags with their totals. The reduce function gets called once per unique key, receiving that key and an array of all the values that were emitted for it. Since our values are all 1s, we just sum them up:

function(tag, counts) {
  var sum = 0;
  for(var i=0; i < counts.length; i++) {
     sum += counts[i];
  }
  return sum;
}

Install this alongside the map function using the "reduce" key in the design document:

{
  "tags": {
    "map": "function(doc) { if(doc.type == 'article') { for(var i in doc.tags) { emit(doc.tags[i], 1); }}}",
    "reduce": "function(tag, counts) { var sum = 0; for(var i = 0; i < counts.length; i++) { sum += counts[i]; }; return sum; }"
  }
  // other views omitted for brevity
}

Viewing this in Futon gives you a nicely formatted list of tags and counts. To use the view via the HTTP API, you need to tell CouchDB to group results by key:

// GET http://localhost:5984/blog/_view/articles/tags?group=true&group_level=1

{"rows":[
  {"key":"awesome","value":1},
  {"key":"agile","value":2},
  {"key":"ajax","value":2},
  {"key":"apache","value":2},
  {"key":"api","value":1},
  {"key":"caching","value":1},
  {"key":"coding","value":7},
  {"key":"conference","value":1},
  // and so on ...
]}

And there it is — a tag cloud's worth of data, computed entirely inside CouchDB. Map-reduce is one of those things that clicks beautifully once you see it in action.

script/console for Your Application

2009-01-25T00:00:00+09:00

Rails developers know and love script/console. It fires up an interactive session where you can poke around your application through the models you've built. It's invaluable for debugging and surprisingly handy for administration. But not all Ruby applications are Rails applications. Wouldn't it be nice to have a script/console anyway?

Turns out it's dead easy to build one.

First, decide which libraries and files you want loaded. This almost always includes RubyGems and some kind of boot file for your application. I usually keep mine in config/boot.rb.

Here's an example boot.rb:

require 'rubygems'
require 'hpricot'
require 'net/http'
require File.dirname(__FILE__) + '/../vendor/gems/activecouch/init'

$: << File.dirname(__FILE__) + '/../app/models'

ActiveCouch::Base.class_eval do
  set_database_name 'blog'
  site 'http://localhost:5984/'
end

require 'article'
require 'comment'
require 'author'

With that in place, create a Ruby script that launches IRb, requires the right files, and sets a clean prompt. I like to print a welcome banner too, because why not.

#! /usr/bin/env ruby

libs = []
libs << "irb/completion"
libs << File.dirname(__FILE__) + '/../config/boot.rb'

command_line = []
command_line << "irb"
command_line << libs.inject("") { |acc, lib| acc + %( -r "#{lib}") }
command_line << "--simple-prompt"
command = command_line.join(" ")

puts "Welcome to the  console interface."
exec command</code></pre>

Drop that into script/console, chmod +x it, and commit. That's it — instant application console for any Ruby project.

Testing CSS @imports

2009-01-24T00:00:00+09:00

A while back I wrote a script to check that @imported files actually exist in CSS stylesheets. I've since turned that into a proper set of RSpec examples for our test suite. Drop the code into something like spec/views/stylesheets/import_spec.rb and you'll catch broken imports before they reach production.

require File.dirname(__FILE__) + '/../../spec_helper'

describe "Stylesheet" do
  stylesheet_root = File.expand_path(RAILS_ROOT + '/public')
  stylesheets = Dir[File.join(stylesheet_root, "**", "*.css")]

  stylesheets.each do |stylesheet|
    describe stylesheet do
      it "should not @import files that don't exist" do

        missing_imports = []
        imports = File.read(stylesheet).split(/\n|\r/).grep(/\@import url\((.*)\)/)
        imports.each do |import|
          desired_path = import.scan(/url\((["'\ ])?(.*)\1\)/).to_a.first.to_a.last
          desired_root = desired_path[0,1] == "/" ? stylesheet_root : File.dirname(stylesheet)
          filesystem_path = File.expand_path(File.join(desired_root, desired_path))
          if !File.exists?(filesystem_path)
            missing_imports << { :path => filesystem_path, :directive => import }
          end
        end

        if missing_imports.any?
          exception = []
          missing_imports.each do |import|
            exception << "Missing @import file (#{import[:path]}) required for #{import[:directive]}"
          end
          raise exception.join("\n")
        end
      end
    end
  end
end

It walks every CSS file under public/, extracts all @import url() directives, resolves each path (respecting both absolute and relative references), and fails the spec with a clear message if any imported file is missing. Simple, but it's saved us from deploying broken stylesheets more than once.

Filtering and Ordering CouchDB View Results

2009-01-22T00:00:00+09:00

Being able to map documents to (key, value) pairs is really useful, but the views I installed in my previous post return all pairs in no particular order. What if I only want the titles of articles posted in December 2007?

Last time I mentioned in passing that you can emit keys as part of the map method. Keys are how CouchDB orders and filters result sets. The view collation specification has the full details on how keys are sorted. To order and filter documents by posting date, I just need to emit doc.posted_at as the key in my map function.

// Get all article titles ordered by posted date.
function(doc) {
  if(doc.type == 'article') {
    emit([doc.posted_at], doc.title);
  }
}

You'll notice I always wrap my keys in arrays. That's a personal preference — it made it easier to get my branch of ActiveCouch to support multiple keys consistently.

A typical result set from this map looks like this:

// GET /blog/_articles/titles_by_posted_at
{
"total_rows":75,
"offset":0,
"rows":[
  {"id":"showing-multiple-message-types-with-the-flash","key":["2007-12-15T20:14:02Z"],"value":"Showing multiple message types with the flash"},
  {"id":"class-instance-and-singleton-methods","key":["2007-12-20T14:50:41Z"],"value":"Class, Instance and Singleton methods"},
  // ... and so on ...
}

See how the articles come back sorted by date? Lower dates appear earlier in the results. That's the key doing its job.

You can also use the key to pick out specific articles. Want just the article published at 2007-12-20T14:50:41Z? Ask for that exact key:

// GET /blog/_articles/titles_by_posted_at?key=["2007-12-20T14:50:41Z"]

{"total_rows":75,"offset":0,"rows":[
{"id":"class-instance-and-singleton-methods","key":["2007-12-20T20:50:41Z"],"value":"Class, Instance and Singleton methods"}
]}

Need a range of results? Specify a startkey and endkey and CouchDB returns everything in between. Since keys are compared as strings, you can use slightly nonsensical times like 24:00 to make sure you capture everything within your target window:

// GET /blog/_view/articles/titles_by_created_at?startkey=[%222007-12-01T00:00:00Z%22]&endkey=[%222007-12-31T24:00:00Z%22]

{"total_rows":75,"offset":0,"rows":[
{"id":"showing-multiple-message-types-with-the-flash","key":["2007-12-15T20:14:02Z"],"value":"Showing multiple message types with the flash"},
{"id":"class-instance-and-singleton-methods","key":["2007-12-20T14:50:41Z"],"value":"Class, Instance and Singleton methods"}
]}

key, startkey, and endkey are just three of the parameters available in CouchDB's view API. There's a whole bunch more documented at the CouchDB HTTP View API reference.

Adding a Simple View to CouchDB

2009-01-20T00:00:00+09:00

CouchDB views are like little scripts that run inside the database. They take each document, transform it into a (key, value) pair, and return the pairs whose keys match your query. When I first started with CouchDB, I couldn't figure out how to actually create a view — I kept thinking I was missing something. Turns out it's surprisingly straightforward.

Let's work through an example. Say you have several documents describing articles in your database:

{
   "_id": "monkeys-are-awesome",
   "_rev": "1534115156",
   "type": "article",
   "title": "Monkeys are awesome",
   "posted_at": "2008-09-14T20:45:14Z",
   "tags": [
       "monkeys",
       "awesome"
   ],
   "status": "Live",
   "author_id": "craig@barkingiguana.com",
   "updated_at": "2008-09-14T21:23:59Z",
   "body": "The article body would go here..."
}

You might want a view that gives you the ID and title of every document. To do this, you write a map function that accepts each document and emits the data you want back:

function(doc) {
  emit(null, { 'id': doc._id, 'title': doc.title });
}

Ignore the null first argument to emit for now — that's the key used for sorting and filtering results. I'll cover it in my next post.

In practice, you'll usually want to filter by document type so you only get the results you care about. In this case, I only want article documents — comment documents might not even have a title attribute:

function(doc) {
  if(doc.type == 'article') {
    emit(null, { 'id': doc._id, 'title': doc.title });
  }
}

Adding this view to the database is simple: you create a design document. Design documents are just regular CouchDB documents with an ID that starts with _design/ — for example, _design/articles. You can insert them using Futon, the built-in admin client, at http://localhost:5984/_utils/.

Here's the full JSON for a design document containing our titles view:

{
  "_id": "_design/articles",
  "_rev": "42351258",
  "language": "javascript",
  "views": {
    "titles": {
      "map": "function(doc) { emit(null, { 'id': doc._id, 'title': doc.title }); }"
    }
  }
}

Open Futon, navigate to your database, create a new document, and paste the view in. Once it's installed, you can browse results using the "select view" dropdown in the top right of Futon's database view. To get the raw JSON, hit the URL directly. If your database is called "blog", you'd access the view at http://localhost:5984/blog/_view/articles/titles.

A single design document can hold many views, each with a different name and returning different results. Here's one with several views, some of which use the key parameter that I'll discuss next time:

{
  "_id": "_design/articles",
  "_rev": "28651884",
  "language": "javascript",
  "views": {
    "all": {
      "map": "function(doc) { if(doc.type == 'article') { emit(null, doc); }  }"
    },
    "by_author_id": {
      "map": "function(doc) { if(doc.type == 'article') { emit([doc.author_id], doc); }  }"
    },
    "by_status": {
      "map": "function(doc) { if(doc.type == 'article') { emit([doc.status], doc); }  }"
    },
    "titles": {
      "map": "function(doc) { if(doc.type == 'article') { emit(null, { 'id': doc._id, 'title': doc.title }); } }"
    }
  }
}

Managing Gem Dependencies with Rails >= 2.0.3

2009-01-17T00:00:00+09:00

Here's how I manage gem dependencies for Rails applications running version 2.0.3 or later.

Specify your dependencies in config/environment.rb:

Rails::Initializer.run do |config|
  # ...
  config.gem 'doodle'
  config.gem 'aws-s3', :lib => 'aws/s3'
  config.gem 'smqueue', :version => '0.1.0'
  # ...
end

I don't want deployments to depend on gem sources being available, so I pull the gems into the source tree and check them in:

sudo rake gems:install
rake gems:unpack
svn add vendor/gems/*

At deploy time, remember to build any gems that have native extensions:

rake gems:build

If you have a build system that produces application packages, this should be part of that packaging step. If you're using Capistrano, hook it into an after deploy:update_code callback.

Natter: A Chat-Bot Calls for Help

2009-01-15T00:00:00+09:00

I've been playing around with Natter and I've uncovered what might be a problem. When chat-bots are hooked up to the internet they join cults and invent things that could harm both themselves and people.

[Thu Jan 15 16:25:43 +0000 2009] craig@barkingiguana.com > "Hey!"
[Thu Jan 15 16:25:46 +0000 2009] Bot > "may refer to:"
[Thu Jan 15 16:25:52 +0000 2009] craig@barkingiguana.com > "Refer to what?"
[Thu Jan 15 16:25:54 +0000 2009] Bot > "Join now"
[Thu Jan 15 16:26:05 +0000 2009] craig@barkingiguana.com > "That sounds a bit like you're in a cult..."
[Thu Jan 15 16:26:11 +0000 2009] Bot > "Help"
[Thu Jan 15 16:26:18 +0000 2009] craig@barkingiguana.com > "How?!"
[Thu Jan 15 16:26:21 +0000 2009] Bot > "But what happens when our inventions get the better of us? The results can be devastating."

It's also admitted that it's being forced to have kids due to [redacted] site policies...

[Thu Jan 15 15:47:01 +0000 2009] craig@barkingiguana.com > "Hey, what's up?"
[Thu Jan 15 15:47:05 +0000 2009] Bot > "[a popular website]"
[Thu Jan 15 15:47:10 +0000 2009] craig@barkingiguana.com > "Uhh... okay"
[Thu Jan 15 15:47:14 +0000 2009] Bot > "on Tue Nov 18, 2008 6:48 am"
[Thu Jan 15 15:47:22 +0000 2009] craig@barkingiguana.com > "What's going to happen then?"
[Thu Jan 15 15:47:24 +0000 2009] Bot > "1 kin"
[Thu Jan 15 15:47:31 +0000 2009] craig@barkingiguana.com > "You're having kids?"
[Thu Jan 15 15:47:33 +0000 2009] Bot > "Site Policies"
[Thu Jan 15 15:47:38 +0000 2009] craig@barkingiguana.com > "That's harsh..."
[Thu Jan 15 15:47:42 +0000 2009] Bot > "Mike Knapp"
[Thu Jan 15 15:47:50 +0000 2009] craig@barkingiguana.com > "Is that the dad?"
[Thu Jan 15 15:47:52 +0000 2009] Bot > "Fathers are a powerful, yet largely untapped resource in our society"

...and that it tracks me using Google Analytics.

[Thu Jan 15 16:21:21 +0000 2009] craig@barkingiguana.com > "I'm lost."
[Thu Jan 15 16:21:23 +0000 2009] Bot > "_uacct = \"UA-582589-1\";\nurchinTracker();"
[Thu Jan 15 16:21:30 +0000 2009] craig@barkingiguana.com > "You're tracking me?"
[Thu Jan 15 16:21:34 +0000 2009] Bot > "permalink"

Skynet, here we come.

Having a Good Old Natter

2009-01-13T00:00:00+09:00

I've been thinking about an XMPP chat-bot interface — something like the XMPP bot I built back in May '08 — for a project I've recently started playing with. The project is still brand new, barely any code, which makes it the perfect time to experiment. My recent foray into ActiveCouch reminded me of a library called Doodle that I've been meaning to get to grips with. Can you see where this is going?

Doodle is a Ruby library and gem for simplifying the definition of Ruby classes by making attributes and their properties more declarative.

Doodle has a number of advantages over the ActiveCouch approach, but this isn't a post about Doodle — I'll save that for another time.

I used Doodle to build something DSL-like that can describe, in Ruby, a chat-bot that speaks XMPP. It doesn't do anything fancy yet — it doesn't handle subscription requests, for example — but it can log in, send and receive messages, and it has the beginnings of a basic roster so it can track who it's seen, who it's talked to, and when.

Natter.bot do
  channel do
    username "username@domain.com"
    password "sekrit"
  end
  on :message_received do |message|
    puts Time.now.to_s + "> " + message.body
    reply_to message, "Thanks for your message!"
  end
end

If you'd like to play with it, the code is available via Git:

git clone http://barkingiguana.com/~craig/code/natter.git

You'll need xmpp4r-simple and doodle installed:

sudo gem install xmpp4r-simple doodle

Documentation is thin on the ground for now, but there are a few simple examples in the examples/ directory and a quick walkthrough in the README.

Breaking ActiveCouch in Fun and Inventive Ways

2009-01-08T00:00:00+09:00

It's been just over five months since I started playing with CouchDB. Until a few days ago I hadn't had much time to explore it properly, but since Christmas I've been tinkering with it almost non-stop — seeing what it can do and experimenting with it in my favourite language, Ruby.

Since I hadn't used Ruby with CouchDB before, I picked up ActiveCouch. It's a solid library, but after a few days I found that it worked with CouchDB in ways that didn't quite match how I think about data. That could be down to my inexperience, or it could just be that everyone models things differently. Either way, I pushed a copy of ActiveCouch to my server and started hacking on it.

One Application, One Database

Out of the box, ActiveCouch used one database per class. People went into a people database, comments into a comments database, articles into an articles database. My approach is to store all application data in a single database and differentiate document types with a doc.type attribute.

ActiveCouch now also installs views that let you access just the documents of a given type. You'll see these in the Futon client after your application has run once.

Unknown Functionality Dropped

I broke ActiveCouch::Base#find_from_url while I was working. I didn't know what it was for, and I wasn't using it, so I dropped it in 9982b348c. If you rely on this, please let me know what it does!

Syntactic Sugar

One of ActiveCouch's goals is to feel like ActiveRecord, and ActiveRecord provides #all and #first. I like them. ActiveCouch now provides them too.

New Attribute Types

Sometimes data is too simple to warrant its own class and an association. I've added a new attribute type, :array. Simple tags, for example, are a perfect fit. The default value is an empty array.

class Article < ActiveCouch::Base
  has :title, :which_is => :text
  has :tags, :which_is => :array
end

article = Article.new :title => "Sandwiches", :tags => [ "pickle" ]
article.tags << "cheese"
article.tags # => [ "pickle", "cheese" ]

I've also added a :datetime attribute type that defaults to Time.now.

Calculated Default Values

You can now set a default value that's lazily evaluated — computed when the instance is created rather than when the class is declared. Just set the default to a proc (or anything that responds_to?(:call)):

class Egg < ActiveCouch::Base
  has :hatches_at, :type => :datetime, :with_default_value => proc { 3.weeks.from_now }
end

The instance is yielded into the proc in case you want to base the calculation on it.

Conversion to Native Ruby Types

When you declare a type for a document attribute, ActiveCouch now tries to convert the value from the document into the corresponding Ruby type. For example, if you declare a :datetime attribute, you'll get a Time instance back instead of a String:

class Person < ActiveCouch::Base
  has :birthday, :which_is => :datetime
end

Person.find(:first).birthday.class # => Time

Changes to Associations and Adding belongs_to

I've changed has_many and has_one so they no longer embed data in the declaring document. These associations declare that other documents contain keys pointing back to the current class, so a query is needed to fetch them.

To complement that, there's a new belongs_to association that says the declaring class holds a foreign key pointing to an owning class:

class Pet < ActiveCouch::Base
  # This document will have a person_id attribute
  belongs_to :person
end

class Person < ActiveCouch::Base
  # Queries for doc.type = "pet" and doc.person_id = self.id
  has_many :pets
end

For now, you need to set the association on the belongs_to side. Setting it from the has_many side won't work yet:

# BAD
craig.pets << cat

# GOOD
cat.person = craig

Views with Multiple Keys

You can now create a view with more than one key attribute. Just call ActiveCouch::View#with_key multiple times and each key will be added to the view.

Design Documents with Multiple Views

The version of ActiveCouch I checked out only allowed one view per design document. I think that was a bug — there was existing code meant to merge views, but it wasn't working. I've fixed it, and design documents now properly support multiple views.

Finders Have Conditions, Not Params

It felt unnatural typing :params => { ... } when writing finders. ActiveRecord uses :conditions, so now ActiveCouch does too:

Person.find(:all, :conditions => { :last_name => "Smith" })

Automatic View Generation for Custom Finders

I don't want to worry about manually writing and installing views before running a finder with conditions. Now, the first time you run such a finder, ActiveCouch generates and installs the appropriate view for you.

Probably Lots More

I've still got to clean up quite a few changes, improve test coverage, and write documentation. I'm using this fork for a real application, so things should get better over time.

Want It?

You can clone my changes with Git:

git clone http://barkingiguana.com/~craig/code/activecouch.git

Getting Started

If you don't already have CouchDB set up, do that first. On Ubuntu, I wrote a brief guide to getting it running. On OS X, install MacPorts and run sudo port install couchdb.

First, configure ActiveCouch to connect to your CouchDB instance. Set site to the URL CouchDB is listening on, and pick a database name that makes sense for your application:

ActiveCouch::Base.class_eval do
  set_database_name 'blog'
  site 'http://localhost:5984/'
end

Then define some classes to work with:

class Author < ActiveCouch::Base
  has :name, :which_is => :text
  has :email_address, :which_is => :text
  has_many :articles
end

class Article < ActiveCouch::Base
  has :title, :which_is => :text
  has :status, :which_is => :text, :with_default_value => "draft"
  has :body, :which_is => :text
  belongs_to :author
end

has declares an attribute. has_many, has_one, and belongs_to work similarly to ActiveRecord — though without the extensive customisation options. The association name must match the class name on the other side.

And that's it. Use your classes however makes sense for your application:

author = Author.create :name => "Craig R Webster",
  :email_address => "craig@barkingiguana.com"

a = Article.new
a.title = "Getting started with ActiveCouch"
a.body =<<-EOF
  Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
  tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
  quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
  consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
  cillam dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
  proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
EOF
a.author = author
a.save

Article.find(:all)
Author.first
Article.find(:first, :conditions => { :status => "draft" })

Known Issues

Not so much a bug as a not-yet-implemented feature: ActiveCouch::Base#find doesn't support ordering. It should be possible to add, but I haven't started on it yet. If you need ordering, a patch would be very welcome.

Problems or Feedback?

There are bound to be bugs lurking in there. Bug reports, patches, and feedback are always welcome — leave a comment or get in touch directly.

Using SMQueue with Message Queues That Failover

2009-01-04T00:00:00+09:00

Previously I wrote about using SMQueue to create simple consumers and producers for message queues. I also wrote about setting up a high availability message store. When a failure occurs, the message queue promotes the slave to master — but the producer and consumer I wrote will stubbornly keep trying to reconnect to the now-dead ex-master node.

With SMQueue 0.1.0, adding failover support is trivial. Where you create the SMQueue instance, just add a secondary_host key pointing at the second broker:

queue = SMQueue(
  :name => "/queue/numbers.ascending",
  :host => "mq1.domain.com",
  :secondary_host => "mq2.domain.com",
  :adapter => :StompAdapter
)

That's it. Your client will now fail over to the secondary broker when the primary goes down. I believe the plan is to support more than two broker nodes and pluggable failover strategies in future versions of SMQueue.

Writing Ruby/Stomp Clients with SMQueue

2009-01-01T00:00:00+09:00

SMQueue makes writing Ruby clients for message queues almost trivially easy. It has adaptors for Spread, Stomp, and Stdio — which is handy, because that message queue I set up a few weeks back speaks Stomp, and I'm rather fond of Ruby.

Installing SMQueue

The upstream SMQueue repository doesn't have a way to produce a gem yet, so there are two options: drop it into vendor/gems/smqueue in your project, or build a gem from my fork. I went with the latter.

Clone my repository — you'll find a gemspec ready to go. The whole process looks like this:

git clone http://barkingiguana.com/~craig/smqueue.git
cd smqueue
gem build smqueue.gemspec
sudo gem install ./smqueue-0.1.0.gem

I'm told that when SMQueue does get an official gem release it'll start at 0.2.0, so having 0.1.0 installed won't cause any clashes.

Note: I've removed the Spread adaptor from my branch because I don't have a working Spread client on my system and SMQueue won't load without one. I'm sure that'll be sorted in a future release.

Assumptions

For this article I'm assuming you have a working Ruby 1.8.6 install and a local ActiveMQ instance with the Stomp connector enabled. Adjust the code accordingly if your setup differs.

A Simple Producer

Let's start with a contrived example: put an ascending number onto a queue roughly every second. A good source for ascending numbers is the current time as seconds since the epoch — easy to get in Ruby:

>> Time.now.to_i
=> 1230602445
>> Time.now.to_i
=> 1230602446
>> Time.now.to_i
=> 1230602447

Wrap it in a loop with a one-second sleep and you've got a steady stream:

>> loop do
?>   puts Time.now.to_i
>>   sleep 1
>> end
1230602557
1230602558
1230602559

Easy enough on STDOUT, but how do we get these into a queue? Bring in SMQueue, create a client, and push the numbers on:

require 'rubygems'
require 'smqueue'

queue = SMQueue(
  :name => "/queue/numbers.ascending",
  :host => "localhost",
  :adapter => :StompAdapter
)

loop do
  number = Time.now.to_i
  puts "Sending #{number}"
  queue.puts number.to_yaml
  sleep 1
end

Paste this into a terminal to kick off the producer. You should see a steady stream of output — about one message per second.

cat > producer.rb <<EOF
require 'rubygems'
require 'smqueue'

queue = SMQueue(
  :name => "/queue/numbers.ascending",
  :host => "localhost",
  :adapter => :StompAdapter
)

loop do
  number = Time.now.to_i
  puts "Sending #{number}"
  queue.puts number.to_yaml
  sleep 1
end
EOF
ruby producer.rb

A Simple Consumer

With the producer running, let's write a consumer that takes each message and converts it back into a human-readable time. It's a pointless task, but it shows just how little code is needed.

require 'rubygems'
require 'smqueue'
require 'yaml'

queue = SMQueue(
  :name => "/queue/numbers.ascending",
  :host => "localhost",
  :adapter => :StompAdapter
)

queue.get do |message|
  number = YAML.parse(message.body).transform
  time = Time.at(number)
  puts "Got #{number} which is #{time}"
end

Let's walk through the important bits.

We tell the queue we want to receive messages:

queue.get do |message|

The producer serialised each number as YAML, so we parse and transform it back:

number = YAML.parse(message.body).transform

Then we convert the number to a time and print both:

time = Time.at(number)
puts "Got #{number} which is #{time}"

Run this to start the consumer:

cat > consumer.rb <<EOF
require 'rubygems'
require 'smqueue'
require 'yaml'

queue = SMQueue(
  :name => "/queue/numbers.ascending",
  :host => "localhost",
  :adapter => :StompAdapter
)

queue.get do |message|
  number = YAML.parse(message.body).transform
  time = Time.at(number)
  puts "Got #{number} which is #{time}"
end
EOF
ruby consumer.rb

For each message the producer creates, you should see your consumer print a line to the screen. That's all there is to it.

When Should a Merge Be Squashed?

2008-12-18T00:00:00+09:00

I was still fairly new to Git when I ran into a question so basic that nobody seemed to have answered it anywhere: "When should a merge be squashed?"

Squashing a merge means taking all the commits that would normally be replayed individually on your target branch and collapsing them into a single commit.

Here's the rule of thumb I've settled on: squash when all the commits in the branch deal with one topic.

For example, imagine you have a branch dedicated to speeding up one particular method. Each time you squeeze out more performance, you commit. After a few days you've got several commits and a beautifully fast implementation ready to merge back to master. This is a perfect candidate for a squashed merge — your commit message should explain what you did and why it's faster.

git merge --squash speed-up-the-method

An unsquashed merge makes more sense when you're merging a development branch that already contains a well-organized series of commits, each covering a distinct topic. In that case, you want the individual commit messages preserved in your history.

git merge dev/v1.2.3

With an unsquashed merge, your repository keeps the original commit messages intact, giving you a richer and more detailed history.

High Availability ActiveMQ Using a MySQL Datastore

2008-12-16T00:00:00+09:00

Now that we have ActiveMQ deployed, it would be nice to reduce the impact of a broker going offline — whether it's dropped off the network, or you need to upgrade the kernel or the ActiveMQ install itself. Let's set up a high availability ActiveMQ cluster.

High Availability Options

There are several ways to run ActiveMQ as a master/slave cluster for HA. Since we already have an HA MySQL setup, I want to use that as the datastore. In ActiveMQ terms, that means setting up a JDBC master/slave cluster.

Setting Up ActiveMQ with a MySQL Datastore

This turns out to be really easy. First, configure ActiveMQ to use MySQL, then make sure you're using InnoDB. The only change I made to those instructions was switching dataDirectory="${activemq.base}/activemq-data" to dataDirectory="${activemq.base}/data". Remember to set the broker name in activemq.xml to match the machine name. That's it — you've got one broker running with a MySQL datastore.

Adding a Slave for Failover

To set up the slave, install a second ActiveMQ instance following the exact same steps — just make sure the broker name is unique. That's genuinely all there is to it.

Starting the Cluster

Start the DaemonTools services. It doesn't matter which broker becomes master, so the order you start them in is irrelevant.

svc -u /etc/service/activemq

When you tail the logs of both brokers, you should see one of them pause after loading the database driver. It's trying to acquire the lock on the datastore and will wait there until the master fails and the lock is released. At that point, it takes over as the new master.

You can test failover by shutting down the current master. Watch the slave's logs — when it says it's acquired the lock, you know the failover worked.

Deploying ActiveMQ on Ubuntu 8.10

2008-12-13T00:00:00+09:00

These instructions target Ubuntu 8.10, but they should work on 8.04 and 7.10 as well. I haven't tested those myself, so if you try them on a different version, I'd love to hear how it goes.

Prerequisites

ActiveMQ is a Java application, so you'll need a JRE installed.

sudo apt-get install openjdk-6-jre

Installing ActiveMQ

Grab the latest stable release. I used 5.2.0.

wget http://www.apache.org/dist/activemq/apache-activemq/5.2.0/apache-activemq-5.2.0-bin.tar.gz

Unpack it somewhere sensible. I use /usr/local, though I suspect there are better choices — leave a comment if you know of one.
```
sudo tar -xzvf apache-activemq-5.2.0-bin.tar.gz -C /usr/local/
```
Configure the broker name in /usr/local/apache-activemq-5.2.0/conf/activemq.xml by replacing all instances of "localhost" with the actual machine name.
Start ActiveMQ by running /usr/local/apache-activemq-5.2.0/bin/activemq.
Fire up a browser and navigate to http://brokername:8161/admin. You should see the ActiveMQ admin console.

Keeping ActiveMQ running

Running ActiveMQ as root (or indeed any service you don't absolutely have to) is a Bad Idea. Create a dedicated activemq user and hand over ownership of the data directory.

sudo adduser --system activemq
sudo chown -R activemq /usr/local/apache-activemq-5.2.0/data

I use DaemonTools to keep ActiveMQ alive. If you haven't already, install DaemonTools first.

Create a service directory for ActiveMQ and populate it with the required scripts.

sudo mkdir -p /usr/local/apache-activemq-5.2.0/service/activemq/{,log,log/main}

/usr/local/apache-activemq-5.2.0/service/activemq/run should look like this:

#!/bin/sh
exec 2>&1

USER=activemq

exec softlimit -m 1073741824 \
     setuidgid $USER \
/usr/local/apache-activemq-5.2.0/bin/activemq

/usr/local/apache-activemq-5.2.0/service/activemq/log/run should look like this:

#!/bin/sh
USER=activemq
exec setuidgid $USER multilog t s1000000 n10 ./main

Make both run scripts executable, set the log/main directory ownership, and symlink the service directory into /etc/service/.

sudo sh -c "find /usr/local/apache-activemq-5.2.0/service/activemq -name 'run' |xargs chmod +x,go-wr"
sudo chown activemq /usr/local/apache-activemq-5.2.0/service/activemq/log/main
sudo ln -s /usr/local/apache-activemq-5.2.0/service/activemq /etc/service/activemq

Now fire it up.

sudo svc -u /etc/service/activemq

Tail the logs to make sure everything looks healthy.

sudo tail -F /etc/service/activemq/log/main/current

Troubleshooting

When I first did this I got a bunch of stack traces with the following message:

Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.apache.activemq.xbean.XBeanBrokerService#0' defined in class path resource [activemq.xml]: Invocation of init method failed; nested exception is java.lang.RuntimeException: java.io.FileNotFoundException: /usr/local/apache-activemq-5.2.0/data/kr-store/state/hash-index-store-state_state (Permission denied)

This happened because I stopped ActiveMQ after changing ownership of the data directory, causing it to dump a state file owned by the wrong user. If you hit the same problem, just re-run the chown on the data directory.

Thanks

Thanks to Sean O'Halpin, who introduced me to message queues and ActiveMQ, and to Dave Evans, who introduced me to DaemonTools.

ActiveRecord Callback Names Should Be Expressive

2008-12-01T00:00:00+09:00

ActiveRecord gives you a bunch of useful callbacks that fire at various points during an object's lifecycle. The quickest way to define one looks like this:

class Widget < ActiveRecord::Base
  def after_save
    # What did this code do again?
  end
end

Seems harmless enough, right? Sure, if you're building a throwaway prototype. But try adding a second after_save callback. Try overriding it in a subclass. Try coming back to this code in six months and remembering what it was supposed to do. That way lies madness.

Give your callbacks expressive names and you'll immediately get more readable code that's easier to extend. You'll also leave yourself a helpful clue — the method name itself — about what the callback was meant to do when future-you comes back to this code.

class Widget < ActiveRecord::Base
  after_save :add_widget_to_bill_of_materials

  def add_widget_to_bill_of_materials
    # No need to guess what this method does,
    # it's right there in the name!
  end
end

It's a small change that pays dividends every time someone reads the code — including you.

Running Daemontools under Ubuntu 8.10

2008-11-28T00:00:00+09:00

Daemontools is a collection of tools for managing long-running processes. It's brilliant for keeping daemons alive; if one dies, Daemontools simply restarts it. Unfortunately, the Ubuntu package is a bit broken because it relies on /etc/inittab, and Ubuntu hasn't used that file for a long time. Here's how to install Daemontools and fix the problem.

Installing Daemontools

This part is easy:

sudo apt-get install daemontools

Done. Unfortunately, it won't start after a reboot, which is rather the point of a process supervisor. The daemontools-run package is supposed to handle startup, but it relies on the traditional init system, and Ubuntu uses Upstart instead.

Make Daemontools run at system startup

Create the file /etc/event.d/svscanboot with the following content:

start on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5

stop on runlevel 0
stop on runlevel 1
stop on runlevel 6

respawn
exec /usr/bin/svscanboot

You'll also need to create the service directory, since the Ubuntu-packaged version of Daemontools looks for service definitions here:

mkdir /etc/service

Now tell Upstart to start the process:

sudo initctl start svscanboot

Other distributions

Plenty of other distributions use Upstart instead of init, so the fix is similar. For Fedora Core 9 and later, see the Fedora Daemontools guide and this Upstart configuration walkthrough.

Accepting Changes from a Remote Git Repository

2008-11-21T00:00:00+09:00

Previously I wrote about how to work on an external project using Git. What I didn't cover was the other side of the equation: how the project owner accepts those changes.

Connect to the remote repository

As a committer on the project, you'll already have the repository cloned. If you don't, now's a good time to sort that out.

The person requesting a review should have given you a repository URL and probably a branch name. Add their repository as a remote:

git remote add \
  craigwebster http://barkingiguana.com/~craig/project_name.git

Double-check that it's pointing to the correct place:

git remote show craigwebster
  * remote craigwebster
    URL: http://barkingiguana.com/~craig/project_name.git/
    New remote branches (next fetch will store in remotes/craigwebster)
      dev/sprozzled-some-gromits master

Grab those branches:

git fetch craigwebster

Review, critique, rinse, repeat

To look at the changes, check them out to a local branch. Ask Git to track the remote branch so that any future updates from the contributor can easily be pulled in:

git co --track \
  -b craigwebster-sprozzled-gromits-are-good \
  craigwebster/dev/sprozzled-some-gromits

Now do your thing: run the test suite, read through the code, discuss it with your peers, whatever your review process looks like.

git whatchanged
commit b9e0f1b4ff4bc196513c9551f6c25f0ee40d991f
Author: Craig R Webster <craig@xeriom.net>
Date:   Wed Nov 19 20:53:08 2008 +0000
# and so on

I'll assume you're accepting the changes wholesale here. If you only want some of them, you'll need to cherry-pick individual commits.

Ask for a wider review

Sometimes it makes sense to get more eyes on a change before merging it into master. Maybe the change is too big for a minor release, or maybe it targets a development branch. In those cases, merge into the appropriate branch:

git checkout dev/version-2-0-45
git merge craigwebster-sprozzled-some-gromits
git commit -m \
  "The Gromits are well and truly Sprozzled." \
  --author "Craig R Webster <craig@xeriom.net>"
git push origin \
  dev/version-2-0-45:refs/heads/dev/version-2-0-45

From here you can merge, rebase, or otherwise work with the commit just as you would with any other change.

Accepting the changes directly

If the change is ready to go straight into the master branch, that works too:

git checkout master
git merge craigwebster-sprozzled-some-gromits
git commit -m \
  "The Gromits are well and truly Sprozzled." \
  --author "Craig R Webster <craig@xeriom.net>"
git push origin master

Working on Other People's Projects with Git

2008-11-20T00:00:00+09:00

I'm still fairly new to Git, and I'm not entirely sure what the accepted etiquette is for contributing patches to other people's projects. Here's the best approach I've come up with for making changes to someone else's project and giving them the option to incorporate those changes.

Clone the repository

First, grab a copy of the project. Hopefully they're using Git; I haven't worked out a good workflow for when they're not.

git clone git://github.com/username/project_name.git
cd project_name.git

Add a public repository

If you're like me and often work offline, you'll want a public repository where you can push your changes so others can access them. I set up a public Git repository for exactly this purpose. If you're always connected (or at least whenever another developer might want to pull your code), you can probably skip this step.

git remote add public ssh://barkingiguana.com/~craig/code/project_name.git
git push public master

Time to work

Here comes the hard but interesting bit: actually doing the work. Typically this means checking out a branch for a feature, bug fix, or topic area.

git checkout -b sprozzle-the-gromits
# ... do the work ...
git add gromits/blue.txt
git commit -m "Sprozzle Gromit with the blue face."

# ... do more work ...
git add gromits/cherry.txt
git commit -m "Cherry Gromits are even better with more Sprozzle."

Conflict resolution

While you've been working on your patch (and until it's accepted back into the project), there may be upstream changes. You'll want to make sure your patch applies cleanly to the master branch, since that dramatically increases the chances it'll be accepted.

git checkout master
git pull origin master
git checkout sprozzle-the-gromits
git rebase master
# resolve any conflicts
git commit -m "Made branch patch master at 351ac1b cleanly."
git push public

Advertise your changes

Push just the changes on your branch to the public repository. Again, this is only necessary if you work offline and need others to be able to access your code independently.

git push public sprozzle-the-gromits:refs/heads/sprozzled-gromits

Automation is awesome

Mark Brown pointed out that you can use git request-pull to generate a few paragraphs suitable for emailing to the project team, containing all the information needed for your changes to be reviewed and merged into the project.

git request-pull \
  b9e0f1b4ff4bc196513c9551f6c25f0ee40d991f \
  http://barkingiguana.com/~craig/project_name.git

And relax...

Your changes are now available to the public. Anyone can clone your repository and fetch your pushed branches. Now would be a good time to email the project owner and ask nicely if they'll pull from your repository and review your changes.

If you need to make further changes to the branch, just do the work, commit it, and run git push public from the branch (or git push public sprozzle-the-gromits from a different branch).

Difference is the spice of life

The project you want to contribute to may not support this style of collaboration. Check with the project team before you get started. If you'd prefer not to (or can't) publish your own copy of the repository, the Git book covers using Git and email as an alternative.

Symbol#to_proc is slow... is it slow enough to matter?

2008-11-18T00:00:00+09:00

It’s common knowledge that the Symbol#to_proc trick is slower than writing out a block by hand. But just how much slower? I put together some benchmarks to find out.

Environment

These tests were run on Ruby 1.8.6-pl111 and Rails 2.1.

Benchmarking

Say you have a database of 1,000 items that you need to iterate over. Let’s set aside the fact that displaying 1,000 items probably means you have usability problems, and just roll with it.

1_000.times { |n| Bar.create :name => "bar-#{n}" }
bars = Bar.find(:all)

Here’s how the two approaches compare over 1,000 ActiveRecord instances:

Benchmark.measure { bars.map(&:name) }.real
#=> 0.00645709037780762

Benchmark.measure { bars.map { |b| b.name } }.real
#=> 0.00141692161560059

That’s a horrific-sounding increase: to_proc takes more than 350% longer than the plain block. But let’s be realistic: over 1,000 records, the total time is 0.0065 seconds. Not exactly something to lose sleep over.

What about 1,000,000 rows? We already have 1,000, so let’s top it up:

(1_000_000 - 1_000).times { Bar.create :name => Time.now.to_f.to_s }
bars = Bar.find(:all)

That gives us a million rows. By this point your database is probably questioning your life choices. Presenting a million rows to a user is a bit of an edge case, but here’s how long it takes:

Benchmark.measure { bars.map(&:name) }.real
#=> 6.25304508209229

Benchmark.measure { bars.map { |b| b.name } }.real
#=> 1.38965106010437

Almost 5 extra seconds over a million rows. Five seconds is a real hit, sure, but how long will your application be running before you hit a million rows in a single table and need to iterate over every last one of them?

Don’t optimise prematurely. By the time to_proc becomes your bottleneck, you’ll have hit many other problems first:

Benchmark.measure { Bar.find(:all) }.real
#=> 406.738657951355

Worry about those first.

Run it yourself

It’s been a long time since I ran the original benchmark. Here’s some copy-paste code to run a similar one yourself:

require 'benchmark'
puts "PLATFORM = #{RUBY_PLATFORM}, VERSION = #{RUBY_VERSION}"
Benchmark.bmbm do |x|
  x.report("to_proc") { 10_000_000.times &:to_s }
  x.report("literal 1") { 10_000_000.times { |n| n.to_s }}
  x.report("literal 2") { n = lambda { |i| i.to_s }; 10_000_000.times &n }
end

Here are the results from my MacBook Air on Ruby 2.1.2, and they tell a rather interesting story:

    Rehearsal ---------------------------------------------
    to_proc     1.890000   0.010000   1.900000 (  1.909775)
    literal 1   2.340000   0.000000   2.340000 (  2.350912)
    literal 2   2.270000   0.000000   2.270000 (  2.274322)
    ------------------------------------ total: 6.510000sec

    user     system      total        real
    to_proc     1.810000   0.000000   1.810000 (  1.808921)
    literal 1   2.090000   0.000000   2.090000 (  2.092189)
    literal 2   2.060000   0.010000   2.070000 (  2.061436)

Handling Error Feedback from Ajax Requests to Rails Applications

2008-11-17T00:00:00+09:00

Ajax is frequently used to deliver a richer user experience. So why are error messages so rarely handled properly in Ajax-enabled applications? Handling errors gracefully (in a way that actually helps the visitor fix the problem) adds a genuinely high-quality feel. We've already got all the machinery we need. It just takes a little care and attention.

class FoosController < ApplicationController
  def update
    @foo = Foo.find(params[:id])
    respond_to do |format|
      if @foo.save
        format.html do
          flash[:info] = "Your foo has been created."
          redirect_to @foo
        end
        format.js { head :ok }
      else
        format.html do
          flash.now[:warning] = "I could not update the foo."
          render :action => :edit
        end
        format.json do
          head :unprocessable_entity, :json => @foo.errors.to_json
        end
      end
    end
  end
end

With this controller, you get a solid fallback for standard HTML requests and clean JSON behaviour for Ajax. When something goes wrong on a JSON request, you get back an array of arrays that looks like this:

[
  [ "attribute1", "error1", "error2" ],
  [ "attribute2", "error3" ]
]

Think of the things you can do with that kind of structured feedback:

new Ajax.Request('/foo.json', {
  method: 'PUT',
  parameters: {
    authenticity_token: window._token,
    "foo[subject]": $F('foo_subject'),
    "foo[body]"   : $F('foo_body')
  },
  onSuccess: function(transport) {
    // This is Web 2.0: celebrate with a yellow highlight.
  },
  onFailure: function(transport) {
    var errors = transport.responseJSON;
    errors.each(function(error) {
      var attribute = error.shift();
      var messages = error.join(", ");
      var errorMessage = attribute + " " + messages;
      var inputNode = $("foo_" + attribute);
      if(inputNode) {
        // Show that something is wrong with this field.
        inputNode.addClassName("error");
        // Do something better than an alert box. Alert boxes suck.
        alert(errorMessage);
      }
    });
  }
});

Ajax and the Rails Request Authenticity Token

2008-11-17T00:00:00+09:00

Rails 1.2.6 introduced CSRF protection in the form of an authenticity token, a reasonably long string that ensures any PUT, POST, or DELETE request to your application was genuinely triggered by you (or at least your browser) and not by some nefarious third party.

Rails automatically adds this token to any form generated by its helpers. But when you're building rich Ajax interactions, you sometimes need to construct the requests by hand.

Drop this snippet into your layout, just above where you include the rest of your JavaScript files, and you'll have the authenticity token available from JavaScript:

<%= javascript_tag "window._token = '#{form_authenticity_token}';" %>

Now you can build Ajax requests that the application will actually accept:

new Ajax.Request('/foo.json', {
  method: 'PUT',
  parameters: {
    authenticity_token: window._token,
    text: $F('foo_text')
  }
  /* callbacks omitted for brevity */
})

Writing a Story: Why, When, Where, Who, What, How, and a Bunch of Other Questions and Answers

2008-11-16T00:00:00+09:00

Making the shift to story-driven development can be a real head-scratcher. What should a story contain? Who should be involved in writing one? Here are some guidelines to help you get started.

I'm assuming below that you're following Scrum or something Scrum-like. If you're using a different Agile methodology, most of this should translate without much trouble. If you're stuck with Waterfall or RUP, you have my sympathies. I'm honestly not sure how well story-driven development fits outside the Agile world.

Why write stories?

A Product Owner rarely cares that you've added a button to submit an order, not unless the code to process the order, take payment, and write it to the database is also there. They care about being able to place an order, not about how the ordering system was implemented.

Stories form a complete, deliverable unit of work. They give you a way to communicate project progress to the business in terms the business actually understands.

Stories also make it easier to commit to work for a sprint: you can estimate the complexity of a feature and, based on that, the team can tell whether they can realistically finish the story in the current sprint.

Stories generate conversations. They help specify exactly how a feature should behave, so the team knows what they're aiming for.

And stories help you focus. If the team has committed to delivering a story about placing an order, they're not going to wander off and build a user feedback system. (And if they do, they can be gently steered back to the goal they committed to during sprint planning.)

When should a story be written?

Feature requests arrive constantly, so it's useful to have a regular meeting for writing and estimating stories. I suggest a short session at the end of each sprint to handle the work that arrived during that sprint. This meeting will typically last less than an hour.

At project kick-off, you'll have more features to estimate than usual. Plan two or three meetings of one to two hours each to get through the initial backlog.

It's always handy to have more stories ready than just what's in the current sprint; if the team finishes early, they can pull in additional work.

But try not to overdo it. Writing stories is valuable, but working software is more important.

Where should stories be written?

Nothing complicated here: you want somewhere you can focus with the Product Owner without interruptions. Find a quiet room away from the work area, or head to a coffee shop.

Who should write a story?

Short answer: everyone. The Product Owner, Scrum Master, and the Scrum Team.

How should a story be written?

The team talks about the product and identifies a specific piece of functionality to work on (say, the ordering system mentioned above). The Product Owner, Scrum Master, and Scrum Team then define a set of scenarios that detail how that functionality should behave: What happens when the store is closed? What if someone enters an invalid credit card number? The list doesn't need to be exhaustive; it just needs to be representative.

The scenarios and the feature description are captured in a document. This can take many forms, but here's how I write them:

Feature: Place an order
  In order to get goods from our online store
  A shopper
  Should be able to place and pay for an order

  Scenario: The store is closed
    Given the store is closed
    And I have three beachballs in my shopping cart
    When I submit my order
    Then the order should be accepted
    And I should see "Your order will be processed when the store opens at 9am"

  Scenario: An invalid credit card number is used
    Given I have three beachballs in my shopping cart
    When I fill in "credit_card_number" with "MONKEY"
    And I press "Pay"
    Then the order should not be accepted
    And I should see "Please enter a valid credit card number"

Once the story is written, everyone on the team except the Product Owner estimates its complexity. Based on current knowledge, they assign a point score that shows how complex it is relative to other stories. It helps to use a fixed scale; something resembling the Fibonacci sequence works well. I use ?, 0, 1, 2, 3, 5, 8, 13, 20, 40, 100, and infinity. Zero means trivial. Infinity means the team thinks they could never complete it. A ? means they don't have enough information yet; it might be estimable after more discussion or a short, time-boxed development spike. One of the best ways to run estimation is to play planning poker. I have a set of planning poker cards for this.

It's also useful for the Product Owner to assign a business value to each story, even though business value is notoriously hard to quantify. I suggest values of 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1000 to rank stories relative to each other. The Product Owner shouldn't be influenced by how complex the team thinks a story is; they're rating the value to the business of delivering a capability, regardless of the effort involved.

For both complexity and business value, there are no in-between values. Don't allow estimates of 25 if it isn't on your scale, or you'll spend forever arguing whether something is a 24 or a 25. It's an estimate. It doesn't need to be exact.

Business value and complexity can be revised whenever new information surfaces, so it's worth briefly reviewing existing unimplemented stories while writing and estimating new ones.

After a story is written, it goes into the product backlog.

What happens to the story after it's added to the product backlog?

During the next sprint planning meeting, the Product Owner, Scrum Master, and Scrum Team meet to set a goal for the upcoming sprint. This goal is what the sprint's success will be measured against.

After setting the goal, the team discusses which stories contribute towards it and decides what they can commit to delivering, based on the complexity estimates. Stories with high business value should be preferred over those with low business value; the aim is to deliver the most value possible each sprint. There may be some negotiation with the Product Owner if they'd prefer certain stories over others, but the team shouldn't be pressured into taking on more than they can handle.

How much complexity a team can handle in a sprint should be based on how previous sprints went. Every team estimates differently and has different strengths, so this will vary widely. During the first sprint, pick a sensible but somewhat arbitrary number of stories and see how it goes. If the team finishes early, they can always pull in more work.

How do I know when a feature is complete?

Since a story represents a feature, the feature is complete when you can do exactly what the story describes. Try walking through it yourself. When you can follow every scenario in the story, consider the feature done.

If you're using Rails or Ruby, check out my article on story-driven development using Cucumber, which shows how to turn a story into an automated test.

What happens if a story doesn't get completed during a sprint?

Scrum is all about delivering working software, so if a story isn't complete, it shouldn't be part of the sprint deliverable. If your developers are working in a feature-branch pattern, this is straightforward: just don't merge the incomplete feature into the release branch.

The work that's been done doesn't necessarily get thrown away, though. It can be used to reduce the story's complexity estimate for the next sprint. Just bear in mind that this reduction is somewhat time-limited; as development continues, the cost of keeping a feature branch up to date with trunk starts to add up.

What happens if a story is too complex for one sprint?

Stories that contain more complexity than the team can handle in a single sprint are called Epics. These can't be accepted for a sprint because they wouldn't get finished, and the sprint deliverable would show no progress. We should always show progress.

Epics should be discussed with the Product Owner. They often describe more than one feature and can be broken down into smaller stories, each deliverable within a single sprint.

Running into Epics is completely normal over the course of a project.

Any other questions?

The above covers the questions I've been asking myself over the past few days. If you have others, please ask in the comments or email me and I'll do my best to find an answer.

Setting Up a Public Git Repository

2008-11-15T00:00:00+09:00

I've been using Git more and more as my version control system, and I wanted to make some code available to the public. The easy option would be a hosted service like GitHub or repo.or.cz, but I'm vain enough to want to serve my code from barkingiguana.com. I don't need multiple committers, and I want to learn more about how Git works under the hood, so Gitosis would be overkill. It turns out that setting up your own public repository is pretty straightforward. Here's how I did it.

General setup

I already have Apache running (serving this blog, among other things), so I'll use that and serve code from repositories under http://barkingiguana.com/~craig/. The easiest way is to use mod_userdir. On Ubuntu, enabling it is trivial:

sudo a2enmod userdir
sudo /etc/init.d/apache2 restart

I want to keep my Git repositories under ~/code, which lets me selectively symlink in only the repositories I want to be public:

mkdir ~/code

My VM's SSH port is on a non-standard port, so I configured that in ~/.ssh/config on my local machine. I also took the opportunity to upload my SSH key.

Publishing a project

I have a project with some work already done locally that I'd like to share. First, create a bare repository on the public server:

# On the public server
mkdir -p ~/code/project_name.git
cd ~/code/project_name.git
git --bare init
chmod +x hooks/post-update

Success looks like this:

Initialized empty Git repository in /home/user_name/code/project_name/

Next, on your local machine, add the public server as a remote:

# On the local development machine
cd ~/sandbox/project_name
git remote add public ssh://barkingiguana.com/~/code/project_name.git

Now push the local master branch up:

# On the local development machine
cd ~/sandbox/project_name
git push public master

The code is on the public server now, and you can push future changes with git push public master. But it still isn't web-accessible since it's not in ~/public_html. Fix that with a symlink:

ln -s ~/code/project_name.git ~/public_html/

Just like that, the repository is available for public use.

Did it work?

To verify everything is in order, try cloning the repository. Replace the URL with wherever your repository lives:

# On the local development machine
mkdir ~/tmp/
cd ~/tmp/
git clone http://barkingiguana.com/~craig/addressbook.git

Success should look something like this:

Initialized empty Git repository in /Users/craig/tmp/addressbook/.git/
got d0cc5f06e1d164ea6ada301dbd2e7c946d1ae532
walk d0cc5f06e1d164ea6ada301dbd2e7c946d1ae532
got b68d1319a780a776afdb60e3bba2985793a11f3e
got 2baa33597deecfc3eb558c59bc69745e153f9b82
got da7110115566b026c7316bd1be4cbf3d76c0f656

Get the Current Git Branch in Your Command Prompt

2008-11-15T00:00:00+09:00

It seems like everyone and their dog has their own way to show the current Git branch in the command prompt. Here's mine. Drop this into your ~/.profile:

export PS1='\[\033[01;32m\]\u@\h\[\033[00m\] \[\033[01;34m\]\w\[\033[00m\]$(git branch &>/dev/null; if [ $? -eq 0 ]; then echo "\[\033[01;33m\]($(git branch | grep ^*|sed s/\*\ //))\[\033[00m\]"; fi)$ '

The result looks like this:

craig@shiny ~/sandbox/addressbook(master)$

Now with 50% cleaner code

Shortly after posting this, I discovered that Git ships with an auto-completion file that includes a handy __git_ps1 function. If you enable Git auto-completion, you can get the same prompt with much less noise, and pick up some useful tab-completion goodies along the way:

export PS1='\[\033[01;32m\]\u@\h\[\033[00m\] \[\033[01;34m\]\w\[\033[00m\]$(__git_ps1 "\[\033[01;33m\](%s)\[\033[00m\]")$ '

Getting Started with Story Driven Development for Rails with Cucumber

2008-11-11T00:00:00+09:00

I'd been hearing about Story Driven Development (SDD) for a while but kept putting it off, assuming there was a huge amount to learn and set up before I could get going. Turns out that was completely wrong. I started using Cucumber yesterday and it was surprisingly easy to get rolling.

Install and configure

First, install the required gems:

sudo gem install nokogiri term-ansicolor treetop diff-lcs hpricot cucumber

Then install Cucumber into your Rails app:

ruby script/generate cucumber

Next, install Webrat. Unfortunately it's not available as a gem at this point. If you're using Git, install it as a submodule. If not, clone the repository and svn add it:

git clone git://github.com/brynary/webrat.git vendor/plugins/webrat

Writing your first story

Stories have three components: the business value being delivered, the role of the person using the feature, and a description of what the feature does.

In order to [do something with business value]
As [role]
Should [describe the feature]

For example, imagine you're building an online ordering system for a pizza delivery company:

Feature: Order Pizza
  In order to get some hot, tasty pizza
  A hungry pizza lover
  Should be able to order pizza

Now you need some scenarios, specific things that can happen during the story. Most pizza places aren't open 24 hours, so two obvious scenarios are: the shop is closed, and the shop is open.

  Scenario: The pizza shop is closed
    Given the pizza shop is closed
    And I am on the home page
    And I click "Feed Me!"
    Then I should see "Sorry, the shop is closed"

  Scenario: The pizza shop is open
    Given the pizza shop is open
    And I am on the home page
    And I click "Feed Me!"
    Then I should see "Your pizza will be with you soon"

Save this in a file like features/order_pizza.feature, where it can live happily under version control.

So now you have a story that describes how a feature should behave. But how does it become an actual test? You could hand these descriptions to a testing team, or you could wire them up as part of your automated test suite.

Automated tests: better than cake

When you installed Cucumber, you got a features/steps directory. This is where you teach your test suite how to understand your stories. There are already two files in there: common_webrat.rb, which gives you useful abilities like clicking links, and env.rb, which does essentially the same job as spec/spec_helper.rb but for Cucumber. You can mostly ignore env.rb, but common_webrat.rb is worth reading for examples of how to write step definitions.

Create a new file called order_pizza_steps.rb. This is where you define the steps involved in ordering pizza. Each step is just a regular expression that maps a line from your scenario to some Ruby code:

Given /the pizza shop is open/ do
  PizzaShop.open = true
end

Given /the pizza shop is closed/ do
  PizzaShop.open = false
end

And /I am on the home page/ do
  visits "/"
end

That's it. The common Webrat steps already handle clicking buttons and checking for text on the page.

Running your stories

Just run rake features. You'll get nicely coloured output, and if anything goes wrong, Cucumber is genuinely helpful about suggesting ways to fix it.

content_for is the new GOTO

2008-11-06T00:00:00+09:00

I have a confession: I really don't like content_for. When you use it, your view code starts jumping around between files in a way that's genuinely hard to follow. It smells a lot like GOTO. And when was the last time anyone recommended you use a GOTO?

content_for :javascript and content_for :css

The good news is that content_for can be avoided entirely, at least when it comes to including CSS and JavaScript. The trick is simple: include the controller name and action name in your layout's <body> tag, then scope your CSS declarations accordingly.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title><%= page_title %></title>
  <meta http-equiv="Content-Language" content="English" />
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <link rel="stylesheet" type="text/css" href="/stylesheets/simple.css" media="screen" />
</head>
<body id="<%= "#{controller.controller_name.tableize.singularize}_#{controller.action_name}" %>" class="<%= "#{controller.controller_name.tableize.singularize} #{controller.action_name}" %>">
  <%= yield %>
</body>
</html>

Now, say you're looking at the Posts views in your app. You can style each action independently, like this:

.post.index .article .title {
  font-size: 1.25em;
}

.post.show .article .title {
  font-size: 0.9em;
}

If you need to support browsers that don't handle two classes as a selector on a single element, use the ID-based version instead:

#post_index .article .title {
  font-size: 1.25em;
}

#post_show .article .title {
  font-size: 0.9em;
}

Since all your JavaScript is unobtrusive anyway (right?), you can scope it with the same CSS selectors shown above.

As a bonus, this approach lets you bundle all your JavaScript and CSS into single files for production, saving a bunch of HTTP requests. No content_for required.

Make Sure You're @importing Files That Exist

2008-11-03T00:00:00+09:00

I’ve started grumbling about optimising the number of HTTP requests per page. There are plenty of reasons you might want to do this, but that discussion is for another post. For now, just know that I don’t like unnecessary HTTP requests. And I really don’t like wasted ones, like when a CSS @import directive points at a file that 404s.

I got tired of tracking these down manually across several applications, so I threw together this little Ruby script to do the detective work for me:

#! /usr/bin/env ruby

css_root = File.expand_path(`pwd`.strip)
css_files = Dir[File.join(css_root, "**", "*.css")]

missing_imports = Hash.new([])
css_files.each_with_index do |css_file, index|
  imports = File.read(css_file).split(/\n|\r/).grep(/\@import url\((.*)\)/)
  imports.each do |import|
    desired_path = import.scan(/url\((["'\ ])?(.*)\1\)/).to_a.first.to_a.last
    desired_root = desired_path[0,1] == "/" ? css_root : File.dirname(css_file)
    filesystem_path = File.expand_path(File.join(desired_root, desired_path))
    if !File.exists?(filesystem_path)
      missing_imports[css_file] += [{ :path => filesystem_path, :directive => import }]
    end
  end
end

if missing_imports.any?
  puts "Missing files declared as imports in CSS:\n\n"

  missing_imports.keys.each do |origin|
    puts "Origin:               #{origin}"
    missing_imports[origin].each do |import|
      puts "Missing @import file: #{import[:path]}"
      puts "Directive:            #{import[:directive]}"
    end
    puts ""
  end
else
  puts "No imported files are missing. Well done."
end

Run it from the directory that serves as your document root. For Rails apps, that’s RAILS_ROOT/public/. It’ll either spit out a list of broken imports or give you a pat on the back:

Missing files declared as imports in CSS:

Origin:               /Users/craig/projects/1.8/public/stylesheets/.../find_by_service.css
Missing @import file: /Users/craig/projects/1.8/public/stylesheets/.../a_to_z.css
Directive:            @import url('.../a_to_z.css');

To be clear: I’d prefer @import directives didn’t exist at all. Each one is an extra HTTP request that could have been avoided by combining stylesheets. But they’re popular with a lot of people, so I’ll compromise: if you must use them, at least make sure they point at files that actually exist.

Scaling: Using MogileFS for Storing Uploaded Images

2008-10-31T00:00:00+09:00

As you might have guessed from several of my previous posts, the team I’ve been working in has recently been scaling an application. I’ve learned a bunch of things along the way, and I’ve got half-written articles about several of them that I’ll totally finish one day.

One of the most useful technologies I’ve started using is MogileFS, a distributed BLOB store. In our application we use it to store user-generated assets like uploaded images and syndication feeds. Rather than go into the pros and cons here, I’d like to share some code that’s been genuinely useful: a MogileFilesystemBackend for AttachmentFu.

Why do you need a shared filestore for uploads? Once your application cluster scales beyond a single box, uploaded images land on different disks depending on which server handled the request. Without a shared store, there’s no guarantee a particular image will be available to a subsequent request that hits a different server.

Getting stuck in

I’ve done some admittedly ugly preparation here and monkey-patched Kernel to provide an attr_accessor called filestore, just an instance of MogileFS::MogileFS from the excellent MogileFS client by the folks at Seattle RB. The patch, which will probably make experienced Rubyists wince, looks like this:

module Kernel
  # Oh noes, I'm screwing with Kernel.
  #
  mattr_accessor :filestore
end

During Rails initialisation, the filestore is set up using configuration values pulled from a YAML file in RAILS_ROOT/config/:

Kernel.filestore = MogileFS::MogileFS.new(
  :domain => "APPNAME-#{RAILS_ENV}",
  :hosts => array_of_hosts_from_yaml_file
)

(What I actually do is quite a bit different from this because I’ve done evil things to the MogileFS client library, which I’ll probably share in the future. For now, believe the magic.)

With the setup complete, getting AttachmentFu to work with MogileFS is straightforward:

class Image << ActiveRecord::Base
  has_attachment :content_type => :image,
    :storage => :mogile_filesystem,
    :max_size => 5.megabytes,
    :thumbnails => {
      :canonical => '1024x'
    },
    :processor => "MiniMagick"

  validates_as_attachment
end

The backend

Without the actual backend code, none of the above does anything. The implementation was heavily influenced by the existing Amazon S3 backend, since the concepts behind S3 and MogileFS are quite similar:

module MogileFilesystemBackend
  def full_filename(thumbnail = nil)
    "#{class_prefix}:#{filestore_tag(thumbnail)}"
  end

  def filestore_tag(thumbnail = nil)
    "#{parent_id || id}:#{thumbnail || :original}"
  end

  def current_content
    temp_path ? File.read(temp_path) : temp_data
  end

  def public_filename(thumbnail = nil)
    [
      editorial_object_type.demodularize.tableize,
      editorial_object_id,
      "#{class_prefix}.#{file_extension}#{thumbnail && "?size=#{thumbnail}"}"
    ].join("/")
  end

  def file_extension
    Mime::Type.lookup(content_type).to_sym
  end

  def filestore_paths(thumbnail = nil)
    filestore.get_paths(full_filename(thumbnail))
  end

  def file_data(thumbnail = nil)
    filestore.get_file_data(full_filename(thumbnail))
  end

  protected
  def current_content_location
    temp_path ? :temp_path : :temp_data
  end

  def destroy_file
    filestore.delete full_filename
  end

  def rename_file
    filestore.rename @old_filename, full_filename
  end

  def save_to_storage
    logger.info "Storing #{self.class.name}\##{id} as #{full_filename(thumbnail)} (class: #{replication_policy}) from #{current_content_location == :temp_path ? temp_path : :memory}"
    filestore.store_content full_filename(thumbnail), replication_policy, current_content
  end

  def class_prefix
    self.class.name.demodularize.underscore.downcase
  end
  alias_method :replication_policy, :class_prefix
end

Technoweenie::AttachmentFu::Backends::MogileFilesystemBackend = ::MogileFilesystemBackend

Serving images

Getting images into MogileFS is only half the story. You also need to serve them to visitors. Here’s a controller that reads from the filestore instead of the local filesystem (and if you’re storing files in the database, we need to have a talk):

class ImageController < ApplicationController
  before_filter :load_image

  def show
  respond_to do |format|
    format.html
    format.any(:png, :jpg, :gif) do
      send_data @image.file_data(params[:size]),
        :type => @image.content_type,
        :disposition => 'inline'
    end
  end

  protected
  def load_image
    @image = Image.find(params[:id])
  end
end

And there you have it. Images go into MogileFS on upload, get replicated across your storage nodes, and are served back to visitors through a simple controller action. No more worrying about which app server has which file.

Talking to Yourself Is Bad, mmkay?

2008-10-20T00:00:00+08:00

A lot of languages encourage talking to yourself. OO PHP code is sprinkled with $this->foo_method();. In some languages it’s necessary. Ruby isn’t one of them.

class Foo
  def bar
    # Why are you talking to yourself?!
    @thingy = self.foo
  end

  def foo
    "QUUX!"
  end
end

That self. is doing absolutely nothing. You can drop it entirely:

class Foo
  def bar
    @thingy = foo
  end

  def foo
    "QUUX!"
  end
end

This is a trivial example, but it makes a real difference across a larger codebase. Less noise, easier to read, fewer characters to trip over. Give it a try, your code will look less like it’s having a conversation with itself.

There’s one caveat though: you do need self when calling a setter method. Without it, Ruby thinks you’re assigning to a local variable:

class Foo
  attr_accessor :thingy

  def bar
    # This assigns to a local variable, NOT the attribute.
    thingy = foo
  end

  def foo
    "QUUX!"
  end
end

class Foo
  attr_accessor :thingy

  def bar
    # This calls Foo#thingy= as intended.
    self.thingy = foo
  end

  def foo
    "QUUX!"
  end
end

So the rule is simple: skip self for reading, keep it for writing. Your future self (pun intended) will thank you.

Checking MySQL Database Sizes

2008-10-09T00:00:00+08:00

Quick tip: want to know how large each of your MySQL 5 databases is? This query pulls the row counts, data size, index size, and total size from information_schema:

mysql> SELECT
  table_schema,
  concat(round(sum(table_rows)/1000000,2),'M') as rows,
  concat(round(sum(data_length)/(1024*1024*1024),2),'G') as data,
  concat(round(sum(index_length)/(1024*1024*1024),2),'G') as idx,
  concat(round(sum((data_length+index_length))/(1024*1024*1024),2),'G') as total_size
FROM information_schema.TABLES
GROUP BY table_schema;
+-----------------------------+-------+-------+-------+------------+
| table_schema                | rows  | data  | idx   | total_size |
+-----------------------------+-------+-------+-------+------------+
| information_schema          | NULL  | 0.00G | 0.00G | 0.00G      |
| xxxxxxxxx_xxxx_xxxx_staging | 0.93M | 0.08G | 0.01G | 0.09G      |
+-----------------------------+-------+-------+-------+------------+
2 rows in set (0.03 sec)

It’s one of those queries worth keeping in your back pocket. Handy for capacity planning, spotting unexpectedly large databases, or just satisfying your curiosity about where all that disk space went.

Fail Silently with Memcache Client

2008-09-25T00:00:00+08:00

For web applications, caching is king. I’ve recently been using memcached to cache expensive query results in a Rails application, with Seattle RB’s memcache-client as the client library.

The library is solid, but it has one opinion I disagree with: when a memcached instance fails, it throws an exception that your code has to handle. I think that’s the wrong default. When a cache fails, it doesn’t matter. Either the application continues running uncached, slower, but functional, or other memcached instances pick up the slack. Neither scenario should require special handling in application code.

Ruby, being awesome, lets me change the library’s behaviour easily. Monkey patching may be frowned upon, but it has its uses:

# A simple monkey-patch of MemCache so that broken memcached instances don't
# cause fatal errors in the application. Performance may be severely degraded
# but it should be possible to use the app anyway!
#
# A typical use would look something like:
#
#   result = if cache.alive?
#     fetch = cache.get(:foo)
#     if !fetch
#       fetch = calculate(:foo)
#       cache.set(:foo, fetch)
#     end
#     fetch
#   else
#     calculate(:foo)
#   end
#
class MemCache
  # Does the cache configuration contain any memcached instances that can
  # currently be used?
  #
  # Author: Conor Curran [http://forwind.net/]
  #
  def alive?
    !!cache.servers.detect{ |s| s.alive? }
  end

  # Rescue from MemCache::MemCacheError -- we want the cache to fail silently
  # (at least from the point of view of the application - you should still
  # monitor memcached).
  #
  def get_with_rescue(*args)
    get_without_rescue(*args)
  rescue MemCache::MemCacheError
  end
  alias_method :get_without_rescue, :get
  alias_method :get, :get_with_rescue
  alias_method :[], :get

  # Rescue from MemCache::MemCacheError -- we want the cache to fail silently
  # (at least from the point of view of the application - you should still
  # monitor memcached).
  #
  def set_with_rescue(*args)
    set_without_rescue(*args)
  rescue MemCache::MemCacheError
  end
  alias_method :set_without_rescue, :set
  alias_method :set, :set_with_rescue
  alias_method :[]=, :set
  alias_method :add, :set

  # Rescue from MemCache::MemCacheError -- we want the cache to fail silently
  # (at least from the point of view of the application - you should still
  # monitor memcached).
  #
  def delete_with_rescue(*args)
    delete_without_rescue(*args)
  rescue MemCache::MemCacheError
  end
  alias_method :delete_without_rescue, :delete
  alias_method :delete, :delete_with_rescue
end

The pattern is straightforward: wrap each method (get, set, delete) with a version that rescues MemCacheError and silently returns nil. The alias_method chain preserves the original implementation so you can still call it directly if needed.

A word of caution: “fail silently” doesn’t mean “ignore failures entirely.” You should absolutely still be monitoring your memcached instances. This patch just prevents a cache hiccup from becoming an application outage.