The Transformer Attention Budget

July 02, 2026 · 10 min read

Part of Under the Hood — deep dives into the technology we use every day.

Three numbers govern every interactive interface ever built. They are fifty years old, they predate the personal computer, and they have survived every revolution in how software gets to a human, because they describe the human and not the software. LLMs do not break them. LLMs bend them, in ways the original numbers never anticipated. A response can start fast and complete slowly. A wait can be filled with motion or with silence. A spinner can buy you ten seconds; a progress message can buy you sixty. Building well in this era means knowing which budget you are spending, and why.

Three numbers, fifty years old

Robert B. Miller wrote the first careful paper on this subject in 1968. It was called “Response Time in Man-Computer Conversational Transactions,” and it was concerned with how long a user could wait at a teletype before something went wrong inside their head. Twenty-five years later, Jakob Nielsen popularised the same three numbers in Usability Engineering (1993). The numbers have not moved since. They are not numbers about computers. They are numbers about people.

0.1 seconds is the limit of “instantaneous.” Below this threshold, the user experiences cause and effect: they pressed a key, the system reacted. Above it, they perceive lag, even if they cannot articulate it. The 0.1s budget is the budget for acknowledgment. The cursor blinks where you put it, the button highlights when you tap it, the input field shows what you typed. Below 100 milliseconds, the system feels like an extension of you. Above it, like a thing you are operating.

1 second is the limit for uninterrupted flow of thought. Within a second, the user notices the delay but does not lose context. They are waiting, but they are still in the task. The 1s budget is the budget for response, the system did the thing you asked, and it came back before you wandered off. Most well-designed interactive systems pre-LLM tried hard to keep the bulk of operations under one second.

10 seconds is the limit of attention. Past this, the user mentally disengages. They look at their phone. They open another tab. Whatever flow state they were in is gone, and even if the answer arrives, you have lost them. The 10s budget is the budget for anything happening at all before you have to acknowledge that something is happening differently, by progress narration, by context switch, by handing the task off to the background.

These numbers have not moved because human cognition has not. The thresholds are about working memory, attention spans, and the cost of context switching. They were measured on terminals connected to mainframes. They apply equally to mobile apps, voice assistants, and chatbots. They are properties of the user, not the system.

What “response time” used to mean, and what it means now

For most of computing history, the budgets were straightforward to apply because the response was atomic. You sent a request, the server computed an answer, the answer came back. “Response started” and “response complete” were effectively the same moment. If your form submission returned in 500 milliseconds, the user got the entire result in 500 milliseconds. The 1-second budget was a single number to hit.

Streaming LLMs split that moment into two. The model produces TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. one at a time, and the response arrives as a sequence. When the first token reaches the user matters; when the last token reaches them is a separate question entirely. Time to first token (TTFT) inherits the spirit of the old 1-second metric, it tells the user the system is alive and working. Total generation time is a different beast, often closer to ten seconds than one, governed by output length, model size, and the network path between the model and the user.

A six-second response with sub-second TTFT and visible tokens streaming the whole way feels in flow. A three-second non-streamed response feels broken. The total time is not the only number that matters, and arguably is not even the main one.

Streaming buys you the 1-second budget, even when your generation costs you closer to ten. But it only works if the path supports it, and if you have measured TTFT in particular, not “average response time.”

Where the budgets go in a RAG pipeline

A typical RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. request, in milliseconds, looks something like this:

  • Input acknowledgment (cursor change, “thinking” indicator): ~50ms
  • Query embedding: ~50ms
  • Vector search across the index: ~100-200ms
  • Reranker pass on the top-k results: ~300-500ms
  • Context assembly + prompt construction: ~50ms
  • Model TTFT: ~500ms-2s
  • Tokens streaming to the user: ongoing

Add the steps before the first token reaches the user: 1.05 to 2.85 seconds, depending on the rerank cost and the model’s first-token latency. The 1-second budget is gone before the model has produced a single token, and that is with a tightly engineered pipeline. A loose one easily takes four or five seconds.

This is why a chatbot whose underlying model streams beautifully still feels slow if you bolted RAG on top without budget thinking. The user typed their question, watched a spinner for two seconds, then watched tokens appear. Two seconds is past the 1-second budget. The streaming does not help, the tokens that finally arrive are answering the wrong question, the question of whether the system is even working.

The fix is not to remove RAG. It is to get something on the screen during the retrieval. Show the rephrased query the model is going to receive. Show the document titles it is looking at. Show anything that signals the pipeline is alive. The 1-second budget is for “I see that something happened,” not necessarily “I have my answer.”

Cache hits and the choppy-UX problem

Now consider what happens when caching enters the picture. Semantic caching, a hit on a sufficiently similar prior question, returns in maybe 50 milliseconds. A miss takes the full RAG pipeline path. Call it two seconds.

If 30% of your queries hit cache and 70% miss, your average response time is 1.4 seconds. That sounds fine on a metrics dashboard. To the user, it is awful, because the variance is what they feel.

The 50-millisecond hits feel snappy. The 2-second misses feel painful, especially in contrast. The user starts to expect the snappy version. Every miss feels like a regression. They start wondering if they typed the question wrong, or if the system is broken today.

There are two responses. The first is to chase down the slow path until it is also fast, a worthy effort but often expensive and sometimes impossible. The second is to deliberately slow the fast path. If the median user experience is closer to “always 1.5 seconds” than “sometimes 50 milliseconds, sometimes 2 seconds,” the perceived UX improves. The 50-millisecond cache hit gets a 1-second artificial delay, with a streaming progress indicator during the wait.

This feels wrong the first time you do it. You spent engineering effort getting the cache hit fast, and now you are adding latency back? But the budget you are managing is attention, not milliseconds. Smooth is more important than fast.

Tool calls and the stutter

AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. loops add another wrinkle. When the model calls a tool, a database query, a web fetch, another LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. , anything, generation pauses. The user, who was watching tokens arrive, sees the typing stop. There is a visible gap. Then maybe a “thinking…” indicator. Then the tool returns, generation resumes, more tokens arrive.

Each tool call spends attention budget. A single tool call adds maybe one to three seconds of dead air. Three tool calls in a row and you are past the 10-second wall. The user has gone to make coffee.

The “thinking…” indicator does real work here. It is not just decoration; it is a deliberate spend of the 1-second budget, used to keep the user on the right side of the 10-second wall. If you replace the spinner with text that describes what is happening (“Looking up your account… Reading the latest transactions… Cross-referencing with the support ticket…”), you are moving from a 10-second attention budget to something more like a 30-second one, because the user is now part of the operation rather than waiting on it.

The tools that get the most attention budget back are the ones that explain themselves while they run.

When 10 seconds is not enough

Some operations genuinely need longer than ten seconds. Multi-document research. Multi-step plans with many tool calls. Long-form analysis where the model is writing a report.

There are two patterns for surviving past the 10-second wall, and they work for different reasons.

The first is progress narration. Stream not just the final answer but the intermediate work. Search query → first result → second result → analysis → conclusion. The user sees the system working. Each visible step resets the attention timer. Done well, this can hold attention for 30 seconds, sometimes more.

The second is going async. Accept that the operation is too long for synchronous attention, and re-engage the user when it is done. This is the “start a research task and we will notify you” pattern, the longer-running research modes that have become common across assistants in the last two years. The user is freed from the attention budget entirely; the system commits to coming back to them.

Both patterns are about giving the user back their attention. The first redirects it; the second releases it. The wrong move is to do neither, to silently process for forty-five seconds and hope the user is still there. They are not.

Designed waits

Spinners are content-free. They tell the user only that something is happening, which they already knew. After a few seconds, a spinner stops conveying anything; it just emphasises that they are waiting.

Capability messages are different. “Searching across 50,000 documents” tells the user what they are getting for the wait. “Analysing your last 12 months of transactions” justifies the patience. “Comparing against industry benchmarks” makes the wait feel like value being created on their behalf.

The same physical wait of 30 seconds feels much shorter when filled with capability messages than with a spinner. This is not exactly a perception trick, the user is genuinely getting more value, and they are being told so. The wait is the price; the message is the receipt.

The implication is that the more your system does that is hard to do, the longer the wait you can charge for it. A trivial chatbot has a small attention budget; a system that genuinely searches a corpus and reasons over it has a much larger one, but only if the user knows what is happening.

The thresholds haven’t moved because the human hasn’t. A hundred milliseconds is still acknowledgment, a second is still response, ten seconds is still the point past which attention has wandered off. Streaming LLMs split the old single moment of “response” into two: time to first token belongs to the one-second budget, total generation time belongs to the ten-second one. A RAG pipeline routinely eats the whole first-token budget before the model has produced anything, which is why the retrieval step needs to put something on the screen rather than a silent spinner. Caching makes the picture worse, not better, because variance hits users harder than means, sometimes the right move really is to slow the fast path so the median experience smooths out.

Every tool call in an agent loop is another withdrawal from the same attention budget. Three calls and the ten-second wall arrives. The “thinking…” indicator does real work, especially when it’s filled with text describing what the system is actually doing; capability messages let the wait feel like value being created rather than time being wasted. Past ten seconds the choice is narrate or go async. Silence loses the user every time.

Where this goes next

The patterns above are the what. The how, server-sent event progress streams, polling endpoints, webhooks back to the client, the choice of synchronous-with-progress versus fully-asynchronous, is its own subject, with code. Past the Ten-Second Wall covers the implementations.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.