The previous post, The Transformer Attention Budget, argued that past the ten-second wall the user is gone unless you do something about it. There are two patterns for keeping them: narrate the work as it happens, or release them and come back. This post is about implementing both, the wire formats, the server shapes, the client shapes, and the trade-offs each one makes. Server code is Python (FastAPI); client code is plain JavaScript so it can run in a browser unmodified.
Four shapes a long response can take
A long-running operation, one that exceeds the 10-second budget, has four reasonable shapes in HTTP. They differ in who holds the connection open, who decides when the work is complete, and how the user finds out.
Synchronous-with-spinner is the simplest. The client makes a request; the server holds the connection open until the work is done; the response is the answer. Simple, but the user gets no signal that anything is happening. Past ten seconds, the user assumes the system is broken.
Streaming with progress events uses the same connection model, client request, server holds it open, but the server sends events down the wire as work progresses. Server-sent events (SSE) is the natural fit. The user sees the operation moving. Works well up to about thirty to sixty seconds. Beyond that, network reliability and connection limits start to bite.
Polling flips the direction of the conversation. The client submits a job, gets back a job ID immediately, then polls a status endpoint until the job is done. The connection between client and server is short-lived. The work runs to completion regardless of whether the client is listening. Survives indefinitely from the network’s perspective, but the user has to be connected to see status.
Webhooks (or push notifications) take it further. The client submits the job and disconnects entirely. The server runs the work asynchronously and notifies the client some other way, a webhook URL the client provided, an email, a push notification, an item in a notifications panel. The user is freed from the attention budget completely; the work runs even if they close the browser.
Most production systems mix these. A typical chatbot uses streaming for normal queries and switches to job-plus-notification for “deep research” or “long analysis” modes.
Streaming with progress events
Server-sent events are HTTP responses with Content-Type: text/event-stream and a body of newline-delimited frames:
data: {"type": "progress", "step": "embedding"}
data: {"type": "progress", "step": "search", "results": 12}
data: {"type": "token", "text": "The"}
data: {"type": "token", "text": " answer"}
data: {"type": "done"}
The browser consumes this with the native EventSource API:
const events = new EventSource("/chat/stream?q=...");
events.onmessage = (e) => {
const event = JSON.parse(e.data);
if (event.type === "progress") {
setStatusText(`Working: ${event.step}...`);
} else if (event.type === "token") {
appendToken(event.text);
} else if (event.type === "done") {
events.close();
}
};
events.onerror = () => {
setStatusText("Connection lost");
events.close();
};
The Python server side, with FastAPI, looks like this:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
app = FastAPI()
def sse_event(payload: dict) -> str:
return f"data: {json.dumps(payload)}\n\n"
async def event_stream(query: str):
yield sse_event({"type": "progress", "step": "embedding"})
embedding = await embed(query)
yield sse_event({"type": "progress", "step": "search"})
docs = await search(embedding)
yield sse_event({"type": "progress", "step": "rerank"})
ranked = await rerank(query, docs)
async for token in generate(query, ranked):
yield sse_event({"type": "token", "text": token})
yield sse_event({"type": "done"})
@app.get("/chat/stream")
async def stream(q: str):
return StreamingResponse(
event_stream(q),
media_type="text/event-stream",
)
Three things are worth noticing. First, every step that takes meaningful time gets its own progress event before it starts, the user sees the system thinking before the work begins, not after. Second, the same connection that delivers progress events later delivers tokens; the client does not have to switch transports halfway through. Third, the done event is explicit. SSE has a clean end-of-stream signal at the HTTP layer, but the application-level done event lets the client distinguish “stream finished cleanly” from “connection dropped mid-response.”
The connection limit to be aware of: browsers cap concurrent SSE connections to a single origin at about six. A user with three open tabs of your app can saturate it. If you need many simultaneous streams, multiplex over a single connection with a routing field, or move to HTTP/2 (much higher per-connection limits) or WebSockets.
For load balancers and proxies in the path, SSE needs three things: response buffering disabled, idle timeouts longer than your longest expected event gap, and HTTP/1.1 or HTTP/2 (not HTTP/3 over a proxy that does not know how to forward streams cleanly). Misconfigured proxies are the most common cause of “the stream works locally but not in production.”
Polling endpoints
When the work might take minutes, holding the connection open is fragile and expensive. The pattern is to submit a job, get an ID, and poll for status.
from fastapi import FastAPI, BackgroundTasks
from uuid import uuid4
app = FastAPI()
# In-memory for the example. In production this is Redis,
# DynamoDB, or your job queue's own state store.
jobs: dict[str, dict] = {}
@app.post("/research")
async def submit(query: str, background_tasks: BackgroundTasks):
job_id = str(uuid4())
jobs[job_id] = {"status": "queued", "query": query}
background_tasks.add_task(run_research, job_id, query)
return {"job_id": job_id, "status_url": f"/research/{job_id}"}
@app.get("/research/{job_id}")
async def status(job_id: str):
job = jobs.get(job_id)
if not job:
return {"error": "not_found"}, 404
return job
async def run_research(job_id: str, query: str):
jobs[job_id]["status"] = "searching"
docs = await deep_search(query)
jobs[job_id]["status"] = "answering"
answer = await answer_from(query, docs)
jobs[job_id] = {"status": "done", "answer": answer}
The client polls until the job is done:
async function pollUntilDone(statusUrl) {
while (true) {
const res = await fetch(statusUrl);
const job = await res.json();
if (job.status === "done") return job;
if (job.status === "error") throw new Error(job.error);
setStatusText(`Working: ${job.status}...`);
await sleep(2000);
}
}
Polling intervals are a balance. Every two seconds is friendly to the server and human-comfortable for the user. Every 200 milliseconds turns a single user into a small denial-of-service attack on your status endpoint. Every thirty seconds makes the UI feel dead. Two to five seconds is a safe default. Aggressive backoff (start at 500ms, grow to 5s) is appropriate when the client expects the job to finish quickly but is not sure when.
A polling endpoint can be cached at the edge. If your status response includes a Cache-Control: max-age=2 header, a CDN will collapse a hundred near-simultaneous polls from the same client (or different clients sharing the same URL) into one origin hit. This is mostly relevant when you are polling a job ID that many clients can see, less so for per-user private jobs.
The work runs to completion regardless of whether the client is polling. That is the property polling buys you over SSE: the connection no longer carries the work. The client can crash, the user can switch networks, the laptop can sleep, the job keeps going. When the user comes back and polls again, the answer is waiting.
Webhooks for completion
When you genuinely do not want the client connected at all, the model is: client submits a job with a callback URL, server runs the job, server POSTs the result to the callback when done.
import httpx
@app.post("/research-async")
async def submit_async(
query: str,
callback_url: str,
background_tasks: BackgroundTasks,
):
job_id = str(uuid4())
background_tasks.add_task(run_and_callback, job_id, query, callback_url)
return {"job_id": job_id, "status": "accepted"}
async def run_and_callback(job_id: str, query: str, callback_url: str):
docs = await deep_search(query)
answer = await answer_from(query, docs)
async with httpx.AsyncClient() as client:
await client.post(
callback_url,
json={
"job_id": job_id,
"status": "done",
"answer": answer,
},
timeout=10.0,
)
The client receives the callback at its own endpoint:
app.post("/webhooks/research-complete", async (req, res) => {
const { job_id, answer } = req.body;
await notifyUser(job_id, answer);
res.status(200).send("ok");
});
The example skips three production-grade concerns. Authentication: the callback URL needs to verify the request actually came from your service, typically with an HMAC signature in a header that the receiver re-computes and compares. Idempotency: callbacks can fire more than once; the receiver should de-duplicate by job ID. Retry policy: if the callback fails (the client is down, rate-limited, returns a 5xx), the server should retry with exponential backoff and eventually give up to a dead-letter queue. The major SaaS providers. Stripe, GitHub, AWS EventBridge, have converged on similar patterns; their public docs are good references.
For the user-facing side, “webhook” usually means something less literal: a notifications panel in the app, a push notification, an email. The work has been freed from any connection; the user finds out about it through a separate channel. Slack’s “we will let you know when your export is ready” is the familiar consumer-facing version.
Picking among them
The four shapes form a rough ladder by job duration:
| Pattern | Connection | Good for | |—|—|—| | Sync + spinner | Long-lived | Sub-second endpoints | | SSE + progress | Long-lived | 2-30 seconds: RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. , AgentA system that wraps an LLM with tools, memory, and a loop, so it can take multi-step actions toward a goal rather than just answering one prompt. chat, most LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. calls | | Polling | Short-lived | 30 seconds to ~10 minutes: jobs the user is monitoring | | Webhook / async notify | None | 10 minutes and up: deep research, batch analysis |
Pick by how long the work takes and how much you trust the connection. Pick up the ladder when in doubt. The cost of an over-async pattern is a small amount of orchestration overhead. The cost of an under-async pattern is a client timeout, a user who thinks it failed, work that may or may not have completed, and a confusing support ticket.
Real systems combine. A chat endpoint streams progress events for in-flow questions and falls back to job-plus-notification when the user requests a deep-research mode. A document analysis endpoint might stream the first page of results and fall back to a notification when the full analysis is ready an hour later.
The shape of the response is part of the product’s UX, not just an implementation detail. Designing it deliberately is the difference between “feels modern” and “feels like waiting at a bus stop.”
Sync, streaming, polling, async-notify, four shapes a long response can take, differing mostly in who holds the connection open and how the client finds out the work is done. Server-sent events are the right default for the two-to-thirty-second range, which covers most RAG calls, agent chats, and ordinary LLM work, provided the proxies in the path know not to buffer and the six-connections-per-origin limit isn’t going to bite. Polling takes over when the work might stretch into minutes; two to five seconds is a polite interval and the work runs whether or not the client is listening. Webhooks and notification panels are the right answer when the user shouldn’t have to stay connected at all, and authentication, idempotency, and a retry policy are not optional once the callback is the only way the result gets home.
When in doubt, pick higher up the ladder. The cost of an over-async pattern is a little orchestration overhead. The cost of an under-async pattern is a timeout, a user who thinks it failed, and a support ticket where nobody can tell whether the work actually completed. The shape of the response is part of the product’s UX. Decide it deliberately.