Exam Room · Advanced GenAI

Streaming Responses to Cut First-Token Latency

July 13, 2026 · 23 min read

The situation

The support assistant has grown up. Responses are often 300-500 tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. now, more context, more careful reasoning, better answers. The user-facing latency has grown with them: 4.2 seconds median to first visible response, 5.8 seconds at p95. Product shows a dashboard: abandonment during the wait period is up 11% over the last six weeks, and the biggest contributor is sessions where the user never sees the assistant’s reply because they closed the tab first.

The technical baseline today is one synchronous call per turn: browser → API Gateway → Lambda → Converse (non-streaming) → Lambda response → API Gateway → browser. The entire reply has to generate before anything reaches the user. No progress indicator beyond a spinner.

Product wants the typing-animation pattern, tokens arriving as they’re generated, first token visible within one second. Engineering needs to understand how the plumbing changes and where the sharp edges live: ConverseStream, Lambda response streaming, API Gateway streaming support, the WebSocket vs Server-Sent-Events choice on the browser, and how error handling changes when the response is in flight.

What actually matters

Streaming flips the response shape. Instead of a single request-reply, the server emits a sequence of events as the model produces them: a start event, then token/text events, then tool-use events if any, then a stop event, sometimes accompanied by a usage-and-reason event. The client consumes events and renders them as they arrive.

The first decision is how the model emits the stream. The streaming variant of the inference API returns a server-side event stream rather than a single buffered reply, and each event carries a block of content, a text delta, a tool-use delta, a content-block-start marker, or a stop-reason. The SDK presents this as an iterator the server-side handler consumes.

The second is how the service surfaces that stream to the client. Every hop between the model and the browser has to be capable of forwarding bytes as they arrive rather than buffering the whole reply before sending. The serverless and gateway layers each have their own switches for that, and each one that doesn’t get flipped is a hop that re-buffers the stream into a single response.

The third is the wire protocol to the browser. The two real options are a unidirectional text-event stream over plain HTTP and a bidirectional persistent connection. Unidirectional event streams are simpler, a text protocol over an existing HTTP request, native browser consumer support, no connection-lifecycle code. Bidirectional connections earn their keep when the browser also needs to push structured data mid-stream (rare for chat; common for collaborative tools).

The fourth is error handling mid-stream. A streamed response that fails halfway has half-delivered state. The client needs to know the stream failed (not just that it ended cleanly), the partial response might still be displayed, and retrying means resuming from where we stopped, which is not trivial because the model doesn’t have a resume-from-token primitive.

The fifth is end-to-end timeouts. A synchronous call has one timeout; a streamed call has multiple: time to first byte, time between bytes, total connection time. Each needs a sensible value, each fails differently.

The tool-call case deserves its own thought. If the assistant calls tools (the Bedrock Agents / function-calling pattern), streaming interleaves with tool dispatch. The stream pauses for a tool call, resumes with the tool result, then continues generation. The client UI has to handle “thinking” states during tool dispatch, which is richer than simple typing animation.

What we’ll filter on

Time to first token, how fast the first byte reaches the user?
Wire-protocol overhead, how many hops between model and browser, and how much each adds?
Error recovery, what happens when the stream breaks?
Tool-call handling, can the stream pause cleanly for a tool call?
Infrastructure cost, does streaming change the per-request bill?

The streaming landscape

ConverseStream + Lambda response streaming + API Gateway streaming + SSE to browser. The canonical path. Lambda calls ConverseStream, iterates events, writes them to awslambda.streamifyResponse handler, which flushes bytes out through API Gateway’s streaming integration, which sends SSE events to the browser. Browser consumes with EventSource or fetch + a reader. Full-stack streaming, AWS-native, works well.
ConverseStream + Lambda Function URL streaming (no API Gateway). Lambda Function URLs support response streaming natively with fewer moving parts than API Gateway. The function URL is public-facing; can be combined with CloudFront in front for custom domain, caching (irrelevant for streams), and WAF. Slightly simpler than API Gateway; loses some API Gateway features (rate limiting by API key, usage plans, etc.).
ConverseStream + API Gateway WebSocket API. WebSocket connection established per session; Lambda pushes events to the connection via @connections API. Bidirectional but more complex: connection lifecycle management, message routing, per-message charging. Worth it when the browser needs to push structured mid-conversation messages (cancel current generation, switch tool results) or when many-client broadcasts are involved.
ConverseStream + a long-running Fargate service with SSE. No Lambda cold starts (relevant when cold is ~300ms and time-to-first-token is ~500ms), no Lambda max duration limits. Higher fixed cost but predictable latency. Correct for high-volume services where cold-start variance matters.
Non-streaming Converse with a “chunked delivery” illusion. The naive workaround: generate the full response non-streaming, then dribble it to the browser one word at a time to simulate typing. Looks like streaming, isn’t. Still has the multi-second wait for the full generation before the first visible byte; abandonment metric doesn’t improve. Not a real option.
ConverseStream + AppSync Events. AppSync’s event-based subscriptions can carry streamed model output to connected clients. Adds GraphQL machinery but integrates cleanly with AppSync-backed front-ends. Correct for AppSync shops; overkill otherwise.

Side by side

Option	TTFT	Wire	Error recovery	Tool calls	Infra cost
CS + Lambda stream + APIGW + SSE	~800 ms	SSE (text)	Half-delivered	Pause mid-stream	Same per-request
CS + Lambda Function URL + SSE	~800 ms	SSE (text)	Half-delivered	Pause mid-stream	Same
CS + APIGW WebSocket API	~900 ms	WebSocket	Connection reset	Bidirectional	Per-message + connection-hour
CS + Fargate + SSE	~500 ms	SSE (text)	Half-delivered	Pause mid-stream	Fargate-hour floor
Non-streaming “illusion”	Full generation	JSON	N/A	N/A	Same
CS + AppSync Events	~900 ms	GraphQL subs	Subscription drop	Pause mid-stream	AppSync per-request

For a chat interface on a Lambda-centric stack with one-way streaming (server → browser) and no broadcast requirements, option 1 or 2 is the correct shape. Option 2 (Lambda Function URL) removes one hop and is arguably simpler; option 1 keeps the API Gateway features the rest of the stack uses.

The streaming path, end to end

Forward request on top, streamed events flowing back in the middle, three distinct timeout classes at the bottom. Every hop needs its own error handling.

The pick in depth

Lambda response streaming. The handler uses awslambda.streamifyResponse (Node.js) or the equivalent pattern in Python with a custom runtime or Lambda Web Adapter. The handler receives the event, opens a writable stream, calls ConverseStream, iterates the Bedrock event iterator, and writes each event to the stream as an SSE line. Writes flush immediately when followed by a blank line.

# Simplified Python handler using an ASGI-adjacent streaming runtime
def handler(event, response_stream):
    response_stream.set_content_type("text/event-stream")
    response_stream.set_header("Cache-Control", "no-cache")
    response_stream.set_header("Connection", "keep-alive")

    bedrock = boto3.client("bedrock-runtime")
    resp = bedrock.converse_stream(
        modelId=MODEL_ID,
        messages=build_messages(event),
        system=SYSTEM_PROMPT,
        inferenceConfig={"maxTokens": 800, "temperature": 0.2},
    )

    try:
        for item in resp["stream"]:
            if "contentBlockDelta" in item:
                delta = item["contentBlockDelta"]["delta"]
                if "text" in delta:
                    response_stream.write(
                        f"data: {json.dumps({'type':'text_delta','text':delta['text']})}\n\n".encode()
                    )
            elif "messageStop" in item:
                response_stream.write(
                    f"data: {json.dumps({'type':'stop','reason':item['messageStop']['stopReason']})}\n\n".encode()
                )
        response_stream.write(b"data: [DONE]\n\n")
    except ClientError as e:
        response_stream.write(
            f"data: {json.dumps({'type':'error','message':str(e)})}\n\n".encode()
        )
    finally:
        response_stream.end()

API Gateway configuration. HTTP API with an AWS_PROXY integration pointing at the Lambda, payloadFormatVersion: 2.0, responseMode: STREAM_RESPONSE. CORS headers set on the route. Default timeouts are 30 seconds on API Gateway HTTP APIs; a streaming response has its own rules (the connection stays open as long as bytes flow), but no single pause can exceed the timeout budget.

Browser consumption. EventSource is the easiest path when a GET works; for POST + streaming, fetch with response.body.getReader() is the modern approach. The client assembles the streamed tokens into the visible message as they arrive, shows a typing indicator between bytes, and handles the [DONE] sentinel or error events. State: the current assistant message is partial until stop; on error, show what arrived plus an “(generation interrupted)” note.

Tool-call handling. When ConverseStream emits a contentBlockStart with a toolUse block, the Lambda stops forwarding text, pauses, dispatches the tool (another API call, could take seconds), gets the result, and continues the stream with the tool result fed back. The client sees a tool_start event (“Looking up your subscription…”), then nothing for a few seconds, then the generation continues. UI shows a “thinking” placeholder during the tool dispatch. The actual pattern involves Bedrock’s ConverseStream with tool-call support baked in; the Lambda orchestrates the pause-resume.

Error handling. Three failure classes. First-byte timeout: the client aborts after 3 seconds of nothing and shows “The assistant is thinking…”; CloudWatch gets a metric. Mid-stream error (Bedrock ThrottlingException, model refusal): emit an error event over the stream, close it cleanly, show the partial response with a reason. Stream cleanly ends but abbreviated (e.g., maxTokens hit): the stop reason tells the client, which shows a “(response truncated)” affordance. Client-side disconnect: Lambda keeps running until it hits its own timeout or detects the closed stream; cost-wise, the Bedrock call is still charged for tokens produced.

Cost shape. Streaming and non-streaming cost the same at Bedrock, per-token pricing is per-token pricing. Lambda billing slightly different because the function runs for the duration of the stream (longer than non-streaming would) but with lower memory pressure. API Gateway billing unchanged. Net neutral to slightly higher (Lambda duration).

A worked example: the latency improvement

Same support-assistant query, 500-token response. Measurements from before and after the streaming rollout:

Baseline (non-streaming)
  Time to first byte:   4,200 ms (full generation)
  Total response time:  4,200 ms
  User-perceived wait:  4,200 ms

Streaming (ConverseStream + SSE)
  Time to first byte:     800 ms (model producing)
  Total response time:  4,400 ms (slightly slower total)
  User-perceived wait:    800 ms

The total time is actually slightly worse with streaming (Lambda duration cost, stream setup overhead), but perceived wait falls from 4.2s to 0.8s. Tokens keep arriving at ~80 per second after the first byte. The typing animation matches the generation pace, which is exactly the UX product wanted.

Abandonment during wait drops from 11% to 2.3% over two weeks post-rollout. The feature shipped because the plumbing worked, not because the generation got faster; generation is roughly the same, but the user’s experience of it is transformed.

What’s worth remembering

Streaming moves the cost of latency from elapsed time to perceived time. The model hasn’t sped up; the user sees progress instead of a spinner.
ConverseStream is the Bedrock API; the rest is transport. Lambda response streaming + API Gateway streaming mode + SSE to the browser is the default AWS-native path.
Lambda Function URLs simplify the path. One less hop than API Gateway; loses some API Gateway features.
SSE beats WebSocket for one-way streaming. Simpler wire protocol, native browser support, no connection-lifecycle code. WebSocket earns its keep on bidirectional traffic.
Error handling changes shape in streams. Three timeouts (first-byte, between-byte, total), mid-stream error events, partial-response display. None of this exists in request-reply.
Tool calls interrupt the stream, not end it. The Lambda pauses on a tool-use event, dispatches the tool, feeds back the result, resumes the stream. UI shows a “thinking” state during the dispatch.
Cost is roughly neutral, slightly higher. Lambda runs longer because the function stays alive for the duration of the stream; Bedrock per-token is the same; API Gateway unchanged.
Time to first byte is the metric. Optimise it, measure it, monitor it. It’s what the user experiences as responsiveness.

The same assistant, the same model, the same promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , the same tokens, but the user sees typing instead of waiting, and the abandonment metric proves that makes a difference. The plumbing changes to support it are concrete, AWS-native, and worth the plumbing.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.