How to Trace Distributed Requests with X-Ray

June 09, 2027 · 19 min read

Developer · DVA-C02 · part of The Exam Room

The situation

A logistics company runs a parcel-tracking API at around 3,000 requests per second. A single GET /parcels/:id enters at API Gateway (REST, custom domain, WAF in front), invokes a Lambda that validates the parcel ID and pulls cached details from DynamoDB, and on cache miss starts a Step Functions express execution that calls a second Lambda to query a PostgreSQL warehouse via RDS Proxy, calls a third-party tracking vendor over HTTPS, merges the two results, and writes them back to DynamoDB before returning to the caller.

Five compute surfaces, three data stores, one external HTTP dependency, one API front door. The p99 has climbed from 200ms to 2s over the last week and the team’s best guess is “probably the vendor,” but nobody can prove it. The logs across the stack are all there; the correlation isn’t, fourteen log groups, fourteen different request-ID schemes, no way to stitch one request’s path from front door to vendor.

What we care about is a single trace, per request, that shows every segment. API Gateway, both Lambdas, Step Functions, DynamoDB, RDS Proxy, the external HTTPS call, with timings, errors, and enough metadata to answer “what’s slow, right now, and why.”

What actually matters

Before reaching for a product name, it’s worth asking what a useful answer to “where is the time going?” actually looks like, because that shapes which pieces matter.

The first property is a trace identifier that survives every hop. The request crosses five compute surfaces and one external boundary on the way through. If each surface invents its own request ID, correlation becomes a log-mining exercise with grep and luck. If one identifier is stamped at the front door and carried forward on every outbound call. AWS-to-AWS and AWS-to-vendor, then a trace is a database query, not a weekend. What we need to pin down is who starts the identifier and whether every downstream surface knows how to read it.

The second is the blast radius of instrumentation failure. A tracing system that adds 50ms to every request, or that loses availability when the tracing backend is slow, becomes part of the problem it was meant to solve. We want emission to be asynchronous and fire-and-forget, spans batched off-box, sent over a cheap transport, and dropped on the floor rather than back-pressuring the request path when the backend is sick. The architecture of how spans are emitted matters as much as what’s in them.

The third is ownership of the instrumentation. Some pieces are toggles: flip a flag on API Gateway, on a Lambda function, on a state machine, and segments appear without application code. Other pieces require the application to import a library, wrap a client, and manage span lifecycles. The split between platform-flip and code-change determines how much work the team has to do to get full coverage, and how much risk a new service carries of being a black hole in the trace.

The fourth is the cost shape. At 3,000 RPS, tracing every request is expensive, storage, indexing, and egress all scale linearly with trace volume. But tracing none is useless, and tracing a fixed percentage misses the incidents that matter, because incidents are where the request rate is different. The control we want is sampling that captures all errors and a manageable proportion of success, tuneable without a redeploy, with the sampling decision made once at the edge and honoured by every downstream service so we get whole traces rather than half-sampled ones.

The fifth is observability of third-party calls. The hypothesis the team keeps landing on is “the vendor is slow,” and the fastest path to proving or disproving that hypothesis is seeing the vendor call as its own subsegment with its own status code and duration. That requires the outbound HTTP client on the Lambda to cooperate with the tracing system, either through a wrapper the SDK provides, or through an automatic instrumentation that patches the client at load time. Without it, the vendor call is invisible and every conversation with them starts from “we think.”

The sixth is softer but real: portability. The rest of the company is slowly moving to OpenTelemetry because the rest of the industry is. A solution that locks the parcel-tracking team onto an AWS-only span format would age badly. We’d prefer something that either is OpenTelemetry, or that cleanly interoperates with it, so next year’s consolidation doesn’t require a rewrite of everything instrumented this quarter.

What we’ll filter on

Distilling that exploration into filters we can score the X-Ray pieces against:

  1. End-to-end propagation, does this piece read and write the trace identifier so the tree stays connected across hops?
  2. Per-segment timing and errors, does it emit a timed span with status and exception information?
  3. Automatic AWS-service instrumentation, does coverage appear without the application having to hand-wire every service boundary?
  4. Outbound HTTP capture, does the third-party vendor call show up as its own subsegment?
  5. Sampling control, does the piece honour or produce sampling decisions at a rate the team can tune?

The tracing landscape

  1. X-Ray SDK inside the application. A library the application imports (aws-xray-sdk for Node, Python, Java, .NET, Go, Ruby). It patches the AWS SDK, the HTTP client, and, on Lambda, the runtime, so every outbound AWS call and every outbound HTTP call produces a subsegment automatically. Inside the handler, the application can open custom subsegments for specific pieces of business logic. The SDK reads the incoming trace header (X-Amzn-Trace-Id) and propagates it to outbound calls via the same header.

  2. The X-Ray daemon (or the service itself on Lambda). A small process that batches subsegment documents over UDP to the X-Ray service. On Lambda, it’s baked into the runtime, there’s no daemon to install; the runtime publishes segments. On ECS/EC2 the daemon runs as a sidecar or host process; on Fargate it’s a sidecar container in the task definition. The asynchronous, UDP-based transport is what keeps emission off the request path.

  3. Active tracing on Lambda. A per-function setting (TracingConfig.Mode = Active) that makes the runtime emit a segment for the invocation itself, cold-start duration, handler duration, any exceptions, without the application having to do anything. When the function’s code also uses the SDK, subsegments inside the handler nest under the Lambda-emitted segment.

  4. API Gateway tracing. A stage-level setting that enables X-Ray for the API Gateway stage. When on, API Gateway emits a segment for the request path through the gateway and propagates the trace header to the integration. This is the front-door piece that starts the trace for every downstream hop.

  5. Step Functions tracing. A state-machine-level setting. When enabled, Step Functions emits a segment for the execution and subsegments for each state; downstream services called from states continue the trace. Without it, a state machine is an opaque box in the middle of the call graph.

  6. AWS SDK instrumentation. The X-Ray SDK patches the AWS SDK so every PutItem, GetObject, StartExecution, Invoke, and friend produces a subsegment automatically. DynamoDB, S3, SNS, SQS, Step Functions, Lambda, RDS, all appear in the trace tree without application code beyond the wrap.

  7. The X-Amzn-Trace-Id header. The wire format for propagation. Carries Root=1-...-..., optionally Parent=..., optionally Sampled=0/1. Services that understand X-Ray read it, write it to their emitted segments, and pass it forward. HTTP clients outbound from instrumented services add it automatically.

  8. Sampling rules. Server-side rules that decide which requests are traced. Default rule: 1 trace per second plus 5% of subsequent requests. Custom rules match by service name, host, URL path, or HTTP method and set different rates. A rule saying “trace 100% of /parcels/* when status is 5xx” preserves errors; another saying “trace 5% of 2xx responses” keeps the sample manageable.

  9. ServiceLens. The cross-signal view. Ties CloudWatch metrics, logs, and X-Ray traces together, click a slow trace, jump to the Lambda’s log lines for that invocation; click a metric anomaly, see the trace that produced it. Not a piece to toggle so much as the lens the others are viewed through.

  10. OpenTelemetry and the AWS Distro for OpenTelemetry (ADOT). The industry standard. An ADOT Collector sidecar (or Lambda layer) receives OTLP spans from the application and forwards them to X-Ray or other backends. Useful when the rest of the company has standardised on OpenTelemetry and the AWS services still need to participate.

Side by side

Piece End-to-end propagation Per-segment timing Auto AWS instrumentation Outbound HTTP Sampling control
X-Ray SDK (app) honours rules
Lambda active tracing honours rules
API Gateway tracing first-hop sampler
Step Functions tracing honours rules
AWS SDK instrumentation inherits from parent
ADOT Collector sampler per pipeline
Sampling rules

Reading the table across the scenario rather than by piece: no single row covers every column. End-to-end propagation needs the front-door piece plus the Lambda runtime flag plus the SDK wrap. Outbound HTTP only appears via the application-side SDK or ADOT. Sampling is only a control when rules exist to apply. The picks aren’t one piece; they’re a stack of pieces switched on in the right order.

Matching pieces to hops

Front door API Gateway → Lambda → DDB Fan-out Step Functions → Lambda → RDS Proxy External hop outbound HTTPS to the vendor GET /parcels/:id REST API, custom domain, WAF parcel-api Lambda + DynamoDB cache stamps trace at the edge parcel-fanout express warehouse-query Lambda + RDS Proxy merges results, writes back to DDB currently a black box in logs tracking-vendor.io outbound HTTPS from Lambda suspected source of the p99 climb invisible without a wrapper Who stamps X-Amzn-Trace-Id? State machine emits segments? HTTP client wrapped? Stage-level flag on the gateway + active tracing on the Lambda + SDK wrap for DDB subsegments TracingConfiguration on state machine + active tracing on each Lambda + SDK wrap for RDS Proxy calls captureHTTPs wraps node https → status code, duration, URL → trace header added to outbound API Gateway tracing + Lambda active tracing + captureAWS() for DynamoDB starts the trace at the edge Step Functions tracing + Lambda active tracing + SDK wrap for RDS Proxy states become segments captureHTTPs(https) outbound vendor subsegment URL, method, status, duration proves or rules out the vendor Sampling rules (cross-cutting) 100% of 5xx + 5% of 2xx on parcel-api, decided at the edge, honoured by every downstream service trace volume stays manageable; incidents are never sampled out
Each hop answers a propagation question and an emission question. The pick that falls out the bottom is a layered stack, not a single switch, with sampling sitting across all of it.

The layered stack, in depth

API Gateway tracing starts the trace. Enable X-Ray on the stage: TracingEnabled: true in the CloudFormation stage resource, or tick the box in the console. The gateway now emits a segment for every request that matches the sampling rule and propagates the X-Amzn-Trace-Id header to the integration. That header, and nothing else, is what makes the rest of the trace tree possible, without it, every downstream segment becomes the root of its own disconnected tree.

Each Lambda gets active tracing. TracingConfig: { Mode: Active } on the function resource. The runtime emits a segment for every invocation, with cold-start duration called out separately, and reads the incoming trace header so the function segment lands under the API Gateway segment in the tree. Active tracing alone captures function boundaries; the details inside the function need the SDK.

Inside each Lambda, the X-Ray SDK. A two-line import at the top of the handler file:

const AWSXRay = require('aws-xray-sdk-core');
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
const https = AWSXRay.captureHTTPs(require('https'));

Every AWS call through the wrapped SDK creates a subsegment: DynamoDB.GetItem with the table name and key (sanitised), StepFunctions.StartExecution with the state machine ARN, RDSDataService.ExecuteStatement with the query ID. Every HTTPS call through the wrapped client creates a subsegment with the URL, method, status code, and duration. Application-specific work worth measuring opens its own subsegment:

const segment = AWSXRay.getSegment();
const sub = segment.addNewSubsegment('merge-results');
try {
  // ... work ...
  sub.close();
} catch (e) {
  sub.addError(e);
  sub.close();
  throw e;
}

Step Functions tracing. On the state machine, TracingConfiguration: { Enabled: true }. The service emits a segment for the execution and subsegments for each task; integrated-service tasks (like an SDK DynamoDB:PutItem task) appear as their own subsegments; Lambda tasks chain to the Lambda’s own segment when the function has active tracing. Without this flag, the fan-out inside Step Functions is a single opaque box in the trace.

Sampling rules. The default (1 trace per second plus 5% of the rest) is often fine; the scenario here wants one specific rule: trace 100% of 5xx responses plus 5% of everything else, matched by service name parcel-api. Rules are created via aws xray create-sampling-rule or the console, applied account-wide, and honoured by every X-Ray-aware service at the edge of the trace so a sampled-in request is sampled at every hop.

Second-order concerns worth naming. The SDK is CPU-light but not free; on Lambda it adds a few milliseconds to cold-start for the SDK patch, which is usually lost in noise but shows up if handler durations are already microsecond-tight. Subsegment metadata is sent over UDP and can be dropped under extreme load, so losing a subsegment is always possible and the trace should be read as best-effort. And span field limits are real, annotations are indexed and searchable (up to 50 per segment); metadata is arbitrary but not searchable. Putting a customer ID in annotations lets ServiceLens filter by it; putting it in metadata doesn’t.

A worked example: one trace, one incident

At 09:14 the p99 alarm fires. The on-call engineer opens ServiceLens, sees the parcel-api service map with a node glowing red, and clicks through to the traces.

The first slow trace (1.84s): eight segments, one of which is a 1,190ms HTTPS subsegment to tracking-vendor.io. The Lambda that made the call has an execution time of 1,200ms of its own. Every other segment in the trace. API Gateway, the first Lambda, DynamoDB, Step Functions’ state transitions, RDS Proxy, is normal.

The engineer filters the traces to the last hour, groups by subsegment name, sees https://tracking-vendor.io/v2/parcels with a p99 of 1.1s and a median of 110ms. Same vendor, same endpoint, 10x latency on a subset of requests. Second trace confirms; third trace confirms; the hypothesis has evidence.

The engineer opens an incident with the vendor and, in parallel, ships a change to the Lambda: a 400ms timeout on the HTTPS client, a fallback to a stale cache read from DynamoDB with a response header indicating the stale read. The next deploy cuts p99 from 1.8s to 450ms even while the vendor is still slow, because the timeout fires before the vendor does. A vendor_stale counter metric tells the team how often they’re falling back.

Without the trace, the conversation would have been “our latency is up, we think it’s the vendor, but we’re not sure.” With the trace, it was “this subsegment is slow in 60% of requests, and here are twelve example trace IDs to forward to the vendor’s support team.” Ten minutes versus three hours.

What’s worth remembering

  1. X-Ray is propagation plus emission. Propagation is the X-Amzn-Trace-Id header flowing from front to back; emission is each service writing its segments. Both have to be on for a trace to be whole.
  2. API Gateway tracing starts the trace. Without the front-door flag, every downstream segment is the root of its own disconnected tree and nothing correlates.
  3. Lambda active tracing emits the function segment. Cold-start is called out separately. Active tracing plus the SDK gives the function boundary and its internals.
  4. aws-xray-sdk patches the AWS SDK and HTTP clients. captureAWS() for AWS calls, captureHTTPs() or captureHTTP() for outbound HTTP. Every external call becomes a subsegment automatically.
  5. Step Functions tracing is a state-machine flag. With it on, executions and states appear in the trace; without it, the state machine is a black box in the middle of the graph.
  6. Custom subsegments measure business logic. getSegment().addNewSubsegment('name') opens a span; close it with success or attach an error on failure; the subsegment shows up nested under the function segment.
  7. Sampling rules keep cost sane. 1 per second plus 5% is the default; add rules to trace 100% of errors and proportionally more of high-priority endpoints, decided once at the edge and honoured everywhere.
  8. ServiceLens is the cross-signal view. Metrics, logs, and traces in one place. Click a slow trace, jump to the Lambda log line for that invocation, then to the CloudWatch metric anomaly that fired the alarm.
  9. ADOT is the OpenTelemetry neighbour. When the rest of the company is on OTel, ADOT runs alongside X-Ray. OTLP spans in, X-Ray backend out, same traces with a portable wire format.
  10. The trace answers “where is the time?” not “why.” The why comes from logs, metrics, vendor documentation, and the change history. The trace just points at the segment to investigate.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.