The situation
A checkout request today traverses:
- ALB in the edge account -> API Gateway.
- Lambda orchestrator in the checkout account.
- Call to the pricing service (Fargate in the pricing account, via PrivateLink).
- Call to the inventory service (ECS in the inventory account).
- Call to the payments service (Fargate in the payments account) which calls Stripe.
- Back to the orchestrator, which writes to orders (DynamoDB in the orders account) and emits an SNS event.
- Notifications picks up the SNS event and emails the customer.
Five accounts, seven steps, p99 latency 2 seconds, but a handful of 10-second tails a day with no obvious culprit. Every team’s logs show normal response times in their service. Somewhere, a lock is contending or a retry storm is firing, and no one can see it end-to-end.
Traces exist per service, each team has some tracing, but the traces don’t link up. The orchestrator’s trace ends at “called pricing, got a response”; pricing’s trace shows a request with a different trace ID. No single pane of glass.
What actually matters
The core trade in distributed tracing is instrumentation cost in exchange for visibility. Every service that wants to participate in a trace must emit spans, with a shared trace ID passed through request headers, to a backend that can stitch them together. Across accounts, this has extra friction: each account’s tracing data tends to land locally, and without cross-account sharing, no single view exists.
The first thing to ask is: does a single trace ID propagate end-to-end? If a downstream service drops the incoming trace context and starts a new one, the link is broken. The fix is HTTP-header-based context propagation, with every service passing the incoming trace header to its outbound calls. Which header format, the AWS-native one, W3C Trace Context, or both, is a per-SDK decision, but consistency matters more than the choice.
The second is: which tracing backend? An AWS-native managed service is one direction; OpenTelemetry feeding a vendor or self-hosted backend is the open-standard alternative. The two aren’t mutually exclusive. AWS-native ingests OTel data via a vendor-neutral collector, but committing to one as the “system of record” is the decision that shapes everything else.
The third is: how do traces cross account boundaries? Either the backend has a native cross-account observability feature, in which case a central monitoring account sees every linked source account without copying data, or traces get exported through a pipeline to a central destination. Native is cheaper to operate; the pipeline gives more control.
The fourth is: sampling. Tracing every request is expensive. Defaults are typically a small fixed rate plus a percentage of the remainder, configurable per service. For debugging specific problems, temporary rules can sample 100% of requests matching a pattern.
The fifth is: the visualisation layer on top. Raw traces are useful; traces plus metrics plus logs with a service-map overlay is better. Whether that overlay comes from the same vendor as the trace backend or from a third-party APM is a separate question from where the spans live.
What we’ll filter on
- Trace-ID propagation, does one ID follow the request end-to-end?
- Cross-account visibility, can a single dashboard show spans from five accounts?
- Service-map completeness, does the tool render a complete dependency graph?
- Sampling flexibility, can we dial up sampling for specific endpoints temporarily?
- Cost profile, per-trace, per-GB, per-service?
The tracing landscape
-
AWS X-Ray with CloudWatch cross-account observability. Every service instruments with the X-Ray SDK or ADOT agent. Traces sent to the local account’s X-Ray service. CloudWatch cross-account observability makes the central observability account a monitoring “root” that sees X-Ray, CloudWatch metrics, logs, and synthetics from every linked source account. One account to log into, one service map, one trace search.
-
OpenTelemetry + self-hosted Jaeger/Tempo. ADOT Collector in each account ships traces to a central OpenSearch or Tempo backend. Open-standard instrumentation; vendor-neutral. More operational overhead (runs the backend, scales OpenSearch clusters).
-
OpenTelemetry + third-party APM. ADOT Collector ships to Datadog/New Relic/Honeycomb/Lightstep. Commercial product handles the backend. Cleanest experience, largest monthly bill.
-
Custom log-based tracing. Request ID in every log line, Athena queries to stitch logs together. Cheap; works at small scale; falls over when services are in five accounts and logs are in different destinations.
-
AWS ServiceLens. Sits on top of X-Ray + CloudWatch to render a service map with metric overlays (error rate, latency, throughput per node). Not a backend in its own right; a visualisation layer.
-
AWS CloudWatch Synthetics + X-Ray. Synthetic canaries generate traces for known customer journeys. Good for SLO monitoring on the critical paths. Complementary to service-generated traces, not a replacement.
Side by side
| Option | Trace propagation | Cross-account | Service map | Sampling | Cost |
|---|---|---|---|---|---|
| X-Ray + CWO | Header-based | Native via linking | ServiceLens | Sampling rules | Per trace + CWO free |
| OTel + Jaeger/Tempo | W3C Trace Context | Collector-based | Via Tempo/Jaeger UI | Per-SDK | Backend + storage |
| OTel + third-party APM | W3C Trace Context | Collector-based | Vendor | Per-SDK | Vendor |
| Log-based | Request ID in logs | Log aggregation | None | Every request logs | Log costs |
| ServiceLens | Relies on X-Ray | Via CWO | ✓ | Via X-Ray | X-Ray costs |
| Synthetics + X-Ray | Canary-generated | Via CWO | ServiceLens | Per canary schedule | Per canary |
For an organisation already deep in AWS with five accounts and a preference for native tooling, X-Ray + CloudWatch cross-account observability + ServiceLens is the fit. For an open-source-leaning team, OTel + a backend is the alternative. The rest of the post goes deep on the AWS-native story.
The cross-account tracing architecture
The picks in depth
CloudWatch cross-account observability. Configure once: the observability account is a monitoring sink; each of the five source accounts is a source link to that sink. The sink is created in the observability account and has a policy naming the source accounts allowed to link; each source account creates a SourceLink resource referencing the sink and selecting what telemetry types to share (metrics, logs, traces).
# In observability account:
aws oam create-sink --name primary-sink --tags Team=observability
# In each source account:
aws oam create-link \
--label-template '$AccountName' \
--resource-types AWS::CloudWatch::Metric AWS::Logs::LogGroup AWS::XRay::Trace \
--sink-identifier arn:aws:oam:eu-west-1:OBS:sink/aaaa...
After the links are up, engineers logged into the observability account see:
- ServiceLens shows a single service map spanning all five accounts.
- X-Ray Trace Search can query any trace ID and return the full trace regardless of which account’s services emitted the spans.
- CloudWatch Metrics can query across accounts (
AWS/ApplicationELB/RequestCountsummed across all account ALBs). - Logs Insights can query log groups across accounts.
X-Ray SDK or ADOT instrumentation. Each service imports the X-Ray SDK (for direct X-Ray) or ADOT (for OpenTelemetry with X-Ray backend). For Lambda, active tracing is a one-click toggle that auto-instruments the runtime. For ECS/Fargate, the task definition adds an X-Ray daemon or ADOT Collector as a sidecar; the service’s code wraps HTTP clients and AWS SDK clients with X-Ray middleware.
Key integration: AWS SDK calls are auto-traced when the SDK is patched (aws-xray-sdk wraps the SDK client). Same for common HTTP libraries (requests, httpx, aiohttp). Custom code adds manual subsegments for the parts that matter, the expensive computation, the database query, the external API call.
Header propagation. X-Ray uses X-Amzn-Trace-Id. Every inbound handler reads the header; every outbound call sets it. The SDKs do this automatically, but wherever custom networking bypasses the SDK (raw sockets, non-standard HTTP clients), the team has to pass the header manually. A common failure mode is “the trace ID ends at the gRPC boundary” because the gRPC client wasn’t patched.
Pricing’s internal call to checkout uses PrivateLink; the header propagates transparently because PrivateLink doesn’t touch HTTP layer, it’s TCP. SNS-delivered messages carry the trace ID as a message attribute; SQS queues pass it through as a system attribute. Lambda, when triggered by SQS or SNS, reads the attribute and continues the trace.
Sampling rules. X-Ray’s default sampler is 1 req/sec plus 5% of the remainder. Custom rules overlay on top:
{
"version": 2,
"rules": [
{
"description": "Sample 100% of checkout critical path",
"service_name": "checkout-orchestrator",
"http_method": "POST",
"url_path": "/v1/place-order",
"fixed_target": 100,
"rate": 1.0
},
{
"description": "Sample 50% of errors always",
"service_name": "*",
"http_method": "*",
"url_path": "*",
"fixed_target": 1,
"rate": 0.0,
"attributes": {"http.status_code": "5xx"}
}
],
"default": { "fixed_target": 1, "rate": 0.05 }
}
Temporary rules for debugging (sample 100% of requests with X-Debug: true) let the team force high-sampling for specific test traffic without paying for high-sampling in production.
ServiceLens for the service map. ServiceLens uses X-Ray’s dependency data plus CloudWatch metrics to render the service map, coloured by health (green/yellow/red) based on error rate and latency. Each node shows request rate, p50/p95/p99 latency, error count. Click a node to drill into its X-Ray traces; click an edge to see the call volume between two services.
The workflow for the 10-second tail. Engineer opens the observability account, loads ServiceLens, filters to checkout service, duration > 8 seconds. ServiceLens shows a dozen matching traces in the last hour. Clicks one: the trace waterfall shows 200ms in the orchestrator, 150ms in pricing, 150ms in inventory, 8500ms in payments in a single subsegment labelled “Stripe API call”. Root cause: Stripe’s p99 goes long when their downstream bank rail is slow, and our code doesn’t have a timeout short enough to fail fast.
Ten minutes from pager to diagnosis; before cross-account tracing, the same investigation was two days of cross-team emails.
A worked cross-account trace
A customer places an order; the orchestrator’s POST /v1/place-order begins:
- edge account: API Gateway receives the request, X-Ray generates trace ID
1-abc..., forwards to Lambda orchestrator. - checkout account: Lambda orchestrator runs with the trace context. Starts a subsegment “orchestrate order”. Makes HTTP call to pricing with
X-Amzn-Trace-Id: Root=1-abc...;Parent=<checkout-subseg-id>. - pricing account: Fargate receives the call, X-Ray SDK creates a new subsegment under the same trace. Computes price, returns.
- Same sequence for inventory and payments.
- payments account: Fargate calls Stripe. X-Ray SDK wraps the Stripe client, creates a subsegment with
http.url=stripe.com/v1/charges. Stripe returns in 800ms. - checkout account: orchestrator writes to DynamoDB in orders account (cross-account role assumption) and publishes to SNS.
- notifications account: Lambda subscribes to SNS; SNS propagates the trace context via message attributes; Lambda extends the trace with a
send-emailsubsegment.
In the observability account, one X-Ray trace shows all seven segments, all seven subsegments, and the total wall-clock time with each service’s contribution visible.
A worked SLO query
SLO: checkout p99 under 2 seconds.
- ServiceLens shows checkout p99 at 1.8s green.
- Dashboard widget: X-Ray’s “service latency by service” across last 7 days, all accounts.
- Logs Insights over centralised logs, filtered by
checkoutservice, for the last hour, answers “why did these specific requests exceed 2s?” by joining with trace IDs. - CloudWatch alarm on the p99 metric, targets the observability account’s SNS topic which fans out to the on-call.
One account to look at; five accounts’ data flows in.
What’s worth remembering
- Trace-ID propagation is the foundation. If the ID doesn’t survive across every boundary, the trace breaks. Patch every SDK client; propagate the header at every inbound/outbound point.
- CloudWatch cross-account observability is free. The linking is free; you pay for the underlying X-Ray, CWL, and metrics in the source accounts. Enable it everywhere; there’s no ongoing cost.
- ServiceLens renders the unified service map. Auto-generated from X-Ray dependencies; coloured by health; click-through to traces. The visualisation the on-call actually uses.
- Sampling rules are dial-able. Default modest sampling; elevate for critical paths; elevate temporarily for debugging. Don’t trace 100% of everything, the bill is real.
- ADOT is the OpenTelemetry path on AWS. Same instrumentation story, OTel-compatible. Ships to X-Ray or a third party. If the team prefers OTel standards, ADOT doesn’t lock them out of AWS tooling.
- Lambda active tracing is a checkbox. No SDK to import; the runtime emits spans. Free telemetry for the effort of a CLI flag.
- Async paths need help. SNS, SQS, EventBridge each have their own trace-context propagation. X-Ray handles the common ones natively; custom queues need explicit header passing.
- Aggregate in an account nobody deploys to. The observability account holds dashboards and sinks, not workloads. Clean boundary; easy IAM.
Five accounts, one trace, one dashboard. The 10-second tail stops being a mystery; the team stops arguing about whose service is slow. The request lives a life across the organisation; distributed tracing is the only thing that lets us follow it.