How to Catch Customer-Facing Outages With Synthetics Canaries

September 08, 2027 · 17 min read

CloudOps Engineer · SOA-C03 · part of The Exam Room

The situation

Acme’s checkout flow is five HTTP hops and two asynchronous waits:

  1. GET /cart, render the cart from the cart service.
  2. POST /checkout, submit, which writes a pending order to Postgres and publishes an order.created event.
  3. A Lambda consumer picks up the event, calls Stripe, waits for the charge webhook.
  4. On webhook success, a second Lambda writes the order as paid.
  5. GET /orders/:id, the browser polls for status until it sees paid.

Three outages in twelve months, each of a different shape:

  • Outage A: a database migration changed a column from NOT NULL to nullable. The cart service kept working; the order-writer crashed on every event. Events backed up in SQS. Health checks for the order-writer Lambda? Invocations were happening, but every single one was throwing. Dead-letter queue was filling. Nobody was looking at the DLQ.
  • Outage B: a Stripe configuration change on their side caused the webhook to arrive with a new header shape, which the signature-verifying Lambda rejected. Charges happened; webhooks were ignored; orders sat in pending forever. The cart-service metrics were fine; the Lambda metrics were fine; the Stripe dashboard was fine; the customer’s browser polled /orders/:id for five minutes and gave up.
  • Outage C: the new JavaScript bundle had a race condition that only appeared on a cold page load on Safari. The page loaded, the “Pay” button rendered, but the click handler wasn’t bound for 3-4 seconds. No server-side metric could see this.

Three outages, one pattern: the component-level metrics were fine, and the user-visible journey was broken. The existing alarms were evaluating the wrong contract.

What actually matters

A synthetic monitor is a scripted, scheduled, automated agent that pretends to be a user. It runs the journey, literally, for the browser case, in a headless browser, on a schedule, records whether it succeeded, how long it took, and what it saw along the way, and emits a metric plus an optional screenshot and trace file when it fails.

The shift in thinking matters. Component monitoring says “is the CPU healthy, is the database up, is the error rate below threshold?” Synthetic monitoring says “does the contract the user cares about still hold end-to-end?” The two are complementary, not competitive, you want CPU alarms because they help diagnose; you want synthetic alarms because they tell you there’s something to diagnose in the first place.

A few design decisions follow.

What contract is the canary actually verifying? A bad canary clicks through a page and asserts “the DOM has a button.” A good canary asserts “add-to-cart succeeded, checkout completed, order eventually showed as paid.” The canary’s assertions are the service-level indicator. Be specific about what failure means.

How often should it run? More frequently means lower detection latency and higher cost. Per-minute cadence catches outages within a minute and pays for it in invocations; five- or fifteen-minute cadence is a sensible default for journeys whose business value doesn’t warrant the more aggressive shape. Pick the cadence per journey, not estate-wide.

What inputs does it need? Credentials (a canary-specific test user, payment-provider sandbox keys), a test environment or production data that’s safe to transact against, an idempotency key so re-runs don’t pile up test orders. Secrets belong in a managed secret store; the canary’s IAM identity grants read access only to its own.

What does failure output? Screenshots, network traces, console logs, source maps for the specific bundle version, the network request that failed and its response. The artefacts are the difference between “the canary went red” and “the canary went red because /api/pay returned 502 at step 4, 1.8 seconds in, and here’s the request.”

Where does the canary run from? For a journey where geographic diversity matters (a global CDN, a payment provider with regional endpoints), running the same canary from multiple Regions catches Region-local failures that a single-Region canary misses.

How does it become a pager alert? The canary emits a success metric. A metric alarm with hysteresis (“two consecutive failures, not just one”) wired to the existing notification pipeline is the usual shape. One failed run is noise; two failed in a row is a reason to wake someone up.

What we’ll filter on

  1. Full-journey vs single-endpoint, does the test exercise the whole flow or one HTTP call?
  2. Browser-aware, does it render JavaScript, handle cookies, and see client-side race conditions?
  3. Scheduled and managed, does AWS run the scheduler, or is that our problem?
  4. Artefact capture on failure, screenshots, HAR, console logs?
  5. Multi-Region runs, can the same canary execute from several Regions?
  6. Native metric emission, does it emit to CloudWatch without custom plumbing?

The synthetic monitoring landscape

  1. CloudWatch Synthetics canaries (heartbeat). A simple HTTPS GET with optional assertions on status code, body, or headers. Scheduled, managed, metric-emitting. Node.js or Python runtime. Runs from the Region you configure.

  2. CloudWatch Synthetics canaries (API). A structured multi-step test hitting REST or GraphQL endpoints, handling auth, and asserting on response shape. Same scheduling and artefact model as heartbeat.

  3. CloudWatch Synthetics canaries (UI / Puppeteer). A full browser-driven journey in headless Chromium. Navigates pages, fills forms, clicks buttons, waits for elements, asserts on the DOM. Screenshots and HAR capture built in. Heaviest per-invocation cost but the only shape that catches client-side failures.

  4. CloudWatch Synthetics canaries (visual monitoring). Pixel-level comparison against a baseline screenshot. Catches layout regressions that functional assertions miss. Runs as part of UI canary scripts.

  5. Route 53 health checks. HTTPS probes from AWS edge locations, global by default, with failover integration to Route 53 DNS. Lightweight single-endpoint checks. No scripting, no journeys, no artefacts.

  6. Lambda invoked on a schedule (EventBridge rule + Lambda). A handwritten synthetic test as a Lambda function, triggered by EventBridge. All scheduling, retries, metric emission, and artefact storage are your problem. Gives total control at the cost of everything.

  7. Third-party SaaS (Pingdom, Datadog Synthetics, Checkly, StatusCake). Fully managed, global probe networks, richer browser runtimes, but outside the AWS billing and IAM surface. Legitimate choice; not what this scenario is asking about.

Side by side

Option Full journey Browser-aware Managed schedule Failure artefacts Multi-Region Native CloudWatch metrics
Heartbeat canary ✓ (limited)
API canary Partial
UI / Puppeteer canary ✓ (screenshots + HAR)
Visual monitoring canary
Route 53 health check ✓ (edge)
Lambda + EventBridge ✓ (hand-written) With Chromium layer Custom Custom Custom
Third-party Via integration

The three outages map cleanly: outage A (order-writer crashing after migration) is caught by an API canary that verifies /checkout flow to paid. Outage B (Stripe webhook shape change) is caught by the same API canary asserting the order actually reaches paid within N seconds. Outage C (Safari click-handler race) is only caught by a UI canary in a real headless browser.

So: one API canary for the server-side contract, one UI canary for the client-side contract, both every one-to-five minutes, both with alarms feeding the pager.

The canary lifecycle from page load to pager

Synthetics scheduler every 5 minutes rate(5 minutes) Canary handler syn-nodejs-puppeteer-9.0 IAM role: CloudWatchSyntheticsRole Headless Chromium GET /cart → POST /checkout poll /orders/:id until paid Checkout stack ALB + ECS + Lambda RDS + Stripe (test mode) Success path every assertion passed emit SuccessPercent = 100 Failure path capture screenshots, HAR, console write to s3://canary-artefacts emit SuccessPercent = 0 metric dim: CanaryName = checkout-ui CloudWatch metric CloudWatchSynthetics/SuccessPercent Metric alarm SuccessPercent < 80 for 2 × 5-min periods SNS → PagerDuty on-call notified in ~6 min
Each scheduled invocation either emits a 100 or a 0 on the success metric; the alarm's job is to translate "two zeros in a row" into a pager incident with the failure artefacts one click away.

The canary script in depth

Canaries execute as a small Lambda-like runtime named syn-nodejs-puppeteer-9.0 (or a version pinned to a specific Chromium + Node pair). The script exports a handler function that gets a synthetics helper and a log object.

A UI canary for the checkout journey:

const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');

const checkoutJourney = async function () {
  const page = await synthetics.getPage();

  // Step 1: cart loads
  await synthetics.executeStep('Load cart', async () => {
    await page.goto('https://shop.acme.com/cart', {
      waitUntil: 'networkidle0', timeout: 10000
    });
    await page.waitForSelector('[data-testid="cart-total"]', { timeout: 5000 });
  });

  // Step 2: pay button binds and is clickable
  await synthetics.executeStep('Pay button ready', async () => {
    await page.waitForSelector('[data-testid="pay-button"]:not([disabled])', { timeout: 5000 });
  });

  // Step 3: submit checkout (test-mode Stripe token injected)
  await synthetics.executeStep('Submit checkout', async () => {
    await page.click('[data-testid="pay-button"]');
    await page.waitForSelector('[data-testid="order-status"]', { timeout: 15000 });
  });

  // Step 4: order reaches paid within SLA
  await synthetics.executeStep('Order paid', async () => {
    const maxWait = 30000;
    const started = Date.now();
    while (Date.now() - started < maxWait) {
      const status = await page.$eval('[data-testid="order-status"]', el => el.textContent);
      if (status === 'paid') return;
      await page.waitForTimeout(1000);
    }
    throw new Error('order did not reach paid within 30s');
  });
};

exports.handler = async () => {
  return await synthetics.executeHttpStep('checkout-journey', checkoutJourney);
};

Four pieces earn their keep. Each executeStep is a named, individually-timed section, on failure, the artefacts and the metric emission record which step failed. waitForSelector with a timeout is the assertion vehicle: if the pay button never binds (outage C), the step fails specifically. The explicit poll in step 4 encodes the SLA, “order should reach paid within 30 seconds”, and fails clearly rather than hanging indefinitely. And the Stripe test-mode token (injected via a Secrets-Manager-backed environment variable, not shown) is the thing that makes the canary safe to run 288 times a day in production.

The canary’s IAM role needs three things: s3:PutObject on the artefact bucket, cloudwatch:PutMetricData in the CloudWatchSynthetics namespace, and secretsmanager:GetSecretValue on the specific secret. The managed policy CloudWatchSyntheticsExecutionRolePolicy covers the first two; the secrets grant is added explicitly.

Alarm shape in depth

Canaries emit SuccessPercent at the end of each invocation. For a per-invocation outcome (success = 100, failure = 0), the alarm shape that works is:

aws cloudwatch put-metric-alarm \
  --alarm-name checkout-ui-canary-failing \
  --namespace CloudWatchSynthetics \
  --metric-name SuccessPercent \
  --dimensions Name=CanaryName,Value=checkout-ui \
  --statistic Average \
  --period 300 \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --threshold 80 \
  --comparison-operator LessThanThreshold \
  --treat-missing-data breaching \
  --alarm-actions arn:aws:sns:eu-west-1:111122223333:pager-critical

Two settings matter. evaluation-periods 2 + datapoints-to-alarm 2 means “two consecutive 5-minute periods below 80%”, one flaky run doesn’t page. treat-missing-data breaching means if the canary itself fails to execute (Lambda crash, runtime issue), that counts as a failure; the alternative (notBreaching) silently hides broken canaries.

For the UI canary running every 5 minutes, a single failure moves the 5-minute average to 0, which triggers the alarm after the second failure 5 minutes later, worst-case detection latency around 11 minutes including canary execution time. Running every minute tightens that to 3 minutes at the cost of five times the invocation count.

A worked incident

A new deploy of the JavaScript bundle goes out at 14:17. The Safari race condition from outage C is re-introduced.

14:17:30. UI canary runs from eu-west-1. Step “Pay button ready” fails: waitForSelector('[data-testid="pay-button"]:not([disabled])') times out at 5000ms. Canary emits SuccessPercent = 0. Screenshot uploaded to s3://canary-artefacts/checkout-ui/2027-05-25T14-17-30Z/passedAndFailed/failed-SCREENSHOT-step-2.png. HAR captured. Console log shows a JavaScript error from the bundle.

14:22:30. UI canary runs again. Same failure. Second consecutive zero datapoint.

14:22:45. Alarm transitions from OK to ALARM. SNS publishes to the pager-critical topic. PagerDuty pages on-call.

14:23:10. On-call acknowledges, opens the alarm in the console. One click on “View canary run” shows the failed screenshot (pay button greyed out, the page otherwise rendered). HAR shows the last request succeeded; no server-side symptom.

14:25:00. On-call correlates: deploy at 14:17, canary fails from 14:17:30, no server-side alarms have fired. Rolls back the deploy.

14:32:00. Canary runs and succeeds. SuccessPercent = 100.

14:37:00. Alarm transitions back to OK. Post-mortem starts, with the canary’s screenshot and HAR as Exhibit A.

The human in the loop was still the human. The canary’s job was to reduce detection latency from “customer tweets” to “11 minutes,” and to hand the on-call a screenshot that localised the failure to the client-side bundle before they had even opened a debugger.

What’s worth remembering

  1. Canaries monitor the contract the user cares about. Component metrics tell you a service is healthy; canaries tell you the journey works. Both are necessary; neither is sufficient.
  2. Three canary shapes. Heartbeat for single endpoints, API for multi-step REST flows, UI (Puppeteer) for anything with client-side logic. Pick per journey.
  3. Each executeStep becomes a named artefact. Failures identify which step failed, with screenshots, HAR, and console logs. This is 80% of the on-call value.
  4. Schedule + alarm shape is the detection latency. One-minute canaries with evaluation-periods 2 page in 2-3 minutes; five-minute canaries page in 10-11. Cost scales linearly with frequency.
  5. treat-missing-data breaching catches broken canaries. A canary runtime that crashes silently is worse than no canary; treat missing data as a failure unless you have a reason not to.
  6. Secrets and test data need a real plan. Test-mode credentials, dedicated canary user, idempotency keys, and ensure-cleanup logic. Running 288 canaries a day against production needs to be safe by construction.
  7. Multi-Region canaries catch Region-local failures. The same script deployed from two or three Regions covers CDN, DNS, and infrastructure failures that single-Region canaries miss.
  8. SuccessPercent is the standard metric. Duration is the other one worth alarming on (latency regressions). Both emit to the CloudWatchSynthetics namespace with CanaryName as the dimension.

A canary is a user who never gets tired and always writes a clear bug report. Three failed deploys at Acme, three server-side clean bills of health, and three Twitter-driven incident responses would have been replaced by three pages inside the 11-minute SLA, each with a screenshot attached. The next outage will probably be of a shape we haven’t thought of yet, but the pattern of running scripted users against production journeys is the one that makes each new outage shorter than the last.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.