Aligning ALB and Route 53 Health Checks for Regional Failover

October 28, 2026 · 16 min read

The situation

A payments API runs in two regions for DR: eu-west-1 primary, us-east-1 secondary. Each region has an ALB fronting 12 ECS Fargate tasks across three AZs. In front of both regions, a Route 53 hosted zone with a failover routing policy: api.payments.example.com points to eu-west-1 when healthy, fails over to us-east-1 when not.

Two layers of health checks are running:

ALB target group health checks: HTTP GET /health on each task, 200 = healthy, 5xx or timeout = unhealthy. Interval 30 seconds, threshold 2 consecutive failures, 3 consecutive successes. A task that fails lands in “unhealthy” and stops receiving traffic until it recovers.
Route 53 health checks: HTTPS GET on eu-west-1-api.payments.example.com/healthz (the regional endpoint). From 3 health checker Regions. Interval 30 seconds, threshold 3 consecutive failures. If failing, the primary record is withdrawn and traffic moves to us-east-1.

Last Tuesday: a bad deploy shipped to eu-west-1 that returned 503 from /health but 200 from /healthz. The ALB marked every task unhealthy and returned 503 to every request. Route 53’s health check, hitting /healthz which was unaffected, still considered the region healthy. Traffic never failed over. Users saw 503s for eleven minutes until the deploy was rolled back.

The postmortem: the two health checks measure different things, by accident, and agree on nothing. The team wants a design where either both layers agree, or they disagree for good reasons that match different failure modes.

What actually matters

Before drawing the failure matrix, it’s worth asking what each health check is actually for.

The first thing to understand is that ALB target health and Route 53 health check different questions. An ALB target health check answers “should this specific target receive traffic right now?” The ALB uses it to take individual tasks out of rotation when they’re sick, the deployment that leaked memory, the task that wedged on a deadlocked thread, the instance that lost its EBS volume. A Route 53 health check answers “should this regional endpoint receive traffic at all?” It’s used to decide whether a whole region, or more precisely the DNS record pointing at a regional endpoint, is healthy.

The second thing: the two layers see different failures. An ALB target health check cannot tell you the ALB itself is broken (if the ALB is down, the health check is too). A Route 53 health check cannot tell you an individual target is broken (it sees the aggregate). They are complementary, not redundant. Using one to replace the other loses a category of failure detection.

The third thing: the health check has to exercise what matters. The outage above happened because /health and /healthz tested different things. If /healthz returns 200 whenever the web server process is alive, and /health returns 200 only when the database connection pool is up, then one checks “server is running” and the other checks “server can do its job.” Both might be useful, but Route 53 wants to know “can this region do its job?”, not “is the web server process up?” If the region can’t serve requests, Route 53 should fail.

The fourth thing: cascading effects of each health check. When an ALB marks a target unhealthy, it stops sending traffic to that one target. If the auto-scaler is watching, it may launch a replacement. When Route 53 marks a health check unhealthy, the entire region’s traffic shifts to another region. The blast radius of a Route 53 failure decision is orders of magnitude larger than an ALB failure decision. A flappy health check at the Route 53 layer is a regional failover event; at the ALB layer, it’s one sick task.

The fifth thing: how the two layers should relate. Either the DNS-level check shares a fate with the load-balancer-level check (same probe shape, same definition of healthy), or it reads the load balancer’s own verdict (aggregate target health as a signal), or it deliberately probes deeper than the load balancer does (so “region healthy” means “region can serve traffic end to end”). The correct answer depends on how much you trust the load balancer to evict its own sick targets before they pollute the regional endpoint.

What we’ll filter on

Filters for each health-check arrangement:

Detects individual target failures, memory leak, deadlock, bad deploy on one task.
Detects regional failures. ALB down, whole region unreachable, every target sick.
Checks exercise the critical path, if the region can’t serve a real request, the check knows.
Blast radius proportional to failure, one sick task doesn’t trigger regional failover.
Both layers agree in common outages, if one says unhealthy, the other does too, for the correct reasons.

The health-check landscape

ALB target health only (no Route 53 check). The ALB removes sick targets from rotation; there’s no DNS-level failover. A regional ALB outage takes the whole service down. Fine for a single-region service with no DR story; not the scenario.
Route 53 health check only (no ALB target check). DNS fails over on regional failure; individual sick targets keep serving traffic until their own crashes. This is the “Route 53 checks a shallow /healthz” failure mode, which is how the Tuesday outage happened. Not a serious option, but worth naming to explain why not.
Both layers, different endpoints (status quo, broken). ALB hits /health (deep), Route 53 hits /healthz (shallow). The two disagree by design because they test different things. When the deep check fails but the shallow one passes, the ALB removes all targets but Route 53 keeps sending traffic to the region. Classic silent failure.
Both layers, same endpoint. ALB hits /health; Route 53 hits the same path, on the regional endpoint URL, with matching success criteria. If targets are sick, the ALB evicts them; if all targets are sick, the ALB returns 5xx, and Route 53’s health check sees the 5xx and fails over. The ALB’s target eviction is the first line of defence; Route 53 is the second.
Both layers, Route 53 calculated from CloudWatch alarms. ALB does its own target health; Route 53’s health check is a calculated health check that aggregates CloudWatch alarms, typically HealthyHostCount < threshold on the target group. When the target group has zero healthy hosts, the alarm fires, the calculated health check flips to unhealthy, and Route 53 fails over. Doesn’t require a public probe endpoint; fails over precisely when the ALB runs out of healthy targets.
Both layers, Route 53 on deep-probe endpoint. ALB hits /health (lightweight: process up, thread pool healthy). Route 53 hits /deep-health on the regional ALB, which exercises database, cache, dependencies. The two layers detect different things on purpose: the ALB evicts broken tasks quickly on a cheap check, and Route 53 fails over when deep dependencies are broken across the region.

Side by side

Option	Detects target failures	Detects regional failures	Checks critical path	Blast radius proportional	Layers agree
ALB target only	✓	✗	✓	✓	n/a
Route 53 only	✗	✓	✓	✗	n/a
Both, different endpoints	✓	—	—	✓	✗
Both, same endpoint	✓	✓	✓	✓	✓
Calculated from CW alarms	✓	✓	✓	✓	✓
Deep-probe at Route 53	✓	✓	✓✓	✓	✓ (by design)

Reading the table: options 4, 5, and 6 all work; they differ on whether Route 53’s signal is “aggregate target health” (5), “same-shape-as-ALB check” (4), or “deeper-than-ALB check” (6). The one to avoid is option 3, which is exactly where the team is.

How target health and DNS failover relate

ALB probes targets directly; HealthyHostCount publishes to CloudWatch; a calculated Route 53 health check reads the CloudWatch alarm. When every target in the target group fails, the alarm fires and Route 53 fails over to us-east-1 without a second probing path.

The pick in depth

Option 5, calculated Route 53 health check driven by CloudWatch alarms, is the cleanest for this team because it makes the layers agree by construction: Route 53 doesn’t probe the region independently, it reads the ALB’s own verdict. If the ALB says zero healthy hosts, Route 53 fails over.

The CloudWatch alarm. On the target group metric HealthyHostCount in the AWS/ApplicationELB namespace, dimensioned by TargetGroup and LoadBalancer. Alarm condition: HealthyHostCount < 1 for 2 consecutive evaluation periods of 60 seconds. Two periods to avoid a single missed datapoint causing a regional failover.

aws cloudwatch put-metric-alarm \
    --alarm-name payments-eu-west-1-no-healthy-targets \
    --metric-name HealthyHostCount \
    --namespace AWS/ApplicationELB \
    --dimensions Name=TargetGroup,Value=targetgroup/payments-prod/a1b2c3d4 \
                 Name=LoadBalancer,Value=app/payments-prod-alb/abcdef123456 \
    --statistic Minimum \
    --period 60 \
    --evaluation-periods 2 \
    --threshold 1 \
    --comparison-operator LessThanThreshold \
    --treat-missing-data breaching

treat-missing-data breaching matters: if the ALB stops reporting the metric because it’s dead, the alarm should fire, not hold its last good value.

The Route 53 calculated health check. A health check of type CALCULATED that watches the alarm state via a child health check of type CLOUDWATCH_METRIC. When the child reports ALARM, the parent is unhealthy. Route 53 polls the state roughly every 30 seconds.

aws route53 create-health-check \
    --caller-reference hc-payments-eu-west-1-$(date +%s) \
    --health-check-config '{
      "Type": "CLOUDWATCH_METRIC",
      "AlarmIdentifier": {
        "Region": "eu-west-1",
        "Name": "payments-eu-west-1-no-healthy-targets"
      },
      "InsufficientDataHealthStatus": "Unhealthy"
    }'

InsufficientDataHealthStatus: Unhealthy keeps the fail-safe bias, if CloudWatch itself is broken and the alarm data is missing, Route 53 treats the region as unhealthy.

The failover record set. Two failover records for api.payments.example.com, both type A-alias:

Primary:   api.payments.example.com. → eu-west-1 ALB (health check: hc-eu-west-1)
Secondary: api.payments.example.com. → us-east-1 ALB (health check: hc-us-east-1)

Each has Failover: PRIMARY or SECONDARY and EvaluateTargetHealth: true. When the primary’s health check is unhealthy, Route 53 returns the secondary. TTL on the record is 60 seconds, low enough that client DNS caches expire quickly, high enough that the resolver load is reasonable. Clients that cache DNS aggressively (some Java stacks, cough) pay the cost regardless; the TTL is a ceiling, not a floor.

The ALB target health check itself. With Route 53 driven by HealthyHostCount, the ALB’s own /health probe is free to be as deep as the team wants: check the database connection pool, check the downstream service’s circuit breaker, check anything that means “this task can do its job.” False positives here evict a single task, not a region.

A worked failure

The SRE on-call sees the bad deploy ship at 14:45 UTC on a Thursday.

45:03  Deploy pipeline completes; new task definition v142 rolls out
45:10  Task payments-7fj3k starts; /health returns 503 (DB pool exhausted)
45:12  ALB target group marks payments-7fj3k unhealthy
45:40  Task payments-9ab2m starts; /health returns 503
45:42  ALB target group marks payments-9ab2m unhealthy
...
47:18  HealthyHostCount drops to 0 as last v141 task is replaced
47:19  CloudWatch metric HealthyHostCount=0 published
48:19  Second evaluation period: alarm ALARM state
48:50  Route 53 calculated health check flips to Unhealthy
48:51  Route 53 primary record withdrawn; secondary answers queries
49:21  Clients with 60s DNS TTL start resolving to us-east-1

Three minutes forty from first bad task to failover; most of that is the ALB draining old tasks in, new ones out. The on-call engineer sees the alarm, rolls back the deployment, watches the failover unwind in reverse as HealthyHostCount in eu-west-1 climbs back to 12 and the CloudWatch alarm clears. Clients shift back to eu-west-1 over the next minute as TTLs expire.

The postmortem is short: the health check layers caught the failure, the failover worked, the deploy safety net fired. The action items are tuning (do we want 90-second alarm evaluation instead of 120?) rather than architecture.

Notes on the other options

Option 4 (both layers, same endpoint) also works, with one trap: if the Route 53 health checkers come from IP ranges the ALB security group doesn’t allow, the probe fails and the region gets marked unhealthy for the wrong reason. Route 53 publishes its checker IP ranges; either allow them in the SG or use a calculated check as above.

Option 6 (deep-probe at Route 53) is the correct call when the ALB’s /health is shallow and there’s a meaningful difference between “ALB has targets” and “region can actually serve.” Picture a scenario where DynamoDB in eu-west-1 is degraded but the ALB’s targets are up and passing /health: option 5 wouldn’t fail over (targets are healthy), option 6 would (deep probe hits DynamoDB and fails). The cost is maintaining a deeper endpoint that exercises every critical dependency.

What’s worth remembering

ALB target health and Route 53 health check answer different questions. Target health says “should this task receive traffic?”, regional health says “should this region receive traffic?” Different scopes, different blast radii.
Health checks that disagree about what “healthy” means are the bug. /health and /healthz returning different answers from the same process is how silent failovers-that-don’t-happen happen. Either align the endpoints or be explicit about why they differ.
Calculated Route 53 health checks read CloudWatch alarms. Tie the alarm to HealthyHostCount on the target group and the regional health check reflects the ALB’s own verdict. No separate probe path to keep in sync.
InsufficientDataHealthStatus: Unhealthy is the safer bias. If the alarm data stops arriving, assume the region is sick rather than hold the last good state.
EvaluateTargetHealth: true on alias records. Without it, Route 53 alias records ignore the target’s health and return the record regardless.
DNS TTL is a ceiling on failover time, not a floor. Route 53 will answer the new record immediately once the health check flips; clients with aggressive DNS caches (some JVMs, some connection pools) pay the cache cost regardless.
The correct /health depth depends on the layer. Target-level checks can be deep and evict a single task cheaply. Regional-level checks either share a fate with target health (calculated) or exercise dependencies the ALB’s check doesn’t.
Failover of regions is a nuclear option compared to target eviction. Design health checks so that “flap at the target” is cheap (one task out, one task in) and “flap at the region” is rare and significant.

Two layers of health check, each watching a different scope, agreeing on what healthy means, that’s the system that would have caught Tuesday’s outage in under two minutes instead of eleven. The ALB knows when its targets are sick. CloudWatch aggregates that into a metric. Route 53 reads the metric through a calculated health check. No parallel probes, no divergent definitions, no “healthy region, 503 to every user” gap.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.