The situation
AWS has a service called AWS Health that publishes events about things happening to your account and to AWS itself. Three event categories:
- issues: operational problems AWS is aware of, scoped to a Region, service, or (for account-specific issues) to your resources. Examples: “increased error rates for EC2 in
eu-west-1,” “degraded performance on volumevol-0abc1234.” - scheduledChanges: planned maintenance, rotations, or changes AWS will apply. “This RDS instance will be restarted between 02:00 and 04:00 UTC on 2027-06-19 for patching.” “This SSL certificate for CloudFront distribution X will auto-rotate on Y.”
- accountNotifications: things AWS wants you to know about your account. Approaching quotas, deprecations, security bulletins, certificate expirations that AWS isn’t auto-rotating.
For most of Acme’s existence, the team’s relationship with Health has been the Personal Health Dashboard: a console page someone occasionally remembers to check. Useful once, as a history browser. Not useful as an operational signal.
The goal this quarter: turn every Health event into something actionable. Specifically:
- Every resource-specific issue (degraded EBS, impaired instance) should page the team that owns that resource.
- Every scheduled change should route to the team that owns the affected resource at least 48 hours ahead, with enough context to decide whether to do something.
- Every account-level notification (quota warning, certificate expiration, security bulletin) should land in a low-urgency triage channel.
- Everything should be captured for audit.
And importantly, this should work across the organisation. Acme has 22 accounts, and the current pattern of “each account’s team checks their own Health Dashboard” is already failing.
What actually matters
Before reaching for a specific integration, it’s worth being precise about what “turn Health into an operational signal” actually wants from a mechanism.
The first thing is push vs pull. A console anyone has to remember to open is a pull-based signal, and pull-based signals don’t survive contact with on-call rotations. A push-based mechanism that fires the moment AWS publishes the event is the only shape that gets the team the heads-up before the incident, rather than after. The chosen integration has to be event-driven, not “ask AWS every fifteen minutes.”
The second is routing by what the event affects. A degraded volume affects one team; a Region-wide RDS notice affects another; a quota warning affects the platform team. The mechanism needs enough metadata on the event (the service, the affected resource, the event category) for a routing layer to pick a destination. Without affected-resource identifiers, the only routing available is “send everything to one channel,” which is the modern equivalent of the console page nobody checks.
The third is org-wide aggregation. Twenty-two accounts is twenty-two event sources. Either each account routes its own events to its own team (which scales the routing-rule configuration by account count), or one place aggregates events from every member account and decides routing centrally. The second shape lets the platform team own the routing logic once; the first scatters it across every account-owning team and assumes they all do the work.
The fourth is category-aware urgency. Resource issues page; scheduled changes route to a planning queue with enough lead time to do something; account notifications land in a low-urgency triage channel. The same delivery surface for all three is the wrong shape because the urgency is different. The mechanism either honours the event’s category natively in the routing logic, or the routing layer has to inspect the payload and decide.
The fifth is Support-tier availability. Some Health interfaces are paywalled to higher Support tiers; the event-driven push surface is available on every tier. If a design pins critical alarms behind a paid tier, the team’s ability to react is a billing decision; worth knowing which knobs that affects before committing.
And sixth, audit. Every event the team acts on should be captured somewhere durable, queryable later, and indexed by account, service, and time. The audit half is non-negotiable for the post-incident question of “did AWS warn us?”; the answer has to come from a log the team owns, not a console page that’s a rolling 30-day window.
What we’ll filter on
- Coverage across services, does it emit events for every AWS service?
- Resource identification, can routing decisions use the affected resource?
- Account vs org-wide, does a single rule see events from every member account?
- Latency to target, seconds from AWS to the pager?
- Integrates with existing notifications, does it fit SNS, Slack, PagerDuty, ticketing?
- Support-plan gating, is it available on all tiers?
The Health-signal landscape
-
Personal Health Dashboard (console). A page you open when you think to. Not a signal.
-
Health API (DescribeEvents). Pull current or historical events programmatically. Useful for audits and “open issues right now” dashboards; Business Support or higher for the organisational view. Pull-based, not push.
-
EventBridge rule on
aws.health. Push-based. Every event arriving on the account’s event bus can trigger any supported target. Available to every account regardless of Support tier. -
Organizational View + EventBridge in org admin. Org-wide aggregation. Every member account’s Health events appear in the admin account, where a single rule handles routing. Requires trusted-service enablement and delegated-admin configuration.
-
SNS topic subscription via Health. Legacy shape: some Health event types could subscribe to an SNS topic directly. Superseded by EventBridge; not worth setting up fresh.
-
Third-party status-page aggregators. Consume both AWS Health and the public status page. Adjacent; outside the AWS integration story.
Side by side
| Option | Cross-service | Resource-aware | Org-wide | Latency | Integrates with notifications | Support-tier gating |
|---|---|---|---|---|---|---|
| Personal Health Dashboard | ✓ | ✓ | ✗ | N/A | Console only | None |
| Health API (pull) | ✓ | ✓ | Org View: Business+ | Depends on polling | ✓ (custom) | Business+ for Org View |
| EventBridge (per account) | ✓ | ✓ | ✗ | Seconds | ✓ | None |
| Org View + EventBridge | ✓ | ✓ | ✓ | Seconds | ✓ | None for events, Business+ for API |
| Direct SNS (legacy) | Subset | Subset | ✗ | Seconds | ✓ | None |
| Third-party aggregator | ✓ | Via integration | Varies | Varies | ✓ | Varies |
Acme’s answer is the fourth row: Organizational View with a single EventBridge rule in the delegated admin account, routing events to team-owned SNS topics based on resource tags.
From Health event to on-call page
Setup in depth
Three pieces go together.
Organizational View. From the management account:
aws organizations enable-aws-service-access --service-principal health.amazonaws.com
aws health enable-health-service-access-for-organization
Then register the security account as delegated administrator:
aws organizations register-delegated-administrator \
--account-id 555555555555 \
--service-principal health.amazonaws.com
After this, the delegated admin account’s default EventBridge bus receives Health events from every member account.
The EventBridge rule. In the delegated admin account:
aws events put-rule \
--name aws-health-all \
--event-pattern '{
"source": ["aws.health"],
"detail-type": ["AWS Health Event"]
}' \
--state ENABLED
aws events put-targets \
--rule aws-health-all \
--targets 'Id=1,Arn=arn:aws:lambda:eu-west-1:555555555555:function:health-classifier'
The rule matches everything on the Health source. Further categorisation happens in the Lambda, because the classification needs resource-tag lookup that event patterns can’t do.
The classifier Lambda. The interesting code:
import boto3, json, os
tag_api = boto3.client('resourcegroupstaggingapi')
sns = boto3.client('sns')
ddb = boto3.resource('dynamodb').Table(os.environ['EVENTS_TABLE'])
ESCALATION = {
'issue': 'issue',
'scheduledChange': 'change',
'accountNotification': 'notice'
}
def handler(event, context):
detail = event['detail']
event_type_code = detail['eventTypeCode']
category = detail['eventTypeCategory']
service = detail['service']
affected = detail.get('affectedEntities', [])
# Default owning team if no resources affected or no Team tag found
owner = 'platform'
if affected:
arns = [e['entityArn'] for e in affected if e.get('entityArn')]
if arns:
resp = tag_api.get_resources(ResourceARNList=arns[:10])
for mapping in resp['ResourceTagMappingList']:
for tag in mapping.get('Tags', []):
if tag['Key'] == 'Team':
owner = tag['Value']
break
urgency = ESCALATION.get(category, 'notice')
topic = f"arn:aws:sns:eu-west-1:555555555555:{owner}-{urgency}"
sns.publish(
TopicArn=topic,
Subject=f"[Health/{category}] {event_type_code}",
Message=json.dumps(detail, default=str, indent=2)
)
ddb.put_item(Item={
'eventArn': event['resources'][0] if event.get('resources') else detail['arn'],
'receivedAt': event['time'],
'category': category,
'service': service,
'eventTypeCode': event_type_code,
'owner': owner,
'status': 'open',
'affectedEntities': [e.get('entityValue') for e in affected]
})
The Lambda’s IAM role needs tag:GetResources (the Resource Groups Tagging API, which covers most resource types), sns:Publish on the team SNS topic ARNs, and dynamodb:PutItem on the tracking table. If the affected resource is in a different account than the admin account, the tag lookup needs a cross-account role assumption, covered by the delegated-admin trust boundary.
Event shapes worth knowing
The eventTypeCode is the specific identifier. Some examples:
AWS_EC2_INSTANCE_STOP_SCHEDULED, a named instance will be stopped by AWS on a date. Affected entity: the instance.AWS_EBS_VOLUME_PERFORMANCE_DEGRADED, a specific volume is degraded. Often paired with recovery recommendations in the event body.AWS_RDS_MAINTENANCE_SCHEDULED, an RDS instance has a maintenance window coming.AWS_EC2_OPERATIONAL_ISSUE, regional-level incident affecting EC2; often no specific affected entity beyond the Region.AWS_ACM_CERTIFICATE_APPROACHING_EXPIRATION, a certificate ACM isn’t managing is approaching its expiry.AWS_LAMBDA_ATHENA_DEPRECATION/ service-named deprecation codes, planned deprecations or API retirements.AWS_IAM_ACCESS_KEY_EXPOSURE, an IAM access key appears to have been exposed publicly. Issue category; always pages.
The detail-type in EventBridge is either "AWS Health Event" (common) or, for specific abuse / compromise categories, "AWS Health Abuse Event". If the response plan differs, the event pattern can split: separate rules for abuse events versus all others.
For scheduledChange events, startTime and endTime in the detail give the window; the classifier can add a second scheduling layer (EventBridge Scheduler at startTime - 2h) to re-notify just before the window opens.
A worked event
03:47. An AWS_EBS_VOLUME_PERFORMANCE_DEGRADED event fires for vol-0abc1234 in the payments-prod account.
03:47:04. Organizational View surfaces the event in the delegated admin account’s default bus.
03:47:05. The aws-health-all rule fires, invoking the classifier Lambda.
03:47:06. The Lambda reads affectedEntities: [{ entityArn: "arn:aws:ec2:eu-west-1:111122223333:volume/vol-0abc1234" }]. It calls tag:GetResources with that ARN; the response includes Tags: [{Key: "Team", Value: "payments"}, {Key: "Env", Value: "prod"}].
03:47:07. payments team, issue category, publish to arn:aws:sns:eu-west-1:...:payments-issue. SNS fans out to PagerDuty integration. DynamoDB row written as status: open. Archive S3 object written.
03:47:08. PagerDuty pages the payments on-call. Alert reaches phone within 15 seconds of the SNS publish.
03:47:20-03:48:00. The on-call acknowledges, opens the runbook (linked from the SNS message’s detail), checks the volume’s CloudWatch metrics, and begins the “failover the pod that’s using this volume” play.
04:10. AWS publishes a follow-up event AWS_EBS_VOLUME_PERFORMANCE_DEGRADED_RESOLVED. The classifier routes the same way; the DynamoDB row is updated to status: resolved. The on-call receives an “incident resolved by AWS” Slack message.
The on-call found out from AWS Health before the I/O errors propagated into application-level alarms. The previous flow (volume errors surfacing as HTTP 500s from the payments service at 03:52) was replaced by a targeted page with the affected resource ID pre-identified.
What’s worth remembering
- AWS Health is the operational signal already being emitted. Issues, scheduled changes, and account notifications are published; the question is whether the team is listening.
- EventBridge is the primary integration. Every account’s default bus sees
aws.healthevents; a rule + Lambda/SNS target is the standard shape. - Organizational View + delegated admin is the org-wide answer. One rule in a dedicated account routes events from 22 member accounts, each event carrying the originating account ID.
- Routing by affected resource tag. Read the
Teamtag on each affected resource and route to that team’s notification lane. Fall back to a platform-owned lane when no tag is available. - Category picks urgency.
issue= page.scheduledChange= Slack + ticket, plus a re-notify 2 hours before start.accountNotification= low-urgency triage. eventTypeCodeis the precise identifier. Runbook lookup maps code to playbook;detail-typesplits abuse events if the response differs.- Persist events for audit and pattern-finding. DynamoDB for open state, S3 for archival. A monthly Athena query over the archive surfaces recurring issues (e.g. “AZ
eu-west-1bhas seen 4PERFORMANCE_DEGRADEDevents this quarter”). - Health API is Business+ for Organizational View; EventBridge events are free to all. The push path works without a Support upgrade; the pull path requires one for cross-account.
Acme’s operational signal goes from “check the console” to “push to the right pager, with the right context, in seconds.” The Personal Health Dashboard stays as the historical browser; EventBridge owns the live signal. The outcome the team was chasing, “know about AWS’s view of our infrastructure before our monitoring rediscovers it”, is exactly what Health was built to give them, and the integration is three AWS primitives in a small Lambda.