The situation
Acme’s observability data is in three main places:
- CloudWatch Metrics: 80% of the data by volume. Every AWS-managed service’s metrics, custom metrics from Lambda and application code, metric filter derivations from log groups.
- Prometheus (self-hosted on EKS): the container platform’s metrics, cluster health, application exposition endpoints, node-level metrics. About 6,000 scraped series.
- X-Ray traces: a smaller but growing dataset, specific latency-breakdown questions, a few critical services fully instrumented.
Three dashboarding tools have been adopted for reasons-of-history rather than by choice:
- CloudWatch Dashboards: live in the AWS console, one per service. Free-ish (first 3 dashboards per account are free; additional ones at $3/month each). Can graph CloudWatch metrics; cannot graph Prometheus directly.
- Amazon Managed Grafana (AMG): a single workspace. Can graph CloudWatch, Prometheus, X-Ray, CloudWatch Logs (via Logs Insights), Timestream, Athena, and Redshift through managed data source plugins. Priced per-user.
- Datadog: inherited from three years ago. Own agent, own metrics ingest, own dashboarding. Richer UX; meaningful monthly bill.
Three operational questions pushed this review onto the agenda:
- The SRE team wants a single “service health” dashboard per service, showing CloudWatch metrics alongside Prometheus metrics alongside X-Ray latency percentiles. Today this requires flipping between three tools.
- The on-call rotation wants an auth boundary: people who aren’t on-call shouldn’t need AWS console access to see the dashboards. Today, CloudWatch dashboards require it.
- Finance wants the Datadog bill down. Not zero, some teams have deep Datadog investment, but the “everyone defaults to Datadog” behaviour has to end.
What actually matters
The first question is who is looking at the dashboard? A dashboard surface tied to the cloud console’s identity plane means every viewer needs onboarding to that identity plane, which is a meaningful tax once the audience extends past the people who already live in the console. A surface with its own auth plane (SAML, OIDC, dedicated workspace identity) can grant dashboard access without granting infrastructure access, which is what the on-call rotation is asking for. The auth boundary is the boundary between “an internal observability tool” and “a console plugin”.
The second is what data sources need to be on the same graph? A single-source dashboarding tool is fine for the world where the answer is on one graph and that graph is one source. The moment a single service’s health view wants metrics from a managed-AWS source, an open-source scrape pipeline, and a tracing system on the same panel, single-source tools force users to flip between three windows and mentally align timestamps. Multi-source panels are an architectural property of the dashboard tool, not a feature you can bolt on.
The third is what’s the cost shape? Dashboard tools price along several different axes: per-dashboard, per-user, per-host, per-ingest, per-query. None of those is inherently right; what matters is that the cost shape matches the usage shape. A surface priced per user scales with audience; a surface priced per ingest scales with data volume; a surface priced per dashboard scales with how many views the team keeps around. Three surfaces with three cost shapes is the natural endgame for “everyone defaults to the same tool”, the finance bill comes from the mismatch.
The fourth is who owns the definition? Dashboards that live as JSON in a repo are auditable, reviewable, and reproducible across environments. Dashboards that live in a UI drift, accumulate one-off tweaks, and lose their history. The choice between “configured in code” and “configured in UI” decides whether dashboards behave like infrastructure or like spreadsheets.
The fifth is how much visualisation sophistication does the workload actually need? The plain visualisations, line, area, number, gauge, cover most operational monitoring. The richer set, heatmaps, state timelines, threshold-coloured gauges, computed-column tables, starts to matter when the question is “what’s the latency distribution doing” rather than “is this number above the line”. Pricing a richer tool against the simpler questions wastes money; pinning a richer audience to the simpler tool wastes their time.
The sixth is what’s the maintenance surface? A console-native dashboard tool has no operating cost beyond authoring. A self-hosted dashboard tool means owning the servers, the upgrade cadence, and the on-call. A managed third-party tool means vendor agent configuration on every host. Each comes with a different ongoing tax, and the tax doesn’t show up on day one; it shows up when the team that stood it up moves on.
What we’ll filter on
- Multi-source panels, can a single panel combine data from different sources?
- No AWS console auth required, can non-AWS-authenticated users view dashboards?
- Cross-account CloudWatch, can a dashboard pull metrics from multiple AWS accounts?
- Rich visualisation, heatmaps, state timelines, threshold-coloured gauges?
- IaC-friendly, dashboards defined in version control?
- Cost shape, free per dashboard, per user, per ingest, per host?
The dashboarding landscape
-
CloudWatch Dashboards. JSON-defined dashboards, widgets of types (line, stacked, number, text, etc.), cross-account and cross-Region CloudWatch metric support via cross-account observability (the observability access manager). Requires AWS console or API access to view. Cheap at small scale; $3/month per dashboard past 3.
-
Amazon Managed Grafana (AMG). Managed Grafana workspace with IAM Identity Center or SAML auth. Built-in data sources for CloudWatch, Prometheus, X-Ray, Logs Insights, Timestream, Athena, Redshift. User-based pricing. No ops overhead beyond the workspace itself. Dashboards are Grafana JSON, version-controllable.
-
Self-hosted Grafana (OSS or Enterprise). Same tool, you run it. Free software; real operational cost. Useful when you need plugins or features not available in AMG, or you already have Grafana investment.
-
Datadog (or New Relic / Honeycomb). Third-party observability platforms with their own metrics, dashboards, APM, and logs. Rich UX; significant cost. Legitimate for shops already committed.
-
Managed Prometheus + Grafana. AMP (Amazon Managed Service for Prometheus) + AMG. The two AWS-managed open-source services together cover the Prometheus-plus-CloudWatch case without any self-hosting.
-
QuickSight or Athena-based BI. For long-range operational analytics (weekly trends, month-over-month), a BI tool is sometimes the correct answer rather than a dashboard tool. Adjacent.
Side by side
| Option | Multi-source panels | No AWS auth needed | Cross-account CloudWatch | Rich visualisations | IaC-friendly | Cost shape |
|---|---|---|---|---|---|---|
| CloudWatch Dashboards | ✗ (CW only) | ✗ | ✓ (with OAM) | Limited | ✓ (JSON / CFN) | Per-dashboard |
| AMG | ✓ | ✓ | ✓ | ✓ | ✓ (JSON) | Per-user |
| Self-hosted Grafana | ✓ | ✓ | ✓ | ✓ | ✓ | Compute + ops |
| Datadog | ✓ | ✓ | ✓ (via integration) | ✓ | Varies | Per-host + per-ingest |
| AMP + AMG | ✓ | ✓ | ✓ | ✓ | ✓ | Per-user + ingest |
| QuickSight / Athena BI | ✓ | ✓ | ✓ | ✓ (different genre) | ✓ | Per-user + query |
Acme’s likely outcome: keep CloudWatch Dashboards for the “quick operational view inside AWS” case (free, zero setup, good for infrastructure teams already in the console); promote AMG to the primary observability surface for everyone else, with CloudWatch and Prometheus data sources; let Datadog usage decline organically as teams migrate, keeping it only for teams that genuinely use its APM features.
How the data sources fan into each surface
CloudWatch Dashboards in depth
A dashboard is a JSON document. Widgets include metric, log, text, number, and alarm. A concise example:
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"metrics": [
[ "AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/checkout/abc" ],
[ ".", "HTTPCode_Target_5XX_Count", ".", "." ],
[ { "expression": "m2/m1*100", "label": "5xx rate %" } ]
],
"view": "timeSeries",
"stat": "Sum",
"region": "eu-west-1",
"period": 60,
"title": "Checkout ALB"
}
},
{
"type": "log",
"x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"query": "SOURCE '/aws/ecs/checkout' | fields @timestamp, @message | filter level = 'ERROR' | stats count() by bin(1m)",
"region": "eu-west-1",
"view": "timeSeries"
}
}
]
}
Cross-account support uses Observability Access Manager (OAM): a source account shares its CloudWatch metrics and logs with a monitoring account, which can then build dashboards that pull data from both. Configure sinks in the monitoring account, attach links in each source account. One place for dashboards across 22 accounts.
Dashboards can be deployed via CloudFormation (AWS::CloudWatch::Dashboard) or Terraform (aws_cloudwatch_dashboard), which keeps them in version control. The JSON is human-readable and diff-friendly.
AMG in depth
An AMG workspace is a managed Grafana cluster with data sources and auth configured from AWS. Setup:
aws grafana create-workspace \
--workspace-name acme-observability \
--account-access-type CURRENT_ACCOUNT \
--authentication-providers AWS_SSO \
--permission-type CUSTOMER_MANAGED \
--workspace-role-arn arn:aws:iam::111122223333:role/AmgWorkspaceRole \
--workspace-data-sources CLOUDWATCH PROMETHEUS XRAY \
--grafana-version 10.4
The workspace role needs permissions against each data source: cloudwatch:Get*/List*, aps:Query* on AMP workspaces, xray:Get*/BatchGet*. IAM Identity Center handles user onboarding; users get Viewer, Editor, or Admin via group assignment.
Data sources get configured inside Grafana. A CloudWatch data source that crosses accounts uses OAM sinks; an AMP data source points at an AMP workspace ID; X-Ray is built-in.
A cross-source panel query: in Grafana’s panel editor, add a CloudWatch query for AWS/ApplicationELB RequestCount and a second query for sum(rate(http_requests_total[1m])) from the AMP data source. Both appear on the same graph, with a shared time selector. This is the single most valuable capability compared to CloudWatch Dashboards.
Dashboards are Grafana JSON. They can be exported, version-controlled, and imported into other workspaces. Terraform’s grafana provider (community-maintained) can manage them as IaC.
When to stay on two tools
The pull to consolidate is real; the cost of consolidating is also real. Some cases where two tools is the correct answer:
- Teams with deep CloudWatch Dashboard expertise and zero cross-source needs. A data-platform team whose entire world is RDS and S3, with dashboards already wired into their runbooks, gains little from moving to AMG. Leave them alone.
- APM-intensive teams on Datadog. Datadog’s APM (distributed tracing, error analytics, profiling) is richer than AMG’s X-Ray integration for heavy-APM use cases. If the team’s workflow depends on Datadog APM, the migration cost is higher than the savings.
- Executive dashboards read by non-engineers. A finance or product dashboard read by people who don’t log into either AWS or Grafana is often best served by a BI tool (QuickSight, Metabase, Looker) rather than forcing everyone into an observability platform.
Three tools is justifiable when each one serves a distinct audience and cost centre; three tools is wasteful when they’re all being used for the same question.
A worked consolidation
Over six months, Acme migrates team-by-team rather than big-bang.
Month 1: AMG workspace provisioned. CloudWatch, AMP, X-Ray data sources configured. Identity Center integration done, user groups mapped to Grafana roles.
Month 2: SRE team migrates their service-health dashboards. Ten dashboards, each combining CloudWatch metrics and Prometheus metrics, previously two tabs. Team stops using Datadog for this workflow.
Month 3: On-call rotation migrates. Grafana viewer role granted to the on-call group; CloudWatch console access no longer required to see dashboards during an incident.
Month 4: Audit of Datadog usage. Five teams identified as APM-dependent (keep). Twelve teams identified as using only dashboards (migrate). Datadog dashboard-only seats are cancelled at the next billing cycle.
Month 6: Datadog bill roughly halved. CloudWatch Dashboards continue for the infra team’s in-console work. AMG is the primary observability surface; the SRE team owns the gold-standard dashboards.
Cost math (illustrative): Datadog bill $18k/month pre-consolidation; $9k/month post. AMG bill $1,200/month (120 viewers at $5, 20 editors at $9). CloudWatch Dashboards remain basically free. Net $8k/month savings without losing operational capability.
What’s worth remembering
- Pick the surface by the question, not by the brand. CloudWatch Dashboards for in-console single-source; AMG for cross-source observability; third-party for what they’re genuinely better at (APM, RUM).
- Multi-source panels are AMG’s killer feature. CloudWatch metric math and Prometheus PromQL on the same graph, same time range, is worth the per-user cost for SRE-level users.
- Cross-account CloudWatch requires OAM. Observability Access Manager sinks and links turn per-account CloudWatch into a centralised view. Free, a half-hour to set up.
- CloudWatch Dashboards are IaC-native. JSON documents deployable via CloudFormation or Terraform. No manual UI work needed to produce a good dashboard.
- AMG pricing is per-user. Editor $9/user/month, Viewer $5. Workspace-level controls; IAM Identity Center or SAML for identity.
- AMP closes the Prometheus-on-AWS loop. Managed Prometheus, scraped from EKS, exposed to AMG as a data source. Together with AMG, that’s an AWS-hosted “Prometheus + Grafana” without servers.
- Don’t consolidate what’s working. Datadog and CloudWatch Dashboards can coexist with AMG as long as each is earning its keep. The consolidation effort is paid back only when the duplicated work is genuinely duplicated.
- Dashboards deserve version control. CloudWatch JSON, AMG JSON, and even Datadog dashboards (via their API) can be managed as code. Dashboards-as-code is the discipline that prevents “the one dashboard that nobody knows who owns.”
Two tools, not three, for most Acme teams: CloudWatch Dashboards stays for the in-console cases it’s free and good for; AMG becomes the primary observability surface because of its multi-source panels and non-AWS auth. Datadog shrinks to the teams using its APM. The rule isn’t “one tool to rule them all”, it’s “one tool per question the team is asking.” Three questions, three tools, no duplication. The bill follows.