How to Migrate EC2 Fleets to the Unified CloudWatch Agent

October 18, 2027 · 17 min read

CloudOps Engineer · SOA-C03 · part of The Exam Room

The situation

Acme has about 400 EC2 instances, running a mix of agents that accumulated organically:

  • Legacy CloudWatch Logs agent (awslogs): on 130 older instances. Ships files from disk to CloudWatch Log Groups. Configured via /etc/awslogs/awslogs.conf. Metrics? None; it’s logs-only.
  • Unified CloudWatch agent (amazon-cloudwatch-agent): on 220 instances. Does both logs and metrics. Configured via /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json. Newer deployments default to this.
  • Prometheus node exporter: on ~50 hosts, scraped by the in-cluster Prometheus. Fills gaps the legacy CloudWatch agent left, memory pressure, disk queue depth, process-level metrics.
  • A handful of hosts running no agent at all, usually ephemeral CI runners that nobody noticed.

Problems that have accumulated:

  1. Three agents to install, configure, and patch. Configuration management drift is constant.
  2. Memory and disk usage metrics from the legacy fleet are missing. Basic EC2 CloudWatch metrics (CPU, network, disk IO) are collected by the hypervisor, no agent required, but memory, per-disk free space, and process counts need an agent. Legacy agent doesn’t do these.
  3. Prometheus node exporter duplicates what the unified agent could do. The unified agent can emit the same system metrics that node_exporter does, and can also scrape Prometheus endpoints if configured.
  4. Agent config lives in two formats. awslogs.conf (ini-ish) and amazon-cloudwatch-agent.json (JSON). Two parsers, two templates, two different sets of gotchas.

The migration target: one agent across the fleet. The unified CloudWatch agent covers logs, metrics, and Prometheus scraping. Legacy agent retires. Node exporter retires on hosts where the unified agent is equally good (most cases).

What actually matters

The first question is what is one agent allowed to be responsible for? Logs, host system metrics, application-emitted custom metrics, scraped exposition endpoints, and trace forwarding are five different jobs. Splitting them across separate daemons made sense when each daemon was a separate project; consolidating them into one process means one install path, one config file, one set of credentials, one upgrade cadence. The trade-off is that the consolidated agent does each job slightly less natively than a specialist, but the operational saving from “one agent per host” is large enough that it usually wins on a fleet of any size.

The second is how does config reach the host? Per-host files on disk drift the moment configuration management isn’t running. A central config store, parameter store, secrets manager, configuration-as-data behind an instance role, means the source of truth is one place and the rollout is a restart, not a config push. Whatever agent gets chosen, the config plane has to be central; otherwise the drift returns within a release.

The third is where do system metrics actually come from? The basic per-VM metrics (CPU, network, disk IO) the cloud platform already collects at the hypervisor for free. The interesting metrics, memory pressure, per-disk free space, process counts, file descriptor usage, require an in-guest agent because they’re not visible from outside the kernel. The agent’s job at that layer is to fill the gap the hypervisor cannot see; if it doesn’t, those metrics simply don’t exist.

The fourth is what’s the cost of letting custom-metric paths fragment? Applications that already emit StatsD, CollectD, or Prometheus exposition pages don’t want to be rewritten because the host agent changed. The agent that wins should accept the existing emission shapes, not force every application to re-instrument. That keeps the migration to one agent independent of any application work.

The fifth is who owns the resulting bill? High-resolution intervals, high-cardinality dimensions, and per-metric publish counts all show up in the metrics bill. The agent’s defaults set the cost shape: collection interval, dimensioning, and which sections are enabled. A sensible default config saves the team from a six-figure surprise the first time a misconfigured fleet starts publishing per-process metrics at ten-second resolution.

The sixth is what’s the migration path off the legacy agent? Two formats on disk, two systemd units, and two sets of log destinations make “flip a switch” impossible. The migration has to be incremental: parallel-run on a canary subset, prove the new emitter against the old, decommission only once the dashboards agree. Anything else risks a metric or log going dark during cutover.

What we’ll filter on

  1. Metrics and logs in one binary, single agent vs separate processes?
  2. Memory / disk utilisation, does the agent collect host system metrics?
  3. StatsD / CollectD receivers, can applications publish custom metrics?
  4. Prometheus scraping, can it scrape /metrics endpoints and emit to CloudWatch?
  5. Config-as-parameter-store, can the config be managed centrally?
  6. X-Ray trace forwarding, does it carry traces as well?

The agent landscape

  1. Legacy CloudWatch Logs agent (awslogs). Logs only. Deprecated. Still works. No metrics, no StatsD, no Prometheus.

  2. Unified CloudWatch agent. One binary, does logs + metrics + StatsD + CollectD + Prometheus + X-Ray. Current recommended agent for EC2 and on-prem.

  3. CloudWatch agent on ECS / EKS. Same binary packaged as a container, typically deployed as a sidecar or as a DaemonSet in EKS. Same config format.

  4. FireLens (for ECS logs). A log-routing sidecar based on Fluent Bit/Fluentd, specifically for ECS task stdout/stderr. Not a general-purpose agent; it solves the “ECS task logs need to go somewhere” problem.

  5. Prometheus node_exporter + Prometheus server. The open-source alternative. Node exporter on hosts; Prometheus scrapes; Grafana or alertmanager consumes. Unrelated to CloudWatch unless bridged by the unified agent or a remote-write gateway.

  6. OpenTelemetry collector (ADOT). Vendor-neutral telemetry collector. Can receive OTLP traces, metrics, logs, and export to CloudWatch (via the awsemf exporter), X-Ray, Prometheus, and a long list of backends. More flexible than the unified agent; more operational overhead.

Side by side

Option Metrics + logs Host utilisation StatsD / CollectD Prometheus scraping Central config X-Ray traces
Legacy awslogs Logs only
Unified CW agent ✓ (SSM)
CW agent (ECS/EKS) ✓ (host)
FireLens Logs (ECS-only) ECS task def
node_exporter + Prom Metrics only ✓ (native) Prom config
ADOT collector via OTLP

Acme’s consolidation target: the unified agent on every EC2 host; FireLens for ECS log routing where a separate log-routing layer is valuable; ADOT considered only for teams going deep on OpenTelemetry. Legacy agent retires; node exporter retires except in places where the in-cluster Prometheus is the primary metrics destination.

What the unified agent does end-to-end

EC2 host (i-0abc1234) /var/log/* app + system logs /proc + /sys mem, disk, cpu, net localhost:9100/metrics Prometheus exposition UDP :8125 StatsD from app UDP :2000 X-Ray trace data amazon-cloudwatch-agent single Go binary, systemd service · logs input: file tails with parsing · metrics input: procstat, disk, cpu, mem · prometheus input: scrape intervals · statsd / collectd listeners · xray input: UDP receiver · buffers, batches, flushes on interval · config reload via amazon-cloudwatch-agent-ctl CloudWatch Logs one log group per source structured, retention-tagged CloudWatch Metrics namespace: CWAgent dims: InstanceId, ASG, Env X-Ray segments, subsegments service map SSM Parameter Store /AmazonCloudWatch-Agent/acme-base JSON config, versioned host sources fan into one agent process; agent writes to three AWS destinations; config fetched from Parameter Store on start / reload
One binary replaces what used to be three or four agents on a host. The config comes from Parameter Store so updating every agent in the fleet is a parameter update plus a systemd restart.

The config file in depth

The JSON config has four top-level sections: agent, metrics, logs, and traces. A concise example covering all four:

{
  "agent": {
    "metrics_collection_interval": 60,
    "run_as_user": "cwagent"
  },
  "metrics": {
    "namespace": "acme/hosts",
    "aggregation_dimensions": [["InstanceId"], ["AutoScalingGroupName"]],
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}",
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
      "ImageId": "${aws:ImageId}",
      "InstanceType": "${aws:InstanceType}",
      "Environment": "prod"
    },
    "metrics_collected": {
      "mem":  { "measurement": ["mem_used_percent", "mem_available_percent"] },
      "disk": { "measurement": ["used_percent", "inodes_free"], "resources": ["*"],
                "drop_device": true },
      "diskio": { "measurement": ["io_time", "read_bytes", "write_bytes"], "resources": ["*"] },
      "swap": { "measurement": ["swap_used_percent"] },
      "net":  { "measurement": ["bytes_sent", "bytes_recv"], "resources": ["*"] },
      "procstat": [
        { "exe": "nginx",
          "measurement": ["cpu_usage", "memory_rss", "num_threads"] }
      ]
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          { "file_path": "/var/log/nginx/access.log",
            "log_group_name": "/acme/nginx/access",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC",
            "retention_in_days": 30 },
          { "file_path": "/var/log/messages",
            "log_group_name": "/acme/system/messages",
            "log_stream_name": "{instance_id}" }
        ]
      }
    }
  },
  "traces": {
    "traces_collected": {
      "xray": { "bind_address": "0.0.0.0:2000" }
    }
  }
}

Five patterns matter. append_dimensions adds EC2-metadata-derived dimensions to every metric, which is how instance IDs and ASG names get attached without hand-writing them. aggregation_dimensions tells the agent which dimension combinations to emit as separate metrics (e.g. [["InstanceId"], ["AutoScalingGroupName"]] means emit metrics per instance and per ASG). namespace isolates the agent’s metrics from AWS-managed ones, keep your custom namespace distinct. log_stream_name: "{instance_id}" uses the agent’s built-in placeholder to give each instance its own log stream. And retention_in_days is the one field that most greenfield configs forget and then regret six months later when the bill arrives.

Adding Prometheus scraping:

"metrics": {
  "metrics_collected": {
    "prometheus": {
      "prometheus_config_path": "/etc/amazon-cloudwatch-agent/prometheus.yaml",
      "emf_processor": {
        "metric_declaration": [
          { "source_labels": ["job"],
            "label_matcher": "node_exporter",
            "dimensions": [["InstanceId"]],
            "metric_selectors": ["^node_memory_Active_bytes$",
                                 "^node_filesystem_avail_bytes$"] }
        ]
      }
    }
  }
}

The prometheus_config_path points at a Prometheus-format scrape config (same as prometheus.yml). emf_processor is what tells the agent which scraped metrics to emit to CloudWatch (you rarely want all scraped metrics in CloudWatch, the cost adds up). metric_declaration lists the metrics to keep and their CloudWatch dimensions.

The migration in depth

Migrating from legacy awslogs to the unified agent, on an instance that’s already running:

  1. Install the unified agent alongside. On Amazon Linux 2023: sudo dnf install amazon-cloudwatch-agent. Both agents can coexist while you switch over, but they’ll try to tail the same files, don’t leave both running for long.
  2. Convert the awslogs.conf to the unified agent’s JSON. The agent ships a amazon-cloudwatch-agent-config-wizard that imports the legacy config; failing that, awslogs.conf’s [logstream] sections map cleanly to logs.logs_collected.files.collect_list entries.
  3. Put the new config in Parameter Store:

    aws ssm put-parameter \
      --name /AmazonCloudWatch-Agent/acme-base \
      --type String \
      --value file://amazon-cloudwatch-agent.json \
      --overwrite
    
  4. On each instance, fetch and start:

    sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
      -a fetch-config \
      -c ssm:/AmazonCloudWatch-Agent/acme-base \
      -s
    
  5. Verify metrics and logs arriving in CloudWatch. Stop and remove the legacy agent: sudo systemctl stop awslogsd && sudo systemctl disable awslogsd && sudo yum remove -y awslogs.

For a fleet, Run Command (with an SSM Automation runbook wrapping the steps) handles the rollout. 400 instances in batches of 20 is about two hours wall-clock.

Retiring node_exporter on hosts where Prometheus is no longer the primary metrics destination: enable the prometheus scraping block in the unified agent config, point it at localhost:9100, deploy, then uninstall node_exporter. On hosts where the in-cluster Prometheus still needs node_exporter (because node_exporter metrics feed Prometheus-based alerts), leave it in place, the two systems can coexist.

A worked consolidation

Week 1: audit. Inventory reveals 130 hosts on legacy awslogs, 220 on unified, 40 running no agent, 50 with node exporter. Of the 50 with node exporter, 30 also run the in-cluster Prometheus side by side for historical reasons.

Week 2: build the canonical unified config in Parameter Store. Dimensions: InstanceId, AutoScalingGroupName, Environment, Team. Metrics: mem, disk, swap, net, procstat for the three key processes across services. Logs: standard /var/log/messages, plus per-service log groups. Retention: 30 days for operational, 90 for audit.

Week 3-4: rollout to the 130 legacy-agent hosts via SSM Automation. Run Command installs the unified agent, fetches config, verifies, removes legacy. Two hosts fail (older AL2 release without the package); fixed by adding a bootstrap step to install from the RPM.

Week 5: nothing-installed hosts. Same runbook, different branch. All 40 hosts now on the unified agent.

Week 6: node_exporter retirement. The 20 hosts where Prometheus is no longer primary get node_exporter removed after verifying the unified agent is publishing the needed metrics to CloudWatch. The 30 where Prometheus is still primary keep node_exporter.

End state: one agent on every host, one config in Parameter Store, consistent dimensions across every metric, logs routed to per-service log groups. The bill for custom metrics climbs slightly (node_exporter’s metrics weren’t in CloudWatch before); the operational overhead drops substantially (one agent instead of three).

What’s worth remembering

  1. Legacy awslogs agent is deprecated; the unified agent replaces it. One binary, does logs + metrics + StatsD + Prometheus + X-Ray.
  2. CloudWatchAgentServerPolicy on the instance profile is the standard IAM. Plus ssm:GetParameter if the config lives in Parameter Store.
  3. Config in Parameter Store is the scalable pattern. One parameter per fleet shape; amazon-cloudwatch-agent-ctl -a fetch-config -c ssm:<name> -s pulls and reloads.
  4. append_dimensions attaches EC2-metadata-derived dimensions automatically. InstanceId, AutoScalingGroupName, ImageId, InstanceType without hand-wiring.
  5. aggregation_dimensions controls which dimension combinations get emitted as separate metrics. Per-instance and per-ASG is a reasonable default; avoid blowing up the dimension cartesian product.
  6. Prometheus scraping is the node_exporter-to-CloudWatch bridge. prometheus_config_path + emf_processor scrapes the endpoint and emits selected metrics to CloudWatch. Not every scraped metric needs to go, cost discipline matters.
  7. log_stream_name: "{instance_id}" gives each host its own stream. Standard pattern; works for EC2 and on-prem with {hostname} as the alternate.
  8. FireLens is the ECS log answer; the unified agent is the EC2 log-and-metric answer. Don’t cross them; each is optimised for its environment.

One agent across the fleet is worth the migration even when the thing being replaced works. Three agents to patch, three configs to template, and two formats to diff is a steady-state operational tax. The unified agent is the default now; moving to it is moving toward the line of least resistance, not away from a burning platform. Three hundred and eighty instances in six weeks, one config in Parameter Store, and a fleet that can be reasoned about with a single systemctl status amazon-cloudwatch-agent.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.