Application Insights Without Dashboards

September 13, 2028 · 14 min read

DevOps Engineer Pro · DOP-C02 · part of The Exam Room

The situation

A team runs a customer-facing .NET application:

  • 20 × Windows Server 2022 EC2 instances running IIS, behind an ALB in eu-west-1.
  • A SQL Server 2022 RDS Multi-AZ instance (db.m6i.2xlarge).
  • An SQS queue for background job processing, pulled by 5 × Windows Server EC2 instances running a Windows service.
  • A Redis ElastiCache cluster (2 shards, 3 replicas each) for session state.
  • An S3 bucket for user uploads.

Existing monitoring:

  • CloudWatch metrics per service (EC2 CPU, ALB target response time, RDS DatabaseConnections, etc.).
  • Custom dashboards per team, mostly copy-paste of a template from 2022.
  • Alarms on CPU > 80% and latency > 2s. Alarms fire often for transient blips; the team has started ignoring them.

Incidents:

  • Last month: a SQL Server long-running query caused ALB target response time to spike. Team spent 40 minutes finding the query via CloudWatch Logs Insights on the application logs.
  • Month before: an ALB 5xx spike turned out to be IIS worker process recycling. Found in the IIS event log after 30 minutes.
  • Month before that: a Redis eviction storm caused session loss. Found by noticing “ReplicationBytes” was unusually high on the Redis shard.

The asks:

  • Automated anomaly detection that correlates metrics across components rather than per-metric alarms.
  • Problem narratives, “elevated ALB latency correlates with SQL Server CPU spike starting at 14:32.”
  • Coverage for common workloads (.NET, SQL Server, IIS) out of the box.
  • Integration with SSM OpsCenter to open tickets for actionable problems.
  • Reduced dashboard maintenance, the team shouldn’t need to rebuild dashboards as architecture evolves.

What actually matters

Before reaching for a tool, it’s worth pinning down what makes the current monitoring fail and what an answer has to do differently.

The first thing worth thinking about is what the team’s incidents have in common. The three recent incidents, long-running SQL Server query, IIS worker recycle, Redis eviction storm, have the same shape: the symptom appeared on one component (ALB latency, ALB 5xx, missing sessions) but the cause was on another. Every minute of triage went on hopping between graphs trying to colocate the spike. The win isn’t “more metrics,” it’s correlation: one signal that says “these three things moved together at 14:32” instead of three independent alarms a human stitches together by hand.

The second is what counts as anomalous. A static threshold (“CPU > 80%, latency > 2s”) fires on every traffic ramp at 09:00 Monday and stays quiet through a slow degradation that never crosses the line. What the team needs is a baseline that knows the workload’s shape, different at 09:30 Monday than at 04:00 Sunday, and flags deviations from that shape, not deviations from an arbitrary number. The cost of the current threshold-only approach is the alarm fatigue the team already has: they’ve started ignoring alerts because most are noise.

The third is stack depth. Generic metrics (CPU, network, disk) catch generic problems. The team’s actual incidents needed stack-specific signals: IIS worker process recycle events, .NET CLR exception rates per host, SQL Server blocked-process counts, Redis eviction rate. A monitoring layer that only understands the AWS service surface (EC2 CPU, RDS connections) misses the layer where the actual application lives. Anything chosen has to read inside the OS and the application stack, not just outside the host.

The fourth is configuration upkeep. The team’s existing dashboards are copy-pastes of a 2022 template; nobody maintains them as architecture evolves, and new services get added without monitoring until something breaks. Whatever’s chosen has to be either zero-config (component types drive monitoring rules automatically) or it’ll rot the same way. “Add a new EC2 to the fleet” should produce monitoring without a JIRA ticket.

The fifth is where the operational output lands. Detecting a problem isn’t the same as someone acting on it. The team needs detected problems to become tickets in a queue (with runbook links, historical context, correlation evidence inline) rather than another tab in another console nobody opens. If the output is a dashboard, the team won’t watch it; if the output is a ticket with the narrative pre-written, they’ll work it.

The sixth is what the team has to do versus what they get for free. Some signal types are application-specific (orders-per-minute, checkout success rate) and have to be published manually. The monitoring layer should be able to incorporate those custom signals into the same correlation model so a drop in orders-per-minute correlates with the SQL Server spike that caused it, not sit in a separate dashboard.

And finally, the boundary of what this layer should try to do. Problem detection, tracing, log search, and synthetic checks are different jobs. A layer that tries to do all of them does each badly. The honest framing is: pick a layer that does problem detection well, accept that distributed tracing (X-Ray), log search (Logs Insights), and user-experience monitoring (RUM/Synthetics) sit beside it rather than under it, and route between them when triage demands it.

What we’ll filter on

Filtering:

  1. Cross-component correlation, not per-metric alarms.
  2. Automatic component detection, no manual metric configuration.
  3. Stack-specific depth, .NET, SQL Server, IIS, etc.
  4. Integration with OpsCenter / Incident Manager.
  5. Low maintenance as architecture evolves.

The application-monitoring landscape

1. Per-metric CloudWatch alarms + custom dashboards. Status quo. Fails correlation; dashboards drift from reality.

2. CloudWatch Application Insights (basic tier). Automatic component detection, generic metrics, basic anomaly detection. Works but under-utilises the stack the team runs.

3. CloudWatch Application Insights + Enhanced tier. Deep .NET/SQL Server integration: SQL query-level diagnostics, IIS worker process monitoring, .NET CLR exceptions, W3C log correlation. Typical recommended setup for .NET estates.

4. CloudWatch Container Insights. Equivalent service for ECS, EKS, Fargate. Different observability surface (pods, tasks, node pressure). Complementary to Application Insights, not a substitute.

5. CloudWatch Synthetics + Evidently + RUM. End-user-facing monitoring: synthetic canaries, A/B experiment outcomes, real user metrics. Complementary; fills the “what is the customer experiencing” gap.

6. Third-party APM (Datadog, New Relic, AppDynamics). Mature application-performance monitoring with deeper code-level tracing. Licence cost. Strong for distributed tracing; overkill for simple 3-tier apps.

Side by side

Option Correlation Auto-detect Stack-specific OpsCenter integration Low maintenance
Per-metric CW alarms Manual
Application Insights basic Partial
Application Insights Enhanced
Container Insights ✓ (container-scoped) Container-specific
Third-party APM Partial Partial

Application Insights Enhanced for the .NET estate; Container Insights for any ECS/EKS workloads the team runs alongside; RUM/Synthetics for the customer-facing edge.

How Application Insights builds the model

Resource group Application Insights Outputs ALB (checkout-alb) TargetResponseTime, 5xx EC2 Windows × 20 IIS, .NET CLR, Windows EventLog RDS SQL Server CPU, DatabaseConnections, SQL errors ElastiCache Redis CPUUtilization, Evictions SQS (jobs-queue) ApproximateNumberOfMessages S3 (uploads) 4xx, 5xx response rates Resource Group: tag Application=checkout Auto component detection matches resource types → component templates configures CloudWatch agent per host Metric & log collection component-specific metrics (IIS, .NET CLR) logs: W3C IIS, Windows EventLog, SQL errorlog ML anomaly engine learns baselines per metric per component time-of-day + day-of-week patterns detects anomalies vs baseline Problem correlation groups temporally correlated anomalies infers root cause from component graph severity: Low / Medium / High Enhanced tier adds SQL query-level, IIS worker process .NET CLR exception details, pattern match Problem dashboard list of active + resolved problems root cause narrative per problem OpsCenter OpsItems one item per problem linked runbooks + history EventBridge / SNS problem-detected events → Incident Manager for Sev 1/2 Config Telemetry CloudWatch logs per component Logs Insights queries queued Retention problems stored 6 months
Tag-based resource group feeds Application Insights; component types drive metric collection; ML baselines catch anomalies; correlated anomalies become problems; problems flow to OpsCenter.

The pick in depth

Resource group setup. Tag every resource belonging to the checkout application with Application=checkout, Environment=prod. A tag-based resource group checkout-prod selects them all. Application Insights monitors the group; adding a new resource is a tag away.

Enhanced tier enabled. The tier is a per-application setting. Enhanced unlocks:

  • SQL Server query-level metrics: top queries by duration, by CPU, by read I/O. Links query-plan anomalies to problem reports.
  • IIS worker process (w3wp.exe) metrics: recycle events, CPU per worker, request queue length. Correlates worker recycling to ALB 5xx.
  • .NET CLR exceptions: exception rate, type, correlation to specific component instances.
  • Windows Service monitoring: status, start/stop events, crash recovery.

Problem narratives in practice. After two weeks of baselining, the first significant problem lands:

Problem P-0421: High severity Duration: 14:32:15 - 14:41:08 UTC (8m 53s)

Correlated anomalies:

  • ALB/checkout-alb: TargetResponseTime spiked 3.2x baseline (baseline 180ms → peak 580ms)
  • EC2/i-abc123 (1 of 20): .NET CLR exceptions increased 40x baseline
  • RDS/checkout-sql: CPU spiked 85% (baseline 35%), blocked processes > 12 (baseline 0)

Root cause inference: SQL Server CPU saturation with blocked processes suggests a long-running query. Correlated .NET CLR exceptions on one application instance may be the query caller or a downstream effect of database timeout.

Suggested runbook: sql-server-blocked-processes.runbook

The team opens the OpsItem, follows the runbook, identifies the long-running query via SQL Server’s Dynamic Management Views, kills it. The problem closes automatically when the anomalies resolve.

Compared to the pre-Application Insights state: the 40 minutes spent finding the query becomes 5 minutes. The team saw the correlation immediately, not after manually stacking graphs.

OpsCenter integration. Every detected problem (medium severity and above, configurable) creates an OpsItem in Systems Manager OpsCenter. The OpsItem includes the problem narrative, the correlated anomalies, links to CloudWatch dashboards pre-filtered to the problem’s time window, and any matching runbooks the team has authored. The OpsItem queue becomes the operational board; working tickets out of OpsCenter instead of Slack threads centralises the response history.

Custom metric inclusion. Business-level metrics (checkout-success-rate, orders-per-minute) get included by publishing them to CloudWatch with the Application=checkout dimension. Application Insights picks them up as part of the application and learns their baselines. A correlation between dropping orders-per-minute and rising SQL Server CPU produces a much more useful problem narrative than either metric alone.

What Application Insights doesn’t catch

  • Silent failures. If a component stops emitting metrics entirely (SQS queue empty because nothing is publishing), Application Insights sees “metric baseline is at floor” which may or may not be anomalous. Missing-data alarms still matter for these cases.
  • Long-term trends. Anomaly detection catches deviations from baseline; a metric that slowly worsens over months (disk filling up, memory leak) may stay within the drifting baseline. Traditional threshold alarms remain useful for these.
  • Multi-application root causes. A shared resource (Aurora cluster used by two apps) has its baselines learned per application; cross-application correlations are not modelled.
  • Deep code-level tracing. Application Insights sees .NET CLR exception counts and types; it does not see stack traces or individual request spans. X-Ray fills that gap.

The correct frame: Application Insights is a problem-detection layer. X-Ray is a tracing layer. CloudWatch Logs Insights is a log-query layer. Synthetics/RUM is a user-experience layer. They compose; none replaces the others.

What’s worth remembering

  1. Application Insights groups correlated anomalies into problems. One problem per incident rather than twenty alarms. Root cause narrative included.
  2. Resource group = application. Tag-based groups are how Application Insights knows which resources belong together. One application per business application.
  3. Enhanced tier is the stack-aware depth. SQL Server query-level, IIS worker process, .NET CLR exceptions. Required for .NET estates; basic tier is under-powered.
  4. Component detection is automatic. AWS resource types → component templates → metric + log configuration. New resources in the group get monitored on tag application.
  5. Baselines are time-of-day and day-of-week aware. “Elevated at 09:30 Monday” is not the same signal as “elevated at 04:00 Sunday”; Application Insights knows the difference.
  6. OpsCenter integration gives the operational queue. Problems become OpsItems; runbooks link inline; response history centralises.
  7. Custom CloudWatch metrics join the model. Publish business metrics with the application dimension; Application Insights learns them and correlates them with infrastructure events.
  8. Complementary services fill the gaps. X-Ray for tracing, Logs Insights for log queries, Synthetics/RUM for user experience, Container Insights for container platforms. Application Insights is the problem-detection layer.

The team’s “open CloudWatch and stare” instinct was working as well as it could without correlation. With Application Insights, the correlations are the service’s job; the team’s job is working the OpsCenter queue, following the runbook, and improving the runbooks as they discover the patterns the ML catches but they didn’t previously know to look for. Dashboards stop being the primary tool; they become the drill-down surface for problems the service already found.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.