Picking the Right Tool to Debug VPC Reachability

March 05, 2029 · 12 min read

Advanced Networking · ANS-C01 · part of The Exam Room

The situation

The problem landed in the ticket queue at 09:12:

  • An ECS service in the prod VPC (10.10.0.0/16) says its connection to the RDS cluster in the data VPC (10.30.0.0/16) is timing out.
  • The RDS cluster endpoint resolves correctly.
  • The two VPCs are attached to a Transit Gateway; the TGW route tables and VPC route tables look correct at a glance.
  • Security groups on both the ECS service and the RDS cluster look correct at a glance.
  • The on-call engineer has spent 30 minutes running nc -zv from inside the container and staring at route tables.

The engineer has five tools within reach:

  • VPC Flow Logs. After-the-fact evidence of what did or didn’t happen.
  • VPC Reachability Analyzer. A managed path analysis that answers “can A reach B?” with a detailed hop-by-hop explanation.
  • VPC Network Access Analyzer. A scope-based analysis that answers “what paths in my VPC match this description?”
  • AWS Network Insights (not the same as Reachability Analyzer!). A broader observability surface that includes path tracing, traffic analysis, and time-series.
  • tcpdump / ping / telnet. The old ways.

What actually matters

Before reaching for a tool, it’s worth naming what we actually need to know.

A reachability question has three parts:

  1. Is there a path, according to the configuration? Route tables, VPC attachments, TGW route tables, peering, gateway endpoints, NAT gateways, PrivateLink endpoints, all the layer-3 plumbing that has to cooperate for a packet to get from source IP to destination IP.
  2. Do the filters let it through? Security groups (stateful, per-ENI) and NACLs (stateless, per-subnet) both need to permit the 5-tuple.
  3. Does the application at the far end answer? Process listening, socket bound, TLS certificate valid, authentication not rejecting.

Two different shapes of analysis show up in the AWS networking toolbox:

  • Point-to-point reachability: “given this source and this destination, can a packet get there, and if not, which named component is blocking it?” This is the shape that matches a ticket.
  • Scope-based path enumeration: “across the estate, find every path that matches a description like ‘any internet-facing ENI that can reach a resource tagged env=prod’.” This is the shape that matches a compliance question.

The two answer different questions and are not substitutes for each other. The ticket above wants the first; a monthly audit wants the second.

The deeper question: when should the engineer open tcpdump or Flow Logs instead? The answer is “when you’ve already confirmed the configuration permits the path, but traffic still isn’t flowing.” A configuration-level check tells you whether the network would let a packet through; flow evidence tells you whether packets actually went. If the config says reachable and packets aren’t arriving, the problem is at layer 4 or above (application, certificate, auth, process not listening).

What we’ll filter on

  1. Answers “can A reach B?” definitively, end-to-end path trace through config?
  2. Points at the blocking component by ID, named rather than “something, somewhere”?
  3. Covers filters (SG + NACL) as well as routes, not just layer-3?
  4. Scope-based compliance queries, find all paths matching X?
  5. Evidence that traffic actually flowed, after-the-fact observation?

The investigation-tool landscape

1. VPC Flow Logs. Per-ENI log of flows with source, destination, ports, action (accept/reject), bytes. Retrospective: tells you what happened, not what would happen. Useful for “did a packet arrive?” Useless for “why can’t a packet arrive?” beyond telling you it got rejected by an SG or NACL at a specific ENI.

2. VPC Reachability Analyzer. One-shot “A → B?” analysis. Inputs: source, destination, optional protocol, port, TCP flags. Output: reachable / not reachable, with a hop-by-hop explanation walking through every ENI, route table, SG, NACL, and gateway in the path. Ticks “answers A→B”, “points at blocking component”, “covers filters”.

3. VPC Network Access Analyzer. Scope-based analysis. Inputs: JSON scope describing source and destination criteria (CIDR, resource tags, ENI attributes). Output: all paths matching the scope. Ticks “scope-based compliance queries”.

4. Network Insights (umbrella). The console surface where the two above live; not a separate tool.

5. Manual tcpdump / Flow Logs inspection. The traditional toolbox. Comprehensive and unhelpful without a hypothesis.

Side by side

Tool A → B definitive Points at blocker Covers filters Scope-based Actual-traffic evidence
Flow Logs ✗ (retrospective) Partial (shows deny)
Reachability Analyzer
Network Access Analyzer ✓ (for all scope matches) Partial
tcpdump / manual With effort With effort With effort

Reading the table by use case:

  • “My ticket says A can’t reach B, why?” → Reachability Analyzer. First.
  • “Prove no production RDS has a path to the internet.” → Network Access Analyzer with a scope.
  • “Did traffic actually flow yesterday at 09:00?” → Flow Logs.
  • “Is the application even answering?” → After Reachability says reachable, it’s app-level. tcpdump or telnet.

A Reachability Analyzer path trace

Reachability Analyzer insight: eni-app-123 → eni-rds-456 TCP/5432 status: not reachable, explanations[].explanation-code = SECURITY_GROUPS_NO_INBOUND_RULE forward-path-components[] below, in order 1. Source ENI eni-app-123 10.10.5.22:* subnet-prod-a VPC prod egress permitted 2. Egress filters SG sg-app-789: egress allow all NACL acl-prod-a: egress rule 100 allow pass 3. VPC routing rtb-prod-a 10.30/16 → tgw-0abc via tgw-att-prod routed to TGW 4. Transit Gateway tgw-0abc assoc RT: shared route 10.30/16 via tgw-att-data forwarded 5. Data VPC ingress VPC data 10.30/16 subnet-data-a NACL acl-data: ingress 100 allow pass 6. Destination ENI + security group eni-rds-456, 10.30.8.10 subnet-data-a, VPC data SG sg-rds-987: ingress 5432 from sg-app-999 (old, unused) NO rule for sg-app-789 explanation-code: SECURITY_GROUPS_NO_INBOUND_RULE suggested fix: add sg-rds-987 ingress 5432 from sg-app-789, or from 10.10.0.0/16 the ECS service was re-platformed yesterday; new SG, old rule never updated
Hop-by-hop explanation ending at the specific security group rule (or missing rule). The tool tells you which resource is blocking and why, by ID.

The pick(s) in depth

For the 09:12 ticket: Reachability Analyzer. For the monthly compliance report: Network Access Analyzer.

Reachability Analyzer in practice.

  1. Create the insight path. Source: the ECS task’s ENI (or the task’s ID; Reachability can resolve). Destination: the RDS cluster’s ENI (or the cluster ID). Protocol: TCP, port 5432.
    aws ec2 create-network-insights-path \
      --source eni-app-123 \
      --destination eni-rds-456 \
      --protocol tcp \
      --destination-port 5432
    
  2. Start the analysis.
    aws ec2 start-network-insights-analysis \
      --network-insights-path-id <path-id>
    

    Returns a network-insights-analysis-id; status goes through running to succeeded in seconds to a minute.

  3. Retrieve the result.
    aws ec2 describe-network-insights-analyses \
      --network-insights-analysis-ids <analysis-id>
    

    Output includes NetworkPathFound: true/false, a ForwardPathComponents[] array with every hop, and (crucially) an Explanations[] array that names the specific ENIs, route tables, SGs, NACLs, gateways, or rules involved, and, if the path is blocked, an ExplanationCode like SECURITY_GROUPS_NO_INBOUND_RULE or NO_ROUTE or RESOURCE_NOT_FOUND.

  4. Act on the named resource. If it says “SG sg-rds-987 has no inbound rule allowing sg-app-789 on port 5432,” that is both the diagnosis and the fix. Add the rule, re-run the analysis, confirm green.

Network Access Analyzer, for compliance.

{
  "accessScope": {
    "matchPaths": [
      {
        "source": {
          "resourceStatement": {
            "resources": ["arn:aws:ec2:*:*:vpc/*"]
          }
        },
        "destination": {
          "resourceStatement": {
            "resources": ["arn:aws:rds:*:*:cluster:*"],
            "resourceTypes": ["AWS::RDS::DBCluster"]
          }
        },
        "throughResources": [
          {
            "resourceTypes": ["AWS::EC2::InternetGateway"]
          }
        ]
      }
    ]
  }
}

Feed this scope to the analyzer and it finds every path, if any, from anywhere to any RDS cluster that traverses an Internet Gateway. For compliance (“no RDS cluster is internet-reachable”), a clean run is the evidence; any non-empty match is a finding.

When to skip both tools.

After Reachability Analyzer says “reachable” and the app still can’t connect, the issue is above layer 4. Check:

  • Is the process bound to 0.0.0.0 or just 127.0.0.1?
  • Is the RDS cluster in available state, not modifying?
  • Is the client’s IAM authentication correct, if using IAM auth?
  • Is the TLS certificate chain trusted?
  • Is there a retry-after-throttle on the RDS side?

Flow Logs at that stage confirm whether packets are arriving at all (Action: ACCEPT) versus being silently dropped by something mysterious. If ACCEPTs are logged but the application doesn’t respond, it’s app-layer.

A worked incident

09:12, ticket arrives.

09:13, on-call engineer runs Reachability Analyzer: source ENI of one ECS task, destination ENI of RDS writer, TCP/5432.

09:14, result: “not reachable. Security group sg-rds-987 has no inbound rule allowing sg-app-789 on port 5432.” Hop-by-hop shows routes and NACLs pass fine; the RDS security group is the blocker.

09:15, engineer checks git blame on the infrastructure-as-code. Yesterday, the ECS task’s security group was replaced (sg-app-999 retired, sg-app-789 created) as part of a re-platforming change. The RDS security group still references sg-app-999 on its ingress rule.

09:17, fix the IaC: add sg-app-789 to the RDS security group’s ingress allowlist on port 5432.

09:22, deploy reviewed and applied.

09:23, re-run Reachability Analyzer: “reachable.” ECS task’s retries succeed; the service dashboard goes green.

Total time: 11 minutes, most of it waiting for CI to review the IaC change. The diagnosis was a 60-second tool invocation.

What’s worth remembering

  1. Reachability Analyzer is the first tool for “A can’t reach B.” Source, destination, protocol, port. It walks the whole path and names the blocker.
  2. Explanation codes are specific. SECURITY_GROUPS_NO_INBOUND_RULE, NO_ROUTE, RESOURCE_NOT_FOUND, NO_ROUTE_TO_DESTINATION, etc. They tell you both the layer and the component.
  3. Network Access Analyzer is for scope-based compliance. “Find all paths matching X.” Use it for proofs, not for troubleshooting individual flows.
  4. Flow Logs confirm whether packets actually flowed. Use them after Reachability says “reachable” and the app still isn’t happy.
  5. Neither tool sees above layer 4. Application-level failures (TLS cert issues, auth failures, process not listening) are invisible to both. tcpdump and application logs catch those.
  6. Both tools are one-shot / batch. Not continuous monitoring. For continuous, Flow Logs + a SIEM rule is the shape.
  7. Run Reachability periodically, not only on incidents. Scheduled analyses for critical paths (e.g., prod → prod-db) catch drift before it becomes an outage. They’re cheap.
  8. Cross-account paths need trust. Reachability Analyzer works across accounts if the tool has permission; set up the necessary IAM permissions in both source and destination accounts.

“Can this reach that?” is one of the most common questions in AWS networking, and AWS shipped a purpose-built tool to answer it. Reachability Analyzer is often 60 seconds faster than tcpdump, and it tells you why, not just what. Network Access Analyzer is the compliance-shaped sibling. The work isn’t opening twelve console tabs and eyeballing route tables, it’s asking the tool that’s designed for the question, and acting on the named component it returns.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.