How to Replace SSH Bastions With Session Manager

August 18, 2027 · 16 min read

CloudOps Engineer · SOA-C03 · part of The Exam Room

The situation

A team runs 200 EC2 instances across three VPCs, all in private subnets with no public IPs. Today, shell access goes through a bastion fleet of three hosts in a public subnet, each accepting inbound SSH on port 22 from the office egress ranges. Engineers authenticate with a shared SSH key pair, rotated every quarter. Audit consists of the CloudTrail AssumeRole that preceded the bastion jump, plus whatever the bastion host logs locally, which is “somebody connected” and not much more.

The security team wants the inbound port 22 closed at the edge, the shared keys retired, and every shell session tied to a named engineer with full input and output captured. The on-call still needs the occasional port tunnel for psql into RDS or for hitting an internal admin UI on localhost:8080. And the three bastion hosts, along with their patching rota and their security-group exceptions, should go away with the migration.

What actually matters

Before reaching for a service, it’s worth weighing what is actually wrong with the current setup, because it’s not that bastions are obsolete. It’s that the current arrangement couples three separable concerns into one badly-fitting shape.

The authentication layer is the first concern. Today, a shared SSH key authenticates “somebody on the team” rather than “Ravi.” The key is in a password manager, which means anyone with access to the vault is indistinguishable on the audit trail from anyone else. A fix needs each shell session to begin with a named human identity, and ideally, an identity that’s already managed by the SSO system the company already operates, not a second account fleet.

The transport layer is the second concern, and it’s the one the bastion is solving. SSH traverses an inbound TCP port, which needs a security group, which needs a public IP, which needs a host to terminate on, which needs patching and monitoring and scaling. Every one of those surfaces is a legitimate thing to worry about, and none of them is essential to the actual goal of getting a shell on a private instance. A fix that eliminates the transport story, no inbound ports, no public host, collapses a large amount of operational work.

The audit layer is the third concern, and it’s where the current setup fails most expensively. CloudTrail sees the IAM principal, the bastion sees the inbound connection, and neither sees what was typed after the prompt appeared. A serious audit story captures every keystroke and every byte of output, attributable to a named engineer, stored somewhere the engineer can’t edit after the fact. That is not something a bastion host can provide without elaborate extra tooling (sshrc, script, auditd) that each engineer can disable if they want to.

The blast radius of a compromise matters too. The shared SSH key is a single credential with access to 200 production instances, rotated quarterly. If it leaks, the rotation story is “tell everyone to stop using it and deploy a new one.” A per-engineer credential with IAM-scoped access can be revoked individually, in seconds, with no coordination.

And finally, the operational complexity the team has to carry. Three bastion hosts is a fleet; fleets have upgrade windows, hardening baselines, CVE response timelines, and, on a bad day, their own incidents. The right fix should delete a fleet, not add one.

What we’ll filter on

  1. No inbound SSH from the internet. Whatever the answer is, it cannot require a public shell endpoint.
  2. Per-engineer identity. The access grant is IAM-scoped to a human or named role, not to a shared secret.
  3. Full-session audit. Input and output, every keystroke typed, every byte returned.
  4. Port forwarding. The ability to tunnel a TCP port from an engineer’s laptop to a port on the instance (or, via the instance’s network, to an RDS endpoint).
  5. Low operational overhead. No new fleet to run. Ideally, “turn something on” rather than “build, secure, scale, and patch more infrastructure.”

The remote-access landscape

SSH bastion fleet (status quo). Three EC2 instances in a public subnet with security groups allowing inbound 22 from the office CIDR, engineers SSH through them using -J. Cheap and familiar. Fails on attribute 1 outright; fails on attribute 2 with shared keys; fails on attribute 3 because CloudTrail doesn’t see keystrokes and the bastion’s own logs don’t capture remote-shell traffic; adds attribute 5 overhead in the form of patching, monitoring, and replacing the three hosts.

AWS Systems Manager Session Manager. SSM Agent runs on each instance; agent makes an outbound HTTPS call to the Systems Manager endpoints; engineers hit ssm:StartSession via the console or CLI and a TLS-1.2-encrypted channel opens between the engineer’s Session Manager plugin and the agent. No inbound ports required. IAM on the user controls who can start a session; IAM on the instance authorises the agent to talk back. A session preferences document – SSM-SessionManagerRunShell, configures KMS-encrypted session logging to S3 or CloudWatch Logs, RunAs identity, and idle-session timeouts, globally per Region. Port forwarding happens via AWS-StartPortForwardingSession.

EC2 Instance Connect. Still SSH under the covers. The engineer pushes a short-lived (60-second) SSH public key onto the instance via the ec2-instance-connect:SendSSHPublicKey API, then SSHes normally. Keys are per-user and short-lived. But the transport is still SSH, so some path to port 22 has to exist, either inbound from the public internet, or via Instance Connect Endpoint (a managed tunnel in the VPC). Even with the endpoint, audit stops at “Instance Connect pushed a key for user X”; the shell transcript is not captured.

VPN + direct SSH. Engineers VPN into the VPC, then SSH from their laptop’s assigned VPN IP. The public SSH surface shrinks. Port 22 is still open inside the VPC, and identity and audit still depend on the SSH keys on the instances. Substitutes one box for another.

AWS Verified Access. Zero-trust access to internal HTTP applications, keyed by identity-provider claims and device posture. Useful for Jenkins, internal dashboards, the kind of thing that used to sit behind a VPN. Not a shell tool.

Side by side

Option No inbound SSH Per-engineer identity Full-session audit Port forwarding Low ops overhead
SSH bastion fleet
Session Manager
EC2 Instance Connect
VPN + direct SSH
AWS Verified Access

One survives: Session Manager. It is the AWS-native story for shell access without inbound ports, with IAM-scoped identity, KMS-encrypted session logging, and port forwarding in the same service.

The two shapes, side by side

Bastion path (status quo) Engineer shared SSH key inbound :22 Bastion (public subnet) SG allows 22 from office fleet of 3 inbound :22 EC2 (private) sshd listening shared key authorised CloudTrail login events only no keystrokes shared keys, two inbound-22 hops, shell transcript unavailable Session Manager path Engineer IAM identity ssm:StartSession TLS 1.2 Systems Manager SSM-SessionManagerRunShell KMS key outbound TLS only EC2 (private) SSM Agent AmazonSSMManagedInstanceCore CloudWatch Logs stdin + stdout KMS encrypted no inbound ports, IAM identity per engineer, full session transcript captured
Bastion path depends on inbound port 22 and shared keys; Session Manager replaces both with outbound-only TLS from the SSM Agent and IAM-scoped identity per engineer.

Session Manager, in depth

Session Manager is a collection of pieces that fit together.

The SSM Agent. A small process on every managed node that maintains an outbound-only HTTPS connection to the Systems Manager and ssmmessages endpoints. Preinstalled on current Amazon Linux 2 and Amazon Linux 2023 AMIs, on Ubuntu Server AMIs, and on recent Windows Server AMIs. AMIs can ship an older version, so a nightly AWS-UpdateSSMAgent Run Command is the usual practice. No agent, no session.

The instance IAM role. The agent needs permission to register with Systems Manager and open the control and data channels it uses for sessions. AWS ships a managed policy for this, AmazonSSMManagedInstanceCore, which grants the ssm:UpdateInstanceInformation heartbeat plus the four ssmmessages:* actions (CreateControlChannel, CreateDataChannel, OpenControlChannel, OpenDataChannel). Attach it as an instance profile. Without ssmmessages:OpenControlChannel in particular, the agent loses connectivity to Systems Manager and the instance disappears from the managed-node list.

The user IAM policy. The engineer’s identity needs ssm:StartSession plus the plumbing: ssm:DescribeSessions, ssm:GetConnectionStatus, ssm:ResumeSession, ssm:TerminateSession, and ssmmessages:OpenDataChannel on arn:aws:ssm:*:*:session/${aws:userid}-* so each engineer can only resume and terminate their own sessions. The grant scopes through two resources: the instance ARNs the policy allows, and the document ARN for the session (SSM-SessionManagerRunShell for interactive shell, AWS-StartPortForwardingSession for tunnels).

Tag-based scoping is the lever that makes 200 instances manageable. An engineer’s policy that grants ssm:StartSession with a condition "StringEquals": {"ssm:resourceTag/Team": "payments"} can only target instances tagged Team=payments. The instance tags are the source of truth.

The session preferences document. A single SSM document per Region, named SSM-SessionManagerRunShell, holds the preferences every session inherits: s3BucketName and s3KeyPrefix for archival; cloudWatchLogGroupName for near-real-time streaming; cloudWatchStreamingEnabled: true so lines arrive in near-real-time; s3EncryptionEnabled and cloudWatchEncryptionEnabled for server-side encryption; kmsKeyId for the customer-managed KMS key; runAsEnabled + runAsDefaultUser; idleSessionTimeout and maxSessionDuration; shellProfile.linux for a shell snippet at session start.

Port forwarding. A separate document, AWS-StartPortForwardingSession, handles TCP tunnels. Parameters are portNumber (on the instance) and localPortNumber (on the laptop). For database tunnelling through the instance to an RDS endpoint, the sibling document AWS-StartPortForwardingSessionToRemoteHost takes an extra host parameter. Either way, the engineer sees localhost:5432 on their laptop and psql reaches Postgres through the agent’s outbound-only tunnel. Session logging does not capture the bytes inside a port-forwarding session; CloudTrail still records the StartSession call with the document name, so you know who opened which tunnel to which instance, just not the traffic inside it.

SCP enforcement org-wide. An SCP on the organisational unit denies the two actions that would weaken the configuration: ssm:UpdateDocument/ssm:UpdateDocumentDefaultVersion/ssm:DeleteDocument on arn:aws:ssm:*:*:document/SSM-SessionManagerRunShell, and ssm:StartSession when the session-document-access check would be false. Pair that with a delegated administrator account that can update the document through a deployment pipeline and the preferences stay canonical across every member account.

A worked session trace

Ravi needs to debug a payments service on i-0abc1234 in eu-west-1. His IAM role has ssm:StartSession on instances tagged Team=payments, is tagged SSMSessionRunAs=ravi, and has kms:GenerateDataKey on the session-logs key.

$ aws ssm start-session --target i-0abc1234 --region eu-west-1
Starting session with SessionId: ravi-0f2e3a9b7c4d5e6f
[ravi@ip-10-0-12-87 ~]$ sudo systemctl status payments-api
...
[ravi@ip-10-0-12-87 ~]$ exit

What happened in the background:

  1. The CLI called ssm:StartSession with DocumentName=SSM-SessionManagerRunShell. IAM evaluated Ravi’s policy: allowed because the target is tagged Team=payments; KMS usage allowed because his role is in the key policy.
  2. Systems Manager loaded the Region’s preferences. runAsEnabled: true, so it checked Ravi’s SSMSessionRunAs tag (ravi), confirmed the OS user exists, and noted KMS, CloudWatch log group, streaming.
  3. The agent opened an ssmmessages data channel on its existing outbound TLS connection. No security-group change.
  4. Every byte in and out streamed to /aws/ssm/session-audit, encrypted under the customer-managed KMS key, near-real-time.
  5. On exit the agent closed the session. A transcript object landed in S3. CloudTrail recorded StartSession and TerminateSession with Ravi’s principal, the instance, and the document name.

The audit artefacts from one session: the engineer’s IAM principal, the instance touched, the exact commands typed, the exact output returned, encrypted at rest, and queryable in Logs Insights.

Port forwarding for the database tunnel

Same engineer, half an hour later, needs a SELECT against the payments Postgres:

$ aws ssm start-session --target i-0abc1234 \
    --document-name AWS-StartPortForwardingSessionToRemoteHost \
    --parameters '{"host":["payments-prod.abcd.eu-west-1.rds.amazonaws.com"],"portNumber":["5432"],"localPortNumber":["15432"]}'

psql "host=localhost port=15432 ..." in another terminal hits RDS through the instance. CloudTrail records the StartSession call with the document name, target, and full parameter set, enough to answer “who tunnelled to prod?” Tunnel payload is opaque to Session Manager; when the bytes themselves need auditing, VPC Flow Logs on the instance’s ENI plus RDS audit logging catch the connection at both ends.

RunAs and the audit story

The default Session Manager identity is ssm-user, a local account the agent creates with passwordless sudo. Useful for getting started; useless for per-engineer audit, because every session lands as ssm-user and the OS’s own logs show the same username for everyone.

Run As fixes that. With runAsEnabled: true, Session Manager consults the IAM entity’s tags: if Ravi’s role has SSMSessionRunAs=ravi, the session starts as the OS user ravi. One IAM entity, one OS user, resolved at session start. If the tag is missing, Session Manager falls back to runAsDefaultUser in the preferences document; if that user doesn’t exist, the session fails. root is not a valid Run As target.

The compounding effect on audit: Session Manager’s own transcripts capture every byte; sudo logs show the elevation with the correct caller; auditd ties every execve back to Ravi’s UID. Three independent trails, all naming the same engineer. The OS user has to exist on every instance, a ten-line systemd unit that reads the instance’s IAM role tags and creates matching local accounts with useradd keeps 200 hosts in sync.

What’s worth remembering

  1. Session Manager’s outbound-only model. The SSM Agent opens a TLS connection out to Systems Manager; the engineer connects in through Systems Manager, not directly to the instance. No inbound ports, no bastion, no VPN.
  2. The two IAM policies. AmazonSSMManagedInstanceCore on the instance profile authorises the agent’s heartbeat and the four ssmmessages:* channel actions. On the engineer’s principal, ssm:StartSession scoped by ssm:resourceTag/<key> conditions constrains which instances they can reach.
  3. SSM-SessionManagerRunShell. One SSM document per Region holds the session preferences: S3 and CloudWatch Logs destinations, KMS key, encryption booleans, streaming flag, idle and max timeouts, Run As settings, shell profile.
  4. Session audit covers shell sessions but not port-forwarding content. Shell sessions get full stdin/stdout to CloudWatch Logs and S3 under KMS. Port-forwarding sessions get the StartSession metadata in CloudTrail but not the traffic inside the tunnel.
  5. RunAs maps IAM entity to OS user via the SSMSessionRunAs tag. Enables per-engineer accountability in the OS’s own logs. root is not valid; missing-user means session fails.
  6. Port forwarding is its own document. AWS-StartPortForwardingSession for a port on the node itself, AWS-StartPortForwardingSessionToRemoteHost for tunnelling through the node to an RDS endpoint or internal service.
  7. SCPs keep the preferences canonical. Deny ssm:UpdateDocument on SSM-SessionManagerRunShell from member accounts, enforce via delegated admin through a pipeline.
  8. EC2 Instance Connect, VPN + SSH, and Verified Access are adjacent tools. Instance Connect replaces shared SSH keys but keeps SSH and doesn’t log the shell transcript; VPN + SSH substitutes one public endpoint for another without solving audit; Verified Access is for HTTP apps, not shells.
  9. No agent, no session. The SSM Agent is the linchpin; a nightly AWS-UpdateSSMAgent Run Command keeps the fleet current.

Migrate to Session Manager. Ensure AmazonSSMManagedInstanceCore is attached to every EC2 instance profile and the SSM Agent is current. Configure SSM-SessionManagerRunShell in each Region to stream sessions to a KMS-encrypted CloudWatch Logs group with cloudWatchStreamingEnabled: true, archive to a KMS-encrypted S3 bucket, and enable RunAs with a per-engineer mapping via IAM-entity tags. Grant engineers ssm:StartSession scoped by instance tag to the teams they own, on the shell and port-forwarding documents. Enforce the preferences org-wide via an SCP that denies modifying SSM-SessionManagerRunShell outside the delegated-admin account. Decommission the three bastion hosts, close the inbound-22 security-group rules, delete the shared key. Two hundred instances, no inbound SSH, one named engineer per session, full transcript in CloudWatch Logs before the session even ends.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.