The situation
We’re the new platform team at a mid-sized company that’s been on AWS for eighteen months. The security team has a spreadsheet of “things that worry them” and wants a half-day walkthrough before they’ll sign off on a production launch. Three items on the spreadsheet are keeping them up at night, and all three look like security questions but turn out to be ownership questions.
- Incident one: A critical CVE in
glibcwas disclosed yesterday. We run around 80 EC2 instances on Amazon Linux 2023 plus a dozen containers on ECS Fargate. The security lead wants to know who patches what, and by when. - Incident two: A developer published a new S3 bucket for marketing assets. Public access block got disabled “just for the launch” and the bucket ended up indexed by a search engine. Three weeks of customer-uploaded testimonial videos were briefly public. The CISO’s question: “AWS is secure, how did this happen?”
- Incident three: AWS sent a scheduled-maintenance email for an EC2 instance in
eu-west-1. Underlying hardware needs attention. The instance will restart on a new host inside a seven-day window. Operations wants to know if they need to do anything and whether this counts as an outage.
Three incidents. One framework. The work is figuring out which side of the line each one sits on.
What actually matters
Before pulling up the shared-responsibility diagram, it’s worth being clear about what the model is actually for.
The shared responsibility model isn’t a legal document or a compliance artefact, it’s a map. When something goes wrong, the map tells us who picks up the pager. When something needs doing proactively (patching, encryption, access reviews), the map tells us whose backlog it lands in. Most arguments about “is AWS secure?” are really arguments about which team on our side should have noticed something, and the model gives us language for having that argument cleanly.
The second thing the model does is set realistic expectations about what AWS provides. AWS secures the physical data centres, the hypervisors, the networking fabric, the managed service code. We don’t get audit access to those things, but we get compliance reports (SOC 2, ISO 27001, PCI DSS) that say an independent auditor did. AWS is not going to notice that our S3 bucket is public, that our IAM policy is over-permissive, or that our application is vulnerable to SQL injection, those are on our side of the line, and AWS’s job is to make sure the tools to fix them exist and work.
The third thing, and the one that confuses people the most, is that the line moves depending on which service we pick. Running Postgres on an EC2 instance puts OS patching, database patching, backups, and high-availability on us. Running the same database on RDS moves OS patching, database minor-version patching, and managed backups to AWS. Running Aurora Serverless moves capacity planning and scaling to AWS as well. Same workload, three different contracts, three different sets of things we’re responsible for. Picking a service isn’t just a technical decision; it’s a responsibility decision.
The fourth thing worth calling out is that some responsibilities are always ours regardless of service. Customer data is always our responsibility. AWS cannot and will not inspect it. Identity and access management is always our responsibility. AWS provides IAM, we decide who can do what. Client-side encryption, if we choose to do it, is always ours. Network traffic protection at the application layer is always ours. These are the things that move with us no matter how far up the stack we climb.
And finally, the model says nothing about outcomes. AWS being responsible for hypervisor patching doesn’t mean our instance can’t be restarted when that patching happens; it means AWS will do the patching and give us reasonable warning. We’re still responsible for designing our application to survive a single-instance restart. The division of responsibility doesn’t excuse either side from engineering.
What we’ll filter on
Distilling that into filters we can apply to each incident:
- Physical vs logical, is the concern about hardware, facilities, and hypervisors, or about configuration, code, and data?
- Service layer. IaaS (EC2), container service (ECS on EC2, ECS Fargate, EKS), managed service (RDS, DynamoDB), or serverless (Lambda, S3, API Gateway)?
- Who can see it, can AWS inspect the thing (a configuration flag in their control plane), or can only we see it (the contents of our bucket, the code in our function)?
- Who has the tools, even if AWS can see it, are the levers to change it on our side or theirs?
- Data, identity, or infrastructure, customer data and IAM are almost always ours; infrastructure moves.
The responsibility landscape
-
What AWS is responsible for, “security of the cloud”. The physical data centres, including building security, power, cooling, and environmental controls. The physical hardware: servers, storage, networking gear. The hypervisor layer that isolates one customer’s instances from another’s. The AWS global network between Regions and Availability Zones. The managed service code itself, the software that implements S3, DynamoDB, RDS, Lambda. When AWS says Lambda scales automatically, the “automatically” part is AWS’s problem to make true. Patching of the hypervisor, of network gear, of the physical switches, all AWS. We don’t get to audit it directly; we get SOC 2 Type II, ISO 27001, PCI DSS, HIPAA BAA, FedRAMP, and a few dozen other compliance artefacts that say auditors did.
-
What we’re responsible for, “security in the cloud”. Customer data, always. Identity and access management (IAM users, roles, policies, federation, MFA). Operating-system patching on anything we run the OS on (EC2, on-premises servers connected via Outposts). Application code, including libraries, dependencies, and whatever vulnerabilities they drag in. Network traffic protection at the instance level (security groups, NACLs, VPC routing choices). Client-side encryption and data integrity. Server-side encryption configuration, we pick the KMS key, we set the bucket policy. And service configuration: whether S3 block public access is on, whether CloudTrail is enabled, whether RDS is in a public subnet.
-
Where the line moves, service-dependent responsibilities. On EC2, OS patching, middleware patching, and database patching are ours. On RDS, AWS patches the OS and the database minor versions during maintenance windows; we still pick the engine version, the parameter group, and the major-version upgrade cadence. On ECS on EC2, we patch the container-host OS; on ECS Fargate, AWS does. On Lambda, AWS manages the runtime entirely; we manage the code and its dependencies (a vulnerable library in our deployment package is still our vulnerability, even though the runtime is AWS’s). The pattern: the higher up the abstraction stack we go, the more AWS takes on, and the less control and visibility we have into how they take it on.
-
What’s always AWS, regardless of service. Hypervisor security. Physical facility security. The global network backbone. Hardware sourcing and disposal (secure hardware decommissioning, wiped-drive policies). The underlying compute hardware for any instance we launch. When the Dublin hypervisor email arrives, that maintenance is in this column no matter what EC2 instance type we chose.
-
What’s always ours, regardless of service. The contents of our data. Who has access to it (IAM). Customer-specific configuration, bucket policies, security groups, parameter groups, function environment variables. Client-side encryption if we want it. Whether we turn on logging. Compliance of our application code with whatever regulatory regime applies to us (GDPR, HIPAA, PCI). When the S3 bucket leaked, every step in that story was in this column.
-
The compliance shared-responsibility view. For certifications like PCI DSS, AWS covers some controls (physical access, hardware security), we cover some (application-level access control, data classification), and some are shared (encryption key management. AWS provides the service, we configure it). AWS publishes a “customer responsibility matrix” per compliance regime; the auditor asks us about the rows attributed to us.
Side by side
Mapping the three incidents, plus a reference row for each service tier, against who owns what:
| Concern | Physical / hardware | OS patching | Runtime / service code | Service configuration | Data contents | IAM / access |
|---|---|---|---|---|---|---|
| EC2 instance (any) | AWS | Us | Us (we are the OS) | Us | Us | Us |
| ECS Fargate task | AWS | AWS | AWS | Us (task def, IAM) | Us | Us |
| RDS database | AWS | AWS | AWS (minor ver) | Us (param group, SG) | Us | Us |
| S3 bucket | AWS | N/A | AWS | Us (policy, BPA) | Us | Us |
| Lambda function | AWS | AWS | AWS (runtime) | Us (IAM, env, VPC) | Us | Us |
| glibc CVE (incident 1) | , | Us on EC2; AWS on Fargate | , | , | , | , |
| Leaked S3 bucket (incident 2) | , | , | , | Us | Us | Us |
| Hypervisor maintenance (incident 3) | AWS | , | , | , | , | , |
Reading across: incident three is entirely on AWS’s side of the line; incident two is entirely on ours; incident one splits, it’s ours for the 80 EC2 hosts, AWS’s for the Fargate containers (AWS patches the Fargate host OS and the underlying runtime; we still have to patch glibc in our container image if we baked our own).
Mapping the line, service by service
The picks in depth
Incident one: the glibc CVE. For the 80 Amazon Linux 2023 EC2 instances, glibc is operating-system code, and the OS is ours. The fix is ours: ensure each instance picks up the patched package. AWS Systems Manager Patch Manager is the usual lever, a patch baseline that approves security updates, a patch group targeted at the fleet, a maintenance-window association that runs AWS-RunPatchBaseline on a schedule. AWS doesn’t run yum update for us; AWS ships the patched package in the Amazon Linux repos, notifies us via the security bulletin, and gives us Patch Manager to roll it out. The operational cadence is ours to set: a two-hour maintenance window Saturday morning for non-critical; a rolling update via an Auto Scaling Group instance-refresh for anything production.
For the Fargate containers, the responsibility is split. The Fargate host OS, the Bottlerocket or Amazon Linux instance the Fargate platform runs tasks on, is AWS’s. AWS patches it, and they’ve published the Fargate platform version that includes the fix. But the container image that we built and pushed to ECR includes its own glibc (copied in from whatever base image we used). If our base image is public.ecr.aws/amazonlinux/amazonlinux:2023 with a vulnerable glibc, we still need to rebuild the image with the patched package and redeploy the service. AWS is not going to reach inside our container image and swap out libraries. The split here, “AWS patched the host, we rebuilt the image”, is the most common source of confusion with Fargate; people assume “serverless” means “AWS patches everything” and it doesn’t.
Incident two: the leaked S3 bucket. S3 is AWS-managed storage; the service itself is secure. AWS patches S3’s internal software, encrypts data at rest by default, replicates across three Availability Zones, and enforces object durability. The failure wasn’t in S3, it was in how we configured S3. Service configuration is on our side of the line, every time.
The fix has two layers. First, the immediate remedy: re-enable Block Public Access at the bucket level (and, ideally, at the account level so no bucket can be made public without a deliberate override), audit the object ACLs, rotate anything genuinely sensitive that was exposed, and revoke the misused developer credential if the change was made outside a sanctioned workflow. Second, the structural fix: account-level Block Public Access via an AWS Organizations policy, an SCP that denies s3:PutBucketPolicy and s3:PutBucketAcl actions that would loosen it, AWS Config rules (s3-bucket-public-read-prohibited, s3-bucket-public-write-prohibited) that alert on drift, and CloudTrail data events on S3 to see object-level writes. IAM Access Analyzer external-access findings would have flagged a publicly-accessible bucket the moment it was created.
Incident three: the hypervisor maintenance. This one is entirely on AWS’s side of the line, physical hardware, hypervisor, facility. AWS will live-migrate the instance if possible, or restart it on new hardware if not, within the maintenance window. We get an email because the event is visible to us: a scheduled stop/start from our perspective. Our responsibility is that we’ve architected the workload to survive a single-instance restart: at least two instances across two Availability Zones behind a load balancer, or an Auto Scaling Group that will replace the instance if the stop/start fails. AWS handled the hardware; we handled the blast radius. The instance being briefly unavailable during the restart is not an “AWS outage” unless the workload can’t tolerate it, and if the workload can’t tolerate a single-instance restart, that is our design problem, not AWS’s.
A worked walkthrough: one Tuesday afternoon
We run through the three items with the security team. Thirty minutes on the whiteboard, the shared-responsibility diagram in hand.
On the glibc CVE, she opens the AWS Security Bulletins page and confirms the advisory is logged with a link to the patched Amazon Linux package. Patch Manager is already configured; she shows them the association targeting the Environment=prod tag, running AWS-RunPatchBaseline with Operation=Install against a baseline that auto-approves critical and important security updates after seven days. She kicks off an out-of-band run against the staging fleet first, watches Patch Manager’s compliance view go from 0% patched to 100% over forty minutes, then enables the production window for tonight. For Fargate, she shows the deployment pipeline: a nightly image rebuild from the tagged base image triggers automatically; the vulnerable containers will be replaced on the next deploy regardless, and a manual pipeline run will bring that forward to this evening.
On the leaked bucket, the consultant pulls up the Config timeline. The bucket was created with Block Public Access enabled on the 14th; a developer disabled BPA at the bucket level on the 17th at 14:22 UTC; the bucket remained public until it was re-enabled three weeks later at 09:14 UTC. CloudTrail tells the team which IAM role ran PutPublicAccessBlock with BlockPublicAcls=false. The fix proposed: an SCP on the OU denying s3:PutPublicAccessBlock where aws:RequestTag/Public is not true and the calling principal isn’t a specific admin role; a Config rule alerting to the security team’s Slack within five minutes of any BPA change; an Access Analyzer that would have caught the bucket becoming externally reachable. The security team asks whether AWS should have prevented this. The consultant points at the diagram: service configuration, customer responsibility; AWS supplies BPA, we supply the discipline to leave it on. The failure was ours, the tools to prevent it were AWS’s, and we didn’t use them.
On the hypervisor maintenance, she pulls up the event in AWS Health Dashboard, a scheduled-stop event for i-0f3b8c1d2e4f56789 with a five-day window. The instance is part of an Auto Scaling Group with a minimum of three, spread across three Availability Zones. She shows the ASG’s health check config, the ALB target group, and the deregistration delay. If AWS stops the instance, the ASG will detect it as unhealthy, the ALB will stop sending traffic, a replacement instance will launch in the same AZ, and the maintenance completes without a customer-visible outage. She drafts a one-line response to operations: “No action required; the ASG handles it. Confirm post-maintenance that the replacement is healthy.”
The security team signs off. None of the three incidents were anything AWS could have fixed for us, and none of them were anything we couldn’t fix ourselves given the tools AWS provides.
What’s worth remembering
- Security of the cloud is AWS; security in the cloud is us. AWS’s side is the physical facility, the hardware, the hypervisor, the network fabric, and the managed service code. Our side is everything we can configure, everything we can read, and everything we put into the service.
- The line moves with the service tier. EC2 leaves the OS and everything above with us. Fargate takes the OS and runtime. RDS takes the database engine. S3 and Lambda take the whole platform. Higher abstractions, smaller customer surface, less control and less visibility.
- Three responsibilities never leave our side. Customer data, identity and access management, and service configuration. AWS cannot see our data, cannot decide who should access it, and cannot know what a “correct” bucket policy looks like for our business.
- Service configuration is where most leaks happen. Not broken AWS services, services configured incorrectly. Block Public Access, IAM least-privilege, encryption on/off, logging on/off. AWS provides the knobs; we turn them.
- “AWS-managed” doesn’t mean “no responsibility”. A vulnerable library in a Lambda deployment package is still our vulnerability. A misconfigured RDS security group is still our problem. AWS manages the platform, not our use of it.
- Hypervisor and hardware maintenance are AWS’s job but our concern. They will handle the patching; we have to design for the restart. Multiple instances across multiple AZs behind a load balancer is the shape that turns AWS’s maintenance into our non-event.
- Compliance inherits the split. AWS’s SOC 2, ISO, PCI DSS, HIPAA BAA, and FedRAMP cover their side. We still have to pass the customer-responsibility rows on the shared-responsibility matrix for our chosen regime.
- The answer to “is AWS secure?” is almost always “which part?” Most incidents people blame on AWS are configuration errors on the customer side. The model gives us the language to identify which side, quickly, without the conversation turning into a finger-pointing exercise.
Shared responsibility isn’t a vendor contract footnote, it’s the map that tells each side what their job is, and what tools the other side provides for the job. AWS builds the secure foundation; we build on top of it without undoing that security. Three incidents, three corners of the map, one framework that makes all three make sense.