Series
The Exam Room
Exploring AWS, one service or situation at a time.
Exam Room · SAA-C03
The Closest Healthy Region
A multi-region application needs to route requests to the closest healthy region, failing over automatically when the preferred one drops out -- with no client-side retries and no extra health-check plumbing to maintain. Route 53 can do all of that in a single record set. Finding the right combination means touring all seven routing policies and the attributes that separate them.
Read articleThe Archive Nobody Reads
Some data exists for compliance, not for use. Tens of terabytes of records sitting untouched until an auditor wants them. S3 has eight storage classes; only one of them is built for that pattern, and getting it wrong can cost an order of magnitude in a year you weren't paying attention to the bill.
Solutions Architect Associate · SAA-C03
Coming soonWhen Inference Takes Minutes
SageMaker offers four ways to serve a model. Pick by latency and you'll get the answer wrong. The right axis here is payload size, processing time, and arrival pattern -- and three of the four options exclude themselves before you've measured anything.
ML Engineer Associate · MLA-C01
Coming soonHub or Tangle
Two VPCs that need to talk to each other is one peering connection. Fifty VPCs that need to talk to each other is one thousand two hundred and twenty-five. The arithmetic is what makes Transit Gateway worth its per-hour fee -- and at multi-region scale, even Transit Gateway is just one rung on the ladder.
Solutions Architect Professional · SAP-C02
Coming soonTen Percent and Watch
AWS ships nine predefined CodeDeploy configurations for Lambda. Three families -- canary, linear, and all-at-once -- and most of the variation is just timing. The interesting part isn't the percentages; it's the alarm gate that turns any of them into a self-rolling-back deploy.
DevOps Engineer Professional · DOP-C02
Coming soonThe First Request Nobody's Waiting For
A customer-facing API responds in 80 milliseconds once it's warm, and four full seconds on the first request after a quiet spell. The second trace is the one every user sees first. Hitting a P99 under 500 ms means touring every cold-start lever Lambda exposes, pricing the ones that survive, and accepting that the cheapest answer isn't always the right one.
Developer Associate · DVA-C02
Coming soonThree Locks, One Door
A BI team in one account needs to read encrypted objects in a data platform account. One bucket. One customer-managed key. Two accounts. Three policies, and every one of them has to say yes. Miss any single lock and the request dies in a different place with a different error -- which is the whole story.
Security Specialty · SCS-C03
Coming soonThe 5xx at 03:00
The pager goes at 03:00 for elevated 5xx on a public ALB. Logs are already in CloudWatch. The question is which path, which target group, which clients, and whether the spike correlates with anything — answered in minutes, from the console, without standing anything new up. CloudWatch Logs Insights is the tool the scenario is pointing at; getting the query right is the skill it's testing.
CloudOps Engineer · SOA-C03
Coming soonFive Terabytes, Nobody on Call
A product analytics team with five terabytes of Parquet in S3, a handful of ad-hoc questions a week, and zero appetite for a Spark cluster. Three AWS services could answer them -- Athena, Glue, EMR -- and only one of them charges nothing between query bursts. Finding the right fit means walking the whole analytics-on-S3 landscape and understanding where each service is cheap, where each is wasteful, and where each is flat-out wrong.
Data Engineer Associate · DEA-C01
Coming soonThe Model Behind the Product
A million LLM requests a day, peaking at thirty per second, split across US and EU customers, with a P99 first-token target under 1.5 seconds and real reasoning over retrieved context. Bedrock has seven model families and four ways to buy capacity. Most of the landscape falls away once you name what actually decides it -- and the real trick is what you do *after* you've picked the model.
Generative AI Developer Professional · AIP-C01
Coming soonEight Transit Gateways and a Terraform Problem
Eight Transit Gateways, five regions, two hundred attachments, and three engineers spending half a release window arguing about route table entries. The way out isn't a cleverer Terraform module -- it's a higher layer that takes declarative segment policy and materialises the routes itself. Finding the right layer means touring what AWS actually offers above raw TGW route tables.
Advanced Networking Specialty · ANS-C01
Coming soonBuy, Borrow, Build
A product manager with no ML background has been told to add AI to a SaaS product, and has heard of Bedrock, SageMaker, Comprehend, Translate, Textract, Rekognition. AWS has three different shapes of AI offering, and the shortest path depends entirely on whether a ready-made service already does the job.
AI Practitioner · AIF-C01
Coming soonThe Spike You Expect and the One You Don't
An e-commerce site has two kinds of spike to survive: Black Friday, which you can see coming months in advance, and the viral moment at 14:07 on a Tuesday, which you cannot. No single auto-scaling feature solves both without overspending the rest of the year. The right answer layers four of them, and the interesting part is how they hand off.
Solutions Architect Associate · SAA-C03
Coming soonFour Searches, One Budget
An XGBoost fraud model, twenty hyperparameters, a two-hundred-job tuning budget, and twenty-five minutes per run. SageMaker Automatic Model Tuning ships four search strategies for exactly this shape of problem -- and picking the right one is worth roughly a 4x cut in wall-clock time for the same final AUC.
ML Engineer Associate · MLA-C01
Coming soonOne Second, One Minute
A 2 TB Aurora PostgreSQL fleet needs a cross-region copy with RPO under a minute and RTO under an hour, managed failover, and the secondary pulling low-priority reads during normal operations. Four AWS services replicate across regions. Only one was purpose-built for this shape, and the numbers are tighter than the requirement by an order of magnitude.
Solutions Architect Professional · SAP-C02
Coming soonFifty Accounts, One Template
A four-person platform team owns a security baseline that has to live identically in fifty AWS accounts, auto-apply to any new account the Org picks up, and shout when someone meddles with it. Logging into fifty consoles is not a plan. CloudFormation StackSets with service-managed permissions is the one answer that ticks every box without a control plane of its own.
DevOps Engineer Professional · DOP-C02
Coming soonThe Leaderboard Nobody Can Write To
A viral game sends every write for every player at one DynamoDB partition key. The table is provisioned for 30,000 WCU, only a fraction of that actually reaches the hot partition, and the application sees ProvisionedThroughputExceededException on writes it is entitled to. The answer is not a bigger provisioned number. It is a partition key that is no longer a single point.
Developer Associate · DVA-C02
Coming soonFifty Thousand Documents and a Citation on Every Answer
Fifty thousand internal documents, five gigabytes of text, weekly churn, a three-second latency budget, per-user access control, and a citation in every single answer. The RAG landscape on Bedrock is bigger than one product and the interesting part of the design is what falls away once you name the five things that actually decide it.
Generative AI Developer Professional · AIP-C01
Coming soonOne Okta, Thirty-Five Accounts
Thirty-five AWS accounts, four hundred engineers, one corporate Okta, and a pile of long-lived IAM access keys that nobody can fully account for. The brief is to collapse all of it into one Okta change per joiner and leaver, with MFA everywhere and no static credentials anywhere. Getting there means touring what federated access on AWS actually looks like, and what each service will and will not do.
Security Specialty · SCS-C03
Coming soonNine Hundred and Fifty Hosts, Three Engineers
800 EC2s across four regions, 150 on-prem VMs in two data centres, three operating systems, three engineers, and a SOC2 auditor. SSH-ing into 950 hosts isn't a plan. Systems Manager Patch Manager is -- knowing which pieces do which job is the skill.
CloudOps Engineer · SOA-C03
Coming soonFifty Thousand Sensors at One Hertz
A fleet of 50,000 IoT sensors pushing one-kilobyte events once a second needs a real-time Lambda consumer, a seven-day replay window, external Java consumers, and Parquet-in-S3 for analytics. No single AWS streaming service ticks all four boxes cleanly -- picking well means knowing where Kinesis Data Streams ends and Firehose begins, and which combination of the two earns its keep.
Data Engineer Associate · DEA-C01
Coming soonEighty Accounts and No Governance
A ten-year AWS customer with roughly eighty accounts has decided to formalise the multi-account foundation they should have built in 2018. Four engineers, six months, a prescriptive OU structure, centralised security tooling, guardrails, self-service provisioning, and SSO. Four foundations compete for the job, and only one of them is actually a product.
Solutions Architect Professional · SAP-C02
Coming soonFive Accounts, Two Consumers
Five AWS accounts emit operational events. Two accounts want to consume them -- the observability account for dashboards, the security account for a SOAR pipeline. The events need versioned schemas, replay for the SOAR pipeline, and a dead-letter path. EventBridge custom buses with cross-account targets are the one answer that ticks every box without a control plane of their own.
DevOps Engineer Professional · DOP-C02
Coming soonOne Grant Per Event
A Lambda fleet across twelve spoke accounts needs to call kms:Decrypt on a shared customer-managed key -- but for one execution, on one event, with a scoped permission that vanishes when the handler returns. Key policies are too coarse. Blanket IAM grants are too coarse. The instrument that matches the shape of the problem is a KMS grant minted per execution, and it comes with its own small ceremony of condition keys, tokens, and CloudTrail events.
Security Specialty · SCS-C03
Coming soonOne Feature Definition, Two Stores
A fraud model trains on historical data overnight and scores transactions in real time under a 50 ms budget. The features it needs are identical in both paths -- and if the two paths compute them differently, the model gets train/serve skew and quietly loses accuracy. SageMaker Feature Store solves this by backing one feature definition with two stores: one for training, one for serving.
ML Engineer Associate · MLA-C01
Coming soonThe Free Tier Isn't One Thing
A student wants to host a small web app on AWS for under $5/month and keeps reading stories of people getting surprise bills. 'AWS Free Tier' names three different programmes with three different expiry rules. Telling them apart is how you avoid the bill.
Cloud Practitioner · CLF-C02
Coming soonThree Zones and a Spare Region
A SaaS team runs RDS PostgreSQL in a single AZ and wants two things at once: sub-minute failover when an AZ dies, and a recovery option in another region with under five minutes of data loss if the whole region goes. Neither problem has one clean answer on its own. The combination that does is a three-AZ cluster plus a replica waiting on the other side of the continent.
Solutions Architect Associate · SAA-C03
Coming soonThe Pipeline That Retrains Itself
A nightly XGBoost retrain, a conditional push to the Model Registry if AUC clears a bar, a deploy Lambda downstream, and a hard requirement to know which dataset produced which model. Five orchestrators can run that DAG. Only one of them tracks the lineage for free.
ML Engineer Associate · MLA-C01
Coming soonSix Gates and a Default Deny
AdministratorAccess that can't create an access key, a Permissions Boundary that looks restrictive but isn't, and a cross-account Lambda write that dies despite both sides seeming to grant it. Three AccessDenied mysteries with one answer: AWS evaluates every request against up to six policy types, in a fixed order, starting from a default deny. Learn the order and the mysteries disappear.
Solutions Architect Professional · SAP-C02
Coming soonTwo Target Groups and a Test Listener
An ECS Fargate platform took eight minutes of elevated 5xx from a rolling deploy that left old and new tasks serving the same path with incompatible contracts. The fix is blue/green via CodeDeploy: two target groups, a test listener for smoke tests, an alarm gate for rollback, and a task definition revision the new task set is pinned to. The rolling controller can't give that shape; the CodeDeploy controller can.
DevOps Engineer Professional · DOP-C02
Coming soonThe Order That Shipped Twice
A fulfilment Lambda reads an order off SQS, calls a slow partner API, and most of the time everything works. A few times a week the same order ships twice. The queue is on default settings, the function timeout is generous, and nothing in the logs looks unusual. The bug is the arithmetic between three numbers that almost never come up together in isolation -- visibility timeout, function timeout, and how long processing actually takes.
Developer Associate · DVA-C02
Coming soonOne Hundred and Fifty Secrets, Two Stores
A platform team has 150 secrets scattered across Parameter Store, Secrets Manager, and env vars committed to S3. They want one canonical home with database rotation, hierarchical naming, cross-region replication, and a reasonable bill. The cheapest correct answer is two stores, and the deep end is the four-step rotation Lambda.
Security Specialty · SCS-C03
Coming soonNo Inbound SSH
Three bastion hosts, shared SSH keys rotated quarterly, CloudTrail that records a login but not a keystroke. Security wants all of it gone: no inbound SSH, no shared secrets, per-engineer shell audit with input and output captured. Session Manager is the AWS-native answer -- once you know which IAM role sits on the instance, which on the engineer, which document shapes the session, and which SCP keeps the whole org honest.
CloudOps Engineer · SOA-C03
Coming soonForty Dollars a Query
A finance team's Athena queries are scanning eight terabytes of CSV every time and costing forty dollars a pop -- fifty times a day. The data they actually want is one day's worth, about thirty gigabytes. Closing the gap means understanding the Athena cost model, the difference between a registered partition and a projected one, and what columnar encoding does on top of partition pruning.
Data Engineer Associate · DEA-C01
Coming soonThe Corpus in the Prompt and the Voice in the Weights
A legal-tech team wants a contract review assistant that understands two hundred thousand past matters, speaks in the firm's voice with clause-by-section citations, and refuses anything off-domain. Fifty thousand pounds, three months. RAG, fine-tuning, and continued pre-training each solve a different half of that sentence — and the interesting answer is which two to pick, not which one.
Generative AI Developer Professional · AIP-C01
Coming soonTwo Writes, One Winner
A SaaS platform runs DynamoDB Global Tables across three regions. Users in all three read and write concurrently. Then a subscriber's update made in eu-west-1 vanishes — another write landed on the same item in us-east-1 a few milliseconds later, and replication's tie-breaker picked the other one. Multi-region active replication in DynamoDB is last-writer-wins by item timestamp, and the lost write is silent. This scenario is really about knowing when that's acceptable, when it isn't, and what patterns keep both writes alive.
Solutions Architect Professional · SAP-C02
Coming soonFour Gigabytes Over Three Hours
An EC2 instance allegedly pushed 4 GB of data to an unusual external IP over three hours. The security team has VPC Flow Logs in CloudWatch, a GuardDuty finding that flagged the behaviour, and Detective switched on across the organisation. Each of the three services does a different job: Flow Logs hold the raw per-flow evidence, GuardDuty raised the alarm with a classification, Detective pivots across the evidence to build a timeline. Pick the wrong one and you either drown in records or stare at a single finding with no context.
Security Specialty · SCS-C03
Coming soonTwo Buildings for Four Nines
Five gigabits of replication traffic over a single 10 Gbps Direct Connect, a compliance team asking for four nines, and five plausible topologies on the table. The one that lives up to the paperwork isn't the cheapest and isn't the most expensive -- it's the one where a whole building can vanish and the traffic keeps moving. Picking it means knowing what each Direct Connect resilience model survives, and which tier of SLA each one actually signs you up for.
Advanced Networking Specialty · ANS-C01
Coming soonWho You Are, Then What You Can Do
A mobile-first social app needs email-and-password sign-up, Google and Apple login, direct-to-S3 photo uploads without a server in the middle, and different bucket prefixes for free-tier and paid-tier users. The team has heard of Cognito User Pools and Cognito Identity Pools and has been using the two names interchangeably. They aren't the same thing, they don't do the same job, and getting them confused is the difference between a working upload and a 403. One is about proving who you are. The other is about turning that proof into the AWS credentials that let you touch a bucket.
Developer Associate · DVA-C02
Coming soonThe Model Labels the Easy Ones
Half a million product images, a hundred hours of human annotation budget, and a requirement for bounding boxes around every product. Labelled by hand that's roughly ten seconds per image -- less than a third of what's needed. The trick is to let a model label the easy ones and save the humans for the uncertain edges, then work out whether the numbers survive contact with reality.
ML Engineer Associate · MLA-C01
Coming soonForty Tables and a Forbidden Column
A 12 TB S3 data lake, 40 Glue tables, and an IAM policy that says yes or no to whole prefixes. The data platform team wants analysts to read the sales tables -- but not the commission column. Regional analysts in EMEA must see only EMEA rows. Every query must land in an audit log that the risk team can subpoena. S3 prefix policies can't carry that weight. Closing the gap means moving the policy surface into the Glue Data Catalog and letting Lake Formation apply row and column filters that the engine honours before a single byte is returned.
Data Engineer Associate · DEA-C01
Coming soonForty Terabytes, No Profiler
Forty terabytes of S3 across hundreds of buckets. Some datasets are read every morning; some are read once a quarter; some were written once and nobody has touched them since. The ops team has never run a per-bucket access profiler and isn't about to start. They want the bill down from $950 a month without hand-engineering a lifecycle rule per prefix. One S3 feature is designed for exactly that -- and two others combine with it in the cases where it would waste money.
Solutions Architect Associate · SAA-C03
Coming soonBusiness Hours and a 70B Model
A fine-tuned Llama 3 70B, a traffic pattern that sits idle overnight and peaks at lunch, and three AWS hosting paths that look similar in the console. One is provisioned-only and wrong the second the day ends. One is full BYO and a quarter's worth of work. The middle path is the one that reads the traffic shape correctly.
ML Engineer Associate · MLA-C01
Coming soonOne Hub, Five Ways to Share
A platform with twenty-five accounts carves one out as shared-services and wants to put a Transit Gateway, Route 53 Resolver rules, customer-managed KMS keys, a Glue Data Catalog, and a tree of SSM parameters in it — visible to the other twenty-four. Four resources, one central account, and no single sharing mechanism that works for all of them. Pick the wrong primitive per resource and you end up either duplicating everything per spoke, or writing custom replication glue, or federating via a pile of resource policies that nobody audits. Pick the right primitive per resource and the hub works.
Solutions Architect Professional · SAP-C02
Coming soonThree Stages, Two Regions, One Pipeline
A platform team wants one pipeline for a microservice that flows dev → staging → prod across us-east-1 and eu-west-1, knows about every target account, gates prod on a human, and updates itself when the pipeline definition changes. CodePipeline with hand-wired CloudFormation actions gets close but has to be rebuilt every time the shape changes. CDK Pipelines synthesises the whole thing from code, self-mutates on each run, and treats accounts and regions as first-class arguments.
DevOps Engineer Professional · DOP-C02
Coming soonTwo Memories for a Fifteen-Turn Chat
A customer-support assistant where the average conversation runs fifteen turns before it resolves, and returning users pick up two weeks later expecting the bot to remember they've been waiting on a refund. Two memory problems in one product — what's live in the current conversation and what persists across visits — and four plausible ways to build it. Bedrock Agents' built-in memory handles one half cleanly; the other half is where teams reach for DynamoDB or a knowledge base and get it wrong.
Generative AI Developer Professional · AIP-C01
Coming soonReachable, Exploitable, Critical
Four hundred EC2 instances, eighteen hundred Lambda functions, two hundred ECR images -- and no vulnerability scanner in the account. Compliance wants evidence of coverage and remediation timelines; the security team wants CVEs ranked by CVSS, by whether an exploit is known, and by whether the thing is even reachable from the internet. One managed service covers all three surfaces with a prioritisation signal built from the attributes that matter, enabled once across the whole organisation.
Security Specialty · SCS-C03
Coming soonFour Thousand Buckets, No Surprises
A healthcare SaaS has 4,000 S3 buckets across 30 accounts and an auditor who wants proof that no PHI has turned up anywhere it shouldn't. The ask is continuous scanning for SSNs, medical record numbers, dates of birth, and card numbers; automated alerting when a match appears; and a dashboard that ranks buckets by sensitivity rather than by filename. One managed service covers all three, enabled once at the organisation level.
Security Specialty · SCS-C03
Coming soonThree Destinations, One Truth
A deposit is recorded in DynamoDB. The application then writes an audit entry to S3, pushes a document to OpenSearch, and emits an event to EventBridge for the fraud team. Most of the time, all four places agree. Occasionally one of the three followers silently fails, the balance stands alone, and three days later a reconciliation job reports a ledger that says one thing and an audit log that says another. The fix is to stop doing the fan-out in application code and let DynamoDB itself be the source of the downstream stream.
Developer Associate · DVA-C02
Coming soonThree Signals, One Page
CPU spikes at 02:00 every night because the nightly backup does what nightly backups do. Latency spikes when the downstream payments API has a wobble. Error rate occasionally prints a one-off 5xx for reasons nobody can ever reproduce. Each signal alone pages the on-call. The on-call stops answering. A real incident -- CPU climbing, latency climbing, errors climbing -- arrives and looks like yet another cry-wolf. The fix is two layers: adaptive per-metric alarms that learn the service's daily pattern, and a composite alarm that only pages when all three agree.
CloudOps Engineer · SOA-C03
Coming soonApproved, or Nothing Ships
A regulated-industry ML team promotes models by copying S3 files and hoping. Legal wants a signed human approval, auditors want the holdout metrics on file, engineering wants to know which dataset trained the thing, and when production catches fire somebody wants a way back to whatever was running yesterday. Four requirements, one promotion gate, and a clear picture of which corner of AWS is built for this and which corners are almost-but-not-quite.
ML Engineer Associate · MLA-C01
Coming soonTwo Megabytes, Eight Consumers
A 50-shard Kinesis Data Stream sprouted an eighth downstream consumer and immediately broke the six already reading it. Every GetRecords started throwing ProvisionedThroughputExceededException. Every consumer thought it had the stream to itself; Kinesis knew otherwise. The fix means understanding what shard-level read throughput really caps, what Enhanced Fan-Out buys, and why the cheapest answer is not the best one for every consumer.
Data Engineer Associate · DEA-C01
Coming soonTwenty Lambdas, Four Stages, One Pipe
A platform team maintains roughly twenty tiny Lambda functions whose only job is to pull events from SQS, drop the ones that don't matter, reshape the rest, and hand them to SNS, DynamoDB, Step Functions, or Kinesis. Each one is forty lines of boilerplate around three lines of value, and each one bills for every invocation whether the event survived the filter or not. EventBridge Pipes collapses the boilerplate into a managed source-filter-enrichment-target path and charges only for events that pass the filter.
DevOps Engineer Professional · DOP-C02
Coming soonPrivate by Default, Customers by ARN
A SaaS vendor sells an observability product to enterprises who won't let production traffic touch the public internet — not even the vendor's HTTPS endpoint behind a WAF. The vendor runs the service from their own AWS account. Customers consume it from their own VPCs, in their own accounts, with overlapping RFC1918 space. Peer the VPCs? Transit Gateway in the middle? Lock down a public endpoint with IAM? None fits the shape of a SaaS relationship. PrivateLink does: the vendor publishes one endpoint service, each customer account is an ARN on an allow-list, and each customer VPC consumes it as if the service were local.
Solutions Architect Professional · SAP-C02
Coming soonFive Filters Around One Prompt
A consumer-facing chatbot on Bedrock has passed every red-team round on the obvious harms — no weapons, no hate, no CSAM — and is still shipping embarrassments: a card number pasted by one user echoing back in a reply, the bot cheerfully comparing the company's product with a named competitor, and a hallucinated policy line that nobody in the building wrote. Five different filter jobs wrap the same Bedrock invocation, and Guardrails is the one surface that does all five without five Lambdas.
Generative AI Developer Professional · AIP-C01
Coming soonWhat's Shared, What's Stale, What's Wrong
A security team at a fintech wants three things: every resource shared outside the AWS Organization in one list; a way to catch an accidental 'Principal: *' before a change merges; and a report of IAM roles and permissions nobody has used for ninety days. Three asks that sound like three products turn out to be three analyzer types of one service, enabled from one delegated administrator and reading across every account.
Security Specialty · SCS-C03
Coming soonScaling on the Wrong Signal
A SageMaker real-time endpoint serves an image classifier on a GPU instance. Auto-scaling is wired to CPU utilisation at a 70% target. When traffic spikes, latency spikes with it -- and the scale-out fires several minutes after the queue has already blown out, because the GPU saturates long before the CPU does. The fix is choosing a signal that actually tracks demand.
ML Engineer Associate · MLA-C01
Coming soonTwelve Producers, One Contract
Twelve producer services feed a Kinesis Data Stream read by eight downstream pipelines. Last week a producer quietly added a required field; three consumers crashed, one silently dropped the records, four carried on. The fix isn't a wiki page telling producers to be careful -- it's a schema the producer has to register before it can serialise, a version ID embedded in every record, and a compatibility rule the registry enforces the moment the schema is published. Move the contract out of the heads of twelve teams and into a system that rejects a breaking change before a byte lands on the stream.
Data Engineer Associate · DEA-C01
Coming soonSixty Accounts, One Assessment
A compliance team is six weeks out from a SOC 2 Type II audit across sixty AWS accounts. Today their evidence lives in a shared drive full of screenshots and a spreadsheet tracking which control each screenshot is meant to prove. They need the evidence pulled automatically, mapped to the controls the auditor will actually ask about, packaged into a report a human can sign off, and monitored continuously so drift between audits is caught the day it happens.
DevOps Engineer Professional · DOP-C02
Coming soonFive Minutes or a Year
A fintech runs two Step Functions workloads on the same flavour and wonders why the bill is what it is. One is an IoT data-processing flow firing about 800 times a second -- short, idempotent, nobody looks at it after 24 hours. The other is an account-opening workflow that waits days for human approvals, survives weeks of KYC back-and-forth, and must be inspectable by auditors long after it finishes. One wants Express. One wants Standard. Picking the right flavour per workload is a two-order-of-magnitude cost decision and a very different durability contract.
Developer Associate · DVA-C02
Coming soonThe Backup Admins Can't Delete
A financial services firm has a ransomware-resilience mandate with six absolutes: backups immutable against anyone with production credentials, stored in a separate account under separate IAM control, copied to a second region, retained for seven years, centrally managed across forty accounts, and auditable through a named reporting service. Each absolute, alone, has two or three plausible AWS answers. Taken together, only one arrangement survives contact with all six.
Solutions Architect Professional · SAP-C02
Coming soonTwo Hundred Cameras and a Retired Service
Two hundred industrial cameras run defect detection on factory floors -- ships with steel hulls, remote mining sites, patchy satellite uplinks. Inference has to happen locally and sync to the cloud when the network cooperates. The textbook answer for this shape used to be SageMaker Edge Manager. That service reached end-of-life on 26 April 2024. Working out what replaces it is the real scenario.
ML Engineer Associate · MLA-C01
Coming soonOne Prefix, Three Regions
A Bedrock-backed SaaS serving US, EU and APAC customers is hitting regional quota in us-east-1 during peak while the same model sits idle in eu-west-1. The team wants to spread load without fracturing the product into three regional deployments. Three letters on the front of the model ID do the job — if the model supports it and the geography fits the customer.
Generative AI Developer Professional · AIP-C01
Coming soonSame Key Material, Both Regions
A SaaS company runs active-active in us-east-1 and eu-west-1. Every ciphertext is wrapped in us-east-1 and replicated to eu-west-1, where the decrypt fails because the customer-managed KMS key is regional. They refuse to stand up a decrypt proxy and refuse to route every read back across the Atlantic. This scenario asks for a solution that puts the same key material in both places -- one key id, two regional arms, symmetric decrypt local to whichever side the reader is on.
Security Specialty · SCS-C03
Coming soonSteady, Bursty, and Throwaway
A 24/7 web tier whose load is boringly predictable. A nightly ETL that can survive being killed and restarted. A dev fleet nobody can forecast. Three workloads, three rhythms, and five different ways AWS will sell us a compute hour. The interesting question isn't which is cheapest -- every option is cheapest for something -- but which rhythm pairs with which pricing model, and what we'd have to believe about the future to pick each one.
Cloud Practitioner · CLF-C02
Coming soonTwo Pauses, One Lifecycle
A fleet behind an ALB scales on CPU. When new instances launch they take traffic instantly -- before their JVM has settled, before their caches are warm, before Consul has a chance to register them -- and users see 2-3 minutes of elevated errors. When instances terminate they lose whatever is in-flight and drop whatever is queued. Two symptoms, one missing mechanism: a pause on the way into service and a pause on the way out, held open by Auto Scaling lifecycle hooks while the application does the work the schedule didn't allow for.
CloudOps Engineer · SOA-C03
Coming soonWhose Problem Is The Patch
A CVE drops in glibc. A misconfigured S3 bucket leaks customer data. A hypervisor in Dublin needs firmware. Three incidents, three very different pagers going off, and a shared responsibility model that draws a clean line between what AWS owns and what we own. The interesting question isn't 'is AWS secure', it's 'which half of the contract are we holding up, and does the service we picked move the line?'
Cloud Practitioner · CLF-C02
Coming soonSix Ways A Design Can Fail
An architecture review on a Friday afternoon. One service, six people around the whiteboard, and the quiet realisation that the question 'is this a good design?' has at least six separate answers. AWS's Well-Architected Framework gives those answers names -- Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability -- and turns a vibes-based review into a systematic one. The interesting question isn't 'is the design good', it's 'good at which things, and what are we trading to get there?'
Cloud Practitioner · CLF-C02
Coming soonWhen The Pager Fires At 2 AM
Three teams, three different relationships with AWS support. A developer with a quota question on a Monday morning. A payments platform that cannot be down for more than fifteen minutes. A bank-backed acquisition where the architect wants a phone number, not a web form. Same vendor, three different support plans, and a pricing model that rewards picking the right one. The interesting question isn't 'which plan is best', it's 'what do we need when the pager fires, and what are we paying for the silence between fires?'
Cloud Practitioner · CLF-C02
Coming soonFrom Raw Model to Production Endpoint
A product team wants a chatbot that summarises support tickets. They have the tickets, a cloud account, and no ML background. Somebody says 'use a foundation model'. Between that sentence and a working endpoint sit roughly seven distinct stages, each with its own AWS service and its own decisions. The interesting question isn't which model to use -- it's which stages this team can skip, which they absolutely cannot, and what AWS gives them at each step.
AI Practitioner · AIF-C01
Coming soonPrompt, Retrieve, or Fine-Tune
A legal-ops team wants a model that answers questions about their 4,000 in-house contract templates. The first prototype, a plain Claude call with the question in the prompt, hallucinates clause numbers. Someone suggests fine-tuning; someone else suggests RAG. The interesting question isn't which is better -- they solve different problems -- it's which problem the team actually has, and what each adaptation technique costs in time, data, and recurring spend.
AI Practitioner · AIF-C01
Coming soonGrounding a Chatbot in Your Own PDFs
A facilities team has 600 PDFs -- equipment manuals, safety procedures, maintenance schedules -- sitting on a SharePoint drive. Engineers want a chatbot that answers 'how do I reset the chiller on floor 4?' in seconds instead of a ten-minute PDF hunt. Bedrock Knowledge Bases is the AWS-native answer, and from a diagram it looks like two boxes. The interesting question isn't 'does it work' -- it works -- it's which of its six configuration decisions actually matter for this corpus.
AI Practitioner · AIF-C01
Coming soonForecasting Without Writing Python
A category manager has 18 months of weekly sales data for 400 SKUs and a deadline to forecast next quarter. She doesn't code. The ML team is booked until Q3. The ask is a tool that lets her build a forecast herself -- importable, reviewable, explainable -- without waiting for engineering. The interesting question isn't whether SageMaker Canvas can do it (it can) -- it's which of its modes fits the shape of the problem, and what the business user has to understand for the answer to be defensible.
AI Practitioner · AIF-C01
Coming soonGuardrails, Watermarks, and Refusals
A fintech ships a customer-facing chatbot on Bedrock. Legal asks: can it give financial advice? Risk asks: can it leak customer account numbers? Compliance asks: if an auditor requests proof a response came from our model, can we demonstrate it? Three questions, three different controls, all of them Bedrock-native. The interesting question isn't whether these controls exist -- they do -- it's which one answers which question, and what the shape of a 'responsible AI' configuration actually looks like when the auditor arrives.
AI Practitioner · AIF-C01
Coming soonAgents, Chains, and Retrieval
A product manager wants a 'GenAI assistant' for internal operations -- something that can answer questions, look up customer records, draft emails, and file Jira tickets. Three architectural patterns keep coming up: chains, retrieval, and agents. They sound similar, they all use foundation models, and teams routinely reach for the most elaborate one when a simpler pattern would do. The interesting question isn't which is 'best' -- it's which one fits each piece of the assistant's workload, and when elaboration costs more than it earns.
AI Practitioner · AIF-C01
Coming soonSageMaker, Bedrock, or a Managed API
A platform team has five AI-shaped requests landing in a single sprint: transcribe call centre audio, detect anomalies in sensor data, extract text from scanned forms, summarise customer emails, and detect faces in CCTV. Someone has already typed 'use SageMaker' into three design docs. Someone else insists Bedrock is the answer. A third voice mutters about purpose-built services. The interesting question isn't which platform wins -- AWS has at least three answers to every AI problem -- it's how to tell which layer of the stack each request lands on, and what that choice costs in time, money, and flexibility.
AI Practitioner · AIF-C01
Coming soonStatic Fast, Dynamic Personal
A single domain serves a marketing shell that barely changes, a product catalogue that changes hourly, and a logged-in account area that's different for every visitor. One CloudFront distribution in front of all of it. The interesting question isn't whether to cache -- it's how to let CloudFront cache aggressively where it can while telling it, precisely, where it must not.
Solutions Architect · SAA-C03
Coming soonThree Load Balancers, Three Layers
A public HTTPS API that needs path routing and WAF. A TCP service that must preserve the client IP down to the instance. A fleet of third-party firewall appliances that every packet leaving the VPC has to pass through. Three problems, three load balancers, and they are not interchangeable. The interesting question isn't which is best; it's which OSI layer each sits at, and what that layer lets you do that the others can't.
Solutions Architect · SAA-C03
Coming soonShared Storage, Many Shapes
A Linux web farm that needs a shared filesystem for user uploads. A Windows application tier that expects SMB and Active Directory. A genomics pipeline that wants to read the same 80 TB dataset from thousands of Lambda invocations. And a small team on macOS with hand-edited NFS mounts. One problem, four protocols, and a catalogue of services that can each solve some but not all of it. The interesting question isn't 'EFS or FSx' but which filesystem protocol each workload actually speaks, and which AWS service speaks it natively.
Solutions Architect · SAA-C03
Coming soonWhen Aurora Should Scale Itself
A SaaS tenant database with long-flat daytime load and sharp weekend spikes. A reporting replica that's idle 22 hours a day and pinned to 100% for two. A dev cluster that's on for an hour and off for twelve. Three Aurora clusters, three very different load profiles, and two ways Aurora can be sized: capacity you set, or capacity Aurora sets for you. The interesting question isn't which is cheaper in the abstract; it's which load profile each mode is shaped for, and where the cost crossover actually lands.
Solutions Architect · SAA-C03
Coming soonThree Ways Out of the VPC
A fleet of private instances that needs outbound internet for package updates. An IPv6-only subnet that wants egress without advertising any IPv6 addresses to the world. And an accountant asking why a NAT Gateway bill has become one of the top line items. Three problems, three AWS components -- NAT Gateway, NAT Instance, and egress-only Internet Gateway -- and they are not the same thing despite sounding like they should be.
Solutions Architect · SAA-C03
Coming soonCache That Remembers, Cache That Forgets
Two caches on the same platform are asking to be decided between: a session store that has to survive a node reboot, a leaderboard that needs sorted sets, and a read-through cache for a recommendation API where any node can die without consequence. Redis and Memcached both show up as engine choices in ElastiCache. The interesting question isn't which is faster in a microbenchmark; it's which data model, durability story, and replication shape each workload actually needs.
Solutions Architect · SAA-C03
Coming soonOrder Matters, Order Waits
A click-tracking firehose that can drop duplicates at the app layer but must scale to 50,000 messages a second. A payment-state machine that must never process the same transaction twice and must apply updates in the order they were issued, per customer. Two workloads, one service called SQS, two flavours that are not interchangeable. The interesting question isn't 'which is better' -- it's what each flavour actually guarantees, what it costs, and what each workload can tolerate if the guarantee slips.
Solutions Architect · SAA-C03
Coming soonRead Scale, Three Ways
An Aurora writer pinned at 90% CPU because the reporting team's dashboards read live from production. A classic RDS MySQL with nightly batch jobs that read-lock the master. And a new multi-Region requirement for a PostgreSQL database that has to serve analytics in Sydney off a writer in Frankfurt. Three flavours of 'offload the reads' -- Aurora replicas, RDS read replicas, and DMS -- and they are not the same mechanism. The interesting question isn't which is fastest; it's which replication technology matches which isolation and freshness requirement.
Solutions Architect · SAA-C03
Coming soonCommitment Across a Portfolio
Fifteen workloads across four accounts and two Regions. A portfolio bill of three-quarters of a million dollars a year on compute. Finance has asked for a commitment plan. The interesting question isn't whether Reserved Instances or Savings Plans or Spot is cheapest for a single workload -- that's a workload decision. It's how to compose commitment shapes across a portfolio so the baseline is covered, the Spot-tolerant hours are cheap, and the flexibility budget hasn't been spent on something that can't bend.
Solutions Architect · SAA-C03
Coming soonBackups That Survive the Region
Backups scattered across six services, each with its own schedule and retention. An audit question -- 'prove you can recover any production resource, cross-Region, within 24 hours' -- that nobody can currently answer in fewer than three meetings. The interesting question isn't which service has the best native backup; it's how to centralise backup policy across an organisation, copy to a second Region for durability, and end up with one place to answer the audit question.
Solutions Architect · SAA-C03
Coming soonPicking a Block Volume
A PostgreSQL database whose bottleneck is a steady 15,000 IOPS at 1 ms latency. A Cassandra cluster whose bottleneck is throughput, not IOPS. And a warehouse of CloudTrail logs that gets written once and read almost never. Three volumes, three workload profiles, and EBS has a family of volume types that each match one of them. The interesting question isn't which is fastest; it's which dimension -- IOPS, throughput, latency, or dollars per gigabyte -- each volume type optimises for.
Solutions Architect · SAA-C03
Coming soonObjects That Cannot Be Deleted
Compliance wants: write-once, read-many storage for seven years, across every environment, with immutability guaranteed against any credential -- including the root account's -- for the full retention. S3 has exactly one feature designed for this. The interesting question isn't whether to enable S3 Object Lock; it's which mode, which retention shape, and how to deploy it in a way that stays compliant when someone accidentally rotates a bucket policy in year three.
Solutions Architect · SAA-C03
Coming soonThe Link That Must Not Fall
A 10 Gbps Direct Connect into eu-west-1 carries the bulk of our on-prem-to-AWS traffic; a Site-to-Site VPN sits idle as a backup. The last outage proved the backup was aspirational -- BGP didn't converge, half the traffic black-holed for four minutes, and the postmortem action item read 'make the failover actually work.' The interesting question isn't 'should we have a backup,' it's which of four hybrid-connectivity shapes gives us sub-minute failover without doubling the line-item cost.
Solutions Architect · SAA-C03
Coming soonTwo Ways to Know a Server Is Alive
Route 53 says the region is healthy and sends traffic there. The ALB in that region thinks every target is failing and returns 503. Users see errors. Both health-check systems were configured by different teams and nobody realised they disagree about what 'healthy' means. The interesting question isn't 'which health check is right,' it's which layer of health check protects you from which failure -- and how to make them agree.
Solutions Architect · SAA-C03
Coming soonThe Front Door for a Function
A Lambda function needs an HTTP front door. API Gateway is the textbook answer. An Application Load Balancer with a Lambda target is the less-textbook answer. They price differently, scale differently, authenticate differently, and they integrate with adjacent AWS services in different ways. The interesting question isn't 'which is better,' it's which match of front-door behaviour to workload shape actually makes sense for this service.
Solutions Architect · SAA-C03
Coming soonServerless Containers or the Cluster We Own
A team runs ECS today on EC2 instances they patch, size, and scale themselves. Fargate promises to take all of that away. But Fargate's per-hour pricing is higher, its networking model is different, and not every workload fits. The interesting question isn't 'Fargate or EC2,' it's which launch type matches which workload and where the crossover lines actually sit.
Solutions Architect · SAA-C03
Coming soonThe Disk That Dies with the Instance
Instance Store is fast, local, and free with the instance -- and it evaporates the moment the instance stops. EBS survives, snapshots, encrypts, and charges per GB-month. Picking between them for ephemeral data isn't about cost alone; it's about what 'ephemeral' actually means for the workload. The interesting question isn't 'which is faster,' it's which durability contract matches the data you're writing.
Solutions Architect · SAA-C03
Coming soonMoving Three Petabytes to AWS
Three petabytes of historical genomic data sit on a SAN in a leased datacentre. The lease expires in nine months. The pipe out of the building is 1 Gbps shared with everybody else. Snow Family, DataSync, and Storage Gateway each solve a different piece of this -- but only when you're precise about what kind of movement you need. The interesting question isn't 'which transfer service,' it's which service matches the shape of the data, the shape of the deadline, and the shape of the ongoing read pattern.
Solutions Architect · SAA-C03
Coming soonWhen the Origin Does Not Answer
CloudFront caches the hot stuff, but cache misses still have to reach an origin. When the origin is down, CloudFront returns 5xx to viewers. Origin failover is the setting that says 'try the backup origin first.' The interesting question isn't 'should we have a backup origin,' it's which failure conditions trigger failover, how quickly, and what 'the backup origin' should actually be.
Solutions Architect · SAA-C03
Coming soonCatching the Resource That Drifted
Somebody opened a security group to 0.0.0.0/0 on port 22 at 02:14 on a Saturday. The infrastructure pipeline would have rejected that change. The console did not. AWS Config sees the change, evaluates it against a rule, and can remediate it -- if it's wired up. The interesting question isn't 'should we have Config,' it's which rule shape catches which drift and how quickly a bad change gets reversed.
Solutions Architect · SAA-C03
Coming soonThe Shared Drive That Moved to AWS
An on-prem Windows file server holds 40 TB of departmental shares used by 600 staff. The hardware is end-of-life, the office is moving, and finance wants the capital refresh gone. AWS has four file-storage services that might replace it, and they're not interchangeable. The interesting question isn't 'which is the cloud file server,' it's which of FSx for Windows, FSx for NetApp ONTAP, EFS, or File Gateway matches the access pattern, the protocol expectations, and the Active Directory footprint this shared drive actually has.
Solutions Architect · SAA-C03
Coming soonOne Load Balancer, Many Sites
Four micro-frontends, three domains, and one shared ALB. Routing traffic to the right target group is either a host-based decision, a path-based decision, or both. The right rule order matters; the wrong one sends 'admin' requests to the marketing team. The interesting question isn't 'host or path,' it's how the two rule types compose, where priority comes in, and how to keep the rule table readable as the service catalog grows.
Solutions Architect · SAA-C03
Coming soonSpreading the Bid Across the Pool
Spot Fleet gives up to 90% off on-demand, but the capacity comes and goes on AWS's terms. A fleet that all requests c6i.2xlarge in us-east-1a is a fleet that goes offline together. The whole point of a Spot Fleet is diversification: enough instance types in enough pools that when one pool gets reclaimed, the others absorb the load. The interesting question isn't 'how cheap can Spot get,' it's how wide the diversification has to be before the fleet genuinely survives capacity events.
Solutions Architect · SAA-C03
Coming soonPaying by the Request or Paying for the Seat
DynamoDB can bill per-request or by reserved capacity. Per-request scales to zero; provisioned locks in discount in exchange for commitment. At the switching point, one month the team pays more on-demand than they would have on provisioned, and the question gets asked. The interesting question isn't 'which is cheaper,' it's which matches the traffic shape -- and what the knobs look like if you get it wrong.
Solutions Architect · SAA-C03
Coming soonRun Command or Runbook
A one-shot patch across 400 instances. A four-step release pipeline with approvals. A nightly log-rotation job that needs to run everywhere. Three operational jobs, two Systems Manager features, and a lot of blog posts that treat them as interchangeable. The interesting question isn't which one is 'better' -- they solve different problems -- it's which shape of job belongs in Run Command, which belongs in an Automation runbook, and what we give up when we pick the wrong one.
CloudOps Engineer · SOA-C03
Coming soonOne Pack, Many Accounts
Twenty-two member accounts, forty compliance rules, and a security team that wants one report across all of it. Bolting Config rules on per-account is how the previous regime got to a spreadsheet that nobody trusted. The interesting question isn't whether to use Config -- that part is obvious -- it's what a conformance pack buys over the raw rules, where it lives in the org, and how remediation fits into something that was designed to only observe.
CloudOps Engineer · SOA-C03
Coming soonTwo Advisors, One Bill
A new FinOps lead asks 'what's Trusted Advisor telling us?' Two hours later they ask the same question about Compute Optimizer. Both services sit in the console with 'save money' in their pitch, both read usage data, and both produce recommendations. They are not the same thing. The interesting question isn't which to enable -- probably both -- it's which class of advice each one gives, and how to stop recommendations from the two services contradicting each other on the same instance.
CloudOps Engineer · SOA-C03
Coming soonSynthetic Users on a Schedule
A checkout flow has gone down three times this year. Each time, the first page to notice has been Twitter. ELB health checks were green; CloudWatch metrics looked normal; the alarm that finally fired was disk-space on an unrelated host. The interesting question isn't whether to add more alarms -- we have plenty -- it's what kind of test would have caught each failure before the customers did, and how we run that test forever without paying someone to watch it.
CloudOps Engineer · SOA-C03
Coming soonThree Ways to Run Something Later
A nightly report that needs to fire at 02:00 local to each Region. A billing batch whose steps must run in sequence, with retries and branches. A per-customer reminder that should fire exactly 24 hours after signup. Three scheduling shapes, three very different tools, and a lot of code that reaches for cron on an EC2 instance because that's what the team knows. The interesting question isn't 'which scheduler' -- it's which scheduling shape each tool was actually built for, and what happens when you use the wrong one.
CloudOps Engineer · SOA-C03
Coming soonQuotas Before They Bite
A deploy fails at 03:14 because the account hit its `RunInstances` quota. A Lambda concurrency limit triggers throttles on a spike the team saw coming. An ops engineer filed a quota increase in the console and the ticket sat unassigned for three days. Quotas are soft limits the organisation should be steering; today they are hard walls discovered at 3 AM. The interesting question isn't 'how do we raise limits' -- AWS has an API for that -- it's which limits we should be tracking, how we automate the request path, and where the responsibility for 'we're about to hit a ceiling' sits.
CloudOps Engineer · SOA-C03
Coming soonLog Line Fan-Out
A security team wants every `AccessDenied` line copied into a SIEM within two minutes. A product team wants a near-real-time stream of structured business events into Kinesis. An ops team wants stack traces routed to an incident bot. Three consumers, one source: CloudWatch Logs, already holding the data. The interesting question isn't 'how do we get logs out of CloudWatch' -- there are five ways -- it's which shape fits a line-at-a-time fan-out, which fits bulk archival, and where subscription filters earn their keep.
CloudOps Engineer · SOA-C03
Coming soonBaking AMIs That Stay Fresh
A golden AMI rebuilt by a shell script on a bastion host, last updated seven months ago. A CVE announcement this morning that requires a patched kernel across the fleet. A team that wants to adopt Packer 'because HashiCorp.' The interesting question isn't 'Image Builder or Packer' in the abstract -- both build images -- it's which one owns the pipeline, which owns the recipe, and what it costs when the choice is wrong.
CloudOps Engineer · SOA-C03
Coming soonProve the Backup Ran
An auditor arrives with a simple question: show me that every RDS instance in production was backed up last night. The on-call engineer opens the console, picks the first instance, clicks Backups, takes a screenshot, and moves to the second. Four hours later they're still going. The interesting question isn't whether the backups ran -- they almost certainly did -- it's how the evidence gets from 'true but unprovable' to 'exportable in a minute' without an engineer ever touching a console.
CloudOps Engineer · SOA-C03
Coming soonWho Can Publish and Who Can Hear
An SNS topic that half the org has permission to publish to. Eighteen subscribers, each receiving every message, each filtering on their own side with hand-written if-statements. A service that joined last month and is getting flooded with events it doesn't care about. The interesting question isn't 'is SNS the wrong tool' -- it's the right tool -- it's which policy answers 'who can publish' and which answers 'who actually receives this specific message.'
CloudOps Engineer · SOA-C03
Coming soonLog Lines into Metrics
A service that logs structured JSON but publishes no CloudWatch metrics. A dashboard of graphs that each pull from Logs Insights queries, taking 30 seconds to load. An on-call rotation that correlates 'error rate' with 'deploy time' by squinting. The interesting question isn't whether to add metrics -- obviously -- it's where the metric should come from: a new instrumentation layer in the code, a CloudWatch agent, or the log lines that are already being written.
CloudOps Engineer · SOA-C03
Coming soonThe Call That Comes Before the Page
An EBS volume is degraded in one availability zone. A TLS certificate on an older API expires in two weeks. A planned maintenance window will reboot an RDS instance tomorrow night. AWS knows all three; the team finds out from the volume's first I/O error, the certificate warning in Slack, and a post-mortem. The interesting question isn't how to 'get better monitoring' in general -- it's how to turn AWS's own health signal into a first-class input that reaches the right people before the incident does.
CloudOps Engineer · SOA-C03
Coming soonFinding the Resource
'Find me every S3 bucket tagged Env=prod across our accounts and Regions.' Thirty tabs later, a half-finished spreadsheet, and a nagging feeling one Region got missed. An estate of 22 accounts and 15 Regions has made the Describe-and-diff approach untenable. The interesting question isn't 'how do we search AWS' -- there are five ways -- it's which one gives an answer in seconds without running a scheduled crawler or building a search index.
CloudOps Engineer · SOA-C03
Coming soonDashboards or Grafana
CloudWatch dashboards in production, tweaked from three tutorials; Grafana in staging, stood up by a team that wanted to plot alongside Prometheus data; a Datadog subscription for the parts nobody wants to move off. Three dashboarding surfaces, three bills, three user populations. The interesting question isn't 'which tool is better' -- they all draw graphs -- it's which question a surface is best for answering, and where the decision to stay on two surfaces is cheaper than consolidating.
CloudOps Engineer · SOA-C03
Coming soonOne Agent, Many Endpoints
An old EC2 fleet still running the legacy CloudWatch Logs agent. A newer fleet running the unified CloudWatch agent with metrics and logs. A Prometheus exporter sidecar on some hosts. Three agents, two configuration files per host, and documentation that contradicts itself on which is preferred. The interesting question isn't 'is the unified agent better' -- it plainly is -- it's what each legacy piece was doing, what migrating feels like, and where the unified agent's knobs sit once it's installed.
CloudOps Engineer · SOA-C03
Coming soonTask Role or Secret
A container image rebuilt from the Dockerfile in a git branch, shipped yesterday, with the database password in an environment variable. A task definition whose IAM role grants too much because nobody's audited it. A third-party API key passed via ECS environment variables in plain text. Three ways ECS workloads hold secrets, two of them wrong. The interesting question isn't 'should we use secrets management' -- it's how the task role and secrets injection fit together, which one carries which credentials, and what an audit should be able to see.
CloudOps Engineer · SOA-C03
Coming soonSAM, CDK, or Serverless Framework
A Lambda-heavy microservice that needs to ship today, a platform team deploying a hundred services next quarter, and a multi-cloud greenfield product trying to stay portable. Three ownership shapes, three different definitions of 'infrastructure as code', and three tools that each solve a real problem. The interesting question isn't which tool is best -- each one is best for something -- but which tool matches which ownership shape, and what the team would have to believe about the future to pick each one.
Developer · DVA-C02
Coming soonCache-Aside or Write-Through
A product-catalogue API returning the same items millions of times, an inventory counter the warehouse is constantly adjusting, and a leaderboard that must be strictly consistent on read. Three workloads, three read-write shapes, and four different caching patterns to pair them with. The interesting question isn't 'which pattern is best' -- each one is best for something -- but which pattern matches which read-write shape, and what the application would have to tolerate to pick each one.
Developer · DVA-C02
Coming soonThrottles, Quotas, and Plans
A public API whose downstream can't survive a burst. A free tier that should top out at 1,000 calls per day without turning paying customers away. A partner integration whose contract says a thousand requests per second. Three traffic-shaping jobs, three different knobs in API Gateway, and enough overlap that it's easy to configure the wrong one. The interesting question isn't which knob is 'the' throttle -- there are at least four -- but which knob protects what, and what the traffic would have to look like to reach each one.
Developer · DVA-C02
Coming soonFollowing a Request Through
A p99 latency that climbed from 200ms to 2 seconds last week, nobody knows where. Logs in fourteen different CloudWatch log groups, each with its own request ID scheme. An API call that passes through a Lambda, a Step Functions state machine, two DynamoDB tables, and a third-party HTTP vendor before it returns. The problem is not a lack of data -- the problem is that the data is scattered. X-Ray is AWS's answer, and it's worth walking the pieces slowly to see which ones carry the problem forward and which ones stop it.
Developer · DVA-C02
Coming soonBus, Queue, or Topic
A payment-captured event that needs to update four internal services. An order-placed event that kicks off five independent downstream workflows. A deeply integrated SaaS that emits events we want to route by schema. Three fan-out shapes, three services with overlapping jobs, and enough similarities that it's easy to pick the wrong one. The interesting question isn't 'which service is best' -- each one is best for something -- but which service matches which fan-out shape, and what the consumers would have to look like to pick each one.
Developer · DVA-C02
Coming soonGraphQL, Managed or Hand-Rolled
A mobile client team that wants one endpoint and a strong schema. A backend team whose data is spread across DynamoDB, Aurora, and three internal HTTP services. A requirement for real-time subscriptions when a record changes. The shape calls for GraphQL; the question is whether to lean on AppSync's managed resolvers or a Lambda running Apollo behind API Gateway. Both can serve the same query; they pay very different tolls.
Developer · DVA-C02
Coming soonRotating the Database Password
A Postgres master password that''s been the same since the project started. A compliance deadline that says ''rotate every 30 days, no exceptions.'' An application fleet of eighteen services that all read the credential at startup. The interesting question isn''t whether to rotate -- it has to happen -- but how to rotate without taking the fleet down, and why Secrets Manager''s rotation Lambda works the way it does.
Developer · DVA-C02
Coming soonS3 Events, Two Ways
An image-processing pipeline that needs to kick off a thumbnail Lambda on every upload. A compliance service that cares about every object mutation -- creates, deletes, replication events, lifecycle expirations -- across fourteen buckets. S3 has two event delivery mechanisms that look almost identical at first and differ in ways that matter. The interesting question isn't which is 'the right way' -- both are -- but which shape each fits, and what happens when you accidentally enable both.
Developer · DVA-C02
Coming soonEnv Vars, Parameters, or Secrets
A Lambda that needs to know the name of its DynamoDB table. A Lambda that reads a feature-flag threshold that changes weekly. A Lambda that must authenticate to a third-party API with a key that rotates monthly. Three pieces of configuration, three different lifetimes, three different secrecy requirements. AWS gives us three places to put them and enough overlap that they get used interchangeably. The interesting question isn't which is 'the right one' -- each is right for something -- but which matches which shape of configuration, and what it costs to pick each one.
Developer · DVA-C02
Coming soonCustom Resources or Constructs
A CloudFormation stack that needs to provision a third-party SaaS resource on every deploy. A CDK team that wants 'our standard service' to become a single line in every new stack. A resource AWS doesn't model natively but we need to manage with the rest of our infrastructure. Three problems, two mechanisms, and a lot of confusion about which is which. The interesting question isn't whether to extend CloudFormation -- we'll have to -- but which tool is doing which job, because they solve genuinely different things.
Developer · DVA-C02
Coming soonPrivate Packages in CodeArtifact
An internal TypeScript library that four services already depend on. A Python package the data team wants to share. Every build pulling open-source dependencies straight from npmjs.com and PyPI, trusting whatever got published last. A compliance deadline says nothing goes through unvetted public registries. CodeArtifact is AWS's answer, and the clean pattern -- upstream proxy plus internal-publish repository -- covers more than most teams first realise.
Developer · DVA-C02
Coming soonSigned URLs or Signed Cookies
An e-book download link that should work once, for one customer, for ten minutes. A premium video library where the whole archive should open up to paying subscribers but stay closed to everyone else. A firmware update a device fetches directly by serial number over HTTPS. Three access-control shapes on private content served through CloudFront, and two different mechanisms for authorising them. The interesting question isn't which is more secure -- both are -- but which shape each fits, and what the consumer would have to look like to pick each one.
Developer · DVA-C02
Coming soonCatching Step Functions Errors
A state machine that calls a flaky third-party API. A workflow that must roll back partial work if a later step fails. A long-running process whose failures should reach a human reviewer rather than vanish into logs. Three failure shapes, one state machine language, and a set of error-handling primitives worth walking through before relying on them. The interesting question isn't whether Step Functions can catch errors -- it can -- but which primitive handles which shape, and what the workflow would have to look like to stop thinking about errors.
Developer · DVA-C02
Coming soonFiltering SNS Subscriptions
A single SNS topic carrying every event the platform emits. A subscriber that cares only about high-priority customer events. Another subscriber that needs events for one specific region. A third that wants everything except test traffic. The naive answer is a topic per filter; the better answer is one topic with SNS message filter policies, and the details of attribute-based matching, message-body filtering, and the 100-subscription limit repay a careful read before you commit.
Developer · DVA-C02
Coming soonSidecars in an ECS Task
An application container that needs structured logs shipped somewhere other than stdout. A service-mesh proxy that must be co-located with the app to handle mTLS. A secrets-injection helper that reads from Parameter Store and writes to a shared volume. Three common sidecars, one ECS task definition, and a handful of primitives -- container definitions, dependsOn, essential, shared volumes, link modes -- that decide whether the task behaves the way you expect. The interesting question isn''t whether sidecars work on ECS -- they do, well -- but which primitives hold the pattern together, and which corner cases bite.
Developer · DVA-C02
Coming soonWho Can Call This API
A public API open to the internet for a consumer-facing mobile app. A partner API behind an API key with usage plans. An internal API reachable only by other AWS services in the same account. A machine-to-machine API where a bank's back-office calls ours with a signed JWT. Four authorisation shapes, a handful of API Gateway authoriser types, and surprisingly different trade-offs depending on which mix lands on the same gateway. The interesting question isn't which authoriser is right -- different ones are right for different callers -- but which matches which caller profile, and why IAM-auth isn't quite what the phrase suggests.
Developer · DVA-C02
Coming soonLambda URL or HTTP API
A single webhook receiver that needs an HTTPS URL, nothing more. A small service with three routes that could do without API Gateway's feature list. A pair of endpoints with different authorisers, rate limits, and Lambda targets. Three different service shapes, two very similar AWS surfaces, and enough overlap that the wrong choice costs a little money today and a lot of migration work next year. The interesting question isn't 'which is cheaper' -- both are cheap -- but which fits the shape, and when 'just a URL for a Lambda' is genuinely enough.
Developer · DVA-C02
Coming soonThe Model Is Drifting, Which Part
A fraud model that's been quiet for nine months starts throwing a different shape of alerts. Recall is down, precision is down, and the feature distributions look off in places that don't match any product change we shipped. 'The model is drifting' is a sentence that means four different things depending on which part of the pipeline moved. SageMaker Model Monitor names the four and gives each one a baseline, a schedule, and a metric -- which is the part that matters when it's 3am and a dashboard is red.
ML Engineer · MLA-C01
Coming soonSeventy Models, One Endpoint
Seventy per-merchant fraud models, each trained independently, each needed on demand but almost never all at once. A separate endpoint per model would be an eye-watering monthly bill; squashing them into one model would waste months of per-merchant tuning. SageMaker offers at least three endpoint shapes for hosting many models together, and they look superficially similar until you ask which ones share containers, which ones share hardware, and what happens when a model isn't hot in memory.
ML Engineer · MLA-C01
Coming soonThe Notebook That Outgrew Itself
A data science team that started with one `ml.t3.medium` SageMaker notebook instance now has fourteen of them, each bookmarked by a different person, each holding state nobody else can see. When a new hire joins, the onboarding doc says 'ask Priya which notebook to use.' SageMaker Studio is sold as the replacement -- but the transition isn't notebook-for-notebook; it's a different model of who owns what, where the files live, and how the team shares kernels.
ML Engineer · MLA-C01
Coming soonTraining Too Big For One GPU
A model that was comfortably fitting on one `p4d.24xlarge` six months ago won't fit on eight A100s today. The dataset has grown, the parameter count has grown, and a single training run takes thirty-six hours if it completes at all. SageMaker supports at least four distributed-training strategies and they scale along different axes -- one for when the data doesn't fit, one for when the model doesn't fit, one for when neither fits, and one for jobs that don't really need any of this but would like to be faster.
ML Engineer · MLA-C01
Coming soonThe GPU Is Only Forty Percent Busy
A training run on a `p4d.24xlarge` is taking twice as long as it did a month ago. The throughput metric says ~50% of expected; nvidia-smi shows GPUs sitting at 40% utilisation instead of the 90%+ they used to hold. Something is starving them, and it could be anything -- the data loader, the network, a CPU pre-processing step, IO latency, a single layer allocating too much. SageMaker Debugger's profiler is designed for this exact hunt, and it breaks the training step into parts that let you answer 'where did the time go' rather than 'is it slow.'
ML Engineer · MLA-C01
Coming soonOur Framework Isn't On The List
A research team has been running a JAX + Flax model on their own EC2 instances and wants to migrate onto SageMaker for training, hyperparameter tuning, and inference. SageMaker ships first-party containers for PyTorch, TensorFlow, MXNet, HuggingFace, and XGBoost -- and nothing for JAX. Building a bring-your-own-container image turns out to be almost the entire migration, and the shape of that container is more constrained than it looks from outside.
ML Engineer · MLA-C01
Coming soonFeatures, Glue Or Processing
A feature pipeline reads a few terabytes of raw transaction events, joins to customer and merchant reference data, computes rolling windows, and writes a parquet dataset ready for training. AWS has at least two very different ways to run that pipeline: Glue (ETL-shaped, Spark-backed, Data Catalog aware) and SageMaker Processing (ML-shaped, script-backed, in the same job graph as the training). Same work, two tools -- the pick is about where the pipeline lives in the team's mental model as much as what it does.
ML Engineer · MLA-C01
Coming soonKnowledge We Don't Own
An internal-support team wants a chatbot that answers questions using the company's own product documentation, changelogs, and a decade of support tickets. A general-purpose foundation model knows none of it. The team has three levers to make the model look like it knows the domain: retrieval-augmented generation, fine-tuning, and careful prompting. The levers trade against each other on latency, cost, freshness, and how much the model actually internalises -- picking between them is more about the nature of the knowledge than the quality of the model.
ML Engineer · MLA-C01
Coming soonThe Agent Or The Function
A customer-facing assistant needs to look up order status, create a return, check inventory, and talk to the user in a coherent thread across all of it. Two AWS options for the orchestration: Bedrock Agents, which gives the LLM the keys to call tools and plan multi-step flows; or a Lambda-based state machine where the LLM is one step among many. The trade isn't about capability -- both can do the work -- but about where the intelligence sits, where the bugs can hide, and who's on call when a user gets a strange answer.
ML Engineer · MLA-C01
Coming soonWhere The Vectors Live
A team building a Bedrock Knowledge Base for 8,000 product documents is being asked the first infrastructure question of any RAG system: where do the embeddings live? Bedrock offers four backing stores -- OpenSearch Serverless, Aurora PostgreSQL with pgvector, Pinecone, and Redis Enterprise Cloud -- plus the Neptune-based graph store for graph-RAG. They look interchangeable on a diagram and diverge sharply once you ask about scale, cost shape, and operational model.
ML Engineer · MLA-C01
Coming soonThe Model Without The Code
A business analyst with ten years of domain expertise and zero Python experience wants to build a customer-churn model. They have the data in Redshift, they have a reasonable understanding of which features matter, and they have no one on the data team with cycles to help for the next quarter. SageMaker Canvas is the no-code option that turns that into something they can actually ship; understanding what Canvas does (and what it hides) is the real story.
ML Engineer · MLA-C01
Coming soonLet The Service Pick The Model
A data scientist is staring at a new tabular classification problem: 50 features, 150k rows, balanced classes. Could be XGBoost. Could be CatBoost. Could be a small neural net. Iterating through five model families with hyperparameter tuning by hand is several engineer-weeks. SageMaker Autopilot is the automated alternative -- it trains and tunes a set of candidate pipelines, picks the best, and hands back a model plus a notebook showing everything it tried. The interesting question is what it does automatically, what it doesn't, and when it's the right first reach.
ML Engineer · MLA-C01
Coming soonShifting Traffic Without Holding Our Breath
A fraud-detection model retrained on six more months of data is ready for production. The last time the team deployed a new version by flipping the endpoint over, it took three days to notice the new model was regressing on a specific merchant segment. They want the next rollout to be observable, reversible, and automated to roll back on its own. SageMaker endpoint variants with shadow testing and traffic shifting are the tool for that, and the shape of the safety is in the deployment-config choices.
ML Engineer · MLA-C01
Coming soonTraining On Somebody Else's Schedule
A nightly training pipeline burns 8 × A100 for four hours per run, which at on-demand prices is enough to have finance asking pointed questions. The training is idempotent, checkpointed every 30 minutes, and has no deadline tighter than 'done by the morning standup.' SageMaker training jobs can run on Spot capacity for roughly a 70% discount -- provided the pipeline tolerates interruption, the instance choice is broad enough, and the checkpoint-and-resume logic actually works when AWS reclaims the instance halfway through epoch 7.
ML Engineer · MLA-C01
Coming soonThree AWS AI Services, One Document Pipeline
A stack of scanned vendor invoices in six languages lands in an S3 bucket every hour. Somebody needs to pull out the supplier name, line-item totals, and the tax rate, translate the freeform notes into English, and route the whole thing to accounts payable. Three AWS services have their names on that job, and each does a slice of it well and the other slices badly. The interesting work is lining up which service answers which question, rather than asking any one of them to do the whole pipeline.
ML Engineer · MLA-C01
Coming soonPicking a Foundation Model Without Picking a Fight
A product team wants a customer-facing chatbot that answers questions over our own documentation, cites its sources, and doesn't cost a fortune per conversation. Two AWS services have foundation-model in their marketing copy: Bedrock and SageMaker JumpStart. They sound interchangeable; they are not. The right question isn't which is better -- it's which shape of problem each is built for, and what we'd have to be willing to operate to pick one.
ML Engineer · MLA-C01
Coming soonA Model in Three Regions
A fraud-scoring model that runs in a single eu-west-1 endpoint earns its first European customer in Sydney, and then Tokyo, and then São Paulo. Latency goes from 40 ms to 340 ms and the product manager wants to know what we're going to do about it. Multi-region deployment sounds like a single decision; it's actually four decisions stacked -- where the model lives, where the endpoints run, how traffic finds the nearest one, and how the weights stay in sync. Each one has options, and the wrong pairing turns a latency win into a disaster-recovery problem.
ML Engineer · MLA-C01
Coming soonThree Trees and a Time Machine
Three forecasting problems on the same desk. Next-quarter fraud rate by product line. Daily SKU-level demand across 4,000 products. Ad-spend to conversion attribution where explanations matter as much as numbers. SageMaker's catalogue has a built-in algorithm for each; the names -- XGBoost, LightGBM, DeepAR -- are not interchangeable. The interesting question is which shape of problem each one was built for, and what we'd give up picking the wrong one.
ML Engineer · MLA-C01
Coming soonPredictions That Take Their Time
A contract-review model takes 45 seconds to score a 30-page document, payload is 8 MB, and the front end team's XHR times out at 30 seconds every time. Throwing a bigger instance at it makes each call slightly faster but leaves 90% of the expensive GPU idle between requests. SageMaker has a third endpoint type that fits exactly this shape -- not synchronous, not a batch job -- and its trade-offs explain themselves once the workload is laid out properly.
ML Engineer · MLA-C01
Coming soonTwo Shapes of Scoring
A customer-churn model is re-scored nightly across 20 million accounts, and a lead-scoring model is called the moment someone submits a form. Same model family, same training data, same accuracy target -- and two completely different right answers about how to deploy it. Real-time endpoints and batch transform jobs have overlapping names and very different bills. Getting the match right turns out to be less about latency and more about what the downstream consumer actually does with the prediction.
ML Engineer · MLA-C01
Coming soonTeaching Model Monitor Your Metric
A recommender-system endpoint is drifting in a way the built-in Model Monitor reports don't see. Feature distributions look fine, ground-truth labels arrive too late to catch the regression, and the data quality metrics are all green. The signal that matters is a domain-specific one -- the top-10 hit rate against a held-out validation set replayed hourly -- and the way to get Model Monitor to watch for it is to teach it your own metric. The interesting work is in what counts as a metric, how the built-in jobs evaluate it, and where the boundary between 'just report the number' and 'also alarm on it' actually sits.
ML Engineer · MLA-C01
Coming soonWhere Did This Prediction Come From?
A regulator asks a compliance team to prove which training dataset produced the model that scored a specific loan application six months ago. The answer lives across S3 buckets, Git commits, training-job logs, and somebody's Jupyter notebook. SageMaker has a service specifically for this question -- Lineage Tracking -- but it's not useful as a report after the fact. It's useful as a graph built automatically while the training pipeline runs, and the interesting work is knowing which pieces it captures for free, which we have to add ourselves, and how the graph gets queried when the regulator actually asks.
ML Engineer · MLA-C01
Coming soonThe Compiler That Reads Your Model
A Hugging Face transformer is trained for 6 hours on 8 GPUs, three times a week, and the data scientists would like it to be faster. They've already tuned the hyperparameters; the bottleneck is the model itself running through PyTorch. SageMaker Training Compiler promises a 30-50% speedup on exactly this shape of workload by rewriting the training graph before it runs -- but the docs hide when it helps and when it doesn't, and a naive drop-in makes some jobs slower. The interesting work is knowing which models and which training shapes it actually accelerates.
ML Engineer · MLA-C01
Coming soonSize the Endpoint Before You Pay For It
A team is about to ship a fine-tuned language model to production. They have a trained artifact, a sample payload, a latency target, and no idea which of the 25 inference instance types is the right one. The cheapest fit is probably six times cheaper than the most-expensive-that-works. SageMaker Inference Recommender runs the benchmark so we don't have to, but the interesting work is knowing which question to ask it, reading the results it gives back, and knowing when its answer is wrong for our actual workload.
ML Engineer · MLA-C01
Coming soonWhen One Class Hides From the Other
A fraud detection model trained on a year of transaction data reports 99.3% accuracy and catches almost no fraud. The training set is 0.4% fraudulent and 99.6% legitimate, and the model has learned the safest possible strategy: predict 'not fraud' every time. The right answer isn't 'try harder' -- it's knowing that imbalanced classification has a specific set of techniques, each with a different cost, and picking the one that matches both the data shape and the business cost of each error type.
ML Engineer · MLA-C01
Coming soonMany Models, One Instance
A personalisation team has trained 4,000 per-tenant churn models -- one per customer account -- and deploying each to its own endpoint would cost $24,000 a month in minimum-instance floors before a single inference is served. SageMaker Multi-Model Endpoints are the pattern for this: one endpoint hosts all 4,000 models, loading each on demand. The interesting work isn't turning it on; it's sizing the underlying instance so the working set fits in memory, the cold-start tax stays small, and nobody's tenant notices their model had to be paged in.
ML Engineer · MLA-C01
Coming soonThree Ways to Swap a Model
A recommendation model has a new version ready to ship. The old one serves 200 RPS of production traffic; the new one has passed offline evaluation but nobody's willing to point the live firehose at it on faith alone. There are three deployment patterns built into SageMaker for exactly this moment -- shadow, blue-green with canary, and linear traffic shifting -- and they're not interchangeable. Which one fits depends on what kind of confidence you need to build, and what you're willing to spend to build it.
ML Engineer · MLA-C01
Coming soonThe Agent That Learns to Bid
An energy-storage operator wants to train a policy that decides when to charge and discharge a battery to maximise arbitrage profit across a volatile electricity market. The business problem is straightforward; the ML framing is anything but. Supervised learning needs labelled 'correct decisions' that don't exist; it's a sequential-decision problem under uncertainty, which is what reinforcement learning is for. SageMaker RL exists for exactly this shape of problem, and the interesting work is knowing what it provides, what it doesn't, and how the pieces fit together into something that trains and deploys.
ML Engineer · MLA-C01
Coming soonThree Lakehouse Table Formats
A pile of Parquet in S3, an Athena catalog that knows about partitions but nothing about versions, and an analyst who keeps asking 'what did the table look like yesterday?'. Three open table formats have answers -- Iceberg, Delta Lake, and Hudi -- and they answer a subtly different question each. The work is figuring out which question we're actually asking before picking.
Data Engineer · DEA-C01
Coming soonPipes Without Glue Code
A DynamoDB stream that needs to feed an EventBridge bus, with a filter that drops 80% of the events and an enrichment that adds customer-segment data. The classic answer is Lambda, and the Lambda is 40 lines of boilerplate around two meaningful transformations. EventBridge Pipes is the newer answer: source, filter, enrichment, target, as configuration. Whether it wins depends on what the enrichment actually needs to do.
Data Engineer · DEA-C01
Coming soonRules That Fail Loudly
A Glue job that silently loaded a malformed CSV last Tuesday and nobody noticed until an analyst asked why Q3 revenue looked odd. The fix isn't a bigger dashboard; it's an upstream gate that refuses to promote data to the warehouse unless a stated set of expectations holds. Glue Data Quality is AWS's answer, built on the Deequ library, expressed as a compact rule language and integrated with the catalog. The interesting question is what to assert and what to do when an assertion fails.
Data Engineer · DEA-C01
Coming soonOne Catalog, Many Producers
Four Glue catalogs across four accounts, an Excel spreadsheet of dataset owners that nobody keeps current, and a finance analyst who's joined her company's customer table against a sales table she wasn't supposed to read. The catalog is everywhere and the governance is nowhere. AWS DataZone pulls catalog metadata, access policies, and producer/consumer workflows into one place. Whether that's what we need depends on how the catalog is being used today and how far off that is from one.
Data Engineer · DEA-C01
Coming soonCleaning Without Code
A CSV of lead records from a trade show, a finance analyst who needs to normalise phone numbers and deduplicate contact emails before Monday, and a data engineering team that's three weeks deep in a warehouse migration. AWS Glue DataBrew is the visual data-prep tool built for exactly this shape: the analyst clicks recipes, the engineer doesn't have to write Spark, and the output is a Parquet file in S3. Whether it earns its place depends on how often the 'one-off' ends up repeated.
Data Engineer · DEA-C01
Coming soonChange Data Capture, Two Ways
A Postgres order-book that needs to feed the warehouse, with every insert, update, and delete captured in near-real-time. DMS has been doing this for seven years; MSK Connect is the newer answer if the downstream consumers are already on Kafka. The interesting question isn't 'which service is better' but which shape of pipeline matches each, and what happens when the CDC stream has to be read by more than one thing.
Data Engineer · DEA-C01
Coming soonCache Once, Query Fast
A QuickSight dashboard with three hundred daily viewers, backed by a Redshift cluster that groans under the concurrent load every Monday morning. SPICE is QuickSight's in-memory columnar engine that caches data so dashboards don't hit the warehouse for every view. Direct query bypasses SPICE and reads live. The decision isn't 'which is faster' -- SPICE usually is -- but which trade-offs each makes and which matches the dashboard's actual freshness and scale.
Data Engineer · DEA-C01
Coming soonHot, Warm, and Cold Shards
Two years of application logs in OpenSearch, most of them never queried, all of them costing the same per GB per month. OpenSearch Service tiered storage — hot, UltraWarm, cold — is how that bill gets sensible without losing the ability to search yesterday's and last year's data when it matters. The work is matching the access pattern of each age of data to the tier that serves it.
Data Engineer · DEA-C01
Coming soonThe Replication Task Is Slow
A DMS full-load task that's been running for thirty-six hours with sixteen hours to go, on tables that shouldn't take half that. The replication instance is showing CPU headroom and memory to spare, but the throughput graph is a flat line. DMS tuning is a set of knobs that affect different parts of the pipeline; the work is figuring out which part is the bottleneck and which knob moves it.
Data Engineer · DEA-C01
Coming soonThe SaaS Ingestion Problem
A Salesforce tenant with thirty objects that need to land in the warehouse daily, a Zendesk instance adding another fifteen, and a team that's been writing a bespoke Python script per connector for each new source. AppFlow is AWS's managed connector fabric for SaaS sources -- configure the source, pick the target, schedule, done. Whether it earns its place depends on whether the shape of ingestion we need is the shape AppFlow serves.
Data Engineer · DEA-C01
Coming soonOrchestrating Eleven Jobs
A nightly ETL with eleven Glue jobs, dependencies between them, three of which can run in parallel, one that has to retry on transient failures, and one that requires human approval before it writes to production. Step Functions and Glue Workflows both orchestrate Glue jobs; the shape of the work determines which is the right fit and which tries to fit but won't.
Data Engineer · DEA-C01
Coming soonThree Kinds of Glue Job
A Glue bill that's become the biggest line item in the data platform. Half the jobs are transforming hundred-gigabyte datasets; the other half are bouncing three hundred rows through a five-minute script. They're all Glue Spark jobs because that's what got written first. Python shell and streaming jobs are two cheaper, better-fitted shapes for specific workloads -- the trick is knowing which job shape is which.
Data Engineer · DEA-C01
Coming soonQueues, Slots, and Memory
A Redshift cluster where short dashboard queries wait behind five-hour ETL jobs every morning, a WLM queue configuration that's been default for eighteen months, and a DBA who wants to make the ETL stop starving the dashboards without making the ETL slower. WLM is Redshift's workload-management layer: queues, slots, memory, and the rules that route queries between them.
Data Engineer · DEA-C01
Coming soonThree Shapes of Redshift
A Redshift cluster that's been running DC2 nodes since 2019, a growing data lake in S3 the warehouse wants to reach into, and a new team that wants Redshift Serverless without any of the node-sizing conversation. RA3 with managed storage, DC2 with local storage, and Spectrum reading S3 directly are three different answers to 'how is compute and storage related in Redshift', and each makes a different trade.
Data Engineer · DEA-C01
Coming soonTwo Regions, One Service
A payments API that has to keep taking traffic through a whole-Region outage, with a recovery-time objective of under a minute and zero planned data loss. That sentence sounds like a shopping list until you start pricing what each word costs. The real work isn't choosing active-active over active-passive; it's picking which of six data-replication shapes you can live with, and which latency, conflict, and cost you'll accept to get there.
Solutions Architect Pro · SAP-C02
Coming soonOne Door Out
Forty VPCs across twelve accounts, each with its own NAT gateway and its own outbound firewall rules. The bill for NAT alone is eye-watering, and every time a new SaaS domain needs allow-listing there are forty places to update it. Centralised egress promises one place to log, inspect, and pay -- but it also promises a single choke point, a new hub account to run, and a Transit Gateway in the middle of every packet's life. The question is which boxes belong on the path and which don't.
Solutions Architect Pro · SAP-C02
Coming soonFour Shapes of Recovery
A regulator asks for a disaster-recovery plan. The CFO asks what it will cost. The CTO asks how often we'll test it. Three questions, one answer -- and the answer depends on which of four DR strategies we pick. Backup and restore at one extreme; multi-site active-active at the other. Between them, pilot light and warm standby, each with a different daily cost, a different recovery time, and a different amount of infrastructure sitting idle.
Solutions Architect Pro · SAP-C02
Coming soonObjects in Two Places
The regulator has told us customer records must be stored in two jurisdictions, the auditor has told us deletes must be irrecoverable for seven years, and finance has told us the Glacier bill is already too high. S3 Cross-Region Replication sounds like the answer to the first problem. It usually is -- once you understand what it does and doesn't replicate, what happens to existing objects, and what it costs to undo a mistake.
Solutions Architect Pro · SAP-C02
Coming soonThree Layers of Deny
A security auditor sits down with the VPC diagram and asks what stops an instance in the web tier from talking to the payroll database. Three answers are on the table: the security group attached to the database, the NACL on its subnet, and the Network Firewall in the egress VPC. They all say 'no'. They say it differently, they fail differently, and picking which one carries the real policy -- rather than being the belt-and-braces on something else -- is the whole job.
Solutions Architect Pro · SAP-C02
Coming soonBackups the Organisation Cannot Forget
Twenty member accounts. Each team owns its own data, its own retention policy, and its own idea of what 'backed up' means. The auditor wants one number for every resource: when was it last backed up, where does the backup live, and who can delete it. AWS Backup policies through Organizations are the answer -- if you understand what they enforce, what they don't, and where the SCPs have to pick up the slack.
Solutions Architect Pro · SAP-C02
Coming soonThe Portal the Team Will Actually Use
Product teams want new RDS databases on demand. Security wants every database encrypted, tagged, in the right subnet, with the right backup plan. Platform wants the request to stop coming through Slack. A self-service portal with vetted building blocks sits between all three. AWS Service Catalog is the shape -- once you know what's a Product, what's a Portfolio, what a launch constraint does, and why StackSets are the deployment engine underneath.
Solutions Architect Pro · SAP-C02
Coming soonOne Template, Many Accounts
A security baseline stack -- GuardDuty, Config, CloudTrail, a handful of IAM roles -- has to exist in every account in the organisation. Twenty accounts today, one per week for the foreseeable future. Clicking through twenty CloudFormation deploys is how you get drift; CloudFormation StackSets promises to do it once and have the organisation maintain it. The promise is real; the details -- service-managed vs self-managed, stack instance targets, drift detection, the dreaded failed-deployment rollback -- are where it earns the Pro.
Solutions Architect Pro · SAP-C02
Coming soonPrivate by Default
Two teams in the same organisation want to call each other's services. Today they do it over the public internet, through NAT, TLS, and each other's ALBs. It works. It's also logging into Flow Logs as 'traffic to a well-known cloud-provider IP range' and the security team has views. PrivateLink promises a path that never touches the internet: the consumer never sees the provider's VPC, the provider never exposes anything publicly, and the permission to connect is IAM. Once you understand NLBs, endpoint services, and who-allow-lists-whom, the picture clicks into place.
Solutions Architect Pro · SAP-C02
Coming soonEvery Log in One Place
The auditor wants a single place to query 'who accessed the payroll database' across twenty accounts. The SOC wants alerts on a failed login from any account in under a minute. The data team wants the same logs for anomaly detection without paying for them twice. Aggregating logs across an organisation is less about shipping bytes than about deciding which log goes through CloudTrail Organization Trail, which through central CloudWatch subscriptions, which through S3 Athena queries, and which through an observability platform sitting on top.
Solutions Architect Pro · SAP-C02
Coming soonLift and Failover
A data-centre full of VMware VMs -- the on-prem half of the estate nobody has refactored yet -- needs a disaster recovery plan that covers actual data-centre fires, not just cloud-Region outages. The traditional answer is 'replicate to another data-centre, hope the failover runbook works.' AWS Elastic Disaster Recovery is the cloud-native answer: block-level replication to AWS, minutes of RTO, tested by actually failing over. Understanding it means understanding the replication agent, the staging subnet, the launch template, and what separates DRS from Backup.
Solutions Architect Pro · SAP-C02
Coming soonFollowing the Request Everywhere
One customer request touches seven services in five accounts. When it times out, three teams look at three dashboards and disagree about whose fault it is. Distributed tracing is the answer -- if the trace follows the request across account boundaries and the backend can aggregate spans from all five accounts without each team reinventing the wheel. AWS X-Ray with cross-account sharing, plus a ServiceLens-aware observability layer, is how this stops being a lunchtime-every-day conversation.
Solutions Architect Pro · SAP-C02
Coming soonNames That Work Both Sides
A Fargate service in AWS needs to resolve hostnames from the on-prem AD domain; an on-prem application needs to resolve internal AWS service endpoints. Today each side has its own DNS and they don't talk. The clean answer is Route 53 Resolver with inbound and outbound endpoints plus forwarding rules -- one piece gets AWS-side queries to on-prem, the other gets on-prem queries to AWS. Knowing which endpoint points which way is the whole job.
Solutions Architect Pro · SAP-C02
Coming soonAWS on the Warehouse Floor
A fulfilment centre in a town with bad internet has to run warehouse management software with sub-10ms latency to the robots on the floor. Cloud isn't an option -- the fibre link drops four times a month. On-prem hardware works but the team doesn't want another fleet to patch. AWS Outposts is the third answer: AWS-managed hardware sitting in the warehouse, running the same services as the cloud, connected back to the parent Region. Knowing what Outposts can and cannot do is how we avoid a million-pound rack collecting dust.
Solutions Architect Pro · SAP-C02
Coming soonMeshes Under Meshes
Forty microservices in an EKS cluster. Three teams have added their own Istio sidecars; one team has installed Linkerd; the platform team has a half-working AWS App Mesh deployment from two years ago. The CTO wants one mesh or none. Choosing is less about 'which mesh is best' than about what problems each solves, which of them the team is willing to run, and whether the answer should be a service mesh at all or something thinner.
Solutions Architect Pro · SAP-C02
Coming soonKeys That Never Leave
A payments regulator asks for the cryptographic keys that sign card transactions to live in hardware rated FIPS 140-2 Level 3. KMS is 'hardware-backed'; is that enough? CloudHSM is 'single-tenant hardware'; is that necessary? The answer sits in the gap between what each service lets you do, what FIPS certifies, and what a specific regulator will accept. Getting it right means knowing when KMS is sufficient and when CloudHSM earns its weight.
Solutions Architect Pro · SAP-C02
Coming soonCompute at the Tower
A factory-floor AR application needs sub-10ms round-trip latency between a tablet and its inference backend. The parent Region is 50ms away; the factory's uplink is 5G. AWS Wavelength sits compute inside the carrier's 5G network, one hop from the tablet's radio. Knowing when it earns its weight -- and when a nearby AWS Local Zone or Outpost does the same job for less -- is the difference between ultra-low-latency and ultra-pricey-but-not-faster.
Solutions Architect Pro · SAP-C02
Coming soonThe Data Stays Here
A financial regulator in a small country requires that all customer data physically reside inside the country's borders. AWS doesn't operate a Region there. Outposts would work but is capex-scale. AWS Local Zones sit in a specific metro without running a full Region -- single-digit-ms latency to same-city clients, data stays in the Local Zone's jurisdiction. When the compliance boundary is 'data inside this city,' Local Zones is the tool; knowing what it can and can't do saves a failed audit.
Solutions Architect Pro · SAP-C02
Coming soonChange Feeds Across Regions
A DynamoDB table holds order state in eu-west-1. A warehouse-management service in us-east-1 and an analytics pipeline in ap-southeast-2 both need to react to every write. Global Tables solve 'the data exists in both Regions' but not 'Region B reacts to a change.' DynamoDB Streams plus Kinesis Data Streams plus cross-region delivery is the native answer -- and knowing the shape lets us decide when Global Tables suffices and when we need the streaming overlay.
Solutions Architect Pro · SAP-C02
Coming soonOutbound That You Trust
Forty thousand EC2 instances across two hundred VPCs in forty accounts. Today each workload speaks to the internet through local NAT; tomorrow a compromised container could exfiltrate data to anywhere on the web and nobody would notice until the invoice. Egress filtering at scale isn't about turning on a firewall -- it's about building one organisation-wide policy, routing every packet through it, scaling it past the single-TGW limit, and convincing forty teams the blast radius is worth it.
Solutions Architect Pro · SAP-C02
Coming soonCross-Account Pipeline Keys
A CodePipeline in a tooling account, artefacts in an S3 bucket in the tooling account, deploy stages that run CloudFormation in staging and prod accounts. The pipeline is green in the console and red in every deploy account: 'AccessDenied' on the artefact download. The fix isn't another role -- it's the customer-managed KMS key that signs the bucket contents, and who is allowed to decrypt with it.
DevOps Engineer Pro · DOP-C02
Coming soonPatching With Proof
Eight hundred EC2 instances, five operating-system flavours, a compliance team that wants to know which hosts are behind on security patches and a change-management team that wants a record of exactly what landed where. SSM Patch Manager has the moving parts -- baselines, patch groups, maintenance windows, compliance state -- but wiring them into a report someone outside the team will read takes a specific set of decisions.
DevOps Engineer Pro · DOP-C02
Coming soonPrivate Packages, Hot Caches
A monorepo of forty Node services, npm install eating eight minutes of every CodeBuild run, and a security finding that half the internal packages are installed straight from a public registry. The two problems are related: a private artefact registry that CodeBuild doesn't cache against is the same wound as a cache that doesn't know about private packages. CodeArtifact upstreams plus the right kind of CodeBuild cache turn both into one story.
DevOps Engineer Pro · DOP-C02
Coming soonTwo Views of the Same Rule
AWS Config already says every S3 bucket in the Organization has default encryption enabled. The auditor wants the same statement in a different shape: a signed PDF with a framework reference, an assessor sign-off, and evidence attached to each control. Config and Audit Manager both see the rule, but only one produces the artefact the auditor will accept -- and the two services together produce the record neither does alone.
DevOps Engineer Pro · DOP-C02
Coming soonPulling and Pushing Kubernetes
Three EKS clusters, sixty teams deploying into them, a CodePipeline stage that runs kubectl apply against a kubeconfig file that nobody wants to rotate. Moving to GitOps means one decision -- ArgoCD or Flux -- and a dozen smaller ones about repository layout, RBAC, secrets, and how the cluster gets told what to run. The two tools pull from the same kinds of places, but their mental models and the failure modes that matter on audit differ enough to be worth thinking through.
DevOps Engineer Pro · DOP-C02
Coming soonThree Ways to Run a Container
A small service mesh of ten applications, each a container image in ECR. One handles a sustained 2,000 requests per second; one is invoked 800 times a day by a webhook; one is the team's favourite long-running worker. App Runner, ECS Fargate, and Lambda can all run the same image, and the right answer changes per application. The interesting work is mapping request shape, cold-start tolerance, and networking constraints to the runtime that fits.
DevOps Engineer Pro · DOP-C02
Coming soonBackup That Reports Itself
Three accounts, eight resource types, a backup policy that lives in AWS Backup and a compliance ask that nobody wants to answer with screenshots. Audit reports in AWS Backup generate the artefacts directly, once the backup plans, vaults, and reporting jobs are wired up to talk to each other. The trick is understanding which report answers which question and where the output lands.
DevOps Engineer Pro · DOP-C02
Coming soonFleets, Reserved and On-Demand
Two thousand CodeBuild runs a day across forty services, queue depth that spikes on merge storms, nine-minute builds when build capacity is cold and ninety-second builds when it's warm. CodeBuild's fleet types -- on-demand, reserved capacity, Lambda compute -- each put the compute somewhere different. The decision isn't about which is fastest but which matches the build profile, and reserved fleets change what 'warm cache' means.
DevOps Engineer Pro · DOP-C02
Coming soonStacks That Drifted
Forty CloudFormation stacks, an auditor's drift report showing eighty modified resources, and a team that can't remember which of those were incident fixes and which were accidents. CloudFormation's drift detection names the resources; the remediation is where the work lives. Detect-and-overwrite, update-stack, import-and-adopt, and change-set-first all produce different outcomes, and picking the right one depends on what caused the drift in the first place.
DevOps Engineer Pro · DOP-C02
Coming soonConfigure Once, Deploy Many
Twelve teams wanting to self-serve an internal platform -- VPC, IAM roles, CI pipelines, logging baseline -- without asking platform engineering every time. Feature flags and configuration that should update in seconds, without a redeploy. Service Catalog and AppConfig are the two AWS services built for those different asks; they're often confused for each other but answer different questions about different scopes.
DevOps Engineer Pro · DOP-C02
Coming soonRunbooks That Run Themselves
A GuardDuty finding fires; someone on call looks at the runbook in Confluence; twelve steps later the EC2 instance is quarantined, a snapshot has been taken, the on-call has typed the AWS console password three times. Step Functions Standard workflows with Lambda tasks, Wait states, approval tokens, and EventBridge rules turn that twelve-step runbook into code -- auditable, testable, and repeatable at 03:00 without the typos.
DevOps Engineer Pro · DOP-C02
Coming soonAlways-On Vulnerability Scanning
EC2 instances, Lambda functions, and container images sitting in ECR -- each a potential surface for a known CVE, each with a different way to find out. Weekly Nessus scans catch things after they land. Amazon Inspector scans continuously at the boundary: image push, instance launch, function update, plus periodic rescans. Understanding what Inspector covers natively versus what it defers to other services is the difference between assuming coverage and having it.
DevOps Engineer Pro · DOP-C02
Coming soonPatching On-Premises Alongside Cloud
Four hundred EC2 instances, two hundred on-prem servers in a data centre connected via Direct Connect, and a single compliance baseline that says 'patch cadence is the same everywhere.' Patch baselines in SSM treat both fleets as managed nodes once the hybrid activation is in place. The decisions worth making are about patch group hierarchies, non-prod vs prod soak periods, and what happens when a hybrid node loses connectivity mid-maintenance window.
DevOps Engineer Pro · DOP-C02
Coming soonRecovery Objectives Made Measurable
Every service the team runs claims a 4-hour RTO and 1-hour RPO in the design doc. Nobody has actually measured whether the numbers hold. AWS Resilience Hub takes the claim, inspects the architecture, scores it against the target, and suggests what to change. The interesting part is what it measures, what it can't measure, and how the resulting score becomes a reliable input to the audit.
DevOps Engineer Pro · DOP-C02
Coming soonBreaking Things On Purpose
Resilience reviews sign off the architecture; game days sign off the application. AWS Fault Injection Simulator sits between them: declarative experiments that stop an EC2 instance, throttle a DynamoDB table, or drop network traffic to an AZ, with stop conditions tied to CloudWatch alarms so a test that starts hurting production aborts itself. Understanding experiment templates, actions, targets, and safety rails is the difference between a controlled test and an incident.
DevOps Engineer Pro · DOP-C02
Coming soonApplication Insights Without Dashboards
A .NET service on EC2 behind an ALB with a SQL Server database, a team that has drilled into CloudWatch Metrics three hundred times looking for 'why is this slow,' and an aggregate noise floor that makes a real regression hard to spot. CloudWatch Application Insights picks components by tag, learns the normal metric patterns, and surfaces the deviations that actually correlate with incidents. Understanding what it monitors natively, what it ignores, and how it fits alongside Container Insights and custom dashboards is the interesting work.
DevOps Engineer Pro · DOP-C02
Coming soonThe Knowledge That Changes Daily
A support assistant that has to answer from a product manual which the product team edits weekly, a pricing sheet that changes at month-end, and an operational runbook that mutates hourly. The base model doesn't know any of it, and fine-tuning won't keep up. Retrieval is the answer; the question is how much of the retrieval plumbing we want to own, and Bedrock Knowledge Bases, a LangChain stack, and a hand-rolled pipeline each put the lines in different places.
Generative AI Developer · AIP-C01
Coming soonThe Assistant That Does Things
An assistant that has to look up a customer's subscription, pause it, refund a charge, and email confirmation. Not just answer -- act. The glue between a language model and the rest of our systems is a solved problem three different ways: Bedrock Agents, a LangChain agent loop, or a hand-written tool router. Each of them handles tool definition, invocation, and error recovery, but they put the guardrails in very different places.
Generative AI Developer · AIP-C01
Coming soonWhere the Vectors Live
Twelve million embedding vectors, a 50ms retrieval budget, hybrid queries that mix keyword and semantic, and a bill that should not double the Bedrock spend on its own. OpenSearch Serverless, Aurora with pgvector, and Pinecone Serverless all serve the same shape of query, but their pricing curves, operational shapes, and query surfaces diverge the moment the corpus grows beyond demo scale.
Generative AI Developer · AIP-C01
Coming soonOne Prompt, A Hundred Callers
One prompt scattered across thirty services, no versioning, no tests, drift between the copy in the code and the copy in the docs, a silent regression when somebody changed 'concise' to 'brief' and retention on one response tanked. Prompt engineering at a hundred callers isn't prose discipline, it's configuration management. Bedrock Prompt Management, Git-backed templates, and parameterised prompts each solve a slice of the same problem.
Generative AI Developer · AIP-C01
Coming soonThe Judges In the Loop
Two thousand historical support tickets, a summarisation prompt, a new model candidate, and a product manager asking whether switching would hurt quality. Bedrock evaluation jobs offer automated scoring, human review through Ground Truth workflows, and model comparison side by side -- but they answer different questions, and getting the right number out of the right job matters more than running more of them.
Generative AI Developer · AIP-C01
Coming soonThree Modalities, One Answer
A claims-processing assistant that reads a scanned invoice, listens to a voicemail, answers the customer's question in plain text, and -- if asked -- reads it back. Four modalities, one conversation. The model choice, the orchestration shape, and the ways different inputs fail each push the architecture in different directions, and the naive 'just use a multi-modal model' misses half of where the real work is.
Generative AI Developer · AIP-C01
Coming soonWhere the Bedrock Bill Goes
A Bedrock bill that doubled in two months, a product roadmap that blames the retrieval service, a finance partner who'd like a straight answer. The cheapest token is the one you don't send; the next cheapest is the one you send to the right model. Model routing, prompt compression, cached retrieval, and provisioned throughput each solve a different slice of the cost problem, and none of them is the silver bullet.
Generative AI Developer · AIP-C01
Coming soonThe Weights We Trained Elsewhere
A research team has fine-tuned an open-weights model for medical-notes summarisation on a private SageMaker cluster. The resulting weights live in S3; the production runtime wants Bedrock's ergonomics. Custom model import bridges that gap -- but it only works for certain base architectures, comes with throughput minimums, and quietly changes the cost model compared to on-demand foundation models.
Generative AI Developer · AIP-C01
Coming soonThe First Token First
A chat interface where users wait four seconds for any response on long generations, abandon rate creeping up, product asking why we can't do the typing-animation thing that every other assistant does. Streaming isn't just a UX polish -- it changes how the entire response path has to work, from the SDK call through API Gateway to the browser, and each hop has its own way of getting it wrong.
Generative AI Developer · AIP-C01
Coming soonToo Big For One Prompt
A 400-page contract, a 200-page policy manual, and a legal team asking 'what clauses govern refund disputes across both?' A 200,000-token context window sounds like enough until you realise what goes in with the documents. Chunking, map-reduce, hierarchical summarisation, and sliding context windows each answer a different question, and getting the boundaries right is most of the battle.
Generative AI Developer · AIP-C01
Coming soonThe Numbers We Turn Text Into
An index with 20 million chunks, queries that need to work in English, Spanish, Portuguese, and Japanese, and a budget that won't bear re-embedding every six months when someone decides the new model is better. The embedding model quietly caps what a retrieval system can ever do -- Titan, Cohere, and a self-hosted model all trade different things, and the comparison is messier than the marketing suggests.
Generative AI Developer · AIP-C01
Coming soonThe Price of a Predictable Second
On-demand Bedrock is priced per token, throttled per minute, and latency-variable in a way that product hates. Provisioned Throughput buys predictability with a monthly commitment. The break-even between the two isn't obvious, and the answer changes with the shape of the workload -- a 24/7 assistant and a burst-heavy report generator land on opposite sides of the same line.
Generative AI Developer · AIP-C01
Coming soonThe Names That Can't Leave
A claims assistant that has to answer questions about a customer's claim while keeping the customer's name, address, policy number, and medical details out of training data, out of logs, and out of anything a subpoena could touch later. PII redaction isn't one knob -- it's four or five, in different places, each covering a different leak path. Comprehend, Bedrock Guardrails, custom redaction, and the prompt itself each handle a different shape of the problem.
Generative AI Developer · AIP-C01
Coming soonTwo Doors to the Model
A team that wants Llama 3.3 70B in production can take it through SageMaker JumpStart, pushing a button to deploy onto an endpoint they own, or through Bedrock's model catalog with per-token pricing and no infrastructure. Same model, sometimes the same base weights, two quite different operational shapes. The right choice depends on how much of the serving layer you actually want to touch.
Generative AI Developer · AIP-C01
Coming soonThe Answer We Already Gave
Thirty percent of the support assistant's queries are paraphrases of each other -- 'how do I cancel?' 'can I cancel?' 'where's the cancel button?' -- and every one pays full model price. Caching LLM responses isn't as simple as hashing a prompt: exact-match, semantic, and prefix caching answer different questions, and getting the boundary wrong serves yesterday's answer to today's question.
Generative AI Developer · AIP-C01
Coming soonThe Hands the Model Has
An assistant that knows the answers but can't act on them is half a tool. Function calling -- letting the model invoke tools we define, with arguments it chooses -- turns understanding into action. Bedrock's Converse API has native tool-use support; so does Anthropic's Messages API via Bedrock; so do Bedrock Agents as a higher-level wrapper. Each exposes function calling through a different surface, and picking wrong makes the simple case hard.
Generative AI Developer · AIP-C01
Coming soonThree Logs About the Same Packet
VPC Flow Logs tell us a packet moved; GuardDuty tells us it looks suspicious; Detective tells us what else that actor has been doing. Same investigation, three stages, three services that talk to each other so precisely that people often assume they are one thing. They aren't -- and the difference matters when a finding fires at 03:00 and the on-call has to decide which console to open first.
Security · SCS-C03
Coming soonThree Kinds of WAF Rule
A login endpoint under a credential-stuffing run, a news site whose comment form is getting SQL-injected, and a tier-one PCI application that has to prove to an auditor that it blocks the OWASP Top Ten. Three problems, three kinds of WAF rule, three different ways the rule decides whether a request is hostile. Picking the wrong kind wastes effort at best and lets the attacker through at worst; picking the right kind is almost the whole job.
Security · SCS-C03
Coming soonThe Day the Shield Upgrade Paid for Itself
Shield Standard is on by default, free, and genuinely useful. Shield Advanced is $3,000 a month per organisation, a lot of paperwork, and -- on the quarter when a volumetric attack lands on a revenue-bearing endpoint -- probably the cheapest insurance AWS sells. Picking between them is less about the attack surface and more about the budget, the team, and whether the DDoS response team on speed-dial is worth more than the subscription.
Security · SCS-C03
Coming soonSeven Years of CloudTrail
An auditor asking for every AssumeRole call into the production account for the last seven years. A SIEM bill dominated by CloudTrail forwarding. An incident responder grepping gzipped JSON in an S3 bucket at 2am. Three jobs that look like they want the same thing -- CloudTrail data -- and three jobs that pull in opposite directions when the storage format is just gzipped JSON in S3. Lake is what happens when AWS stops making customers build that pipeline themselves.
Security · SCS-C03
Coming soonThe CVE Scanner That's Already On
A Lambda function written three years ago, still in production, still using a library with a known RCE. A container image that passed review in 2024 and has quietly rotted since. An EC2 AMI that was hardened at launch and hasn't been looked at since. Three workloads, one vulnerability story -- and Amazon Inspector is the scanner that was already on without anyone quite deciding to turn it on, whose findings pile up in Security Hub until somebody asks who owns remediation.
Security · SCS-C03
Coming soonThe Graph That Answers 'What Else'
A GuardDuty finding on an EC2 instance that leads to questions no single log source answers: what else was that role used for, which accounts did the session touch, is this user's behaviour unusual for a Tuesday? Those are graph questions. Detective is the service that already drew the graph; the work is knowing when to use it, what it can't tell us, and how to walk from an entity profile to a containment decision in the time the incident gives us.
Security · SCS-C03
Coming soonOne Place for Every Finding
GuardDuty in twelve accounts. Inspector finding vulnerabilities across EC2, ECR, and Lambda. Macie flagging sensitive S3 objects. IAM Access Analyzer catching public exposure. Five different consoles, five different finding formats, one security team that doesn't have time to open five tabs. Security Hub is the aggregator that turns finding streams into a single normalised queue, and using it well is mostly about what you route, how you score, and who you route it to.
Security · SCS-C03
Coming soonTwo Ways to Tell the Auditor
An auditor wants evidence the org is meeting PCI DSS 4.0.1. One team reaches for AWS Config conformance packs; another reaches for Audit Manager. Both produce dashboards, both produce reports, and both cost money -- but they're doing very different work. Config is the continuous compliance engine; Audit Manager is the evidence-collection and assessment workflow on top. Picking the wrong one leaves the other half of the job undone.
Security · SCS-C03
Coming soonThree Firewalls in Different Clothes
Network Firewall, WAF, and Shield sound like overlapping answers to the same question. They aren't. Each sits at a different layer of the network, each sees a different slice of traffic, each is good at a different kind of threat. Picking the right one -- or the right combination -- starts with what we're trying to protect and works outward to what the traffic looks like when it arrives.
Security · SCS-C03
Coming soonWho Gets Into Which Account
Forty accounts, two hundred engineers, and the perennial question: how do they get into the accounts they need and nothing else? IAM Identity Center permission sets and cross-account IAM roles both answer it, but in different shapes -- one is a scalable, identity-centric pattern that works for humans, the other is the building block that's older, lower-level, and still necessary for programmatic access. Picking the right one is less about which is better and more about what the caller is.
Security · SCS-C03
Coming soonWhere the Key Actually Lives
KMS is the default answer for "we need a key." CloudHSM is the answer for "we need a key AND we need to be the only ones who can touch it, AND we need FIPS 140-3 Level 3, AND we need a PKCS#11 interface for that legacy Java client." The difference between them is less about capability and more about where the raw key material actually lives, who can get to it, and how much of the operational work is yours versus AWS's.
Security · SCS-C03
Coming soonA Private CA That Doesn't Own the World
A root CA with a 20-year key, offline in an HSM, that signs two subordinate CAs -- one for production workloads, one for developer environments. Each subordinate signs its own leaf certificates with short lifetimes. ACM Private CA builds this hierarchy in AWS-native shape, and the design choices -- how many tiers, what key types, what templates, what revocation model -- are less about cryptography and more about blast radius.
Security · SCS-C03
Coming soonReading a GraphQL Body Through WAF
Every GraphQL request is a POST to the same URL with the query inside the JSON body. WAF's pattern rules that trigger on URI path don't help; rate-based rules by endpoint don't distinguish a cheap query from an expensive one. The answer is JSON body inspection: WAF rule statements that parse the POST body, navigate into the query string, and pattern-match or rate-limit on what's actually happening at the GraphQL layer.
Security · SCS-C03
Coming soonEncrypting a Single Form Field
TLS protects every field in a form in transit; it doesn't protect any of them at the origin. A credit-card number posted through CloudFront arrives at the ALB in plaintext, at the application in plaintext, in application logs and error traces and APM spans in plaintext. Field-level encryption is CloudFront's answer: encrypt specific POST fields at the edge with a public key whose private counterpart only lives in the payments processor, so the middle of the stack never sees the plaintext.
Security · SCS-C03
Coming soonThe Scan That Reads the Disk
GuardDuty sees the packets. Inspector sees the installed packages. Neither looks at what's actually on disk -- whether a cryptominer has been written to /tmp, whether a reverse shell is sitting in /usr/local/bin. GuardDuty Malware Protection is the extra tier that takes a snapshot, spins up a dedicated scanner, and tells us what's in the filesystem. Knowing when it fires and what it can't see is the difference between a real defence and a tick-box.
Security · SCS-C03
Coming soonWho Can Do What to This Document
Cognito handles who is signed in. What they can do inside the app -- edit this document, view that folder, approve that workflow -- is a different question, usually answered by code scattered across services. Verified Permissions is the managed policy engine AWS built for exactly that: Cedar policies, hierarchical resources, batched evaluation, and a clean separation between authentication and authorisation.
Security · SCS-C03
Coming soonDirect Connect That Survives a Fibre Cut
A single 10 Gbps Direct Connect link carrying every byte between the on-prem data centre and AWS. Finance is happy, the network team is nervous, and the business is about to depend on this path for a quarterly close that cannot slip. Four resilience shapes exist in the Direct Connect catalogue, each with a different blast radius, a different price tag, and a different SLA. The interesting question isn't 'which is most resilient' -- every shape is most resilient at something -- but which failure modes we're actually buying insurance against.
Advanced Networking · ANS-C01
Coming soonOne WAN Across Thirty Regions
A company with 30 AWS Regions, a handful of branch offices on every continent, and a Transit Gateway tangle that nobody can draw on a whiteboard anymore. Peering meshes across Regions, route tables copy-pasted between accounts, and the network team has finally said 'no more' to any new region request. AWS Cloud WAN promises a global network built from policy rather than peering. The interesting question isn't whether it's newer -- it is -- but whether a segment-based, policy-defined WAN actually solves the problems the TGW mesh created, and what we give up to get there.
Advanced Networking · ANS-C01
Coming soonExpose One Service, Not the Whole VPC
A payments service in one VPC. A dozen consumer teams across half the company who need to call it. The first instinct is to attach everything to a Transit Gateway and let the route tables sort it out. The better question is whether the consumers need a whole network path into the producer's VPC, or just a TCP socket against the one service. PrivateLink exposes the socket without the network; TGW connects the networks and trusts the firewalls. Which one is right depends on whether we're running a service or sharing a subnet.
Advanced Networking · ANS-C01
Coming soonResolving corp.example.com From Both Sides
A hybrid estate where Windows-based corporate services still live on-prem under corp.example.com, and new AWS workloads resolve cloud.example.com via Route 53 Private Hosted Zones. An EC2 instance needs to look up both. Today, it works for cloud names and silently fails for on-prem ones. Route 53 Resolver has two endpoint types, one per direction, and a small vocabulary of rules that decide which queries go where. Understanding the endpoints and rules is the whole job.
Advanced Networking · ANS-C01
Coming soonStateful Rules for an Egress Firewall
Compliance has a new requirement: egress from every production workload must be allowlisted by destination hostname, not IP address, and every denied connection must leave an audit trail. Security groups and NACLs are the wrong abstraction; they work on IPs and ports. AWS Network Firewall is the managed Suricata-compatible service that does this work. Understanding the stateful rule order, domain lists, and TLS SNI inspection is what turns 'block what isn't on the list' from a slogan into a deployed configuration.
Advanced Networking · ANS-C01
Coming soonSteering Traffic With BGP Attributes
Two Direct Connect links, one primary and one backup, and a business requirement that says 'backup is backup -- outbound traffic goes out the primary, return traffic comes back the primary, and everything fails over cleanly.' BGP has five attributes that decide which route wins, and only two of them are levers we can pull with AWS. Understanding local preference, AS-path prepending, MED, and AWS's Direct Connect BGP community tags is the whole game.
Advanced Networking · ANS-C01
Coming soonMany Services, Fewer Subnets
Dozens of microservices across half a dozen VPCs, each one exposed via its own NLB or ALB, each one reachable via PrivateLink endpoints in every consumer VPC, each one tied into a per-service IAM policy for who can call it. The network team is managing load balancers; the platform team is managing endpoints; nobody owns the overall shape. VPC Lattice bundles service discovery, IAM-based authorisation, and cross-VPC routing into one thing. Whether that consolidation is worth the new abstraction depends on how many services we're actually running.
Advanced Networking · ANS-C01
Coming soonGetting a Packet There Fast
A global application whose users complain about latency from Sydney, intermittent connection failures from Mumbai, and a jittery experience everywhere that isn't Europe. The origin is in eu-west-1. Two AWS services put the user closer to AWS: CloudFront caches at the edge, and Global Accelerator provides anycast IPs that route onto AWS's private backbone. Both live at 'the edge,' both improve latency, and they do very different jobs. Picking between them is a question of protocol, cacheability, and whether we need a static IP.
Advanced Networking · ANS-C01
Coming soonPlugging SD-WAN Into a Transit Gateway
An SD-WAN vendor's appliance running in the VPC, carrying branch-office traffic. Today it's stitched to the Transit Gateway via IPsec over a VPN attachment -- which means encap twice and a throughput ceiling AWS imposes on VPN. Transit Gateway Connect attachments exist precisely for this shape: GRE-based tunnels between the appliance and the TGW, higher throughput, dynamic routing. Understanding Connect attachments, Connect peers, and BGP-over-GRE turns a clunky integration into something that behaves like the TGW learned to speak SD-WAN natively.
Advanced Networking · ANS-C01
Coming soonMarket Data Multicast Into the Cloud
A market data pipeline that ingests multicast feeds from an exchange's colocation facility. On-prem, this is routine: IGMP on the switch, multicast routing in the core, consumers join groups and receive. In AWS, multicast has historically been absent from the VPC fabric. Transit Gateway Multicast is the managed service that changes that. Understanding multicast domains, sources, and IGMP-v2 members turns 'can we even do this in a VPC?' into a standard architecture diagram.
Advanced Networking · ANS-C01
Coming soonWhy Can't This Packet Reach That Instance?
A ticket comes in: 'the app in the prod VPC can't reach the RDS instance in the data VPC, worked yesterday, nothing changed.' Nothing ever changed. Two AWS tools both answer 'can this thing reach that thing' questions: VPC Reachability Analyzer is a static path-existence check; VPC Network Access Analyzer (now 'Network Insights Analysis') identifies all paths matching a scope. They look similar at a glance; they solve different problems. Knowing which one to reach for is the difference between a one-minute answer and an afternoon of tcpdump.
Advanced Networking · ANS-C01
Coming soonOne Firewall for the Whole Estate
An organisation has decided every packet leaving or entering a VPC must pass through a vendor firewall for deep inspection. Running firewall appliances per VPC is a licensing nightmare; having each VPC route traffic through a central firewall VPC is the plan. Gateway Load Balancer is the AWS primitive that makes centralised inspection work. Understanding GENEVE tunnelling, GWLB endpoints, and route-table plumbing is the difference between 'packets get inspected' and 'packets get inspected twice, once in each direction, with symmetric routing.'
Advanced Networking · ANS-C01
Coming soonStopping Egress at the Layer That Sees It
A requirement to constrain outbound traffic to an allowlist of destinations. Two AWS shapes answer it with different trade-offs. Network Firewall inspects packets at layer 3 and 4 with some TLS-SNI awareness at layer 5. A forward proxy terminates HTTP and HTTPS sessions at layer 7, sees headers, the full URL path, and the TLS certificate chain. Which layer we stop the traffic at decides how much we see, how much we can control, and what we pay in complexity.
Advanced Networking · ANS-C01
Coming soonA VPC With No IPv4 At All
IPv4 space has become a cost line item. Public IPs bill by the hour. RFC1918 is exhausted in the M&A portfolio. The question of whether to run an IPv6-only VPC has moved from 'interesting experiment' to 'concrete ask.' The dataplane supports it. The services mostly support it. The edge cases are where the work lives: legacy clients that only speak IPv4, third-party services that haven't caught up, instance-types that do or don't have IPv6. Understanding what an IPv6-only VPC looks like in practice is the difference between a clean launch and a post-migration support queue.
Advanced Networking · ANS-C01
Coming soon