The situation
We’re reviewing the proposed architecture for a new internal API, a loyalty-points service that will read from and write to a Postgres database, expose a REST API to internal consumers, and run in eu-west-1. The first draft on the whiteboard is tidy:
- Application Load Balancer fronting an Auto Scaling Group of EC2 instances in a single subnet.
- Amazon RDS for PostgreSQL, single-AZ,
db.m6i.large. - Shared VPC with the rest of the company; security groups scoped to the calling services.
- Deploys via a CodePipeline that runs on merge to
main; tests run in the pipeline. - No encryption configured beyond the defaults.
- No specific monitoring beyond CloudWatch metrics.
It works. It will handle the expected traffic. It’s also, depending on who you ask, either fine or a disaster. Operations thinks the single subnet will bite us. Security thinks the “no encryption configured” line is missing three conversations. The SRE is already drafting a post-mortem template for the eventual outage. Finance hasn’t priced it yet. Nobody has asked about the sustainability implications, but they will.
Rather than argue pillar-by-pillar by intuition, we walk it through a framework that names the dimensions, asks specific questions against each one, and ends with a list of trade-offs rather than a verdict.
What actually matters
An architecture review is really six conversations that people keep conflating. The Well-Architected Framework names and separates them so they can be had in parallel without everyone shouting past each other.
The first thing the framework does is surface the questions that aren’t being asked. Most first-draft architectures are good at one or two things their author thinks about a lot, adequate at two or three things the reviewer happens to care about, and silent on the rest. The pillars act as a checklist: “did we think about sustainability at all?” isn’t a judgment, it’s a prompt. Often the answer is “no, and that’s fine for now, but let’s record the decision.” Sometimes the answer is “no, and now we’re thinking about it, that Savings Plan we were going to buy should probably be on Graviton instead.”
The second thing the framework does is force explicit trade-offs. Every pillar has costs and every pillar has tensions with the others. Multi-AZ RDS doubles the monthly database bill in exchange for reliability. Encryption at rest with a customer-managed KMS key adds a key-management responsibility in exchange for security and auditability. Aggressive cost optimisation often means running close to capacity limits, which can trade reliability for cost. The pillars give us a vocabulary for saying “we spent reliability to buy cost here, deliberately, and here’s our plan if the bet goes wrong.”
The third thing it does is normalise the idea that good design isn’t a point, it’s a region. A design tuned for a bank is different from a design tuned for a marketing microsite, and both can be well-architected if the pillars are balanced against the actual requirements. The framework doesn’t prescribe the balance; it prescribes that the balance be chosen deliberately. “We know we’re weak on cost optimisation because this is a prototype” is fine. “We didn’t notice we were weak on cost optimisation” is not.
The fourth thing worth noting is that the framework is not a certification gate. AWS publishes a Well-Architected Tool that walks teams through a questionnaire against each pillar and generates a report of high and medium risk findings. The output is a list, not a score, and the list changes what the team does in the following weeks, not what the design is allowed to do today. The ambition is cumulative improvement over time, not perfection on day one.
And finally, the framework is version-controlled. AWS updates the pillars periodically; sustainability joined the set in late 2021, bringing the count from five to six. A review done against the 2019 version might look fine until someone notices the whole sustainability conversation was never had. The current framework (as of the time of writing) has six pillars, and they are the six we will walk through.
What we’ll filter on
Six pillars, one lens each. These are the questions each pillar asks the design.
- Operational Excellence, can we run this thing? Can we deploy it safely, observe it clearly, respond to incidents quickly, and learn from the failures?
- Security, can the people who should access this do so, and can the people who shouldn’t not? Is the data protected at every layer?
- Reliability, when something fails, does the system stay up or recover fast? Are we designed for the failures we know will happen?
- Performance Efficiency, is the system using the correct resources, at the correct size, for the workload? Will it scale when load grows?
- Cost Optimization, are we paying only for what we use and value, or are we paying for capacity that isn’t earning its keep?
- Sustainability, are we minimising the environmental impact of running this workload, in terms of energy used and resources provisioned?
The pillar landscape
-
Operational Excellence. The ability to run and evolve the workload without drama. Design principles: perform operations as code (infrastructure, deployments, runbooks, all in Git); make frequent, small, reversible changes; refine operations procedures frequently; anticipate failure; learn from operational events. Practical questions: is the deployment automated and rollback-able? Do we have runbooks for the five most likely incidents? Are logs, metrics, and traces available and queryable? Do we conduct post-incident reviews blamelessly, and do the outcomes actually feed back into the system? AWS services that tend to show up: CloudFormation or CDK for IaC, CodePipeline for delivery, CloudWatch and X-Ray for observability, Systems Manager for runbook automation.
-
Security. Protecting information, systems, and assets; detecting security events; responding appropriately. Design principles: implement a strong identity foundation (least privilege, IAM per-role, centralised access management); enable traceability (log everything, review the logs); apply security at all layers (network, compute, application, data); automate security best practices; protect data in transit and at rest; keep people away from data (access through APIs, not SSH); prepare for security events. Practical questions: is data encrypted at rest with a key we control? Is data encrypted in transit? Do we have IAM roles (not users) for compute, and policies scoped to just what’s needed? Is CloudTrail on? Is GuardDuty on? Is Security Hub aggregating findings? If the answer to any of those is “default”, it probably isn’t enough for production. Services: IAM, KMS, CloudTrail, GuardDuty, Security Hub, AWS WAF, Secrets Manager.
-
Reliability. The workload performs its intended function correctly and consistently when expected. Design principles: automatically recover from failure (detect, repair, don’t wait for a human); test recovery procedures (simulate failures, check we know what happens); scale horizontally to increase aggregate availability (many small things, not one big thing); stop guessing capacity (autoscale or provision on measured demand); manage change in automation. Practical questions: what’s our target availability, and does the design meet it? What’s the RPO and RTO for the data? What happens when an AZ is unavailable? Do we have backups, are they tested, and can we restore them? Are dependencies (external APIs, databases, queues) designed with retry, timeout, and circuit-breaker patterns? Services: Route 53 health checks, Auto Scaling, multi-AZ RDS, S3 cross-region replication, Backup, Fault Injection Service.
-
Performance Efficiency. Using computing resources efficiently to meet requirements and maintaining efficiency as demand changes and technology evolves. Design principles: democratise advanced technologies (use managed services so the team doesn’t become experts in running Kafka); go global in minutes (deploy to Regions near users); use serverless architectures (skip the server-management overhead where possible); experiment more often (cheap experiments via infrastructure-as-code); consider mechanical sympathy (use the instance type, storage type, and memory configuration that fits). Practical questions: is the instance type right-sized for the workload? Is the database engine and size appropriate? Are we caching where caching pays off (CloudFront, ElastiCache)? Are we using the correct storage class (gp3 vs io2, S3 Standard vs Intelligent-Tiering)? Services: CloudFront, Route 53 latency-based routing, ElastiCache, Compute Optimizer, EC2 instance families sized to workload.
-
Cost Optimization. Running systems to deliver business value at the lowest price point. Design principles: implement cloud financial management (a FinOps function, not just “whoever spots the bill”); adopt a consumption model (pay for what we use); measure overall efficiency (value per dollar, not dollars saved); stop spending money on undifferentiated heavy lifting (managed services over DIY); analyse and attribute expenditure (tags, cost allocation). Practical questions: are we on-demand where we could be on a Savings Plan or Spot? Is anything running that shouldn’t be (dev instances over the weekend, orphaned EBS volumes, un-deleted snapshots)? Are we using the correct storage class for the access pattern? Are the instance sizes correct, or did someone pick
m6i.largebecause it felt like a sensible default? Services: Savings Plans, Spot, Cost Explorer, Budgets, Compute Optimizer, S3 Intelligent-Tiering, Trusted Advisor. -
Sustainability. Minimising the environmental impact of running workloads, energy consumption, carbon footprint, resource utilisation over time. Design principles: understand your impact (measure it); establish sustainability goals (reduce energy consumption per transaction, per user); maximise utilisation (an instance at 20% CPU is three-quarters wasted energy); anticipate and adopt new, more-efficient hardware and software offerings (Graviton instances use less energy per unit of work than x86 equivalents for most workloads); use managed services (AWS can operate shared infrastructure at higher utilisation than we can); reduce the downstream impact of cloud workloads (smaller payloads, cached responses, efficient formats). Practical questions: are we on Graviton where it fits? Are idle resources running? Are we storing data in the cheapest (and usually greenest) storage tier appropriate to the access pattern? Is the Region choice sensible. AWS’s carbon intensity varies by Region, and the closest Region isn’t always the lowest-carbon one. Services: Customer Carbon Footprint Tool, Graviton instances, Compute Optimizer for right-sizing, S3 lifecycle policies.
Side by side
Scoring the first-draft architecture (single-AZ, unencrypted, on-demand, manually observed) against each pillar:
| Pillar | First-draft design | Concrete weakness | First improvement to make |
|---|---|---|---|
| Operational Excellence | Partial ✓ | No runbooks, no tracing, deploys untested in staging | Add X-Ray, require staging run before prod, write three runbooks |
| Security | ✗ | No encryption explicit, KMS keys not defined, IAM roles broad | Enable encryption at rest with CMK; scope IAM; enable CloudTrail, GuardDuty |
| Reliability | ✗ | Single-AZ RDS, single subnet for EC2, no tested restore | Multi-AZ RDS; ASG across three AZs; scheduled restore drill |
| Performance Efficiency | Partial ✓ | m6i.large chosen without measurement; no caching layer |
Run Compute Optimizer; add ElastiCache if reads dominate |
| Cost Optimization | ✗ | On-demand everything; no tagging; nothing turns off | Savings Plan on baseline; tag-based budgets; idle-shutdown on dev |
| Sustainability | ✗ | x86 instances without trial; Region picked by habit | Test Graviton; check Carbon Footprint Tool for Region carbon intensity |
Four fails, two partials, zero strong passes. That doesn’t mean the design is bad, it means the first draft has only done one job (work for the expected traffic) and the review’s job is to make the other five explicit before the thing goes live.
The six-pillar radar
The pillars in depth, what the questions actually are
The Well-Architected Tool formalises each pillar as a set of questions, each with a handful of best-practice options the team checks off. A few examples from each, to show the flavour.
Operational Excellence asks things like: How do you determine what your priorities are? How do you structure your organization to support business outcomes? How do you reduce defects, ease remediation, and improve flow into production? How do you mitigate deployment risks? How do you monitor workload resources? The answers are procedural as much as technical: a team that deploys via CloudFormation with automated testing and automated rollback scores higher than one that shells into a box and edits config. A team that keeps runbooks in a wiki nobody reads scores lower than one that keeps them as Systems Manager Automation documents that run themselves.
Security asks: How do you manage identities for people and machines? How do you manage permissions for people and machines? How do you detect and investigate security events? How do you protect your network resources? How do you classify your data? How do you protect your data at rest? How do you protect your data in transit? How do you anticipate, respond to, and recover from incidents? The detail matters: “we encrypt at rest” and “we encrypt at rest with a customer-managed KMS key, with key rotation, and an IAM policy that restricts who can decrypt” are very different answers to the same question.
Reliability asks: How do you manage service quotas and constraints? How do you plan your network topology? How do you design your workload service architecture? How do you design interactions in a distributed system to prevent failures? How do you design interactions in a distributed system to mitigate or withstand failures? How do you monitor workload resources? How do you design your workload to adapt to changes in demand? How do you implement change? How do you back up data? How do you use fault isolation? How do you design your workload to withstand component failures? How do you test reliability? How do you plan for disaster recovery? The testing question tends to find the most gaps, most teams back up; few teams have actually restored from backup in the last quarter.
Performance Efficiency asks about selection (compute, storage, database, network), review (are we still using the correct choice as newer options appear?), monitoring (are we measuring, not guessing?), and trade-offs (where are we compromising throughput for latency or consistency for availability, and is it deliberate?). Compute Optimizer is the go-to service here, it reads the last two weeks of CloudWatch metrics for EC2, EBS, Lambda, and ECS on Fargate, and recommends smaller (or sometimes larger, sometimes different-family) resources.
Cost Optimization asks about practising cloud financial management, expenditure awareness (tagging, allocation, reporting), cost-effective resources (Savings Plans, Spot, right-sizing), demand matching (autoscaling, idle shutdown), and optimisation over time (reviewing old decisions as new services launch). The quick wins live in Trusted Advisor’s cost-optimisation checks: idle load balancers, underutilised instances, unassociated elastic IPs, orphaned snapshots.
Sustainability asks: How do you select Regions to support your sustainability goals? How do you align cloud resources to your demand? How do you take advantage of software and architecture patterns to support your sustainability goals? How do you take advantage of data access and usage patterns? How do you select and use cloud hardware and services to support your sustainability goals? How do your organisational processes support your sustainability goals? Region selection is the lever most people miss: the Ireland Region (eu-west-1) runs on a different grid mix than the Stockholm Region (eu-north-1), and the Customer Carbon Footprint Tool will show the difference.
A worked review: the loyalty-points API, end-to-end
Two weeks after the Friday whiteboard session, the team runs the design through the Well-Architected Tool in the console. Three hours with the team. Twenty-six questions across the six pillars. Thirty-one findings: four high-risk, nine medium, eighteen low or already-addressed.
High-risk findings. (1) Single-AZ RDS with no tested restore procedure, reliability. (2) IAM role on the EC2 fleet grants * on a shared S3 bucket containing customer data, security. (3) No encryption on the RDS storage; relies on default, not a customer-managed key, security. (4) No tagging strategy, so cost can’t be attributed to the team or the product, cost optimisation. All four have a concrete remediation and a named owner; three of them are one-sprint changes, one (the tagging strategy) is a quarter-long cross-team project with an initial quick-win phase.
The two-week plan. Flip RDS to Multi-AZ. Create a customer-managed KMS key for the service, with a key policy allowing only the service roles to decrypt. Re-encrypt the RDS storage during the Multi-AZ switch by creating an encrypted snapshot and restoring it. Split the IAM role into one that can read the application-config prefix and one that can read the customer-data prefix, with the latter granted only to the instances that actually need it. Enable GuardDuty at the account level. Enable Security Hub and subscribe to the AWS Foundational Security Best Practices standard. Add X-Ray SDK to the application. Write three runbooks: RDS failover, AZ outage, credential rotation. Schedule a restore drill for the following month.
The quarter plan. Switch the EC2 baseline from m6i.large to Graviton m7g.large after a week of shadow-traffic testing. Compute Optimizer is already flagging the change. Buy a one-year Compute Savings Plan covering 80% of steady-state usage. Introduce cost-allocation tags (Team, Service, Environment), with a Config rule that blocks resources without them. Review the Region choice against the Customer Carbon Footprint Tool – eu-west-1 is fine on the carbon story but Stockholm would be lower; the decision is “stay in Ireland for data-locality reasons, revisit annually.”
The after-radar. Four across all six pillars. No fives, fives cost money and effort that this workload does not earn. Four is “well-designed for the business requirements and explicitly chosen to be no more than that.” That is the goal.
What’s worth remembering
- Six pillars, not five. Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability. Sustainability joined in late 2021; reviews done without it are missing a column.
- The framework is a checklist, not a grade. The Well-Architected Tool generates findings, not scores. “High-risk” and “medium-risk” items are prioritised actions; the goal is steady reduction, not zero.
- Balance beats maximum. A design that’s a 5 on security and a 2 on reliability is not well-architected. A design that’s a 4 across all six is. The radar shape matters more than the score on any one axis.
- Trade-offs are the point. Multi-AZ buys reliability and costs money; Graviton buys cost and sustainability and sometimes costs a code rebuild; encryption with a CMK buys security and costs a key-management role. Name the trade, pick deliberately, record the decision.
- The pillars tension each other. Cost optimisation often tensions reliability (cheaper means less capacity buffer). Security tensions operational excellence (stricter IAM means slower incident response unless runbooks are pre-baked). Performance tensions sustainability (faster usually means more compute). These tensions are healthy; the framework makes them visible.
- AWS provides services for each pillar. CloudFormation, CodePipeline, and CloudWatch for OpEx. IAM, KMS, CloudTrail, GuardDuty for Security. Multi-AZ, ASG, Backup, FIS for Reliability. Compute Optimizer, CloudFront, ElastiCache for Performance. Savings Plans, Cost Explorer, Budgets for Cost. Graviton, Carbon Footprint Tool for Sustainability. The framework points at the service; the service does the work.
- Reviews are a practice, not an event. A one-off Well-Architected Review before go-live catches the obvious gaps. Quarterly reviews catch the drift. The framework is designed to be used repeatedly against the same workload over time.
- Requirements set the target balance. A prototype doesn’t need a 4 on Cost Optimization; a regulated workload needs a 5 on Security. The framework doesn’t prescribe the balance, it prescribes that the balance be chosen deliberately and reviewed periodically.
Six pillars, six questions, one framework. The first-draft design in the whiteboard session answered two of them implicitly and four of them not at all. The review’s job wasn’t to grade the design but to make all six questions explicit, to surface the trade-offs each answer represented, and to end with a list of actions the team actually plans to take. A well-architected design isn’t one that scores perfectly; it’s one that scores deliberately across all six pillars, with the balance chosen to match what the workload actually needs.