The situation
A platform team supports twelve product teams and a shared reliability function:
- Product teams want to self-serve three patterns: “new service” (VPC subnet, ALB target group, ECS service, IAM role, log group), “new pipeline” (CodePipeline, CodeBuild project, CodeArtifact access, ECR repo), “new S3 data store” (encrypted bucket, lifecycle policy, replication target). Today, each request goes through a Jira ticket and takes 3-5 days.
- Reliability team wants feature flags for circuit-breaker thresholds, request timeouts, canary percentages. Changing a timeout from 2s to 5s today requires a code change, a PR review, a CI build, and a canary deploy: 30 minutes minimum. They want 30 seconds.
The asks:
- Self-service for the three infrastructure patterns with guardrails (tags required, encryption enforced, logging baseline mandatory).
- Approval workflow for certain patterns (the S3 data store needs compliance approval before provisioning).
- Feature flags and runtime config that application code reads and that can be changed live without redeploy.
- Configuration validation so a bad value (e.g.
timeout=-1) is rejected before deployment. - Staged rollouts. Configuration changes go to 10% of targets first, then 50%, then 100%, with automatic rollback on metric breach.
What actually matters
The two asks are fundamentally different operations on different timelines, and the confusion usually comes from treating them as alternatives.
Infrastructure self-service is “create these resources once per request, with guardrails that the platform team controls.” The platform team authors patterns; product teams launch instances of those patterns through a parameter form; the result is a stack of resources the team then owns. Provisioning takes minutes; the resources live for months. The platform team’s lever is the pattern catalogue: what’s in it, what parameters it accepts, what IAM role provisions it.
Runtime configuration delivery is “update values inside running applications, repeatedly, without redeploy.” An application reads configuration during execution; the delivery service validates each change before deploying, rolls it out in stages, and rolls back automatically if a linked alarm trips. Each change takes seconds; the application carries on. The reliability team’s lever is the deployment strategy: how fast a change reaches each percent of targets and what aborts the rollout.
They don’t overlap. One creates resources once and hands them to a team; the other updates values in running applications many times. Trying to use a provisioning tool for feature flags is wrong because provisioning takes minutes per resource. Trying to use a runtime-config delivery service for infrastructure is wrong because it delivers values to running code, not new resources. Both jobs exist; the work is keeping them separate.
Two decisions per side are worth making deliberately.
For self-service infrastructure: what guardrails attach to each pattern? Things to fix at the pattern boundary include parameter-value restrictions, the IAM role provisioning runs under (so launchers get capabilities they don’t have directly), required tags, and notification hooks for approval-gated patterns. Anything left to convention will drift.
For runtime configuration: how does the application read configuration, and how does a rollout unwind? Direct API polling on the request hot path is simplest and has latency; a sidecar or runtime extension that caches the configuration locally and polls in the background gets the application a network-free read at the cost of a small extra component. Rollout strategies define the deployment shape (canary, linear, all-at-once), and a linked alarm provides the automatic rollback when a metric breaches.
What we’ll filter on
For self-service infrastructure:
- Pre-approved patterns with parameter-driven customisation.
- Guardrails enforced at launch (tags, encryption, IAM constraints).
- Approval workflow for patterns that need it.
- Per-team access control via IAM and portfolio sharing.
For runtime configuration:
- Live updates without redeploy.
- Validation of configuration before deployment.
- Staged rollout with metric-gated progression.
- Automatic rollback on alarm breach.
The self-service and configuration landscape
For infrastructure self-service:
1. Jira tickets + platform team manual work. Status quo. Fails time-to-service.
2. A shared Git repo of templates, teams clone-and-apply. Works for teams with platform skills; offers no enforcement and no approval gate. Teams can modify templates freely; guardrails exist as a review process, not technical controls.
3. AWS Service Catalog with portfolios per team. Canonical AWS pattern. Platform team maintains a portfolio of products; each product is a CloudFormation template with constraints; teams launch products via a console/API call that runs as a launch-constrained IAM role. Guardrails enforced at launch.
4. CDK constructs published as an internal library. Teams write CDK code against platform-provided constructs. Strong typing, better developer experience for CDK-native teams; weaker for teams that don’t use CDK. No approval workflow.
5. Backstage or similar internal developer portal. A front-end to many patterns, backed by any number of AWS services. Appropriate for large platforms; significant investment to stand up.
For runtime configuration:
6. Environment variables at deploy time. Status quo; change = redeploy.
7. SSM Parameter Store polled by the application. Near-real-time, simple, ubiquitous. No validation, no staged rollout, no rollback.
8. AWS AppConfig with an application, environment, configuration profile. Validation via JSON schema or Lambda validator; staged rollout via deployment strategy; automatic rollback via CloudWatch alarm.
9. LaunchDarkly or similar third-party feature-flag service. Mature feature-flag platforms; licence cost; out of scope when “AWS-native” is a constraint.
Side by side
For infrastructure self-service:
| Option | Pre-approved patterns | Guardrails | Approval workflow | Per-team access |
|---|---|---|---|---|
| Jira + manual | ✗ | Partial (review) | ✓ | Account-scoped |
| Shared Git templates | ✓ | ✗ | Partial | None |
| Service Catalog + portfolios | ✓ | ✓ | ✓ (via constraints) | ✓ |
| CDK construct library | ✓ | Partial | ✗ | IAM |
| Backstage | ✓ | Varies | ✓ | ✓ |
For runtime configuration:
| Option | Live update | Validation | Staged rollout | Auto-rollback |
|---|---|---|---|---|
| Env vars (redeploy) | ✗ | ✗ | ✗ | ✗ |
| SSM Parameter Store | ✓ | ✗ | ✗ | ✗ |
| AppConfig | ✓ | ✓ | ✓ | ✓ |
| Third-party (LaunchDarkly) | ✓ | ✓ | ✓ | ✓ |
The AWS-native picks are Service Catalog for self-service infrastructure and AppConfig for runtime configuration. They run side by side; neither is a substitute for the other.
How they sit side by side
The picks in depth
Service Catalog: three products, one portfolio, three constraints each.
The new-service product is a CloudFormation template that creates an ECS task definition, a service, an ALB target group and listener rule, an IAM task role, and a CloudWatch log group, all tagged with the product team’s CostCenter and TeamName parameters. Product metadata includes the version, documentation URL, and support channel.
Launch constraint: the product launches as PlatformLaunchRole, an IAM role in the platform account with permissions to create the listed resources. The product team member launching the product has servicecatalog:LaunchProduct but not (for example) ec2:CreateVPC directly; the launch role has the creation permissions. This is the capability-elevation pattern; teams get infrastructure without having the raw creation permissions.
Template constraint: Environment must match /^(dev|staging|prod)$/; TeamName must be one of the registered team names; InstanceSize has a list of allowed values per environment (prod gets larger sizes). Template constraints are CloudFormation rules expressed in Service Catalog’s constraint language; invalid parameter combinations fail at the parameter-form stage before any resource creation.
Tag constraint: enforces CostCenter, TeamName, Environment, and ManagedBy=ServiceCatalog tags on all resources created by the product. Combined with an SCP requiring these tags on resource creation account-wide, the platform’s tag discipline is free.
Notification constraint: on provisioning events (Product launch started, Product launch completed, Product launch failed), an SNS topic receives a message. For the new-s3-datastore product, the subscription is compliance’s approval-workflow Lambda which, before provisioning completes, requires a human approval in Slack. Approval-gated Service Catalog products are a valid pattern when provisioning a pre-built template isn’t quite enough for regulated workloads.
AppConfig: application → environment → configuration profile.
The payments team creates one AppConfig application; inside it, three environments (dev, staging, prod). Each environment has one or more deployments of a configuration profile. The feature-flags profile is a JSON document with keys like circuit-breaker-threshold: 0.05, request-timeout-ms: 2000, canary-percentage: 10.
Validators on the profile: a JSON schema validates structure (circuit-breaker-threshold is a number between 0 and 1); a Lambda validator runs additional checks (e.g. canary-percentage must decrease or match the previous value during staging-to-prod promotion, to enforce rollout discipline). Both validators run during StartDeployment; a validator failure blocks the deployment before any target sees the new configuration.
Deployment strategy: Canary10Percent20Minutes. 10% of targets get the new config immediately, the deployment bakes for 20 minutes, then 100% gets it. CloudWatch alarms tied to the deployment (e.g. payments-5xx-rate > 1%) cause automatic rollback if they trip during the deployment window. Environment-linked alarms apply globally to the environment; deployment-linked alarms apply only during the deployment.
On the client side, the payments service (running on Fargate) has the AppConfig agent as a sidecar container. The agent polls AppConfig every 45 seconds, caches the configuration locally, and exposes it at http://localhost:2772/applications/payments/environments/prod/configurations/feature-flags. The application code reads from this endpoint; there’s no network latency on hot paths, and the agent absorbs the rollout timing (a canary’s 10% is reflected in the 10% of targets whose agents have polled since the deployment started). Lambda functions use the AppConfig Lambda extension; same mechanism, packaged as an extension instead of a sidecar.
A worked self-service request
Team analytics wants a new service called report-generator. Today:
- An engineer opens the Service Catalog console in their account, sees the shared
acme-platformportfolio, clicksnew-service. - Parameter form asks: service name, container image URI, CPU, memory, port, environment (
dev|staging|prod), team name (analytics), cost centre. - Engineer fills in values. Template constraints validate server-side; tag constraints require
CostCenter=analytics-eastto be filled. - Engineer clicks Launch. Service Catalog invokes CloudFormation under the launch constraint’s role (
PlatformLaunchRole), creates the stack. - 5 minutes later the stack completes. Engineer sees a provisioned product in the console with outputs (ALB DNS name, task role ARN).
- The stack is the team’s to maintain; updates go via Service Catalog’s
UpdateProvisionedProduct(which essentially runsUpdateStackunder the launch role).
No platform-team ticket. No two-week wait. The platform team still controls the template and constraints; they don’t control which team launches it.
A worked configuration rollout
Reliability engineer wants to lower the request-timeout-ms from 2000 to 1500 on the payments service:
- Engineer edits the
feature-flagsprofile in AppConfig, commits the change as a new version, starts a deployment to theprodenvironment with strategyCanary10Percent20Minutes. - Validators run: JSON schema passes, Lambda validator passes (the change is within bounds).
- Deployment begins. 10% of payments service replicas get the new value within their next poll cycle (~45 seconds). The deployment enters the 20-minute bake period.
- CloudWatch alarm
payments-p99-latency > 3sis linked to the deployment. During bake, latency stays nominal; no alarm. - 20 minutes in, the remaining 90% get the new value. Deployment complete.
If at any point during bake or final the alarm had tripped, AppConfig would have rolled back automatically, reverting the configuration to the previous version, notifying the SNS topic, and marking the deployment as ROLLED_BACK in CloudTrail. The engineer investigates without being the last person to push production.
What’s worth remembering
- Service Catalog and AppConfig answer different questions. Service Catalog is infrastructure self-service (CloudFormation templates as catalogue products); AppConfig is runtime configuration (live updates to running applications). They coexist; they don’t overlap.
- Service Catalog constraints are how guardrails attach. Launch constraint (IAM role for provisioning), template constraint (parameter value restrictions), tag constraint (required tags), notification constraint (SNS on events). All four are commonly used.
- Launch constraint is the capability-elevation pattern. Users without direct resource-creation permissions can launch products that run as a privileged role. Powerful; important to keep launch-role scopes tight.
- Portfolio sharing distributes products across accounts. AWS Organizations sharing (or account-level sharing) gets products to consumer accounts without duplicating CloudFormation templates.
- AppConfig validators run before deployment. JSON schema for structure, Lambda validator for cross-field or cross-version logic. Both block bad configuration before any target sees it.
- AppConfig deployment strategies include canary, linear, all-at-once. Pick based on risk tolerance; canary and linear strategies add bake periods that let CloudWatch alarms catch problems before full rollout.
- AppConfig agent and Lambda extension are the production delivery path. Direct API polling works but costs latency; the agent or extension caches locally and absorbs the rollout rhythm.
- Configuration changes are CloudTrail events.
StartDeployment,StopDeployment, andDeleteDeploymentappear in CloudTrail with principal, timestamp, and payload; audit has the change history without extra plumbing.
Twelve product teams self-serve their infrastructure through Service Catalog; the reliability team pushes configuration changes to running services through AppConfig. Both reduce the platform team’s interrupt rate; neither is a substitute for the other. Picking the correct tool for the correct scope is the DevOps pro’s job, and confusing “infrastructure as catalogue” with “configuration as deployment” is the mistake that takes weeks to recover from.