The situation
An organisation runs a single Aurora PostgreSQL cluster shared by eighteen internal services. Today the master password lives in an environment variable that was set on deploy a year ago and has not changed since. The security review has flagged two things: first, the password hasn’t rotated; second, nobody quite knows all the places it’s been pasted (CI secrets, a few developer laptops, a Slack thread from 2024).
The team needs to move the credential out of environment variables, rotate it automatically on a 30-day cadence, grant each service only the access it needs, and have every access recorded in CloudTrail. The interesting engineering is the rotation itself: eighteen services, one shared credential, no maintenance window, and no tolerance for authentication failed mid-request.
Everything before rotation, where the credential lives, who reads it, how CloudTrail sees it, has one answer (Secrets Manager). The design space opens up around the rotation mechanism: what kind of rotation, how the application fleet cooperates with it, whether anything sits between the app and the database to absorb the change.
What actually matters
Before picking a rotation template, it’s worth asking what the fleet actually needs from rotation day.
The first thing to think about is blast radius. A rotation that flips a single credential at a single moment in time will, by definition, have a window where applications holding the old password can’t open new connections. If that window is nonzero, the rotation takes the fleet down, briefly, maybe, but during a request somebody cares about. The shape of the answer hinges on whether we can avoid that window at all, or only shorten it, and what either option costs.
The second is ownership of the refresh. Somebody has to notice that the password changed, either the application (by re-reading the secret on a schedule or on a failed connect), the connection pool (by reconnecting when the pool drains), or a proxy sitting in front of the database (by re-reading the secret and re-authenticating on the applications’ behalf). Each of those moves the coupling to a different place: into the code, into the runtime, or into the network path. Eighteen services means whatever we pick has to be implementable in eighteen codebases, or in zero of them.
The third is recovery when rotation itself fails. If the Lambda throws halfway through, the old credential had better still work, the new credential had better not be silently live, and the next scheduled rotation had better retry cleanly. A rotation design that can leave the database in an indeterminate state is worse than no rotation, because now the only way to recover is manual.
The fourth is cost shape and operational load. Secrets Manager charges per secret per month plus per API call; rotation Lambdas cost Lambda-invocation pennies. These numbers are small. The real cost is engineering attention: a rotation design that requires each team to implement bespoke cache-invalidation logic is expensive in meetings, code reviews, and bugs, even if the AWS bill is trivial.
The fifth is coupling between the database and the application lifecycle. If rotation forces every application to restart or redeploy to pick up the new credential, rotation becomes a scheduled outage. If rotation can be absorbed by a running application without a restart, it becomes invisible. That’s the bar worth clearing.
And finally, a softer one: what kind of user the rotated credential is. Rotating the RDS master user is a different proposition from rotating an application-level user, one is singular and DBA-facing, the other is duplicable and app-facing. Conflating them leads to either an awkward rotation or a dangerous one.
What we’ll filter on
Distilling that exploration into filters we can score rotation options against:
- Zero-downtime during rotation, is there a window where a valid cached credential stops working?
- Automatic application refresh, do applications pick up the new password without code changes or restarts?
- Recovers cleanly on rotation failure, if a step throws, does the old credential keep working while the next schedule retries?
- Works for shared application credentials at fleet scale, does the mechanism scale to eighteen services reading the same secret?
- Keeps the RDS master account out of the hot path, is the rotated credential one we can duplicate, rather than the singular master?
The rotation landscape
-
Manual rotation, no Secrets Manager. Someone runs
ALTER USERonce a month and updates environment variables across eighteen services. Every rotation is a coordinated redeploy or restart. This is the starting point, and the thing the security review is objecting to. Not a serious option, but worth naming as the baseline. -
Secrets Manager with single-user rotation. The managed template AWS ships for RDS-family engines. The rotation Lambda creates a new password, runs
ALTER USERon the one account whose password is being rotated, tests the new credential, and moves theAWSCURRENTstaging label. During and aftersetSecret, anything holding the old password fails to open new connections until it refetches. Fine for a single worker that restarts cheaply; awkward for a fleet that caches. -
Secrets Manager with alternating-users rotation. The managed template for workloads that can’t tolerate a single-password moment. Two application users (e.g.
app_aandapp_b) exist on the database with the same privileges. The secret holds whichever user currently ownsAWSCURRENT; each rotation resets the other user’s password and flips the label. At any instant both users exist, both passwords are valid, and applications using the previous user keep working until their connections expire naturally. The RDS master account is rotated separately, or not at all on the 30-day cycle. -
Secrets Manager plus RDS Proxy. RDS Proxy sits between applications and the database, authenticates to Postgres using a secret it reads from Secrets Manager, and offers the application an IAM-authenticated endpoint. When the secret rotates, the proxy re-reads it and re-authenticates; applications use IAM to reach the proxy and never see the password change at all. Adds a component (and its cost), removes the rotation concern from the application entirely.
-
Parameter Store with a custom rotation workflow. SSM Parameter Store can hold the credential at lower cost than Secrets Manager, but there’s no managed rotation and no staging-label concept. Anything rotation-like has to be built: EventBridge schedule, custom Lambda, custom versioning, custom refresh discipline. Cheaper per month; much more expensive to build and maintain.
-
IAM database authentication, no password at all. Postgres supports IAM authentication: the database trusts AWS to vouch for the caller, and the application fetches a short-lived token instead of a password. Removes rotation as a concept, there is no long-lived credential to rotate, but imposes a 15-minute token lifetime and a per-second token-generation rate limit, which doesn’t match every workload. Excellent when it fits; not always available for legacy drivers or high-frequency connection patterns.
Side by side
| Option | Zero-downtime | Auto refresh | Clean failure recovery | Fleet-scale | Keeps master clear |
|---|---|---|---|---|---|
| Manual rotation | ✗ | ✗ | ✗ | ✗ | ✗ |
| SM single-user | ✗ | ✗ (needs TTL/catch) | ✓ | ✓ | ✗ (rotates master) |
| SM alternating-users | ✓ | ✓ (natural) | ✓ | ✓ | ✓ |
| SM + RDS Proxy | ✓ | ✓ (transparent) | ✓ | ✓ | ✓ |
| Parameter Store + custom | depends | ✗ | depends | ✓ | depends |
| IAM DB auth | ✓ | N/A | N/A | depends | ✓ |
Reading across the row, alternating-users rotation is the default pick for a shared application credential at fleet scale. RDS Proxy is the upgrade when the architecture already includes (or could justify) a proxy for other reasons, connection pooling, failover smoothing, IAM auth. IAM database auth is strictly better when it fits the workload’s token-refresh cadence; often it doesn’t.
Matching patterns to workloads
Alternating-users rotation, in depth
The core move is that there are always two valid credentials. Create app_a and app_b on the database with identical GRANTs. A single secret stores the credentials for whichever user holds AWSCURRENT right now; the other user exists and has a password but isn’t in the secret. On rotation day, the Lambda resets the password for the off user, tests it, and flips AWSCURRENT onto it. Old connections opened against the previous user keep working until they drop naturally, because Postgres authenticates at connect time, not per statement, and new connections opened after the flip authenticate against the now-current user. No moment exists where a cached-but-valid credential stops working.
The rotation itself is still the four-step Secrets Manager pattern – createSecret, setSecret, testSecret, finishSecret, but setSecret runs ALTER USER against the other user rather than the current one. That’s the only structural change. Staging labels move exactly the same way: AWSPENDING appears in createSecret, stays through setSecret and testSecret, and becomes AWSCURRENT in finishSecret.
Three staging labels matter. AWSCURRENT is what every reader gets by default. AWSPENDING is the in-flight new version, visible only to the rotation Lambda. AWSPREVIOUS is the version that used to be current, kept as a safety net for one cycle. Rotation is, mechanically, the movement of those labels between versions.
Application services still have to cope with the fact that AWSCURRENT changed. Three patterns work, and the fleet should standardise on one:
- Short in-memory TTL. Cache for 5 minutes; on miss, re-read. Costs one
GetSecretValueper process per 5 minutes. Rotation is picked up within a TTL. - Catch-and-refresh. On
authentication failed, re-read the secret and retry the connect once. Cheap in the happy path; requires the driver to surface the error type cleanly. - AWS’s client-side cache library. The Secrets Manager caching library (Java, Python, .NET, Go) handles TTL and refresh internally. Opinionated, well-tested, adds a dependency.
With alternating-users rotation, even a sloppy refresh story is forgiving: the old password keeps working until the cache expires, and the new password works when it arrives. Without alternating-users, single-user rotation only, the cache expiry window is a window of failed connects.
Recovery on failure. If setSecret throws, the off user’s password is whatever the Lambda last managed to set, either the prior value (if the ALTER USER didn’t run) or a value the Lambda knows about (captured in AWSPENDING). AWSCURRENT hasn’t moved, so the fleet is still using the current user’s current password; nothing is broken. Secrets Manager marks the rotation failed and retries on the next schedule. The old credential keeps working through the whole incident.
Keeping the master account clear. Rotate the master infrequently, yearly, or on a known DBA schedule, using the single-user template. It’s not on any application’s hot path, so the reconnect window doesn’t matter; there’s nothing to reconnect. Use the master to provision app_a and app_b, then leave it alone on the 30-day cadence.
RDS Proxy as the upgrade path. If the fleet already has or can justify RDS Proxy (for connection pooling, multi-AZ failover smoothing, IAM auth), let the proxy hold the Secrets Manager secret and present applications with an IAM-authenticated endpoint. Rotation becomes transparent to the applications, no cache TTL to tune, no catch-and-refresh to wrap, no driver type-sniffing. The proxy picks up the new secret and re-authenticates to the database on its own.
The rotation Lambda
A simplified Python shape for the single-user template; the alternating-users template is structurally identical but tracks two usernames and runs ALTER USER on the off one.
import json, boto3, os, pg8000
def lambda_handler(event, context):
arn = event['SecretId']
token = event['ClientRequestToken']
step = event['Step']
sm = boto3.client('secretsmanager')
desc = sm.describe_secret(SecretId=arn)
if step == 'createSecret':
create_secret(sm, arn, token, desc)
elif step == 'setSecret':
set_secret(sm, arn, token)
elif step == 'testSecret':
test_secret(sm, arn, token)
elif step == 'finishSecret':
finish_secret(sm, arn, token, desc)
def create_secret(sm, arn, token, desc):
try:
sm.get_secret_value(SecretId=arn, VersionId=token, VersionStage='AWSPENDING')
return # pending already exists; idempotent
except sm.exceptions.ResourceNotFoundException:
pass
current = sm.get_secret_value(SecretId=arn, VersionStage='AWSCURRENT')
payload = json.loads(current['SecretString'])
payload['password'] = sm.get_random_password(
PasswordLength=32, ExcludePunctuation=True)['RandomPassword']
sm.put_secret_value(
SecretId=arn, ClientRequestToken=token,
SecretString=json.dumps(payload), VersionStages=['AWSPENDING'])
def set_secret(sm, arn, token):
pending = json.loads(sm.get_secret_value(SecretId=arn, VersionId=token)['SecretString'])
current = json.loads(sm.get_secret_value(SecretId=arn, VersionStage='AWSCURRENT')['SecretString'])
conn = pg8000.connect(
user=current['username'], password=current['password'],
host=current['host'], port=current['port'], database='postgres')
conn.run(f"ALTER USER {pending['username']} WITH PASSWORD '{pending['password']}'")
conn.commit()
conn.close()
def test_secret(sm, arn, token):
pending = json.loads(sm.get_secret_value(SecretId=arn, VersionId=token)['SecretString'])
conn = pg8000.connect(
user=pending['username'], password=pending['password'],
host=pending['host'], port=pending['port'], database='postgres')
conn.run('SELECT 1')
conn.close()
def finish_secret(sm, arn, token, desc):
current_version = next(v for v, stages in desc['VersionIdsToStages'].items() if 'AWSCURRENT' in stages)
if current_version == token:
return
sm.update_secret_version_stage(
SecretId=arn, VersionStage='AWSCURRENT',
MoveToVersionId=token, RemoveFromVersionId=current_version)
The Lambda has two IAM policies attached: one on its execution role (permission to call Secrets Manager, read KMS for decrypt, reach the VPC), and one on the secret’s resource policy (to say which principals can read or rotate it). The Lambda lives in the VPC that can reach the database.
IAM and resource policies
The rotation Lambda’s execution role needs:
secretsmanager:DescribeSecret,GetSecretValue,PutSecretValue,UpdateSecretVersionStage, on the secret ARN only.secretsmanager:GetRandomPassword, unscoped (the API takes no resource).kms:Decrypt,kms:GenerateDataKey, on the KMS key encrypting the secret.- VPC execution (
ec2:CreateNetworkInterface,ec2:DeleteNetworkInterface,ec2:DescribeNetworkInterfaces) if the Lambda is in a VPC.
The secret’s resource policy optionally restricts who can read it:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAppRoleRead",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::111122223333:role/payments-app-role" },
"Action": "secretsmanager:GetSecretValue",
"Resource": "*"
},
{
"Sid": "AllowRotationLambdaWrite",
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::111122223333:role/rotation-lambda-role" },
"Action": [
"secretsmanager:DescribeSecret",
"secretsmanager:GetSecretValue",
"secretsmanager:PutSecretValue",
"secretsmanager:UpdateSecretVersionStage"
],
"Resource": "*"
}
]
}
Each application service gets its own IAM role with secretsmanager:GetSecretValue restricted to the secret ARN; the rotation Lambda has the full rotation set. CloudTrail records every call.
A worked example: rotation day for the eighteen-service fleet
Rotation fires at 00:00 UTC on the 30th. app_a is currently in the secret; app_b is the off user.
00:00:01 EventBridge schedule invokes rotation Lambda (Step=createSecret)
00:00:02 Lambda: PutSecretValue → new version v2 with app_b's new password,
staged AWSPENDING (v1 with app_a stays AWSCURRENT)
00:00:04 Secrets Manager invokes Lambda (Step=setSecret)
00:00:05 Lambda connects with app_a's current creds,
ALTER USER app_b WITH PASSWORD '<new>'
00:00:06 Commit; connection closes
00:00:08 Secrets Manager invokes Lambda (Step=testSecret)
00:00:09 Lambda connects as app_b with new password, SELECT 1, disconnect
00:00:11 Secrets Manager invokes Lambda (Step=finishSecret)
00:00:12 Lambda: UpdateSecretVersionStage moves AWSCURRENT to v2 (app_b);
v1 (app_a) automatically receives AWSPREVIOUS
00:00:15 Application services with 5-minute TTL caches still have v1 (app_a)
-- their existing Postgres connections keep working as app_a
00:05:00 First cache expiry; GetSecretValue returns v2 (app_b);
services open new connections as app_b as the pool turns over
-- no authentication failure, because app_a is still a valid user
00:05:00+ Any still-open app_a connections keep working until they drop;
pool eventually refills with app_b connections
Total rotation time: under fifteen seconds. Total fleet refresh: bounded by each service’s cache TTL and connection pool turnover. Zero authentication failed events, because app_a remains a valid database user the whole time, it just stops being what the secret returns.
On the next rotation (day 60), the Lambda resets app_a’s password, tests it, and flips AWSCURRENT back to v3 (app_a). app_b keeps working through that rotation the same way app_a did through this one.
What’s worth remembering
- Rotation is four steps:
createSecret,setSecret,testSecret,finishSecret. Each step is a separate Lambda invocation; staging labels move between them. - The database is touched only in
setSecret. Everywhere else, rotation is Secrets Manager bookkeeping on versions. - Single-user rotation flips one password at a moment. Applications holding the old password fail to open new connections until they refetch, fine for a single worker, awkward for a cached fleet.
- Alternating-users rotation keeps two valid credentials at all times. Zero-downtime by design; the right default for shared application credentials at scale.
- Keep the RDS master account off the 30-day cadence. Rotate the master infrequently, with a maintenance window; create app users for the hot path.
- Applications need a refresh story: short TTL, catch-and-refresh, or the caching library. Pick one and standardise across the fleet so rotation behaviour is predictable.
- RDS Proxy makes rotation invisible to applications. Proxy re-reads the secret; apps use IAM auth; no cache tuning required.
- Recovery is automatic on rotation failure.
AWSCURRENTonly moves infinishSecret; any earlier failure leaves the fleet on the previous credential and the next schedule retries. - IAM DB authentication removes the rotation problem entirely when it fits. 15-minute token lifetime and token-generation rate limits mean it doesn’t always fit.
- CloudTrail records every
GetSecretValueand every rotation event. The audit story comes from enabling CloudTrail, not from writing code.
Eighteen services, one Postgres, thirty-day cadence, zero downtime. Two application users, alternating-users rotation, short-TTL caches, master account kept on a different clock, and, for the long-term architecture, RDS Proxy waiting as the upgrade that makes all of this invisible to the applications in the first place.