The situation
The platform team owns Bedrock access for the whole company. Roughly thirty services, a support assistant, a ticket classifier, a marketing-copy drafter, a translation pipeline, a meeting summariser, and twenty-five others, call Bedrock in production, each with its own prompt. The prompts were drafted separately by product teams, copied between codebases, embedded as string literals, sometimes templated with f-strings, sometimes loaded from a Markdown file.
What happened last quarter is going to happen again. A product engineer tweaked the meeting-summariser’s System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. , changed “concise” to “brief” in what looked like a clean-up, deployed to production, and retention on the daily summary email dropped 12% for eight days before anyone correlated the code change. The prompt had no version history the product team could see. The A/B infrastructure didn’t know prompts were a thing to vary. The monitoring dashboard reported Bedrock latency and error rate; it didn’t report whether the output was any good.
Platform’s ask: a prompt management story for the whole company. Version prompts, test them before release, roll them out alongside the code that calls them (or independently, if that’s better), measure their impact, and stop letting string literals in thirty repos be the authoritative copy.
What actually matters
A prompt is text the ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. sees before it does anything else. It is, functionally, configuration: it changes behaviour, it’s smaller than code, it wants review, and it wants versioning. The failure modes are the ones every config-management practice was invented to address, drift, shadow copies, untested change, silent regression, no rollback.
The first decision is where prompts live as source of truth. Checked into the service repo? Central repo? A managed Bedrock resource? A database?
The second is how they’re versioned. Git commit hashes? Semantic versions? Bedrock prompt versions? All of the above, coordinated?
The third is how they’re released. Deployed with the code that uses them, or independently? Is a prompt change a deployment, a feature flag flip, or a config push?
The fourth is how they’re tested. Before the change hits production, somebody runs the new prompt against a bank of examples and checks the outputs. Is that bank owned by the prompt author? The product team? Platform?
The fifth is how they’re parameterised. A prompt usually has slots, user input, retrieved context, session state. The templating language matters: f-strings lose their context when you refactor; Jinja gains power but adds a dependency; a simple {variable} substitution is predictable. Managed registries usually have their own template syntax to learn.
The sixth is how they’re attributed. When one prompt feeds thirty services, the bill, the latency, and the quality signal have to be broken down by caller, otherwise the platform team can’t tell which service is driving which problem.
And a softer one: the team that owns the prompt. Who edits it, who approves the edit, who rolls it back? Without a clear answer, every service’s prompt is owned by the last engineer who touched it, which is to say, owned by no one.
What we’ll filter on
- Source-of-truth clarity, one place, many places, or a registry?
- Versioning and rollback, named versions, diffs, easy revert?
- Deployment shape, bundled with code, pushed independently, feature-flagged?
- Evaluation coverage, tests run before a prompt ships?
- Per-caller attribution, cost, latency, quality broken down by service?
The prompt-management landscape
-
Bedrock Prompt Management. AWS-native prompt registry. Create a prompt with a template, variables, a selected foundation model, default inference config (TemperatureA knob (usually 0 to 2) that controls how much the model deviates from its highest-probability next token. , top-p, max TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ), and an optional system message. Each prompt has versions; each version is immutable. Aliases (
$LATEST,production,staging) point at versions. TheInvokeModelorConverseAPI takes a prompt ARN and a variables dict; Bedrock resolves the template, runs the model, returns the response. IAM scopes who can create, update, and invoke prompts. CloudWatch carries invocation metrics per prompt. Ticks 1, 2, 5; moderate on 3, 4. -
Git-backed templates in a shared repo. A
prompts/directory in a shared repo, one file per prompt, Jinja2 or Handlebars or a plain-text template with named placeholders. A small library in each service loads the prompt, substitutes variables, calls Bedrock. Versioning is Git commits; releases are tags; tests sit next to the templates in CI. Nothing AWS-specific; works identically if Bedrock moves to a different model. Ticks 1, 2, 3, 4 cleanly; 5 depends on observability we add. -
Parameterised prompts in each service’s config. Prompts live in each service’s config file (YAML, JSON), deployed with the service, versioned with the service. Cheapest to set up; the baseline thirty-services-each-doing-their-own-thing pattern, formalised. Ticks 3 cleanly; fails 1 and 5.
-
LangChain’s PromptTemplate + LangSmith. Prompts as code in a shared Python package; LangSmith as the evaluation and observability surface. Prompts versioned in the package, evaluated with LangSmith datasets, observed per-invocation. Strong on 4 and 5; separate SaaS; tied to LangChain’s abstractions.
-
A prompt database. A DynamoDB or Postgres table with prompts, versions, and metadata. Services fetch the active prompt at call time. Flexible, but puts prompt changes one write away from production, fast, and dangerous without a deployment gate.
-
Hybrid: Git + Bedrock Prompt Management. The pattern most platform teams land on. Prompts authored in Git, reviewed in PRs, evaluated in CI. On merge, a pipeline pushes the new version to Bedrock Prompt Management; the
productionalias moves when the canary passes. Git is the source; Bedrock is the runtime registry; services look up prompts by alias.
Side by side
| Option | Source of truth | Versioning | Deployment | Evaluation | Attribution |
|---|---|---|---|---|---|
| Bedrock Prompt Management | Bedrock resource | Named versions + aliases | API call | Manual / custom | CloudWatch per prompt |
| Git-backed templates | Repo | Commits, tags | With service | CI-driven | Build it ourselves |
| Per-service config | Each service | With service | With service | Each team’s job | None central |
| LangChain + LangSmith | Python package | Package versions | With service | LangSmith datasets | LangSmith traces |
| Prompt database | DB rows | Row versions | DB write | Optional | Depends |
| Git + Bedrock (hybrid) | Git, with mirror | Git commits → Bedrock versions | Pipeline | CI + Bedrock evals | CloudWatch per prompt |
The hybrid is the honest answer for a platform team with 30 callers. Git carries the authoring workflow; Bedrock carries the runtime registry and per-prompt metrics. Nothing else hits all five attributes cleanly.
The prompt lifecycle, end to end
The pick in depth: Git + Bedrock Prompt Management
Authoring. Prompts live in prompts/ in a shared repo. Each prompt is a Jinja2 template plus a YAML sidecar with the inference config (temperature, top-p, max tokens, stop sequences), the intended foundation model, and ownership metadata (team, primary contact, service list). PRs require review by the prompt owner; SMEs are added as reviewers via CODEOWNERS based on domain.
Evaluation in CI. A GitHub Action runs on every PR. For each changed prompt, it loads a 50-example smoke set (small, fast, runs in under two minutes), invokes the model with the new template, and scores with a mix of reference metrics (BLEU, ROUGE, exact-match for structured outputs) and LLM-as-judge (another Claude call scoring each output 1-5 on defined rubrics). Thresholds are per-prompt, the summariser has different quality criteria than the ticket classifier. A regression blocks merge.
Release pipeline. On merge to main, a pipeline loops through changed prompts and calls CreatePromptVersion on Bedrock. The Bedrock prompt version is immutable; the version number increments. The pipeline then runs a Bedrock Evaluation Job, a managed job that runs the new version against a larger golden set (500-2000 examples) and produces a scored report. If the job’s metrics are within tolerance of the previous production version, the staging alias moves to the new version and the canary in production starts, 5% of traffic, routed via an alias-aware wrapper in the platform SDK. 24 hours of CloudWatch metrics; if error rate and user-facing quality signals hold, the pipeline moves the production alias.
Runtime. Services call Bedrock InvokeModel (or Converse) with promptIdentifier: arn:aws:bedrock:eu-west-1:…:prompt/summariser and promptVersion: production. Bedrock resolves the alias, applies the template with the variables dict, runs the model, returns the response. Each call is tagged with callerId in the request, which flows to CloudWatch metrics.
Rollback. One API call – UpdatePrompt on the alias, points production back at the previous version. No service redeploy. The change propagates in seconds. This is the single most important operational property of the whole setup: the “oh no” button exists and is fast.
Per-caller attribution. The CloudWatch metric PromptInvocations is dimensioned by PromptArn, PromptVersion, and a custom dimension we populate with callerId. A dashboard breaks down each prompt by which service is calling it, how much they’re spending, and how their latency compares. When one service complains that “the summariser is slow,” platform can see whether it’s slow for everyone or only for them, and if only for them, which argument shape is triggering the slowness.
A worked example: the concise-vs-brief incident, prevented
Someone opens a PR changing “concise” to “brief” in the summariser prompt.
- CI runs the 50-example smoke set. The LLM-as-judge rubric includes a “length appropriateness” criterion. The new prompt scores 3.2/5 on that criterion vs the baseline’s 4.1/5, outputs are now shorter than the ideal. CI posts the regression; reviewer asks “was that intentional?”
- Author decides the intent was wording cleanup, not behaviour change. They revert. Incident prevented in three minutes.
Alternative reality: author insists. Reviewer approves. PR merges.
- Pipeline pushes
summariser:v17. Bedrock BenchmarkA standardised test set used to score and compare models. job runs the 500-example golden set. Overall quality score holds, but the length-appropriateness sub-metric is down. Platform’s quality dashboard flags the change for human review before the staging alias moves. - Product team decides “shorter is fine” and approves. Staging moves; canary starts at 5%.
- User-facing retention metric in Datadog is wired into the canary gate. 24 hours in, retention on the summariser’s daily email has dipped 4% on the canary users; p<0.01. Pipeline aborts the canary.
productionalias stays on v16. - Rollback is automatic; no action required. Author sees the abort notification and has data to work with.
What didn’t happen: eight days of silent regression, a confused postmortem, and a product team blaming engineering.
What’s worth remembering
- Prompts are configuration. Treat them like it: version control, review, CI, release pipeline, rollback. The damage of ignoring this is cheap to inflict and expensive to detect.
- Git is the source of truth; Bedrock is the runtime registry. Git gives PRs, diffs, code review, and CI; Bedrock gives immutable versions, aliases, and per-prompt metrics.
- Aliases are the rollback button. Move the alias, rollback happens in seconds without redeploying thirty services. Design for this from day one.
- Evaluation is non-optional. A 50-example smoke set in CI catches the obvious; a 500-example golden set in the pipeline catches the subtle; a production canary catches the rest.
- Per-caller attribution matters when one prompt feeds many services. Tag invocations with a caller ID; dimension CloudWatch metrics on it.
- Prompt ownership is explicit. Each prompt has a named owner team and SME reviewers. Without it, the owner becomes the last person who touched it.
- Bedrock Prompt Management handles inference config too. Temperature, top-p, max tokens, stop sequences travel with the prompt version, changing them is a prompt change, with the same review and rollback.
- Don’t let prompts live as string literals in thirty repos. That’s not a style point; that’s the root cause of every prompt regression that ever ships.
One prompt, a hundred callers, a versioned registry, a release pipeline, and a rollback that’s faster than the Slack thread asking “did we change something?” The service owners still own their prompts; the platform team just stopped letting them own them badly.