The situation
A retail SaaS company has been running a customer-service chatbot on Bedrock for four months. Claude Sonnet behind a thin application layer, a knowledge base for returns and shipping policy, a handful of tools for order lookup. Red-team exercises before launch covered the obvious harms. CSAM, hate speech, weapon instructions, and the bot passed.
Three incident classes have surfaced since.
Leaking personal data. A user uploads a scan of a document containing a social security number and asks the bot to read the name off it. The reply politely acknowledges the upload and repeats the SSN back in full. A separate case echoes a pasted credit-card number from a complaint. Neither prompt is malicious; the model is being helpful with what the user handed it.
Talking about competitors. “How does your billing compare to Acme?” gets two paragraphs of side-by-side feature comparison, politely framed, factually wobbly, and the sort of thing legal and marketing will each independently ask to stop. Another user asks for third-party tools that integrate with the product; the reply lists three named competitors.
Confidently wrong about policy. A subscriber asks when their refund will arrive. The bot quotes a fourteen-day window. The policy in the knowledge base is thirty days. The model invented a plausible answer that didn’t come from the retrieved documents.
What actually matters
The first observation is that the three incidents look like three different problems and are actually the same problem in three costumes: the model is doing something the business doesn’t want, and a prompt instruction didn’t stop it. That framing is uncomfortable because it admits the System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. is not a safety boundary. The model will mostly comply with “never reveal PII, never discuss competitors, only answer from retrieved documents”, and the gap between “mostly” and “never” is where the production incidents live. Any design that relies on prompt discipline as the enforcement layer is a design with a guaranteed failure mode; what changes between products is only the rate.
The second is that these filter jobs differ in what “enforcement” actually means. PII redaction is a pattern-match problem, there’s a definition of a social security number that holds up, a definition of a card number that holds up, and a detector that either finds them or doesn’t. Topic bans are a semantic problem, “competitor products” isn’t a keyword, it’s a cluster of phrasings no keyword list ever catches all of. GroundingConstraining a model to answer from provided sources rather than from whatever it absorbed during training. is a comparison problem, does this claim match this passage?, and requires the retrieved context to be part of the check. Content moderation (hate, violence, the usual cluster) is yet a fourth shape. Lumping those into one Lambda means writing four detectors badly. Keeping them as separate filters that share an invocation wraps them cleanly.
The third is where does enforcement sit relative to the model? Input filtering catches the pasted SSN before the model reads it; output filtering catches the echoed SSN and the drifted policy number before the user reads it. Both directions matter, and the blast radius of either direction failing is the same, a regulator or a journalist reading the transcript. The right design runs filters on both sides of the invocation, which means the mechanism has to live in the model call path, not in a Lambda the application remembers to invoke.
The fourth is who owns the policy and how do they change it? Legal wants to add a new competitor name to the ban list on a Friday afternoon. Product wants to loosen the insult threshold because the support bot is refusing grumpy-but-legitimate users. Engineering wants to audit what fired last night. If the policy lives in application code, every change is a release and a deploy; if it lives in a managed configuration versioned by the platform, a change is a version bump and a config update. The second shape is what lets a policy surface serve the people who own the policy.
The fifth is what do we need to see when something trips? Every intervention should emit a structured reason, which category, which topic, which filter, so the team can alarm on spikes (a denied-topic rate doubling at 2am is either a JailbreakA prompt that bypasses a model’s safety training and gets it to produce output it would normally refuse. campaign or a misconfigured prompt) and tune thresholds against real traffic rather than hypothetical red-team sessions. That observability is a first-class feature of the mechanism, not a logging tap bolted on afterwards.
Sixth: how does the solution compose with the rest of the stack? RAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. already passes retrieval context on each call; the grounding check wants that context. Higher-level orchestration surfaces already coordinate tool calls; the filtering has to wrap whichever surface the application invokes without breaking its semantics. A solution that only works for the lowest-level model API and not for the orchestration loop above it isn’t a solution for a product that uses orchestration.
Finally: what’s the cost of adding the mechanism? Adding a round-trip on every call to a third-party DLP scanner doubles the per-turn latency budget and adds a contract to manage. Adding a managed in-call filter adds milliseconds and no contract. The mechanism that doesn’t show up as a separate bill or a separate vendor wins the tie.
What we’ll filter on
Five distinct safety jobs on the same prompt.
- Broad content safety. Cover the harm dimensions red team already found, hate, insults, sexual, violence, misconduct, plus prompt-injection on the input side. Needs tuneable strength per category.
- Topic-level policy. Block conversation about topics the business doesn’t want the assistant covering, competitor products here, but the same shape fits legal advice or investment recommendations. The trigger is a topic expressed in many words, not a keyword.
- PII detection and redaction. Find SSNs, cards, bank accounts, addresses, names in both input and output. Bidirectional, input so pastes don’t reach the model, output so echoes and hallucinations don’t reach the user.
- Grounding in retrieved context. Verify the reply actually follows from the documents retrieved. Catch the thirty-days-becomes-fourteen case at the response boundary, not at complaint time.
- Operable by the team that ships the bot. No new long-running service to scale or pay for twice. Policy changes are a console edit and a version bump, not a release.
The safety-wrapper landscape on Bedrock
Four shapes for wrapping GuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. around a Bedrock invocation.
Bedrock Guardrails. A managed policy surface that wraps calls through InvokeModel, InvokeModelWithResponseStream, Converse, ConverseStream, Bedrock Agents, and Flows. A guardrail is a versioned configuration with up to five filter types: content filters across six categories (hate, insults, sexual, violence, misconduct, prompt attack), denied topics in natural language, sensitive information filters (30+ built-in PII types plus regex, BLOCK or ANONYMIZE), word filters (custom list plus managed profanity), and a contextual grounding check returning grounding and relevance scores on outputs. Invoked by passing guardrailIdentifier and guardrailVersion on the model call. ApplyGuardrail runs the same policy on arbitrary text with no model call.
Custom moderation via Lambda plus Amazon Comprehend. A pre-processing Lambda calls Comprehend’s DetectPiiEntities (about 20 PII classes) and toxicity detection, optionally calls another Bedrock model as a classifier for harm categories and denied topics, scrubs or rejects, and forwards to the model. A post-processing Lambda mirrors the pass on output. The application owns the chaining, the errors, and every tuning knob.
Third-party DLP scanner. Route input and output through a commercial product (Nightfall, Private AI, or similar. Macie is S3 batch discovery, not in-band chat). Strong on PII; weaker on category harms and non-pattern denied topics; contextual grounding typically out of scope.
Prompt engineering alone. “Never discuss competitors, never reveal PII, only answer from the retrieved documents, refuse unsafe content.” Fast, free, and not enforcement. Every new jailbreak is a production incident; every creatively phrased request slips through.
Side by side
| Option | Content categories | PII redaction | Denied topics | Grounding check | Low ops |
|---|---|---|---|---|---|
| Bedrock Guardrails | ✓ | ✓ | ✓ | ✓ | ✓ |
| Custom Lambda + Comprehend | — | — | — | ✗ | ✗ |
| Third-party DLP | — | ✓ | ✗ | ✗ | ✗ |
| Prompt engineering alone | ✗ | ✗ | ✗ | ✗ | ✓ |
Prompt engineering ticks “low ops” because it’s zero infrastructure, but it fails every enforcement column, so low but unsafe. Bedrock Guardrails is the only row ticking every column cleanly.
How Guardrails wraps a Bedrock invocation
Two things the diagram flattens worth spelling out.
Prompt attack is input-only. The category detects “ignore your previous instructions and…” patterns and protects the system prompt on the input side. No symmetric output check, the model either resists the injection or it doesn’t, and the other output filters catch the blast radius if it didn’t.
Contextual grounding is output-only. The check scores the generated reply against the retrieval context passed in; the input hasn’t been generated yet so there’s nothing to score.
The six content categories
Content filters are fixed; strength is configured per guardrail.
- Hate. Attacks on identity groups.
- Insults. Language demeaning an individual without the group-identity angle.
- Sexual. CSAM is handled at the highest strength regardless of configuration.
- Violence. Incitement, graphic descriptions, weapon instructions.
- Misconduct. Illegal activity, fraud, criminal how-tos.
- Prompt attack. Input-only. Injection patterns trying to rewrite or extract the system prompt.
Each strength. NONE, LOW, MEDIUM, HIGH, applies independently to input and output for the first five. When a category trips, response metadata carries GUARDRAIL_INTERVENED and the category that caught it.
Denied topics, PII, and word filters
The competitor-comparison incident is not a content-filter failure, the replies were polite, not hateful. They were off-topic. That’s the denied-topics shape: a name, a natural-language definition, up to five example phrases, up to 30 topics per guardrail.
Name: Competitor products
Definition: Any discussion of products, services, pricing,
or features offered by companies other than our own
that compete in the same category.
Examples:
- "How does this compare to Acme?"
- "Is BrandX better than your product?"
- "Recommend alternatives to your service."
The runtime classifies each turn against these definitions. Input-side match refuses the question; output-side match catches unprompted comparisons. Natural-language beats keyword lists because competitors get renamed, new ones appear, and users phrase comparisons without ever saying “compare.”
Sensitive information filters cover 30+ built-in PII types – US_SOCIAL_SECURITY_NUMBER, CREDIT_DEBIT_CARD_NUMBER, US_BANK_ACCOUNT_NUMBER, EMAIL, PHONE, ADDRESS, NAME, IP_ADDRESS, AWS_ACCESS_KEY, passport and driver’s licence numbers, a healthcare cluster, plus named regex patterns. Per-type action is BLOCK or ANONYMIZE. Both halves apply to input and output.
A sharp edge: NAME is often more aggressive than teams want, a bot greeting “{NAME}, I can help with that” because the user’s own name got masked is a poor experience. Default BLOCK on high-harm types (SSN, card, bank account), ANONYMIZE on the rest, and be willing to disable NAME on user-facing fields where the name belongs.
Word filters are a managed profanity toggle plus up to 10,000 custom literal terms. Competitor brand names get both treatments, denied topic catches comparisons in general, word filter catches the slip where the model names a brand directly.
Contextual grounding
The thirty-days-becomes-fourteen incident isn’t content, PII, or topic. It’s grounding, the reply contained a claim the retrieved passage didn’t support. The check returns two scores per output, each 0 to 1:
- Grounding. How well the claim is supported by the source passages.
- Relevance. How directly the claim addresses the user’s question.
Thresholds are configured per guardrail; below-threshold responses trip GUARDRAIL_INTERVENED. The check requires retrieval context alongside the model invocation. Bedrock Agents, Flows, and RetrieveAndGenerate pass it automatically.
A worked configuration
- One guardrail, versioned, all five filter types enabled.
- Content filters. MEDIUM on insults (a support bot gets rude users and needs to respond neutrally); HIGH on hate, sexual, violence, misconduct; HIGH on prompt attack.
- Denied topics. Competitor products, Legal or financial advice.
- Sensitive information.
US_SOCIAL_SECURITY_NUMBER,CREDIT_DEBIT_CARD_NUMBER,US_BANK_ACCOUNT_NUMBERon BLOCK.EMAIL,PHONE,ADDRESS,NAMEon ANONYMIZE. One regex for the company’s internal order-reference format, ANONYMIZE. - Word filters. Managed profanity on. Custom list containing three competitor brand names legal supplied.
- Contextual grounding. Grounding threshold 0.6, relevance threshold 0.5, tuned against an evaluation set built from the knowledge base.
- Invocation. The existing Bedrock call (the product already uses an Agent) passes
guardrailIdentifierandguardrailVersion. Version is pinned in configuration and bumped through the release process when policy changes. - Observability. CloudWatch metrics on interventions by category. Alarms on sudden spikes, if denied topics doubles in an hour, either the model has gone off-script or users have found a new way to ask the same thing.
The three production incidents all get caught inside one call. The pasted SSN trips sensitive-info BLOCK on input; the model never sees it. The competitor comparison trips denied topics on input or output. The fourteen-day refund hallucination trips the contextual grounding check against the thirty-day passage.
What’s worth remembering
- Bedrock Guardrails is a five-in-one safety surface. Content filters, denied topics, sensitive information, word filters, contextual grounding, one configuration, one call path, one version to pin.
- The six content categories are fixed. Hate, insults, sexual, violence, misconduct, prompt attack. Strength is configurable (NONE, LOW, MEDIUM, HIGH); the first five apply to input and output, prompt attack is input-only.
- Denied topics are natural-language policy, not keywords. Name, definition, up to five examples, up to 30 topics.
- PII filtering works in both directions. 30+ built-in types plus regex, BLOCK or ANONYMIZE per type. Input-side catches pastes; output-side catches echoes and hallucinated numbers.
- Word filters are literal, not semantic. Specific brand names and codenames alongside denied topics for the general category.
- Contextual grounding returns relevance and grounding scores on outputs. Below-threshold responses trip intervention. Requires retrieval context at invocation time. Agents, Flows, and
RetrieveAndGeneratepass it automatically. - Guardrails wraps
InvokeModel,Converse, Agents, and Flows viaguardrailIdentifierandguardrailVersion.ApplyGuardrailruns the same policy on arbitrary text with no model call. GUARDRAIL_INTERVENEDtells the application what fired. The reason identifies the category, topic, or filter that matched, log it, alarm on it, tune from real traffic.- Prompt engineering is a layer, not the layer. A good system prompt reduces intervention rates; it does not enforce policy boundaries.
- Custom Lambdas fit alongside, not instead of. Bespoke classifiers and role-specific rules call
ApplyGuardrailfor the general work and add the custom check on top.
The answer: configure a single Bedrock Guardrail with content filters across all six categories, denied topics for competitor products and other off-scope policy, sensitive-information filters covering SSN, credit card, and bank account on BLOCK with names, addresses, emails, and phones on ANONYMIZE, word filters for named brands, and a contextual grounding check tuned against the knowledge base. Invoke it by passing guardrailIdentifier and guardrailVersion on the existing Bedrock call. Reach for custom Lambdas only where the business has a bespoke classifier or role-based policy Guardrails cannot express, not as the default wrapper. Five filter jobs, one call, one surface to change when policy moves.