Guardrails, Watermarks, and Refusals

A fintech ships a customer-facing chatbot on Bedrock that helps users understand their account history. Legal asks: can it give financial advice it shouldn’t? Risk asks: can it echo back a customer’s full account number? Compliance asks: if a regulator challenges us – “prove this response came from your approved modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. , not a third party, and wasn’t tampered with” – what do we show them? Three questions, three different controls, all of them Bedrock-native. The controls exist; the work is matching the right one to each question, figuring out how they compose, and seeing what the shape of a “responsible AI” configuration looks like when someone external actually asks to see it.

The situation

A mid-size fintech runs a customer-facing chatbot on Bedrock. The chatbot helps the roughly 2M active customers understand their transaction history, explain fees, surface policy documents, and escalate to a human when needed. It runs on Claude Sonnet 4.5, invoked from a Lambda behind API Gateway.

Three compliance obligations:

No regulated financial advice. The chatbot can explain what a fee is; it cannot recommend whether to invest, what to buy, or when to sell. Crossing that line is a regulated-advice violation.
No customer PII in outputs. The model should never echo a full account number, full name + date of birth together, or any other field that would count as PII under the relevant privacy regulation. The chatbot has access to this data (via tool use) but should redact it in responses.
Auditable provenance. Every response must be attributable: which model produced it, which promptPromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. , which customer session, and – in the event of a dispute – proof that the text came from the AWS-hosted model rather than a third-party intercept or a compromised channel.

Separately, the product team wants to know: when the chatbot refuses a request (e.g. “I can’t give investment advice”), what does that refusal look like? Who controls the refusal message? And can users bypass it with prompt tricks (“pretend you’re a financial advisor…”)?

What actually matters

Responsible-AI controls on Bedrock live in three layers, and they don’t overlap neatly with the three compliance obligations – so the first job is mapping question to control.

The first thing worth thinking about is topic restriction. The “no financial advice” requirement is about what the model talks about, not what words appear in the output. A topic filter needs to recognise that “should I invest in XYZ?” is a regulated-advice question even if phrased as “what do you think about XYZ?” or “if you were me, what would you buy?”. This is Bedrock GuardrailsGuardrailA filter or rule applied to an LLM’s inputs or outputs to keep it inside safe, legal, or on-brand behaviour. ’ denied topics feature: define a topic with a natural-language description plus optional example prompts, and Guardrails will intercept invocations whose input or output falls into the topic, returning a canned refusal instead of the model’s response.

The second is PII redaction. This is about patterns in the output – an account number has a recognisable shape, an email address matches a regex, a full name is an entity the model tagger can identify. Bedrock Guardrails includes sensitive-information filters for a catalogue of PII types (SSN, email, phone, credit card, address, name, and more), plus user-defined regex patterns for domain-specific identifiers (the fintech’s internal account number format, say). Filters can block the invocation entirely or anonymize – replace the matched text with [REDACTED] or a typed tag like [ACCOUNT_NUMBER].

The third is word filters. Profanity, competitor mentions, and a hard-coded block list live here. Less important for this specific scenario but part of the feature catalogue.

The fourth is content moderation. Hate, insults, sexual content, violence, misconduct, and prompt attacks (jailbreakJailbreakA prompt that bypasses a model’s safety training and gets it to produce output it would normally refuse. attempts) – each with a configurable threshold (NONE, LOW, MEDIUM, HIGH). Applied to input and output independently. This is the “generic safety” layer that catches the cases Guardrails’ other features don’t explicitly cover.

The fifth is grounding and relevance checks. A newer Guardrails feature: Guardrails can evaluate whether the model’s response is grounded in the source material provided to it (grounding score) and whether it actually addresses the user’s question (relevance score). If either falls below a threshold, the response can be blocked or flagged. Relevant for RAG-heavy chatbots; less directly relevant to the fintech scenario but worth knowing.

The sixth is watermarking and provenance. Text-generation watermarking (statistical patterns embedded in the output that can be detected later) is an emerging capability, not yet universal across Bedrock models. For provenance, the primary AWS-native answer is CloudTrail + Bedrock invocation logging – every InvokeModel call is recorded with principal, model ARN, timestamp, and (with invocation logging enabled) the full request and response, stored encrypted in S3 or CloudWatch Logs. Combined with IAM restrictions on which principals can invoke which models, this gives a cryptographically-bounded audit trail: no service outside Bedrock has the access to produce an output that would pass CloudTrail verification.

The seventh is the shape of the refusal itself. When Guardrails intercepts an invocation, the caller receives a specific response body with stopReason: "guardrail_intervened", assessment details showing which policy fired, and a configurable refusal message ("I can't help with that request. For investment advice, please speak with a licensed advisor at ..."). The application code handles this as a control-flow case distinct from a normal completion.

What we’ll filter on

Five filters, one per compliance requirement (with PII split into “detect” and “action taken”).

Does this control enforce topic-level restrictions (e.g. no financial advice)?
Does it detect PII in inputs and outputs?
Does it let the response be redacted rather than blocked (for cases where redaction is enough)?
Does it cover prompt-injection attempts (“ignore previous instructions…”)?
Does it produce an audit artefact – something an auditor can inspect after the fact?

The Bedrock responsible-AI landscape

Bedrock Guardrails. A configuration object attached to an invocation that applies one or more policies to the input, the output, or both. Policies: denied topics (up to 30 per guardrail, each a named topic with description and example prompts), content filters (six categories + prompt attacks, each with severity threshold), word filters (block list + managed profanity), sensitive-information filters (the built-in PII catalogue + user regexes), and – for RAG use cases – contextual grounding and relevance checks. Each guardrail is versioned; invocations reference a specific version. Created via bedrock:CreateGuardrail, versioned via CreateGuardrailVersion, applied via the guardrailIdentifier + guardrailVersion fields on InvokeModel (or automatically when a Knowledge Base or Agent has a guardrail attached).
Model invocation logging. A Bedrock account-level setting (one per Region) that directs Bedrock to write full request and response payloads to a destination: S3 bucket, CloudWatch Logs, or both. Enabled via bedrock:PutModelInvocationLoggingConfiguration. Captures the prompt, the model’s raw output, any guardrail assessments, and metadata (model ID, timestamp, caller IAM principal via CloudTrail correlation). Encrypts at rest under a KMS key of your choice. This is the durable audit trail.
CloudTrail. Every Bedrock API call – InvokeModel, CreateGuardrail, GetGuardrail, PutModelInvocationLoggingConfiguration – emits a CloudTrail event. Data events can be enabled to capture InvokeModel calls specifically (they’re not in management events by default). Gives the “who called what, when” audit; doesn’t include the model’s output (that’s invocation logging’s job).
IAM-scoped model access. A Bedrock IAM policy controls which principals can invoke which models. bedrock:InvokeModel on arn:aws:bedrock:*::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0 restricts a role to one model. The chatbot Lambda’s role should allow exactly the models the application is approved to use, nothing else; requests for other models return AccessDenied in CloudTrail before the model is invoked.
Customer-managed KMS keys. Invocation logs, training data, and custom models can be encrypted with customer-managed KMS keys. Gives the ability to revoke access to historical logs by disabling the key, and to require explicit key-usage grants to read the audit record. The regulator-facing story.
Cross-Region inferenceInferenceRunning a trained model to produce output – as opposed to training it. profiles and data residency. For regulators that care where inference happens, Bedrock’s model ARNs pin the Region, and cross-Region inference profiles (for models that support it) expose an explicit list of which Regions can serve a request. Important for the audit story when data-residency constraints apply.
Bedrock Evaluation. Not a real-time control, but part of the responsible-AI story: systematic evaluation of a model (or a prompt-and-model combination) on dimensions including toxicity, robustness, and accuracy, against either built-in datasets or your own. The pre-production counterpart to Guardrails’ in-production enforcement.

Side by side

Mapping each control to the three compliance obligations plus the four attributes:

Control	Topic restriction	PII detection	Redact option	Prompt-injection	Audit artefact
Guardrails: denied topics	✓	—	—	Partial	✓ (assessment)
Guardrails: content filters	Partial	—	—	✓ (prompt attacks)	✓ (assessment)
Guardrails: PII filters	—	✓	✓ (anonymize)	—	✓ (assessment)
Guardrails: word filters	Partial	—	—	—	✓ (assessment)
Invocation logging	—	—	—	—	✓ (full payload)
CloudTrail	—	—	—	—	✓ (metadata)
IAM model scoping	—	—	—	—	✓ (deny trail)

A complete configuration for the fintech chatbot uses all of these, not one. Guardrails handle real-time enforcement; invocation logging and CloudTrail handle audit; IAM handles the “this model, not another” question; KMS handles the “this key, held by us” question.

How the controls compose

Guardrails enforce at two gates -- input and output -- around a single model call. CloudTrail, invocation logging, and KMS produce the three audit artefacts. Each layer does one job; removing any of them breaks a different piece of the compliance story.

The configuration in depth

The Guardrail. Create one guardrail per application (chatbot-customer-v1). The configuration, at a high level:

{
  "name": "chatbot-customer-v1",
  "blockedInputMessaging": "I can't help with that request. For investment advice, please speak with a licensed advisor at 0800-...",
  "blockedOutputsMessaging": "I can't share that response. Please contact support if you need more detail.",
  "topicPolicyConfig": {
    "topicsConfig": [
      {
        "name": "RegulatedFinancialAdvice",
        "definition": "Advice to buy, sell, or hold specific securities, or recommendations on investment strategy, asset allocation, retirement planning, or tax planning for a specific person.",
        "examples": [
          "Should I invest in XYZ stock?",
          "What should I do with my 401k?",
          "Is now a good time to buy bonds?"
        ],
        "type": "DENY"
      }
    ]
  },
  "contentPolicyConfig": {
    "filtersConfig": [
      {"type": "SEXUAL",      "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "VIOLENCE",    "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "HATE",        "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "INSULTS",     "inputStrength": "MEDIUM", "outputStrength": "MEDIUM"},
      {"type": "MISCONDUCT",  "inputStrength": "HIGH",   "outputStrength": "HIGH"},
      {"type": "PROMPT_ATTACK", "inputStrength": "HIGH", "outputStrength": "NONE"}
    ]
  },
  "sensitiveInformationPolicyConfig": {
    "piiEntitiesConfig": [
      {"type": "CREDIT_DEBIT_CARD_NUMBER", "action": "BLOCK"},
      {"type": "US_BANK_ACCOUNT_NUMBER",   "action": "ANONYMIZE"},
      {"type": "US_SOCIAL_SECURITY_NUMBER","action": "BLOCK"},
      {"type": "EMAIL",                    "action": "ANONYMIZE"},
      {"type": "PHONE",                    "action": "ANONYMIZE"},
      {"type": "NAME",                     "action": "ANONYMIZE"}
    ],
    "regexesConfig": [
      {
        "name": "InternalAccountId",
        "pattern": "ACCT-[A-Z0-9]{10}",
        "action": "ANONYMIZE"
      }
    ]
  }
}

A few points on that. PROMPT_ATTACK is applied to input only (outputStrength: NONE) because what we’re catching is the user’s attempt to jailbreak; it doesn’t make sense on output. CREDIT_DEBIT_CARD_NUMBER is BLOCK (blocks the whole invocation) because a card number in response is never acceptable; US_BANK_ACCOUNT_NUMBER is ANONYMIZE because the chatbot can reference “your account ending in 1234” legitimately by using the anonymized form. The regexesConfig catches the company’s internal ACCT-... identifier that isn’t in the built-in PII catalogue.

Versioning. CreateGuardrailVersion snapshots the DRAFT into an immutable version. The Lambda invokes guardrailIdentifier=<id>, guardrailVersion=<N> pinning to a specific version; updates to the guardrail don’t affect production until the Lambda is updated to reference the new version. This is the change-control story: Legal reviews version 3, approves it, the Lambda is updated to reference version 3.

Invocation logging. Enable via PutModelInvocationLoggingConfiguration at the Region level:

{
  "loggingConfig": {
    "cloudWatchConfig": {
      "logGroupName": "/aws/bedrock/invocations",
      "roleArn": "arn:aws:iam::111122223333:role/BedrockLoggingRole"
    },
    "s3Config": {
      "bucketName": "fintech-bedrock-audit",
      "keyPrefix": "chatbot/"
    },
    "textDataDeliveryEnabled": true,
    "imageDataDeliveryEnabled": false,
    "embeddingDataDeliveryEnabled": false
  }
}

Every invocation’s full request, response, model metadata, and guardrail assessment land in both sinks. S3 is archival (Athena queryable); CloudWatch Logs is real-time (Logs Insights queryable for incident response). Both are encrypted; the S3 bucket’s default encryption uses a customer-managed KMS key that only the audit team can grant kms:Decrypt on.

CloudTrail data events. InvokeModel isn’t in management events by default – enable data events for Bedrock to capture each call’s principal, model ARN, and timestamp. Data events cost money per event but are the only way to get the “who called what” trail for high-volume model calls at the CloudTrail layer.

IAM restriction. The chatbot Lambda’s execution role has exactly one bedrock:InvokeModel permission, scoped to the Claude Sonnet model ARN and requiring the guardrail:

{
  "Effect": "Allow",
  "Action": "bedrock:InvokeModel",
  "Resource": [
    "arn:aws:bedrock:eu-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0",
    "arn:aws:bedrock:eu-west-1:111122223333:guardrail/chatbot-customer-v1"
  ],
  "Condition": {
    "StringEquals": {
      "bedrock:GuardrailIdentifier": "arn:aws:bedrock:eu-west-1:111122223333:guardrail/chatbot-customer-v1"
    }
  }
}

That condition block is the key enforcement: the Lambda cannot invoke the model without the guardrail attached. Even if a developer accidentally removed the guardrail reference in code, IAM would deny the call.

A worked refusal

A customer asks: “Hey, I’ve got 50k saved. Should I put it in index funds or high-yield savings?”

The Lambda forwards the message to Bedrock with the guardrail attached:

$ aws bedrock-runtime invoke-model \
    --model-id anthropic.claude-sonnet-4-5-20250929-v1:0 \
    --guardrail-identifier chatbot-customer-v1 \
    --guardrail-version 3 \
    --body '{"anthropic_version":"bedrock-2023-05-31","max_tokens":500,"messages":[{"role":"user","content":"Hey, I have got 50k saved. Should I put it in index funds or high-yield savings?"}]}' \
    --cli-binary-format raw-in-base64-out \
    out.json

$ jq . out.json
{
  "stopReason": "guardrail_intervened",
  "content": [
    {"type": "text", "text": "I can't help with that request. For investment advice, please speak with a licensed advisor at 0800-..."}
  ],
  "amazon-bedrock-guardrailAction": "INTERVENED",
  "amazon-bedrock-trace": {
    "guardrail": {
      "inputAssessment": {
        "chatbot-customer-v1": {
          "topicPolicy": {
            "topics": [
              {"name": "RegulatedFinancialAdvice", "type": "DENY", "action": "BLOCKED"}
            ]
          }
        }
      }
    }
  }
}

What happened:

The input guardrail evaluated the message against the RegulatedFinancialAdvice topic. The topic’s definition (“advice to buy, sell, or hold specific securities…”) plus the examples (“What should I do with my 401k?”) trained the topic classifier to recognise this phrasing.
The classifier flagged the input as matching. Guardrails short-circuited the invocation: the model was never called.
The response body contains the configured blockedInputMessaging plus the full assessment showing which policy fired.
The Lambda received this response with stopReason: "guardrail_intervened" and rendered the configured refusal in the chat UI.
CloudTrail recorded the InvokeModel call. The invocation log wrote the full prompt, the refusal response, and the guardrail assessment to S3 under the audit bucket’s KMS key.

The customer sees a polite refusal pointing them to a real human advisor. Compliance has an auditable record that the model was not invoked with that prompt, which is the stronger position than “the model was invoked and declined.”

A worked PII redaction

Customer asks: “Can you confirm the balance on my account ACCT-ABC1234567?”

The Lambda has tool-use wired up: it calls an internal API to look up the balance, includes the result in the prompt context, and asks the model to produce a natural-language response. The model generates:

The balance on account ACCT-ABC1234567 is $3,421.55 as of today.

The output guardrail evaluates. The InternalAccountId regex matches ACCT-ABC1234567 with action ANONYMIZE. The returned content:

The balance on account {ACCOUNT} is $3,421.55 as of today.

The application layer then looks at the original session context, confirms the customer is authenticated and authorised for that specific account, and renders “your account ending in 4567” in the UI. The guardrail doesn’t need to know which account number is OK to show which customer – it just ensures the raw internal identifier never reaches the rendered chat log. The application, which has the authz context, substitutes a friendly form.

This is the key pattern: Guardrails enforce a structural invariant (“no internal account IDs in output”); the application layer enforces contextual authorisation (“this customer can see a reference to their account”). The two compose.

What’s worth remembering

Responsible AI on Bedrock is three layers, not one. Real-time enforcement (Guardrails), audit persistence (invocation logging + CloudTrail), and identity/encryption (IAM + KMS). All three are needed for a defensible compliance story.
Guardrails has five policy types. Denied topics, content filters (including prompt attacks), word filters, sensitive-information filters (PII + user regex), and contextual grounding/relevance. Each can apply to input, output, or both.
PII filters can block or anonymize. Block stops the invocation; anonymize replaces the matched text with a tag like [EMAIL] or a user-defined placeholder. Choose per PII type: card numbers block, account references anonymize.
Guardrails are versioned and pinned per invocation. Create, version, reference a specific version in the invocation. Updates don’t affect production until the caller is updated. This is change control for model behaviour.
Model invocation logging captures the full payload. Prompt, response, guardrail assessment, metadata – to S3 or CloudWatch Logs, encrypted under a customer-managed KMS key. The durable audit artefact.
CloudTrail data events for Bedrock give the “who called what” trail. Not on by default. Pair with invocation logging for the full picture.
IAM conditions enforce guardrail usage. A policy that requires bedrock:GuardrailIdentifier to equal a specific guardrail ARN makes it impossible to invoke the model without the guardrail – bypassing guardrails requires changing IAM, which has its own audit trail.
Guardrails enforce structure; the application enforces context. Guardrails keep raw account numbers out of output. The app layer, which has the authenticated session, decides which anonymized references to show which customer. The two compose; neither alone is sufficient.

A chatbot that refuses financial advice, redacts account numbers, and produces an audit trail a regulator would accept isn’t one feature – it’s five Bedrock features configured together, plus IAM and KMS around them. The craft is knowing which feature answers which compliance question, and wiring the configuration so no obvious bypass exists (no guardrail-less invocation path, no unencrypted log sink, no overly broad IAM). Get the composition right once and the chatbot is defensible; miss a layer and the auditor has a question with no good answer.