How to Build a Multi-Modal Bedrock Assistant for Insurance Claims

July 13, 2026 · 15 min read

Generative AI Developer · AIP-C01 · part of The Exam Room

The situation

An insurance company is modernising the first-line claims workflow. A customer submits a claim through any combination of channels: a photo of a damaged laptop, a PDF of the purchase invoice, a voicemail explaining what happened, and a follow-up text message asking when the decision will be made. Today, those artefacts land in separate queues and separate humans stitch them together. The target is a single assistant that accepts any subset of these inputs, understands them, asks clarifying questions where needed, and either resolves the claim or routes it to a human with a clean summary and a recommendation.

Concretely, the assistant needs to:

  • Read images. Photos of damaged goods, whiteboard notes from adjusters, screenshots of error messages, identity documents.
  • Read PDFs and scanned documents. Invoices, receipts, policy documents, medical notes, a mix of text-over-image and structured PDF.
  • Transcribe and understand audio. Voicemails up to three minutes, often with background noise and accents.
  • Produce text. Customer-facing explanations, internal summaries, structured decisions for the claims system.
  • Optionally produce speech. Accessibility mode reads responses back; some channels (IVR) are audio-only.
  • Keep a single conversation. Across modalities, across turns, without losing context.

Five SLAs matter: claim acknowledgement within 30 seconds, first substantive response within 2 minutes, decision or routing within 10 minutes, accessibility for audio-first users, and audit trail for every AI-produced decision.

What actually matters

The phrase “multi-modal” collapses several distinct capabilities that a real system has to handle separately. Understanding an image is not the same as understanding audio, and neither is the same as generating speech. The models that are good at each are different; the failure modes are different; the latency and cost profiles are different.

The first decision is which input modalities go through a single multi-modal model, and which get transcoded to text first. A vision-capable LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. can ingest images directly and reason about them in the same PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. as text. Audio generally doesn’t work that way, audio understanding is a separate service that produces text, which then flows to the LLM. PDFs sit in between: native PDFs have text layers; scanned PDFs need OCR first (a dedicated document service, or the LLM’s vision capability if the document is page-image-sized).

The second is the orchestration shape. One prompt with several input blocks, text, image, text, is the simplest case. Several steps (transcribe audio → extract PDF text → combine → send to LLM) is the common case. A stateful agent that decides which tools to call is the most flexible case. Each shape has different latency characteristics.

The third is output modality. Generating text is native to every LLM. Generating speech is a separate service call. Generating images is a separate service call. Whether to bundle these into the model’s response or chain them as a post-step changes what the user experiences.

The fourth is failure modes, per modality. An image might be blurry; an audio file might be inaudible; a PDF might be password-protected; a voicemail might be in a language the model wasn’t trained on. Each needs a graceful fallback, “I can’t quite make out the invoice; could you describe the damaged item?”, instead of a hard error.

The fifth is cost shape across modalities. A single image input costs a few thousand TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. ’ worth of processing, depending on resolution; a minute of transcribed audio costs a small fraction of a cent; a minute of generated speech costs about the same. The bill shape depends on which modalities dominate usage.

And a softer one: user expectations differ by channel. An accessibility user reading via screen reader expects a different response shape from a claims adjuster reviewing a summary, not a different model, but a different prompt and response length.

What we’ll filter on

  1. Modality coverage, which inputs and outputs does this architecture natively support?
  2. Latency, first-response and full-response timing for each input shape?
  3. Robustness, graceful handling of bad-quality inputs?
  4. Operational surface, how many services, SDKs, tools to integrate?
  5. Per-modality cost, does the cost model make sense for the expected input mix?

The multi-modal landscape

  1. Claude Sonnet 4.5 (with vision) + Amazon Transcribe + Amazon Polly. The best-of-breed stack. Claude handles text, images, and PDFs-as-images in a single prompt. Transcribe handles audio in, Polly handles speech out. Orchestration is explicit code: a Lambda that receives the request, dispatches to Transcribe if audio, sends everything to Claude, optionally sends Claude’s response to Polly. Each service is best-in-class; glue is ours. Claude Sonnet 4.5’s vision is strong on documents, receipts, and natural images; Transcribe handles 30+ languages and speaker diarisation; Polly has dozens of voices including neural-quality options.

  2. Amazon Nova family. Nova Lite and Nova Pro handle text, image, and video input natively through Bedrock. Nova Canvas generates images; Nova Reel generates video. Nova Micro is text-only. Audio input is handled by pre-processing through Transcribe. Competitive pricing against Claude. Same orchestration pattern; different vendor.

  3. Amazon Q Business. The higher-level managed product. Q Business handles documents (via Q Business connectors), answers questions about them, integrates with enterprise systems, and includes built-in citations. For a well-scoped enterprise-documents use case it can skip a lot of the plumbing. Limited control over the underlying model, and narrower for claims-specific workflows that mix images, audio, and structured decisions.

  4. SageMaker-hosted multi-modal models. Custom or open-source multi-modal models. LLaVA, Kosmos, Idefics, hosted on SageMaker endpoints. Full control, higher operational cost, worth it when the commercial models don’t fit (specialised domains, privacy requirements, custom fine-tuning). Not the default.

  5. A “everything through OCR to text” approach. Run every input through a text transcription (Textract for documents, Transcribe for audio, a vision-to-description step for images), concatenate the text, feed to a text-only model. Simple; loses information (an image described in words loses visual detail the model could have used directly); cheaper in some cases; wrong when the visual detail matters.

Side by side

Option Modality coverage Latency Robustness Ops surface Cost shape
Claude + Transcribe + Polly Text, image, PDF, audio, speech Moderate Per-service fallbacks 3 services + glue Per-token + per-minute
Nova + Transcribe + Polly Text, image, video, audio (via), speech Moderate Per-service fallbacks 3 services + glue Per-token + per-minute
Q Business Documents + text Low Managed Minimal Per-user subscription
SageMaker hosted Anything we deploy Variable Ours High Endpoint-hours
Everything-to-text Text only (post-transcription) Low Lossy conversion Moderate Cheapest per input

For a claims assistant with four distinct modalities and a need for high visual fidelity (a photo of a damaged laptop carries information that a description loses), Option 1 is the honest answer. Claude’s vision is strong, Transcribe handles the audio leg, Polly the speech out. The orchestration is ours to own, but it’s manageable, one Lambda with clean branches per modality.

The orchestration, in shape

Multi-modal orchestration for the claims assistant Inputs Damaged item photo JPEG / PNG Scanned invoice PDF (scanned pages) Voicemail MP3 / WAV, 3 min max Text message SMS / chat Pre-processing Resize + encode base64 image block Split to page images one image block / page Amazon Transcribe audio → text + confidence Pass through text block as-is Converse request multiple content blocks image · image · text · text + history from DynamoDB + system prompt Claude Sonnet 4.5 (vision) reasons across blocks tool_use if claim routing Text response cited, structured, for customer Amazon Polly (accessibility) neural voice, SSML tags bypass when text-only Claims system (tool call) lookup policy · record decision · route Conversation + audit state DynamoDB: conversation history per session, TTL 30 days S3: raw inputs encrypted, evidence for audit CloudWatch: audit log per decision, prompt version
Four inputs, four pre-processing paths, one Converse request, one text response with optional Polly pass. State in DynamoDB, evidence in S3, audit in CloudWatch.

The pick in depth

Image and PDF inputs go direct to Claude. The Converse API accepts image content blocks up to a few megabytes per image, and Claude Sonnet 4.5’s vision is strong on photos, screenshots, and document scans. A scanned invoice PDF splits into page-image blocks; a natural photo of damage goes in as-is. For very large documents (30+ pages) the split + embed route through a Knowledge Base becomes worthwhile, but a single five-page invoice is fine inline.

Audio goes through Transcribe first. No production foundation model on Bedrock ingests audio natively today. Amazon Transcribe handles the audio → text conversion, with features that matter for voicemail: noise reduction, speaker diarisation (when there are multiple voices), custom vocabulary (the product’s brand names, policy jargon), and a confidence score per segment. Low-confidence segments get flagged in the prompt, “[transcript, confidence 0.4: mumbling about a laptop]”, so the model knows it’s working from approximate text. Latency for a three-minute voicemail is typically 10-30 seconds for a synchronous batch job, or near-real-time with Transcribe Streaming if the channel is live.

Text messages pass through. No pre-processing needed; text block as-is.

Orchestration as a Lambda. The Lambda receives the claim package (some combination of S3 keys for image/PDF/audio, plus any inline text), dispatches to Transcribe for audio, splits PDFs to page images, assembles a Converse request with all blocks in a stable order (text first, then images, then PDF pages, then transcribed audio as text with confidence annotations), adds the System promptThe instruction block that frames the model’s behaviour for a session, separate from the user’s messages. and session history from DynamoDB, and calls Bedrock.

Output. The model returns text, optionally with tool calls to the claims system (lookup policy by ID, record a provisional decision, route to human). Text goes to the customer’s channel. If accessibility mode is enabled (session attribute on the conversation), the text also routes through Polly. Polly’s neural voices produce natural-sounding speech; SSML tags in the LLM’s output (<break time="500ms"/>, <emphasis>) improve prosody for long responses.

State and audit. DynamoDB holds conversation history keyed by session ID; each turn records the input modalities, the prompt version (from Prompt Management), the model’s text output, and any tool calls made. S3 holds the raw inputs with server-side encryption, evidence the claims system can reference later. CloudWatch records each decision with the session ID, prompt version, and a flag indicating human oversight was required.

A worked example: one claim, four modalities

A customer submits a claim with a photo of a laptop with a cracked screen, a scanned PDF of the original purchase invoice (2 pages), a 45-second voicemail explaining they dropped it, and a text message asking “can this be fixed under my policy?”

  1. Lambda receives the package. Kicks off Transcribe async on the voicemail; splits the PDF into two page images; resizes the photo.
  2. Transcribe returns after ~8 seconds: “Hi, it’s Sarah, I dropped my MacBook yesterday and the screen is cracked, but it still turns on. Is this covered?” Confidence: 0.95 throughout.
  3. Converse request assembled: system prompt (claims assistant, empathetic, always name the policy clause), session history (empty, new conversation), user message with blocks: text of text message; image of damage photo; image of invoice page 1; image of invoice page 2; text of transcript.
  4. Claude runs (~2.5 seconds). Response: reasoning about accidental damage coverage; tool call lookupPolicy(policyId from session attributes).
  5. Lambda dispatches the tool call, gets the policy, feeds result back.
  6. Claude’s next turn: “Hi Sarah, I can see the cracked screen in your photo and I have your invoice from 2027-03-15. Under clause 4.2 of your policy, accidental damage is covered up to $1,500 with a $100 excess. Based on the damage shown, a screen replacement typically costs $400-600. I’ve flagged this for an adjuster to confirm the quote; you should hear back within 48 hours.”
  7. Session attribute accessibility: true. Lambda routes the response to Polly (neural voice, en-GB), returns MP3 URL + text.
  8. CloudWatch records: session ID, prompt version claims-assistant:v42, input modalities [image, pdf, audio, text], tool calls [lookupPolicy], decision route_to_adjuster, latency 14.3 seconds end to end.

What’s worth remembering

  1. “Multi-modal” is a marketing word; real systems route per modality. Images and documents go direct to a vision LLM; audio goes through Transcribe first; speech out goes through Polly. One prompt, several pre-processing paths.
  2. Claude Sonnet 4.5 and Nova models handle image + text + PDF-as-image in the Converse API. Stack multiple content blocks in the user message; the model reasons across them.
  3. Audio in requires Transcribe. Bedrock foundation models don’t ingest audio directly. Include Transcribe’s confidence metadata in the prompt so the model knows which parts of the transcript are shaky.
  4. Polly handles text-to-speech out; SSML tags improve prosody. Route the LLM’s text through Polly only when the channel needs speech; don’t pay for it otherwise.
  5. Orchestration is a straight-line Lambda, not an agent. Four pre-processing branches, one Converse call, optional TTS. Deterministic flow; debuggable; no agent loop required unless tools come into play.
  6. State in DynamoDB, evidence in S3, audit in CloudWatch. Three stores, three jobs. Conversation continues; raw inputs stay for audit; decisions are traceable.
  7. Robustness is per-modality fallbacks. Low-quality image? Ask for another. Low-confidence transcript? Summarise what was heard and ask for confirmation. Unreadable PDF? Fall back to a text description prompt.
  8. Cost is per-token-for-text + per-minute-for-audio. Images count against token budgets proportional to resolution; Transcribe is per audio minute; Polly is per character of output. Model the bill by modality mix, not by user count.

Four modalities, one Converse request, one customer response, one audit trail. The model isn’t doing all of it, it’s doing the reasoning, with Transcribe handling the ears and Polly handling the voice. The pattern is the same whenever modalities stack: keep the transcoders at the edges, keep the reasoning in one prompt.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.