The situation
A finance operations team receives around 3,000 supplier invoices a day, roughly evenly split between English, French, German, Spanish, Portuguese, and Japanese. They arrive as PDFs and scanned images, some native-digital, some photographed on a warehouse floor at an angle. Today the team keys the totals into the accounts-payable system by hand. Somebody upstairs has heard that AWS can do this and the ask lands on our desk: extract structured fields, surface any free-text notes in English, and flag invoices that look unusual.
Three constraints worth naming before reaching for a service:
- Structured extraction is the money job. The business outcome is a clean row per invoice with supplier, date, currency, line items, subtotal, tax, and total. Everything else is nice-to-have.
- Six languages, mixed quality of scan. Whatever OCR we use has to handle non-Latin scripts (Japanese) and pages that are tilted, faxed, or photographed in bad light.
- The free-text notes matter, but only a little. Vendors sometimes write “delivered early, please pay on net-15” in the notes field. Finance wants that surfaced in English, not a full translation of every word on the page.
What actually matters
Before listing services, it’s worth asking what the pipeline is actually doing.
At the top of the funnel is optical character recognition, turning pixels into characters. A generic OCR will extract words and their bounding boxes, but that’s not what finance wants. Finance wants the total, not every number on the page. A subtly different problem: structured field extraction, sometimes called key-value or form extraction, where the service understands that “Total:” is a label and the number next to it is the value. An invoice-aware extractor returns typed fields (vendor, date, total, line items) without any rules on our end. Plain OCR plus hand-written regexes works, badly, for a month before the first French invoice with a different layout breaks it.
Next is language. OCR is mostly orthogonal to language (the recogniser doesn’t need to understand the text, only recognise the glyphs), but downstream everything does. Once we have text, we have to decide whether to translate it, to summarise it, to extract entities from it, or all three. Translating every word of every invoice is wasteful; the structured fields are numbers and dates that don’t need translating, and the line-item descriptions only need translating if a human is going to read them. The notes field is the narrow slice where translation earns its keep.
Third is what counts as “unusual”. An invoice that’s ten times the supplier’s average is unusual. An invoice with a supplier name that matches nothing in the vendor master is unusual. An invoice written in an angry tone is unusual. Three different detection problems. Statistical outlier detection is a custom-classifier or SQL-over-extracted-fields question; tone is sentiment analysis, which managed text-analysis services do out of the box.
Fourth is confidence and human-in-the-loop. No service returns 100% on scanned documents. Every field comes with a confidence score, and the business has to decide the threshold below which a human reviews. That belongs to the pipeline, not to any single service, but it shapes which service we pick: structured extractors return per-field confidence; entity extractors return per-entity confidence; machine-translation generally does not return per-word confidence, only a translated string.
Fifth is cost shape. Per-page pricing, per-100-characters pricing, and per-million-characters pricing all show up in the document-processing area. A 3,000-invoice-per-day volume at two pages per invoice is 6,000 per-page calls; real money. Translating only the notes field rather than the whole invoice can be the difference between dollars and tens of dollars a day on the translation step. The pricing pushes toward doing each job with the service that’s priced for it.
What we’ll filter on
Distilling the exploration into filters we can score each candidate service against:
- Structured invoice extraction: does it return typed fields and line-item tables, not just raw text?
- Multilingual input: does it handle six languages including Japanese?
- Per-field confidence: can we set a threshold for human review?
- Text translation: can it translate arbitrary text into a target language?
- Text understanding: can it classify, detect entities, extract key phrases, or score sentiment?
The document-processing landscape
1. Amazon Textract. OCR plus structured extraction. Three API families: DetectDocumentText for plain OCR (words and lines with bounding boxes), AnalyzeDocument for forms, tables, queries, signatures, and layout, and AnalyzeExpense for invoice-specific fields (vendor, date, totals, line items) with confidence scores per field. Handles English, French, German, Italian, Portuguese, Spanish for AnalyzeExpense; broader support for raw OCR. Japanese is supported in DetectDocumentText but not in AnalyzeExpense. Async API for multi-page PDFs, sync for single pages.
2. Amazon Comprehend. Text understanding, not OCR, not translation. Given a string, returns detected language, entities (people, places, organisations, dates, commercial items), key phrases, sentiment (positive/negative/neutral/mixed), and optionally PII detection. Custom classifiers and custom entity recognisers can be trained on domain data for $3 an hour of training plus inference cost. Does nothing with pixels; expects text.
3. Amazon Translate. Neural machine translation. Input a string in one of 75 languages, output in another. Supports auto-detection of source language (internally calls Comprehend’s DetectDominantLanguage). Supports custom terminology (our company’s product names, our suppliers’ proper nouns) uploaded as a CSV to keep “Boîte Verte” translating to “Greenbox” rather than “Green Box”. Real-time and batch APIs; batch reads from and writes to S3.
4. Amazon Rekognition. Image analysis: labels, faces, moderation, text detection. DetectText does exist but is built for scene text (“STOP” on a stop sign), not document OCR, and gives up quickly on paragraphs. Wrong shape for a document pipeline; worth naming only to rule out.
5. Amazon Bedrock with a multimodal ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against.
. A general-purpose LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for.
(Claude, Nova) can read an image and return structured JSON in one call. It will handle Japanese and bad scans. It is priced per TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word.
, not per page, and the variance per document is real. The accuracy on structured-invoice extraction is often comparable to Textract AnalyzeExpense, but the failure modes are different: Textract returns an empty field with low confidence; an LLM hallucinates a total. Useful as a fallback for documents Textract rejects, less useful as the primary extractor for a high-volume structured pipeline.
Side by side
| Service | Structured invoice extraction | Multilingual input | Per-field confidence | Text translation | Text understanding |
|---|---|---|---|---|---|
Textract AnalyzeExpense |
✓ | ✓ (EN/FR/DE/IT/PT/ES) | ✓ | ✗ | ✗ |
Textract DetectDocumentText |
✗ (raw OCR only) | ✓ (incl. JA) | ✓ (per word) | ✗ | ✗ |
| Comprehend | ✗ | ✓ | ✓ (per entity) | ✗ | ✓ |
| Translate | ✗ | ✓ (75 langs) | ✗ | ✓ | ✗ |
Rekognition DetectText |
✗ | partial | ✓ (per detection) | ✗ | ✗ |
| Bedrock multimodal | partial (no native schema) | ✓ | ✗ | ✓ (via PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot. | |
| </span>) | ✓ (via prompt) |
Reading the table by job rather than by service:
- Extraction of vendor, date, totals, line items: Textract
AnalyzeExpensefor the five Latin-script languages;DetectDocumentTextplus a Bedrock multimodal fallback for Japanese, whereAnalyzeExpensedoesn’t reach. - Translation of the notes field to English: Translate, called only on the notes text, not the whole page.
- Sentiment on the notes field: Comprehend
DetectSentiment, because finance wants “angry vendor” flagged. - Unusual-supplier detection: Comprehend
DetectEntitieson the extracted vendor name, matched against the vendor master table; or a Comprehend custom classifier if the rules get complex.
Each service does one job. Chaining them beats asking any one to do the whole pipeline.
The pipeline shape
The pick in depth
Textract AnalyzeExpense as the primary extractor. The API is shaped for this problem: the response contains SummaryFields (vendor, invoice date, subtotal, tax, total, currency) each with a typed Type, a ValueDetection and a LabelDetection, and LineItemGroups with per-row breakdowns. Each field carries a Confidence score between 0 and 100. We set a threshold (80 is a sensible start) and route anything below it to Amazon Augmented AI (A2I), where a human reviewer either corrects the field or confirms the model was right. A2I feeds the corrections back for tracking, and over time the threshold can be tuned against the actual error rate.
For Japanese invoices, AnalyzeExpense returns an error about unsupported language. The Japanese branch uses DetectDocumentText to get the raw text with its bounding boxes, then a Bedrock prompt that says “here is the OCR output of a Japanese invoice, return JSON with these fields”. The Bedrock call costs more per document than AnalyzeExpense, but 500 Japanese invoices a day is a rounding error next to the 2,500 in Latin-script languages.
Translate on the notes field only. The AnalyzeExpense response doesn’t have a dedicated “notes” field, so the team’s heuristic is: any SummaryField with a label containing “notes” or “memo” or “comment”, or any line item whose description is longer than 80 characters, gets passed to Translate. Source language is auto-detected; target is en. A custom terminology CSV (fewer than a thousand rows) holds the supplier names and product terms that should translate as-is. The translated string is stored alongside the original in the invoice record, so auditors can check either.
Comprehend for the narrow understanding jobs. Two calls, both on the already-extracted text:
DetectSentimenton the translated notes. If the dominant sentiment isNEGATIVEwith confidence above 0.8, the invoice gets avendor_sentiment=negativetag and surfaces on the ops dashboard. This catches “this is the third time I’m chasing payment” without reading 3,000 notes fields a day.DetectEntitieson the translated notes to pick up ORGANIZATION, DATE, and QUANTITY mentions that aren’t already in the structured fields. Matched against the vendor master; any ORGANIZATION entity that isn’t a known supplier becomes a flag.
Custom classifiers are on the roadmap for later: a classifier trained to tag invoices as routine, urgent, or disputed based on historical payment outcomes. Not worth the training cost until the extraction pipeline is stable.
A worked invoice trace
A French vendor sends an invoice for €4,217.50. The PDF lands in s3://inbox/2027-10-13/INV-8823.pdf.
- S3
PutObjectfires an EventBridge rule. A Step Functions state machine starts. - The first state downloads the first 200 characters of the PDF’s embedded text (if any) or calls Textract
DetectDocumentTexton the first page, then callsComprehend:DetectDominantLanguage. Result:fr, confidence 0.99. - Because
fris in the Latin-script set, the state machine callsTextract:StartExpenseAnalysis(async, because PDFs are multi-page) with the S3 object reference. Textract writes the result to the job’s output location. - A polling state waits for the job to complete. The response contains
SummaryFields:VENDOR_NAME: “Les Primeurs de Provence” (confidence 98.4)INVOICE_RECEIPT_DATE: “2027-10-11” (99.1)TOTAL: “4217.50” (94.6)SUBTOTAL: “3514.58” (92.3)TAX: “702.92” (91.8)CURRENCY: “EUR” (99.9)NOTES: “Livraison avant vendredi, merci de régler sous 15 jours” (87.2)
- All
SummaryFieldsare above the 80 threshold. No A2I review triggered. - The NOTES value is passed to
Translate:TranslateText, sourcefr, targeten: “Delivery before Friday, please pay within 15 days.” - The translated notes go to
Comprehend:DetectSentiment:NEUTRALat 0.76. No sentiment flag. Comprehend:DetectEntitieson the translated notes returnsDATE: "Friday",QUANTITY: "15 days". No ORGANIZATION entities; vendor-master lookup happens on the extractedVENDOR_NAMEinstead, matches a known supplier.- The record is written to DynamoDB with partition key
INV-8823. AnInvoiceExtractedevent is published to EventBridge; the accounts-payable consumer picks it up and creates the AP entry.
End-to-end latency: roughly 35 seconds, dominated by Textract’s async job. Cost: about $0.05 per invoice (Textract $0.048, Comprehend and Translate together under $0.002).
What’s worth remembering
- Textract, Comprehend, and Translate each do one job. Textract turns pixels into structured fields; Comprehend understands text; Translate converts between languages. None of them does the other two well.
AnalyzeExpenseis the invoice-specific Textract API. It returns typedSummaryFieldsandLineItemGroupswith per-field confidence.DetectDocumentTextis raw OCR;AnalyzeDocumentis general forms, tables, queries, and layout. Pick the narrowest one that answers the question.- Textract
AnalyzeExpenselanguage support is narrower thanDetectDocumentText. English, French, German, Italian, Portuguese, Spanish for expense; broader, including Japanese, for raw OCR. Design a fallback for the languages expense doesn’t cover. - Comprehend does sentiment, entities, key phrases, language detection, and PII. Custom classifiers and custom entity recognisers extend it to domain-specific problems. All of it is text-in, JSON-out; Comprehend never reads pixels.
- Translate handles 75 languages with custom terminology. Domain glossaries (supplier names, product names) belong in a terminology CSV uploaded once and referenced per call.
- Confidence scores drive the human-in-the-loop boundary. Textract and Comprehend return per-result confidence; Translate does not. Below a threshold, route to A2I. Tune the threshold against measured error rate, not intuition.
- Translate only the text that benefits from translation. Numbers and dates don’t need translating; the notes field often does. Translating every word of the page is paying for information finance doesn’t read.
- Bedrock multimodal is a fallback, not the default. When structured extraction is the goal and volumes are high, per-page pricing and schema-shaped responses beat per-token pricing and prompt engineering. An LLM earns its place on the documents the purpose-built service can’t handle.
The pipeline isn’t a single service; it’s three services lined up by what they’re good at. Textract owns the pages-to-fields step, Translate owns the language-to-language step on the narrow slice that needs it, and Comprehend owns the text-to-signal step after that. The plumbing that holds them together (Step Functions, EventBridge, DynamoDB, A2I) is where the engineering lives; the service choices are just making each one do what it was shaped to do.