How to Build a Multilingual Invoice Pipeline With Textract, Translate, and Comprehend

April 26, 2028 · 16 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A finance operations team receives around 3,000 supplier invoices a day, roughly evenly split between English, French, German, Spanish, Portuguese, and Japanese. They arrive as PDFs and scanned images, some native-digital, some photographed on a warehouse floor at an angle. Today the team keys the totals into the accounts-payable system by hand. Somebody upstairs has heard that AWS can do this and the ask lands on our desk: extract structured fields, surface any free-text notes in English, and flag invoices that look unusual.

Three constraints worth naming before reaching for a service:

  • Structured extraction is the money job. The business outcome is a clean row per invoice with supplier, date, currency, line items, subtotal, tax, and total. Everything else is nice-to-have.
  • Six languages, mixed quality of scan. Whatever OCR we use has to handle non-Latin scripts (Japanese) and pages that are tilted, faxed, or photographed in bad light.
  • The free-text notes matter, but only a little. Vendors sometimes write “delivered early, please pay on net-15” in the notes field. Finance wants that surfaced in English, not a full translation of every word on the page.

What actually matters

Before listing services, it’s worth asking what the pipeline is actually doing.

At the top of the funnel is optical character recognition, turning pixels into characters. A generic OCR will extract words and their bounding boxes, but that’s not what finance wants. Finance wants the total, not every number on the page. A subtly different problem: structured field extraction, sometimes called key-value or form extraction, where the service understands that “Total:” is a label and the number next to it is the value. An invoice-aware extractor returns typed fields (vendor, date, total, line items) without any rules on our end. Plain OCR plus hand-written regexes works, badly, for a month before the first French invoice with a different layout breaks it.

Next is language. OCR is mostly orthogonal to language (the recogniser doesn’t need to understand the text, only recognise the glyphs), but downstream everything does. Once we have text, we have to decide whether to translate it, to summarise it, to extract entities from it, or all three. Translating every word of every invoice is wasteful; the structured fields are numbers and dates that don’t need translating, and the line-item descriptions only need translating if a human is going to read them. The notes field is the narrow slice where translation earns its keep.

Third is what counts as “unusual”. An invoice that’s ten times the supplier’s average is unusual. An invoice with a supplier name that matches nothing in the vendor master is unusual. An invoice written in an angry tone is unusual. Three different detection problems. Statistical outlier detection is a custom-classifier or SQL-over-extracted-fields question; tone is sentiment analysis, which managed text-analysis services do out of the box.

Fourth is confidence and human-in-the-loop. No service returns 100% on scanned documents. Every field comes with a confidence score, and the business has to decide the threshold below which a human reviews. That belongs to the pipeline, not to any single service, but it shapes which service we pick: structured extractors return per-field confidence; entity extractors return per-entity confidence; machine-translation generally does not return per-word confidence, only a translated string.

Fifth is cost shape. Per-page pricing, per-100-characters pricing, and per-million-characters pricing all show up in the document-processing area. A 3,000-invoice-per-day volume at two pages per invoice is 6,000 per-page calls; real money. Translating only the notes field rather than the whole invoice can be the difference between dollars and tens of dollars a day on the translation step. The pricing pushes toward doing each job with the service that’s priced for it.

What we’ll filter on

Distilling the exploration into filters we can score each candidate service against:

  1. Structured invoice extraction: does it return typed fields and line-item tables, not just raw text?
  2. Multilingual input: does it handle six languages including Japanese?
  3. Per-field confidence: can we set a threshold for human review?
  4. Text translation: can it translate arbitrary text into a target language?
  5. Text understanding: can it classify, detect entities, extract key phrases, or score sentiment?

The document-processing landscape

1. Amazon Textract. OCR plus structured extraction. Three API families: DetectDocumentText for plain OCR (words and lines with bounding boxes), AnalyzeDocument for forms, tables, queries, signatures, and layout, and AnalyzeExpense for invoice-specific fields (vendor, date, totals, line items) with confidence scores per field. Handles English, French, German, Italian, Portuguese, Spanish for AnalyzeExpense; broader support for raw OCR. Japanese is supported in DetectDocumentText but not in AnalyzeExpense. Async API for multi-page PDFs, sync for single pages.

2. Amazon Comprehend. Text understanding, not OCR, not translation. Given a string, returns detected language, entities (people, places, organisations, dates, commercial items), key phrases, sentiment (positive/negative/neutral/mixed), and optionally PII detection. Custom classifiers and custom entity recognisers can be trained on domain data for $3 an hour of training plus inference cost. Does nothing with pixels; expects text.

3. Amazon Translate. Neural machine translation. Input a string in one of 75 languages, output in another. Supports auto-detection of source language (internally calls Comprehend’s DetectDominantLanguage). Supports custom terminology (our company’s product names, our suppliers’ proper nouns) uploaded as a CSV to keep “Boîte Verte” translating to “Greenbox” rather than “Green Box”. Real-time and batch APIs; batch reads from and writes to S3.

4. Amazon Rekognition. Image analysis: labels, faces, moderation, text detection. DetectText does exist but is built for scene text (“STOP” on a stop sign), not document OCR, and gives up quickly on paragraphs. Wrong shape for a document pipeline; worth naming only to rule out.

5. Amazon Bedrock with a multimodal ModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. . A general-purpose LLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. (Claude, Nova) can read an image and return structured JSON in one call. It will handle Japanese and bad scans. It is priced per TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , not per page, and the variance per document is real. The accuracy on structured-invoice extraction is often comparable to Textract AnalyzeExpense, but the failure modes are different: Textract returns an empty field with low confidence; an LLM hallucinates a total. Useful as a fallback for documents Textract rejects, less useful as the primary extractor for a high-volume structured pipeline.

Side by side

Service Structured invoice extraction Multilingual input Per-field confidence Text translation Text understanding
Textract AnalyzeExpense ✓ (EN/FR/DE/IT/PT/ES)
Textract DetectDocumentText ✗ (raw OCR only) ✓ (incl. JA) ✓ (per word)
Comprehend ✓ (per entity)
Translate ✓ (75 langs)
Rekognition DetectText partial ✓ (per detection)
Bedrock multimodal partial (no native schema) ✓ (via PromptThe input you hand to an LLM – system instructions, user message, examples, retrieved documents, tool descriptions, the lot.  
</span>) ✓ (via prompt)        

Reading the table by job rather than by service:

  • Extraction of vendor, date, totals, line items: Textract AnalyzeExpense for the five Latin-script languages; DetectDocumentText plus a Bedrock multimodal fallback for Japanese, where AnalyzeExpense doesn’t reach.
  • Translation of the notes field to English: Translate, called only on the notes text, not the whole page.
  • Sentiment on the notes field: Comprehend DetectSentiment, because finance wants “angry vendor” flagged.
  • Unusual-supplier detection: Comprehend DetectEntities on the extracted vendor name, matched against the vendor master table; or a Comprehend custom classifier if the rules get complex.

Each service does one job. Chaining them beats asking any one to do the whole pipeline.

The pipeline shape

Ingest S3 landing scanned PDFs, 6 langs ~3,000/day EventBridge on PutObject Language router Comprehend DetectDominantLanguage Extract Textract AnalyzeExpense EN/FR/DE/IT/PT/ES vendor, date, totals, items per-field confidence Textract DetectDocumentText Japanese branch raw lines + bounding boxes → Bedrock for fields Bedrock multimodal fallback: low-confidence pages structured JSON via prompt priced per token Confidence gate fields < 0.8 → A2I review fields ≥ 0.8 → auto-accept document average → KPI Enrich (notes field only) Translate notes → en custom terminology Greenbox vendor glossary Comprehend DetectSentiment translated notes NEGATIVE → flag per-sentence confidence Comprehend DetectEntities ORGANIZATION, DATE, QUANTITY vendor vs master list mismatch → review queue Sink DynamoDB invoice record PK = invoice_id EventBridge InvoiceExtracted fan-out to AP A2I workflow human review low-confidence fields
Ingest routes by language; extract does the expensive per-page work; enrich runs only on the small notes field; sink writes the record and fans out.

The pick in depth

Textract AnalyzeExpense as the primary extractor. The API is shaped for this problem: the response contains SummaryFields (vendor, invoice date, subtotal, tax, total, currency) each with a typed Type, a ValueDetection and a LabelDetection, and LineItemGroups with per-row breakdowns. Each field carries a Confidence score between 0 and 100. We set a threshold (80 is a sensible start) and route anything below it to Amazon Augmented AI (A2I), where a human reviewer either corrects the field or confirms the model was right. A2I feeds the corrections back for tracking, and over time the threshold can be tuned against the actual error rate.

For Japanese invoices, AnalyzeExpense returns an error about unsupported language. The Japanese branch uses DetectDocumentText to get the raw text with its bounding boxes, then a Bedrock prompt that says “here is the OCR output of a Japanese invoice, return JSON with these fields”. The Bedrock call costs more per document than AnalyzeExpense, but 500 Japanese invoices a day is a rounding error next to the 2,500 in Latin-script languages.

Translate on the notes field only. The AnalyzeExpense response doesn’t have a dedicated “notes” field, so the team’s heuristic is: any SummaryField with a label containing “notes” or “memo” or “comment”, or any line item whose description is longer than 80 characters, gets passed to Translate. Source language is auto-detected; target is en. A custom terminology CSV (fewer than a thousand rows) holds the supplier names and product terms that should translate as-is. The translated string is stored alongside the original in the invoice record, so auditors can check either.

Comprehend for the narrow understanding jobs. Two calls, both on the already-extracted text:

  • DetectSentiment on the translated notes. If the dominant sentiment is NEGATIVE with confidence above 0.8, the invoice gets a vendor_sentiment=negative tag and surfaces on the ops dashboard. This catches “this is the third time I’m chasing payment” without reading 3,000 notes fields a day.
  • DetectEntities on the translated notes to pick up ORGANIZATION, DATE, and QUANTITY mentions that aren’t already in the structured fields. Matched against the vendor master; any ORGANIZATION entity that isn’t a known supplier becomes a flag.

Custom classifiers are on the roadmap for later: a classifier trained to tag invoices as routine, urgent, or disputed based on historical payment outcomes. Not worth the training cost until the extraction pipeline is stable.

A worked invoice trace

A French vendor sends an invoice for €4,217.50. The PDF lands in s3://inbox/2027-10-13/INV-8823.pdf.

  1. S3 PutObject fires an EventBridge rule. A Step Functions state machine starts.
  2. The first state downloads the first 200 characters of the PDF’s embedded text (if any) or calls Textract DetectDocumentText on the first page, then calls Comprehend:DetectDominantLanguage. Result: fr, confidence 0.99.
  3. Because fr is in the Latin-script set, the state machine calls Textract:StartExpenseAnalysis (async, because PDFs are multi-page) with the S3 object reference. Textract writes the result to the job’s output location.
  4. A polling state waits for the job to complete. The response contains SummaryFields:
    • VENDOR_NAME: “Les Primeurs de Provence” (confidence 98.4)
    • INVOICE_RECEIPT_DATE: “2027-10-11” (99.1)
    • TOTAL: “4217.50” (94.6)
    • SUBTOTAL: “3514.58” (92.3)
    • TAX: “702.92” (91.8)
    • CURRENCY: “EUR” (99.9)
    • NOTES: “Livraison avant vendredi, merci de régler sous 15 jours” (87.2)
  5. All SummaryFields are above the 80 threshold. No A2I review triggered.
  6. The NOTES value is passed to Translate:TranslateText, source fr, target en: “Delivery before Friday, please pay within 15 days.”
  7. The translated notes go to Comprehend:DetectSentiment: NEUTRAL at 0.76. No sentiment flag.
  8. Comprehend:DetectEntities on the translated notes returns DATE: "Friday", QUANTITY: "15 days". No ORGANIZATION entities; vendor-master lookup happens on the extracted VENDOR_NAME instead, matches a known supplier.
  9. The record is written to DynamoDB with partition key INV-8823. An InvoiceExtracted event is published to EventBridge; the accounts-payable consumer picks it up and creates the AP entry.

End-to-end latency: roughly 35 seconds, dominated by Textract’s async job. Cost: about $0.05 per invoice (Textract $0.048, Comprehend and Translate together under $0.002).

What’s worth remembering

  1. Textract, Comprehend, and Translate each do one job. Textract turns pixels into structured fields; Comprehend understands text; Translate converts between languages. None of them does the other two well.
  2. AnalyzeExpense is the invoice-specific Textract API. It returns typed SummaryFields and LineItemGroups with per-field confidence. DetectDocumentText is raw OCR; AnalyzeDocument is general forms, tables, queries, and layout. Pick the narrowest one that answers the question.
  3. Textract AnalyzeExpense language support is narrower than DetectDocumentText. English, French, German, Italian, Portuguese, Spanish for expense; broader, including Japanese, for raw OCR. Design a fallback for the languages expense doesn’t cover.
  4. Comprehend does sentiment, entities, key phrases, language detection, and PII. Custom classifiers and custom entity recognisers extend it to domain-specific problems. All of it is text-in, JSON-out; Comprehend never reads pixels.
  5. Translate handles 75 languages with custom terminology. Domain glossaries (supplier names, product names) belong in a terminology CSV uploaded once and referenced per call.
  6. Confidence scores drive the human-in-the-loop boundary. Textract and Comprehend return per-result confidence; Translate does not. Below a threshold, route to A2I. Tune the threshold against measured error rate, not intuition.
  7. Translate only the text that benefits from translation. Numbers and dates don’t need translating; the notes field often does. Translating every word of the page is paying for information finance doesn’t read.
  8. Bedrock multimodal is a fallback, not the default. When structured extraction is the goal and volumes are high, per-page pricing and schema-shaped responses beat per-token pricing and prompt engineering. An LLM earns its place on the documents the purpose-built service can’t handle.

The pipeline isn’t a single service; it’s three services lined up by what they’re good at. Textract owns the pages-to-fields step, Translate owns the language-to-language step on the narrow slice that needs it, and Comprehend owns the text-to-signal step after that. The plumbing that holds them together (Step Functions, EventBridge, DynamoDB, A2I) is where the engineering lives; the service choices are just making each one do what it was shaped to do.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.