The situation
A global content platform has rolled out a retrieval layer across its customer knowledge base. The corpus is 20 million chunks across four locales: English is 60%, Spanish 20%, Portuguese 10%, Japanese 10%. Queries come in a user’s locale and the retriever needs to find content in any locale, a Spanish-speaking customer asking about a product feature wants results from English documentation if the Spanish version is missing or out of date.
The current index was built with Titan Text Embeddings v1 when the platform launched 18 months ago. Retrieval quality is acceptable in English, mediocre in Spanish and Portuguese, visibly bad in Japanese. Product has three asks: improve retrieval for the non-English locales, future-proof against needing to re-index again in a year, and don’t double the monthly embedding spend.
The embedding spend today is roughly $12,000/month, split between nightly incremental embedding of new content and on-demand embedding of user queries (20 million queries/month). The index itself lives on OpenSearch Serverless, which charges by OCU; the embedding dimension and the VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. count drive storage cost.
What actually matters
An embedding model turns text into a fixed-length numeric vector. Two pieces of text whose meaning is similar should produce vectors that are close (in cosine or euclidean terms); two with different meaning should produce vectors that are far apart. That’s the whole job. The differences between embedding models are about how well that’s done on different text shapes and how much it costs to run.
The first decision is dimension size. More dimensions mean finer distinctions but more storage and slower ANN search. The headline number on most current models sits in the 1024-region, with smaller variants available either as a parameter or as a separate model. The trade is roughly linear: halving the dimensions halves vector storage at a measurable but small quality cost.
The second is language coverage. Older embedding models were English-dominant; newer multilingual models cover dozens of languages in a single embedding space. For non-English locales, the gap between an English-first model and a true multilingual model is dramatic, not a few points of benchmark, a step change in whether the retriever works at all.
The third is quality on BenchmarkA standardised test set used to score and compare models. and domain data. MTEB (Massive Text Embedding Benchmark) is the standard comparison: retrieval precision, classification accuracy, clustering quality, across 50+ datasets. The current top models score within a few points of each other; on domain-specific data (legal, medical, technical documentation), the ranking can shift, a model that wins on general web text might lose on dense legal prose. The benchmark is a starting point; the corpus is the real test.
The fourth is cost shape. Managed embedding endpoints price per TokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. , predictable, scales with usage. Self-hosting on a GPU endpoint trades per-token for instance-hours, makes sense at very high throughput where the instance is saturated, costs more at lower throughput.
The fifth is re-embedding friction. Switching embedding models means re-embedding every chunk in the index, which at 20M chunks isn’t free even at sub-cent-per-thousand pricing, and it isn’t just the embedding calls. The write throughput into the vector store and the doubled storage during the canary period dominate the real bill.
And a softer one: vendor lock-in at the embedding boundary. Once a corpus is embedded with model X, switching to model Y isn’t a config change, it’s a re-embedding project. This makes embedding-model choice surprisingly sticky. Choosing the newest benchmark-leader without weighing the switching cost is a recipe for re-embedding every 12 months.
What we’ll filter on
- Retrieval quality per locale, how well does the model work in each language we care about?
- Dimension cost, storage and query latency scales with dimensions?
- Per-token embedding cost, what the nightly and on-demand bills look like?
- Re-embedding friction, what’s the all-in cost to switch?
- Future-proofing, is this model family likely to improve in place, or force a re-embed when it evolves?
The embedding model landscape
-
Amazon Titan Text Embeddings v2 (1024 dim). The current default on Bedrock. 1024 dimensions by default, configurable down to 512 or 256 via the
dimensionsparameter. 100+ languages. MTEB scores competitive with Cohere Embed v3 on general text. Per-token pricing ~$0.02/M. Matrix dimension-flex means we can re-index at 512 dims later without switching models. Strong default for AWS-native stacks. -
Amazon Titan Text Embeddings v1 (legacy). Older model, English-dominant, 1536 dimensions. Being phased out; not recommended for new work. The current situation’s baseline.
-
Cohere Embed Multilingual v3. Cohere’s flagship multilingual embedding model, available on Bedrock. 1024 dimensions. Top-tier MTEB scores, particularly strong on multilingual retrieval. Per-token pricing ~$0.10/M, roughly 5× Titan v2 at current rates. Comparable retrieval quality in English; often slightly better on non-English locales in practice.
-
Cohere Embed English v3. Same family, English-only, slightly better than multilingual on English benchmarks. Same per-token cost as the multilingual variant. Useful when the corpus is monolingual.
-
Self-hosted on SageMaker (BGE-M3 or similar). Open-source multilingual embedding models (BGE-M3, E5-multilingual, MPNet) hosted on a SageMaker endpoint. 1024 dims for BGE-M3. Competitive MTEB scores. Cost: a g5.xlarge endpoint runs ~$1.20/hour = ~$860/month, which amortises well at high throughput. No per-token charge. Requires endpoint ops.
-
Titan Text Embeddings v2 (256 dim). Same model as #1, lower-dimensional output. 75% storage reduction, faster queries, slight quality drop (a few MTEB points). Useful for very large indices where storage dominates cost.
-
Hybrid: locale-specific models. Cohere English for English chunks, a Japanese-tuned model for Japanese chunks, etc. Maximum quality per locale; impossible to do cross-locale retrieval cleanly (vectors from different models are not comparable). Not the correct shape for a cross-locale retrieval system.
Side by side
| Model | Quality EN | Quality ES/PT | Quality JA | Dimensions | Per-token cost | Re-embed friction |
|---|---|---|---|---|---|---|
| Titan v2 (1024) | High | High | High | 1024 (or 512/256) | ~$0.02/M | Medium |
| Titan v1 | Medium | Low-medium | Low | 1536 | ~$0.01/M (legacy) | N/A (baseline) |
| Cohere Multilingual v3 | High | Very high | Very high | 1024 | ~$0.10/M | Medium |
| Cohere English v3 | Very high | N/A | N/A | 1024 | ~$0.10/M | Would drop multi-lang |
| Self-hosted BGE-M3 | High | Very high | High | 1024 | Endpoint-hours | High (ops) |
| Titan v2 (256) | Medium-high | Medium-high | Medium-high | 256 | ~$0.02/M | Low (same family) |
For 20M chunks across four locales, two realistic finalists emerge: Titan v2 at 1024 dim for AWS-native simplicity and cost, or Cohere Embed Multilingual v3 for slightly better non-English retrieval at 5× the per-token cost. Self-hosted wins on pure cost at this scale but asks for endpoint ops.
How the finalists stack up
The pick in depth
Titan v2 at 1024 dimensions is the correct default for this situation. It’s within 2-3 retrieval points of Cohere v3 on English, Spanish, Portuguese; the gap widens to 6-8 points on Japanese, which is the locale under most pressure today. But Cohere at 5× the cost means $36k/year of extra spend to close that gap, expensive for 10% of the corpus.
The compromise worth considering: Cohere Multilingual v3 for Japanese chunks only, Titan v2 for everything else. It sounds appealing; it doesn’t work. Cross-locale retrieval needs vectors in the same space; mixing models means a Spanish query can’t find Japanese results (the vectors are incomparable). The workaround, two indices, two queries, merge results, adds plumbing and doesn’t actually help cross-locale discovery. Pick one model for the whole corpus.
Dimension choice. Start at 1024. At 20M vectors × 1024 dims × 4 bytes = ~80GB of vector storage in OpenSearch Serverless. Dropping to 512 saves 50% of storage and OCU scaling, at a measurable (but not huge) quality cost, typically 1-2 MTEB points. For non-critical applications, 512 is a fine default; for retrieval where every point matters, stick at 1024 and budget for the storage.
Migration plan. The switch from Titan v1 to Titan v2 is a re-embedding project. Approach: create a new OpenSearch collection for the Titan v2 index, run a batch embedding job (AWS Batch or Step Functions Map over the 20M chunks), write new vectors, dual-read during a 2-week canary (query both, compare retrieval quality on a held-out evaluation set), cut over when the new index demonstrates parity or improvement. Budget: ~$120 for the embedding calls, plus OpenSearch write throughput scaling during the ingestion phase, plus ~$2000 of double-storage during the canary. Total: under $3000, over a 3-4 week project.
Query-time embedding. 20M queries/month × average 30 tokens × $0.02/M = $12/month for query embedding. Basically free compared to the rest of the stack. No optimisation needed.
Future-proofing. Titan is AWS’s own model family; v2 will receive point-release improvements over time that are typically backward-compatible (same dimensions, same output space). When Titan v3 launches, it’ll likely require re-embedding, but the AWS-native ergonomics of going from v2 to v3 will be similar to the v1-to-v2 migration: a batch job, a canary, a cutover.
A worked example: the migration week by week
Week 1. Create new OpenSearch Serverless collection kb-v2. Deploy the Titan v2 embedding pipeline (Step Functions Map over chunks from S3). Start the batch job; it completes in ~18 hours at a few hundred dollars.
Week 2. Dual-write enabled: new content embedded into both v1 and v2. Retrieval service updated to query both; metrics comparing retrieval quality via a held-out 500-query evaluation set. Results come in by Friday: Recall@10 up 4 points on English, 6 points on Spanish, 8 points on Portuguese, 12 points on Japanese. Latency 20% lower (1024-dim HNSW vs 1536-dim).
Week 3. Canary to 10% traffic. User-facing engagement metrics (click-through, session satisfaction) hold steady or improve on the canary cohort. No regressions.
Week 4. Ramp to 100%. v1 index kept warm for another week as rollback safety; deleted at end of week 5.
Five weeks, under $3,000 in migration costs, quality improved across all locales, future-proof to Titan updates, spend unchanged month-over-month.
What’s worth remembering
- The embedding model caps what the retriever can do. A better retriever can’t make up for vectors that are wrong about meaning.
- Multilingual is solved by multilingual models. Titan v2 and Cohere Multilingual v3 both handle 100+ languages in one embedding space. The old “English-only” defaults are legacy.
- Dimension size trades quality for cost. 1024 vs 256 is roughly 4× storage; 1-2 MTEB points; workload-dependent.
- Cohere v3 wins on Japanese and other non-Western languages by meaningful margins. Whether that gap is worth 5× the per-token cost is a product question.
- Switching embedding models is a re-embedding project. Budget for double-storage during canary, one-time batch cost, and operational overhead. Don’t switch for 2-point MTEB improvements.
- Self-hosted BGE-M3 is attractive at scale. Amortises endpoint cost across high throughput; adds ops surface. Correct for teams with ML-ops maturity.
- One model for the whole corpus. Mixing embedding models breaks cross-corpus retrieval. If the corpus is mixed-language, pick a multilingual model.
- Titan v2 is the AWS-native default. Good quality, low cost, dimension-flexible, future-proof within the family. Upgrade from v1 whenever retrieval quality matters.
20M chunks, four locales, one embedding model, a migration that improves every locale, a budget that doesn’t move, and a retriever that can keep up as the content grows. The interesting work was picking the model; the rest is batch jobs and a canary.