The situation
The retrieval assistant from earlier in the year outgrew its starter index. What began as 20,000 chunks across three sources is now 12 million vectors spanning product documentation, customer knowledge base articles, internal runbooks, historical support tickets, and a growing archive of community forum posts. Every document carries metadata, source, product line, language, published date, access level, and the queries that matter mix semantic similarity with hard filters: “articles about billing, in English, not marked internal, embedded in the last 90 days.”
The retrieval service has a 50ms budget at p99. The Bedrock generation step dominates cost at roughly $0.003 per query; the VectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. must not push that number over $0.006 at peak. Peak is 200 queries per second during US business hours and 20 queries per second overnight. The index grows at roughly 500k new vectors per month and existing chunks get re-embedded when documents change.
The quick-create OpenSearch Serverless collection the Knowledge Base spun up on day one has been the default. Finance is now looking at the bill.
What actually matters
A vector store does three things: stores high-dimensional vectors next to their source text and metadata, runs approximate-nearest-neighbour (ANN) search against them quickly, and supports metadata filters alongside the vector search. That’s the job. The differentiation is in how well each of those three is done, and what it costs.
The first decision worth naming is the index algorithm. HNSW (hierarchical navigable small world) is the de-facto standard for ANN, fast, accurate, memory-hungry. IVF (inverted file) is an alternative, slower to query but cheaper at scale. Most managed stores use HNSW; some let us tune the parameters (M, ef_construction, ef_search) to trade recall for speed and memory.
The second is query shape. Pure vector ANN is the baseline. Hybrid search, combining keyword (BM25) and vector scores, handles queries where the exact match matters (product codes, version numbers, error strings). Metadata filters are the other axis; they can be applied before the vector search (pre-filter, sometimes slower but more accurate) or after (post-filter, faster but can return empty when the filter is restrictive). Different stores default to different strategies; some let us choose.
The third is pricing shape. Dedicated vector services usually price by compute units with a minimum floor and usage-based scaling. Adding vectors to a relational database prices by instance hours plus storage, predictable, scales with the database. Pure-managed third-party stores price per-operation, reads, writes, and storage metered. The curves cross at different corpus sizes; the cheapest option at 100k vectors is often not the cheapest at 10M.
The fourth is operational shape. Is the store a managed service that we point at, or does it want capacity planning, index tuning, reindexing procedures? The answer isn’t binary, some managed offerings still have compute-unit ceilings to think about; databases are managed but vector-index builds need planning.
And a softer one: what else we’re already running. An organisation with Aurora everywhere has ops maturity on Postgres that tips the scales toward pgvector; an organisation with OpenSearch for logs already knows the query language.
What we’ll filter on
- Query latency at scale, p99 under 50ms at 12M vectors?
- Hybrid search, keyword + vector scoring in one query?
- Metadata filtering, pre-filter, post-filter, or both?
- Cost shape, how the bill grows with corpus size and query volume?
- Operational surface, capacity planning, index rebuilds, tuning?
The vector-store landscape
-
OpenSearch Serverless (vector collection). Purpose-built vector collection type; k-NN plugin using HNSW under the hood with FAISS or Lucene as the engine. Minimum 2 OCUs (1 indexing + 1 search) per collection; scales to more as load rises. Hybrid search is a first-class feature via the
search_pipelinewith anormalization-processor. Metadata filters are pre-filters or post-filters; pre-filtering (withfilterin the k-NN query) works cleanly. At current pricing, 2 OCUs runs around $350/month floor before any data, which for small corpora is overkill but for 12M vectors is reasonable. Fast, sub-20ms queries on well-tuned indexes. -
Aurora PostgreSQL with pgvector. The relational option.
pgvectorextension storesvector(n)columns, builds HNSW or IVFFlat indexes, and supports operators (<->L2,<#>negative inner product,<=>cosine). Queries look likeSELECT ... ORDER BY embedding <=> $1 LIMIT 10. Hybrid search requires combining with Postgres full-text search (ts_vector,ts_query) and ranking manually, flexible, verbose. Metadata filters are justWHEREclauses, which the planner can push down. Pricing is instance-hours; an r7g.xlarge runs ~$250/month, plus storage. On Aurora Serverless v2, ACUs scale with load. HNSW index builds can take hours on 12M rows and need carefulmaintenance_work_memtuning. Rewarding when the team already lives in Postgres. -
Pinecone Serverless. Managed vector database, usage-metered. Separates storage from compute; reads charged per read unit (roughly 1 RU per query returning up to 16 KB of results), writes per write unit, storage per GB-month. Hybrid search supported via sparse-dense vectors (we supply both a dense embedding and a sparse BM25 representation; Pinecone scores and combines). Metadata filters are pre-filter, baked in. Latency claims <100ms at scale. Cost is attractive at low query volumes, storage-only for a quiet index, but rises linearly with query rate. At 200 qps, 12M vectors, typical metadata, expect low hundreds of dollars a month, scaling with traffic.
-
DynamoDB + OpenSearch. Hybrid pattern: DynamoDB as the source-of-truth for the documents, OpenSearch for the vectors. Adds plumbing (DynamoDB Streams → Lambda → OpenSearch) but separates write and read concerns, which some teams like. For a pure retrieval workload, it’s OpenSearch doing the vector work, so the analysis reduces to OpenSearch’s.
-
ElastiCache for Redis with RediSearch. Redis Stack has vector search via the FT module. Sub-millisecond at small scales. Capped by memory, 12M × 1024-dim × 4 bytes = ~48 GB of vector data alone, plus index overhead; needs an
r7g.4xlargeor larger Redis cluster. Fast; expensive at scale. Works best for hot caches or small, frequently-queried corpora. -
S3 Vectors. The newest option: vectors stored directly in S3 with a vector API on top, designed for massive-scale archival retrieval where per-query latency is more forgiving (seconds, not tens of milliseconds). Not the correct shape for a 50ms interactive assistant; mentioned to place it on the landscape.
Side by side
| Option | p99 at 12M | Hybrid search | Metadata filters | Cost shape | Ops surface |
|---|---|---|---|---|---|
| OpenSearch Serverless | ~20 ms | ✓ (pipeline) | Pre + post | OCU-based, $350 floor | Managed, OCU ceilings |
| Aurora + pgvector | ~30-60 ms | ✓ (manual FTS + vec) | WHERE clauses | Instance-hours | Index builds, vacuum |
| Pinecone Serverless | <100 ms | ✓ (sparse-dense) | Pre-filter | Per-op, scales with traffic | Minimal |
| ElastiCache + RediSearch | <5 ms | Limited | Prefix filters | Memory-bound | Cluster management |
| S3 Vectors | Seconds | , | ✓ | Per-request + storage | Managed, archival fit |
Reading it for this situation, 12M vectors, 50ms budget, hybrid queries, metadata-rich filtering, 200 qps peak, three viable options remain: OpenSearch Serverless, Aurora + pgvector, Pinecone Serverless. The choice is about what else we’re running.
How the three finalists compare
The picks in depth
OpenSearch Serverless. The correct answer when retrieval quality is paramount and the corpus is large enough to justify the floor. Hybrid search is a one-query affair via the search pipeline; HNSW parameters (m, ef_construction, ef_search) are tunable through the k-NN mapping. Pre-filtering metadata works cleanly with the filter parameter in the k-NN query, the engine applies the filter during graph traversal rather than post-filtering, which keeps recall high when filters are selective. The operational tax is watching OCU usage: the minimum is 2 (1 indexing + 1 search), scaling up automatically, but there’s a ceiling configurable per collection to prevent runaway costs. At 12M vectors with 200 qps peak, 4 OCUs is a reasonable expectation and costs roughly $700/month at current rates, the upper bound, not the typical.
Aurora PostgreSQL with pgvector. The correct answer when the team already runs Aurora, the metadata lives in relational tables, and queries can lean on SQL. A document’s row has id, content, metadata jsonb, and embedding vector(1024); the query SELECT ... WHERE metadata->>'source' = 'pricing' AND metadata->>'language' = 'en' ORDER BY embedding <=> $1 LIMIT 10 combines filtering and vector search in one plan. The HNSW index in pgvector 0.7+ supports WHERE push-down, so selective filters don’t wreck recall. At 12M rows, an r7g.xlarge Aurora instance with 32 GB memory holds the HNSW index warm with room to spare, and the bill runs roughly $250-400/month depending on IOPS. The sharp edge: building the HNSW index on 12M rows can take several hours and needs maintenance_work_mem bumped to a few gigabytes; plan reindexes during low-traffic windows.
Pinecone Serverless. The correct answer when the team wants to stop thinking about vector stores entirely. Upload vectors via the SDK, query them, ignore the rest. Sparse-dense hybrid handles keyword + semantic in one call with an alpha parameter mixing the two scores. Metadata filters pre-filter. Operational surface is near-zero; the trade is a separate vendor relationship, a separate bill, and cross-AZ latency that puts queries toward the upper end of the 50ms budget rather than the middle. At 12M vectors, 200 qps peak, expect low hundreds of dollars per month.
A worked example: one query, three stacks
User asks: “Why does my billing show a prorated charge on the 15th?”
The front-end embeds the query with Titan v2 (a vector of 1024 floats). It also generates a sparse BM25 representation for hybrid stores. The metadata filter is source IN ('pricing', 'billing-kb', 'manual') AND language = 'en' AND published_after = '2028-01-01'.
OpenSearch Serverless. One POST to the collection’s _search endpoint with a hybrid pipeline: { "query": { "hybrid": { "queries": [ { "match": { "content": "prorated charge 15th" }}, { "knn": { "embedding": { "vector": [...], "k": 50, "filter": { "bool": { "must": [...metadata...] } } }}} ] } } }. Response in 18ms. Top 5 chunks. Score normalisation handled by the pipeline.
Aurora + pgvector. One SQL query: WITH semantic AS (SELECT id, content, embedding <=> $1 AS dist FROM chunks WHERE metadata @> $2 ORDER BY dist LIMIT 50), keyword AS (SELECT id, ts_rank(ts, plainto_tsquery('prorated charge 15th')) AS r FROM chunks WHERE metadata @> $2 ORDER BY r DESC LIMIT 50) SELECT ... FROM semantic FULL JOIN keyword USING (id) ORDER BY (0.6 * semantic.dist + 0.4 * keyword.r) LIMIT 5;. Response in 45ms. Explicit weighting; auditable plan.
Pinecone Serverless. One query call with vector, sparse_vector, filter, top_k: 5, alpha: 0.6. Response in 80ms. Transport cost dominates; query itself is ~10ms on Pinecone’s side.
All three return a comparable top-5. The differentiator is the 20-50ms you get back into the budget, or the 20 minutes of SQL-writing you avoid.
What’s worth remembering
- The vector store is retrieval’s engine, not its brain. A better store doesn’t fix a bad chunking strategy; a worse store caps what a great retriever can do.
- HNSW is the de-facto algorithm. Every serious store uses it. Tuning
mandef_searchtrades recall against latency and memory. - Hybrid search handles the queries vector search fumbles. Product codes, version strings, error messages, proper nouns, the exact-match cases. Enable it unless the corpus is purely conceptual.
- Pre-filtering beats post-filtering for selective metadata. When the filter eliminates 90% of the corpus, pre-filter or the top-k returns nothing.
- OpenSearch Serverless is the default for AWS-native hybrid retrieval at scale. One collection, one query, managed index. Watch the OCU floor; set a ceiling.
- Aurora + pgvector earns its keep when metadata is relational. SQL joins, transactions, and familiar operational tools. Plan the HNSW index build; tune
maintenance_work_mem. - Pinecone Serverless is the “don’t think about it” option. Lowest ops surface, separate vendor, linear cost with traffic. Good fit at low volumes; reconsider at scale.
- The cost curves cross. What’s cheapest at 100k is rarely cheapest at 10M. Model the bill for realistic growth, not current usage.
The 12M-vector corpus with 200 qps peak lands on OpenSearch Serverless for retrieval quality and hybrid query simplicity. The answer isn’t universal, a heavily relational corpus tips toward pgvector, a team allergic to ops tips toward Pinecone, but it is defensible, and the axes above are the ones to defend it on.