Vector Databases Explained for Engineers Shipping RAG
What vector databases actually do, how ANN indexes work, and how to choose one without falling for benchmark theater.
What you'll learn
- ✓What a vector index actually stores
- ✓How HNSW and IVF differ in trade-offs
- ✓Why recall is not the only metric that matters
- ✓When pgvector beats a dedicated DB
- ✓How filtering changes the math
Prerequisites
- •Familiar with how APIs work
- •Basic embedding concepts
What and Why
A vector database stores high-dimensional vectors and lets you find the nearest neighbors to a query vector. That is the whole product. The reason it matters is that modern AI represents meaning as vectors (embeddings), and similarity in vector space approximates semantic similarity. Search for “how do I cancel my subscription?” should match a help article titled “Ending your membership” even though they share no words.
The vector DB has become the storage layer of RAG, semantic search, recommendations, deduplication, and image retrieval. Like any new category, it is over-marketed. You do not always need one. When you do, the choice matters.
Mental Model
A nearest-neighbor query is conceptually simple: compute distance between the query and every stored vector, return the top K. Exact search is O(N x D) per query, where N is the corpus size and D the dimensionality. At 100M vectors of 1024 dimensions, that is 100B float operations per query. Unworkable.
Approximate nearest neighbor (ANN) algorithms trade exactness for speed. They build an index that lets you find “close enough” neighbors in milliseconds. The two dominant families are graph-based (HNSW) and partitioning-based (IVF).
Architecture
Query vector
|
v
Layer 2 (sparse) o-----o-----o
| | |
v v v
Layer 1 o-o-o-o-o-o-o
| | |
v v v
Layer 0 (dense) o-o-o-o-o-o-o-o-o-o-o-o
^
candidate neighborhood
-> top-K HNSW (Hierarchical Navigable Small World) builds a layered graph. You enter at the top sparse layer, greedily walk toward the query, then descend to denser layers. Excellent recall and latency. Memory-hungry: indexes are often 1.5x to 2x the raw vector size.
IVF (Inverted File) clusters vectors into Voronoi cells using k-means. At query time, find the nearest cells and scan them. Lower memory, lower recall unless you scan many cells. Often paired with PQ (Product Quantization) which compresses vectors to bytes at the cost of some accuracy.
Real systems often combine techniques: IVF + PQ for memory efficiency at billion-vector scale, HNSW for sub-100M corpora where memory is available.
A vector DB sits in a pipeline:
documents -> embedding model -> vectors + metadata -> index
queries -> embedding model -> vector -> ANN search -> top-K + metadata -> LLM
The metadata layer is more important than people realize. Filtering by tenant, language, document type, or date is non-negotiable for production systems.
Trade-offs
Recall vs latency vs memory. The three corners of the ANN triangle. HNSW with high ef_search gives 99% recall at higher latency. IVF with many probes gives similar recall but slower. PQ-compressed indexes trade 1-3% recall for a 4-16x memory reduction.
Filtered search is hard. Naive ANN finds K neighbors then filters. If your filter is selective (e.g., tenant_id matches 0.1% of corpus), you may get zero results after filtering. Real engines support pre-filtering (scan only matching candidates) or hybrid filter-aware indexes. This is where Pinecone, Weaviate, Qdrant, and pgvector diverge sharply.
Hybrid search. Pure vector search is bad at exact-match queries: “model XJ-47B” can be lost in semantic noise. BM25 + vector with score fusion (RRF or weighted) consistently outperforms either alone.
Operational complexity. A dedicated vector DB is another service to run, monitor, back up, and pay for. pgvector inside an existing Postgres instance often handles 1-10M vectors comfortably and integrates with the rest of your data. Only outgrow it when you hit real limits.
Index build time. HNSW takes minutes to hours to build for tens of millions of vectors. Plan for incremental upserts and periodic rebuilds. Some engines support online updates well (Qdrant, Weaviate); others rebuild segments in the background (Lucene-based engines).
Vector dimension matters. 1536-dim OpenAI embeddings are 6KB per vector. 100M of those is 600GB just for the data. Smaller embeddings (256, 384) are cheaper to store and faster to search; with reranking, quality is often similar.
Practical Tips
- Start with pgvector. If you have Postgres, add HNSW indexes. You will outgrow it eventually; you may not. Avoid prematurely operating another stateful system.
- Measure recall@K against ground truth. Build a small labeled set. Benchmark your engine config (ef, nprobe, etc.) and pick the cheapest config that hits your recall target.
- Pre-filter, don’t post-filter. Engines that support it (Qdrant, Weaviate, pgvector with HNSW) outperform ones that filter after the fact when selectivity is high.
- Use metadata indexes aggressively. tenant_id, language, source, date. Make them filterable in the vector engine, not just stored.
- Re-embed when the model changes. Vectors are model-specific. Upgrading from
embedding-3-smalltoembedding-3-largerequires re-embedding the corpus. Plan capacity. - Hybrid search by default. BM25 plus vector with reciprocal rank fusion is the modern default. Pure vector is rarely the best baseline.
- Rerank with a cross-encoder. Pull 50 candidates, rerank with a small model, keep top 5. Cheap quality boost.
# Hybrid with reranking
bm25_hits = bm25.search(q, k=25)
vec_hits = vec_db.search(embed(q), k=25, filter={"tenant": t})
fused = rrf(bm25_hits, vec_hits)
reranked = cross_encoder.rerank(q, fused, k=5)
Wrap-up
Vector databases are a real category solving a real problem: approximate nearest-neighbor search at scale with metadata filtering. They are not magic and they are not always the right choice. Start with pgvector, measure recall against ground truth, add hybrid search and reranking before you blame the engine. When you do outgrow Postgres, pick the dedicated engine whose filtering and operational story fits your workload, not the one with the best benchmark chart. The chart is not your workload.
Related articles
- AI LLM Context Windows: Trade-offs Beyond Token Count
Why bigger context windows are not always better: cost, attention degradation, retrieval design, and how to architect for long-context tasks.
- AI Vector Databases Compared
A grounded comparison of vector databases for RAG and semantic search: pgvector, Pinecone, Weaviate, Qdrant, Milvus, and Chroma, with guidance on when each shines.
- RAG RAG Chunking Strategies Explained
Compare fixed-size, sentence, semantic, and structural chunking for retrieval augmented generation and pick the right one for your corpus.
- RAG RAG HyDE: Hypothetical Document Embeddings
Learn how Hypothetical Document Embeddings (HyDE) improve RAG recall by embedding a generated answer instead of the raw query, with examples and trade-offs.