AI Vector Search with FAISS
A practical introduction to vector search with FAISS: how indexes work, which index to pick, and how to wire it into a real retrieval pipeline for embeddings.
What you'll learn
- ✓What vector search is and when to use it
- ✓How FAISS index families work
- ✓A working end-to-end example
- ✓Trade-offs between speed, memory, and recall
- ✓Tips for keeping a vector index healthy in production
Prerequisites
- •Basic Python and NumPy familiarity
Vector search is the workhorse behind semantic search, retrieval augmented generation, and recommendation candidate generation. FAISS, from Facebook AI Research, is the most widely used open source library for it. This post explains how it works and how to use it without surprises.
What and Why
A vector is just a list of numbers, typically produced by an embedding model. Semantically similar pieces of text or images land near each other in that high-dimensional space. Vector search finds the nearest vectors to a query vector, fast.
You need this whenever exact keyword matching is not enough. Searching documents by meaning, finding similar images, matching products to user intent, or pulling context for an LLM all reduce to the same problem: nearest neighbors over millions of vectors with tight latency. FAISS gives you the data structures and the GPU acceleration to do that at scale.
Mental Model
Brute force search compares the query against every vector. It is exact but slow once you have more than a few hundred thousand items. Approximate nearest neighbor (ANN) indexes trade a small amount of recall for orders of magnitude in speed.
FAISS organizes indexes as building blocks you compose. An IVF index partitions the space into clusters and only searches a few of them per query. A PQ index compresses vectors so more fit in RAM. An HNSW index builds a navigable graph. You combine these to fit your dataset size, memory budget, and recall target.
Hands-on Example
Here is a minimal pipeline. Take a corpus, embed it, build an index, and query it.
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["cats love boxes", "dogs love walks", "the moon is bright tonight"]
vecs = model.encode(docs).astype("float32")
dim = vecs.shape[1]
index = faiss.IndexFlatIP(dim)
faiss.normalize_L2(vecs)
index.add(vecs)
query = model.encode(["pets and cardboard"]).astype("float32")
faiss.normalize_L2(query)
scores, ids = index.search(query, k=2)
print(scores, ids)
For larger corpora you switch the flat index for something like IndexIVFPQ and train it on a sample. The query interface stays the same.
text or image
|
v
[embedding model]
|
v
vector (e.g. 384-dim)
|
v
[FAISS index]
IVF buckets -> PQ compression
|
v
top-k vector ids
|
v
[lookup metadata in primary store]
|
v
results to caller The FAISS index stores ids and vectors only. Keep the actual documents in a primary database and join on id after the search.
Trade-offs
IndexFlat gives perfect recall but is memory-heavy and slow at scale. It is the right choice up to maybe a million vectors on a beefy machine.
IndexIVFFlat partitions the space with k-means and searches a few partitions per query. It is fast and exact within those partitions but can miss neighbors near partition boundaries. The nprobe parameter trades recall for latency.
IndexIVFPQ compresses each vector to a few bytes. Memory drops by 10x or more, but recall drops too. Tune m and nbits carefully and measure recall on a held-out set.
IndexHNSWFlat builds a graph. It is excellent for medium datasets and supports incremental adds, but uses more RAM than IVF variants.
Practical Tips
Normalize your vectors and use inner product if your embedding model was trained with cosine similarity. Most modern embedding models were.
Train IVF indexes on a representative sample of at least 30 times nlist. Skimping on training data leads to skewed clusters and bad recall.
Measure recall at K against a brute force baseline before you ship. Pick a small evaluation set, run both, and compute overlap. If recall is below your target, raise nprobe or switch to a less compressed index.
Plan for updates. FAISS is not a database. If your corpus changes constantly, rebuild on a schedule or use an index type that supports add and remove cleanly, like HNSW.
For very large corpora, shard by hash or by tenant and fan out queries in parallel. A single FAISS index above a few hundred million vectors gets unwieldy.
Wrap-up
FAISS is a sharp tool, not a managed service. You pick the index, you tune the knobs, you handle persistence and updates. The reward is sub-millisecond nearest neighbor search at scale with full control. Start with IndexFlatIP, graduate to IVF when memory or latency hurts, and always validate recall against a brute force baseline.
Related articles
- AI Vector Databases Explained for Engineers Shipping RAG
What vector databases actually do, how ANN indexes work, and how to choose one without falling for benchmark theater.
- AI Embeddings Explained for Developers
What embeddings are, why they work, how to use them for search and clustering, how to pick a model, and the practical pitfalls that bite first-time users.
- AI RAG Retrieval Strategies
Practical retrieval strategies for RAG: chunking, hybrid search, reranking, query rewriting, metadata filtering, and evaluation patterns that actually move the needle.
- RAG RAG Chunk Overlap Strategies
Learn how chunk overlap rescues boundary context in RAG pipelines, with practical strategies for choosing overlap size and shape for different corpora.