Skip to content
C Codeloom
AI

AI Vector Search with FAISS

A practical introduction to vector search with FAISS: how indexes work, which index to pick, and how to wire it into a real retrieval pipeline for embeddings.

·4 min read · By Codeloom
Beginner 10 min read

What you'll learn

  • What vector search is and when to use it
  • How FAISS index families work
  • A working end-to-end example
  • Trade-offs between speed, memory, and recall
  • Tips for keeping a vector index healthy in production

Prerequisites

  • Basic Python and NumPy familiarity

Vector search is the workhorse behind semantic search, retrieval augmented generation, and recommendation candidate generation. FAISS, from Facebook AI Research, is the most widely used open source library for it. This post explains how it works and how to use it without surprises.

What and Why

A vector is just a list of numbers, typically produced by an embedding model. Semantically similar pieces of text or images land near each other in that high-dimensional space. Vector search finds the nearest vectors to a query vector, fast.

You need this whenever exact keyword matching is not enough. Searching documents by meaning, finding similar images, matching products to user intent, or pulling context for an LLM all reduce to the same problem: nearest neighbors over millions of vectors with tight latency. FAISS gives you the data structures and the GPU acceleration to do that at scale.

Mental Model

Brute force search compares the query against every vector. It is exact but slow once you have more than a few hundred thousand items. Approximate nearest neighbor (ANN) indexes trade a small amount of recall for orders of magnitude in speed.

FAISS organizes indexes as building blocks you compose. An IVF index partitions the space into clusters and only searches a few of them per query. A PQ index compresses vectors so more fit in RAM. An HNSW index builds a navigable graph. You combine these to fit your dataset size, memory budget, and recall target.

Hands-on Example

Here is a minimal pipeline. Take a corpus, embed it, build an index, and query it.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["cats love boxes", "dogs love walks", "the moon is bright tonight"]
vecs = model.encode(docs).astype("float32")

dim = vecs.shape[1]
index = faiss.IndexFlatIP(dim)
faiss.normalize_L2(vecs)
index.add(vecs)

query = model.encode(["pets and cardboard"]).astype("float32")
faiss.normalize_L2(query)
scores, ids = index.search(query, k=2)
print(scores, ids)

For larger corpora you switch the flat index for something like IndexIVFPQ and train it on a sample. The query interface stays the same.

text or image
   |
   v
[embedding model]
   |
   v
 vector (e.g. 384-dim)
   |
   v
[FAISS index]
IVF buckets -> PQ compression
   |
   v
top-k vector ids
   |
   v
[lookup metadata in primary store]
   |
   v
 results to caller
FAISS retrieval pipeline

The FAISS index stores ids and vectors only. Keep the actual documents in a primary database and join on id after the search.

Trade-offs

IndexFlat gives perfect recall but is memory-heavy and slow at scale. It is the right choice up to maybe a million vectors on a beefy machine.

IndexIVFFlat partitions the space with k-means and searches a few partitions per query. It is fast and exact within those partitions but can miss neighbors near partition boundaries. The nprobe parameter trades recall for latency.

IndexIVFPQ compresses each vector to a few bytes. Memory drops by 10x or more, but recall drops too. Tune m and nbits carefully and measure recall on a held-out set.

IndexHNSWFlat builds a graph. It is excellent for medium datasets and supports incremental adds, but uses more RAM than IVF variants.

Practical Tips

Normalize your vectors and use inner product if your embedding model was trained with cosine similarity. Most modern embedding models were.

Train IVF indexes on a representative sample of at least 30 times nlist. Skimping on training data leads to skewed clusters and bad recall.

Measure recall at K against a brute force baseline before you ship. Pick a small evaluation set, run both, and compute overlap. If recall is below your target, raise nprobe or switch to a less compressed index.

Plan for updates. FAISS is not a database. If your corpus changes constantly, rebuild on a schedule or use an index type that supports add and remove cleanly, like HNSW.

For very large corpora, shard by hash or by tenant and fan out queries in parallel. A single FAISS index above a few hundred million vectors gets unwieldy.

Wrap-up

FAISS is a sharp tool, not a managed service. You pick the index, you tune the knobs, you handle persistence and updates. The reward is sub-millisecond nearest neighbor search at scale with full control. Start with IndexFlatIP, graduate to IVF when memory or latency hurts, and always validate recall against a brute force baseline.