Embeddings Explained for Developers
What embeddings are, why they work, how to use them for search and clustering, how to pick a model, and the practical pitfalls that bite first-time users.
What you'll learn
- ✓What an embedding actually represents
- ✓Why cosine similarity is the usual metric
- ✓How to embed and search with a few lines of code
- ✓How to pick an embedding model
- ✓Common pitfalls and how to avoid them
Prerequisites
- •Basic Python familiarity
An embedding is a list of numbers that represents the meaning of a piece of text (or an image, or audio) in a way a computer can compare. That sentence sounds abstract, but the payoff is concrete: with embeddings you can search by meaning, cluster similar items, deduplicate, recommend, and route messages, all with a few lines of code and a fast nearest-neighbor index.
What the numbers actually mean
An embedding model is trained so that pieces of text with similar meanings produce vectors that point in similar directions. The individual numbers do not have human-readable labels; what matters is the geometry of the space. Two sentences about cooking will land near each other, two sentences about astronomy will land near each other, and the two clusters will sit far apart.
The dimensionality varies by model, commonly 384, 768, 1024, 1536, or 3072. Higher dimension can capture more nuance but costs more storage and more compute per query. For most applications, mid-range models in the 768 to 1536 range are a sensible default.
Cosine similarity, and why
To compare two embeddings, the standard measure is cosine similarity: the cosine of the angle between them. It ranges from -1 to 1, with 1 meaning identical direction. Direction matters because most embedding models do not put meaningful information into the vector’s length, only into where it points. Using Euclidean distance can also work, especially if vectors are normalized, in which case cosine and Euclidean give the same ranking.
import numpy as np
def cosine(a, b):
a = np.array(a)
b = np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
print(cosine([1, 0, 0], [0.9, 0.1, 0])) # close to 1
print(cosine([1, 0, 0], [0, 1, 0])) # 0
A working example
Embedding a few sentences and finding the nearest neighbor takes very little code. The snippet below uses the sentence-transformers library because it runs locally, but the same shape works with any hosted embedding API.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
"How do I reset my password?",
"Our refund policy lasts 30 days.",
"The dashboard loads slowly on mobile.",
"I forgot my login credentials.",
]
doc_vecs = model.encode(docs, normalize_embeddings=True)
query = "I can't sign in to my account"
q = model.encode([query], normalize_embeddings=True)[0]
scores = doc_vecs @ q # since vectors are normalized, dot product = cosine
order = np.argsort(-scores)
for i in order[:2]:
print(f"{scores[i]:.3f} {docs[i]}")
The top match should be the password question, even though it shares almost no words with the query. That is the point: embeddings match meaning, not keywords.
Picking a model
Three factors matter: quality, dimension, and cost. For quality, the MTEB benchmark gives a useful ranking, but treat it as a hint, not a verdict; evaluate on your data. For dimension, smaller is faster and cheaper but plateaus on hard queries. For cost, hosted models charge per million tokens; local models trade memory and CPU/GPU time.
Practical defaults: if you want a hosted model that just works, OpenAI’s text-embedding-3-small or Cohere’s embed-v3 are solid choices. If you want to run locally, bge-large-en and nomic-embed-text-v1.5 are strong open options. For multilingual content, pick a model trained on the languages you care about.
How to use embeddings beyond search
Embeddings power more than RAG. Clustering with k-means or HDBSCAN lets you discover topics in a pile of support tickets. Classification with a tiny linear head on top of embeddings can outperform much bigger models for a fraction of the cost. Deduplication is as simple as flagging pairs above a similarity threshold. Routing in agent systems often uses embeddings to pick which tool or sub-agent should handle a query.
Pitfalls that bite first-timers
Mixing models is the most common mistake. Embeddings from one model are not comparable to embeddings from another; even small version bumps can change the geometry. Pick one model, version it, and re-embed everything if you switch.
Chunking too aggressively hurts retrieval. Embedding ten-token snippets often produces vectors that are too sparse on context to match real queries. Aim for chunks of a few hundred tokens and overlap them slightly so important spans are not split.
Treating cosine similarity as truth is another trap. A high similarity score does not mean the retrieved chunk answers the question; it means it shares a vibe. Use a reranker or an LLM check on the top-k for high-stakes queries.
Ignoring normalization causes silent bugs. Some libraries return normalized vectors; some do not. If you use dot product as a shortcut for cosine, you need normalized vectors. Pick one convention in your codebase and stick with it.
A small mental model
Picture a high-dimensional space where every meaning has a location. The embedding model is a function that turns text into a point in that space. Cosine similarity asks how close two points are in direction. Vector databases store millions of these points and answer “what is closest to this one?” quickly. Build that picture and embeddings stop feeling mysterious; they become a normal data type that happens to be very useful.
Related articles
- AI AI Vector Search with FAISS
A practical introduction to vector search with FAISS: how indexes work, which index to pick, and how to wire it into a real retrieval pipeline for embeddings.
- AI Vector Databases Explained for Engineers Shipping RAG
What vector databases actually do, how ANN indexes work, and how to choose one without falling for benchmark theater.
- Embeddings & RAG Text Embeddings: The Foundation of Semantic Search
What an embedding is, why cosine similarity works, how dimensionality and chunking choices affect retrieval, and a tiny numpy example you can run in your head.
- AI AI Agents vs Pipelines Explained
Understand the difference between AI agents and AI pipelines, when to choose each, and how to design systems that combine both for reliability and flexibility.