Skip to content
C Codeloom
AI

Embeddings Explained for Developers

What embeddings are, why they work, how to use them for search and clustering, how to pick a model, and the practical pitfalls that bite first-time users.

·5 min read · By Codeloom
Beginner 9 min read

What you'll learn

  • What an embedding actually represents
  • Why cosine similarity is the usual metric
  • How to embed and search with a few lines of code
  • How to pick an embedding model
  • Common pitfalls and how to avoid them

Prerequisites

  • Basic Python familiarity

An embedding is a list of numbers that represents the meaning of a piece of text (or an image, or audio) in a way a computer can compare. That sentence sounds abstract, but the payoff is concrete: with embeddings you can search by meaning, cluster similar items, deduplicate, recommend, and route messages, all with a few lines of code and a fast nearest-neighbor index.

What the numbers actually mean

An embedding model is trained so that pieces of text with similar meanings produce vectors that point in similar directions. The individual numbers do not have human-readable labels; what matters is the geometry of the space. Two sentences about cooking will land near each other, two sentences about astronomy will land near each other, and the two clusters will sit far apart.

The dimensionality varies by model, commonly 384, 768, 1024, 1536, or 3072. Higher dimension can capture more nuance but costs more storage and more compute per query. For most applications, mid-range models in the 768 to 1536 range are a sensible default.

Cosine similarity, and why

To compare two embeddings, the standard measure is cosine similarity: the cosine of the angle between them. It ranges from -1 to 1, with 1 meaning identical direction. Direction matters because most embedding models do not put meaningful information into the vector’s length, only into where it points. Using Euclidean distance can also work, especially if vectors are normalized, in which case cosine and Euclidean give the same ranking.

import numpy as np

def cosine(a, b):
    a = np.array(a)
    b = np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

print(cosine([1, 0, 0], [0.9, 0.1, 0]))  # close to 1
print(cosine([1, 0, 0], [0, 1, 0]))      # 0

A working example

Embedding a few sentences and finding the nearest neighbor takes very little code. The snippet below uses the sentence-transformers library because it runs locally, but the same shape works with any hosted embedding API.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

docs = [
    "How do I reset my password?",
    "Our refund policy lasts 30 days.",
    "The dashboard loads slowly on mobile.",
    "I forgot my login credentials.",
]
doc_vecs = model.encode(docs, normalize_embeddings=True)

query = "I can't sign in to my account"
q = model.encode([query], normalize_embeddings=True)[0]

scores = doc_vecs @ q  # since vectors are normalized, dot product = cosine
order = np.argsort(-scores)
for i in order[:2]:
    print(f"{scores[i]:.3f}  {docs[i]}")

The top match should be the password question, even though it shares almost no words with the query. That is the point: embeddings match meaning, not keywords.

Picking a model

Three factors matter: quality, dimension, and cost. For quality, the MTEB benchmark gives a useful ranking, but treat it as a hint, not a verdict; evaluate on your data. For dimension, smaller is faster and cheaper but plateaus on hard queries. For cost, hosted models charge per million tokens; local models trade memory and CPU/GPU time.

Practical defaults: if you want a hosted model that just works, OpenAI’s text-embedding-3-small or Cohere’s embed-v3 are solid choices. If you want to run locally, bge-large-en and nomic-embed-text-v1.5 are strong open options. For multilingual content, pick a model trained on the languages you care about.

Embeddings power more than RAG. Clustering with k-means or HDBSCAN lets you discover topics in a pile of support tickets. Classification with a tiny linear head on top of embeddings can outperform much bigger models for a fraction of the cost. Deduplication is as simple as flagging pairs above a similarity threshold. Routing in agent systems often uses embeddings to pick which tool or sub-agent should handle a query.

Pitfalls that bite first-timers

Mixing models is the most common mistake. Embeddings from one model are not comparable to embeddings from another; even small version bumps can change the geometry. Pick one model, version it, and re-embed everything if you switch.

Chunking too aggressively hurts retrieval. Embedding ten-token snippets often produces vectors that are too sparse on context to match real queries. Aim for chunks of a few hundred tokens and overlap them slightly so important spans are not split.

Treating cosine similarity as truth is another trap. A high similarity score does not mean the retrieved chunk answers the question; it means it shares a vibe. Use a reranker or an LLM check on the top-k for high-stakes queries.

Ignoring normalization causes silent bugs. Some libraries return normalized vectors; some do not. If you use dot product as a shortcut for cosine, you need normalized vectors. Pick one convention in your codebase and stick with it.

A small mental model

Picture a high-dimensional space where every meaning has a location. The embedding model is a function that turns text into a point in that space. Cosine similarity asks how close two points are in direction. Vector databases store millions of these points and answer “what is closest to this one?” quickly. Build that picture and embeddings stop feeling mysterious; they become a normal data type that happens to be very useful.