Skip to content
C Codeloom
Embeddings & RAG

Text Embeddings: The Foundation of Semantic Search

What an embedding is, why cosine similarity works, how dimensionality and chunking choices affect retrieval, and a tiny numpy example you can run in your head.

·7 min read · By Yash Kesharwani
Intermediate 11 min read

What you'll learn

  • What an embedding actually is, geometrically
  • Why cosine similarity is the standard distance
  • How embedding models differ and what dimensionality buys you
  • Chunking strategies that retrieve better
  • Normalization and the small numpy example tying it together

Prerequisites

  • Some Python and array intuition
  • You have read What Is RAG? or know roughly what retrieval is

Search used to mean keywords. If the user typed “refund,” you found documents containing the word “refund.” That breaks the moment they type “money back” — same intent, no match.

Embeddings fix this. They turn text into vectors so that pieces of text with similar meaning sit close together in space. That single trick powers semantic search, retrieval-augmented generation, recommendation, deduplication, and a long list of other features.

This post explains what an embedding is, the math you actually need, and the design choices that make retrieval work in practice.

What an embedding is

An embedding is a fixed-length list of numbers — a vector — that represents a piece of text.

# Conceptual: each text maps to a vector of the same length
embed("dog") = [0.12, -0.04, 0.88, ..., 0.07]   # length 1536
embed("puppy") = [0.14, -0.02, 0.85, ..., 0.10] # similar to "dog"
embed("invoice") = [-0.31, 0.42, 0.05, ..., -0.6] # very different

The model that produces these vectors was trained so that texts with similar meaning produce vectors with similar direction. The numbers themselves are not interpretable — no dimension means “is animal” — but the geometry between vectors is.

Two consequences:

  • Search becomes a distance problem. Find the document vectors closest to the query vector.
  • Language barriers blur. Many modern embedding models are multilingual; “perro” and “dog” land near each other.

Cosine similarity

The standard distance for embeddings is cosine similarity — the cosine of the angle between two vectors.

# cosine similarity, mathematically
# cos(θ) = (a · b) / (|a| * |b|)

import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Values range from -1 (opposite directions) to 1 (identical direction). In practice for text:

  • 0.85+ — nearly the same meaning
  • 0.6–0.85 — related topic
  • Below 0.5 — probably unrelated

Why cosine and not Euclidean distance? Because what we care about is direction, not magnitude. Two vectors pointing the same way but at different lengths are still “about the same thing.” Cosine ignores length and focuses on angle, which is what aligns with semantic meaning.

When vectors are normalized (rescaled to length 1), cosine similarity reduces to a plain dot product — fast and simple. Most production systems normalize once at insert time and use dot product at query time.

Embedding models

A few practical points about choosing one.

Hosted vs open. Hosted models — OpenAI text-embedding-3-small, Voyage, Cohere — are dead simple: call an API, get a vector. Open models like all-MiniLM-L6-v2, bge-large, nomic-embed run locally for free but need more setup.

Domain matters. A model trained on web text may not embed legal documents well. Some providers offer domain-tuned variants; for serious applications, evaluate on your data.

You must use the same model for query and corpus. Vectors from different models are not comparable. Pick one and stick with it; switching means re-embedding everything.

Dimensionality

Embeddings come in different lengths — 384, 768, 1024, 1536, 3072. Higher dimensions usually mean more nuance, but also more storage, slower search, and diminishing returns.

A practical lens:

  • 384–768 — fine for many small projects, cheap and fast
  • 1024–1536 — the modern default for production RAG
  • 3072+ — useful for very large or nuanced corpora; storage cost is real

Some newer models support Matryoshka embeddings — you can truncate the vector and lose surprisingly little quality. Useful when you want to store full-fidelity vectors but query against a shorter prefix for speed.

Chunking strategy

Embeddings are computed per chunk of text. How you split a document drastically affects what gets retrieved.

The naive approach — embed the whole document — fails because:

  • A long document covers many topics; the average vector represents none of them well
  • Even if it matches, you cannot tell the model which part answered the question

Common strategies:

Fixed-size chunks. Split by tokens or characters, with overlap.

# Simple chunking with overlap
def chunk(text, size=500, overlap=50):
    chunks = []
    i = 0
    while i < len(text):
        chunks.append(text[i:i + size])
        i += size - overlap   # overlap so boundary ideas survive
    return chunks

Easy, fast, decent baseline.

Structure-aware chunks. Split on headings, paragraphs, or markdown sections. Preserves logical units. Better for documentation and articles.

Semantic chunks. Compute embeddings sentence by sentence, group adjacent sentences with similar embeddings. Highest quality, most complex.

Whichever you pick, keep chunks small enough to be coherent (a few hundred tokens) and large enough to carry context (not single sentences). Test on real queries — the right chunk size is the one that retrieves the chunks your users actually need.

Experiment. Take a documentation page and chunk it three ways: every 200 tokens, every 800 tokens, and split by H2 headings. Embed each set. Search for a question you know the answer to and look at what gets returned. You will see immediately why chunking is half the battle in RAG.

A tiny numpy example

Putting the pieces together with mock vectors.

import numpy as np

# Pretend we already have embeddings for four documents
docs = {
    "doc1": np.array([0.9, 0.1, 0.0]),    # "dogs are pets"
    "doc2": np.array([0.8, 0.2, 0.1]),    # "cats and dogs"
    "doc3": np.array([0.0, 0.9, 0.1]),    # "tax filing"
    "doc4": np.array([0.1, 0.0, 0.95]),   # "ocean currents"
}

# Normalize so dot product == cosine similarity
def normalize(v):
    return v / np.linalg.norm(v)

docs = {k: normalize(v) for k, v in docs.items()}

# Query: "puppies" — semantically near doc1 and doc2
query = normalize(np.array([0.85, 0.15, 0.05]))

# Rank by similarity
scores = {k: float(np.dot(query, v)) for k, v in docs.items()}
ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)

for name, score in ranked:
    print(f"{name}: {score:.3f}")
# output:
# doc1: 0.998
# doc2: 0.989
# doc3: 0.171
# doc4: 0.137

Three lessons in ten lines:

  1. The math is plain linear algebra. No magic.
  2. Normalization makes the dot product equal cosine — exploit that.
  3. The ranking is what you actually return; the raw scores are a sanity check.

Real systems do this with 1536-dimensional vectors over millions of documents, but the loop is identical.

What can go wrong

Stale embeddings. You change the embedding model but forget to re-embed the corpus. The new query vectors land in different geometry. Always re-embed when you switch models.

Mismatched preprocessing. You lowercased and stripped punctuation at index time but not at query time. Small inconsistencies show up as missed matches.

Junk in your corpus. Headers, footers, navigation menus get embedded as if they were content. They pollute retrieval. Clean before chunking.

Over-trusting cosine. A 0.78 similarity is not “the model thinks this is the right answer.” It is “this is the nearest vector I have.” Pair retrieval with a re-ranker or LLM filter for anything that matters.

Reflection. Pick any product you use that has search. Try a query in your own words, then try a perfect keyword match. If they return different results, the system is using embeddings somewhere. If only the keyword match works, it is probably plain text search.

Recap

  • An embedding is a vector that captures meaning; similar texts have similar directions
  • Cosine similarity measures angle, which aligns with semantic similarity
  • Normalize vectors so dot product == cosine — faster and simpler
  • Dimensionality trades quality for storage and speed
  • Chunking is half the retrieval problem; structure-aware chunks usually win
  • Pin one embedding model and re-embed everything if you switch

Next steps

Embeddings are the math. Vector databases are the infrastructure that makes them queryable at scale.

→ Next: Vector Databases: A Practical Overview

Related: What Is RAG?, LLM Evaluation Basics, What Is an LLM?.

Questions or feedback? Email codeloomdevv@gmail.com.