RAG HyDE: Hypothetical Document Embeddings

Intermediate 9 min read

What you'll learn

✓Why short queries embed poorly
✓The HyDE idea in one paragraph
✓How to wire HyDE into an existing RAG pipeline
✓When HyDE helps and when it hurts
✓Cheap evaluation tricks for HyDE

Prerequisites

•Comfortable with basic RAG and embeddings

What and Why

Standard RAG embeds the user query directly and searches the vector store for similar chunks. The problem is that queries are short, often vague, and written in a different register from the source documents. A three word question rarely sits close to a paragraph of technical prose in embedding space, even when the paragraph contains the answer.

Hypothetical Document Embeddings, or HyDE, fixes this by asking an LLM to write a fake answer to the question first, then embedding that fake answer and using it as the search vector. The hypothetical document looks like the real documents, so it lands near them in embedding space.

Mental Model

Embedding models cluster documents by what they say, not by what someone might ask about them. A query and the matching passage often live in different regions of the vector space because they use different words and different lengths.

HyDE bridges that gap by translating the query into the shape of a document before search. Even if the generated text is partly wrong, its style, vocabulary, and length match the corpus, which is what the embedding cares about.

Hands-on Example

A minimal HyDE pipeline runs in two LLM calls and one vector search.

def hyde_retrieve(query, llm, embedder, vector_db, k=5):
    prompt = f"Write a short passage that answers: {query}"
    hypothetical = llm.generate(prompt)
    vector = embedder.embed(hypothetical)
    return vector_db.search(vector, k=k)

The retrieved chunks then go into your normal answer generation prompt. Note that the hypothetical document is thrown away after embedding; it is never shown to the user.

query: "why is my pod crashlooping?"
 |
 v
LLM generates hypothetical answer
 |
 v
"A pod enters CrashLoopBackOff when its
container repeatedly exits. Common causes
include OOM kills, failing probes..."
 |
 v
embedder
 |
 v
search vector ----> vector DB ----> top-k real chunks
                                        |
                                        v
                                  answer LLM

HyDE inserts a generation step before retrieval to bridge the query and document spaces

For zero shot HyDE you can use a generic instruction. For domain HyDE you can prime the LLM with the style of your corpus, for example “Write a passage in the style of internal runbook documentation answering…”. The closer the style match, the better the retrieval.

Trade-offs

HyDE shines on short, ambiguous, or natural language queries against technical or formal documents. It is the cheapest fix for the “query too short” problem.

The cost is one extra LLM call per query, which adds latency and money. On a real time chat system that round trip can be two hundred to eight hundred milliseconds depending on model, which users feel.

HyDE can also amplify hallucination during retrieval. If the LLM invents the wrong jargon, you retrieve chunks about the wrong topic with high confidence. This is most painful for queries about entities the model has never seen, like product codes or internal acronyms.

Compared to query expansion, where you append synonyms and rerun the same query, HyDE produces a single richer vector instead of multiple searches. It is usually cheaper and more accurate, but harder to debug because the intermediate text is throwaway.

Practical Tips

Cache hypothetical documents by query. Identical questions should not pay the generation cost twice.

Use a small, fast model for the hypothetical step. The hypothetical is throwaway, so a cheap model is usually enough; save the expensive model for final answer generation.

Combine HyDE with a normal query embedding by averaging the two vectors, or by running both searches and merging results. This hedges against hallucinated jargon in the hypothetical.

Log the hypothetical document during development. Reading a few hundred of them is the fastest way to spot prompt issues, like the LLM consistently veering into the wrong sub-topic.

Skip HyDE when the query is already long, structured, or contains rare entities. A user pasting a stack trace does not need a hypothetical version of itself.

Evaluate with the same recall test you would use for plain RAG: a fixed set of queries with known correct chunks, then measure top-k hit rate with and without HyDE. The lift is usually visible on the first ten queries.

Wrap-up

HyDE is a small idea with a big payoff for RAG systems suffering from short, fuzzy queries. By generating a hypothetical answer and embedding that instead, you align the search vector with the shape of your documents. Pay attention to latency, hallucinated jargon, and caching, and treat HyDE as one tool in a retrieval toolkit rather than a default for every query.