HyDE — Hypothetical Document Embeddings
Build HyDE pipelines that use LLMs to generate hypothetical answers before retrieval. Bridge the query-document semantic gap for better RAG results.
Learning Goals
- Build HyDE pipelines for bridging semantic gaps
- Implement two-step hypothetical generation and retrieval
HyDE — Hypothetical Document Embeddings
Vector search works by comparing the embeddings of a query and a document. However, queries and documents are fundamentally different: queries are short questions, while documents are long, informative answers. This "Asymmetry" can lead to poor matches. HyDE (Hypothetical Document Embeddings) bridges this gap by using an LLM to generate a "fake" (hypothetical) answer to the user's question, and then using that fake answer as the query for the vector database.
By searching with an informative answer instead of a short question, HyDE significantly improves semantic matching.
Learning Goals
- Explain the semantic gap between queries and documents in RAG.
- Define the two-step HyDE process (Generation → Retrieval).
- Implement a HyDE pipeline using LangChain.
Core Concepts
1. The Asymmetry Problem
- Query: "How do I fix a leaky faucet?" (Vector represents a need).
- Document: "To repair a dripping tap, first turn off the main water valve..." (Vector represents a solution). In high-dimensional space, the "Need" vector might be far from the "Solution" vector.
2. The HyDE Solution
HyDE transforms the "Need" into a "Solution" before searching.
- Generate: The LLM receives the query and generates a plausible (but potentially hallucinated) answer.
- Embed: The hallucinated answer is embedded into a vector.
- Retrieve: The database returns real documents that are semantically similar to the hypothetical answer.
HyDE Workflow
Implementing HyDE with LangChain
- 1Step 1
1from langchain_openai import ChatOpenAI, OpenAIEmbeddings 2 3llm = ChatOpenAI(temperature=0) 4embeddings = OpenAIEmbeddings() - 2Step 2
Build a chain that generates the fake document:
1from langchain_core.prompts import ChatPromptTemplate 2from langchain_core.output_parsers import StrOutputParser 3 4hyde_prompt = ChatPromptTemplate.from_template( 5 "Please write a detailed technical paragraph answering this question: {question}" 6) 7hyde_chain = hyde_prompt | llm | StrOutputParser() - 3Step 3
Use a custom function or a chain to perform the two-step search:
1def hyde_retriever(query): 2 # 1. Generate hypothetical doc 3 fake_doc = hyde_chain.invoke({"question": query}) 4 # 2. Search using the fake doc as the embedding query 5 return vector_store.similarity_search(fake_doc, k=3)
Example: Zero-Shot Domain Adaptation
If you ask a RAG system about a niche topic it wasn't specifically trained on, a basic search might fail. HyDE works well here because the LLM can use its general knowledge to "imagine" what a relevant document would look like, which is often enough to guide the vector search to the correct technical manual in your database.
Common Mistakes
- Trusting the Hypothetical Doc: Never show the hypothetical document to the user. It is purely an "Internal Map" used for retrieval. The final answer must always be grounded in the real documents found.
- Using Weak LLMs: If the LLM generating the hypothetical answer is poor (e.g., it produces random text), the resulting vector will be useless. Use a high-quality reasoning model for the generation step.
Recap
- HyDE aligns queries with documents by creating a "bridge" hypothetical answer.
- It is excellent for handling broad or abstract questions where keyword overlap is low.
- The pattern relies on the "Vector Proximity" between two informative answers.
Knowledge Check
What is the primary purpose of generating a 'fake' document in HyDE?