Hybrid Search with BM25 and EnsembleRetriever
Build hybrid search combining dense vector similarity with sparse BM25 keyword matching. Use LangChain's EnsembleRetriever to weight and combine both approaches.
Learning Goals
- Build hybrid search using dense vectors and BM25
- Implement EnsembleRetriever with configurable weights
Hybrid Search with BM25 and EnsembleRetriever
In the previous lessons, we focused on Dense Retrieval (using vector embeddings). Dense retrieval is great at capturing semantic meaning, but it can sometimes fail on specific keyword searches (like technical IDs, product names, or rare acronyms).
Hybrid Search combines the strengths of Dense Retrieval with Sparse Retrieval (like BM25, which is based on keyword frequency). LangChain's EnsembleRetriever makes it easy to fuse these two approaches together to get the best of both worlds.
Learning Goals
- Define Sparse vs. Dense retrieval and why they are complementary.
- Implement a
BM25Retrieverfor keyword-based search. - Combine multiple retrievers using LangChain's
EnsembleRetriever.
Core Concepts
1. Dense vs. Sparse
- Dense (Vector Search): Finds "Meaning." Query: "How to fix a vehicle" matches "Automobile repair guide."
- Sparse (Keyword Search): Finds "Exact Words." Query: "XJ-9000 motherboard" matches documents containing that exact part number.
2. Reciprocal Rank Fusion (RRF)
How do you combine a vector score (e.g., 0.82) with a BM25 score (e.g., 14.5)? You can't just add them. EnsembleRetriever uses Reciprocal Rank Fusion (RRF), which looks at the rank of documents across both lists and re-orders them based on a weighted average of their positions.
Visualizing Hybrid Fusion
Implementing Hybrid Search
- 1Step 1
Install the rank_bm25 package:
pip install rank_bm25. Then initialize it with your document chunks:1from langchain_community.retrievers import BM25Retriever 2 3# bm25 needs the raw text chunks 4sparse_retriever = BM25Retriever.from_texts(all_text_chunks) 5sparse_retriever.k = 3 - 2Step 2
Initialize your standard vector store retriever:
1dense_retriever = vector_store.as_retriever(search_kwargs={"k": 3}) - 3Step 3
Combine them and assign weights (e.g., 70% Dense, 30% Sparse):
1from langchain.retrievers import EnsembleRetriever 2 3ensemble_retriever = EnsembleRetriever( 4 retrievers=[dense_retriever, sparse_retriever], 5 weights=[0.7, 0.3] 6) - 4Step 4
1docs = ensemble_retriever.invoke("How to reset XJ-9000?")
Example: Technical Support Bot
Imagine a user asks: "My server is throwing a 502 error."
- Dense Search might find articles about general "connection issues" or "network timeouts."
- Sparse Search (BM25) will specifically find the troubleshooting guide for "502 error."
- Hybrid Search ensures that the user gets the guide for the specific error code while also considering the broader context of server connectivity.
Common Mistakes
- Equal Weights for all use cases: If your data is highly technical (full of part numbers and codes), increase the weight of the Sparse retriever. If it's conversational, favor the Dense retriever.
- Forgetting to update BM25: Unlike vector stores which are dynamic,
BM25Retriever.from_textscreates a static index in memory. If your knowledge base changes, you must recreate the BM25 retriever.
Recap
- Hybrid search combines semantic (Dense) and keyword (Sparse) retrieval.
- BM25 is the standard algorithm for keyword-based retrieval in RAG.
EnsembleRetrieveruses RRF to merge results from multiple sources into a single, high-quality list.
Knowledge Check
Which retrieval method is better for finding a specific product code like 'SKU-4021'?