Cross-Encoder Re-Ranking
Add cross-encoder re-ranking to improve retrieval precision. Learn how to combine bi-encoder retrieval (fast, efficient) with cross-encoder re-ranking (accurate, expensive) for the best of both worlds.
Learning Goals
- Implement cross-encoder re-ranking for precision
- Combine bi-encoder retrieval with cross-encoder re-ranking pipeline
Cross-Encoder Re-Ranking
A standard bi-encoder (vector) retrieval system is fast but sometimes lacks precision. It maps queries and documents into a shared space, which is an approximation of their relationship. Cross-Encoders take a different approach: they process the query and a document together to output a single similarity score.
While too slow to search a million documents, Cross-Encoders are incredibly accurate for Re-ranking a small subset of results (e.g., the top 25) found by a vector store.
Learning Goals
- Explain the difference between Bi-Encoders and Cross-Encoders.
- Integrate the Cohere Re-ranker into a LangChain pipeline.
- Evaluate the impact of re-ranking on retrieval precision.
Core Concepts
1. Bi-Encoder (Retrieval)
In Module 3 and 4, we used Bi-Encoders. They embed queries and documents independently. Closeness is measured by the distance between two fixed points.
- Pros: Fast (millisecond search), scalable.
- Cons: Less precise; misses nuanced relationships.
2. Cross-Encoder (Re-ranking)
A Cross-Encoder takes both the Query and the Document as a single input to the transformer. It "sees" the interaction between words in the query and words in the document simultaneously.
- Pros: Extremely accurate.
- Cons: Slow and computationally expensive.
3. The Two-Stage Architecture
To get both speed and accuracy, we use a two-stage approach:
- Stage 1 (Retrieval): Use a fast vector store to find the Top 50 candidates.
- Stage 2 (Re-ranking): Use a Cross-Encoder to re-order those 50 candidates and pick the final Top 5.
Re-ranking Pipeline
Implementing Re-ranking with Cohere
- 1Step 1
You'll need the Cohere partner package:
1pip install langchain-cohere - 2Step 2
Setup the
CohereRerankcompressor. Note: You need aCOHERE_API_KEY.1from langchain_cohere import CohereRerank 2from langchain_openai import ChatOpenAI 3 4# Initialize the re-ranker model 5compressor = CohereRerank(model="rerank-english-v3.0") - 3Step 3
Wrap your base retriever with the
ContextualCompressionRetriever:1from langchain.retrievers import ContextualCompressionRetriever 2 3base_retriever = vector_store.as_retriever(search_kwargs={"k": 25}) 4 5compression_retriever = ContextualCompressionRetriever( 6 base_compressor=compressor, 7 base_retriever=base_retriever 8) - 4Step 4
The output will now be the most relevant chunks as judged by the Cross-Encoder:
1query = "What is the specific gravity of liquid hydrogen?" 2docs = compression_retriever.invoke(query)
Example: Fact-Checking Complex Data
If a user asks for a specific number buried in a technical table, a vector search might retrieve 10 similar-looking tables. A Cross-Encoder will analyze the columns and headers of all 10 tables against the query to ensure the exact table containing the number is ranked first.
Common Mistakes
- Re-ranking too many documents: Don't try to re-rank 500 documents. It will significantly increase latency and cost. 25-50 is usually the optimal range for the initial pool.
- Ignoring the Score Filter: Some re-rankers return a confidence score. You can combine re-ranking with a threshold filter to discard documents that even the Cross-Encoder finds irrelevant.
Recap
- Cross-Encoders are significantly more accurate than standard vector search.
- The two-stage "Retrieve and Re-rank" pattern is the gold standard for production RAG.
- Cohere's Re-ranker is a powerful, ready-to-use implementation for LangChain.
Knowledge Check
Why don't we use Cross-Encoders for the initial search across millions of documents?