Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Cross-Encoder Re-Ranking

Add cross-encoder re-ranking to improve retrieval precision. Learn how to combine bi-encoder retrieval (fast, efficient) with cross-encoder re-ranking (accurate, expensive) for the best of both worlds.

Learning Goals

Implement cross-encoder re-ranking for precision
Combine bi-encoder retrieval with cross-encoder re-ranking pipeline

Cross-Encoder Re-Ranking

A standard bi-encoder (vector) retrieval system is fast but sometimes lacks precision. It maps queries and documents into a shared space, which is an approximation of their relationship. Cross-Encoders take a different approach: they process the query and a document together to output a single similarity score.

While too slow to search a million documents, Cross-Encoders are incredibly accurate for Re-ranking a small subset of results (e.g., the top 25) found by a vector store.

Learning Goals

Explain the difference between Bi-Encoders and Cross-Encoders.
Integrate the Cohere Re-ranker into a LangChain pipeline.
Evaluate the impact of re-ranking on retrieval precision.

Core Concepts

1. Bi-Encoder (Retrieval)

In Module 3 and 4, we used Bi-Encoders. They embed queries and documents independently. Closeness is measured by the distance between two fixed points.

Pros: Fast (millisecond search), scalable.
Cons: Less precise; misses nuanced relationships.

2. Cross-Encoder (Re-ranking)

A Cross-Encoder takes both the Query and the Document as a single input to the transformer. It "sees" the interaction between words in the query and words in the document simultaneously.

Pros: Extremely accurate.
Cons: Slow and computationally expensive.

3. The Two-Stage Architecture

To get both speed and accuracy, we use a two-stage approach:

Stage 1 (Retrieval): Use a fast vector store to find the Top 50 candidates.
Stage 2 (Re-ranking): Use a Cross-Encoder to re-order those 50 candidates and pick the final Top 5.

Re-ranking Pipeline

Implementing Re-ranking with Cohere

1
Step 1
You'll need the Cohere partner package:

1pip install langchain-cohere

Step 2

Setup the CohereRerank compressor. Note: You need a COHERE_API_KEY.

1from langchain_cohere import CohereRerank
2from langchain_openai import ChatOpenAI
3
4# Initialize the re-ranker model
5compressor = CohereRerank(model="rerank-english-v3.0")

Step 3

Wrap your base retriever with the ContextualCompressionRetriever:

1from langchain.retrievers import ContextualCompressionRetriever
2
3base_retriever = vector_store.as_retriever(search_kwargs={"k": 25})
4
5compression_retriever = ContextualCompressionRetriever(
6    base_compressor=compressor, 
7    base_retriever=base_retriever
8)

Step 4

The output will now be the most relevant chunks as judged by the Cross-Encoder:

1query = "What is the specific gravity of liquid hydrogen?"
2docs = compression_retriever.invoke(query)

Example: Fact-Checking Complex Data

If a user asks for a specific number buried in a technical table, a vector search might retrieve 10 similar-looking tables. A Cross-Encoder will analyze the columns and headers of all 10 tables against the query to ensure the exact table containing the number is ranked first.

Common Mistakes

Re-ranking too many documents: Don't try to re-rank 500 documents. It will significantly increase latency and cost. 25-50 is usually the optimal range for the initial pool.
Ignoring the Score Filter: Some re-rankers return a confidence score. You can combine re-ranking with a threshold filter to discard documents that even the Cross-Encoder finds irrelevant.

Recap

Cross-Encoders are significantly more accurate than standard vector search.
The two-stage "Retrieve and Re-rank" pattern is the gold standard for production RAG.
Cohere's Re-ranker is a powerful, ready-to-use implementation for LangChain.

Knowledge Check

Question 1 of 3

Q1Single choice

Why don't we use Cross-Encoders for the initial search across millions of documents?

They are not accurate enough

They are too slow and computationally expensive

They don't work with vectors

Multi-Query Retriever

RAG Fusion — Multi-Query Fusion