Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Cross-Encoder Re-Ranking

Add cross-encoder re-ranking to improve retrieval precision. Learn how to combine bi-encoder retrieval (fast, efficient) with cross-encoder re-ranking (accurate, expensive) for the best of both worlds.

Learning Goals

  • Implement cross-encoder re-ranking for precision
  • Combine bi-encoder retrieval with cross-encoder re-ranking pipeline

Cross-Encoder Re-Ranking

A standard bi-encoder (vector) retrieval system is fast but sometimes lacks precision. It maps queries and documents into a shared space, which is an approximation of their relationship. Cross-Encoders take a different approach: they process the query and a document together to output a single similarity score.

While too slow to search a million documents, Cross-Encoders are incredibly accurate for Re-ranking a small subset of results (e.g., the top 25) found by a vector store.

Learning Goals

  • Explain the difference between Bi-Encoders and Cross-Encoders.
  • Integrate the Cohere Re-ranker into a LangChain pipeline.
  • Evaluate the impact of re-ranking on retrieval precision.

Core Concepts

1. Bi-Encoder (Retrieval)

In Module 3 and 4, we used Bi-Encoders. They embed queries and documents independently. Closeness is measured by the distance between two fixed points.

  • Pros: Fast (millisecond search), scalable.
  • Cons: Less precise; misses nuanced relationships.

2. Cross-Encoder (Re-ranking)

A Cross-Encoder takes both the Query and the Document as a single input to the transformer. It "sees" the interaction between words in the query and words in the document simultaneously.

  • Pros: Extremely accurate.
  • Cons: Slow and computationally expensive.

3. The Two-Stage Architecture

To get both speed and accuracy, we use a two-stage approach:

  1. Stage 1 (Retrieval): Use a fast vector store to find the Top 50 candidates.
  2. Stage 2 (Re-ranking): Use a Cross-Encoder to re-order those 50 candidates and pick the final Top 5.

Re-ranking Pipeline

Implementing Re-ranking with Cohere

  1. 1
    Step 1

    You'll need the Cohere partner package:

    1pip install langchain-cohere
  2. 2
    Step 2

    Setup the CohereRerank compressor. Note: You need a COHERE_API_KEY.

    1from langchain_cohere import CohereRerank 2from langchain_openai import ChatOpenAI 3 4# Initialize the re-ranker model 5compressor = CohereRerank(model="rerank-english-v3.0")
  3. 3
    Step 3

    Wrap your base retriever with the ContextualCompressionRetriever:

    1from langchain.retrievers import ContextualCompressionRetriever 2 3base_retriever = vector_store.as_retriever(search_kwargs={"k": 25}) 4 5compression_retriever = ContextualCompressionRetriever( 6 base_compressor=compressor, 7 base_retriever=base_retriever 8)
  4. 4
    Step 4

    The output will now be the most relevant chunks as judged by the Cross-Encoder:

    1query = "What is the specific gravity of liquid hydrogen?" 2docs = compression_retriever.invoke(query)

Example: Fact-Checking Complex Data

If a user asks for a specific number buried in a technical table, a vector search might retrieve 10 similar-looking tables. A Cross-Encoder will analyze the columns and headers of all 10 tables against the query to ensure the exact table containing the number is ranked first.

Common Mistakes

  • Re-ranking too many documents: Don't try to re-rank 500 documents. It will significantly increase latency and cost. 25-50 is usually the optimal range for the initial pool.
  • Ignoring the Score Filter: Some re-rankers return a confidence score. You can combine re-ranking with a threshold filter to discard documents that even the Cross-Encoder finds irrelevant.

Recap

  • Cross-Encoders are significantly more accurate than standard vector search.
  • The two-stage "Retrieve and Re-rank" pattern is the gold standard for production RAG.
  • Cohere's Re-ranker is a powerful, ready-to-use implementation for LangChain.

Knowledge Check

Question 1 of 3
Q1Single choice

Why don't we use Cross-Encoders for the initial search across millions of documents?