Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Contextual Compression

Use contextual compression to extract the most relevant parts of retrieved documents. Implement LLMChainExtractor and EmbeddingsFilter for quality control.

Learning Goals

Implement contextual compression with LLMChainExtractor
Use EmbeddingsFilter for efficient compression

Contextual Compression

A common issue in RAG is that retrieved chunks often contain a lot of "irrelevant" text surrounding the actual answer. Passing this bulk to the LLM increases token costs and adds "noise" that can distract the model. Contextual Compression solves this by using a secondary process (usually a smaller LLM or an embedding filter) to extract only the most relevant snippets from each retrieved document.

By compressing your context before sending it to the final generator, you provide the LLM with a highly distilled, high-signal prompt.

Learning Goals

Define Contextual Compression and its benefits for cost and accuracy.
Implement the LLMChainExtractor for intelligent text distillation.
Apply EmbeddingsFilter for efficient, non-LLM based compression.

Core Concepts

1. The "Signal vs. Noise" Problem

Imagine you retrieve a 1,000-word page about "Tesla Batteries." The actual answer to the user's question about "Anode materials" is only 2 sentences long. Contextual compression finds those 2 sentences and discards the other 980 words.

2. LLM-based Compression (Extractor)

A small, fast LLM scans each retrieved document and rewrites it, keeping only the parts relevant to the query.

Pros: Extremely precise.
Cons: Increases latency and API costs.

3. Embedding-based Compression (Filter)

The system breaks the retrieved documents into even smaller sentences and performs a secondary vector search to keep only those sentences that are most similar to the query.

Pros: Very fast and cheap.
Cons: May miss nuanced context if sentences are too disconnected.

Compression Workflow

Implementing Contextual Compression

1
Step 1
Start with your standard vector store retriever:

1base_retriever = vector_store.as_retriever()

Step 2

Use an LLM to extract only the relevant parts of the text:

1from langchain.retrievers.document_compressors import LLMChainExtractor
2from langchain_openai import ChatOpenAI
3
4llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")
5compressor = LLMChainExtractor.from_llm(llm)

Step 3

Wrap the base retriever with the compressor:

1from langchain.retrievers import ContextualCompressionRetriever
2
3compression_retriever = ContextualCompressionRetriever(
4    base_compressor=compressor, 
5    base_retriever=base_retriever
6)

Step 4

Observe how the documents are now much shorter and more focused:

1docs = compression_retriever.invoke("What are the anode materials?")
2print(docs[0].page_content) # Contains ONLY relevant snippets

Example: Summarizing Research Papers

If you are building a tool to "Chat with Arxiv Papers," papers are often 20+ pages long. Contextual compression ensures that when you ask about "Methodology," the model only receives the Methodology section, not the full introduction, references, and appendices.

Common Mistakes

Using gpt-4 for Compression: Don't use your most expensive model for compression. Use a "small" model like gpt-4o-mini or Claude Haiku which are optimized for fast extraction tasks.
Over-compression: If your compressor is too aggressive, it might remove necessary context (like definitions). Start with a conservative prompt and check for "Information Loss."

Recap

Contextual compression distills retrieved documents into query-specific snippets.
It reduces token usage and improves LLM focus.
LangChain's ContextualCompressionRetriever allows you to swap between LLM-based and Embedding-based compressors easily.

Knowledge Check

Question 1 of 3

Q1Single choice

What is the primary advantage of compressing context before sending it to the LLM?

It makes the database smaller

It reduces token costs and improves the signal-to-noise ratio in the prompt

It makes the embedding process faster

Hands-On: Building a Basic RAG Pipeline

Parent Document Retriever