Contextual Compression
Use contextual compression to extract the most relevant parts of retrieved documents. Implement LLMChainExtractor and EmbeddingsFilter for quality control.
Learning Goals
- Implement contextual compression with LLMChainExtractor
- Use EmbeddingsFilter for efficient compression
Contextual Compression
A common issue in RAG is that retrieved chunks often contain a lot of "irrelevant" text surrounding the actual answer. Passing this bulk to the LLM increases token costs and adds "noise" that can distract the model. Contextual Compression solves this by using a secondary process (usually a smaller LLM or an embedding filter) to extract only the most relevant snippets from each retrieved document.
By compressing your context before sending it to the final generator, you provide the LLM with a highly distilled, high-signal prompt.
Learning Goals
- Define Contextual Compression and its benefits for cost and accuracy.
- Implement the
LLMChainExtractorfor intelligent text distillation. - Apply
EmbeddingsFilterfor efficient, non-LLM based compression.
Core Concepts
1. The "Signal vs. Noise" Problem
Imagine you retrieve a 1,000-word page about "Tesla Batteries." The actual answer to the user's question about "Anode materials" is only 2 sentences long. Contextual compression finds those 2 sentences and discards the other 980 words.
2. LLM-based Compression (Extractor)
A small, fast LLM scans each retrieved document and rewrites it, keeping only the parts relevant to the query.
- Pros: Extremely precise.
- Cons: Increases latency and API costs.
3. Embedding-based Compression (Filter)
The system breaks the retrieved documents into even smaller sentences and performs a secondary vector search to keep only those sentences that are most similar to the query.
- Pros: Very fast and cheap.
- Cons: May miss nuanced context if sentences are too disconnected.
Compression Workflow
Implementing Contextual Compression
- 1Step 1
Start with your standard vector store retriever:
1base_retriever = vector_store.as_retriever() - 2Step 2
Use an LLM to extract only the relevant parts of the text:
1from langchain.retrievers.document_compressors import LLMChainExtractor 2from langchain_openai import ChatOpenAI 3 4llm = ChatOpenAI(temperature=0, model="gpt-4o-mini") 5compressor = LLMChainExtractor.from_llm(llm) - 3Step 3
Wrap the base retriever with the compressor:
1from langchain.retrievers import ContextualCompressionRetriever 2 3compression_retriever = ContextualCompressionRetriever( 4 base_compressor=compressor, 5 base_retriever=base_retriever 6) - 4Step 4
Observe how the documents are now much shorter and more focused:
1docs = compression_retriever.invoke("What are the anode materials?") 2print(docs[0].page_content) # Contains ONLY relevant snippets
Example: Summarizing Research Papers
If you are building a tool to "Chat with Arxiv Papers," papers are often 20+ pages long. Contextual compression ensures that when you ask about "Methodology," the model only receives the Methodology section, not the full introduction, references, and appendices.
Common Mistakes
- Using gpt-4 for Compression: Don't use your most expensive model for compression. Use a "small" model like
gpt-4o-miniorClaude Haikuwhich are optimized for fast extraction tasks. - Over-compression: If your compressor is too aggressive, it might remove necessary context (like definitions). Start with a conservative prompt and check for "Information Loss."
Recap
- Contextual compression distills retrieved documents into query-specific snippets.
- It reduces token usage and improves LLM focus.
- LangChain's
ContextualCompressionRetrieverallows you to swap between LLM-based and Embedding-based compressors easily.
Knowledge Check
What is the primary advantage of compressing context before sending it to the LLM?