Parent Document Retriever
Build a parent-child document system: retrieve small chunks for precise matching but return full parent documents for rich context. Ideal for Q&A systems requiring detailed answers.
Learning Goals
- Build a parent-child document retriever
- Implement small-match, large-context retrieval pattern
Parent Document Retriever
In Module 2, we learned that small chunks (e.g., 200 tokens) are better for retrieval precision because they produce "tighter" embeddings. However, small chunks often lack enough surrounding context for the LLM to generate a complete and high-quality answer. Parent Document Retriever solves this conflict by indexing small "Child" chunks but returning the larger "Parent" document (or a much larger parent chunk) to the LLM during generation.
This "Best of Both Worlds" strategy ensures precise search while providing the LLM with rich, full-context information.
Learning Goals
- Explain the Parent-Child document relationship in RAG.
- Implement the
ParentDocumentRetrieverusing LangChain. - Configure persistent
InMemoryStorefor parent document storage.
Core Concepts
1. Small vs. Large (The Dilemma)
- Small Chunks: Great for embedding similarity; bad for LLM understanding.
- Large Chunks: Great for LLM understanding; bad for embedding similarity (meaning gets "diluted").
2. The Linkage Strategy
The system maintains a mapping between child IDs and their parent ID.
- Search: The query is compared against millions of tiny Child vectors.
- Lookup: Once the best Child is found, the system looks up its Parent ID.
- Return: The full Parent text is fetched from a key-value store and passed to the LLM.
Parent-Child Architecture
Implementing Parent Document Retrieval
- 1Step 1
You need a Vector Store for the child chunks and a DocStore for the parents:
1from langchain.storage import InMemoryStore 2from langchain_chroma import Chroma 3from langchain_openai import OpenAIEmbeddings 4 5vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings()) 6store = InMemoryStore() - 2Step 2
Define a large splitter for parents and a small one for children:
1from langchain_text_splitters import RecursiveCharacterTextSplitter 2 3parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) 4child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) - 3Step 3
Initialize the
ParentDocumentRetriever:1from langchain.retrievers import ParentDocumentRetriever 2 3retriever = ParentDocumentRetriever( 4 vectorstore=vectorstore, 5 docstore=store, 6 child_splitter=child_splitter, 7 parent_splitter=parent_splitter, 8) - 4Step 4
LangChain handles the multi-level splitting and indexing automatically:
1retriever.add_documents(docs) 2 3# Search returns the large parent chunks 4results = retriever.invoke("What are the specific safety protocols?") 5print(len(results[0].page_content)) # ~2000 characters
Example: Employee Handbooks
Company handbooks often have long sections on "Medical Benefits." If a user asks about "Dental coverage," a small chunk might find the specific line, but the LLM needs the entire "Medical Benefits" context to explain the context of that coverage. Parent-Child retrieval ensures the model gets the whole policy.
Common Mistakes
- Memory Loss: If you use
InMemoryStorefor parents, your documents are lost when the script restarts. For production, use a persistent store likeRedisStoreorPostgreSQLStore. - Inconsistent Splitters: If your child splitter is larger than your parent splitter, the logic fails. Always ensure child size << parent size.
Recap
- Parent Document Retrieval separates the "search units" from the "generation units."
- It improves grounding by providing the LLM with the full surrounding context of a match.
- LangChain's
ParentDocumentRetrieverautomates the complex ID mapping and multi-index orchestration.
Knowledge Check
What is the main problem solved by the Parent Document Retriever?