Text Splitting Strategies
Explore the science of Chunking. Learn how to break documents into smaller, semantically meaningful pieces using the Recursive Character Splitter and custom separators.
Learning Goals
- Define "Chunking" and explain its impact on retrieval precision.
- Implement a Recursive Character Text Splitter with custom boundaries.
- Understand the critical role of "Chunk Overlap" in context preservation.
The Geometry of Information
You cannot feed a 300-page book into an LLM's retrieval window. To get high-precision results, we must chop documents into smaller pieces called Chunks.
The challenge is Splitting at the Right Spot. If you split a sentence in half, you lose the meaning. The goal of a Text Splitter is to keep related pieces of information together while staying within the limits of the model's context window.
Text Splitting and Chunking Strategies
The Recursive Character Splitter
This is the industry standard. Instead of just cutting text every 500 characters, it tries to split on logical boundaries in a specific order:
- Paragraphs (
\n\n) - Sentences (
\n) - Words (
) - Characters (``)
1from langchain_text_splitters import RecursiveCharacterTextSplitter 2 3splitter = RecursiveCharacterTextSplitter( 4 chunk_size=1000, 5 chunk_overlap=200, 6 add_start_index=True 7) 8 9chunks = splitter.split_documents(docs)
Chunk Overlap (e.g., 200 characters) creates a "sliding window." It repeats the end of one chunk at the beginning of the next, ensuring that if a fact was mentioned at the boundary, the model sees enough surrounding context to understand it.
Configuring Your Splitter
- 1Step 1
Decide on your budget. For technical text, 800-1200 characters is usually the 'Sweet Spot' between context and precision.
- 2Step 2
Set a 10-20% overlap. This acts as a 'Safety Buffer' to prevent semantic context from being sliced in half.
- 3Step 3
If your document contains Python or Javascript, use a language-specific splitter like
from_language(Language.PYTHON)to split on function and class boundaries. - 4Step 4
Set
add_start_index=True. This allows you to map the chunk back to its exact character position in the original file.
Knowledge Check
Why is a 'Recursive' splitter preferred over a 'Character' splitter?