Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Text Splitting Strategies

30 mins

Explore the science of Chunking. Learn how to break documents into smaller, semantically meaningful pieces using the Recursive Character Splitter and custom separators.

Learning Goals

  • Define "Chunking" and explain its impact on retrieval precision.
  • Implement a Recursive Character Text Splitter with custom boundaries.
  • Understand the critical role of "Chunk Overlap" in context preservation.

The Geometry of Information

You cannot feed a 300-page book into an LLM's retrieval window. To get high-precision results, we must chop documents into smaller pieces called Chunks.

The challenge is Splitting at the Right Spot. If you split a sentence in half, you lose the meaning. The goal of a Text Splitter is to keep related pieces of information together while staying within the limits of the model's context window.

Text Splitting and Chunking Strategies

The Recursive Character Splitter

This is the industry standard. Instead of just cutting text every 500 characters, it tries to split on logical boundaries in a specific order:

  1. Paragraphs (\n\n)
  2. Sentences (\n)
  3. Words ( )
  4. Characters (``)
1from langchain_text_splitters import RecursiveCharacterTextSplitter 2 3splitter = RecursiveCharacterTextSplitter( 4 chunk_size=1000, 5 chunk_overlap=200, 6 add_start_index=True 7) 8 9chunks = splitter.split_documents(docs)

Chunk Overlap (e.g., 200 characters) creates a "sliding window." It repeats the end of one chunk at the beginning of the next, ensuring that if a fact was mentioned at the boundary, the model sees enough surrounding context to understand it.

Configuring Your Splitter

  1. 1
    Step 1

    Decide on your budget. For technical text, 800-1200 characters is usually the 'Sweet Spot' between context and precision.

  2. 2
    Step 2

    Set a 10-20% overlap. This acts as a 'Safety Buffer' to prevent semantic context from being sliced in half.

  3. 3
    Step 3

    If your document contains Python or Javascript, use a language-specific splitter like from_language(Language.PYTHON) to split on function and class boundaries.

  4. 4
    Step 4

    Set add_start_index=True. This allows you to map the chunk back to its exact character position in the original file.

Knowledge Check

Question 1 of 3
Q1Single choice

Why is a 'Recursive' splitter preferred over a 'Character' splitter?