Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Building the Multi-Document Ingestion Layer

Build a multi-format document ingestion system supporting PDFs, web pages, and code repositories. Implement advanced chunking and metadata enrichment pipelines.

Learning Goals

  • Build multi-format document ingestion
  • Implement advanced chunking with metadata enrichment

Building the Multi-Document Ingestion Layer

The foundation of our Tech Support Agent is its knowledge base. To handle the variety of sources required for this project, we must build a robust Ingestion Pipeline that can process static PDFs (legacy manuals) and dynamic web content (new releases). We will use LangChain's specialized loaders and a high-fidelity chunking strategy to ensure our embeddings are as clean as possible.

In this lesson, we will implement the ETL (Extract, Transform, Load) layer of our capstone system.

Learning Goals

  • Build a multi-source ingestion script for local PDFs and web URLs.
  • Implement hierarchical chunking using the RecursiveCharacterTextSplitter.
  • Enrich documents with searchable metadata (source, date, category).

Core Concepts

1. Multi-Format Loading

Our agent needs two types of data:

  • PyPDFLoader: For the existing local knowledge base.
  • WebBaseLoader: For grabbing live documentation from URLs.

2. Strategic Metadata Enrichment

During ingestion, we will tag documents to enable Pre-filtering later:

  • doc_type: "internal" or "external".
  • last_updated: To help the agent prioritize newer information.

3. The "Clean" Indexing Rule

Never index documents without cleaning. We will strip redundant headers and footer boilerplate from the web content to prevent "semantic dilution."

Ingestion Flow

Implementing the Ingestion Layer

  1. 1
    Step 1
    1from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader 2 3pdf_docs = PyPDFLoader("./manuals/v5_guide.pdf").load() 4web_docs = WebBaseLoader("https://docs.sdk.com/changelog").load() 5 6all_docs = pdf_docs + web_docs
  2. 2
    Step 2

    Use a 1000-character chunk with a 150-character overlap for balanced context:

    1from langchain_text_splitters import RecursiveCharacterTextSplitter 2 3text_splitter = RecursiveCharacterTextSplitter( 4 chunk_size=1000, 5 chunk_overlap=150, 6 add_start_index=True 7) 8chunks = text_splitter.split_documents(all_docs)
  3. 3
    Step 3

    Add searchable tags to each chunk:

    1for chunk in chunks: 2 chunk.metadata["project"] = "SDK_Support" 3 if "http" in chunk.metadata["source"]: 4 chunk.metadata["source_type"] = "web" 5 else: 6 chunk.metadata["source_type"] = "internal"
  4. 4
    Step 4
    1from langchain_chroma import Chroma 2from langchain_openai import OpenAIEmbeddings 3 4vector_store = Chroma.from_documents( 5 documents=chunks, 6 embedding=OpenAIEmbeddings(), 7 persist_directory="./db/capstone_index" 8)

Common Mistakes

  • Ignoring Loader errors: Web loaders often fail on 404s or JavaScript-heavy sites. Wrap your loaders in a try/except block and log failures to LangSmith.
  • Redundant Overlap: A 50% overlap is too much—it creates identical vectors that waste storage and LLM context. Stick to 10-15%.

Recap

  • We built a unified ingestion pipeline for multiple formats.
  • We used Recursive Splitting to preserve semantic boundaries.
  • Metadata enrichment was applied to allow for targeted search in the next phase.

Knowledge Check

Question 1 of 3
Q1Single choice

Why do we combine multiple document types into a single 'all_docs' list before splitting?