Retrieval-Augmented Generation (RAG): A Comprehensive Course

Retrieval-Augmented Generation (RAG): A Comprehensive Course

Verified Sources
Jun 26, 2026

Retrieval-Augmented Generation (RAG) is a hybrid AI framework that enhances Large Language Models by grounding their responses in external, up-to-date data sources. Instead of relying solely on static pre-trained knowledge, RAG retrieves relevant documents at query time and feeds them into the model as additional context, enabling more accurate, current, and domain-specific outputs .

Traditional LLMs suffer from several critical limitations:

  • Hallucinations — generating plausible but factually incorrect information
  • Stale knowledge — training data has a cutoff date; models cannot know about recent events
  • Lack of domain specificity — general knowledge may not cover proprietary or niche domains
  • Limited context windows — models can only process a restricted number of tokens at once

RAG addresses these issues by dynamically injecting retrieved evidence into the generation process. The result is a system that produces responses grounded in verifiable sources, rather than relying on parametric memory alone.

According to recent surveys, over 60% of organizations are developing AI-powered retrieval tools to improve reliability, reduce hallucinations, and personalize outputs using internal data .

Footnotes

  1. What is Retrieval Augmented Generation (RAG)? - Databricks - RAG definition, enterprise adoption statistics (>60% orgs), and hybrid architecture trends. 2

What is Retrieval-Augmented Generation (RAG)?

Core Architecture: The Two-Stage Pipeline

At its heart, RAG operates in two fundamental stages :

  1. Retrieval — Fetch the most relevant documents from an external knowledge base based on the user query.
  2. Generation — The LLM processes the retrieved data along with the original prompt to produce a coherent, fact-grounded response.

The key insight is that RAG decouples knowledge from generation. The LLM no longer needs to memorize every fact; instead, it acts as a reasoning engine that synthesizes retrieved evidence into natural language.

Footnotes

  1. 8 RAG Architectures You Should Know - Humanloop - Comprehensive overview of RAG architecture variants: naive, branched, HyDE, and more.

Evolution of RAG

RAG Introduced

2020

Facebook AI Research (Lewis et al.) publish the original RAG paper, combining parametric memory (pre-trained seq2seq model) with non-parametric memory (Wikipedia dense vector index)."

Adoption in Enterprise

2021–2022

RAG gains traction in enterprise search, customer support, and knowledge management. Open-source frameworks like Haystack and LangChain emerge."

Vector Database Boom

2023

Pinecone, Qdrant, Weaviate, and Chroma popularize purpose-built vector databases. RAG becomes the de facto pattern for LLM apps with private data."

Advanced RAG Techniques

2024

Hybrid search, re-ranking, query transformation, Self-RAG, and GraphRAG push beyond naive retrieval. RAGAS becomes the standard evaluation framework."

Agentic & Modular RAG

2025

RAG evolves into agentic workflows with multi-step reasoning. Modular RAG decomposes the pipeline into swappable components. Retriever-generator co-training emerges."

The RAG Pipeline: End-to-End Deep Dive

A production RAG system consists of two major phases: an offline ingestion pipeline and a runtime retrieval-generation pipeline .

Ingestion Phase (Offline)

StepDescriptionKey Decisions
Document LoadingIngest PDFs, HTML, Markdown, databases, APIsFormat parsers, OCR for scanned docs
ChunkingSplit documents into smaller pieces (sentences/paragraphs)Chunk size, overlap %, semantic vs. fixed
EmbeddingConvert chunks into dense vector representationsModel choice (OpenAI, BGE, Cohere), dimensionality
IndexingStore embeddings in a vector database with metadataHNSW vs. IVF index, metadata schema

Retrieval Phase (Runtime)

StepDescriptionKey Decisions
Query EmbeddingEncode the user's question with the same embedding modelSame model as ingestion — critical for alignment
Similarity SearchFind top-K nearest vectors by cosine similarityK value, similarity threshold, metadata filters
Context AssemblyCombine retrieved chunks into the augmented promptPrompt template, chunk ordering, deduplication
LLM GenerationProduce the final grounded responseModel choice, temperature, max tokens, citation format

Footnotes

  1. RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration.

Building a Naive RAG Pipeline

  1. 1
    Step 1

    Load your source documents using a document loader (e.g., PyPDF, Unstructured, or DocumentLoader). Extract raw text from PDFs, HTML pages, databases, or APIs. Attach useful metadata such as source filename, page number, author, and creation date.

  2. 2
    Step 2

    Split the raw documents into smaller chunks that fit the embedding model's context window. A common starting point is 512–1024 tokens with 10–20% overlap between adjacent chunks to preserve context at boundaries. Choose between fixed-size splitting, sentence-based splitting, or semantic chunking depending on document structure.

  3. 3
    Step 3

    Pass each chunk through an embedding model (e.g., text-embedding-3-small, BAAI/bge-base-en, Cohere embed v3) to produce a dense vector — typically 384 to 3072 dimensions. Batch requests for efficiency.

  4. 4
    Step 4

    Insert the embedding vectors and their associated metadata into a vector database. Create an index (HNSW for low-latency, IVF for large-scale) to enable fast approximate nearest-neighbor (ANN) search at query time.

  5. 5
    Step 5

    When a user submits a query, encode it using the exact same embedding model used during ingestion. Mismatched models will produce misaligned vector spaces and poor retrieval quality.

  6. 6
    Step 6

    Search the vector database for the top-K most similar document chunks. Use cosine similarity or dot product as the distance metric. Apply metadata filters when possible (e.g., restrict to recent documents or a specific author).

  7. 7
    Step 7

    Construct an augmented prompt that combines the user query with the retrieved context. A typical template:

    Answer the question based on the context below.
    
    Context:
    {retrieved_chunks}
    
    Question: {user_query}
    
    Answer:
    
  8. 8
    Step 8

    Send the augmented prompt to the LLM. Configure generation parameters (temperature ≈ 0 for factual tasks). Optionally request citations or source references in the output for traceability.

Critical: Embedding Model Consistency

You MUST use the same embedding model for both ingestion and query-time encoding. If documents are embedded with model-A and queries are embedded with model-B, the vectors occupy different mathematical spaces, and similarity search will return irrelevant or random results. This is the single most common deployment mistake in RAG systems.

Advanced RAG Techniques

Naive RAG — chunk, embed, retrieve top-K, generate — answers only ~63% of factual questions correctly . Advanced RAG adds quality-control layers at multiple stages to push that ceiling higher.

Advanced techniques fall into four categories :

Footnotes

  1. RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks. 2

Query Transformation addresses the biggest bottleneck: the user. User queries are often ambiguous, incomplete, or poorly phrased, causing the query embedding to misalign with document embeddings.

Key techniques:

  • Query Rewriting: Use an LLM to rephrase the query for clarity and specificity before embedding.
  • HyDE (Hypothetical Document Embedding): Generate a hypothetical answer, embed that instead of the query, and search for similar real documents.
  • Query Decomposition: Break complex multi-part questions into sub-questions, retrieve for each, then merge results.
  • Step-back Prompting: Ask the LLM to generate a broader, more abstract version of the question to improve retrieval recall.

RAG Architecture Comparison

Estimated accuracy and complexity of different RAG architectures (illustrative)

Chunking Strategies: The Foundation of Retrieval Quality

Chunking is often the single most impactful design decision in a RAG system. Poor chunking leads to fragmented context, missed information, and noisy retrieval.

StrategyDescriptionBest For
Fixed-sizeSplit at every N characters/tokensSimple, uniform documents
Sentence-basedSplit on sentence boundariesShort documents, QA pairs
Paragraph-basedSplit on paragraph or section breaksStructured documents with headers
Semantic chunkingGroup sentences by embedding similarityUnstructured, topic-shifting text
Parent-childSmall chunks for retrieval, linked to larger parent chunks for generationWhen you need fine-grained search but broad generation context
Sentence windowRetrieve a small chunk but return surrounding window of N sentencesWhen local context around a match matters

The recommended overlap is 10–20% between adjacent chunks to prevent loss of information at boundaries .

Footnotes

  1. RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration.

Pro Tip: Chunk Size is a Hyperparameter

There is no universally optimal chunk size. Start with 512 tokens and 10% overlap, then measure retrieval quality on your specific data using context precision and faithfulness metrics. Domain-specific documents (legal contracts, medical records) often benefit from larger chunks (1024+), while FAQ-style content works better with smaller chunks (128–256). Always benchmark before committing to production.

Vector Databases: The Retrieval Engine

A Vector Database is the core retrieval layer in a RAG system. It stores document embeddings and performs semantic similarity search to find the most relevant information for a query .

Key Vector Databases in 2024–2025

DatabaseTypeKey Feature
PineconeManaged cloudZero-ops, production-ready
QdrantOpen-sourceRust-based, high performance
WeaviateOpen-sourceBuilt-in hybrid (BM25 + vector) search
ChromaOpen-sourceLightweight, great for prototyping
pgvectorPostgreSQL extensionLeverages existing Postgres infrastructure
MilvusOpen-sourceScales to billions of vectors

Index Types

  • HNSW (Hierarchical Navigable Small World): Graph-based ANN index. Excellent for low-latency, high-recall retrieval. Default in most modern vector DBs.
  • IVF (Inverted File Index): Partition-based index. Scales well to very large datasets but requires training on the dataset.
  • Hybrid Indexes: Combine HNSW with sparse indexes for hybrid dense + keyword search.

Footnotes

  1. What is RAG: Understanding Retrieval-Augmented Generation - Qdrant - RAG architecture deep dive including retriever components, indexing, and query vectorization.

Embedding Models: Representing Meaning as Vectors

Embeddings are the mathematical bridge between human language and vector search. The quality of your embedding model directly determines retrieval quality.

similarity(q,d)=qdqd=cos(θ)\text{similarity}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \cdot \|\mathbf{d}\|} = \cos(\theta)

where q\mathbf{q} is the query vector, d\mathbf{d} is the document vector, and θ\theta is the angle between them.

ModelDimensionsNotes
OpenAI text-embedding-3-small1536Default for many, good balance
OpenAI text-embedding-3-large3072Higher quality, more storage
BAAI/bge-base-en768Strong open-source baseline
BAAI/bge-large-en-v1.51024Top MTEB benchmark performer
Cohere embed-v31024Multi-language, search-optimized
nomic-embed-text768Open-source, 8192 token context

RAG Evaluation: Measuring What Matters

You cannot improve what you cannot measure. The RAGAS framework has become the de facto standard for evaluating RAG systems, offering programmatic metrics across both the retrieval and generation stages .

Core RAG Evaluation Metrics

MetricComponent EvaluatedWhat It Measures
Context PrecisionRetrieverAre the relevant items ranked at the top?
Context RecallRetrieverWere all relevant documents retrieved?
FaithfulnessGeneratorIs the answer supported by the retrieved context?
Answer RelevancyGeneratorIs the answer relevant to the query?

Hallucination Detection

RAGAS Faithfulness had an average precision of 0.762 for detecting incorrect answers, making it moderately effective for simple queries but less reliable for complex ones . Other methods include:

  • DeepEval Hallucination Metric: Estimates likelihood of contradictions between response and context.
  • Self-Evaluation: The LLM assesses its own factual correctness — simple but less reliable.
  • TLM (Trustworthy Language Model): Outperforms all other methods in benchmarks.

Footnotes

  1. Benchmarking Hallucination Detection Methods in RAG - Cleanlab - Comparative evaluation of RAGAS Faithfulness, DeepEval, and other hallucination detection methods with precision statistics. 2

RAG Evaluation Metrics Mapping

Which metrics evaluate which pipeline components

RAG Architecture Decision Guide

RAG Key Concepts

1 / 8
Question · Term

What is RAG?

Click to reveal
Answer · Definition

Retrieval-Augmented Generation: A hybrid AI framework that enhances LLM outputs by first retrieving relevant documents from an external knowledge base, then using those documents as context for generation. It reduces hallucinations and enables up-to-date, domain-specific responses.

1from langchain_community.document_loaders import PyPDFLoader 2from langchain_text_splitters import RecursiveCharacterTextSplitter 3from langchain_openai import OpenAIEmbeddings 4from langchain_community.vectorstores import Chroma 5from langchain_openai import ChatOpenAI 6from langchain.chains import create_retrieval_chain 7from langchain.chains.combine_documents import create_stuff_documents_chain 8from langchain_core.prompts import ChatPromptTemplate 9 10# 1. Load documents 11loader = PyPDFLoader("knowledge_base.pdf") 12docs = loader.load() 13 14# 2. Chunk 15splitter = RecursiveCharacterTextSplitter( 16 chunk_size=512, 17 chunk_overlap=64 # ~12.5% overlap 18) 19chunks = splitter.split_documents(docs) 20 21# 3. Embed and store 22embeddings = OpenAIEmbeddings(model="text-embedding-3-small") 23vectorstore = Chroma.from_documents( 24 documents=chunks, 25 embedding=embeddings, 26 collection_name="rag_collection" 27) 28 29# 4. Create retriever 30retriever = vectorstore.as_retriever( 31 search_type="similarity", 32 search_kwargs={"k": 5} 33) 34 35# 5. Build RAG chain 36llm = ChatOpenAI(model="gpt-4o", temperature=0) 37prompt = ChatPromptTemplate.from_template( 38 """Answer based on the context only. 39 40 Context: 41 {context} 42 43 Question: {input} 44 45 Answer:""" 46) 47doc_chain = create_stuff_documents_chain(llm, prompt) 48rag_chain = create_retrieval_chain(retriever, doc_chain) 49 50# 6. Query 51response = rag_chain.invoke({"input": "What is RAG?"}) 52print(response["answer"])

Common RAG Anti-Patterns

  1. Using different embedding models for ingestion and queries — vectors will be in incompatible spaces.
  2. Chunking too small (64-128 tokens) — loses context and coherence.
  3. Ignoring metadata — filtering by date, source, or domain dramatically improves precision.
  4. Skipping re-ranking — naive cosine similarity often returns marginally relevant results that waste context window tokens.
  5. No evaluation loop — deploying without measuring faithfulness or context precision means you cannot detect hallucinations systematically.
  6. Over-chunking — too many small chunks dilutes retrieval signal and increases noise.

RAG Architecture Decision Tree

Use this decision framework to select the right RAG architecture for your use case :

Is the answer in a single document chunk?
├─ Yes → Naive RAG (add reranker for precision)
└─ No → Does it need facts from 2–3 documents?
    ├─ Yes → Advanced RAG (hybrid + rerank + query transform)
    └─ No → Does it need to reason across many documents?
        ├─ Relationships? → GraphRAG
        ├─ Multi-step reasoning? → Agentic RAG
        └─ Mixed workload? → Adaptive RAG

Quick Component Recommendations

Use CaseChunkingRetrievalPost-RetrievalEvaluation
FAQ BotSentence (256 tok)Dense onlyCross-encoder rerankFaithfulness
Enterprise SearchParagraph (512 tok)Hybrid (dense + BM25)MMR + RerankContext Precision
Legal AnalysisSemantic (1024 tok)Hybrid + MetadataCompression + Self-reflectionContext Recall
Research AssistantParent-child (128/1024)Agentic multi-hopSelf-RAG verificationAll RAGAS metrics

Footnotes

  1. RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks.

Knowledge Check

Question 1 of 5
Q1Single choice

What are the two fundamental stages of a RAG system?

Explore Related Topics

1

Code Generation: Foundations, Methods, Tooling, and Safe Practice

Code generation transforms high‑level intent—schemas, prompts, DSLs, or source code—into executable artifacts using deterministic, probabilistic, or hybrid techniques, and its safe use hinges on verification and human oversight.

  • Deterministic generators (templates, compilers, DSL transpilers) offer predictability; LLM‑based generators add flexibility but introduce hallucinations and security risks.
  • Modern AI systems combine model inference, context retrieval, tool augmentation, and feedback loops to improve correctness.
  • Reliable practice requires structured specifications, generated tests, static analysis, and focused human review.
  • Choose deterministic methods for repeatable, well‑defined inputs and AI assistance for exploratory tasks, always pairing output with validation.
2

React Roadmap: From Fundamentals to Advanced Mastery

The React ecosystem has matured into one of the most dominant forces in modern web development. With React 19 introducing Server Components, Server Actions, and a host of new hooks, the framework continues to evolve rapidly. This roadmap provides a structured, stage-by-stage learning path — from fou

3

Generative AI

Generative AI comprises models that learn data distributions to create new text, images, audio, code, or video, driven mainly by transformer‑based language models and diffusion image models.

  • Core architectures are large‑scale transformers for language and diffusion models for image synthesis, approximating p(x)p(x), p(xc)p(x\mid c) or p(yx)p(y\mid x).
  • Foundation models are pretrained on massive corpora and adapted via prompting, fine‑tuning, or retrieval‑augmented generation for diverse downstream tasks.
  • 2023 saw 25.225.2 billion private investment, 149 new foundation models, and a rise to 65.7%65.7\% open‑source releases, though frontier models remain costly.
  • Major risks include confabulation, bias, privacy leakage, copyright disputes, and deepfake misuse, requiring systematic governance.
  • Responsible deployment combines data provenance, RAG grounding, safety layers, human oversight, and continuous monitoring.