Retrieval-Augmented Generation RAG: A Comprehensive Course | AI Research

Retrieval-Augmented Generation (RAG): A Comprehensive Course

Verified Sources

Jun 26, 2026

Retrieval-Augmented Generation (RAG) is a hybrid AI framework that enhances Large Language Models by grounding their responses in external, up-to-date data sources. Instead of relying solely on static pre-trained knowledge, RAG retrieves relevant documents at query time and feeds them into the model as additional context, enabling more accurate, current, and domain-specific outputs .

Traditional LLMs suffer from several critical limitations:

Hallucinations — generating plausible but factually incorrect information
Stale knowledge — training data has a cutoff date; models cannot know about recent events
Lack of domain specificity — general knowledge may not cover proprietary or niche domains
Limited context windows — models can only process a restricted number of tokens at once

RAG addresses these issues by dynamically injecting retrieved evidence into the generation process. The result is a system that produces responses grounded in verifiable sources, rather than relying on parametric memory alone.

According to recent surveys, over 60% of organizations are developing AI-powered retrieval tools to improve reliability, reduce hallucinations, and personalize outputs using internal data .

What is Retrieval Augmented Generation (RAG)? - Databricks - RAG definition, enterprise adoption statistics (>60% orgs), and hybrid architecture trends. ↩ ↩²

What is Retrieval-Augmented Generation (RAG)?

Core Architecture: The Two-Stage Pipeline

At its heart, RAG operates in two fundamental stages :

Retrieval — Fetch the most relevant documents from an external knowledge base based on the user query.
Generation — The LLM processes the retrieved data along with the original prompt to produce a coherent, fact-grounded response.

The key insight is that RAG decouples knowledge from generation. The LLM no longer needs to memorize every fact; instead, it acts as a reasoning engine that synthesizes retrieved evidence into natural language.

8 RAG Architectures You Should Know - Humanloop - Comprehensive overview of RAG architecture variants: naive, branched, HyDE, and more. ↩

Evolution of RAG

RAG Introduced

2020

Facebook AI Research (Lewis et al.) publish the original RAG paper, combining parametric memory (pre-trained seq2seq model) with non-parametric memory (Wikipedia dense vector index)."

Adoption in Enterprise

2021–2022

RAG gains traction in enterprise search, customer support, and knowledge management. Open-source frameworks like Haystack and LangChain emerge."

Vector Database Boom

2023

Pinecone, Qdrant, Weaviate, and Chroma popularize purpose-built vector databases. RAG becomes the de facto pattern for LLM apps with private data."

Advanced RAG Techniques

2024

Hybrid search, re-ranking, query transformation, Self-RAG, and GraphRAG push beyond naive retrieval. RAGAS becomes the standard evaluation framework."

Agentic & Modular RAG

2025

RAG evolves into agentic workflows with multi-step reasoning. Modular RAG decomposes the pipeline into swappable components. Retriever-generator co-training emerges."

The RAG Pipeline: End-to-End Deep Dive

A production RAG system consists of two major phases: an offline ingestion pipeline and a runtime retrieval-generation pipeline .

Ingestion Phase (Offline)

Step	Description	Key Decisions
Document Loading	Ingest PDFs, HTML, Markdown, databases, APIs	Format parsers, OCR for scanned docs
Chunking	Split documents into smaller pieces (sentences/paragraphs)	Chunk size, overlap %, semantic vs. fixed
Embedding	Convert chunks into dense vector representations	Model choice (OpenAI, BGE, Cohere), dimensionality
Indexing	Store embeddings in a vector database with metadata	HNSW vs. IVF index, metadata schema

Retrieval Phase (Runtime)

Step	Description	Key Decisions
Query Embedding	Encode the user's question with the same embedding model	Same model as ingestion — critical for alignment
Similarity Search	Find top-K nearest vectors by cosine similarity	K value, similarity threshold, metadata filters
Context Assembly	Combine retrieved chunks into the augmented prompt	Prompt template, chunk ordering, deduplication
LLM Generation	Produce the final grounded response	Model choice, temperature, max tokens, citation format

RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration. ↩

Building a Naive RAG Pipeline

1
Step 1
Load your source documents using a document loader (e.g., PyPDF, Unstructured, or DocumentLoader). Extract raw text from PDFs, HTML pages, databases, or APIs. Attach useful metadata such as source filename, page number, author, and creation date.
2
Step 2
Split the raw documents into smaller chunks that fit the embedding model's context window. A common starting point is 512–1024 tokens with 10–20% overlap between adjacent chunks to preserve context at boundaries. Choose between fixed-size splitting, sentence-based splitting, or semantic chunking depending on document structure.
3
Step 3
Pass each chunk through an embedding model (e.g., text-embedding-3-small, BAAI/bge-base-en, Cohere embed v3) to produce a dense vector — typically 384 to 3072 dimensions. Batch requests for efficiency.
4
Step 4
Insert the embedding vectors and their associated metadata into a vector database. Create an index (HNSW for low-latency, IVF for large-scale) to enable fast approximate nearest-neighbor (ANN) search at query time.
5
Step 5
When a user submits a query, encode it using the exact same embedding model used during ingestion. Mismatched models will produce misaligned vector spaces and poor retrieval quality.
6
Step 6
Search the vector database for the top-K most similar document chunks. Use cosine similarity or dot product as the distance metric. Apply metadata filters when possible (e.g., restrict to recent documents or a specific author).

Step 7

Construct an augmented prompt that combines the user query with the retrieved context. A typical template:

Answer the question based on the context below.

Context:
{retrieved_chunks}

Question: {user_query}

Answer:

8
Step 8
Send the augmented prompt to the LLM. Configure generation parameters (temperature ≈ 0 for factual tasks). Optionally request citations or source references in the output for traceability.

Critical: Embedding Model Consistency

You MUST use the same embedding model for both ingestion and query-time encoding. If documents are embedded with model-A and queries are embedded with model-B, the vectors occupy different mathematical spaces, and similarity search will return irrelevant or random results. This is the single most common deployment mistake in RAG systems.

Advanced RAG Techniques

Naive RAG — chunk, embed, retrieve top-K, generate — answers only ~63% of factual questions correctly . Advanced RAG adds quality-control layers at multiple stages to push that ceiling higher.

Advanced techniques fall into four categories :

RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks. ↩ ↩²

Query Transformation addresses the biggest bottleneck: the user. User queries are often ambiguous, incomplete, or poorly phrased, causing the query embedding to misalign with document embeddings.

Key techniques:

Query Rewriting: Use an LLM to rephrase the query for clarity and specificity before embedding.
HyDE (Hypothetical Document Embedding): Generate a hypothetical answer, embed that instead of the query, and search for similar real documents.
Query Decomposition: Break complex multi-part questions into sub-questions, retrieve for each, then merge results.
Step-back Prompting: Ask the LLM to generate a broader, more abstract version of the question to improve retrieval recall.

RAG Architecture Comparison

Estimated accuracy and complexity of different RAG architectures (illustrative)

Chunking Strategies: The Foundation of Retrieval Quality

Chunking is often the single most impactful design decision in a RAG system. Poor chunking leads to fragmented context, missed information, and noisy retrieval.

Strategy	Description	Best For
Fixed-size	Split at every N characters/tokens	Simple, uniform documents
Sentence-based	Split on sentence boundaries	Short documents, QA pairs
Paragraph-based	Split on paragraph or section breaks	Structured documents with headers
Semantic chunking	Group sentences by embedding similarity	Unstructured, topic-shifting text
Parent-child	Small chunks for retrieval, linked to larger parent chunks for generation	When you need fine-grained search but broad generation context
Sentence window	Retrieve a small chunk but return surrounding window of N sentences	When local context around a match matters

The recommended overlap is 10–20% between adjacent chunks to prevent loss of information at boundaries .

RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration. ↩

Pro Tip: Chunk Size is a Hyperparameter

There is no universally optimal chunk size. Start with 512 tokens and 10% overlap, then measure retrieval quality on your specific data using context precision and faithfulness metrics. Domain-specific documents (legal contracts, medical records) often benefit from larger chunks (1024+), while FAQ-style content works better with smaller chunks (128–256). Always benchmark before committing to production.

Vector Databases: The Retrieval Engine

A Vector Database is the core retrieval layer in a RAG system. It stores document embeddings and performs semantic similarity search to find the most relevant information for a query .

Key Vector Databases in 2024–2025

Database	Type	Key Feature
Pinecone	Managed cloud	Zero-ops, production-ready
Qdrant	Open-source	Rust-based, high performance
Weaviate	Open-source	Built-in hybrid (BM25 + vector) search
Chroma	Open-source	Lightweight, great for prototyping
pgvector	PostgreSQL extension	Leverages existing Postgres infrastructure
Milvus	Open-source	Scales to billions of vectors

Index Types

HNSW (Hierarchical Navigable Small World): Graph-based ANN index. Excellent for low-latency, high-recall retrieval. Default in most modern vector DBs.
IVF (Inverted File Index): Partition-based index. Scales well to very large datasets but requires training on the dataset.
Hybrid Indexes: Combine HNSW with sparse indexes for hybrid dense + keyword search.

What is RAG: Understanding Retrieval-Augmented Generation - Qdrant - RAG architecture deep dive including retriever components, indexing, and query vectorization. ↩

Embedding Models: Representing Meaning as Vectors

Embeddings are the mathematical bridge between human language and vector search. The quality of your embedding model directly determines retrieval quality.

$\text{similarity}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \cdot \|\mathbf{d}\|} = \cos(\theta)$

where $\mathbf{q}$ is the query vector, $\mathbf{d}$ is the document vector, and $\theta$ is the angle between them.

Popular Embedding Models

Model	Dimensions	Notes
OpenAI `text-embedding-3-small`	1536	Default for many, good balance
OpenAI `text-embedding-3-large`	3072	Higher quality, more storage
`BAAI/bge-base-en`	768	Strong open-source baseline
`BAAI/bge-large-en-v1.5`	1024	Top MTEB benchmark performer
Cohere `embed-v3`	1024	Multi-language, search-optimized
`nomic-embed-text`	768	Open-source, 8192 token context

RAG Evaluation: Measuring What Matters

You cannot improve what you cannot measure. The RAGAS framework has become the de facto standard for evaluating RAG systems, offering programmatic metrics across both the retrieval and generation stages .

Core RAG Evaluation Metrics

Metric	Component Evaluated	What It Measures
Context Precision	Retriever	Are the relevant items ranked at the top?
Context Recall	Retriever	Were all relevant documents retrieved?
Faithfulness	Generator	Is the answer supported by the retrieved context?
Answer Relevancy	Generator	Is the answer relevant to the query?

Hallucination Detection

RAGAS Faithfulness had an average precision of 0.762 for detecting incorrect answers, making it moderately effective for simple queries but less reliable for complex ones . Other methods include:

DeepEval Hallucination Metric: Estimates likelihood of contradictions between response and context.
Self-Evaluation: The LLM assesses its own factual correctness — simple but less reliable.
TLM (Trustworthy Language Model): Outperforms all other methods in benchmarks.

Benchmarking Hallucination Detection Methods in RAG - Cleanlab - Comparative evaluation of RAGAS Faithfulness, DeepEval, and other hallucination detection methods with precision statistics. ↩ ↩²

RAG Evaluation Metrics Mapping

Which metrics evaluate which pipeline components

RAG Architecture Decision Guide

RAG Key Concepts

1 / 8

Question · Term

What is RAG?

Click to reveal

Answer · Definition

Retrieval-Augmented Generation: A hybrid AI framework that enhances LLM outputs by first retrieving relevant documents from an external knowledge base, then using those documents as context for generation. It reduces hallucinations and enables up-to-date, domain-specific responses.

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# 1. Load documents
loader = PyPDFLoader("knowledge_base.pdf")
docs = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64  # ~12.5% overlap
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="rag_collection"
)

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# 5. Build RAG chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template(
    """Answer based on the context only.

    Context:
    {context}

    Question: {input}

    Answer:"""
)
doc_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, doc_chain)

# 6. Query
response = rag_chain.invoke({"input": "What is RAG?"})
print(response["answer"])

Common RAG Anti-Patterns

Using different embedding models for ingestion and queries — vectors will be in incompatible spaces.
Chunking too small (64-128 tokens) — loses context and coherence.
Ignoring metadata — filtering by date, source, or domain dramatically improves precision.
Skipping re-ranking — naive cosine similarity often returns marginally relevant results that waste context window tokens.
No evaluation loop — deploying without measuring faithfulness or context precision means you cannot detect hallucinations systematically.
Over-chunking — too many small chunks dilutes retrieval signal and increases noise.

RAG Architecture Decision Tree

Use this decision framework to select the right RAG architecture for your use case :

Is the answer in a single document chunk?
├─ Yes → Naive RAG (add reranker for precision)
└─ No → Does it need facts from 2–3 documents?
    ├─ Yes → Advanced RAG (hybrid + rerank + query transform)
    └─ No → Does it need to reason across many documents?
        ├─ Relationships? → GraphRAG
        ├─ Multi-step reasoning? → Agentic RAG
        └─ Mixed workload? → Adaptive RAG

Quick Component Recommendations

Use Case	Chunking	Retrieval	Post-Retrieval	Evaluation
FAQ Bot	Sentence (256 tok)	Dense only	Cross-encoder rerank	Faithfulness
Enterprise Search	Paragraph (512 tok)	Hybrid (dense + BM25)	MMR + Rerank	Context Precision
Legal Analysis	Semantic (1024 tok)	Hybrid + Metadata	Compression + Self-reflection	Context Recall
Research Assistant	Parent-child (128/1024)	Agentic multi-hop	Self-RAG verification	All RAGAS metrics

RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks. ↩

Knowledge Check

Question 1 of 5

Q1Single choice

What are the two fundamental stages of a RAG system?

Training and Inference

Retrieval and Generation

Encoding and Decoding

Chunking and Embedding

Explore Related Topics

Code Generation: Foundations, Methods, Tooling, and Safe Practice

Code generation transforms high‑level intent—schemas, prompts, DSLs, or source code—into executable artifacts using deterministic, probabilistic, or hybrid techniques, and its safe use hinges on verification and human oversight.

Deterministic generators (templates, compilers, DSL transpilers) offer predictability; LLM‑based generators add flexibility but introduce hallucinations and security risks.
Modern AI systems combine model inference, context retrieval, tool augmentation, and feedback loops to improve correctness.
Reliable practice requires structured specifications, generated tests, static analysis, and focused human review.
Choose deterministic methods for repeatable, well‑defined inputs and AI assistance for exploratory tasks, always pairing output with validation.

React Roadmap: From Fundamentals to Advanced Mastery

The React ecosystem has matured into one of the most dominant forces in modern web development. With React 19 introducing Server Components, Server Actions, and a host of new hooks, the framework continues to evolve rapidly. This roadmap provides a structured, stage-by-stage learning path — from fou

Generative AI

Generative AI comprises models that learn data distributions to create new text, images, audio, code, or video, driven mainly by transformer‑based language models and diffusion image models.

Core architectures are large‑scale transformers for language and diffusion models for image synthesis, approximating $p(x)$ , $p(x\mid c)$ or $p(y\mid x)$ .
Foundation models are pretrained on massive corpora and adapted via prompting, fine‑tuning, or retrieval‑augmented generation for diverse downstream tasks.
2023 saw $25.2$  billion private investment, 149 new foundation models, and a rise to $65.7\%$ open‑source releases, though frontier models remain costly.
Major risks include confabulation, bias, privacy leakage, copyright disputes, and deepfake misuse, requiring systematic governance.
Responsible deployment combines data provenance, RAG grounding, safety layers, human oversight, and continuous monitoring.

Browse all research articles

Retrieval-Augmented Generation (RAG): A Comprehensive Course

Footnotes

What is Retrieval-Augmented Generation (RAG)?

Core Architecture: The Two-Stage Pipeline

Footnotes

Evolution of RAG

RAG Introduced

Adoption in Enterprise

Vector Database Boom

Advanced RAG Techniques

Agentic & Modular RAG

The RAG Pipeline: End-to-End Deep Dive

Ingestion Phase (Offline)

Retrieval Phase (Runtime)

Footnotes

Building a Naive RAG Pipeline

Critical: Embedding Model Consistency

Advanced RAG Techniques

Footnotes

RAG Architecture Comparison

Chunking Strategies: The Foundation of Retrieval Quality

Footnotes

Pro Tip: Chunk Size is a Hyperparameter

Vector Databases: The Retrieval Engine

Key Vector Databases in 2024–2025

Index Types

Footnotes

Embedding Models: Representing Meaning as Vectors

Popular Embedding Models

RAG Evaluation: Measuring What Matters

Core RAG Evaluation Metrics

Hallucination Detection

Footnotes

RAG Evaluation Metrics Mapping

RAG Architecture Decision Guide

RAG Key Concepts

What is RAG?

Common RAG Anti-Patterns

RAG Architecture Decision Tree

Quick Component Recommendations

Footnotes

Knowledge Check

Explore Related Topics