Retrieval-Augmented Generation (RAG): A Comprehensive Course
Retrieval-Augmented Generation (RAG) is a hybrid AI framework that enhances Large Language Models by grounding their responses in external, up-to-date data sources. Instead of relying solely on static pre-trained knowledge, RAG retrieves relevant documents at query time and feeds them into the model as additional context, enabling more accurate, current, and domain-specific outputs .
Traditional LLMs suffer from several critical limitations:
- Hallucinations — generating plausible but factually incorrect information
- Stale knowledge — training data has a cutoff date; models cannot know about recent events
- Lack of domain specificity — general knowledge may not cover proprietary or niche domains
- Limited context windows — models can only process a restricted number of tokens at once
RAG addresses these issues by dynamically injecting retrieved evidence into the generation process. The result is a system that produces responses grounded in verifiable sources, rather than relying on parametric memory alone.
According to recent surveys, over 60% of organizations are developing AI-powered retrieval tools to improve reliability, reduce hallucinations, and personalize outputs using internal data .
Footnotes
-
What is Retrieval Augmented Generation (RAG)? - Databricks - RAG definition, enterprise adoption statistics (>60% orgs), and hybrid architecture trends. ↩ ↩2
What is Retrieval-Augmented Generation (RAG)?
Core Architecture: The Two-Stage Pipeline
At its heart, RAG operates in two fundamental stages :
- Retrieval — Fetch the most relevant documents from an external knowledge base based on the user query.
- Generation — The LLM processes the retrieved data along with the original prompt to produce a coherent, fact-grounded response.
The key insight is that RAG decouples knowledge from generation. The LLM no longer needs to memorize every fact; instead, it acts as a reasoning engine that synthesizes retrieved evidence into natural language.
Footnotes
-
8 RAG Architectures You Should Know - Humanloop - Comprehensive overview of RAG architecture variants: naive, branched, HyDE, and more. ↩
Evolution of RAG
RAG Introduced
2020Facebook AI Research (Lewis et al.) publish the original RAG paper, combining parametric memory (pre-trained seq2seq model) with non-parametric memory (Wikipedia dense vector index)."
Adoption in Enterprise
2021–2022RAG gains traction in enterprise search, customer support, and knowledge management. Open-source frameworks like Haystack and LangChain emerge."
Vector Database Boom
2023Pinecone, Qdrant, Weaviate, and Chroma popularize purpose-built vector databases. RAG becomes the de facto pattern for LLM apps with private data."
Advanced RAG Techniques
2024Hybrid search, re-ranking, query transformation, Self-RAG, and GraphRAG push beyond naive retrieval. RAGAS becomes the standard evaluation framework."
Agentic & Modular RAG
2025RAG evolves into agentic workflows with multi-step reasoning. Modular RAG decomposes the pipeline into swappable components. Retriever-generator co-training emerges."
The RAG Pipeline: End-to-End Deep Dive
A production RAG system consists of two major phases: an offline ingestion pipeline and a runtime retrieval-generation pipeline .
Ingestion Phase (Offline)
| Step | Description | Key Decisions |
|---|---|---|
| Document Loading | Ingest PDFs, HTML, Markdown, databases, APIs | Format parsers, OCR for scanned docs |
| Chunking | Split documents into smaller pieces (sentences/paragraphs) | Chunk size, overlap %, semantic vs. fixed |
| Embedding | Convert chunks into dense vector representations | Model choice (OpenAI, BGE, Cohere), dimensionality |
| Indexing | Store embeddings in a vector database with metadata | HNSW vs. IVF index, metadata schema |
Retrieval Phase (Runtime)
| Step | Description | Key Decisions |
|---|---|---|
| Query Embedding | Encode the user's question with the same embedding model | Same model as ingestion — critical for alignment |
| Similarity Search | Find top-K nearest vectors by cosine similarity | K value, similarity threshold, metadata filters |
| Context Assembly | Combine retrieved chunks into the augmented prompt | Prompt template, chunk ordering, deduplication |
| LLM Generation | Produce the final grounded response | Model choice, temperature, max tokens, citation format |
Footnotes
-
RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration. ↩
Building a Naive RAG Pipeline
- 1Step 1
Load your source documents using a document loader (e.g., PyPDF, Unstructured, or DocumentLoader). Extract raw text from PDFs, HTML pages, databases, or APIs. Attach useful metadata such as source filename, page number, author, and creation date.
- 2Step 2
Split the raw documents into smaller chunks that fit the embedding model's context window. A common starting point is 512–1024 tokens with 10–20% overlap between adjacent chunks to preserve context at boundaries. Choose between fixed-size splitting, sentence-based splitting, or semantic chunking depending on document structure.
- 3Step 3
Pass each chunk through an embedding model (e.g.,
text-embedding-3-small,BAAI/bge-base-en,Cohere embed v3) to produce a dense vector — typically 384 to 3072 dimensions. Batch requests for efficiency. - 4Step 4
Insert the embedding vectors and their associated metadata into a vector database. Create an index (HNSW for low-latency, IVF for large-scale) to enable fast approximate nearest-neighbor (ANN) search at query time.
- 5Step 5
When a user submits a query, encode it using the exact same embedding model used during ingestion. Mismatched models will produce misaligned vector spaces and poor retrieval quality.
- 6Step 6
Search the vector database for the top-K most similar document chunks. Use cosine similarity or dot product as the distance metric. Apply metadata filters when possible (e.g., restrict to recent documents or a specific author).
- 7Step 7
Construct an augmented prompt that combines the user query with the retrieved context. A typical template:
Answer the question based on the context below. Context: {retrieved_chunks} Question: {user_query} Answer: - 8Step 8
Send the augmented prompt to the LLM. Configure generation parameters (temperature ≈ 0 for factual tasks). Optionally request citations or source references in the output for traceability.
Critical: Embedding Model Consistency
You MUST use the same embedding model for both ingestion and query-time encoding. If documents are embedded with model-A and queries are embedded with model-B, the vectors occupy different mathematical spaces, and similarity search will return irrelevant or random results. This is the single most common deployment mistake in RAG systems.
Advanced RAG Techniques
Naive RAG — chunk, embed, retrieve top-K, generate — answers only ~63% of factual questions correctly . Advanced RAG adds quality-control layers at multiple stages to push that ceiling higher.
Advanced techniques fall into four categories :
Footnotes
-
RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks. ↩ ↩2
Query Transformation addresses the biggest bottleneck: the user. User queries are often ambiguous, incomplete, or poorly phrased, causing the query embedding to misalign with document embeddings.
Key techniques:
- Query Rewriting: Use an LLM to rephrase the query for clarity and specificity before embedding.
- HyDE (Hypothetical Document Embedding): Generate a hypothetical answer, embed that instead of the query, and search for similar real documents.
- Query Decomposition: Break complex multi-part questions into sub-questions, retrieve for each, then merge results.
- Step-back Prompting: Ask the LLM to generate a broader, more abstract version of the question to improve retrieval recall.
RAG Architecture Comparison
Estimated accuracy and complexity of different RAG architectures (illustrative)
Chunking Strategies: The Foundation of Retrieval Quality
Chunking is often the single most impactful design decision in a RAG system. Poor chunking leads to fragmented context, missed information, and noisy retrieval.
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split at every N characters/tokens | Simple, uniform documents |
| Sentence-based | Split on sentence boundaries | Short documents, QA pairs |
| Paragraph-based | Split on paragraph or section breaks | Structured documents with headers |
| Semantic chunking | Group sentences by embedding similarity | Unstructured, topic-shifting text |
| Parent-child | Small chunks for retrieval, linked to larger parent chunks for generation | When you need fine-grained search but broad generation context |
| Sentence window | Retrieve a small chunk but return surrounding window of N sentences | When local context around a match matters |
The recommended overlap is 10–20% between adjacent chunks to prevent loss of information at boundaries .
Footnotes
-
RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search - Dev.to - Detailed walkthrough of all RAG pipeline phases including HNSW indexing and GPU acceleration. ↩
Pro Tip: Chunk Size is a Hyperparameter
There is no universally optimal chunk size. Start with 512 tokens and 10% overlap, then measure retrieval quality on your specific data using context precision and faithfulness metrics. Domain-specific documents (legal contracts, medical records) often benefit from larger chunks (1024+), while FAQ-style content works better with smaller chunks (128–256). Always benchmark before committing to production.
Vector Databases: The Retrieval Engine
A Vector Database is the core retrieval layer in a RAG system. It stores document embeddings and performs semantic similarity search to find the most relevant information for a query .
Key Vector Databases in 2024–2025
| Database | Type | Key Feature |
|---|---|---|
| Pinecone | Managed cloud | Zero-ops, production-ready |
| Qdrant | Open-source | Rust-based, high performance |
| Weaviate | Open-source | Built-in hybrid (BM25 + vector) search |
| Chroma | Open-source | Lightweight, great for prototyping |
| pgvector | PostgreSQL extension | Leverages existing Postgres infrastructure |
| Milvus | Open-source | Scales to billions of vectors |
Index Types
- HNSW (Hierarchical Navigable Small World): Graph-based ANN index. Excellent for low-latency, high-recall retrieval. Default in most modern vector DBs.
- IVF (Inverted File Index): Partition-based index. Scales well to very large datasets but requires training on the dataset.
- Hybrid Indexes: Combine HNSW with sparse indexes for hybrid dense + keyword search.
Footnotes
-
What is RAG: Understanding Retrieval-Augmented Generation - Qdrant - RAG architecture deep dive including retriever components, indexing, and query vectorization. ↩
Embedding Models: Representing Meaning as Vectors
Embeddings are the mathematical bridge between human language and vector search. The quality of your embedding model directly determines retrieval quality.
where is the query vector, is the document vector, and is the angle between them.
Popular Embedding Models
| Model | Dimensions | Notes |
|---|---|---|
OpenAI text-embedding-3-small | 1536 | Default for many, good balance |
OpenAI text-embedding-3-large | 3072 | Higher quality, more storage |
BAAI/bge-base-en | 768 | Strong open-source baseline |
BAAI/bge-large-en-v1.5 | 1024 | Top MTEB benchmark performer |
Cohere embed-v3 | 1024 | Multi-language, search-optimized |
nomic-embed-text | 768 | Open-source, 8192 token context |
RAG Evaluation: Measuring What Matters
You cannot improve what you cannot measure. The RAGAS framework has become the de facto standard for evaluating RAG systems, offering programmatic metrics across both the retrieval and generation stages .
Core RAG Evaluation Metrics
| Metric | Component Evaluated | What It Measures |
|---|---|---|
| Context Precision | Retriever | Are the relevant items ranked at the top? |
| Context Recall | Retriever | Were all relevant documents retrieved? |
| Faithfulness | Generator | Is the answer supported by the retrieved context? |
| Answer Relevancy | Generator | Is the answer relevant to the query? |
Hallucination Detection
RAGAS Faithfulness had an average precision of 0.762 for detecting incorrect answers, making it moderately effective for simple queries but less reliable for complex ones . Other methods include:
- DeepEval Hallucination Metric: Estimates likelihood of contradictions between response and context.
- Self-Evaluation: The LLM assesses its own factual correctness — simple but less reliable.
- TLM (Trustworthy Language Model): Outperforms all other methods in benchmarks.
Footnotes
-
Benchmarking Hallucination Detection Methods in RAG - Cleanlab - Comparative evaluation of RAGAS Faithfulness, DeepEval, and other hallucination detection methods with precision statistics. ↩ ↩2
RAG Evaluation Metrics Mapping
Which metrics evaluate which pipeline components
RAG Architecture Decision Guide
RAG Key Concepts
Common RAG Anti-Patterns
- Using different embedding models for ingestion and queries — vectors will be in incompatible spaces.
- Chunking too small (64-128 tokens) — loses context and coherence.
- Ignoring metadata — filtering by date, source, or domain dramatically improves precision.
- Skipping re-ranking — naive cosine similarity often returns marginally relevant results that waste context window tokens.
- No evaluation loop — deploying without measuring faithfulness or context precision means you cannot detect hallucinations systematically.
- Over-chunking — too many small chunks dilutes retrieval signal and increases noise.
RAG Architecture Decision Tree
Use this decision framework to select the right RAG architecture for your use case :
Is the answer in a single document chunk? ├─ Yes → Naive RAG (add reranker for precision) └─ No → Does it need facts from 2–3 documents? ├─ Yes → Advanced RAG (hybrid + rerank + query transform) └─ No → Does it need to reason across many documents? ├─ Relationships? → GraphRAG ├─ Multi-step reasoning? → Agentic RAG └─ Mixed workload? → Adaptive RAG
Quick Component Recommendations
| Use Case | Chunking | Retrieval | Post-Retrieval | Evaluation |
|---|---|---|---|---|
| FAQ Bot | Sentence (256 tok) | Dense only | Cross-encoder rerank | Faithfulness |
| Enterprise Search | Paragraph (512 tok) | Hybrid (dense + BM25) | MMR + Rerank | Context Precision |
| Legal Analysis | Semantic (1024 tok) | Hybrid + Metadata | Compression + Self-reflection | Context Recall |
| Research Assistant | Parent-child (128/1024) | Agentic multi-hop | Self-RAG verification | All RAGAS metrics |
Footnotes
-
RAG Techniques Compared: Best Practices Guide - Starmorph - Architecture decision tree, advanced vs. naive RAG comparison, and re-ranking impact benchmarks. ↩
Knowledge Check
What are the two fundamental stages of a RAG system?
Explore Related Topics
Code Generation: Foundations, Methods, Tooling, and Safe Practice
Code generation transforms high‑level intent—schemas, prompts, DSLs, or source code—into executable artifacts using deterministic, probabilistic, or hybrid techniques, and its safe use hinges on verification and human oversight.
- Deterministic generators (templates, compilers, DSL transpilers) offer predictability; LLM‑based generators add flexibility but introduce hallucinations and security risks.
- Modern AI systems combine model inference, context retrieval, tool augmentation, and feedback loops to improve correctness.
- Reliable practice requires structured specifications, generated tests, static analysis, and focused human review.
- Choose deterministic methods for repeatable, well‑defined inputs and AI assistance for exploratory tasks, always pairing output with validation.
React Roadmap: From Fundamentals to Advanced Mastery
The React ecosystem has matured into one of the most dominant forces in modern web development. With React 19 introducing Server Components, Server Actions, and a host of new hooks, the framework continues to evolve rapidly. This roadmap provides a structured, stage-by-stage learning path — from fou
Generative AI
Generative AI comprises models that learn data distributions to create new text, images, audio, code, or video, driven mainly by transformer‑based language models and diffusion image models.
- Core architectures are large‑scale transformers for language and diffusion models for image synthesis, approximating , or .
- Foundation models are pretrained on massive corpora and adapted via prompting, fine‑tuning, or retrieval‑augmented generation for diverse downstream tasks.
- 2023 saw billion private investment, 149 new foundation models, and a rise to open‑source releases, though frontier models remain costly.
- Major risks include confabulation, bias, privacy leakage, copyright disputes, and deepfake misuse, requiring systematic governance.
- Responsible deployment combines data provenance, RAG grounding, safety layers, human oversight, and continuous monitoring.