Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Self-RAG — Self-Reflective Retrieval

Understand Self-RAG reflection tokens that gate retrieval and verify generation quality. Learn about retrieve, isRel, isSup, and isUse tokens for self-correction.

Learning Goals

  • Understand Self-RAG reflection tokens
  • Classify query complexity for adaptive retrieval

Self-RAG — Self-Reflective Retrieval

While CRAG uses a separate "Grader" model, Self-RAG (Self-Reflective Retrieval-Augmented Generation) trains or prompts an LLM to generate special Reflection Tokens during the generation process itself. These tokens act as internal "quality control" flags that indicate whether retrieval is necessary, whether the retrieved documents are relevant, and whether the final answer is actually supported by those documents.

This pattern enables a highly autonomous AI that can critique its own reasoning and "think twice" before providing an answer.

Learning Goals

  • Define the 4 types of Reflection Tokens in Self-RAG.
  • Understand the decision logic for adaptive retrieval.
  • Implement a self-reflection prompt using LangChain.

Core Concepts

1. The Reflection Tokens

Self-RAG uses four primary internal flags:

  1. Retrieve: Does the user query need external context? (e.g., [yes], [no]).
  2. IsRel (Relevance): Are the retrieved documents relevant to the query?
  3. IsSup (Support): Is the generated answer supported by the documents? (Prevents hallucinations).
  4. IsUse (Utility): Is the answer actually useful to the user?

2. Adaptive Retrieval

Unlike standard RAG which always retrieves, Self-RAG first asks: "Do I already know the answer?" If the Retrieve token is [no], it answers from its own parametric memory, saving latency and costs.

3. The Self-Correction Loop

If the IsSup token indicates that the answer is not supported by the context, the system can automatically trigger a "re-generation" or a "re-retrieval" pass until the quality bar is met.

Self-RAG Decision Logic

Implementing Self-Reflection

  1. 1
    Step 1

    Use structured output to force the LLM to provide its internal grades:

    1class Reflection(BaseModel): 2 relevance: str = Field(description="Are chunks relevant? 'yes'|'no'") 3 supported: str = Field(description="Is answer grounded? 'yes'|'no'") 4 answer: str = Field(description="The actual response")
  2. 2
    Step 2

    Instruct the model to act as its own critic:

    1prompt = ChatPromptTemplate.from_template(""" 2Use these docs to answer: {context} 3Question: {question} 4 5Critique your own answer. Is it fully supported by the docs? 6If not, say 'no' in the supported field and try to refine. 7""")
  3. 3
    Step 3

    Use a while loop or a LangGraph edge to re-try generation if supported == 'no'.

In legal RAG, a "mostly correct" answer is a failure. Self-RAG ensures that every sentence in a legal summary is explicitly flagged as [Supported]. If the model finds itself "guessing" a clause that isn't in the provided contract, the IsSup token will trigger a fallback, preventing a dangerous hallucination.

Common Mistakes

  • Reflection Overhead: Generating reflection tokens for every query can be expensive. Only use the full Self-RAG loop for complex, high-stakes reasoning tasks.
  • Biased Self-Critique: LLMs are often over-confident. If the same model generates and critiques the answer, it might miss its own mistakes. Best Practice: Use a stronger model (e.g., GPT-4o) to critique a smaller model (e.g., GPT-4o-mini).

Recap

  • Self-RAG enables internal quality control via reflection tokens.
  • It supports Adaptive Retrieval to save costs on simple questions.
  • It is the foundation for "Self-Healing" RAG systems that correct their own hallucinations.

Knowledge Check

Question 1 of 3
Q1Single choice

Which Reflection Token is used to determine if a query even needs a vector search?