Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Iterative Improvement with Evaluation Feedback

Learn to close the feedback loop: evaluate RAG pipeline → identify weak points → improve chunking, retrieval, or generation → re-evaluate. Build automated evaluation pipelines.

Learning Goals

  • Close the feedback loop between evaluation and improvement
  • Build automated evaluation pipelines for continuous monitoring

Iterative Improvement with Evaluation Feedback

Evaluation is not a one-time event; it is a continuous loop. The goal of using RAGAS is to identify the "weakest link" in your architecture and fix it. Once you apply a fix (like changing your chunk size), you must re-run the evaluation to ensure that the metric improved without regressing other parts of the system.

In this final lesson of Module 8, we will learn how to close the feedback loop and build a data-driven RAG optimization strategy.

Learning Goals

  • Map low RAGAS scores to specific architectural fixes.
  • Implement an automated evaluation pipeline (CI/CD for RAG).
  • Balance the trade-offs between different metrics during optimization.

Core Concepts

1. The Optimization Map

Low MetricPrimary SuspectSuggested Fix
Context RecallChunking / EmbeddingsIncrease chunk size, add overlap, or use a better embedding model.
Context PrecisionRetrieval / RankingAdd a Re-ranker (Cohere) or implement Hybrid Search.
FaithfulnessSystem Prompt / LLMTighten the prompt ("Answer ONLY using context") or use a stronger model.
Answer RelevancePrompting / Query logicUse Query Rewriting or improve the system's instruction set.

2. Regression Testing

When you "fix" retrieval by increasing chunk size, your Context Recall might go up, but your Faithfulness might go down because the LLM is now overwhelmed with irrelevant noise. Iterative improvement requires watching all metrics simultaneously to find the optimal balance.

3. Automated Evaluation (Eval-Ops)

In a production system, you should run your RAGAS test set every time you change a line of code or a configuration parameter. This "Eval-Ops" approach ensures that quality never slips.

The Optimization Loop

Closing the Feedback Loop

  1. 1
    Step 1

    Analyze your average scores. Focus on the lowest score first. For example, if Faithfulness is 0.5, your system is not reliable.

  2. 2
    Step 2

    If Faithfulness is low, try adding a 'Grader' node (from Module 8) or improving your system prompt to strictly enforce grounding.

  3. 3
    Step 3

    Run the same RAGAS test set on your updated system:

    1new_results = evaluate(dataset, metrics=[faithfulness]) 2print(f"New Faithfulness: {new_results['faithfulness']}")
  4. 4
    Step 4

    Check that your fix didn't accidentally lower your Context Precision or Answer Relevance.

Example: The Chunking Pivot

A developer finds that their bot is missing key details (Recall: 0.4). They increase chunk size from 500 to 2000 characters.

  • Recall improves to 0.8.
  • Faithfulness drops from 0.9 to 0.6 because the LLM is now hallucinating from the extra "fluff" in the larger chunks.
  • Solution: Instead of huge chunks, the developer implements Parent Document Retrieval (Module 6). This restores Faithfulness to 0.9 while keeping Recall at 0.8.

Common Mistakes

  • Fixing the Test Set instead of the System: If a question is hard, don't delete it from the test set. Fix the system to handle it.
  • Optimizing for a Single Metric: A system with 1.0 Recall but 0.1 Faithfulness is a dangerous system. Always optimize for the "RAG Triad" as a whole.

Recap

  • Use the Optimization Map to match low scores to architectural fixes.
  • Always check for regressions in other metrics after a "fix."
  • Automated benchmarks are the only way to maintain quality at scale.

Knowledge Check

Question 1 of 3
Q1Single choice

If your system is retrieving the right documents but the LLM is ignoring them and giving generic answers, which metric will be low?