Iterative Improvement with Evaluation Feedback
Learn to close the feedback loop: evaluate RAG pipeline → identify weak points → improve chunking, retrieval, or generation → re-evaluate. Build automated evaluation pipelines.
Learning Goals
- Close the feedback loop between evaluation and improvement
- Build automated evaluation pipelines for continuous monitoring
Iterative Improvement with Evaluation Feedback
Evaluation is not a one-time event; it is a continuous loop. The goal of using RAGAS is to identify the "weakest link" in your architecture and fix it. Once you apply a fix (like changing your chunk size), you must re-run the evaluation to ensure that the metric improved without regressing other parts of the system.
In this final lesson of Module 8, we will learn how to close the feedback loop and build a data-driven RAG optimization strategy.
Learning Goals
- Map low RAGAS scores to specific architectural fixes.
- Implement an automated evaluation pipeline (CI/CD for RAG).
- Balance the trade-offs between different metrics during optimization.
Core Concepts
1. The Optimization Map
| Low Metric | Primary Suspect | Suggested Fix |
|---|---|---|
| Context Recall | Chunking / Embeddings | Increase chunk size, add overlap, or use a better embedding model. |
| Context Precision | Retrieval / Ranking | Add a Re-ranker (Cohere) or implement Hybrid Search. |
| Faithfulness | System Prompt / LLM | Tighten the prompt ("Answer ONLY using context") or use a stronger model. |
| Answer Relevance | Prompting / Query logic | Use Query Rewriting or improve the system's instruction set. |
2. Regression Testing
When you "fix" retrieval by increasing chunk size, your Context Recall might go up, but your Faithfulness might go down because the LLM is now overwhelmed with irrelevant noise. Iterative improvement requires watching all metrics simultaneously to find the optimal balance.
3. Automated Evaluation (Eval-Ops)
In a production system, you should run your RAGAS test set every time you change a line of code or a configuration parameter. This "Eval-Ops" approach ensures that quality never slips.
The Optimization Loop
Closing the Feedback Loop
- 1Step 1
Analyze your average scores. Focus on the lowest score first. For example, if Faithfulness is 0.5, your system is not reliable.
- 2Step 2
If Faithfulness is low, try adding a 'Grader' node (from Module 8) or improving your system prompt to strictly enforce grounding.
- 3Step 3
Run the same RAGAS test set on your updated system:
1new_results = evaluate(dataset, metrics=[faithfulness]) 2print(f"New Faithfulness: {new_results['faithfulness']}") - 4Step 4
Check that your fix didn't accidentally lower your Context Precision or Answer Relevance.
Example: The Chunking Pivot
A developer finds that their bot is missing key details (Recall: 0.4). They increase chunk size from 500 to 2000 characters.
- Recall improves to 0.8.
- Faithfulness drops from 0.9 to 0.6 because the LLM is now hallucinating from the extra "fluff" in the larger chunks.
- Solution: Instead of huge chunks, the developer implements Parent Document Retrieval (Module 6). This restores Faithfulness to 0.9 while keeping Recall at 0.8.
Common Mistakes
- Fixing the Test Set instead of the System: If a question is hard, don't delete it from the test set. Fix the system to handle it.
- Optimizing for a Single Metric: A system with 1.0 Recall but 0.1 Faithfulness is a dangerous system. Always optimize for the "RAG Triad" as a whole.
Recap
- Use the Optimization Map to match low scores to architectural fixes.
- Always check for regressions in other metrics after a "fix."
- Automated benchmarks are the only way to maintain quality at scale.
Knowledge Check
If your system is retrieving the right documents but the LLM is ignoring them and giving generic answers, which metric will be low?