Running and Interpreting RAGAS Metrics
Run RAGAS evaluation on your RAG pipelines. Learn to interpret faithfulness, answer relevance, context precision, and context recall scores to identify improvement areas.
Learning Goals
- Run RAGAS evaluation and interpret scores
- Use evaluation results to identify RAG pipeline improvements
Running and Interpreting RAGAS Metrics
Once you have your evaluation dataset ready, it's time to run the actual evaluation. RAGAS provides a high-level evaluate function that takes your dataset and a list of metrics, and returns a Result object. This result contains the overall average scores as well as granular scores for every single row in your dataset.
In this lesson, we will implement the evaluation loop and learn how to interpret the numbers to find weaknesses in our RAG pipeline.
Learning Goals
- Format evaluation data for the RAGAS
evaluatefunction. - Execute a multi-metric evaluation pass.
- Export results to Pandas for detailed analysis.
Core Concepts
1. Data Formatting
RAGAS expects a specific schema, usually in the form of a Dataset object from the Hugging Face datasets library.
question: list of stringscontexts: list of list of stringsanswer: list of stringsground_truth: list of strings (optional)
2. The Result Object
The output of evaluate() is more than just a table.
- Aggregated Scores: The "Bird's-eye view" of your system.
- Per-row Scores: The "Magnifying glass" that tells you exactly which queries failed.
3. Interpreting the Scores (The Guide)
| Score Range | Interpretation |
|---|---|
| 0.9 - 1.0 | Production Ready. High quality and grounding. |
| 0.7 - 0.89 | Acceptable but needs monitoring. Check edge cases. |
| 0.4 - 0.69 | Sub-optimal. Significant retrieval or reasoning gaps. |
| < 0.4 | Failing. Likely severe hallucinations or irrelevant data. |
Executing the Evaluation
- 1Step 1
Convert your results into a Hugging Face Dataset format:
1from datasets import Dataset 2 3data_samples = { 4 'question': ['How do I reset my password?', 'What is RAG?'], 5 'answer': ['Go to settings...', 'RAG is a technique...'], 6 'contexts' : [['Admin guide page 1'], ['RAG whitepaper section 2']], 7 'ground_truth': ['Navigate to settings > security...', 'Retrieval Augmented Generation...'] 8} 9 10dataset = Dataset.from_dict(data_samples) - 2Step 2
Execute the evaluation using your chosen metrics:
1from ragas import evaluate 2from ragas.metrics import faithfulness, answer_relevance 3 4result = evaluate( 5 dataset, 6 metrics=[faithfulness, answer_relevance], 7) - 3Step 3
Convert to a dataframe to find the worst-performing rows:
1df = result.to_pandas() 2 3# Sort by faithfulness to find hallucinations 4worst_hallucinations = df.sort_values("faithfulness").head(3) 5print(worst_hallucinations)
Example: The "Precision-Recall" Mismatch
Imagine you run an evaluation and find:
- Context Precision: 0.95 (Excellent ranking)
- Context Recall: 0.30 (Poor coverage) This tells you that while your search engine is good at putting the "best" document at the top, your knowledge base is missing 70% of the information needed to answer the questions. You don't need a better search algorithm; you need more data or better chunking.
Common Mistakes
- Assuming 1.0 is Perfect: LLM judges are not infallible. A score of 1.0 might still have minor errors. Always "spot-check" a few rows manually to verify the judge's reasoning.
- Ignoring Granular Data: Don't just look at the average. One "0.0" score in a sea of "0.9" scores could represent a critical safety failure that needs immediate attention.
Recap
- RAGAS requires data in the Hugging Face
Datasetformat. - The
evaluatefunction provides both summary and detailed metrics. - Dataframes are the best tool for identifying and debugging failing queries.
Knowledge Check
Which Python library is used to format the data for RAGAS?