Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Running and Interpreting RAGAS Metrics

Run RAGAS evaluation on your RAG pipelines. Learn to interpret faithfulness, answer relevance, context precision, and context recall scores to identify improvement areas.

Learning Goals

Run RAGAS evaluation and interpret scores
Use evaluation results to identify RAG pipeline improvements

Running and Interpreting RAGAS Metrics

Once you have your evaluation dataset ready, it's time to run the actual evaluation. RAGAS provides a high-level evaluate function that takes your dataset and a list of metrics, and returns a Result object. This result contains the overall average scores as well as granular scores for every single row in your dataset.

In this lesson, we will implement the evaluation loop and learn how to interpret the numbers to find weaknesses in our RAG pipeline.

Learning Goals

Format evaluation data for the RAGAS evaluate function.
Execute a multi-metric evaluation pass.
Export results to Pandas for detailed analysis.

Core Concepts

1. Data Formatting

RAGAS expects a specific schema, usually in the form of a Dataset object from the Hugging Face datasets library.

question: list of strings
contexts: list of list of strings
answer: list of strings
ground_truth: list of strings (optional)

2. The Result Object

The output of evaluate() is more than just a table.

Aggregated Scores: The "Bird's-eye view" of your system.
Per-row Scores: The "Magnifying glass" that tells you exactly which queries failed.

3. Interpreting the Scores (The Guide)

Score Range	Interpretation
0.9 - 1.0	Production Ready. High quality and grounding.
0.7 - 0.89	Acceptable but needs monitoring. Check edge cases.
0.4 - 0.69	Sub-optimal. Significant retrieval or reasoning gaps.
< 0.4	Failing. Likely severe hallucinations or irrelevant data.

Executing the Evaluation

Step 1

Convert your results into a Hugging Face Dataset format:

1from datasets import Dataset
2
3data_samples = {
4    'question': ['How do I reset my password?', 'What is RAG?'],
5    'answer': ['Go to settings...', 'RAG is a technique...'],
6    'contexts' : [['Admin guide page 1'], ['RAG whitepaper section 2']],
7    'ground_truth': ['Navigate to settings > security...', 'Retrieval Augmented Generation...']
8}
9
10dataset = Dataset.from_dict(data_samples)

Step 2

Execute the evaluation using your chosen metrics:

1from ragas import evaluate
2from ragas.metrics import faithfulness, answer_relevance
3
4result = evaluate(
5    dataset,
6    metrics=[faithfulness, answer_relevance],
7)

Step 3

Convert to a dataframe to find the worst-performing rows:

1df = result.to_pandas()
2
3# Sort by faithfulness to find hallucinations
4worst_hallucinations = df.sort_values("faithfulness").head(3)
5print(worst_hallucinations)

Example: The "Precision-Recall" Mismatch

Imagine you run an evaluation and find:

Context Precision: 0.95 (Excellent ranking)
Context Recall: 0.30 (Poor coverage) This tells you that while your search engine is good at putting the "best" document at the top, your knowledge base is missing 70% of the information needed to answer the questions. You don't need a better search algorithm; you need more data or better chunking.

Common Mistakes

Assuming 1.0 is Perfect: LLM judges are not infallible. A score of 1.0 might still have minor errors. Always "spot-check" a few rows manually to verify the judge's reasoning.
Ignoring Granular Data: Don't just look at the average. One "0.0" score in a sea of "0.9" scores could represent a critical safety failure that needs immediate attention.

Recap

RAGAS requires data in the Hugging Face Dataset format.
The evaluate function provides both summary and detailed metrics.
Dataframes are the best tool for identifying and debugging failing queries.

Knowledge Check

Question 1 of 3

Q1Single choice

Which Python library is used to format the data for RAGAS?

Pandas

Hugging Face 'datasets'

NumPy

Creating Evaluation Datasets

Iterative Improvement with Evaluation Feedback