Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

RAGAS Framework and Metrics

Implement the RAGAS evaluation framework. Learn the core metrics: faithfulness, answer relevance, context precision, context recall, and noise sensitivity.

Learning Goals

Implement the RAGAS evaluation framework
Understand faithfulness, relevance, precision, and recall metrics

RAGAS Framework and Metrics

RAGAS (RAG Assessment) is the industry-standard library for automated RAG evaluation. It provides a suite of metrics that allow you to objectively measure every part of your pipeline without needing expensive human annotation. RAGAS works by using an LLM to extract key claims and statements from your context and answer, and then mathematically comparing them.

In this lesson, we will dive into the four core metrics that define the quality of a RAG system.

Learning Goals

Master the 4 core RAGAS metrics: Faithfulness, Answer Relevance, Context Precision, and Context Recall.
Understand the mathematical logic behind each metric.
Configure RAGAS with an LLM evaluator.

Core Concepts

1. Faithfulness (Generation Quality)

Measures how many claims in the Answer are supported by the Context.

Goal: Zero hallucinations.
Calculation: (Number of supported claims) / (Total number of claims in the answer).

2. Answer Relevance (End-to-End Quality)

Measures how well the Answer addresses the user's Question.

Goal: Direct and useful answers.
Calculation: The LLM judge generates several "potential questions" based on the Answer and calculates the cosine similarity between those generated questions and the original user Question.

3. Context Precision (Retrieval Quality)

Measures if the best documents are at the top of the retrieved list.

Goal: Optimized ranking.
Calculation: Checks if the ground-truth information appears early in the retrieved context pool.

4. Context Recall (Retrieval Quality)

Measures if all the information needed to answer the question is present in the context.

Goal: Complete knowledge retrieval.
Calculation: Compares the Context against a "Ground Truth" answer to see if every fact in the ground truth is covered by the context.

The RAGAS Metric Map

Setting up RAGAS

1
Step 1
Install the library via pip:

1pip install ragas

Step 2

1from ragas.metrics import (
2    faithfulness,
3    answer_relevance,
4    context_precision,
5    context_recall,
6)

Step 3

By default, RAGAS uses OpenAI. You can configure it with any LangChain LLM:

1from ragas import evaluate
2from langchain_openai import ChatOpenAI
3
4eval_llm = ChatOpenAI(model="gpt-4o")

Example: Decoding a Low Score

If your Faithfulness score is 0.4, it means 60% of what your AI said is "made up" or not in your documents. You should immediately look at your System Prompt—perhaps it's giving the LLM too much creative freedom. If Context Recall is 0.2, your search engine is only finding 20% of the relevant facts; you need to improve your Embedding Model or Chunking Strategy.

Common Mistakes

Evaluating on a Single Sample: Metrics fluctuate. Always evaluate on a dataset of at least 20-50 examples to get a statistically significant average.
Ignoring Context Recall: Many developers only check if the answer "looks right" (Answer Relevance), but forget to check if the retrieval was complete. This leads to answers that are "half-true."

Recap

RAGAS provides a complete suite for "measuring what matters."
Faithfulness and Answer Relevance score the generator.
Context Precision and Context Recall score the retriever.
These metrics move your AI development from "Vibe-based" to "Data-driven."

Knowledge Check

Question 1 of 3

Q1Single choice

Which metric would help you detect if your LLM is answering based on its own general knowledge instead of the provided documents?

Context Precision

Faithfulness

Latency

Introduction to RAG Evaluation

Creating Evaluation Datasets