Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Creating Evaluation Datasets

Create evaluation datasets for RAG pipelines. Learn to build question-answer-context triplets, generate synthetic evaluation data, and use human-annotated datasets.

Learning Goals

Create evaluation datasets with QA-context triplets
Generate synthetic evaluation data using LLMs

Creating Evaluation Datasets

To calculate RAGAS metrics, you need data. Specifically, you need a set of "Evaluation Triplets" consisting of a Question, the Retrieved Context, and the Generated Answer. For some metrics (like Context Recall), you also need a Ground Truth answer—the "gold standard" correct response.

Manually writing 50 questions and ground-truth answers for a technical domain is exhausting. In this lesson, we will learn how to use RAGAS to automatically generate a Synthetic Test Set directly from your knowledge base.

Learning Goals

Define the components of an Evaluation Dataset.
Generate synthetic Question-Context-Answer-GroundTruth sets using RAGAS.
Filter and curate synthetic data for high-quality benchmarks.

Core Concepts

1. The Evaluation Triplet (+)

Question: The user's input.
Context: The specific chunks retrieved by your system.
Answer: The text generated by your system.
Ground Truth (Optional but Recommended): The known correct answer to the question.

2. Synthetic Data Generation (SDG)

RAGAS uses an LLM to scan your document chunks and "reverse engineer" questions. It identifies interesting facts and generates different types of questions:

Simple Questions: Direct lookup.
Reasoning Questions: Requires connecting multiple facts.
Multi-context Questions: Requires information from different documents.

3. The Power of Diversity

A good test set shouldn't just be easy. RAGAS allows you to control the Evolution of questions, ensuring your benchmark covers complex edge cases that a human might forget to test.

Synthetic Generation Workflow

Generating a Test Set

Step 1

Use standard LangChain loaders to provide the source material:

1from langchain_community.document_loaders import DirectoryLoader
2loader = DirectoryLoader("./data")
3docs = loader.load()

Step 2

Setup the RAGAS TestsetGenerator:

1from ragas.testset.generator import TestsetGenerator
2from ragas.testset.evolutions import simple, reasoning, multi_context
3
4generator = TestsetGenerator.with_langchain(
5    generator_llm=ChatOpenAI(model="gpt-4o"),
6    critic_llm=ChatOpenAI(model="gpt-4o"),
7    embeddings=OpenAIEmbeddings()
8)

Step 3

Specify the number of questions and the distribution of complexity:

1testset = generator.generate_with_langchain_docs(
2    docs, 
3    test_size=10,
4    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}
5)
6
7# Convert to a pandas dataframe
8df = testset.to_pandas()

Imagine you are building a bot for a complex legal contract. You generate 20 "Reasoning" questions synthetically. You might find that the generator asks: "What happens if the force majeure clause is triggered but the notification period is missed?" If you didn't manually think of that question, the synthetic set just helped you find a potential high-stakes failure point in your RAG logic.

Common Mistakes

Generating too few questions: A test set of 5 questions is a "smoke test," not a benchmark. Aim for at least 30-50 questions for a production release.
Low-Quality Source Chunks: If your documents are full of OCR errors or noisy boilerplate, the generated questions will be nonsensical. Clean your data (Module 2) before generating a test set.

Recap

Evaluation requires triplets of Question, Context, and Answer.
Synthetic Data Generation (SDG) saves weeks of manual labor.
Curate your distribution (Simple vs. Reasoning) to match your real-world user intent.

Knowledge Check

Question 1 of 3

Q1Single choice

Which component is required if you want to measure 'Context Recall'?

System Prompt

Ground Truth Answer

Vector Metadata

RAGAS Framework and Metrics

Running and Interpreting RAGAS Metrics

Creating Evaluation Datasets

Learning Goals

Creating Evaluation Datasets

Learning Goals

Core Concepts

1. The Evaluation Triplet (+)

2. Synthetic Data Generation (SDG)

3. The Power of Diversity

Synthetic Generation Workflow

Generating a Test Set

Example: The "Blind Spot" Discovery

Common Mistakes

Recap

Knowledge Check