Creating Evaluation Datasets
Create evaluation datasets for RAG pipelines. Learn to build question-answer-context triplets, generate synthetic evaluation data, and use human-annotated datasets.
Learning Goals
- Create evaluation datasets with QA-context triplets
- Generate synthetic evaluation data using LLMs
Creating Evaluation Datasets
To calculate RAGAS metrics, you need data. Specifically, you need a set of "Evaluation Triplets" consisting of a Question, the Retrieved Context, and the Generated Answer. For some metrics (like Context Recall), you also need a Ground Truth answer—the "gold standard" correct response.
Manually writing 50 questions and ground-truth answers for a technical domain is exhausting. In this lesson, we will learn how to use RAGAS to automatically generate a Synthetic Test Set directly from your knowledge base.
Learning Goals
- Define the components of an Evaluation Dataset.
- Generate synthetic Question-Context-Answer-GroundTruth sets using RAGAS.
- Filter and curate synthetic data for high-quality benchmarks.
Core Concepts
1. The Evaluation Triplet (+)
- Question: The user's input.
- Context: The specific chunks retrieved by your system.
- Answer: The text generated by your system.
- Ground Truth (Optional but Recommended): The known correct answer to the question.
2. Synthetic Data Generation (SDG)
RAGAS uses an LLM to scan your document chunks and "reverse engineer" questions. It identifies interesting facts and generates different types of questions:
- Simple Questions: Direct lookup.
- Reasoning Questions: Requires connecting multiple facts.
- Multi-context Questions: Requires information from different documents.
3. The Power of Diversity
A good test set shouldn't just be easy. RAGAS allows you to control the Evolution of questions, ensuring your benchmark covers complex edge cases that a human might forget to test.
Synthetic Generation Workflow
Generating a Test Set
- 1Step 1
Use standard LangChain loaders to provide the source material:
1from langchain_community.document_loaders import DirectoryLoader 2loader = DirectoryLoader("./data") 3docs = loader.load() - 2Step 2
Setup the RAGAS
TestsetGenerator:1from ragas.testset.generator import TestsetGenerator 2from ragas.testset.evolutions import simple, reasoning, multi_context 3 4generator = TestsetGenerator.with_langchain( 5 generator_llm=ChatOpenAI(model="gpt-4o"), 6 critic_llm=ChatOpenAI(model="gpt-4o"), 7 embeddings=OpenAIEmbeddings() 8) - 3Step 3
Specify the number of questions and the distribution of complexity:
1testset = generator.generate_with_langchain_docs( 2 docs, 3 test_size=10, 4 distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25} 5) 6 7# Convert to a pandas dataframe 8df = testset.to_pandas()
Example: The "Blind Spot" Discovery
Imagine you are building a bot for a complex legal contract. You generate 20 "Reasoning" questions synthetically. You might find that the generator asks: "What happens if the force majeure clause is triggered but the notification period is missed?" If you didn't manually think of that question, the synthetic set just helped you find a potential high-stakes failure point in your RAG logic.
Common Mistakes
- Generating too few questions: A test set of 5 questions is a "smoke test," not a benchmark. Aim for at least 30-50 questions for a production release.
- Low-Quality Source Chunks: If your documents are full of OCR errors or noisy boilerplate, the generated questions will be nonsensical. Clean your data (Module 2) before generating a test set.
Recap
- Evaluation requires triplets of Question, Context, and Answer.
- Synthetic Data Generation (SDG) saves weeks of manual labor.
- Curate your distribution (Simple vs. Reasoning) to match your real-world user intent.
Knowledge Check
Which component is required if you want to measure 'Context Recall'?