"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"
Series overview: Series index
Without an evaluation set, every RAG improvement is a story, not evidence.
RAG systems fail in more than one place: retrieval, context selection, answer generation, and grounding. That means "looks better" is not a metric. Part 14 explains how to build a golden dataset, how RAGAS and DeepEval fit into the workflow, and why teams must separate retrieval quality from answer quality if they want tuning decisions to hold up.
0. Prerequisites
- Part 13 reranking — system quality now depends on multi-stage ranking.
- Part 1 grounding — a correct-looking answer is not enough.
- Part 12 Hybrid — multiple retrieval settings need comparison on fixed data.
1. Learning Objectives
- Build a small but useful golden dataset for RAG.
- Distinguish retrieval metrics from answer metrics.
- Use RAGAS and DeepEval in the right roles.
- Avoid the common traps in judge-model-based evaluation.
2. 핵심 요약
An evaluation set for RAG is usually a table of:
query + reference answer + supporting context IDs + optional metadata slice
Use it to answer two different questions:
- Did retrieval find the right evidence?
- Did generation answer faithfully from that evidence?
RAGAS is strong for RAG-specific metrics such as Faithfulness, Context Precision, and Context Recall. DeepEval is strong as a testing harness for assertions, regression checks, and CI-like workflows. The core rule is simple: keep one human-trusted holdout set, do not rely on synthetic data alone, and record which changes improve which layer of the stack.
3. Intuition — Why "The Answer Looks Good" Is Not Enough
Suppose a RAG system answers:
"The report used the ECB end-of-quarter FX rate."
That answer may still be wrong in four different ways:
- the retrieved context came from the wrong report version,
- the answer copied a nearby but incorrect sentence,
- the wording sounds confident but the evidence is weak,
- the question type was never represented in your internal test set.
One final answer can look acceptable while the retrieval layer is already failing. Evaluation must split the pipeline.
4. Definitions — Evaluation Terms
| Term | Definition |
|---|---|
| Golden dataset | Human-trusted evaluation set with fixed queries and expected evidence/answers |
| Synthetic dataset | Evaluation examples generated automatically, often by an LLM |
| Faithfulness | Whether the answer is supported by the retrieved context |
| Context Precision | How much of the retrieved context is actually relevant |
| Context Recall | Whether the retrieved context contains the evidence needed |
| Answer Correctness | Whether the final answer matches the expected answer |
| Judge model | An LLM used to score other outputs |
| Slice | A subset of queries, such as code lookups or policy questions |
The practical split is:
- retrieval-first metrics: did we fetch the right evidence?
- generation-first metrics: did we answer from that evidence correctly?
5. Math — The Small Set of Metrics That Matter First
5.1 Context precision
If \(R\) is the retrieved context set and \(G\) is the gold relevant context set:
$$\text{Context Precision} = \frac{|R \cap G|}{|R|}$$
High precision means little irrelevant context was retrieved.
5.2 Context recall
$$\text{Context Recall} = \frac{|R \cap G|}{|G|}$$
High recall means the retrieved context contains most of the necessary evidence.
5.3 Faithfulness
Faithfulness is usually judged, not computed from exact string overlap:
$$\text{Faithfulness} \approx \Pr(\text{answer claims are supported by context})$$
In practice, RAGAS estimates this through LLM-based claim checking.
5.4 Answer correctness
For exact tasks:
$$\text{Accuracy} = \frac{\#\text{correct answers}}{\#\text{questions}}$$
For free-form answers, LLM-judged correctness or reference-based similarity is often used instead of exact match alone.
6. Walkthrough — From Dataset to Evaluation Run
6.1 Minimal golden dataset shape
{
"query": "What FX basis did the Q3 report use?",
"reference_answer": "The report used the ECB end-of-quarter rate.",
"gold_context_ids": ["c031", "c032"],
"metadata": {"slice": "finance", "query_type": "proper_noun"}
}
Even a 50-100 query set is useful if it is carefully sliced.
6.2 RAGAS-style evaluation inputs
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
dataset = {
"question": [q1, q2],
"answer": [a1, a2],
"contexts": [[ctx11, ctx12], [ctx21, ctx22]],
"ground_truth": [gt1, gt2],
}
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
This is good for batch comparison across prompt versions, retrievers, or rerankers.
6.3 DeepEval-style regression test
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
test_case = LLMTestCase(
input="What FX basis did the Q3 report use?",
actual_output=answer,
retrieval_context=retrieved_chunks,
expected_output="The report used the ECB end-of-quarter rate."
)
assert_test(
test_case,
[
FaithfulnessMetric(threshold=0.8),
AnswerRelevancyMetric(threshold=0.8),
]
)
This is useful when you want pass/fail style regression gates.
6.4 Slice your dataset before you trust the average
slice A: exact codes / proper nouns
slice B: conceptual "why/how" questions
slice C: multi-hop internal-policy questions
slice D: stale-version traps
One global average can hide the fact that a new retriever improved slice B and broke slice A.
7. Variants
7.1 Human-curated holdout sets
- What changes: humans write or review queries, answers, and evidence links.
- Why use it: highest trust.
- What becomes possible: stable model comparison over time.
- Where it fits: any production RAG team.
- Limits: expensive to build and maintain.
7.2 Synthetic data generation
- What changes: LLMs generate questions and references from documents.
- Why use it: fast coverage expansion.
- What becomes possible: larger evaluation sets for nightly runs.
- Where it fits: bootstrapping or low-resource teams.
- Limits: distribution drift and synthetic bias.
7.3 Retrieval-only evaluation
- What changes: score only whether gold contexts were retrieved.
- Why use it: isolates the search layer.
- What becomes possible: cleaner debugging of retriever changes.
- Where it fits: Parts 10-13 tuning work.
- Limits: says nothing about final answer quality.
7.4 Answer-only evaluation
- What changes: judge just the final answer.
- Why use it: quick product-level checks.
- What becomes possible: high-level acceptance tests.
- Where it fits: demos or early prototypes.
- Limits: hides whether failure came from retrieval or generation.
7.5 Live shadow evaluation
- What changes: run offline judges on anonymised real traffic samples.
- Why use it: catches drift early.
- What becomes possible: ongoing regression detection.
- Where it fits: mature production systems.
- Limits: privacy, sampling, and judge cost.
8. Limits and Failure Modes
8.1 Judge-model bias
- Why intrinsic: LLM judges are themselves imperfect models.
- Diagnosis: metric swings when you change the judge model.
- Mitigation: pin the judge version and spot-check with humans.
- Later part: Part 16 experiment logging.
8.2 Synthetic-set overconfidence
- Why intrinsic: generated queries often resemble the generator's own style.
- Diagnosis: excellent offline scores but weak live-user performance.
- Mitigation: keep a human holdout slice and production samples.
- Later part: Part 16.
8.3 Missing gold context annotations
- Why intrinsic: many teams label answers but not supporting chunk IDs.
- Diagnosis: retrieval metrics cannot be computed cleanly.
- Mitigation: annotate at least a small subset with gold context IDs.
- Later part: Part 15 search metrics.
8.4 Metric collapse into one number
- Why intrinsic: dashboards tempt teams to optimise the average only.
- Diagnosis: one composite score improves while a key slice degrades.
- Mitigation: report by slice and by pipeline stage.
- Later part: Parts 15, 16, 17.
8.5 Threshold drift
- Why intrinsic: a threshold that passed last month may fail after corpus or prompt changes.
- Diagnosis: unstable CI-like outcomes.
- Mitigation: revisit thresholds when the system or dataset version changes.
- Later part: Part 16.
8.5 Common Pitfalls
- "RAG evaluation is just answer accuracy." Retrieval and grounding need separate checks.
- "Synthetic data is enough." It is useful, but not trustworthy enough on its own.
- "One metric tells the whole story." Faithfulness, context precision, and correctness answer different questions.
- "The judge model is ground truth." It is a tool, not a final authority.
- "If the average improved, the system improved." Slice-level regressions can still hurt users badly.
9. Settled Conclusions
Q1. What belongs in a minimal RAG golden dataset?
At least a query, a reference answer, and gold supporting context IDs for some slice of the set. Chapter: §4, §6.1.
Q2. What is the difference between context recall and faithfulness?
Context recall asks whether the evidence was retrieved; faithfulness asks whether the answer stayed grounded in that evidence. Chapter: §5.2, §5.3.
Q3. Where does RAGAS fit best?
Batch-style RAG-specific scoring, especially faithfulness and context metrics. Chapter: §2, §6.2.
Q4. Where does DeepEval fit best?
Regression tests, assertions, and pass/fail evaluation loops around LLM systems. Chapter: §2, §6.3.
Q5. Why should evaluation be sliced?
Because different query types fail in different ways, and averages hide those differences. Chapter: §6.4, §8.4.
References
Primary
- Es, S. et al. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
- Saad-Falcon, J. et al. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. 2024.
- Lewis, P. et al. Retrieval-Augmented Generation. NeurIPS 2020.
Official docs
- RAGAS docs:
https://docs.ragas.io/ - DeepEval docs:
https://docs.confident-ai.com/ - Arize Phoenix docs:
https://arize.com/docs/phoenix/
Supporting
- Author note Chapter 13 — evaluation sets and RAGAS.
- Internal retrieval logs and manually checked user queries are high-value seed material for a golden set.
Cheat Sheet
| Knob | Default / Recommendation |
|---|---|
| Human holdout size | start with 50-100 queries |
| Dataset fields | query, reference answer, gold context IDs, slice metadata |
| Core metrics | context recall, context precision, faithfulness, correctness |
| Synthetic data | use to expand, not to replace human holdout |
| Judge model | pin version and spot-check |
| Reporting | by stage and by slice |
| Regression use | DeepEval-style thresholds on stable datasets |
One-liner: evaluate retrieval and answer quality separately, or you will optimise the wrong layer.
Bridge — What's Next
Next — RAG Core Study (15/26) — Search Quality Metrics: Recall@K, MRR, NDCG, Hit Rate.
To build or compare evaluation sets properly, we need the retrieval metrics themselves. The next part unpacks Recall@K, Precision@K, MRR, NDCG, and Hit Rate with formulas and code.
댓글
댓글 쓰기