"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

5월 18, 2026

Series overview: Series index

Without an evaluation set, every RAG improvement is a story, not evidence.

RAG systems fail in more than one place: retrieval, context selection, answer generation, and grounding. That means "looks better" is not a metric. Part 14 explains how to build a golden dataset, how RAGAS and DeepEval fit into the workflow, and why teams must separate retrieval quality from answer quality if they want tuning decisions to hold up.

0. Prerequisites

Part 13 reranking — system quality now depends on multi-stage ranking.
Part 1 grounding — a correct-looking answer is not enough.
Part 12 Hybrid — multiple retrieval settings need comparison on fixed data.

1. Learning Objectives

Build a small but useful golden dataset for RAG.
Distinguish retrieval metrics from answer metrics.
Use RAGAS and DeepEval in the right roles.
Avoid the common traps in judge-model-based evaluation.

2. 핵심 요약

An evaluation set for RAG is usually a table of:

query + reference answer + supporting context IDs + optional metadata slice

Use it to answer two different questions:

Did retrieval find the right evidence?
Did generation answer faithfully from that evidence?

RAGAS is strong for RAG-specific metrics such as Faithfulness, Context Precision, and Context Recall. DeepEval is strong as a testing harness for assertions, regression checks, and CI-like workflows. The core rule is simple: keep one human-trusted holdout set, do not rely on synthetic data alone, and record which changes improve which layer of the stack.

3. Intuition — Why "The Answer Looks Good" Is Not Enough

Suppose a RAG system answers:

"The report used the ECB end-of-quarter FX rate."

That answer may still be wrong in four different ways:

the retrieved context came from the wrong report version,
the answer copied a nearby but incorrect sentence,
the wording sounds confident but the evidence is weak,
the question type was never represented in your internal test set.

One final answer can look acceptable while the retrieval layer is already failing. Evaluation must split the pipeline.

4. Definitions — Evaluation Terms

Term	Definition
Golden dataset	Human-trusted evaluation set with fixed queries and expected evidence/answers
Synthetic dataset	Evaluation examples generated automatically, often by an LLM
Faithfulness	Whether the answer is supported by the retrieved context
Context Precision	How much of the retrieved context is actually relevant
Context Recall	Whether the retrieved context contains the evidence needed
Answer Correctness	Whether the final answer matches the expected answer
Judge model	An LLM used to score other outputs
Slice	A subset of queries, such as code lookups or policy questions

The practical split is:

retrieval-first metrics: did we fetch the right evidence?
generation-first metrics: did we answer from that evidence correctly?

5. Math — The Small Set of Metrics That Matter First

5.1 Context precision

If $R$ is the retrieved context set and $G$ is the gold relevant context set:

$$\text{Context Precision} = \frac{|R \cap G|}{|R|}$$

High precision means little irrelevant context was retrieved.

5.2 Context recall

$$\text{Context Recall} = \frac{|R \cap G|}{|G|}$$

High recall means the retrieved context contains most of the necessary evidence.

5.3 Faithfulness

Faithfulness is usually judged, not computed from exact string overlap:

$$\text{Faithfulness} \approx \Pr(\text{answer claims are supported by context})$$

In practice, RAGAS estimates this through LLM-based claim checking.

5.4 Answer correctness

For exact tasks:

$$\text{Accuracy} = \frac{\#\text{correct answers}}{\#\text{questions}}$$

For free-form answers, LLM-judged correctness or reference-based similarity is often used instead of exact match alone.

6. Walkthrough — From Dataset to Evaluation Run

6.1 Minimal golden dataset shape

{
  "query": "What FX basis did the Q3 report use?",
  "reference_answer": "The report used the ECB end-of-quarter rate.",
  "gold_context_ids": ["c031", "c032"],
  "metadata": {"slice": "finance", "query_type": "proper_noun"}
}

Even a 50-100 query set is useful if it is carefully sliced.

6.2 RAGAS-style evaluation inputs

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

dataset = {
    "question": [q1, q2],
    "answer": [a1, a2],
    "contexts": [[ctx11, ctx12], [ctx21, ctx22]],
    "ground_truth": [gt1, gt2],
}

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)

This is good for batch comparison across prompt versions, retrievers, or rerankers.

6.3 DeepEval-style regression test

from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What FX basis did the Q3 report use?",
    actual_output=answer,
    retrieval_context=retrieved_chunks,
    expected_output="The report used the ECB end-of-quarter rate."
)

assert_test(
    test_case,
    [
        FaithfulnessMetric(threshold=0.8),
        AnswerRelevancyMetric(threshold=0.8),
    ]
)

This is useful when you want pass/fail style regression gates.

6.4 Slice your dataset before you trust the average

slice A: exact codes / proper nouns
slice B: conceptual "why/how" questions
slice C: multi-hop internal-policy questions
slice D: stale-version traps

One global average can hide the fact that a new retriever improved slice B and broke slice A.

7. Variants

7.1 Human-curated holdout sets

What changes: humans write or review queries, answers, and evidence links.
Why use it: highest trust.
What becomes possible: stable model comparison over time.
Where it fits: any production RAG team.
Limits: expensive to build and maintain.

7.2 Synthetic data generation

What changes: LLMs generate questions and references from documents.
Why use it: fast coverage expansion.
What becomes possible: larger evaluation sets for nightly runs.
Where it fits: bootstrapping or low-resource teams.
Limits: distribution drift and synthetic bias.

7.3 Retrieval-only evaluation

What changes: score only whether gold contexts were retrieved.
Why use it: isolates the search layer.
What becomes possible: cleaner debugging of retriever changes.
Where it fits: Parts 10-13 tuning work.
Limits: says nothing about final answer quality.

7.4 Answer-only evaluation

What changes: judge just the final answer.
Why use it: quick product-level checks.
What becomes possible: high-level acceptance tests.
Where it fits: demos or early prototypes.
Limits: hides whether failure came from retrieval or generation.

7.5 Live shadow evaluation

What changes: run offline judges on anonymised real traffic samples.
Why use it: catches drift early.
What becomes possible: ongoing regression detection.
Where it fits: mature production systems.
Limits: privacy, sampling, and judge cost.

8. Limits and Failure Modes

8.1 Judge-model bias

Why intrinsic: LLM judges are themselves imperfect models.
Diagnosis: metric swings when you change the judge model.
Mitigation: pin the judge version and spot-check with humans.
Later part: Part 16 experiment logging.

8.2 Synthetic-set overconfidence

Why intrinsic: generated queries often resemble the generator's own style.
Diagnosis: excellent offline scores but weak live-user performance.
Mitigation: keep a human holdout slice and production samples.
Later part: Part 16.

8.3 Missing gold context annotations

Why intrinsic: many teams label answers but not supporting chunk IDs.
Diagnosis: retrieval metrics cannot be computed cleanly.
Mitigation: annotate at least a small subset with gold context IDs.
Later part: Part 15 search metrics.

8.4 Metric collapse into one number

Why intrinsic: dashboards tempt teams to optimise the average only.
Diagnosis: one composite score improves while a key slice degrades.
Mitigation: report by slice and by pipeline stage.
Later part: Parts 15, 16, 17.

8.5 Threshold drift

Why intrinsic: a threshold that passed last month may fail after corpus or prompt changes.
Diagnosis: unstable CI-like outcomes.
Mitigation: revisit thresholds when the system or dataset version changes.
Later part: Part 16.

8.5 Common Pitfalls

"RAG evaluation is just answer accuracy." Retrieval and grounding need separate checks.
"Synthetic data is enough." It is useful, but not trustworthy enough on its own.
"One metric tells the whole story." Faithfulness, context precision, and correctness answer different questions.
"The judge model is ground truth." It is a tool, not a final authority.
"If the average improved, the system improved." Slice-level regressions can still hurt users badly.

9. Settled Conclusions

Q1. What belongs in a minimal RAG golden dataset?

At least a query, a reference answer, and gold supporting context IDs for some slice of the set. Chapter: §4, §6.1.

Q2. What is the difference between context recall and faithfulness?

Context recall asks whether the evidence was retrieved; faithfulness asks whether the answer stayed grounded in that evidence. Chapter: §5.2, §5.3.

Q3. Where does RAGAS fit best?

Batch-style RAG-specific scoring, especially faithfulness and context metrics. Chapter: §2, §6.2.

Q4. Where does DeepEval fit best?

Regression tests, assertions, and pass/fail evaluation loops around LLM systems. Chapter: §2, §6.3.

Q5. Why should evaluation be sliced?

Because different query types fail in different ways, and averages hide those differences. Chapter: §6.4, §8.4.

References

Primary

Es, S. et al. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217.
Saad-Falcon, J. et al. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. 2024.
Lewis, P. et al. Retrieval-Augmented Generation. NeurIPS 2020.

Official docs

RAGAS docs: https://docs.ragas.io/
DeepEval docs: https://docs.confident-ai.com/
Arize Phoenix docs: https://arize.com/docs/phoenix/

Supporting

Author note Chapter 13 — evaluation sets and RAGAS.
Internal retrieval logs and manually checked user queries are high-value seed material for a golden set.

Cheat Sheet

Knob	Default / Recommendation
Human holdout size	start with 50-100 queries
Dataset fields	query, reference answer, gold context IDs, slice metadata
Core metrics	context recall, context precision, faithfulness, correctness
Synthetic data	use to expand, not to replace human holdout
Judge model	pin version and spot-check
Reporting	by stage and by slice
Regression use	DeepEval-style thresholds on stable datasets

One-liner: evaluate retrieval and answer quality separately, or you will optimise the wrong layer.

Bridge — What's Next

Next — RAG Core Study (15/26) — Search Quality Metrics: Recall@K, MRR, NDCG, Hit Rate.

To build or compare evaluation sets properly, we need the retrieval metrics themselves. The next part unpacks Recall@K, Precision@K, MRR, NDCG, and Hit Rate with formulas and code.