"RAG Core Study (15/26) — Search Quality Metrics: Recall@K, MRR, NDCG, Hit Rate"

Series overview: Series index

Retrieval quality is never one number. Each metric tells you what kind of failure you are actually dealing with.

After you build an eval set, the next question is obvious: how exactly do we measure whether retrieval improved? RAG systems usually care about multiple layers at once. Did the retriever fetch the right document at all? Did it place that document near the top? Did it return too much noise? Part 15 explains the core retrieval metrics you will keep seeing in RAG evaluation: Recall@K, Hit Rate, MRR, NDCG, and Precision@K.


0. Prerequisites

  • Part 14 evaluation sets — the metrics only matter if the labels are trustworthy.
  • Part 12 Hybrid Search — fused pipelines often change recall and ranking in different ways.
  • Part 13 reranker — rerankers usually improve ranking metrics more than raw recall.

1. Learning Objectives

  1. Distinguish Recall@K, Hit Rate, MRR, and NDCG.
  2. Know which metric to inspect first for each retrieval problem.
  3. Explain why retrieval metrics and answer metrics are not interchangeable.
  4. Avoid the most common metric-reading mistakes in RAG.

2. ํ•ต์‹ฌ ์š”์•ฝ

Recall@K asks whether the relevant document made it into the top-K at all. Hit Rate@K asks whether each query got at least one relevant result in the top-K. MRR focuses on how early the first relevant result appears. NDCG@K cares about ranking quality when there are multiple relevant documents with different relevance levels. In practice, RAG teams usually stabilise retriever recall first, then improve ranking quality with rerankers, fusion, and routing. One metric cannot represent the whole retrieval system.


3. Intuition — The Same System Can Look Good or Bad Depending on the Metric

Imagine two retrievers:

  • Retriever A almost always places the correct document somewhere in the top-20, but often not near rank 1.
  • Retriever B often puts the correct document in the top-3, but occasionally misses it entirely.

Then:

  • Recall@20 may favour A
  • MRR may favour B

Neither result is contradictory. The metrics are answering different questions.


4. Definitions — Core Retrieval Metrics

Metric What It Measures
Recall@K How often the retriever includes the relevant documents within top-K
Hit Rate@K How often top-K contains at least one relevant result
MRR How early the first relevant result appears
NDCG@K How close the ranked list is to an ideal ranking with graded relevance
Precision@K How much of top-K is actually relevant

5. Math — What the Metrics Actually Compute

5.1 Recall@K

$$\text{Recall@K} = \frac{\text{relevant documents found in top-K}}{\text{all relevant documents}}$$

If a query has three relevant chunks and the top-5 contains two of them, Recall@5 is:

$$\frac{2}{3}$$

5.2 Mean Reciprocal Rank

$$\text{MRR} = \frac{1}{|Q|}\sum_{q \in Q} \frac{1}{\text{rank of first relevant result}}$$

If the first relevant result is rank 1, the contribution is 1.0.
If it is rank 2, the contribution is 0.5.
If it is rank 10, the contribution is 0.1.

5.3 NDCG@K

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i}-1}{\log_2(i+1)}$$

Then:

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

This is useful when not all relevant documents are equally valuable.


6. Walkthrough — Reading Metrics Through Small Examples

6.1 MRR for a single query

rank_first_relevant = 3
mrr_for_query = 1 / rank_first_relevant   # 0.333...

MRR cares almost entirely about whether your best supporting context appears early enough.

6.2 Recall@K for multiple relevant documents

relevant_docs = {"c03", "c08", "c11"}
top5 = ["c17", "c03", "c21", "c08", "c05"]

found = len(relevant_docs.intersection(top5))  # 2
recall_at_5 = found / len(relevant_docs)       # 2 / 3

6.3 Why NDCG matters

Suppose a policy PDF is the strongest source, a committee note is useful but weaker, and a blog explanation is only loosely helpful. NDCG lets you grade those differences instead of treating them as equal.

Self-explanation: Why can a retriever with strong Recall@K still produce weak final RAG answers?


7. Variants and Use Cases

7.1 Recall@K — Best for retrieval coverage

What changes
The metric only asks whether the relevant document was recovered.

Why it matters
If the answer document never appears in top-K, generation cannot rescue the failure.

What it enables
You can diagnose the raw coverage of the retriever before reranking.

Limit and next step
It says little about ranking quality. That is where MRR and NDCG take over.

7.2 MRR — Best for single-answer QA retrieval

When one main document matters most, MRR is intuitive and operationally useful.

7.3 NDCG — Best for graded relevance

When multiple documents matter but not equally, NDCG reflects that structure better than simple hit-based metrics.


8. Limits and Failure Modes

8.1 One metric can hide another failure

You can raise Recall@50 simply by returning more documents, while making the final context noisier and less useful for generation.

8.2 K must match the pipeline

Recall@50 is not the right focus if your RAG pipeline only ever passes top-5 into the model.

8.3 Label quality controls metric quality

If the eval set does not define relevance consistently, even mathematically correct metrics become operationally misleading.

8.4 Next step — The metrics must become a repeatable experiment loop

Metrics matter most when they are logged, compared, and tied to trace data. That leads to Part 16.


8.5 Common Pitfalls

# Pitfall Symptom Fast Check
1 Optimising one metric only hidden regressions elsewhere read at least recall + ranking together
2 Using the wrong K mismatch with production align K with actual pipeline depth
3 Mixing retrieval and answer metrics root cause stays unclear split retrieval from generation dashboards
4 Ignoring per-query failures average looks fine inspect hard queries individually
5 Treating all relevant docs as equal metric loses nuance add graded labels where needed

9. Self-check — Answer Before Looking

Q1. What does Recall@K measure?

Answer Whether relevant documents were included within top-K.
Why It measures retrieval coverage rather than ranking finesse.

Q2. Why is MRR useful for QA retrieval?

Answer Because it rewards placing the first relevant result early.
Why QA often depends most on the highest-ranked relevant chunk.

Q3. When is NDCG better than simple hit-based metrics?

Answer When multiple relevant documents have different importance levels.
Why NDCG handles graded relevance instead of binary relevance only.

Q4. Why can Recall@K improve while answer quality gets worse?

Answer Because higher recall can come from returning more noisy documents.
Why Generation still depends on ranking, context quality, and token budget.


Cheat Sheet — One-page Summary

Formulas - Recall@K = retrieved relevant docs / all relevant docs - MRR = average reciprocal rank of first relevant result - NDCG = DCG / IDCG

Definitions - Hit Rate@K: at least one relevant result in top-K - Precision@K: proportion of top-K that is relevant

Minimal code

mrr = sum(1 / first_rank[q] for q in queries) / len(queries)

When to use what | Situation | Metric | |---|---| | Retrieval coverage | Recall@K | | First-result quality | MRR | | Graded ranking quality | NDCG | | Noise level in top-K | Precision@K |


References

Primary sources

  • Manning, C., Raghavan, P., and Schรผtze, H. Introduction to Information Retrieval.
  • Jรคrvelin, K. and Kekรคlรคinen, J. Cumulated Gain-based Evaluation of IR Techniques. 2002.

Official docs

  • Elasticsearch ranking evaluation docs
  • Weaviate evaluation docs

Supporting notes

  • User notes, chapter 12 evaluation
  • User notes, chapter 13 experiments

Bridge to the Next Part

Once the metrics are clear, the next challenge is operational: you need to collect those metrics consistently and compare runs without relying on memory. Part 16 covers experiment automation with LangSmith, Phoenix, and MLflow.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System