"RAG Core Study (12/26) — Hybrid Search & Score Fusion"

Series overview: Series index

Dense catches semantics. BM25 catches exact tokens. Production RAG usually needs both.

Hybrid Search combines Dense Retrieval and Sparse Retrieval so each covers the other's blind spots. Simple in principle, but score scales differ, candidate pools collide, and weight tuning can overfit fast. Part 12 explains why Reciprocal Rank Fusion (RRF) became the practical default, when Weighted Fusion is still useful, and how to run a small, repeatable alpha-tuning experiment.


0. Prerequisites

  • Part 10 Dense Retrieval — semantic similarity and top-K gaps.
  • Part 11 BM25 — exact matching, inverted indices, analyzer choices.
  • Part 7 metadata — filters should apply before fusion whenever possible.

1. Learning Objectives

  1. Explain why Dense + BM25 is the standard first-stage retrieval stack.
  2. Compute RRF and compare it with score-based weighted fusion.
  3. Design candidate pools for Hybrid -> Reranker pipelines.
  4. Diagnose the most common Hybrid failure modes.

2. 핵심 요약

Hybrid Search runs Dense and BM25 separately, then fuses the results. The production default is RRF:

$$\text{RRF}(d) = \sum_i \frac{1}{k + \text{rank}_i(d)}$$

It works well because it ignores raw score scales and only trusts rank order. Weighted Fusion can outperform RRF, but only after score normalisation and per-corpus tuning. The standard pipeline is:

Dense top-50/100 + BM25 top-50/100 -> RRF fuse -> optional Reranker -> final top-K

Use Hybrid when queries mix semantic paraphrase and exact identifiers. Skip it only when one signal clearly dominates, such as a pure code lookup or a purely semantic FAQ corpus.


3. Intuition — Why Two Weaknesses Make One Stronger System

Query: "What FX basis did the PR-2024-Q3 report use?"

  • Dense pulls chunks about currency conversion and FX rates, even if wording differs.
  • BM25 locks onto the exact token PR-2024-Q3.
  • Hybrid surfaces chunks that satisfy both signals near the top.
diagram-1

Dense alone tends to blur the report code. BM25 alone may miss "FX basis" if the chunk says "conversion rate applied". Hybrid exists for exactly this kind of mixed query.


4. Definitions — Core Fusion Terms

Term Definition
Hybrid Search Combine two or more retrieval signals, usually Dense + BM25
Candidate pool The first-stage top-N documents from each retriever before fusion
RRF Rank-only fusion: add \(\frac{1}{k+r}\) for each rank position
Weighted Fusion Combine normalised scores with weights such as \(\alpha s_d + (1-\alpha)s_b\)
Score normalisation Map Dense and BM25 scores into comparable ranges
Agreement The same document appears high in multiple retrievers
Fusion constant \(k\) RRF dampening term, usually 60

RRF is strong because it does not care whether Dense scores are cosine values around 0.6 and BM25 scores are raw values around 18.4. Weighted fusion does care.


5. Math — RRF vs Weighted Fusion

5.1 Reciprocal Rank Fusion

For document \(d\), across ranked lists \(L_1, ..., L_m\):

$$\text{RRF}(d) = \sum_{i=1}^{m} \frac{1}{k + \text{rank}_{L_i}(d)}$$

  • If \(d\) is rank 1 in Dense and rank 3 in BM25, with \(k=60\):

$$\text{RRF}(d) = \frac{1}{61} + \frac{1}{63} \approx 0.0323$$

  • Documents that appear in both lists rise naturally.
  • Missing from one list simply means no contribution from that list.

5.2 Weighted score fusion

After score normalisation:

$$\text{HybridScore}(d) = \alpha \cdot \tilde{s}_{dense}(d) + (1-\alpha) \cdot \tilde{s}_{bm25}(d)$$

Where \(\tilde{s}\) is a normalised score, for example min-max:

$$\tilde{s}(d) = \frac{s(d) - s_{min}}{s_{max} - s_{min}}$$

Weighted fusion is more expressive, but only if:

  1. the normalisation is sensible,
  2. \(\alpha\) is tuned on an evaluation set,
  3. both retrievers expose scores stable enough to compare.

5.3 Candidate-depth rule

If a reranker follows, retrieve deeper:

$$N_{dense}, N_{bm25} \in [30, 100] \quad \Rightarrow \quad N_{fused} \in [30, 80]$$

Hybrid is usually a recall stage. Precision comes later in Part 13.


6. Walkthrough — Dense + BM25 + Fusion

6.1 Minimal RRF implementation

from collections import defaultdict

dense_ranking = ["c014", "c022", "c031", "c005"]
bm25_ranking  = ["c031", "c014", "c099", "c022"]

def rrf(rankings, k=60):
    scores = defaultdict(float)
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] += 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

print(rrf([dense_ranking, bm25_ranking]))

Expected order:

[('c014', 0.0325), ('c031', 0.0323), ('c022', 0.0315), ('c099', 0.0159), ('c005', 0.0156)]

c014 wins because it appears near the top in both lists.

6.2 Dense + BM25 in two systems

bm25_hits = es.search(
    index="rag",
    body={"query": {"match": {"chunk_text": query}}, "size": 50}
)

dense_hits = vectordb.similarity_search_with_score(query, k=50)

bm25_rank  = [hit["_id"] for hit in bm25_hits["hits"]["hits"]]
dense_rank = [doc.metadata["chunk_id"] for doc, _ in dense_hits]

fused = rrf([bm25_rank, dense_rank], k=60)
top_ids = [doc_id for doc_id, _ in fused[:20]]

The only hard requirement is stable document IDs across both systems. If IDs drift, fusion breaks silently.

6.3 Weighted fusion with score normalisation

def minmax(scores):
    lo, hi = min(scores.values()), max(scores.values())
    if hi == lo:
        return {k: 1.0 for k in scores}
    return {k: (v - lo) / (hi - lo) for k, v in scores.items()}

dense_scores = {"c014": 0.81, "c022": 0.79, "c031": 0.77}
bm25_scores  = {"c031": 14.2, "c014": 12.7, "c099": 11.5}

dn = minmax(dense_scores)
bn = minmax(bm25_scores)
alpha = 0.6

all_ids = set(dn) | set(bn)
hybrid = {
    doc_id: alpha * dn.get(doc_id, 0.0) + (1 - alpha) * bn.get(doc_id, 0.0)
    for doc_id in all_ids
}

Without minmax, BM25's raw values dominate numerically and the Dense signal effectively disappears.

6.4 Small alpha-tuning loop

alphas = [0.0, 0.25, 0.5, 0.75, 1.0]
results = []

for alpha in alphas:
    metric = evaluate_hybrid(eval_queries, alpha=alpha)   # e.g. Recall@20 or NDCG@10
    results.append((alpha, metric))

best_alpha, best_metric = max(results, key=lambda x: x[1])
print(best_alpha, best_metric)

This only makes sense if evaluate_hybrid() runs on a fixed evaluation set. Part 14 formalises that.


7. Variants

7.1 RRF as the default production baseline

  • What changes: rank-only fusion, no score calibration.
  • Why use it: robust across tools and corpora.
  • What becomes possible: reliable Dense + BM25 combination with minimal tuning.
  • Where it fits: most first Hybrid deployments.
  • Limits: cannot exploit cases where one score scale is meaningfully calibrated.

7.2 Weighted fusion after calibration

  • What changes: use score normalisation plus \(\alpha\).
  • Why use it: lets you bias toward semantic or exact matching.
  • What becomes possible: small quality gains on stable corpora.
  • Where it fits: mature pipelines with evaluation coverage.
  • Limits: overfits easily; scores drift when the corpus changes.

7.3 Query-adaptive weighting

  • What changes: make \(\alpha\) depend on query type.
  • Why use it: proper-noun queries should lean BM25; conceptual queries should lean Dense.
  • What becomes possible: better average quality than one global weight.
  • Where it fits: after Part 17 query classification.
  • Limits: classification error cascades into fusion error.

7.4 Single-system Hybrid in Elasticsearch/OpenSearch

  • What changes: BM25 and vector search live in one engine.
  • Why use it: operational simplicity.
  • What becomes possible: one index, one filter layer, one deployment path.
  • Where it fits: existing ES shops.
  • Limits: less flexible than a best-of-breed BM25 + vector DB stack.

7.5 Hybrid -> Reranker cascade

  • What changes: Hybrid expands recall, reranker restores precision.
  • Why use it: first-stage retrieval and final ordering solve different problems.
  • What becomes possible: higher top-3 and top-5 accuracy.
  • Where it fits: production QA systems.
  • Limits: more latency and model cost. Part 13 covers the trade-off.

8. Limits and Failure Modes

8.1 Score-scale mismatch

  • Why intrinsic: Dense and BM25 scores live on unrelated numeric ranges.
  • Diagnosis: raw weighted sums always behave like one retriever only.
  • Mitigation: use RRF first, or normalise scores before weighting.
  • Later part: Part 16 experiment tracking for stable tuning.

8.2 ID mismatch across systems

  • Why intrinsic: Dense DB and BM25 index may store different chunk IDs for the same text.
  • Diagnosis: obviously relevant chunks never merge, even when both retrievers find them.
  • Mitigation: enforce shared chunk_id from ingestion time.
  • Later part: Part 3 ingestion design.

8.3 Candidate pools too shallow

  • Why intrinsic: if each retriever only returns top-5, fusion has too little recall to work with.
  • Diagnosis: Hybrid barely differs from the better single retriever.
  • Mitigation: retrieve top-30 to top-100 before fusion.
  • Later part: Part 13 reranking depth.

8.4 Filters applied after fusion

  • Why intrinsic: forbidden or stale documents may rise during fusion and only be removed later.
  • Diagnosis: top ranks collapse after filtering.
  • Mitigation: pre-filter each retriever when possible.
  • Later part: Part 7 metadata, Part 25 security ops.

8.5 Weight overfitting

  • Why intrinsic: one alpha can look best on a small dev slice and fail on new queries.
  • Diagnosis: offline gains vanish in live traffic.
  • Mitigation: validate by query slice, not only global average.
  • Later part: Parts 14, 16, 17.

8.5 Common Pitfalls

  • "Hybrid means add raw Dense and BM25 scores." Wrong unless scores are normalised and validated.
  • "RRF is too simple to matter." Simplicity is why it survives tool changes and corpus drift.
  • "Hybrid replaces reranking." Hybrid improves recall; rerankers improve final order.
  • "Top-10 + top-10 is enough." Too shallow for production-quality recall.
  • "If both retrievers disagree, one of them is broken." Often the query is mixed and both are partially right.

9. Settled Conclusions

Q1. Why is RRF the default fusion method?

Because it uses rank only, so Dense and BM25 score scales do not need calibration. Chapter: §5.1, §7.1.

Q2. When does weighted fusion beat RRF?

When scores are normalised, the corpus is stable, and alpha is tuned on a real evaluation set. Chapter: §5.2, §6.4.

Q3. Why should candidate pools be deeper than final top-K?

Fusion is a recall stage. It needs enough candidates for overlap and recovery before reranking. Chapter: §5.3, §8.3.

Q4. What is the most common Hybrid implementation bug?

Mismatched document IDs between the Dense index and the BM25 index. Chapter: §6.2, §8.2.

Q5. In one line, when should Hybrid be used?

When queries combine semantic paraphrase and exact token constraints. Chapter: §3, §7.5.


References

Primary

  • Cormack, G. V., Clarke, C. L. A., Buettcher, S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009.
  • Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906.
  • Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. 2009.
  • Chen, X. et al. Hybrid Retrieval-Augmented Generation for Real-World Applications. survey-style industry writeups are useful here, but the core logic is already in DPR + BM25 + RRF.

Official docs

  • Elasticsearch hybrid search docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-search.html
  • OpenSearch hybrid search docs: https://opensearch.org/docs/latest/search-plugins/hybrid-search/
  • Qdrant hybrid search docs: https://qdrant.tech/documentation/

Supporting

  • Author note Chapter 11 — Hybrid Search.
  • Author note Chapter 35 §6 — retriever routing and filter-first retrieval.

Cheat Sheet

Knob Default / Recommendation
Dense candidate depth 50
BM25 candidate depth 50
RRF constant 60
Weighted fusion only after score normalisation
Alpha tuning on a fixed evaluation set only
Filters pre-filter before fusion
With reranker fuse to 20-80 candidates, then rerank

One-liner: Hybrid = Dense for meaning, BM25 for exactness, RRF for stable fusion.


Bridge — What's Next

Next — RAG Core Study (13/26) — Reranker: The Role of Cross-encoders.

Hybrid fixes first-stage recall, but the final top order is still noisy. The next layer is the reranker: a slower but far more precise cross-encoder that re-sorts the fused candidate pool.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System