"RAG Core Study (13/26) — Reranker: The Role of Cross-encoders"

Series overview: Series index

Retrievers are built for recall. Rerankers are built for the final order.

The first-stage retriever usually gets the right answer somewhere into top-50, not necessarily at rank 1. A reranker fixes that. Instead of embedding query and document separately, it reads them together and scores the pair directly. Part 13 explains why cross-encoders are slower but more precise, how top-N -> top-K reranking works, and when tools like BGE-reranker, Cohere Rerank, or Jina rerankers are worth the added latency.


0. Prerequisites

  • Part 12 Hybrid Search — candidate pools and first-stage recall.
  • Part 10 Dense Retrieval — bi-encoder assumptions.
  • Part 11 BM25 — exact-match candidate generation.

1. Learning Objectives

  1. Explain the structural difference between a bi-encoder and a cross-encoder.
  2. Decide how many candidates should be reranked.
  3. Implement a simple reranking step after Hybrid retrieval.
  4. Diagnose the main latency and truncation risks.

2. 핵심 요약

A reranker scores the full pair \((q, d)\), not separate embeddings. That gives it two advantages:

  1. it can model token-level interactions,
  2. it can understand negation, ordering, and exact phrasing better.

The trade-off is speed. Retrieval cost is roughly:

first-stage retrieval once + reranker over N candidates

So rerankers are used only on a small candidate pool, typically top-20 to top-100. The standard pipeline is:

BM25/Dense/Hybrid -> top-N candidates -> reranker -> final top-K

If first-stage retrieval gives recall, rerankers give precision at the top.


3. Intuition — Why the Right Chunk Is Often Not Rank 1 Yet

Query: "Does policy 3.2 allow exceptions without director approval?"

First-stage Hybrid may return:

  1. a chunk about exceptions,
  2. a chunk about director approval,
  3. the exact chunk stating exceptions require director approval.

All three share terms and semantics. The reranker reads the full pair and can see that candidate 3 answers the actual question best.

diagram-1

The reranker does not find new candidates. It makes better use of the candidates already found.


4. Definitions — Reranking Terms

Term Definition
Cross-encoder Encode query and document jointly and predict one relevance score
Reranker A model or service that reorders first-stage candidates
Candidate depth \(N\) Number of documents sent into reranking
Final top-K Number of documents kept after reranking
Pointwise reranking Score each \((q, d)\) pair independently
Cascade Multi-stage retrieval pipeline: recall stage -> precision stage
Truncation Long candidates may be cut to fit the reranker's token limit

The key split is:

  • Retriever: fast, approximate, scalable.
  • Reranker: slow, precise, limited to small \(N\).

5. Math — Why Cross-encoders Cost More

5.1 Retriever score

A bi-encoder computes:

$$s_{retr}(q, d) = E_q(q) \cdot E_d(d)$$

Document embeddings are precomputed, so query-time cost is low.

5.2 Reranker score

A cross-encoder computes:

$$s_{rerank}(q, d) = f_{\theta}([CLS]\ q\ [SEP]\ d)$$

The full query-document pair must pass through the model for every candidate.

5.3 Cost model

If first-stage retrieval returns \(N\) candidates:

$$L_{total} \approx L_{retrieval} + N \cdot L_{pair}$$

With batching:

$$L_{total} \approx L_{retrieval} + \left\lceil \frac{N}{B} \right\rceil \cdot L_{batch}$$

So the practical problem is not whether rerankers help. It is how deep you can rerank under your latency budget.


6. Walkthrough — Hybrid -> Reranker

6.1 Local cross-encoder with sentence-transformers

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

query = "Does policy 3.2 allow exceptions without director approval?"
candidates = [
    "Policy 3.2 defines exceptions and appeal paths.",
    "Director approval is required for all external disclosures.",
    "Under policy 3.2, exceptions require director approval before filing.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
for doc, score in ranked:
    print(f"{score:.3f} | {doc}")

The third candidate should rise to the top even if it was lower in the retrieval stage.

6.2 Service-based reranking

docs = [{"id": "c001", "text": t} for t in candidates]

response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[d["text"] for d in docs],
    top_n=3,
)

Managed rerankers reduce operational burden but add per-query cost and vendor dependency.

6.3 Hybrid + reranker pipeline

bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search_with_score(query, k=50)
fused = rrf([bm25_rank, dense_rank], k=60)

candidate_ids = [doc_id for doc_id, _ in fused[:30]]
candidate_docs = [chunk_store[doc_id] for doc_id in candidate_ids]

pairs = [[query, doc["chunk_text"]] for doc in candidate_docs]
scores = reranker.predict(pairs)

final = sorted(
    zip(candidate_ids, candidate_docs, scores),
    key=lambda x: -x[2]
)[:5]

This is the standard fast recall -> slow precision cascade.

6.4 Practical depth choices

no reranker:            Dense/BM25/Hybrid -> top-5
small latency budget:   Hybrid top-20  -> rerank -> top-5
quality-first setup:    Hybrid top-50  -> rerank -> top-5
analysis / offline:     Hybrid top-100 -> rerank -> top-10

If rerank depth is too shallow, the best answer may never enter the reranker.


7. Variants

7.1 Local BGE rerankers

  • What changes: run the cross-encoder on your own hardware.
  • Why use it: low marginal cost and on-prem control.
  • What becomes possible: private-data reranking without external APIs.
  • Where it fits: internal RAG or privacy-sensitive corpora.
  • Limits: GPU/CPU latency and deployment burden.

7.2 Managed reranking APIs

  • What changes: outsource inference to a hosted provider.
  • Why use it: fastest path to strong quality.
  • What becomes possible: high-quality reranking without model ops.
  • Where it fits: small teams or fast-moving products.
  • Limits: cost, rate limits, external dependency.

7.3 Multilingual rerankers

  • What changes: train or select models with stronger cross-lingual coverage.
  • Why use it: English-focused rerankers often underperform on Korean or mixed corpora.
  • What becomes possible: better multilingual top-3 accuracy.
  • Where it fits: global knowledge bases.
  • Limits: quality still depends on language coverage in training.

7.4 Selective reranking

  • What changes: only rerank when the first-stage confidence is low.
  • Why use it: save latency and cost on obvious queries.
  • What becomes possible: better quality-cost trade-offs.
  • Where it fits: production systems with large traffic.
  • Limits: needs a reliable confidence signal. Part 22 revisits this.

7.5 Listwise or generative reranking

  • What changes: rank candidates jointly or generate the best selection.
  • Why use it: may exploit interactions across candidates.
  • What becomes possible: better ordering in niche tasks.
  • Where it fits: advanced retrieval research.
  • Limits: harder to operate and usually slower than pointwise rerankers.

8. Limits and Failure Modes

8.1 Reranking cannot recover missing recall

  • Why intrinsic: it only sees candidates already retrieved.
  • Diagnosis: the gold chunk is absent from the rerank input set.
  • Mitigation: improve first-stage retrieval depth or quality.
  • Later part: Parts 12, 15.

8.2 Long-document truncation

  • Why intrinsic: query + document pair must fit the reranker's max input.
  • Diagnosis: relevant evidence appears in the second half of long chunks and is ignored.
  • Mitigation: use shorter chunks or passage windows before reranking.
  • Later part: Part 5 chunking.

8.3 Latency grows with candidate depth

  • Why intrinsic: scoring is per pair.
  • Diagnosis: p95 latency rises linearly as rerank depth grows.
  • Mitigation: batch, distill, or rerank selectively.
  • Later part: Parts 16, 21.

8.4 Domain mismatch

  • Why intrinsic: generic rerankers may not understand domain terms or document style.
  • Diagnosis: top reorderings look reasonable but systematically wrong in one domain slice.
  • Mitigation: evaluate by slice; fine-tune only if the gains justify it.
  • Later part: Parts 14, 16.

8.5 Score calibration confusion

  • Why intrinsic: reranker scores are not always comparable across queries or models.
  • Diagnosis: teams try to use one absolute cutoff everywhere.
  • Mitigation: rely on rank order first, absolute thresholds second.
  • Later part: Part 22 confidence handling.

8.5 Common Pitfalls

  • "Rerank top-5 only." Too shallow to help much.
  • "Rerankers replace retrievers." They cannot find candidates that were never retrieved.
  • "A stronger retriever makes reranking unnecessary." First-stage and second-stage ranking solve different problems.
  • "One reranker score threshold works for every query." Score distributions shift by query and corpus.
  • "Longer chunks are safer for reranking." They often get truncated instead.

9. Settled Conclusions

Q1. In one sentence, what is a reranker?

A model that reads query and candidate jointly and reorders first-stage retrieval results by pairwise relevance. Chapter: §4, §5.2.

Q2. Why is a cross-encoder more accurate than a bi-encoder?

Because it can model direct token interactions between query and document instead of comparing two separately compressed vectors. Chapter: §5.1, §5.2.

Q3. Why is reranking expensive?

Because the full model runs once per query-document pair, so latency scales with candidate depth. Chapter: §5.3, §8.3.

Q4. What is the main design rule for rerank depth?

Deep enough to preserve recall, shallow enough to fit the latency budget. Chapter: §6.4.

Q5. When does reranking help most?

When first-stage retrieval already contains the right answer but does not rank it near the top. Chapter: §3, §8.1.


References

Primary

  • Nogueira, R., Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085.
  • Nogueira, R. et al. Document Ranking with a Pretrained Sequence-to-Sequence Model. 2020.
  • Karpukhin, V. et al. Dense Passage Retrieval. EMNLP 2020.
  • Khattab, O., Zaharia, M. ColBERT. SIGIR 2020.

Official docs

  • Sentence-Transformers CrossEncoder docs: https://www.sbert.net/docs/package_reference/cross_encoder/
  • Cohere Rerank docs: https://docs.cohere.com/
  • Jina AI docs: https://jina.ai/

Supporting

  • Author note Chapter 12 — Reranker.
  • Author note Chapter 35 §6 — retriever routing and filter-first design.

Cheat Sheet

Knob Default / Recommendation
Candidate depth 20-50
Final top-K 3-10
First stage Hybrid preferred
Batch reranking yes, if local
Long chunks avoid or window them
Score use trust rank order first
Selective rerank use after confidence signals mature

One-liner: retrievers find candidates, rerankers fix the final order.


Bridge — What's Next

Next — RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval.

Once reranking enters the system, intuition is no longer enough. The next question is how to measure whether Hybrid, rerankers, or prompt changes actually improved RAG.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System