"RAG Core Study (13/26) — Reranker: The Role of Cross-encoders"
Series overview: Series index
Retrievers are built for recall. Rerankers are built for the final order.
The first-stage retriever usually gets the right answer somewhere into top-50, not necessarily at rank 1. A reranker fixes that. Instead of embedding query and document separately, it reads them together and scores the pair directly. Part 13 explains why cross-encoders are slower but more precise, how top-N -> top-K reranking works, and when tools like BGE-reranker, Cohere Rerank, or Jina rerankers are worth the added latency.
0. Prerequisites
- Part 12 Hybrid Search — candidate pools and first-stage recall.
- Part 10 Dense Retrieval — bi-encoder assumptions.
- Part 11 BM25 — exact-match candidate generation.
1. Learning Objectives
- Explain the structural difference between a bi-encoder and a cross-encoder.
- Decide how many candidates should be reranked.
- Implement a simple reranking step after Hybrid retrieval.
- Diagnose the main latency and truncation risks.
2. 핵심 요약
A reranker scores the full pair \((q, d)\), not separate embeddings. That gives it two advantages:
- it can model token-level interactions,
- it can understand negation, ordering, and exact phrasing better.
The trade-off is speed. Retrieval cost is roughly:
first-stage retrieval once + reranker over N candidates
So rerankers are used only on a small candidate pool, typically top-20 to top-100. The standard pipeline is:
BM25/Dense/Hybrid -> top-N candidates -> reranker -> final top-K
If first-stage retrieval gives recall, rerankers give precision at the top.
3. Intuition — Why the Right Chunk Is Often Not Rank 1 Yet
Query: "Does policy 3.2 allow exceptions without director approval?"
First-stage Hybrid may return:
- a chunk about exceptions,
- a chunk about director approval,
- the exact chunk stating exceptions require director approval.
All three share terms and semantics. The reranker reads the full pair and can see that candidate 3 answers the actual question best.
The reranker does not find new candidates. It makes better use of the candidates already found.
4. Definitions — Reranking Terms
| Term | Definition |
|---|---|
| Cross-encoder | Encode query and document jointly and predict one relevance score |
| Reranker | A model or service that reorders first-stage candidates |
| Candidate depth \(N\) | Number of documents sent into reranking |
| Final top-K | Number of documents kept after reranking |
| Pointwise reranking | Score each \((q, d)\) pair independently |
| Cascade | Multi-stage retrieval pipeline: recall stage -> precision stage |
| Truncation | Long candidates may be cut to fit the reranker's token limit |
The key split is:
- Retriever: fast, approximate, scalable.
- Reranker: slow, precise, limited to small \(N\).
5. Math — Why Cross-encoders Cost More
5.1 Retriever score
A bi-encoder computes:
$$s_{retr}(q, d) = E_q(q) \cdot E_d(d)$$
Document embeddings are precomputed, so query-time cost is low.
5.2 Reranker score
A cross-encoder computes:
$$s_{rerank}(q, d) = f_{\theta}([CLS]\ q\ [SEP]\ d)$$
The full query-document pair must pass through the model for every candidate.
5.3 Cost model
If first-stage retrieval returns \(N\) candidates:
$$L_{total} \approx L_{retrieval} + N \cdot L_{pair}$$
With batching:
$$L_{total} \approx L_{retrieval} + \left\lceil \frac{N}{B} \right\rceil \cdot L_{batch}$$
So the practical problem is not whether rerankers help. It is how deep you can rerank under your latency budget.
6. Walkthrough — Hybrid -> Reranker
6.1 Local cross-encoder with sentence-transformers
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")
query = "Does policy 3.2 allow exceptions without director approval?"
candidates = [
"Policy 3.2 defines exceptions and appeal paths.",
"Director approval is required for all external disclosures.",
"Under policy 3.2, exceptions require director approval before filing.",
]
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
for doc, score in ranked:
print(f"{score:.3f} | {doc}")
The third candidate should rise to the top even if it was lower in the retrieval stage.
6.2 Service-based reranking
docs = [{"id": "c001", "text": t} for t in candidates]
response = co.rerank(
model="rerank-v3.5",
query=query,
documents=[d["text"] for d in docs],
top_n=3,
)
Managed rerankers reduce operational burden but add per-query cost and vendor dependency.
6.3 Hybrid + reranker pipeline
bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search_with_score(query, k=50)
fused = rrf([bm25_rank, dense_rank], k=60)
candidate_ids = [doc_id for doc_id, _ in fused[:30]]
candidate_docs = [chunk_store[doc_id] for doc_id in candidate_ids]
pairs = [[query, doc["chunk_text"]] for doc in candidate_docs]
scores = reranker.predict(pairs)
final = sorted(
zip(candidate_ids, candidate_docs, scores),
key=lambda x: -x[2]
)[:5]
This is the standard fast recall -> slow precision cascade.
6.4 Practical depth choices
no reranker: Dense/BM25/Hybrid -> top-5
small latency budget: Hybrid top-20 -> rerank -> top-5
quality-first setup: Hybrid top-50 -> rerank -> top-5
analysis / offline: Hybrid top-100 -> rerank -> top-10
If rerank depth is too shallow, the best answer may never enter the reranker.
7. Variants
7.1 Local BGE rerankers
- What changes: run the cross-encoder on your own hardware.
- Why use it: low marginal cost and on-prem control.
- What becomes possible: private-data reranking without external APIs.
- Where it fits: internal RAG or privacy-sensitive corpora.
- Limits: GPU/CPU latency and deployment burden.
7.2 Managed reranking APIs
- What changes: outsource inference to a hosted provider.
- Why use it: fastest path to strong quality.
- What becomes possible: high-quality reranking without model ops.
- Where it fits: small teams or fast-moving products.
- Limits: cost, rate limits, external dependency.
7.3 Multilingual rerankers
- What changes: train or select models with stronger cross-lingual coverage.
- Why use it: English-focused rerankers often underperform on Korean or mixed corpora.
- What becomes possible: better multilingual top-3 accuracy.
- Where it fits: global knowledge bases.
- Limits: quality still depends on language coverage in training.
7.4 Selective reranking
- What changes: only rerank when the first-stage confidence is low.
- Why use it: save latency and cost on obvious queries.
- What becomes possible: better quality-cost trade-offs.
- Where it fits: production systems with large traffic.
- Limits: needs a reliable confidence signal. Part 22 revisits this.
7.5 Listwise or generative reranking
- What changes: rank candidates jointly or generate the best selection.
- Why use it: may exploit interactions across candidates.
- What becomes possible: better ordering in niche tasks.
- Where it fits: advanced retrieval research.
- Limits: harder to operate and usually slower than pointwise rerankers.
8. Limits and Failure Modes
8.1 Reranking cannot recover missing recall
- Why intrinsic: it only sees candidates already retrieved.
- Diagnosis: the gold chunk is absent from the rerank input set.
- Mitigation: improve first-stage retrieval depth or quality.
- Later part: Parts 12, 15.
8.2 Long-document truncation
- Why intrinsic: query + document pair must fit the reranker's max input.
- Diagnosis: relevant evidence appears in the second half of long chunks and is ignored.
- Mitigation: use shorter chunks or passage windows before reranking.
- Later part: Part 5 chunking.
8.3 Latency grows with candidate depth
- Why intrinsic: scoring is per pair.
- Diagnosis: p95 latency rises linearly as rerank depth grows.
- Mitigation: batch, distill, or rerank selectively.
- Later part: Parts 16, 21.
8.4 Domain mismatch
- Why intrinsic: generic rerankers may not understand domain terms or document style.
- Diagnosis: top reorderings look reasonable but systematically wrong in one domain slice.
- Mitigation: evaluate by slice; fine-tune only if the gains justify it.
- Later part: Parts 14, 16.
8.5 Score calibration confusion
- Why intrinsic: reranker scores are not always comparable across queries or models.
- Diagnosis: teams try to use one absolute cutoff everywhere.
- Mitigation: rely on rank order first, absolute thresholds second.
- Later part: Part 22 confidence handling.
8.5 Common Pitfalls
- "Rerank top-5 only." Too shallow to help much.
- "Rerankers replace retrievers." They cannot find candidates that were never retrieved.
- "A stronger retriever makes reranking unnecessary." First-stage and second-stage ranking solve different problems.
- "One reranker score threshold works for every query." Score distributions shift by query and corpus.
- "Longer chunks are safer for reranking." They often get truncated instead.
9. Settled Conclusions
Q1. In one sentence, what is a reranker?
A model that reads query and candidate jointly and reorders first-stage retrieval results by pairwise relevance. Chapter: §4, §5.2.
Q2. Why is a cross-encoder more accurate than a bi-encoder?
Because it can model direct token interactions between query and document instead of comparing two separately compressed vectors. Chapter: §5.1, §5.2.
Q3. Why is reranking expensive?
Because the full model runs once per query-document pair, so latency scales with candidate depth. Chapter: §5.3, §8.3.
Q4. What is the main design rule for rerank depth?
Deep enough to preserve recall, shallow enough to fit the latency budget. Chapter: §6.4.
Q5. When does reranking help most?
When first-stage retrieval already contains the right answer but does not rank it near the top. Chapter: §3, §8.1.
References
Primary
- Nogueira, R., Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085.
- Nogueira, R. et al. Document Ranking with a Pretrained Sequence-to-Sequence Model. 2020.
- Karpukhin, V. et al. Dense Passage Retrieval. EMNLP 2020.
- Khattab, O., Zaharia, M. ColBERT. SIGIR 2020.
Official docs
- Sentence-Transformers CrossEncoder docs:
https://www.sbert.net/docs/package_reference/cross_encoder/ - Cohere Rerank docs:
https://docs.cohere.com/ - Jina AI docs:
https://jina.ai/
Supporting
- Author note Chapter 12 — Reranker.
- Author note Chapter 35 §6 — retriever routing and filter-first design.
Cheat Sheet
| Knob | Default / Recommendation |
|---|---|
| Candidate depth | 20-50 |
| Final top-K | 3-10 |
| First stage | Hybrid preferred |
| Batch reranking | yes, if local |
| Long chunks | avoid or window them |
| Score use | trust rank order first |
| Selective rerank | use after confidence signals mature |
One-liner: retrievers find candidates, rerankers fix the final order.
Bridge — What's Next
Next — RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval.
Once reranking enters the system, intuition is no longer enough. The next question is how to measure whether Hybrid, rerankers, or prompt changes actually improved RAG.
댓글
댓글 쓰기