"RAG Core Study (13/26) — Reranker: The Role of Cross-encoders"

5월 18, 2026

Series overview: Series index

Retrievers are built for recall. Rerankers are built for the final order.

The first-stage retriever usually gets the right answer somewhere into top-50, not necessarily at rank 1. A reranker fixes that. Instead of embedding query and document separately, it reads them together and scores the pair directly. Part 13 explains why cross-encoders are slower but more precise, how top-N -> top-K reranking works, and when tools like BGE-reranker, Cohere Rerank, or Jina rerankers are worth the added latency.

0. Prerequisites

Part 12 Hybrid Search — candidate pools and first-stage recall.
Part 10 Dense Retrieval — bi-encoder assumptions.
Part 11 BM25 — exact-match candidate generation.

1. Learning Objectives

Explain the structural difference between a bi-encoder and a cross-encoder.
Decide how many candidates should be reranked.
Implement a simple reranking step after Hybrid retrieval.
Diagnose the main latency and truncation risks.

2. 핵심 요약

A reranker scores the full pair $(q, d)$, not separate embeddings. That gives it two advantages:

it can model token-level interactions,
it can understand negation, ordering, and exact phrasing better.

The trade-off is speed. Retrieval cost is roughly:

first-stage retrieval once + reranker over N candidates

So rerankers are used only on a small candidate pool, typically top-20 to top-100. The standard pipeline is:

BM25/Dense/Hybrid -> top-N candidates -> reranker -> final top-K

If first-stage retrieval gives recall, rerankers give precision at the top.

3. Intuition — Why the Right Chunk Is Often Not Rank 1 Yet

Query: "Does policy 3.2 allow exceptions without director approval?"

First-stage Hybrid may return:

a chunk about exceptions,
a chunk about director approval,
the exact chunk stating exceptions require director approval.

All three share terms and semantics. The reranker reads the full pair and can see that candidate 3 answers the actual question best.

The reranker does not find new candidates. It makes better use of the candidates already found.

4. Definitions — Reranking Terms

Term	Definition
Cross-encoder	Encode query and document jointly and predict one relevance score
Reranker	A model or service that reorders first-stage candidates
Candidate depth $N$	Number of documents sent into reranking
Final top-K	Number of documents kept after reranking
Pointwise reranking	Score each $(q, d)$ pair independently
Cascade	Multi-stage retrieval pipeline: recall stage -> precision stage
Truncation	Long candidates may be cut to fit the reranker's token limit

The key split is:

Retriever: fast, approximate, scalable.
Reranker: slow, precise, limited to small $N$.

5. Math — Why Cross-encoders Cost More

5.1 Retriever score

A bi-encoder computes:

$$s_{retr}(q, d) = E_q(q) \cdot E_d(d)$$

Document embeddings are precomputed, so query-time cost is low.

5.2 Reranker score

A cross-encoder computes:

$$s_{rerank}(q, d) = f_{\theta}([CLS]\ q\ [SEP]\ d)$$

The full query-document pair must pass through the model for every candidate.

5.3 Cost model

If first-stage retrieval returns $N$ candidates:

$$L_{total} \approx L_{retrieval} + N \cdot L_{pair}$$

With batching:

$$L_{total} \approx L_{retrieval} + \left\lceil \frac{N}{B} \right\rceil \cdot L_{batch}$$

So the practical problem is not whether rerankers help. It is how deep you can rerank under your latency budget.

6. Walkthrough — Hybrid -> Reranker

6.1 Local cross-encoder with `sentence-transformers`

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3")

query = "Does policy 3.2 allow exceptions without director approval?"
candidates = [
    "Policy 3.2 defines exceptions and appeal paths.",
    "Director approval is required for all external disclosures.",
    "Under policy 3.2, exceptions require director approval before filing.",
]

pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
for doc, score in ranked:
    print(f"{score:.3f} | {doc}")

The third candidate should rise to the top even if it was lower in the retrieval stage.

6.2 Service-based reranking

docs = [{"id": "c001", "text": t} for t in candidates]

response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=[d["text"] for d in docs],
    top_n=3,
)

Managed rerankers reduce operational burden but add per-query cost and vendor dependency.

6.3 Hybrid + reranker pipeline

bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search_with_score(query, k=50)
fused = rrf([bm25_rank, dense_rank], k=60)

candidate_ids = [doc_id for doc_id, _ in fused[:30]]
candidate_docs = [chunk_store[doc_id] for doc_id in candidate_ids]

pairs = [[query, doc["chunk_text"]] for doc in candidate_docs]
scores = reranker.predict(pairs)

final = sorted(
    zip(candidate_ids, candidate_docs, scores),
    key=lambda x: -x[2]
)[:5]

This is the standard fast recall -> slow precision cascade.

6.4 Practical depth choices

no reranker:            Dense/BM25/Hybrid -> top-5
small latency budget:   Hybrid top-20  -> rerank -> top-5
quality-first setup:    Hybrid top-50  -> rerank -> top-5
analysis / offline:     Hybrid top-100 -> rerank -> top-10

If rerank depth is too shallow, the best answer may never enter the reranker.

7. Variants

7.1 Local BGE rerankers

What changes: run the cross-encoder on your own hardware.
Why use it: low marginal cost and on-prem control.
What becomes possible: private-data reranking without external APIs.
Where it fits: internal RAG or privacy-sensitive corpora.
Limits: GPU/CPU latency and deployment burden.

7.2 Managed reranking APIs

What changes: outsource inference to a hosted provider.
Why use it: fastest path to strong quality.
What becomes possible: high-quality reranking without model ops.
Where it fits: small teams or fast-moving products.
Limits: cost, rate limits, external dependency.

7.3 Multilingual rerankers

What changes: train or select models with stronger cross-lingual coverage.
Why use it: English-focused rerankers often underperform on Korean or mixed corpora.
What becomes possible: better multilingual top-3 accuracy.
Where it fits: global knowledge bases.
Limits: quality still depends on language coverage in training.

7.4 Selective reranking

What changes: only rerank when the first-stage confidence is low.
Why use it: save latency and cost on obvious queries.
What becomes possible: better quality-cost trade-offs.
Where it fits: production systems with large traffic.
Limits: needs a reliable confidence signal. Part 22 revisits this.

7.5 Listwise or generative reranking

What changes: rank candidates jointly or generate the best selection.
Why use it: may exploit interactions across candidates.
What becomes possible: better ordering in niche tasks.
Where it fits: advanced retrieval research.
Limits: harder to operate and usually slower than pointwise rerankers.

8. Limits and Failure Modes

8.1 Reranking cannot recover missing recall

Why intrinsic: it only sees candidates already retrieved.
Diagnosis: the gold chunk is absent from the rerank input set.
Mitigation: improve first-stage retrieval depth or quality.
Later part: Parts 12, 15.

8.2 Long-document truncation

Why intrinsic: query + document pair must fit the reranker's max input.
Diagnosis: relevant evidence appears in the second half of long chunks and is ignored.
Mitigation: use shorter chunks or passage windows before reranking.
Later part: Part 5 chunking.

8.3 Latency grows with candidate depth

Why intrinsic: scoring is per pair.
Diagnosis: p95 latency rises linearly as rerank depth grows.
Mitigation: batch, distill, or rerank selectively.
Later part: Parts 16, 21.

8.4 Domain mismatch

Why intrinsic: generic rerankers may not understand domain terms or document style.
Diagnosis: top reorderings look reasonable but systematically wrong in one domain slice.
Mitigation: evaluate by slice; fine-tune only if the gains justify it.
Later part: Parts 14, 16.

8.5 Score calibration confusion

Why intrinsic: reranker scores are not always comparable across queries or models.
Diagnosis: teams try to use one absolute cutoff everywhere.
Mitigation: rely on rank order first, absolute thresholds second.
Later part: Part 22 confidence handling.

8.5 Common Pitfalls

"Rerank top-5 only." Too shallow to help much.
"Rerankers replace retrievers." They cannot find candidates that were never retrieved.
"A stronger retriever makes reranking unnecessary." First-stage and second-stage ranking solve different problems.
"One reranker score threshold works for every query." Score distributions shift by query and corpus.
"Longer chunks are safer for reranking." They often get truncated instead.

9. Settled Conclusions

Q1. In one sentence, what is a reranker?

A model that reads query and candidate jointly and reorders first-stage retrieval results by pairwise relevance. Chapter: §4, §5.2.

Q2. Why is a cross-encoder more accurate than a bi-encoder?

Because it can model direct token interactions between query and document instead of comparing two separately compressed vectors. Chapter: §5.1, §5.2.

Q3. Why is reranking expensive?

Because the full model runs once per query-document pair, so latency scales with candidate depth. Chapter: §5.3, §8.3.

Q4. What is the main design rule for rerank depth?

Deep enough to preserve recall, shallow enough to fit the latency budget. Chapter: §6.4.

Q5. When does reranking help most?

When first-stage retrieval already contains the right answer but does not rank it near the top. Chapter: §3, §8.1.

References

Primary

Nogueira, R., Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085.
Nogueira, R. et al. Document Ranking with a Pretrained Sequence-to-Sequence Model. 2020.
Karpukhin, V. et al. Dense Passage Retrieval. EMNLP 2020.
Khattab, O., Zaharia, M. ColBERT. SIGIR 2020.

Official docs

Sentence-Transformers CrossEncoder docs: https://www.sbert.net/docs/package_reference/cross_encoder/
Cohere Rerank docs: https://docs.cohere.com/
Jina AI docs: https://jina.ai/

Supporting

Author note Chapter 12 — Reranker.
Author note Chapter 35 §6 — retriever routing and filter-first design.

Cheat Sheet

Knob	Default / Recommendation
Candidate depth	20-50
Final top-K	3-10
First stage	Hybrid preferred
Batch reranking	yes, if local
Long chunks	avoid or window them
Score use	trust rank order first
Selective rerank	use after confidence signals mature

One-liner: retrievers find candidates, rerankers fix the final order.

Bridge — What's Next

Next — RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval.

Once reranking enters the system, intuition is no longer enough. The next question is how to measure whether Hybrid, rerankers, or prompt changes actually improved RAG.