"RAG Core Study (10/26) — Dense Retrieval Deep Dive"
The embedding and vector DB are ready. Now we look at retrieval itself — the principles of Dense Retrieval.
Dense Retrieval finds answers by vector distance, not keyword overlap. Simple in appearance, but without the bi-encoder asymmetry, negative sampling, and top-K distance distribution you cannot explain why a given answer surfaced. Part 10 starts from DPR (Karpukhin 2020) and works through asymmetric query/doc embedding, what the top-K distances mean, and Dense's hard limits — with formulas and code.
0. Prerequisites
- Part 8 (embeddings) — normalisation, similarity.
- Part 9 (vector DBs) — HNSW top-K search.
- Part 7 (metadata) — combining with pre-filters.
1. Learning Objectives
- Express the bi-encoder structure and query/doc asymmetry as a formula.
- Explain DPR's in-batch negative sampling.
- Diagnose retrieval quality from the top-K distance distribution.
- Name Dense's five hard limits and when to complement with BM25, Hybrid, or a Reranker.
2. ํต์ฌ ์์ฝ
Dense Retrieval maps query and document to the same vector space and ranks top-K by similarity. A bi-encoder (query encoder + doc encoder; identical or distinct) splits indexing from query — document vectors are precomputed, query vectors are computed at runtime. DPR (Karpukhin 2020) standardised training with in-batch negatives, pulling true pairs together and pushing others apart. At search time cosine or dot product produces top-K. Dense's strength is semantic similarity; its weakness is rare vocabulary and exact matching — pairing with BM25 (Part 11) via Hybrid (Part 12) is the standard remedy.
3. Intuition — What Dense Catches and BM25 Misses
Query: "basis for currency conversion in the quarterly sales report?"
In the same corpus:
- BM25: strongly matches chunks that literally contain "currency conversion". If the body says "FX rate applied" instead, it misses.
- Dense: ties "currency conversion" and "FX rate applied" together by semantic similarity and pulls both into top-K.
The opposite case: queries hinging on proper nouns or codes like "SKU-2024-04 inventory" — BM25 wins. Dense's embeddings blur on rare tokens.
4. Definitions — Bi-encoder and Core Terms
| Term | Definition |
|---|---|
| Bi-encoder | Query encoder \(E_q\) and doc encoder \(E_d\) produce vectors independently; similarity is computed afterwards |
| Symmetric | \(E_q = E_d\) (same model). BGE-M3, OpenAI default |
| Asymmetric | \(E_q \ne E_d\) (different weights). DPR, Upstage Solar |
| Top-K | Top K candidates by similarity. Typically K = 5–50 |
| In-batch negative | Use other pairs in the same training batch as negatives |
| Hard negative | Semantically similar but not the answer — yields the strongest training signal |
| Bi-encoder vs Cross-encoder | Bi: precomputable, fast / Cross: scores query+doc jointly, slow. Cross is Part 13 (Reranker) |
5. Math — DPR and Training Loss
Similarity (assuming normalised vectors):
$$\text{sim}(q, d) = E_q(q) \cdot E_d(d)$$
InfoNCE loss (DPR's standard objective):
$$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+))}{\exp(\text{sim}(q, d^+)) + \sum_{d^- \in N} \exp(\text{sim}(q, d^-))}$$
- \(d^+\) = the correct document
- \(N\) = negative set (in-batch + hard)
- Training pulls the correct similarity up and pushes negatives down.
Top-K retrieval:
$$\text{Top-K}(q) = \arg\max_{d \in \mathcal{D}}^{(K)} \text{sim}(q, d)$$
ANN (Part 9 HNSW) approximates this argmax. Exact KNN is \(\mathcal{O}(N)\); HNSW is \(\mathcal{O}(\log N)\).
Distance-distribution diagnostic:
- The gap \(s_1 - s_{10}\) between top-1 and top-10 similarity expresses confidence; larger is more confident.
- A small gap means a flat distribution — either Part 5 §8.3 (chunk too large) or this part §8.3 (ambiguous query).
6. Walkthrough — Dense Retrieval From Scratch
6.1 Embedding + in-memory top-K (concept)
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
docs = ["The quarterly sales report consolidates in USD after applying FX rates.",
"Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
"Exception filings require department-head approval and form SEC-EX-04."]
doc_embs = model.encode(docs, normalize_embeddings=True) # (3, 1024)
query_emb = model.encode("currency conversion basis?", normalize_embeddings=True) # (1024,)
scores = doc_embs @ query_emb # dot product = cosine when normalised
top_k = np.argsort(-scores)[:2]
for i in top_k:
print(f" {scores[i]:.3f} | {docs[i]}")
Example output:
0.612 | The quarterly sales report consolidates in USD after applying FX rates.
0.184 | Exception filings require department-head approval and form SEC-EX-04.
The top-1 (0.612) to top-2 (0.184) gap is large → confident retrieval. The winning chunk says "FX rates" and still matches "currency conversion" — that is Dense's core capability.
6.2 LangChain + vector DB top-K
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings, collection_name="rag")
vectordb.add_texts(docs, metadatas=[{"version": "3.2"}] * len(docs))
results = vectordb.similarity_search_with_score(
"currency conversion basis?",
k=5,
filter={"version": "3.2"}, # Part 7 pre-filter
)
for doc, score in results:
print(f" {score:.3f} | {doc.page_content[:50]}...")
LangChain splits embed_query and embed_documents automatically — protecting against the Part 8 §8.2 asymmetry pitfall.
6.3 DPR style — two asymmetrically-trained encoders
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
q_enc = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_enc = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
q_tok = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_tok = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
q_inputs = q_tok("currency conversion basis?", return_tensors="pt")
q_emb = q_enc(**q_inputs).pooler_output # (1, 768)
d_inputs = d_tok("The quarterly sales report consolidates in USD after applying FX rates.", return_tensors="pt")
d_emb = d_enc(**d_inputs).pooler_output # (1, 768)
sim = (q_emb * d_emb).sum(dim=-1).item()
Key: the two encoders have different weights. Feeding the same text to both yields different vectors. To avoid the Part 8 §8.2 trap, the query must go through q_enc and the doc through d_enc.
7. Variants
7.1 Symmetric vs Asymmetric — read the model card
- What changes: are \(E_q\) and \(E_d\) the same?
- Why use it: asymmetric helps when query and doc distributions differ (short query vs long doc).
- What becomes possible: precise matching across length asymmetry.
- Where it fits: domains with large query-doc style gaps (user questions vs formal documents).
- Limits: must split function calls; missing the split causes §8.2.
7.2 In-batch + hard negatives
- What changes: use other batch pairs as negatives, plus pre-mined hard negatives.
- Why use it: in-batch alone fails to distinguish similar documents. Hard negatives sharpen the decision boundary.
- What becomes possible: retrieval that separates near-identical documents.
- Where it fits: domain fine-tuning (legal, medical, in-house).
- Limits: hard-negative mining is expensive; needs a self-training loop.
7.3 Multi-vector retrieval — ColBERT
- What changes: each document has many vectors (one per token); each query token matches its nearest doc token (late interaction).
- Why use it: avoids the single-vector compression loss.
- What becomes possible: better partial-match accuracy in long documents.
- Where it fits: long-doc corpora, accuracy-first.
- Limits: index size grows by tens of times. BGE-M3 includes a ColBERT mode.
7.4 Query expansion — paraphrase fallback
- What changes: expand a user query into 3–5 paraphrases, retrieve each, merge.
- Why use it: boosts recall for short or ambiguous queries.
- What becomes possible: more stable retrieval via query rewriting.
- Where it fits: chatbot RAG where queries are short and vague.
- Limits: more calls per query (LLM + retrieval). Part 18 (Query Rewrite) covers this in depth.
7.5 Embedding cache — repeat queries
- What changes: cache query embeddings keyed by hash.
- Why use it: zero re-compute cost for FAQ-like repeated queries.
- What becomes possible: lower latency and cost.
- Where it fits: any RAG with Zipf-like user-query patterns.
- Limits: model swap invalidates the entire cache; manage TTL.
8. Limits and Failure Modes
8.1 Weak on rare vocabulary and proper nouns
- Why intrinsic: embeddings reward high-frequency words; rare tokens like SKU-2024-04 are blurred.
- Diagnosis: recall on queries containing proper nouns or codes is meaningfully lower.
- Mitigation: Hybrid with Part 11 BM25; Part 12 RRF.
- Later part: Parts 11, 12.
8.2 Flat top-K — low confidence
- Why intrinsic: ambiguous queries or corpora with many similar chunks produce uniformly close top-K scores.
- Diagnosis: top-1 minus top-10 gap < 0.1 (cosine normalised).
- Mitigation: query rewrite (Part 18), reranker (Part 13), signal low confidence in the answer.
- Later part: Parts 13, 18, 19.
8.3 Weak handling of negation and antonyms
- Why intrinsic: "not an exception" and "is an exception" are similar in embedding space — vector distance struggles with meaning reversal.
- Diagnosis: negation queries returning the opposite answer in top-K.
- Mitigation: explicit semantic verification in answer generation; Hybrid's BM25 signal partially helps (high-frequency negation tokens).
- Later part: Part 22 (answer verification).
8.4 Language drift — outside training distribution
- Why intrinsic: English-heavy models blur on Korean queries.
- Diagnosis: compare retrieval quality across different models on the same corpus.
- Mitigation: Part 8's decision table — Korean share ≥ 30% → BGE-M3 or Upstage.
- Later part: Part 8 cross-ref.
8.5 No notion of recency
- Why intrinsic: Dense scores semantic similarity only. Time-sensitive queries ("this quarter's policy") may surface old chunks at the top.
- Diagnosis: answers citing outdated versions.
- Mitigation: Part 7
version=latestpre-filter, orcreated_atboost (post-filter sigmoid weighting). - Later part: Parts 22, 7 cross-ref.
8.5 Common Pitfalls
- "Dense replaces BM25." §8.1. Proper nouns and codes still need BM25; Hybrid is the production standard.
- "Use top-K absolute scores as thresholds." Scores vary by model and corpus; read the gap and distribution.
- "Embed query and doc with the same function." §8.2. The classic asymmetric-model mistake.
- "A trained model generalises across domains." Domain fine-tuning makes a large difference. §7.2.
- "Top-K = 5 is enough." Using a reranker (Part 13) needs 30–100 candidates.
9. Settled Conclusions
Q1. What is the bi-encoder asymmetry and why does it matter?
\(E_q \ne E_d\). Trained to reflect query/doc distribution differences. Embedding both with the same function decouples the vector spaces — recall collapses. Chapter: §4, §7.1, Part 8 §8.2.
Q2. State the DPR InfoNCE loss in one line.
A log-likelihood with the correct-pair similarity in the numerator and the sum over correct + all negatives in the denominator. Pull correct, push negatives. Chapter: §5.
Q3. What does the top-K gap tell you?
Retrieval confidence. Large gap → clear winner; small gap → flat (ambiguous query or oversized chunks). Chapter: §5, §8.2.
Q4. Summarise Dense's five hard limits in two words each.
Rare vocabulary, flat top-K, negation handling, language drift, no recency. Chapter: §8.
Q5. Why is Dense + BM25 the standard combination?
Dense is strong on semantics; BM25 is strong on exact and proper-noun matches. Their weaknesses are complementary. Chapter: §8.1, Part 12 cross-ref.
10. Further Reading
Primary
- Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906.
- Khattab, O., Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. arXiv:2004.12832.
- Reimers, N., Gurevych, I. Sentence-BERT. EMNLP 2019. arXiv:1908.10084.
- Xiong, L. et al. Approximate Nearest Neighbor Negative Contrastive Learning (ANCE). ICLR 2021. arXiv:2007.00808.
- Lewis, P. et al. Retrieval-Augmented Generation. NeurIPS 2020. arXiv:2005.11401.
Official docs
- DPR (HuggingFace):
https://huggingface.co/docs/transformers/model_doc/dpr - LangChain Retrievers:
https://python.langchain.com/docs/concepts/retrievers/ - ColBERT v2:
https://github.com/stanford-futuredata/ColBERT - Sentence-Transformers:
https://sbert.net/
Supporting
- Author note Chapter 9 — Dense Retrieval.
- Author note Chapter 35 §2 — Model-aware Ingestion (asymmetric implications).
Cheat Sheet
| Knob | Default / Recommendation |
|---|---|
| Top-K | 5–50 (30–100 if a reranker follows) |
| Similarity | Normalise + cosine (or normalised dot) |
| Symmetric / Asymmetric | Check model card — split calls accordingly |
| Negatives | In-batch + hard (for fine-tuning) |
| Gap diagnostic | top-1 \(-\) top-10 \(\ge\) 0.1 |
| Cache | Hash query → embedding; invalidate on model swap |
| BM25 combination | Production default — Hybrid (Part 12) |
One-liner: semantic similarity → Dense, exact matching → BM25, final ranking → Reranker. The three together form RAG's standard search.
Bridge — What's Next
Next — RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive.
The retrieval that fills Dense's proper-noun and exact-match gap is Sparse — TF-IDF, BM25 (Robertson 2009), inverted indices, and Korean morphological analyzers. We will also cover Elasticsearch/OpenSearch integration patterns for RAG.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ