"RAG Core Study (10/26) — Dense Retrieval Deep Dive"

5월 17, 2026

The embedding and vector DB are ready. Now we look at retrieval itself — the principles of Dense Retrieval.

Dense Retrieval finds answers by vector distance, not keyword overlap. Simple in appearance, but without the bi-encoder asymmetry, negative sampling, and top-K distance distribution you cannot explain why a given answer surfaced. Part 10 starts from DPR (Karpukhin 2020) and works through asymmetric query/doc embedding, what the top-K distances mean, and Dense's hard limits — with formulas and code.

0. Prerequisites

Part 8 (embeddings) — normalisation, similarity.
Part 9 (vector DBs) — HNSW top-K search.
Part 7 (metadata) — combining with pre-filters.

1. Learning Objectives

Express the bi-encoder structure and query/doc asymmetry as a formula.
Explain DPR's in-batch negative sampling.
Diagnose retrieval quality from the top-K distance distribution.
Name Dense's five hard limits and when to complement with BM25, Hybrid, or a Reranker.

2. 핵심 요약

Dense Retrieval maps query and document to the same vector space and ranks top-K by similarity. A bi-encoder (query encoder + doc encoder; identical or distinct) splits indexing from query — document vectors are precomputed, query vectors are computed at runtime. DPR (Karpukhin 2020) standardised training with in-batch negatives, pulling true pairs together and pushing others apart. At search time cosine or dot product produces top-K. Dense's strength is semantic similarity; its weakness is rare vocabulary and exact matching — pairing with BM25 (Part 11) via Hybrid (Part 12) is the standard remedy.

3. Intuition — What Dense Catches and BM25 Misses

Query: "basis for currency conversion in the quarterly sales report?"

In the same corpus:

BM25: strongly matches chunks that literally contain "currency conversion". If the body says "FX rate applied" instead, it misses.
Dense: ties "currency conversion" and "FX rate applied" together by semantic similarity and pulls both into top-K.

The opposite case: queries hinging on proper nouns or codes like "SKU-2024-04 inventory" — BM25 wins. Dense's embeddings blur on rare tokens.

4. Definitions — Bi-encoder and Core Terms

Term	Definition
Bi-encoder	Query encoder $E_q$ and doc encoder $E_d$ produce vectors independently; similarity is computed afterwards
Symmetric	$E_q = E_d$ (same model). BGE-M3, OpenAI default
Asymmetric	$E_q \ne E_d$ (different weights). DPR, Upstage Solar
Top-K	Top K candidates by similarity. Typically K = 5–50
In-batch negative	Use other pairs in the same training batch as negatives
Hard negative	Semantically similar but not the answer — yields the strongest training signal
Bi-encoder vs Cross-encoder	Bi: precomputable, fast / Cross: scores query+doc jointly, slow. Cross is Part 13 (Reranker)

5. Math — DPR and Training Loss

Similarity (assuming normalised vectors):

$$\text{sim}(q, d) = E_q(q) \cdot E_d(d)$$

InfoNCE loss (DPR's standard objective):

$$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+))}{\exp(\text{sim}(q, d^+)) + \sum_{d^- \in N} \exp(\text{sim}(q, d^-))}$$

$d^+$ = the correct document
$N$ = negative set (in-batch + hard)
Training pulls the correct similarity up and pushes negatives down.

Top-K retrieval:

$$\text{Top-K}(q) = \arg\max_{d \in \mathcal{D}}^{(K)} \text{sim}(q, d)$$

ANN (Part 9 HNSW) approximates this argmax. Exact KNN is $\mathcal{O}(N)$; HNSW is $\mathcal{O}(\log N)$.

Distance-distribution diagnostic:

The gap $s_1 - s_{10}$ between top-1 and top-10 similarity expresses confidence; larger is more confident.
A small gap means a flat distribution — either Part 5 §8.3 (chunk too large) or this part §8.3 (ambiguous query).

6. Walkthrough — Dense Retrieval From Scratch

6.1 Embedding + in-memory top-K (concept)

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

docs = ["The quarterly sales report consolidates in USD after applying FX rates.",
        "Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
        "Exception filings require department-head approval and form SEC-EX-04."]

doc_embs = model.encode(docs, normalize_embeddings=True)   # (3, 1024)
query_emb = model.encode("currency conversion basis?", normalize_embeddings=True)  # (1024,)

scores = doc_embs @ query_emb     # dot product = cosine when normalised
top_k = np.argsort(-scores)[:2]
for i in top_k:
    print(f"  {scores[i]:.3f} | {docs[i]}")

Example output:

  0.612 | The quarterly sales report consolidates in USD after applying FX rates.
  0.184 | Exception filings require department-head approval and form SEC-EX-04.

The top-1 (0.612) to top-2 (0.184) gap is large → confident retrieval. The winning chunk says "FX rates" and still matches "currency conversion" — that is Dense's core capability.

6.2 LangChain + vector DB top-K

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings, collection_name="rag")

vectordb.add_texts(docs, metadatas=[{"version": "3.2"}] * len(docs))

results = vectordb.similarity_search_with_score(
    "currency conversion basis?",
    k=5,
    filter={"version": "3.2"},   # Part 7 pre-filter
)
for doc, score in results:
    print(f"  {score:.3f} | {doc.page_content[:50]}...")

LangChain splits embed_query and embed_documents automatically — protecting against the Part 8 §8.2 asymmetry pitfall.

6.3 DPR style — two asymmetrically-trained encoders

from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer

q_enc   = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_enc   = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
q_tok   = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_tok   = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

q_inputs = q_tok("currency conversion basis?", return_tensors="pt")
q_emb = q_enc(**q_inputs).pooler_output     # (1, 768)

d_inputs = d_tok("The quarterly sales report consolidates in USD after applying FX rates.", return_tensors="pt")
d_emb = d_enc(**d_inputs).pooler_output     # (1, 768)

sim = (q_emb * d_emb).sum(dim=-1).item()

Key: the two encoders have different weights. Feeding the same text to both yields different vectors. To avoid the Part 8 §8.2 trap, the query must go through q_enc and the doc through d_enc.

7. Variants

7.1 Symmetric vs Asymmetric — read the model card

What changes: are $E_q$ and $E_d$ the same?
Why use it: asymmetric helps when query and doc distributions differ (short query vs long doc).
What becomes possible: precise matching across length asymmetry.
Where it fits: domains with large query-doc style gaps (user questions vs formal documents).
Limits: must split function calls; missing the split causes §8.2.

7.2 In-batch + hard negatives

What changes: use other batch pairs as negatives, plus pre-mined hard negatives.
Why use it: in-batch alone fails to distinguish similar documents. Hard negatives sharpen the decision boundary.
What becomes possible: retrieval that separates near-identical documents.
Where it fits: domain fine-tuning (legal, medical, in-house).
Limits: hard-negative mining is expensive; needs a self-training loop.

7.3 Multi-vector retrieval — ColBERT

What changes: each document has many vectors (one per token); each query token matches its nearest doc token (late interaction).
Why use it: avoids the single-vector compression loss.
What becomes possible: better partial-match accuracy in long documents.
Where it fits: long-doc corpora, accuracy-first.
Limits: index size grows by tens of times. BGE-M3 includes a ColBERT mode.

7.4 Query expansion — paraphrase fallback

What changes: expand a user query into 3–5 paraphrases, retrieve each, merge.
Why use it: boosts recall for short or ambiguous queries.
What becomes possible: more stable retrieval via query rewriting.
Where it fits: chatbot RAG where queries are short and vague.
Limits: more calls per query (LLM + retrieval). Part 18 (Query Rewrite) covers this in depth.

7.5 Embedding cache — repeat queries

What changes: cache query embeddings keyed by hash.
Why use it: zero re-compute cost for FAQ-like repeated queries.
What becomes possible: lower latency and cost.
Where it fits: any RAG with Zipf-like user-query patterns.
Limits: model swap invalidates the entire cache; manage TTL.

8. Limits and Failure Modes

8.1 Weak on rare vocabulary and proper nouns

Why intrinsic: embeddings reward high-frequency words; rare tokens like SKU-2024-04 are blurred.
Diagnosis: recall on queries containing proper nouns or codes is meaningfully lower.
Mitigation: Hybrid with Part 11 BM25; Part 12 RRF.
Later part: Parts 11, 12.

8.2 Flat top-K — low confidence

Why intrinsic: ambiguous queries or corpora with many similar chunks produce uniformly close top-K scores.
Diagnosis: top-1 minus top-10 gap < 0.1 (cosine normalised).
Mitigation: query rewrite (Part 18), reranker (Part 13), signal low confidence in the answer.
Later part: Parts 13, 18, 19.

8.3 Weak handling of negation and antonyms

Why intrinsic: "not an exception" and "is an exception" are similar in embedding space — vector distance struggles with meaning reversal.
Diagnosis: negation queries returning the opposite answer in top-K.
Mitigation: explicit semantic verification in answer generation; Hybrid's BM25 signal partially helps (high-frequency negation tokens).
Later part: Part 22 (answer verification).

8.4 Language drift — outside training distribution

Why intrinsic: English-heavy models blur on Korean queries.
Diagnosis: compare retrieval quality across different models on the same corpus.
Mitigation: Part 8's decision table — Korean share ≥ 30% → BGE-M3 or Upstage.
Later part: Part 8 cross-ref.

8.5 No notion of recency

Why intrinsic: Dense scores semantic similarity only. Time-sensitive queries ("this quarter's policy") may surface old chunks at the top.
Diagnosis: answers citing outdated versions.
Mitigation: Part 7 version=latest pre-filter, or created_at boost (post-filter sigmoid weighting).
Later part: Parts 22, 7 cross-ref.

8.5 Common Pitfalls

"Dense replaces BM25." §8.1. Proper nouns and codes still need BM25; Hybrid is the production standard.
"Use top-K absolute scores as thresholds." Scores vary by model and corpus; read the gap and distribution.
"Embed query and doc with the same function." §8.2. The classic asymmetric-model mistake.
"A trained model generalises across domains." Domain fine-tuning makes a large difference. §7.2.
"Top-K = 5 is enough." Using a reranker (Part 13) needs 30–100 candidates.

9. Settled Conclusions

Q1. What is the bi-encoder asymmetry and why does it matter?

$E_q \ne E_d$. Trained to reflect query/doc distribution differences. Embedding both with the same function decouples the vector spaces — recall collapses. Chapter: §4, §7.1, Part 8 §8.2.

Q2. State the DPR InfoNCE loss in one line.

A log-likelihood with the correct-pair similarity in the numerator and the sum over correct + all negatives in the denominator. Pull correct, push negatives. Chapter: §5.

Q3. What does the top-K gap tell you?

Retrieval confidence. Large gap → clear winner; small gap → flat (ambiguous query or oversized chunks). Chapter: §5, §8.2.

Q4. Summarise Dense's five hard limits in two words each.

Rare vocabulary, flat top-K, negation handling, language drift, no recency. Chapter: §8.

Q5. Why is Dense + BM25 the standard combination?

Dense is strong on semantics; BM25 is strong on exact and proper-noun matches. Their weaknesses are complementary. Chapter: §8.1, Part 12 cross-ref.

10. Further Reading

Primary

Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906.
Khattab, O., Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. arXiv:2004.12832.
Reimers, N., Gurevych, I. Sentence-BERT. EMNLP 2019. arXiv:1908.10084.
Xiong, L. et al. Approximate Nearest Neighbor Negative Contrastive Learning (ANCE). ICLR 2021. arXiv:2007.00808.
Lewis, P. et al. Retrieval-Augmented Generation. NeurIPS 2020. arXiv:2005.11401.

Official docs

DPR (HuggingFace): https://huggingface.co/docs/transformers/model_doc/dpr
LangChain Retrievers: https://python.langchain.com/docs/concepts/retrievers/
ColBERT v2: https://github.com/stanford-futuredata/ColBERT
Sentence-Transformers: https://sbert.net/

Supporting

Author note Chapter 9 — Dense Retrieval.
Author note Chapter 35 §2 — Model-aware Ingestion (asymmetric implications).

Cheat Sheet

Knob	Default / Recommendation
Top-K	5–50 (30–100 if a reranker follows)
Similarity	Normalise + cosine (or normalised dot)
Symmetric / Asymmetric	Check model card — split calls accordingly
Negatives	In-batch + hard (for fine-tuning)
Gap diagnostic	top-1 $-$ top-10 $\ge$ 0.1
Cache	Hash query → embedding; invalidate on model swap
BM25 combination	Production default — Hybrid (Part 12)

One-liner: semantic similarity → Dense, exact matching → BM25, final ranking → Reranker. The three together form RAG's standard search.

Bridge — What's Next

Next — RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive.

The retrieval that fills Dense's proper-noun and exact-match gap is Sparse — TF-IDF, BM25 (Robertson 2009), inverted indices, and Korean morphological analyzers. We will also cover Elasticsearch/OpenSearch integration patterns for RAG.

Series overview: Series index

Term	Definition
Bi-encoder	Query encoder \(E_q\) and doc encoder \(E_d\) produce vectors independently; similarity is computed afterwards
Symmetric	\(E_q = E_d\) (same model). BGE-M3, OpenAI default
Asymmetric	\(E_q \ne E_d\) (different weights). DPR, Upstage Solar
Top-K	Top K candidates by similarity. Typically K = 5–50
In-batch negative	Use other pairs in the same training batch as negatives
Hard negative	Semantically similar but not the answer — yields the strongest training signal
Bi-encoder vs Cross-encoder	Bi: precomputable, fast / Cross: scores query+doc jointly, slow. Cross is Part 13 (Reranker)