"RAG Core Study (10/26) — Dense Retrieval Deep Dive"

The embedding and vector DB are ready. Now we look at retrieval itself — the principles of Dense Retrieval.

Dense Retrieval finds answers by vector distance, not keyword overlap. Simple in appearance, but without the bi-encoder asymmetry, negative sampling, and top-K distance distribution you cannot explain why a given answer surfaced. Part 10 starts from DPR (Karpukhin 2020) and works through asymmetric query/doc embedding, what the top-K distances mean, and Dense's hard limits — with formulas and code.


0. Prerequisites

  • Part 8 (embeddings) — normalisation, similarity.
  • Part 9 (vector DBs) — HNSW top-K search.
  • Part 7 (metadata) — combining with pre-filters.

1. Learning Objectives

  1. Express the bi-encoder structure and query/doc asymmetry as a formula.
  2. Explain DPR's in-batch negative sampling.
  3. Diagnose retrieval quality from the top-K distance distribution.
  4. Name Dense's five hard limits and when to complement with BM25, Hybrid, or a Reranker.

2. ํ•ต์‹ฌ ์š”์•ฝ

Dense Retrieval maps query and document to the same vector space and ranks top-K by similarity. A bi-encoder (query encoder + doc encoder; identical or distinct) splits indexing from query — document vectors are precomputed, query vectors are computed at runtime. DPR (Karpukhin 2020) standardised training with in-batch negatives, pulling true pairs together and pushing others apart. At search time cosine or dot product produces top-K. Dense's strength is semantic similarity; its weakness is rare vocabulary and exact matching — pairing with BM25 (Part 11) via Hybrid (Part 12) is the standard remedy.


3. Intuition — What Dense Catches and BM25 Misses

Query: "basis for currency conversion in the quarterly sales report?"

In the same corpus:

  • BM25: strongly matches chunks that literally contain "currency conversion". If the body says "FX rate applied" instead, it misses.
  • Dense: ties "currency conversion" and "FX rate applied" together by semantic similarity and pulls both into top-K.
diagram-1

The opposite case: queries hinging on proper nouns or codes like "SKU-2024-04 inventory"BM25 wins. Dense's embeddings blur on rare tokens.


4. Definitions — Bi-encoder and Core Terms

Term Definition
Bi-encoder Query encoder \(E_q\) and doc encoder \(E_d\) produce vectors independently; similarity is computed afterwards
Symmetric \(E_q = E_d\) (same model). BGE-M3, OpenAI default
Asymmetric \(E_q \ne E_d\) (different weights). DPR, Upstage Solar
Top-K Top K candidates by similarity. Typically K = 5–50
In-batch negative Use other pairs in the same training batch as negatives
Hard negative Semantically similar but not the answer — yields the strongest training signal
Bi-encoder vs Cross-encoder Bi: precomputable, fast / Cross: scores query+doc jointly, slow. Cross is Part 13 (Reranker)

5. Math — DPR and Training Loss

Similarity (assuming normalised vectors):

$$\text{sim}(q, d) = E_q(q) \cdot E_d(d)$$

InfoNCE loss (DPR's standard objective):

$$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, d^+))}{\exp(\text{sim}(q, d^+)) + \sum_{d^- \in N} \exp(\text{sim}(q, d^-))}$$

  • \(d^+\) = the correct document
  • \(N\) = negative set (in-batch + hard)
  • Training pulls the correct similarity up and pushes negatives down.

Top-K retrieval:

$$\text{Top-K}(q) = \arg\max_{d \in \mathcal{D}}^{(K)} \text{sim}(q, d)$$

ANN (Part 9 HNSW) approximates this argmax. Exact KNN is \(\mathcal{O}(N)\); HNSW is \(\mathcal{O}(\log N)\).

Distance-distribution diagnostic:

  • The gap \(s_1 - s_{10}\) between top-1 and top-10 similarity expresses confidence; larger is more confident.
  • A small gap means a flat distribution — either Part 5 §8.3 (chunk too large) or this part §8.3 (ambiguous query).

6. Walkthrough — Dense Retrieval From Scratch

6.1 Embedding + in-memory top-K (concept)

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

docs = ["The quarterly sales report consolidates in USD after applying FX rates.",
        "Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
        "Exception filings require department-head approval and form SEC-EX-04."]

doc_embs = model.encode(docs, normalize_embeddings=True)   # (3, 1024)
query_emb = model.encode("currency conversion basis?", normalize_embeddings=True)  # (1024,)

scores = doc_embs @ query_emb     # dot product = cosine when normalised
top_k = np.argsort(-scores)[:2]
for i in top_k:
    print(f"  {scores[i]:.3f} | {docs[i]}")

Example output:

  0.612 | The quarterly sales report consolidates in USD after applying FX rates.
  0.184 | Exception filings require department-head approval and form SEC-EX-04.

The top-1 (0.612) to top-2 (0.184) gap is large → confident retrieval. The winning chunk says "FX rates" and still matches "currency conversion" — that is Dense's core capability.

6.2 LangChain + vector DB top-K

from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})
vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings, collection_name="rag")

vectordb.add_texts(docs, metadatas=[{"version": "3.2"}] * len(docs))

results = vectordb.similarity_search_with_score(
    "currency conversion basis?",
    k=5,
    filter={"version": "3.2"},   # Part 7 pre-filter
)
for doc, score in results:
    print(f"  {score:.3f} | {doc.page_content[:50]}...")

LangChain splits embed_query and embed_documents automatically — protecting against the Part 8 §8.2 asymmetry pitfall.

6.3 DPR style — two asymmetrically-trained encoders

from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer

q_enc   = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_enc   = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
q_tok   = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
d_tok   = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

q_inputs = q_tok("currency conversion basis?", return_tensors="pt")
q_emb = q_enc(**q_inputs).pooler_output     # (1, 768)

d_inputs = d_tok("The quarterly sales report consolidates in USD after applying FX rates.", return_tensors="pt")
d_emb = d_enc(**d_inputs).pooler_output     # (1, 768)

sim = (q_emb * d_emb).sum(dim=-1).item()

Key: the two encoders have different weights. Feeding the same text to both yields different vectors. To avoid the Part 8 §8.2 trap, the query must go through q_enc and the doc through d_enc.


7. Variants

7.1 Symmetric vs Asymmetric — read the model card

  • What changes: are \(E_q\) and \(E_d\) the same?
  • Why use it: asymmetric helps when query and doc distributions differ (short query vs long doc).
  • What becomes possible: precise matching across length asymmetry.
  • Where it fits: domains with large query-doc style gaps (user questions vs formal documents).
  • Limits: must split function calls; missing the split causes §8.2.

7.2 In-batch + hard negatives

  • What changes: use other batch pairs as negatives, plus pre-mined hard negatives.
  • Why use it: in-batch alone fails to distinguish similar documents. Hard negatives sharpen the decision boundary.
  • What becomes possible: retrieval that separates near-identical documents.
  • Where it fits: domain fine-tuning (legal, medical, in-house).
  • Limits: hard-negative mining is expensive; needs a self-training loop.

7.3 Multi-vector retrieval — ColBERT

  • What changes: each document has many vectors (one per token); each query token matches its nearest doc token (late interaction).
  • Why use it: avoids the single-vector compression loss.
  • What becomes possible: better partial-match accuracy in long documents.
  • Where it fits: long-doc corpora, accuracy-first.
  • Limits: index size grows by tens of times. BGE-M3 includes a ColBERT mode.

7.4 Query expansion — paraphrase fallback

  • What changes: expand a user query into 3–5 paraphrases, retrieve each, merge.
  • Why use it: boosts recall for short or ambiguous queries.
  • What becomes possible: more stable retrieval via query rewriting.
  • Where it fits: chatbot RAG where queries are short and vague.
  • Limits: more calls per query (LLM + retrieval). Part 18 (Query Rewrite) covers this in depth.

7.5 Embedding cache — repeat queries

  • What changes: cache query embeddings keyed by hash.
  • Why use it: zero re-compute cost for FAQ-like repeated queries.
  • What becomes possible: lower latency and cost.
  • Where it fits: any RAG with Zipf-like user-query patterns.
  • Limits: model swap invalidates the entire cache; manage TTL.

8. Limits and Failure Modes

8.1 Weak on rare vocabulary and proper nouns

  • Why intrinsic: embeddings reward high-frequency words; rare tokens like SKU-2024-04 are blurred.
  • Diagnosis: recall on queries containing proper nouns or codes is meaningfully lower.
  • Mitigation: Hybrid with Part 11 BM25; Part 12 RRF.
  • Later part: Parts 11, 12.

8.2 Flat top-K — low confidence

  • Why intrinsic: ambiguous queries or corpora with many similar chunks produce uniformly close top-K scores.
  • Diagnosis: top-1 minus top-10 gap < 0.1 (cosine normalised).
  • Mitigation: query rewrite (Part 18), reranker (Part 13), signal low confidence in the answer.
  • Later part: Parts 13, 18, 19.

8.3 Weak handling of negation and antonyms

  • Why intrinsic: "not an exception" and "is an exception" are similar in embedding space — vector distance struggles with meaning reversal.
  • Diagnosis: negation queries returning the opposite answer in top-K.
  • Mitigation: explicit semantic verification in answer generation; Hybrid's BM25 signal partially helps (high-frequency negation tokens).
  • Later part: Part 22 (answer verification).

8.4 Language drift — outside training distribution

  • Why intrinsic: English-heavy models blur on Korean queries.
  • Diagnosis: compare retrieval quality across different models on the same corpus.
  • Mitigation: Part 8's decision table — Korean share ≥ 30% → BGE-M3 or Upstage.
  • Later part: Part 8 cross-ref.

8.5 No notion of recency

  • Why intrinsic: Dense scores semantic similarity only. Time-sensitive queries ("this quarter's policy") may surface old chunks at the top.
  • Diagnosis: answers citing outdated versions.
  • Mitigation: Part 7 version=latest pre-filter, or created_at boost (post-filter sigmoid weighting).
  • Later part: Parts 22, 7 cross-ref.

8.5 Common Pitfalls

  • "Dense replaces BM25." §8.1. Proper nouns and codes still need BM25; Hybrid is the production standard.
  • "Use top-K absolute scores as thresholds." Scores vary by model and corpus; read the gap and distribution.
  • "Embed query and doc with the same function." §8.2. The classic asymmetric-model mistake.
  • "A trained model generalises across domains." Domain fine-tuning makes a large difference. §7.2.
  • "Top-K = 5 is enough." Using a reranker (Part 13) needs 30–100 candidates.

9. Settled Conclusions

Q1. What is the bi-encoder asymmetry and why does it matter?

\(E_q \ne E_d\). Trained to reflect query/doc distribution differences. Embedding both with the same function decouples the vector spaces — recall collapses. Chapter: §4, §7.1, Part 8 §8.2.

Q2. State the DPR InfoNCE loss in one line.

A log-likelihood with the correct-pair similarity in the numerator and the sum over correct + all negatives in the denominator. Pull correct, push negatives. Chapter: §5.

Q3. What does the top-K gap tell you?

Retrieval confidence. Large gap → clear winner; small gap → flat (ambiguous query or oversized chunks). Chapter: §5, §8.2.

Q4. Summarise Dense's five hard limits in two words each.

Rare vocabulary, flat top-K, negation handling, language drift, no recency. Chapter: §8.

Q5. Why is Dense + BM25 the standard combination?

Dense is strong on semantics; BM25 is strong on exact and proper-noun matches. Their weaknesses are complementary. Chapter: §8.1, Part 12 cross-ref.


10. Further Reading

Primary

  • Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020. arXiv:2004.04906.
  • Khattab, O., Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. arXiv:2004.12832.
  • Reimers, N., Gurevych, I. Sentence-BERT. EMNLP 2019. arXiv:1908.10084.
  • Xiong, L. et al. Approximate Nearest Neighbor Negative Contrastive Learning (ANCE). ICLR 2021. arXiv:2007.00808.
  • Lewis, P. et al. Retrieval-Augmented Generation. NeurIPS 2020. arXiv:2005.11401.

Official docs

  • DPR (HuggingFace): https://huggingface.co/docs/transformers/model_doc/dpr
  • LangChain Retrievers: https://python.langchain.com/docs/concepts/retrievers/
  • ColBERT v2: https://github.com/stanford-futuredata/ColBERT
  • Sentence-Transformers: https://sbert.net/

Supporting

  • Author note Chapter 9 — Dense Retrieval.
  • Author note Chapter 35 §2 — Model-aware Ingestion (asymmetric implications).

Cheat Sheet

Knob Default / Recommendation
Top-K 5–50 (30–100 if a reranker follows)
Similarity Normalise + cosine (or normalised dot)
Symmetric / Asymmetric Check model card — split calls accordingly
Negatives In-batch + hard (for fine-tuning)
Gap diagnostic top-1 \(-\) top-10 \(\ge\) 0.1
Cache Hash query → embedding; invalidate on model swap
BM25 combination Production default — Hybrid (Part 12)

One-liner: semantic similarity → Dense, exact matching → BM25, final ranking → Reranker. The three together form RAG's standard search.


Bridge — What's Next

Next — RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive.

The retrieval that fills Dense's proper-noun and exact-match gap is Sparse — TF-IDF, BM25 (Robertson 2009), inverted indices, and Korean morphological analyzers. We will also cover Elasticsearch/OpenSearch integration patterns for RAG.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System