"RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive"
Series overview: Series index
Where Dense is weak — proper nouns, codes, exact strings — Sparse takes over. BM25 is its standard.
Sparse Retrieval represents documents as term-frequency vectors (mostly zeros) and finds answers by keyword overlap. Plain in appearance, but TF-IDF's information weighting, BM25's length normalisation, and the Inverted Index's speed combine to beat Dense on proper nouns, codes, and abbreviations. Part 11 unpacks Robertson 2009's BM25 formula, the Elasticsearch/OpenSearch integration patterns for RAG, and why morphological analyzers are decisive in Korean.
0. Prerequisites
- Part 10 Dense Retrieval — the complementarity of the two methods.
- Part 5 chunking — how chunks meet the tokeniser unit.
- Part 7 metadata — pre-filter integration.
1. Learning Objectives
- Read TF, IDF, and BM25 from formula and intuition.
- Explain why an Inverted Index is fast.
- Explain why a morphological analyzer is decisive for Korean BM25.
- Know BM25's limits and where Part 12's Hybrid fits.
2. ํต์ฌ ์์ฝ
Sparse Retrieval uses sparse term-frequency vectors. TF-IDF = frequency × information. BM25 (Robertson & Zaragoza 2009) adds document-length normalisation and TF saturation to TF-IDF, becoming the practical standard. An Inverted Index (token → \([(doc\_id, tf), ...]\)) makes retrieval O(query tokens), independent of corpus size. Elasticsearch / OpenSearch are the operations standard — RAG uses BM25 candidate → Reranker or Dense Hybrid. For Korean, a morphological analyzer (Mecab, Kiwi, Nori) is required; without it, particles and endings split the stem and matching breaks. Particle removal + stemming is Korean BM25's minimum hygiene.
3. Intuition — BM25 catches what Dense misses
Query: "Section 5 conclusions of the PR-2024-Q3 report?"
In the same corpus:
- Dense: the embedding for "PR-2024-Q3" blurs (rare token); top-K scatters across other reports with similar themes.
- BM25: chunks containing the exact token "PR-2024-Q3" score very high; top-1 is the target chunk itself.
Opposite case: "currency conversion basis" — BM25 misses chunks that say "FX rates" instead (Part 10 §3). The two methods are strong on different query types.
4. Definitions — Sparse Core Terms
| Term | Definition |
|---|---|
| TF (Term Frequency) | Token count inside a document |
| IDF (Inverse Document Frequency) | The rarer a token in the corpus, the higher its weight |
| TF-IDF | \(\text{TF} \times \text{IDF}\). Tokens that are common in the doc but rare in the corpus |
| BM25 | TF-IDF + document-length normalisation + TF saturation. The standard since the late 1990s |
| Inverted Index | Token → \([(doc\_id, tf), ...]\) dictionary. Query touches only query-token lookups |
| Stemming / Lemmatisation | Reduce surface forms to a stem (English: running → run; Korean: ๊ฐ๋๋ค → ๊ฐ๋ค) |
| Morphological Analyzer | Morphology pass. Mandatory for Korean and Japanese BM25 |
| Stopwords | Low-information tokens ("the", "of", "๋", "์ด"). Usually removed |
5. Math — BM25 unpacked
For document \(D\) and query \(Q = \{q_1, ..., q_n\}\):
$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$
Each piece:
- \(f(q_i, D)\) = TF of token \(q_i\) in document \(D\)
- \(\text{IDF}(q_i) = \log\left(\frac{N - n_i + 0.5}{n_i + 0.5} + 1\right)\), \(N\) = total documents, \(n_i\) = docs containing \(q_i\)
- \(|D|\) = document length, \(\text{avgdl}\) = corpus average length
- \(k_1\) = TF-saturation parameter (1.2–2.0 typical). Higher → TF contributes more.
- \(b\) = length-normalisation parameter (0.75 typical). Higher → short documents weighted more.
TF saturation intuition: 10 → 11 occurrences add far less than 1 → 2. Prevents repeat-spam from dominating.
Length-normalisation intuition: a token appearing once in a short document matters more than once in a long document.
6. Walkthrough — From rank_bm25 to Elasticsearch
6.1 rank_bm25 one-liner (Python, English)
from rank_bm25 import BM25Okapi
docs = [
"The quarterly sales report consolidates in USD after applying FX rates.",
"Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
"Exception filings require department-head approval and form SEC-EX-04.",
]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)
query = "SKU-2024-04 inventory".lower().split()
scores = bm25.get_scores(query)
6.2 Elasticsearch for the same search
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
es.indices.create(index="rag", body={
"mappings": {
"properties": {
"chunk_text": {"type": "text", "analyzer": "standard"}, # English standard
"version": {"type": "keyword"},
"security_level": {"type": "keyword"},
}
}
})
for i, d in enumerate(docs):
es.index(index="rag", id=f"c{i:03d}", document={
"chunk_text": d,
"version": "3.2",
"security_level": "internal",
})
results = es.search(index="rag", body={
"query": {
"bool": {
"must": [{"match": {"chunk_text": "SKU-2024-04 inventory"}}],
"filter": [
{"term": {"version": "3.2"}},
{"terms": {"security_level": ["public", "internal"]}},
]
}
},
"size": 5
})
6.3 Korean — Nori analyzer is decisive
es.indices.create(index="rag_ko", body={
"settings": {"analysis": {"analyzer": {
"ko_analyzer": {
"type": "custom",
"tokenizer": "nori_tokenizer",
"filter": ["nori_part_of_speech", "lowercase"],
}
}}},
"mappings": {"properties": {
"chunk_text": {"type": "text", "analyzer": "ko_analyzer"},
"version": {"type": "keyword"},
}}
})
es.index(index="rag_ko", id="c001", document={
"chunk_text": "๋ถ๊ธฐ ๋งค์ถ ๋ณด๊ณ ๋ ํ์จ ์ ์ฉ ํ USD๋ก ํตํฉํ๋ค.",
"version": "3.2",
})
results = es.search(index="rag_ko", body={
"query": {"match": {"chunk_text": "ํ์จ ์ ์ฉ ํตํ ํ์ฐ"}}
})
Nori is the built-in morphological analyzer. The nori_part_of_speech filter strips particles and endings automatically — that single line carries more than half of Korean BM25 quality.
6.4 Combining BM25 with Dense (Hybrid preview)
bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search(...)
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
7. Variants
7.1 BM25F — Multi-field weighting
- What changes: one document with title, body, tags — weight per field.
- Why use it: a title match is a much stronger signal than a body match.
- What becomes possible: precise field-weighted search.
- Where it fits: structured documents (products, blog posts, papers).
- Limits: weight tuning needed; pair with an evaluation set (Part 14).
7.2 SPLADE — Learned sparse retrieval
- What changes: learned token weights replace BM25's statistical weights.
- Why use it: captures some semantic signal while preserving sparse interpretability.
- What becomes possible: a middle ground between BM25 and Dense.
- Where it fits: integrated tooling (BGE-M3's sparse mode, SPLADE++).
- Limits: learned weights mean a variable inverted index — more complex than plain ES.
7.3 Elasticsearch + vector field — single-system integration
- What changes: ES 8.x supports
dense_vectorfields alongside BM25 in the same index. - Why use it: avoids running two systems.
- What becomes possible: BM25 and Dense in one query DSL.
- Where it fits: existing ES shops adding RAG.
- Limits: ES HNSW is slightly weaker than dedicated vector DBs (Qdrant); split at scale (> 100M).
7.4 Korean analyzers — Mecab vs Kiwi vs Nori
- What changes: accuracy, speed, and ops burden.
- Why use it: BM25's token unit depends on the analyzer.
- What becomes possible: decides Korean RAG retrieval quality.
- Where it fits:
- Mecab: fastest (C++); user-dictionary maintenance burden.
- Kiwi: Python-friendly, strong on Korean neologisms and proper nouns.
- Nori: built into Elasticsearch, simple ops.
- Limits: changing analyzers means full re-index.
7.5 BM25 + domain dictionary — proper-noun pre-registration
- What changes: register SKU codes, abbreviations, in-house terms in the analyzer's user dictionary.
- Why use it: prevents default analyzers from mis-splitting proper nouns.
- What becomes possible: codes like PR-2024-Q3 stay as a single token.
- Where it fits: internal-corpus RAG.
- Limits: dictionary maintenance; manual updates for new terms.
8. Limits and Failure Modes
8.1 Weak on synonyms and semantic matches
- Why intrinsic: BM25 is string-level. It does not know "FX rates" ↔ "currency conversion".
- Diagnosis: synonym-driven queries show recall significantly lower than Dense.
- Mitigation: Hybrid with Dense (Part 12); or register a synonym dictionary in the analyzer.
- Later part: Part 12.
8.2 Korean particles and endings break tokens
- Why intrinsic: without an analyzer, "ํ์จ์", "ํ์จ์ด", "ํ์จ์" are three different tokens; BM25 treats them as unrelated.
- Diagnosis: the same noun in different inflected forms fragments top-K.
- Mitigation: morphological analyzer with stem extraction (Mecab/Kiwi/Nori).
- Later part: Extra D (Korean RAG).
8.3 High-frequency stopword pollution
- Why intrinsic: even with IDF, stopwords like "the", "of", "๋", "์ด" are not fully zeroed and add noise to short queries.
- Diagnosis: top-K scatters on stopword matches.
- Mitigation: register stopword lists in the analyzer (ES
stopfilter). - Later part: Part 16 ops hygiene.
8.4 Extreme document-length variance
- Why intrinsic: BM25's \(b\) normalises length, but extreme variance (10 tokens vs 10K tokens) leaves the curve unsatisfying for both ends.
- Diagnosis: short chunks dominate (or sink) top-K consistently.
- Mitigation: normalise chunk size (Part 5); tune BM25 per corpus.
- Later part: Part 5 cross-ref; Part 16 (tuning).
8.5 Typos and surface-form variance
- Why intrinsic: BM25 matches surface form. "AI" and "์์ด์์ด" are distinct; "Pinecone" and "pinecone" align only via lowercase.
- Diagnosis: same-meaning, different-spelling queries miss.
- Mitigation: folding/normalisation filters in the analyzer; or n-gram matching.
- Later part: Part 12 Hybrid.
8.5 Common Pitfalls
- "BM25 is old, skip it." — Part 10 §3, this part §3. Decisive for proper nouns and codes.
- "Use the default ES analyzer on Korean." §8.2. Nori/Mecab/Kiwi are required.
- "BM25 needs no tuning." \(k_1\), \(b\), stopwords, analyzer — all need corpus tuning.
- "Hybrid = weighted sum of Dense and BM25." — Part 12. RRF is the standard; raw weighted sums break because of score-scale mismatch.
- "In-house abbreviations work without a user dictionary." — §7.5. Dictionary entries decide hit rate.
9. Settled Conclusions
Q1. What two improvements does BM25 add to TF-IDF?
Document-length normalisation (\(b\)) and TF saturation (\(k_1\)). Chapter: §5.
Q2. On which queries is BM25 stronger than Dense?
Proper nouns, codes, abbreviations, exact phrases, rare vocabulary — tokens whose embeddings blur. Chapter: §3, §8.1; Part 10 §8.1.
Q3. Why is an inverted index fast?
Token → doc-id list structure. Search scales with query tokens, not corpus size. Chapter: §4, §5.
Q4. The two minimum-hygiene steps for Korean BM25?
Morphological analyzer with particle removal + stemming; user dictionary with in-house proper nouns. Chapter: §6.3, §7.5, §8.2.
Q5. The standard way to combine BM25 and Dense?
Take top-K' from each, fuse with Reciprocal Rank Fusion. Raw weighted sums fail due to score-scale mismatch. Chapter: §6.4; Part 12 cross-ref.
10. Further Reading
Primary
- Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR 2009.
- Robertson, S. et al. Okapi at TREC-3. NIST 1994 (the original BM25 proposal).
- Formal, T. et al. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720.
- Lin, J. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).
- Park, E. Mecab-ko: the Korean fork of MeCab (de facto Korean morphology standard).
Official docs
- Elasticsearch BM25:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html - OpenSearch BM25:
https://opensearch.org/docs/latest/query-dsl/full-text/match/ - Elasticsearch Nori:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html - rank-bm25 (Python):
https://github.com/dorianbrown/rank_bm25 - Kiwi:
https://github.com/bab2min/Kiwi
Supporting
- Author note Chapter 10 — BM25.
- Author note Chapter 35 §6 — Retriever Routing (separating and combining BM25 with Dense).
Cheat Sheet
| Knob | Default / Recommendation |
|---|---|
| \(k_1\) | 1.2–2.0 (Elasticsearch default 1.2) |
| \(b\) | 0.75 (stable across most corpora) |
| Stopwords | English standard; Korean uses analyzer dictionaries |
| Analyzer (Korean) | Nori (ES built-in) > Kiwi (Python-friendly) > Mecab (fastest) |
| User dictionary | Register in-house abbreviations and proper nouns always |
| Candidate count | top 50–200 before Hybrid / Reranker |
| Hybrid combination | RRF (k=60 default) — Part 12 |
One-liner: BM25 = exact match for proper nouns and codes; Dense = semantic similarity. For Korean, the analyzer + user dictionary deliver half of BM25's quality.
Bridge — What's Next
Next — RAG Core Study (12/26) — Hybrid Search & Score Fusion.
How do we combine Dense and BM25? Part 12 covers Reciprocal Rank Fusion (Cormack 2009), Weighted Fusion, Score Normalisation, and the standard weight-tuning experiment. Their division of labour lifts RAG accuracy a clear step.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ