"RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive"

5월 17, 2026

Series overview: Series index

Where Dense is weak — proper nouns, codes, exact strings — Sparse takes over. BM25 is its standard.

Sparse Retrieval represents documents as term-frequency vectors (mostly zeros) and finds answers by keyword overlap. Plain in appearance, but TF-IDF's information weighting, BM25's length normalisation, and the Inverted Index's speed combine to beat Dense on proper nouns, codes, and abbreviations. Part 11 unpacks Robertson 2009's BM25 formula, the Elasticsearch/OpenSearch integration patterns for RAG, and why morphological analyzers are decisive in Korean.

0. Prerequisites

Part 10 Dense Retrieval — the complementarity of the two methods.
Part 5 chunking — how chunks meet the tokeniser unit.
Part 7 metadata — pre-filter integration.

1. Learning Objectives

Read TF, IDF, and BM25 from formula and intuition.
Explain why an Inverted Index is fast.
Explain why a morphological analyzer is decisive for Korean BM25.
Know BM25's limits and where Part 12's Hybrid fits.

2. 핵심 요약

Sparse Retrieval uses sparse term-frequency vectors. TF-IDF = frequency × information. BM25 (Robertson & Zaragoza 2009) adds document-length normalisation and TF saturation to TF-IDF, becoming the practical standard. An Inverted Index (token → $[(doc\_id, tf), ...]$) makes retrieval O(query tokens), independent of corpus size. Elasticsearch / OpenSearch are the operations standard — RAG uses BM25 candidate → Reranker or Dense Hybrid. For Korean, a morphological analyzer (Mecab, Kiwi, Nori) is required; without it, particles and endings split the stem and matching breaks. Particle removal + stemming is Korean BM25's minimum hygiene.

3. Intuition — BM25 catches what Dense misses

Query: "Section 5 conclusions of the PR-2024-Q3 report?"

In the same corpus:

Dense: the embedding for "PR-2024-Q3" blurs (rare token); top-K scatters across other reports with similar themes.
BM25: chunks containing the exact token "PR-2024-Q3" score very high; top-1 is the target chunk itself.

Opposite case: "currency conversion basis" — BM25 misses chunks that say "FX rates" instead (Part 10 §3). The two methods are strong on different query types.

4. Definitions — Sparse Core Terms

Term	Definition
TF (Term Frequency)	Token count inside a document
IDF (Inverse Document Frequency)	The rarer a token in the corpus, the higher its weight
TF-IDF	$\text{TF} \times \text{IDF}$. Tokens that are common in the doc but rare in the corpus
BM25	TF-IDF + document-length normalisation + TF saturation. The standard since the late 1990s
Inverted Index	Token → $[(doc\_id, tf), ...]$ dictionary. Query touches only query-token lookups
Stemming / Lemmatisation	Reduce surface forms to a stem (English: running → run; Korean: 갑니다 → 가다)
Morphological Analyzer	Morphology pass. Mandatory for Korean and Japanese BM25
Stopwords	Low-information tokens ("the", "of", "는", "이"). Usually removed

5. Math — BM25 unpacked

For document $D$ and query $Q = \{q_1, ..., q_n\}$:

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

Each piece:

$f(q_i, D)$ = TF of token $q_i$ in document $D$
$\text{IDF}(q_i) = \log\left(\frac{N - n_i + 0.5}{n_i + 0.5} + 1\right)$, $N$ = total documents, $n_i$ = docs containing $q_i$
$|D|$ = document length, $\text{avgdl}$ = corpus average length
$k_1$ = TF-saturation parameter (1.2–2.0 typical). Higher → TF contributes more.
$b$ = length-normalisation parameter (0.75 typical). Higher → short documents weighted more.

TF saturation intuition: 10 → 11 occurrences add far less than 1 → 2. Prevents repeat-spam from dominating.

Length-normalisation intuition: a token appearing once in a short document matters more than once in a long document.

6. Walkthrough — From rank_bm25 to Elasticsearch

6.1 rank_bm25 one-liner (Python, English)

from rank_bm25 import BM25Okapi

docs = [
    "The quarterly sales report consolidates in USD after applying FX rates.",
    "Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
    "Exception filings require department-head approval and form SEC-EX-04.",
]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)

query = "SKU-2024-04 inventory".lower().split()
scores = bm25.get_scores(query)

6.2 Elasticsearch for the same search

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")

es.indices.create(index="rag", body={
    "mappings": {
        "properties": {
            "chunk_text": {"type": "text", "analyzer": "standard"},   # English standard
            "version": {"type": "keyword"},
            "security_level": {"type": "keyword"},
        }
    }
})

for i, d in enumerate(docs):
    es.index(index="rag", id=f"c{i:03d}", document={
        "chunk_text": d,
        "version": "3.2",
        "security_level": "internal",
    })

results = es.search(index="rag", body={
    "query": {
        "bool": {
            "must": [{"match": {"chunk_text": "SKU-2024-04 inventory"}}],
            "filter": [
                {"term": {"version": "3.2"}},
                {"terms": {"security_level": ["public", "internal"]}},
            ]
        }
    },
    "size": 5
})

6.3 Korean — Nori analyzer is decisive

es.indices.create(index="rag_ko", body={
    "settings": {"analysis": {"analyzer": {
        "ko_analyzer": {
            "type": "custom",
            "tokenizer": "nori_tokenizer",
            "filter": ["nori_part_of_speech", "lowercase"],
        }
    }}},
    "mappings": {"properties": {
        "chunk_text": {"type": "text", "analyzer": "ko_analyzer"},
        "version": {"type": "keyword"},
    }}
})

es.index(index="rag_ko", id="c001", document={
    "chunk_text": "분기 매출 보고는 환율 적용 후 USD로 통합한다.",
    "version": "3.2",
})

results = es.search(index="rag_ko", body={
    "query": {"match": {"chunk_text": "환율 적용 통화 환산"}}
})

Nori is the built-in morphological analyzer. The nori_part_of_speech filter strips particles and endings automatically — that single line carries more than half of Korean BM25 quality.

6.4 Combining BM25 with Dense (Hybrid preview)

bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search(...)

def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

7. Variants

7.1 BM25F — Multi-field weighting

What changes: one document with title, body, tags — weight per field.
Why use it: a title match is a much stronger signal than a body match.
What becomes possible: precise field-weighted search.
Where it fits: structured documents (products, blog posts, papers).
Limits: weight tuning needed; pair with an evaluation set (Part 14).

7.2 SPLADE — Learned sparse retrieval

What changes: learned token weights replace BM25's statistical weights.
Why use it: captures some semantic signal while preserving sparse interpretability.
What becomes possible: a middle ground between BM25 and Dense.
Where it fits: integrated tooling (BGE-M3's sparse mode, SPLADE++).
Limits: learned weights mean a variable inverted index — more complex than plain ES.

7.3 Elasticsearch + vector field — single-system integration

What changes: ES 8.x supports dense_vector fields alongside BM25 in the same index.
Why use it: avoids running two systems.
What becomes possible: BM25 and Dense in one query DSL.
Where it fits: existing ES shops adding RAG.
Limits: ES HNSW is slightly weaker than dedicated vector DBs (Qdrant); split at scale (> 100M).

7.4 Korean analyzers — Mecab vs Kiwi vs Nori

What changes: accuracy, speed, and ops burden.
Why use it: BM25's token unit depends on the analyzer.
What becomes possible: decides Korean RAG retrieval quality.
Where it fits:
Mecab: fastest (C++); user-dictionary maintenance burden.
Kiwi: Python-friendly, strong on Korean neologisms and proper nouns.
Nori: built into Elasticsearch, simple ops.
Limits: changing analyzers means full re-index.

7.5 BM25 + domain dictionary — proper-noun pre-registration

What changes: register SKU codes, abbreviations, in-house terms in the analyzer's user dictionary.
Why use it: prevents default analyzers from mis-splitting proper nouns.
What becomes possible: codes like PR-2024-Q3 stay as a single token.
Where it fits: internal-corpus RAG.
Limits: dictionary maintenance; manual updates for new terms.

8. Limits and Failure Modes

8.1 Weak on synonyms and semantic matches

Why intrinsic: BM25 is string-level. It does not know "FX rates" ↔ "currency conversion".
Diagnosis: synonym-driven queries show recall significantly lower than Dense.
Mitigation: Hybrid with Dense (Part 12); or register a synonym dictionary in the analyzer.
Later part: Part 12.

8.2 Korean particles and endings break tokens

Why intrinsic: without an analyzer, "환율을", "환율이", "환율은" are three different tokens; BM25 treats them as unrelated.
Diagnosis: the same noun in different inflected forms fragments top-K.
Mitigation: morphological analyzer with stem extraction (Mecab/Kiwi/Nori).
Later part: Extra D (Korean RAG).

8.3 High-frequency stopword pollution

Why intrinsic: even with IDF, stopwords like "the", "of", "는", "이" are not fully zeroed and add noise to short queries.
Diagnosis: top-K scatters on stopword matches.
Mitigation: register stopword lists in the analyzer (ES stop filter).
Later part: Part 16 ops hygiene.

8.4 Extreme document-length variance

Why intrinsic: BM25's $b$ normalises length, but extreme variance (10 tokens vs 10K tokens) leaves the curve unsatisfying for both ends.
Diagnosis: short chunks dominate (or sink) top-K consistently.
Mitigation: normalise chunk size (Part 5); tune BM25 per corpus.
Later part: Part 5 cross-ref; Part 16 (tuning).

8.5 Typos and surface-form variance

Why intrinsic: BM25 matches surface form. "AI" and "에이아이" are distinct; "Pinecone" and "pinecone" align only via lowercase.
Diagnosis: same-meaning, different-spelling queries miss.
Mitigation: folding/normalisation filters in the analyzer; or n-gram matching.
Later part: Part 12 Hybrid.

8.5 Common Pitfalls

"BM25 is old, skip it." — Part 10 §3, this part §3. Decisive for proper nouns and codes.
"Use the default ES analyzer on Korean." §8.2. Nori/Mecab/Kiwi are required.
"BM25 needs no tuning." $k_1$, $b$, stopwords, analyzer — all need corpus tuning.
"Hybrid = weighted sum of Dense and BM25." — Part 12. RRF is the standard; raw weighted sums break because of score-scale mismatch.
"In-house abbreviations work without a user dictionary." — §7.5. Dictionary entries decide hit rate.

9. Settled Conclusions

Q1. What two improvements does BM25 add to TF-IDF?

Document-length normalisation ($b$) and TF saturation ($k_1$). Chapter: §5.

Q2. On which queries is BM25 stronger than Dense?

Proper nouns, codes, abbreviations, exact phrases, rare vocabulary — tokens whose embeddings blur. Chapter: §3, §8.1; Part 10 §8.1.

Q3. Why is an inverted index fast?

Token → doc-id list structure. Search scales with query tokens, not corpus size. Chapter: §4, §5.

Q4. The two minimum-hygiene steps for Korean BM25?

Morphological analyzer with particle removal + stemming; user dictionary with in-house proper nouns. Chapter: §6.3, §7.5, §8.2.

Q5. The standard way to combine BM25 and Dense?

Take top-K' from each, fuse with Reciprocal Rank Fusion. Raw weighted sums fail due to score-scale mismatch. Chapter: §6.4; Part 12 cross-ref.

10. Further Reading

Primary

Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR 2009.
Robertson, S. et al. Okapi at TREC-3. NIST 1994 (the original BM25 proposal).
Formal, T. et al. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720.
Lin, J. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).
Park, E. Mecab-ko: the Korean fork of MeCab (de facto Korean morphology standard).

Official docs

Elasticsearch BM25: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
OpenSearch BM25: https://opensearch.org/docs/latest/query-dsl/full-text/match/
Elasticsearch Nori: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html
rank-bm25 (Python): https://github.com/dorianbrown/rank_bm25
Kiwi: https://github.com/bab2min/Kiwi

Supporting

Author note Chapter 10 — BM25.
Author note Chapter 35 §6 — Retriever Routing (separating and combining BM25 with Dense).

Cheat Sheet

Knob	Default / Recommendation
$k_1$	1.2–2.0 (Elasticsearch default 1.2)
$b$	0.75 (stable across most corpora)
Stopwords	English `standard`; Korean uses analyzer dictionaries
Analyzer (Korean)	Nori (ES built-in) > Kiwi (Python-friendly) > Mecab (fastest)
User dictionary	Register in-house abbreviations and proper nouns always
Candidate count	top 50–200 before Hybrid / Reranker
Hybrid combination	RRF (k=60 default) — Part 12

One-liner: BM25 = exact match for proper nouns and codes; Dense = semantic similarity. For Korean, the analyzer + user dictionary deliver half of BM25's quality.

Bridge — What's Next

Next — RAG Core Study (12/26) — Hybrid Search & Score Fusion.

How do we combine Dense and BM25? Part 12 covers Reciprocal Rank Fusion (Cormack 2009), Weighted Fusion, Score Normalisation, and the standard weight-tuning experiment. Their division of labour lifts RAG accuracy a clear step.

Knob	Default / Recommendation
\(k_1\)	1.2–2.0 (Elasticsearch default 1.2)
\(b\)	0.75 (stable across most corpora)
Stopwords	English `standard`; Korean uses analyzer dictionaries
Analyzer (Korean)	Nori (ES built-in) > Kiwi (Python-friendly) > Mecab (fastest)
User dictionary	Register in-house abbreviations and proper nouns always
Candidate count	top 50–200 before Hybrid / Reranker
Hybrid combination	RRF (k=60 default) — Part 12