"RAG Core Study (11/26) — Sparse Retrieval & BM25 Deep Dive"

Series overview: Series index

Where Dense is weak — proper nouns, codes, exact strings — Sparse takes over. BM25 is its standard.

Sparse Retrieval represents documents as term-frequency vectors (mostly zeros) and finds answers by keyword overlap. Plain in appearance, but TF-IDF's information weighting, BM25's length normalisation, and the Inverted Index's speed combine to beat Dense on proper nouns, codes, and abbreviations. Part 11 unpacks Robertson 2009's BM25 formula, the Elasticsearch/OpenSearch integration patterns for RAG, and why morphological analyzers are decisive in Korean.


0. Prerequisites

  • Part 10 Dense Retrieval — the complementarity of the two methods.
  • Part 5 chunking — how chunks meet the tokeniser unit.
  • Part 7 metadata — pre-filter integration.

1. Learning Objectives

  1. Read TF, IDF, and BM25 from formula and intuition.
  2. Explain why an Inverted Index is fast.
  3. Explain why a morphological analyzer is decisive for Korean BM25.
  4. Know BM25's limits and where Part 12's Hybrid fits.

2. ํ•ต์‹ฌ ์š”์•ฝ

Sparse Retrieval uses sparse term-frequency vectors. TF-IDF = frequency × information. BM25 (Robertson & Zaragoza 2009) adds document-length normalisation and TF saturation to TF-IDF, becoming the practical standard. An Inverted Index (token → \([(doc\_id, tf), ...]\)) makes retrieval O(query tokens), independent of corpus size. Elasticsearch / OpenSearch are the operations standard — RAG uses BM25 candidate → Reranker or Dense Hybrid. For Korean, a morphological analyzer (Mecab, Kiwi, Nori) is required; without it, particles and endings split the stem and matching breaks. Particle removal + stemming is Korean BM25's minimum hygiene.


3. Intuition — BM25 catches what Dense misses

Query: "Section 5 conclusions of the PR-2024-Q3 report?"

In the same corpus:

  • Dense: the embedding for "PR-2024-Q3" blurs (rare token); top-K scatters across other reports with similar themes.
  • BM25: chunks containing the exact token "PR-2024-Q3" score very high; top-1 is the target chunk itself.
diagram-1

Opposite case: "currency conversion basis" — BM25 misses chunks that say "FX rates" instead (Part 10 §3). The two methods are strong on different query types.


4. Definitions — Sparse Core Terms

Term Definition
TF (Term Frequency) Token count inside a document
IDF (Inverse Document Frequency) The rarer a token in the corpus, the higher its weight
TF-IDF \(\text{TF} \times \text{IDF}\). Tokens that are common in the doc but rare in the corpus
BM25 TF-IDF + document-length normalisation + TF saturation. The standard since the late 1990s
Inverted Index Token → \([(doc\_id, tf), ...]\) dictionary. Query touches only query-token lookups
Stemming / Lemmatisation Reduce surface forms to a stem (English: running → run; Korean: ๊ฐ‘๋‹ˆ๋‹ค → ๊ฐ€๋‹ค)
Morphological Analyzer Morphology pass. Mandatory for Korean and Japanese BM25
Stopwords Low-information tokens ("the", "of", "๋Š”", "์ด"). Usually removed

5. Math — BM25 unpacked

For document \(D\) and query \(Q = \{q_1, ..., q_n\}\):

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}$$

Each piece:

  • \(f(q_i, D)\) = TF of token \(q_i\) in document \(D\)
  • \(\text{IDF}(q_i) = \log\left(\frac{N - n_i + 0.5}{n_i + 0.5} + 1\right)\), \(N\) = total documents, \(n_i\) = docs containing \(q_i\)
  • \(|D|\) = document length, \(\text{avgdl}\) = corpus average length
  • \(k_1\) = TF-saturation parameter (1.2–2.0 typical). Higher → TF contributes more.
  • \(b\) = length-normalisation parameter (0.75 typical). Higher → short documents weighted more.

TF saturation intuition: 10 → 11 occurrences add far less than 1 → 2. Prevents repeat-spam from dominating.

Length-normalisation intuition: a token appearing once in a short document matters more than once in a long document.


6. Walkthrough — From rank_bm25 to Elasticsearch

6.1 rank_bm25 one-liner (Python, English)

from rank_bm25 import BM25Okapi

docs = [
    "The quarterly sales report consolidates in USD after applying FX rates.",
    "Inventory SKU-2024-04: 30 units in the Incheon warehouse.",
    "Exception filings require department-head approval and form SEC-EX-04.",
]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)

query = "SKU-2024-04 inventory".lower().split()
scores = bm25.get_scores(query)

6.2 Elasticsearch for the same search

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")

es.indices.create(index="rag", body={
    "mappings": {
        "properties": {
            "chunk_text": {"type": "text", "analyzer": "standard"},   # English standard
            "version": {"type": "keyword"},
            "security_level": {"type": "keyword"},
        }
    }
})

for i, d in enumerate(docs):
    es.index(index="rag", id=f"c{i:03d}", document={
        "chunk_text": d,
        "version": "3.2",
        "security_level": "internal",
    })

results = es.search(index="rag", body={
    "query": {
        "bool": {
            "must": [{"match": {"chunk_text": "SKU-2024-04 inventory"}}],
            "filter": [
                {"term": {"version": "3.2"}},
                {"terms": {"security_level": ["public", "internal"]}},
            ]
        }
    },
    "size": 5
})

6.3 Korean — Nori analyzer is decisive

es.indices.create(index="rag_ko", body={
    "settings": {"analysis": {"analyzer": {
        "ko_analyzer": {
            "type": "custom",
            "tokenizer": "nori_tokenizer",
            "filter": ["nori_part_of_speech", "lowercase"],
        }
    }}},
    "mappings": {"properties": {
        "chunk_text": {"type": "text", "analyzer": "ko_analyzer"},
        "version": {"type": "keyword"},
    }}
})

es.index(index="rag_ko", id="c001", document={
    "chunk_text": "๋ถ„๊ธฐ ๋งค์ถœ ๋ณด๊ณ ๋Š” ํ™˜์œจ ์ ์šฉ ํ›„ USD๋กœ ํ†ตํ•ฉํ•œ๋‹ค.",
    "version": "3.2",
})

results = es.search(index="rag_ko", body={
    "query": {"match": {"chunk_text": "ํ™˜์œจ ์ ์šฉ ํ†ตํ™” ํ™˜์‚ฐ"}}
})

Nori is the built-in morphological analyzer. The nori_part_of_speech filter strips particles and endings automatically — that single line carries more than half of Korean BM25 quality.

6.4 Combining BM25 with Dense (Hybrid preview)

bm25_hits = es.search(...)
dense_hits = vectordb.similarity_search(...)

def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

7. Variants

7.1 BM25F — Multi-field weighting

  • What changes: one document with title, body, tags — weight per field.
  • Why use it: a title match is a much stronger signal than a body match.
  • What becomes possible: precise field-weighted search.
  • Where it fits: structured documents (products, blog posts, papers).
  • Limits: weight tuning needed; pair with an evaluation set (Part 14).

7.2 SPLADE — Learned sparse retrieval

  • What changes: learned token weights replace BM25's statistical weights.
  • Why use it: captures some semantic signal while preserving sparse interpretability.
  • What becomes possible: a middle ground between BM25 and Dense.
  • Where it fits: integrated tooling (BGE-M3's sparse mode, SPLADE++).
  • Limits: learned weights mean a variable inverted index — more complex than plain ES.

7.3 Elasticsearch + vector field — single-system integration

  • What changes: ES 8.x supports dense_vector fields alongside BM25 in the same index.
  • Why use it: avoids running two systems.
  • What becomes possible: BM25 and Dense in one query DSL.
  • Where it fits: existing ES shops adding RAG.
  • Limits: ES HNSW is slightly weaker than dedicated vector DBs (Qdrant); split at scale (> 100M).

7.4 Korean analyzers — Mecab vs Kiwi vs Nori

  • What changes: accuracy, speed, and ops burden.
  • Why use it: BM25's token unit depends on the analyzer.
  • What becomes possible: decides Korean RAG retrieval quality.
  • Where it fits:
  • Mecab: fastest (C++); user-dictionary maintenance burden.
  • Kiwi: Python-friendly, strong on Korean neologisms and proper nouns.
  • Nori: built into Elasticsearch, simple ops.
  • Limits: changing analyzers means full re-index.

7.5 BM25 + domain dictionary — proper-noun pre-registration

  • What changes: register SKU codes, abbreviations, in-house terms in the analyzer's user dictionary.
  • Why use it: prevents default analyzers from mis-splitting proper nouns.
  • What becomes possible: codes like PR-2024-Q3 stay as a single token.
  • Where it fits: internal-corpus RAG.
  • Limits: dictionary maintenance; manual updates for new terms.

8. Limits and Failure Modes

8.1 Weak on synonyms and semantic matches

  • Why intrinsic: BM25 is string-level. It does not know "FX rates" ↔ "currency conversion".
  • Diagnosis: synonym-driven queries show recall significantly lower than Dense.
  • Mitigation: Hybrid with Dense (Part 12); or register a synonym dictionary in the analyzer.
  • Later part: Part 12.

8.2 Korean particles and endings break tokens

  • Why intrinsic: without an analyzer, "ํ™˜์œจ์„", "ํ™˜์œจ์ด", "ํ™˜์œจ์€" are three different tokens; BM25 treats them as unrelated.
  • Diagnosis: the same noun in different inflected forms fragments top-K.
  • Mitigation: morphological analyzer with stem extraction (Mecab/Kiwi/Nori).
  • Later part: Extra D (Korean RAG).

8.3 High-frequency stopword pollution

  • Why intrinsic: even with IDF, stopwords like "the", "of", "๋Š”", "์ด" are not fully zeroed and add noise to short queries.
  • Diagnosis: top-K scatters on stopword matches.
  • Mitigation: register stopword lists in the analyzer (ES stop filter).
  • Later part: Part 16 ops hygiene.

8.4 Extreme document-length variance

  • Why intrinsic: BM25's \(b\) normalises length, but extreme variance (10 tokens vs 10K tokens) leaves the curve unsatisfying for both ends.
  • Diagnosis: short chunks dominate (or sink) top-K consistently.
  • Mitigation: normalise chunk size (Part 5); tune BM25 per corpus.
  • Later part: Part 5 cross-ref; Part 16 (tuning).

8.5 Typos and surface-form variance

  • Why intrinsic: BM25 matches surface form. "AI" and "์—์ด์•„์ด" are distinct; "Pinecone" and "pinecone" align only via lowercase.
  • Diagnosis: same-meaning, different-spelling queries miss.
  • Mitigation: folding/normalisation filters in the analyzer; or n-gram matching.
  • Later part: Part 12 Hybrid.

8.5 Common Pitfalls

  • "BM25 is old, skip it." — Part 10 §3, this part §3. Decisive for proper nouns and codes.
  • "Use the default ES analyzer on Korean." §8.2. Nori/Mecab/Kiwi are required.
  • "BM25 needs no tuning." \(k_1\), \(b\), stopwords, analyzer — all need corpus tuning.
  • "Hybrid = weighted sum of Dense and BM25." — Part 12. RRF is the standard; raw weighted sums break because of score-scale mismatch.
  • "In-house abbreviations work without a user dictionary." — §7.5. Dictionary entries decide hit rate.

9. Settled Conclusions

Q1. What two improvements does BM25 add to TF-IDF?

Document-length normalisation (\(b\)) and TF saturation (\(k_1\)). Chapter: §5.

Q2. On which queries is BM25 stronger than Dense?

Proper nouns, codes, abbreviations, exact phrases, rare vocabulary — tokens whose embeddings blur. Chapter: §3, §8.1; Part 10 §8.1.

Q3. Why is an inverted index fast?

Token → doc-id list structure. Search scales with query tokens, not corpus size. Chapter: §4, §5.

Q4. The two minimum-hygiene steps for Korean BM25?

Morphological analyzer with particle removal + stemming; user dictionary with in-house proper nouns. Chapter: §6.3, §7.5, §8.2.

Q5. The standard way to combine BM25 and Dense?

Take top-K' from each, fuse with Reciprocal Rank Fusion. Raw weighted sums fail due to score-scale mismatch. Chapter: §6.4; Part 12 cross-ref.


10. Further Reading

Primary

  • Robertson, S., Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR 2009.
  • Robertson, S. et al. Okapi at TREC-3. NIST 1994 (the original BM25 proposal).
  • Formal, T. et al. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. SIGIR 2021. arXiv:2107.05720.
  • Lin, J. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 (2021).
  • Park, E. Mecab-ko: the Korean fork of MeCab (de facto Korean morphology standard).

Official docs

  • Elasticsearch BM25: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
  • OpenSearch BM25: https://opensearch.org/docs/latest/query-dsl/full-text/match/
  • Elasticsearch Nori: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-nori.html
  • rank-bm25 (Python): https://github.com/dorianbrown/rank_bm25
  • Kiwi: https://github.com/bab2min/Kiwi

Supporting

  • Author note Chapter 10 — BM25.
  • Author note Chapter 35 §6 — Retriever Routing (separating and combining BM25 with Dense).

Cheat Sheet

Knob Default / Recommendation
\(k_1\) 1.2–2.0 (Elasticsearch default 1.2)
\(b\) 0.75 (stable across most corpora)
Stopwords English standard; Korean uses analyzer dictionaries
Analyzer (Korean) Nori (ES built-in) > Kiwi (Python-friendly) > Mecab (fastest)
User dictionary Register in-house abbreviations and proper nouns always
Candidate count top 50–200 before Hybrid / Reranker
Hybrid combination RRF (k=60 default) — Part 12

One-liner: BM25 = exact match for proper nouns and codes; Dense = semantic similarity. For Korean, the analyzer + user dictionary deliver half of BM25's quality.


Bridge — What's Next

Next — RAG Core Study (12/26) — Hybrid Search & Score Fusion.

How do we combine Dense and BM25? Part 12 covers Reciprocal Rank Fusion (Cormack 2009), Weighted Fusion, Score Normalisation, and the standard weight-tuning experiment. Their division of labour lifts RAG accuracy a clear step.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System