"RAG Core Study (5/26) — Five Paths of Chunking"
Documents are markdown; the search space is partitioned. Now we decide: what unit does a single embedding represent?
Chunking looks like the shortest decision in RAG but produces the widest downstream consequences. Same corpus, same embedding, same LLM — retrieval accuracy can swing 10 to 30 percentage points based purely on chunking strategy (Anthropic Contextual Retrieval, RAGAS benchmarks). Part 5 unpacks five paths — Fixed / Paragraph / Heading / Semantic / Sliding Window — and the Whole-document decision table left from Part 3.
0. Prerequisites
- Parts 2 (markdown) and 3 (Ingestion Design).
- Familiarity with max sequence length (commonly 512–8192 tokens).
- Tokens ≠ characters — a Korean character averages 1.3–2 tokens.
1. Learning Objectives
- State the one-line difference between the five chunking strategies.
- Express the trade-off of chunk_size and overlap in a single formula.
- Read the Whole-document adoption criteria from a table.
2. ํต์ฌ ์์ฝ
A chunk is the unit of embedding and the unit of retrieval. Too small fragments meaning; too large invites Lost in the Middle and max-length overflow. Fixed (token count) is simple but breaks semantic boundaries. Paragraph suits prose but the lengths vary widely. Heading shines on structured documents. Semantic cuts where text changes topic. Sliding Window overlaps neighbours to reduce boundary loss. RecursiveTextSplitter is a generic standard that tries those in priority order. And for short, tight documents, Whole-document is often the right answer — §35-1's point from Part 3.
3. Intuition — One Paragraph Cut Five Ways
"Security policy §5.2 Permissions — All employees follow the principle of least privilege …" A one-page policy.
- Fixed (512 tokens): cuts mid-sentence; an awkward boundary like "the principle of least…" appears.
- Paragraph: variable lengths from 80 to 500 tokens.
- Heading: cuts at
## 5.2 Permissions; one chunk equals one section. - Semantic: cuts where consecutive-sentence embedding distance spikes.
- Sliding Window: Fixed plus 50–100-token overlap with the neighbour.
The same text places sentences at different positions inside a chunk, and embeddings see different signals.
4. Definitions — Five Strategies + Whole-document
| Strategy | Split Rule | Strength | Weakness |
|---|---|---|---|
| Fixed | Every N tokens | Simple, reproducible | Breaks semantic boundary |
| Paragraph | On \n\n |
Preserves meaning | Length variance |
| Heading | On # / ## / ### |
Strong on structured docs | Useless without headings |
| Semantic | Embedding distance threshold | Sensitive to topic change | Embedding-call cost |
| Sliding Window | Fixed + overlap | Reduces boundary loss | Larger index |
| Whole-document | No split | Best for short tight docs | Wrong for long ones |
RecursiveTextSplitter (LangChain standard) tries a priority list (e.g. ["\n\n", "\n", ". ", " "]) and cuts at the largest semantic boundary that fits.
5. Math — chunk_size and overlap
- \(L\) = chunk_size (tokens)
- \(o\) = overlap (tokens)
- \(N_{\text{doc}}\) = document tokens
- \(N_{\text{chunks}}\) = chunk count
$$N_{\text{chunks}} \approx \frac{N_{\text{doc}}}{L - o}$$
- Small \(L\) → many chunks; index bloats, search costs rise, but matching is fine-grained.
- Large \(L\) → fewer chunks; cheaper, but too many topics per chunk blur the embedding.
- Large \(o\) → less boundary loss but more duplication in the index and top-K.
Working starts: \(L = 256-1024\) tokens, \(o = 10-20\%\) of \(L\) (e.g. \(L=512, o=50-100\)). Always under the embedding model's max length. Tune via Parts 14–15.
6. Walkthrough — RecursiveTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(markdown_text)
Internally:
- Try
\n\n(paragraph). Keep if underchunk_size. - Otherwise fall back to
\n(line). - Then
.(sentence). - Then whitespace.
- Apply overlap by prepending the trailing tokens of the previous chunk.
That priority approximates Paragraph → Heading → Sentence → Token. Semantic and Sliding live in separate splitters.
7. Variants
7.1 Fixed
- What changes: cut every N tokens.
- Why use it: reproducible, uniform; easy cost analysis.
- What becomes possible: fast index build, easy budgeting.
- Where it fits: even-flow English corpora, fast POCs.
- Limits: breaks semantic boundaries; severe in Korean where particles and endings carry meaning.
7.2 Paragraph
- What changes: split on paragraph markers.
- Why use it: human-written text treats paragraphs as semantic units.
- What becomes possible: better embeddings, stronger top-K.
- Where it fits: blogs, essays, general reports.
- Limits: paragraph length variance — short paragraphs cause excess chunk count.
7.3 Heading
- What changes: split on markdown headings.
- Why use it: technical documents are section-shaped.
- What becomes possible: one chunk equals one section; answers arrive section-grained.
- Where it fits: technical docs, policies, API references.
- Limits: useless when there are no headings; a single huge section overflows.
7.4 Semantic
- What changes: split where consecutive-sentence embedding distance exceeds a threshold (
SemanticChunker). - Why use it: when topic boundaries do not align with formatting.
- What becomes possible: meaning-preserving cuts in unstructured text.
- Where it fits: meeting minutes, interview transcripts, news.
- Limits: one embedding call per sentence; rebuilding chunks on embedding change.
7.5 Sliding Window
- What changes: Fixed plus overlap with the neighbour.
- Why use it: to keep boundary information from being halved.
- What becomes possible: answers near a boundary appear in both chunks.
- Where it fits: almost every corpus by default.
- Limits: index inflation and duplicate top-K entries when overlap is too large.
7.6 Whole-document
- What changes: no chunking — the document becomes one embedding.
- Why use it: short, tight documents (FAQ, README, one-page policy, single meeting minutes).
- What becomes possible: document-level retrieval; full document as context.
- Where it fits: §35-1 criteria.
- Limits: long documents overflow; mix document-level and chunk-level when scales differ.
Whole-document decision table (cross-ref Part 3 §8.5):
| Document Size | Meaning Unit | Recommended |
|---|---|---|
| < 1K tokens | Tight (FAQ/README/one-page policy) | Whole-document |
| 1–4K tokens | Tight (single meeting minutes, abstracts) | Whole-document or Heading |
| 4–16K tokens | Clear section structure | Heading + Sliding |
| > 16K tokens | Multi-topic | RecursiveTextSplitter + Sliding |
8. Limits and Failure Modes
8.1 A Meaning Unit Splits Across Two Chunks
- Why intrinsic: a boundary in the middle of a sentence separates subject and predicate; both chunks blur.
- Diagnosis: the gold-citation spans two chunks, but top-K returns only one.
- Mitigation: increase Sliding overlap; switch to Heading.
- Later part: Part 6 (Contextual Chunking).
8.2 Chunk Too Small — Topic Disperses
- Why intrinsic: tiny chunks lack context; embeddings carry weak signals.
- Diagnosis: top-K matches related keywords but cannot assemble an answer.
- Mitigation: raise chunk_size; switch to Heading or Paragraph.
- Later part: Part 6 (Parent-Child Retrieval).
8.3 Chunk Too Large — Embedding Blurs
- Why intrinsic: many topics inside one chunk average out into a generic embedding.
- Diagnosis: top-K returns similarly-scored items with no clear winner.
- Mitigation: shrink chunk_size; Semantic chunking.
- Later part: Part 8 (Embedding models).
8.4 Korean Particle Loss
- Why intrinsic: Korean particles ("์์", "์ผ๋ก") carry meaning. Token-level cuts can detach them from the stem.
- Diagnosis: chunk ends with truncated endings.
- Mitigation: enforce sentence-tokenizer boundaries; morphological analyzers (Mecab, Kiwi).
- Later part: Part 11 (BM25 with Korean morphology); Extra D (Korean RAG).
8.5 Metadata Lost After Chunking
- Why intrinsic: if
document_id,section,page,versionare not copied per chunk, filtering breaks. - Diagnosis: chunk metadata missing key fields.
- Mitigation: propagate all relevant metadata at chunk creation.
- Later part: Part 7.
8.5 Common Pitfalls
- "Smaller chunks are better." — §8.2.
- "Larger chunks carry more info." — §8.3.
- "Zero overlap is cleaner." — §8.1. Sliding is nearly the default.
- "Chunk once, keep forever." — Embedding changes and corpus rebuilds require re-chunking.
- "Korean works with English splitters." — Particle loss. §8.4.
9. Settled Conclusions
Q1. What ordering does RecursiveTextSplitter mimic?
Paragraph (\n\n) → Line (\n) → Sentence (.) → Whitespace — biggest semantic boundary first, under chunk_size.
Chapter: §6.
Q2. With chunk_size 512 and overlap 50, approximate the chunk count.
\(N_{\text{chunks}} \approx N_{\text{doc}} / (512 - 50) = N_{\text{doc}} / 462\). Chapter: §5.
Q3. State the Whole-document criteria in one line.
Short, tightly-bounded, single-topic documents. Chapter: §7.6, §35-1.
Q4. When does Semantic Chunking get expensive?
It calls the embedding model once per sentence; large corpora accumulate embedding cost. Chapter: §7.4.
Q5. Most common fix when a meaning unit splits across two chunks?
Raise Sliding overlap (e.g. 10% → 20%) or move to Heading-based chunking. Chapter: §8.1.
10. Further Reading
Primary
- LangChain Recursive Character Text Splitter documentation and source.
- Anthropic. Introducing Contextual Retrieval (2024-09 blog).
- Kamradt, G. 5 Levels of Text Splitting (2024 LangChain talk).
- Microsoft. Markdown Header Text Splitter (LangChain integration).
Official docs
- LangChain Text Splitters:
https://python.langchain.com/docs/concepts/text_splitters/ - LlamaIndex NodeParsers:
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/ - Pinecone Chunking Strategies:
https://www.pinecone.io/learn/chunking-strategies/
Supporting
- Author notes §3, §4 — chunks and overlap.
- Author note §35-1 — Whole-document retrieval.
Cheat Sheet
| Strategy | Starting Values | Best Fit |
|---|---|---|
| Fixed | 512 / overlap 0 | POC, uniform English |
| Paragraph | natural paragraphs | Blogs / essays |
| Heading | markdown headers | Technical docs / manuals |
| Semantic | threshold 0.5–0.7 cosine | Minutes / transcripts |
| Sliding | 512 / overlap 50–100 | Default for almost any corpus |
| Whole-document | — | < 1K-token tight docs |
Bridge — What's Next
Next — RAG Core Study (6/26) — Contextual Chunking: Parent-Child & Contextual Retrieval.
When a chunk alone is insufficient, augment with the parent document or with an LLM-generated contextual prefix. Part 6 unpacks Anthropic Contextual Retrieval (2024-09, ~35% recall improvement), Parent-Child Retrieval, Chunk Header Injection, and Document Summary Prefix.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ