"RAG Core Study (5/26) — Five Paths of Chunking"

5월 14, 2026

Documents are markdown; the search space is partitioned. Now we decide: what unit does a single embedding represent?

Chunking looks like the shortest decision in RAG but produces the widest downstream consequences. Same corpus, same embedding, same LLM — retrieval accuracy can swing 10 to 30 percentage points based purely on chunking strategy (Anthropic Contextual Retrieval, RAGAS benchmarks). Part 5 unpacks five paths — Fixed / Paragraph / Heading / Semantic / Sliding Window — and the Whole-document decision table left from Part 3.

0. Prerequisites

Parts 2 (markdown) and 3 (Ingestion Design).
Familiarity with max sequence length (commonly 512–8192 tokens).
Tokens ≠ characters — a Korean character averages 1.3–2 tokens.

1. Learning Objectives

State the one-line difference between the five chunking strategies.
Express the trade-off of chunk_size and overlap in a single formula.
Read the Whole-document adoption criteria from a table.

2. 핵심 요약

A chunk is the unit of embedding and the unit of retrieval. Too small fragments meaning; too large invites Lost in the Middle and max-length overflow. Fixed (token count) is simple but breaks semantic boundaries. Paragraph suits prose but the lengths vary widely. Heading shines on structured documents. Semantic cuts where text changes topic. Sliding Window overlaps neighbours to reduce boundary loss. RecursiveTextSplitter is a generic standard that tries those in priority order. And for short, tight documents, Whole-document is often the right answer — §35-1's point from Part 3.

3. Intuition — One Paragraph Cut Five Ways

"Security policy §5.2 Permissions — All employees follow the principle of least privilege …" A one-page policy.

Fixed (512 tokens): cuts mid-sentence; an awkward boundary like "the principle of least…" appears.
Paragraph: variable lengths from 80 to 500 tokens.
Heading: cuts at ## 5.2 Permissions; one chunk equals one section.
Semantic: cuts where consecutive-sentence embedding distance spikes.
Sliding Window: Fixed plus 50–100-token overlap with the neighbour.

The same text places sentences at different positions inside a chunk, and embeddings see different signals.

4. Definitions — Five Strategies + Whole-document

Strategy	Split Rule	Strength	Weakness
Fixed	Every N tokens	Simple, reproducible	Breaks semantic boundary
Paragraph	On `\n\n`	Preserves meaning	Length variance
Heading	On `# / ## / ###`	Strong on structured docs	Useless without headings
Semantic	Embedding distance threshold	Sensitive to topic change	Embedding-call cost
Sliding Window	Fixed + overlap	Reduces boundary loss	Larger index
Whole-document	No split	Best for short tight docs	Wrong for long ones

RecursiveTextSplitter (LangChain standard) tries a priority list (e.g. ["\n\n", "\n", ". ", " "]) and cuts at the largest semantic boundary that fits.

5. Math — chunk_size and overlap

$L$ = chunk_size (tokens)
$o$ = overlap (tokens)
$N_{\text{doc}}$ = document tokens
$N_{\text{chunks}}$ = chunk count

$$N_{\text{chunks}} \approx \frac{N_{\text{doc}}}{L - o}$$

Small $L$ → many chunks; index bloats, search costs rise, but matching is fine-grained.
Large $L$ → fewer chunks; cheaper, but too many topics per chunk blur the embedding.
Large $o$ → less boundary loss but more duplication in the index and top-K.

Working starts: $L = 256-1024$ tokens, $o = 10-20\%$ of $L$ (e.g. $L=512, o=50-100$). Always under the embedding model's max length. Tune via Parts 14–15.

6. Walkthrough — RecursiveTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(markdown_text)

Internally:

Try \n\n (paragraph). Keep if under chunk_size.
Otherwise fall back to \n (line).
Then . (sentence).
Then whitespace.
Apply overlap by prepending the trailing tokens of the previous chunk.

That priority approximates Paragraph → Heading → Sentence → Token. Semantic and Sliding live in separate splitters.

7. Variants

7.1 Fixed

What changes: cut every N tokens.
Why use it: reproducible, uniform; easy cost analysis.
What becomes possible: fast index build, easy budgeting.
Where it fits: even-flow English corpora, fast POCs.
Limits: breaks semantic boundaries; severe in Korean where particles and endings carry meaning.

7.2 Paragraph

What changes: split on paragraph markers.
Why use it: human-written text treats paragraphs as semantic units.
What becomes possible: better embeddings, stronger top-K.
Where it fits: blogs, essays, general reports.
Limits: paragraph length variance — short paragraphs cause excess chunk count.

7.3 Heading

What changes: split on markdown headings.
Why use it: technical documents are section-shaped.
What becomes possible: one chunk equals one section; answers arrive section-grained.
Where it fits: technical docs, policies, API references.
Limits: useless when there are no headings; a single huge section overflows.

7.4 Semantic

What changes: split where consecutive-sentence embedding distance exceeds a threshold (SemanticChunker).
Why use it: when topic boundaries do not align with formatting.
What becomes possible: meaning-preserving cuts in unstructured text.
Where it fits: meeting minutes, interview transcripts, news.
Limits: one embedding call per sentence; rebuilding chunks on embedding change.

7.5 Sliding Window

What changes: Fixed plus overlap with the neighbour.
Why use it: to keep boundary information from being halved.
What becomes possible: answers near a boundary appear in both chunks.
Where it fits: almost every corpus by default.
Limits: index inflation and duplicate top-K entries when overlap is too large.

7.6 Whole-document

What changes: no chunking — the document becomes one embedding.
Why use it: short, tight documents (FAQ, README, one-page policy, single meeting minutes).
What becomes possible: document-level retrieval; full document as context.
Where it fits: §35-1 criteria.
Limits: long documents overflow; mix document-level and chunk-level when scales differ.

Whole-document decision table (cross-ref Part 3 §8.5):

Document Size	Meaning Unit	Recommended
< 1K tokens	Tight (FAQ/README/one-page policy)	Whole-document
1–4K tokens	Tight (single meeting minutes, abstracts)	Whole-document or Heading
4–16K tokens	Clear section structure	Heading + Sliding
> 16K tokens	Multi-topic	RecursiveTextSplitter + Sliding

8. Limits and Failure Modes

8.1 A Meaning Unit Splits Across Two Chunks

Why intrinsic: a boundary in the middle of a sentence separates subject and predicate; both chunks blur.
Diagnosis: the gold-citation spans two chunks, but top-K returns only one.
Mitigation: increase Sliding overlap; switch to Heading.
Later part: Part 6 (Contextual Chunking).

8.2 Chunk Too Small — Topic Disperses

Why intrinsic: tiny chunks lack context; embeddings carry weak signals.
Diagnosis: top-K matches related keywords but cannot assemble an answer.
Mitigation: raise chunk_size; switch to Heading or Paragraph.
Later part: Part 6 (Parent-Child Retrieval).

8.3 Chunk Too Large — Embedding Blurs

Why intrinsic: many topics inside one chunk average out into a generic embedding.
Diagnosis: top-K returns similarly-scored items with no clear winner.
Mitigation: shrink chunk_size; Semantic chunking.
Later part: Part 8 (Embedding models).

8.4 Korean Particle Loss

Why intrinsic: Korean particles ("에서", "으로") carry meaning. Token-level cuts can detach them from the stem.
Diagnosis: chunk ends with truncated endings.
Mitigation: enforce sentence-tokenizer boundaries; morphological analyzers (Mecab, Kiwi).
Later part: Part 11 (BM25 with Korean morphology); Extra D (Korean RAG).

8.5 Metadata Lost After Chunking

Why intrinsic: if document_id, section, page, version are not copied per chunk, filtering breaks.
Diagnosis: chunk metadata missing key fields.
Mitigation: propagate all relevant metadata at chunk creation.
Later part: Part 7.

8.5 Common Pitfalls

"Smaller chunks are better." — §8.2.
"Larger chunks carry more info." — §8.3.
"Zero overlap is cleaner." — §8.1. Sliding is nearly the default.
"Chunk once, keep forever." — Embedding changes and corpus rebuilds require re-chunking.
"Korean works with English splitters." — Particle loss. §8.4.

9. Settled Conclusions

Q1. What ordering does RecursiveTextSplitter mimic?

Paragraph (\n\n) → Line (\n) → Sentence (.) → Whitespace — biggest semantic boundary first, under chunk_size. Chapter: §6.

Q2. With chunk_size 512 and overlap 50, approximate the chunk count.

$N_{\text{chunks}} \approx N_{\text{doc}} / (512 - 50) = N_{\text{doc}} / 462$. Chapter: §5.

Q3. State the Whole-document criteria in one line.

Short, tightly-bounded, single-topic documents. Chapter: §7.6, §35-1.

Q4. When does Semantic Chunking get expensive?

It calls the embedding model once per sentence; large corpora accumulate embedding cost. Chapter: §7.4.

Q5. Most common fix when a meaning unit splits across two chunks?

Raise Sliding overlap (e.g. 10% → 20%) or move to Heading-based chunking. Chapter: §8.1.

10. Further Reading

Primary

LangChain Recursive Character Text Splitter documentation and source.
Anthropic. Introducing Contextual Retrieval (2024-09 blog).
Kamradt, G. 5 Levels of Text Splitting (2024 LangChain talk).
Microsoft. Markdown Header Text Splitter (LangChain integration).

Official docs

LangChain Text Splitters: https://python.langchain.com/docs/concepts/text_splitters/
LlamaIndex NodeParsers: https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/
Pinecone Chunking Strategies: https://www.pinecone.io/learn/chunking-strategies/

Supporting

Author notes §3, §4 — chunks and overlap.
Author note §35-1 — Whole-document retrieval.

Cheat Sheet

Strategy	Starting Values	Best Fit
Fixed	512 / overlap 0	POC, uniform English
Paragraph	natural paragraphs	Blogs / essays
Heading	markdown headers	Technical docs / manuals
Semantic	threshold 0.5–0.7 cosine	Minutes / transcripts
Sliding	512 / overlap 50–100	Default for almost any corpus
Whole-document	—	< 1K-token tight docs

Bridge — What's Next

Next — RAG Core Study (6/26) — Contextual Chunking: Parent-Child & Contextual Retrieval.

When a chunk alone is insufficient, augment with the parent document or with an LLM-generated contextual prefix. Part 6 unpacks Anthropic Contextual Retrieval (2024-09, ~35% recall improvement), Parent-Child Retrieval, Chunk Header Injection, and Document Summary Prefix.

Series overview: Series index