"RAG Core Study (6/26) — Contextual Chunking: Parent-Child & Contextual Retrieval"

Part 5 covered five ways to cut a chunk. Part 6 covers what to add back when a chunk alone is not enough.

A single chunk is often insufficient. For "Where does this policy apply?" a chunk may carry the sentence that answers but not which policy and which clause that sentence belongs to. Four augmentation techniques — Parent-Child Retrieval, Chunk Header Injection, Document Summary Prefix, and Anthropic Contextual Retrieval (2024-09) — share this problem space. Anthropic reports the last technique cuts retrieval failure by 49% (67% when combined with BM25).


0. Prerequisites

  • Part 5 (chunking strategies). Part 6 is about augmentation after chunks exist.
  • Part 3 (ingestion design) metadata fields — document_id, section, version.
  • Awareness that LLM calls happen (especially in technique 4). Cost and caching matter.

1. Learning Objectives

  1. Explain in two scenarios why a chunk alone is insufficient.
  2. State the one-line difference between Parent-Child, Header Injection, Summary Prefix, and Contextual Retrieval.
  3. Quote Anthropic Contextual Retrieval's cost, cache, and impact numbers.
  4. Read a decision table for when to use which technique.

2. ํ•ต์‹ฌ ์š”์•ฝ

A chunk is the unit of retrieval but not the unit of answering. Retrieve on small units, answer on larger context — that separation is what the four techniques have in common. Parent-Child Retrieval indexes small chunks and returns the parent document or section as context (LangChain ParentDocumentRetriever). Chunk Header Injection statically prefixes each chunk with document title and section path, giving the embedding a location signal. Document Summary Prefix prepends a document-level summary to every chunk. Anthropic Contextual Retrieval (2024-09) generates a per-chunk, 50–100-token tailored context with an LLM and uses prompt caching to push cost down to roughly $1.02 per million chunks. Reported retrieval-failure reductions: 35–49%.


3. Intuition — Same Chunk, Same Query, Different Result

Take the question "How do I file an exception to the security policy?" The chunk that contains the answer:

"The applicant submits form SEC-EX-04 to InfoSec after department-head approval."

The sentence contains neither "security policy" nor "exception." Embedding search is semantic similarity, not keyword match, but missing context weakens the signal.

Apply the four augmentations to the same chunk:

diagram-1

All four leave the chunk body intact while adding extra signal to either the embedding input or the answer context. The differences: where, what, and at what cost.


4. Definitions — Four Augmentation Techniques

Technique What is added Where it goes Generation cost
Parent-Child Retrieval Parent document or section Answer context (search is unchanged) 0 (restructure only)
Chunk Header Injection Document title + section path Embedding input + answer context 0 (static metadata)
Document Summary Prefix 1–2 paragraph document summary Embedding input for every chunk 1 LLM call per document
Contextual Retrieval (Anthropic 2024-09) Per-chunk tailored context, 50–100 tokens Embedding input + (optional) BM25 1 LLM call per chunk (cache-heavy)

Common principle: "Retrieve short, answer rich." The techniques split on whether they augment the retrieval unit itself (2, 3, 4) or leave retrieval alone and grow only the answer unit (1).


5. Math — Cost Model of Augmentation

Let \(T_{\text{aug}}\) be the augmentation tokens added per chunk.

  • \(N\) = number of chunks
  • \(T_c\) = average chunk tokens
  • \(T_{\text{aug}}\) = augmentation tokens (per technique)
  • \(C_{\text{embed}}\) = embedding price ($/1K tokens)
  • \(C_{\text{llm}}\) = LLM price ($/1K input tokens, for augmentation generation)

Embedding cost (indexing the augmented chunks):

$$\text{Cost}_{\text{embed}} = N \cdot \frac{T_c + T_{\text{aug}}}{1000} \cdot C_{\text{embed}}$$

Generation cost (Contextual Retrieval; one LLM call per chunk):

$$\text{Cost}_{\text{gen}} = N \cdot \frac{T_{\text{doc}} + T_{\text{prompt}} + T_{\text{aug}}}{1000} \cdot C_{\text{llm}}$$

Here \(T_{\text{doc}}\) dominates — the same document is re-fed for every chunk. Anthropic's prompt caching (1-hour cache, 2× write, 0.1× hit) collapses that cost. Concrete numbers in §7.4.


6. Walkthrough — Four Techniques in Code

6.1 Parent-Child Retrieval (LangChain)

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)

retriever = ParentDocumentRetriever(
    vectorstore=vectordb,        # only child chunks are embedded
    docstore=InMemoryStore(),    # parent bodies live here
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
results = retriever.invoke("how do I file an exception to security policy?")

Key idea: small children go into the embedding index, large parents go into the answer context.

6.2 Chunk Header Injection

def inject_header(chunk: str, doc_title: str, section_path: list[str]) -> str:
    header = f"Document: {doc_title}\nPath: {' > '.join(section_path)}\n\n"
    return header + chunk

embedded_text = inject_header(
    chunk="The applicant submits form SEC-EX-04 to InfoSec after department-head approval.",
    doc_title="Information Security Policy v3.2",
    section_path=["5. Exceptions", "5.2 Filing Procedure"],
)

LangChain's MarkdownHeaderTextSplitter extracts the header path into metadata automatically. The one line that matters is joining metadata back into the embedding input string.

6.3 Document Summary Prefix

summary = llm.invoke(
    f"Summarise the document in five sentences or fewer:\n\n{full_document}"
).content

embedded_texts = [
    f"{summary}\n\n---\n\n{chunk.page_content}"
    for chunk in chunks
]

Cheap because the LLM runs once per document, and every chunk shares the same summary. Document-level context is injected into the embedding signal.

6.4 Anthropic Contextual Retrieval

CONTEXT_PROMPT = """<document>{doc}</document>
The following chunk is part of the above document:
<chunk>{chunk}</chunk>

Generate a short (50–100 tokens) context that situates this chunk for retrieval.
Respond with only the context, no explanation."""

def contextualize(doc: str, chunk: str) -> str:
    return llm.invoke(CONTEXT_PROMPT.format(doc=doc, chunk=chunk)).content

contextual_chunks = [
    f"{contextualize(full_doc, c.page_content)}\n\n{c.page_content}"
    for c in chunks
]

The LLM writes a chunk-specific context every chunk. Prompt caching crushes the cost — the same <document> is cached, so subsequent reads are billed at 0.1×. Anthropic's recommended setup lands near $1.02 per 1M chunks at Claude 3 Haiku prices.


7. Variants

7.1 Parent-Child — Small-to-Big

  • What changes: child = sentence-level; parent = paragraph- or page-level. Retrieve small, answer big.
  • Why use it: covers §8.2's topic disperse failure at the answer stage.
  • What becomes possible: separating retrieval precision from answer richness — light index, rich answer.
  • Where it fits: long policies, statutes, technical manuals — where chunks fragment but answers must re-assemble.
  • Limits: a separate docstore is needed; the same parent can match through several children, requiring deduplication.

7.2 Header Injection — Breadcrumb Path

  • What changes: prefix each chunk with the full Doc > Ch.1 > §1.2 > 1.2.3 breadcrumb.
  • Why use it: preserves the chance that section names themselves match the query.
  • What becomes possible: the most cost-effective augmentation — zero LLM calls, often several percentage points of recall gain.
  • Where it fits: markdown, technical docs, manuals — recommended as a default.
  • Limits: useless without headings; very long paths can crowd the body signal.

7.3 Summary Prefix — Section Summary

  • What changes: replace the document-level summary with a section-level summary attached to that section's chunks.
  • Why use it: a single summary blurs on large or multi-topic documents.
  • What becomes possible: more local augmentation.
  • Where it fits: books, annual reports, omnibus manuals.
  • Limits: LLM calls scale with section count; needs accurate section boundaries (use Heading chunking together).

7.4 Contextual Retrieval — Key Numbers (Anthropic 2024-09)

Metric Baseline + Contextual Embeddings + Contextual BM25 + Reranker
Top-20 retrieval failure 5.7% 3.7% 2.9% 1.9%
Relative reduction -35% -49% -67%

Cost per million chunks, Claude 3 Haiku, prompt caching enabled:

  • Cache write: \( \approx 0.30 \text{ USD} \)
  • Cache hit on subsequent calls: \( \approx 0.03 \text{ USD per 100K tokens} \)
  • Output (generated context): \( \approx 0.72 \text{ USD} \)
  • Total: roughly $1.02 per 1M chunks

A 5-minute standard cache will not hold long enough. The 1-hour cache (2× write price) is what makes the cost model work.

7.5 Contextual Retrieval — Neighbour-aware Variant

  • What changes: include neighbour chunks in the context-generation prompt to keep continuity.
  • Why use it: adjacent chunks inside one section often share premises.
  • What becomes possible: smoother handling of long sections where boundaries break context.
  • Where it fits: papers, technical white papers.
  • Limits: longer input → noticeably higher per-call cost.

8. Limits and Failure Modes

8.1 Parent-Child — Answer Context Bloat

  • Why intrinsic: even with top-K = 5, parents of 2–3K tokens push the answer context to 10–15K — into Lost in the Middle territory (Liu 2023).
  • Diagnosis: measure answer context length; if the 95th percentile exceeds the model's recommended window (8–16K), you are exposed.
  • Mitigation: cap parents at paragraph- or page-level; let a reranker compress the children first.
  • Later part: Part 13 (Reranker).

8.2 Header Injection — Path Noise

  • Why intrinsic: if headers repeat the same vocabulary ("policy", "chapter"), every chunk gets a near-identical prefix and embeddings blur.
  • Diagnosis: same-document chunks show abnormally high mutual cosine similarity (> 0.95).
  • Mitigation: drop repeated tokens from the path; keep only document title + nearest one-level heading.
  • Later part: Part 8 (embedding models).

8.3 Summary Prefix — Summary Drift

  • Why intrinsic: summaries capture document themes but skip special items inside; every chunk gets the same prefix, weakening specific-item matches.
  • Diagnosis: thematic questions score high, but detail questions show narrow margins between top-K candidates.
  • Mitigation: switch to section-level summaries (§7.3); keep summaries short (≤ 2 sentences).
  • Later part: Part 7 (metadata to reinforce detail-level signal).

8.4 Contextual Retrieval — Hallucinated Context

  • Why intrinsic: the LLM can guess a chunk's position and invent facts not in the document; retrieval then learns a hallucinated signal.
  • Diagnosis: sample generated contexts and measure overlap with source vocabulary; low overlap = risk.
  • Mitigation: prompt with "only facts stated in the document"; cap context length (e.g. 100 tokens); low temperature; post-filter named entities not in source.
  • Later part: Extra C (hallucination defence).

8.5 Cache Miss — Cost Explosion

  • Why intrinsic: without prompt caching, every chunk re-bills the full document at list price. A 1M-chunk build can multiply by 10×.
  • Diagnosis: check the cache write/hit/miss ratios in API billing; alert when hit rate < 90%.
  • Mitigation: chunk all calls for a document contiguously to keep the cache warm; use the 1-hour cache (2× write) to extend TTL.
  • Later part: Part 23 (RAG operations cost).

8.5 Common Pitfalls

  • "Small chunks always need Parent-Child." Often Header Injection alone is enough — try the zero-cost option first.
  • "Contextual Retrieval is always best." Under ~10K chunks the cost doesn't pay back. Header + Sliding often suffices.
  • "One summary fits every chunk." §8.3. Multi-topic documents need section-level summaries.
  • "Shorter augmentation is better." Too short = weak signal; too long = body dilution. 50–150 tokens is the working sweet spot.
  • "Augment embeddings only, leave answer context raw." Passing the augmentation to the answer context too lifts LLM generation quality alongside retrieval.

9. Settled Conclusions

Q1. Order the four techniques by cost, cheapest to most expensive.

Chunk Header Injection \(<\) Parent-Child Retrieval \(<\) Document Summary Prefix \(<\) Contextual Retrieval. The first two have no LLM calls; Summary is one per document; Contextual is one per chunk. Chapter: §4.

Q2. Estimate Parent-Child's Lost in the Middle risk in one formula.

Answer context ≈ top-K × average parent tokens. e.g. K=5, parent=2K → 10K tokens; risk rises past the model's recommended 8K window. Chapter: §8.1.

Q3. What retrieval-failure reductions does Anthropic Contextual Retrieval report?

Contextual Embeddings alone -35%, plus BM25 -49%, plus Reranker -67% (Top-20, 2024-09). Chapter: §7.4.

Q4. Why is prompt caching decisive for Contextual Retrieval?

Because the same document is re-fed for every chunk. A 5-minute cache is too short; the 1-hour cache (2× write) is the precondition for the cost model. Chapter: §7.4, §8.5.

Q5. Which augmentation can hallucinate, and why?

Contextual Retrieval — the LLM may guess a chunk's position and assert facts not in the source. Mitigate with low temperature, length caps, and post-filtering. Chapter: §8.4.


10. Further Reading

Primary

  • Anthropic. Introducing Contextual Retrieval (2024-09 blog). Source for §7.4 numbers.
  • LangChain. ParentDocumentRetriever documentation and source.
  • Sarthi, P. et al. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. arXiv:2401.18059. (recursive generalisation of summary prefix)
  • Liu, N. F. et al. Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. arXiv:2307.03172. (basis for the answer-bloat risk)

Official docs

  • LangChain ParentDocumentRetriever: https://python.langchain.com/docs/how_to/parent_document_retriever/
  • LangChain MarkdownHeaderTextSplitter: https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/
  • Anthropic Prompt Caching: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • LlamaIndex AutoMergingRetriever (Parent-Child variant): https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

Supporting

  • Author notes §5 — chunk augmentation.
  • Author note §35-1 — link between Parent Document Retrieval and Whole-document retrieval.

Cheat Sheet

Technique LLM calls Embedding tokens added Retrieval-failure reduction (approx.) Recommended use
Header Injection 0 20–80 5–10 pp Default; markdown/structured docs
Parent-Child 0 0 (structural) 10–20 pp Long policies, statutes, manuals
Summary Prefix 1 per document 100–200 10–15 pp Single-topic mid- to long-form
Contextual Retrieval 1 per chunk (cached) 50–100 35–49 pp (with BM25) Large indices, accuracy-first

Decision rule of thumb: Lay down Header Injection (free) first; if that isn't enough, add Parent-Child → Summary → Contextual in that order. The four techniques stack; they are not substitutes.


Bridge — What's Next

Next — RAG Core Study (7/26) — Metadata Design: Filters, Permissions, Provenance.

What must travel alongside each chunk so retrieval can filter by user, time, or document type? Part 7 unpacks the seven core fields — document_id, chunk_id, version, page, section, security_level, namespace — together with permission models and provenance tracking.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System