"RAG Core Study (3/26) — Ingestion Design: Document Boundary, Model-aware Schema, Filter-first Retrieval"

Core sentence (author note §35-9)"In RAG, retrieval quality is decided before 'which document to find' by 'which group of documents we are searching.'"

In Part 2 we turned PDFs into markdown. The reflex move is "embed everything → load into one vector DB → top-K search." In production, that reflex repeats the same accidents: meeting minutes leak into policy answers, training brochures impersonate security policy, last year's decisions arrive as today's operating guide. This part fills the gap between Part 1, Part 2, and Part 7Ingestion Design — in one place.


0. Prerequisites

  • Part 1's five elements (especially Knowledge Base vs Context).
  • Part 2's markdown output.
  • Comfort with the term vector embedding (full comparison in Part 8).

This part is mostly decision tables. Code is intentionally minimal.


1. Learning Objectives

  1. Explain the six-stage ingestion workflow from §35 in your own words.
  2. Decide when to use Whole-document retrieval vs Chunk-level retrieval.
  3. Explain why Document Boundary and Filter-first Retrieval have more impact on accuracy than chunk-size tuning.

These decisions happen at the ingestion stage. Getting them wrong forces rebuilding the chunks, embeddings, and index from scratch.


2. 핵심 요약

RAG ingestion is not "load the text." It is six design steps — (1) understand document structure, (2) check the embedding model's training format, (3) preprocess title/body/query forms, (4) define document purpose and boundary, (5) design metadata filtering, (6) shrink the search space before retrieving. Step 4 is the most commonly missed: when policies, training material, meeting minutes, past decisions, and external assets share one vector space, semantic similarity masks purpose differences and retrieval breaks. On top of that, Filter-first Retrieval replaces "all → top-K" with "classify → filter → narrowed space → top-K."


3. Intuition — Sorted Stacks in a Library

Ask a librarian for "the latest academic policy." A good librarian does not search every shelf at once. They first narrow which section — administration / educational material / student board / academic calendar / minutes / regulations. Within that section they pull the most relevant book. A naïve librarian who scans every shelf for the nearest match to "operational policy" may hand back the student-club minutes as a regulation. Purpose differs.

RAG retrieval is the same. When purposes differ but the vector space is shared, semantic similarity hides purpose distinction. This part answers: how to partition the space ahead of time, and how to narrow it at query time.

diagram-1

4. Definition — Six Terms from §35

Term One-line Definition §35 Reference
Document Boundary The line between document groups that must not share a vector space §35-4
Document Purpose Functional class — policy / education / minutes / decision / manual §35-4
Multi-collection Multiple indexes inside one vector DB, partitioned by purpose §35-5
Namespace A logical partition inside a single collection §35-5
Pre-filter / Filtered Vector Search Narrowing candidates with metadata before vector search §35-6
Filter-first Retrieval "Classify → pick filter → narrowed space → dense/sparse/hybrid → rerank" §35-6

Document Boundary covers both physical separation (collection · namespace) and logical separation (metadata field). §5 chooses between them.

A second axis is Model-aware Ingestion (§35-1, §35-2). The documents are loaded in the form the embedding model was trained on. Common forms:

  • Title / Body split (many models lean strongly on titles)
  • Query / Passage distinction (instruction-tuned embeddings require passage: prefixes)
  • Document Type / Purpose fields (custom schema)

5. The Six-Stage Workflow (§35 Introduction Expanded)

This is the centre of the article.

Stage 1: Understand the document structure

Inspect the section structure. Decide the question each document answers. The unit may be a single page, a single section, or the whole document.

Stage 2: Check the embedding model's training format

Read the model card. - Are query prefixes required (passage:, query:)? - Does it accept a title field separately? - Multilingual or English-centric? - What is the max sequence length (this caps your chunk size)?

Load the documents in the form the model saw at training time. Anything else costs embedding quality.

Stage 3: Preprocess title/body/query forms

Normalise to the schema the model expects:

{
  "document_type": "policy",
  "title": "Information Security Policy",
  "purpose": "Employee conduct",
  "body": "...",
  "section": "5.2 Permissions",
  "valid_from": "2025-01-01",
}

Titles are strong signals — give them their own field. The body is the chunking unit. document_type and purpose drive filtering.

Stage 4: Define document purpose and boundary — the key

§35-4 emphasises this stage. Make the boundary explicit.

Separation (§35-5) Reason
Policy vs Educational material Different intent — regulation vs guidance
Recent vs historical Same subject, different validity time
Internal vs external Different authority
Regulation vs commentary Verbatim citation vs paraphrase
Specification vs sales material Fact vs marketing
Minutes vs final decision Process vs outcome
Developer docs vs user manuals Different level of abstraction

Each separation is expressed as (a) a collection, (b) a namespace, or (c) a metadata field.

Stage 5: Design metadata filtering

Decide which fields will be filterable at query time. Common fields:

  • document_id, chunk_id, version, page, section
  • document_type (policy / education / minutes / decision)
  • purpose, valid_from, valid_to
  • security_level, language, region
  • source_authority (internal / official / vendor / public)
  • audience (developer / user / executive)

Part 7 revisits these. The role of this part is to derive the field set from Stage 4's boundary table.

Stage 6: Shrink the search space — Filter-first

Retrieval becomes two steps: 1. Classify → pick a filter: classify the query type to choose which filters apply. 2. Narrowed space → top-K: run dense / sparse / hybrid search inside the filtered space.

That is Filter-first Retrieval. While Naive RAG is "all → top-K", Filter-first is "classify → filter → narrowed space → top-K → rerank."


6. Walkthrough — One Filter-first Query

Two lines suffice:

filter_dict = classify_and_build_filter(query)  # e.g. {"document_type": "policy", "valid_from": ">2024-01-01"}

hits = vector_index.search(query_vec, top_k=5, filter=filter_dict)

Six steps unfold:

  1. Classify the question by keywords and style (policy vs education vs minutes).
  2. Map the class to a metadata filter.
  3. Pass the filter straight to the vector DB (most DBs support metadata pre-filter).
  4. Run dense or hybrid search inside the narrowed space.
  5. Rerank top-K (Part 12).
  6. Assemble context and generate.

The dual effect: better accuracy (semantic similarity no longer masks purpose) and lower cost (narrowed space, fewer items in top-K). The trade-off is misclassification; Part 17 (Query Classification) covers confidence scoring and fallback retrievers.


7. Variants — Separation Patterns (Five-block)

7.1 Multi-collection — Per-purpose Collections

  • What changes: one vector DB with multiple collections, one per document purpose.
  • Why it appeared: the clearest physical separation; choosing a collection is itself a filter.
  • What becomes possible: different embedding models per collection (precise-match for regulations, multilingual for minutes).
  • Where it fits: clearly partitioned domains with independent reindexing.
  • Limits: routing complexity grows with collection count. → Part 19 (Query Routing).

7.2 Namespace — Logical Slice in One Collection

  • What changes: a namespace key inside a single collection. Pinecone and Weaviate support it natively.
  • Why it appeared: department, language, or tenant separation that does not justify a physical split.
  • What becomes possible: shared embedding model and index with logical isolation.
  • Where it fits: language partitioning at multinationals, tenant isolation in B2B SaaS.
  • Limits: shared embedding model — using an English-centric model for Korean data drops quality. → Part 8.

7.3 Metadata Field — Dynamic Axes

  • What changes: separation expressed as metadata conditions (where document_type = 'policy').
  • Why it appeared: when separation axes are dynamic or crossing (policy × department × language).
  • What becomes possible: arbitrary axis combinations inside one collection.
  • Where it fits: corpora with frequently changing axes.
  • Limits: filters on uninexed fields fall to full scan; configure field indexes proactively.

7.4 Versioning — Time-Axis Partition

  • What changes: version, valid_from, valid_to fields.
  • Why it appeared: policy answers depend on time. Current vs last year cannot share a vector space without ambiguity.
  • What becomes possible: "only currently valid policy", time-window comparisons.
  • Where it fits: regulation, policy, contracts where validity matters.
  • Limits: every chunk inherits version metadata; reindexing churn grows. → Parts 7, 25.

7.5 Source Authority — Authority Ranking

  • What changes: source_authority field — internal-official / external-standard / vendor-marketing / public — weights ranking.
  • Why it appeared: same subject from different authority levels has different trust.
  • What becomes possible: "official answer first; fall back if absent."
  • Where it fits: compliance-heavy domains — medical, legal, finance.
  • Limits: authority scores need human maintenance; trust may drift.

8. Limits and Failure Modes

8.1 Missing Separation — "We accidentally mixed everything"

  • Why intrinsic: at small corpus size, separation feels premature. As the corpus grows, retrofit costs the full reindex and embedding budget.
  • Diagnosis: slow accuracy decay over months as new documents arrive.
  • Mitigation: build the boundary table day one; every new document must land in one cell of it.
  • Later part: Part 7.

8.2 Embedding Format Mismatch

  • Why intrinsic: models are sharpest on the form they trained on. Embedding queries without query: prefix where required (BGE-M3, E5-Mistral) misaligns query and passage spaces.
  • Diagnosis: compare top-K accuracy across embedding formats.
  • Mitigation: follow the model card. Use title/body separately if the model supports it.
  • Later part: Parts 8, 18.

8.3 Filter Key Not Indexed

  • Why intrinsic: vector DBs index vectors by default; metadata indexing is opt-in. Filtering on uninexed fields falls to full scan.
  • Diagnosis: same query with and without filter shows similar latency.
  • Mitigation: index every Stage-4 field. Pinecone metadata index, Qdrant payload index, Weaviate property index.
  • Later part: Part 9.

8.4 Query Classification Errors

  • Why intrinsic: a wrong filter rules out the correct documents. The answer cannot surface.
  • Diagnosis: separate empty answers from wrong answers. High empty rate suggests classifier issues.
  • Mitigation: low-confidence classification triggers a fallback retriever with no filter, or multi-collection search.
  • Later part: Parts 17, 19.

8.5 Whole-document vs Chunk Not Decided

  • Why intrinsic: short documents (FAQ, one-page policy, README) may benefit from whole-document embedding rather than chunking; chunking can splinter meaning.
  • Diagnosis: compare top-K accuracy on the same short documents in whole-doc vs chunked form.
  • Mitigation: allow both retrieval modes side by side. Part 5 returns to the decision table.
  • Later part: Parts 5, 6.

8.5 Common Pitfalls

  • "One collection for everything." — §8.1.
  • "Ignore the embedding model's recommended format." — §8.2.
  • "Metadata only follows chunks." — Without indexes the filter is slow. §8.3.
  • "The classifier will handle it." — Without confidence monitoring, empty-answer cases pile up. §8.4.
  • "512-token chunk is the standard." — Short documents may belong to whole-doc retrieval. §8.5, Part 5.

9. Settled Conclusions

Q1. List the six §35 stages.

Understand structure → check embedding model format → preprocess title/body/query → set document purpose and boundary → design metadata filters → shrink the space before search. Chapter: §5.

Q2. Filter-first vs Naive — what differs?

Naive: "all → top-K." Filter-first: "classify → filter → narrowed space → top-K → rerank." Improves both accuracy and cost; trade-off is classification error. Chapter: §5, §6.

Q3. Name seven document groups that must not share a vector space.

Policy vs education / recent vs historical / internal vs external / regulation vs commentary / specification vs sales / minutes vs decisions / developer vs user manuals. Chapter: §5 Stage 4.

Q4. Multi-collection vs Namespace vs Metadata Field — what differs?

Multi-collection is physical (embedding model may differ). Namespace is logical (shared embedding). Metadata Field expresses dynamic axis combinations. Chapter: §7.

Q5. When is Whole-document Retrieval the right choice?

Short documents with clear meaning units — FAQ, one-page policy, README, individual meeting minutes. When chunking splinters meaning. Chapter: §35-1, §8.5.


10. Further Reading

Primary sources

  • Author notes §35 (full) — Embedding Model-Aware Ingestion, 9 subsections.
  • Pinecone Blog — Multi-namespace Architecture for Multi-tenant Apps (2024).
  • Weaviate Docs — Multi-tenancy (2024).
  • Qdrant Docs — Payload Index and Filtering (2024).
  • Sarthi, P. et al. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. ICLR 2024. arXiv:2401.18059

Official docs

  • Pinecone Namespaces: https://docs.pinecone.io/guides/indexes/use-namespaces
  • Weaviate Multi-tenancy: https://weaviate.io/developers/weaviate/concepts/data#multi-tenancy
  • Qdrant Payload: https://qdrant.tech/documentation/concepts/payload/

Supporting

  • Author note §35-9 — "Which group of documents are we searching."
  • Author note §35-7 — Learning sequence (reflected in §5).

Cheat Sheet

Decision Key Question
Unit of separation Should these documents share a vector space? If no — collection / namespace / metadata field?
Embedding format Did you follow the model card's prefix, title/body split, max length?
Schema Do you need document_type, purpose, valid_from, security_level, source_authority?
Filter indexes Are frequently filtered fields indexed in the vector DB?
Whole-doc vs Chunk If the document is short with clear units, consider whole-doc retrieval
Filter-first flow Classify → filter → narrowed space → top-K → rerank

Bridge — What's Next

Next — RAG Core Study (4/26) — OCR & Layout Analysis.

Part 3 settled where to search. Now back to what to load — the parts Part 2's markdown conversion missed: scanned PDFs, text inside images, complex tables, multi-column layouts. Part 4 compares PaddleOCR / Tesseract / AWS Textract / Azure DI / LayoutLMv3 on five axes.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System