"RAG Core Study (7/26) — Metadata Design: Filters, Permissions, Provenance"

Chunks are built and augmented. Now we decide what lives beside a chunk — the fields that let retrieval slice by user, time, and document type.

Metadata is the quiet half of RAG. Without the fields stored alongside the chunk body, the questions "only this quarter's security policies", "only what this user is allowed to see", "only the latest version" cannot even be asked. Good metadata design lifts retrieval accuracy by 5–30 percentage points; poor design causes permission incidents and unverifiable answers. Part 7 unpacks seven core fields — document_id, chunk_id, version, page, section, security_level, namespace — alongside permission models, provenance, and filter cost.


0. Prerequisites

  • Part 3's Document Boundary and Filter-first Retrieval.
  • Chunk-creation flow from Parts 5–6.
  • Awareness that vector DBs support metadata filtering — Part 9 covers it in depth.

1. Learning Objectives

  1. State the role of each of the seven core metadata fields in one line.
  2. Express permission, version, and provenance filters in a single formula set.
  3. Explain how cardinality affects search performance.
  4. Name the five failure modes triggered by missing metadata.

2. 핵심 요약

If the chunk body answers "what does this know?", metadata answers "where, when, from whom?" Seven core fields — document_id (which doc), chunk_id (which slice), version (lineage), page/section (origin), security_level (access), namespace (logical search space) — must travel with every chunk. Retrieval runs as pre-filter → ANN → post-filter. Pre-filtering is for low-cardinality fields (e.g. tenant_id); post-filtering for high-cardinality ones (e.g. date ranges). Permissions belong in pre-filter — post-filtering them produces empty results and risks leaks.


3. Intuition — Same Corpus, Different Users

The vector DB contains both all-staff documents and executives-only documents. A regular employee's query that returns executive chunks in top-K is a permission incident. Plain dense retrieval cannot distinguish them — metadata filters do.

diagram-1

Why pre-filter first: if the ANN ever places a disallowed chunk in top-K, that chunk leaks to the user. Filtering after ANN can leave candidates empty (e.g. top-10 all disallowed → zero answers). Permissions must shrink the search space itself.


4. Definitions — Seven Core Fields

Field What it carries Example Who uses it
document_id Document identifier policy-sec-v3.2 Provenance, dedup
chunk_id Position within doc policy-sec-v3.2#c014 Chunk-level citation
version Document version 3.2, 2026-05-15 Latest-only / specific version
page Source page 42 Citation location
section Heading path ["5", "5.2", "5.2.1"] Reuses Part 6 Header Injection
security_level Access tier public, internal, confidential, secret Permission pre-filter (required)
namespace Logical search space policy, manual, meeting_notes Reuses Part 3 Filter-first

Optional supporting fields (add as needed):

  • source_url — original link.
  • author, owner_team — responsibility trail.
  • created_at, updated_at — time filters.
  • languageko, en. Splits multilingual corpora.
  • tenant_id — the top pre-filter in multi-tenant SaaS.

Core rule: when a chunk is created, every parent-doc metadata field is copied onto it. Same point as Part 5 §8.5.


5. Math — Filter Cost and Cardinality

Filter efficiency depends on cardinality \(C\) and selectivity \(s\).

  • \(N\) = total chunks in the index
  • \(C\) = number of unique values for the filter field (cardinality)
  • \(s\) = fraction of chunks passing the filter (e.g. tenant_id=A is 10% of the index → \(s=0.1\))
  • \(k\) = top-K

Pre-filter cost (ANN search restricted to chunks that pass the filter):

$$\text{Cost}_{\text{pre}} \approx \mathcal{O}(\log N) + \mathcal{O}(s \cdot N)$$

Smaller \(s\) is better. Strong for fields with low cardinality and low selectivity (tenant_id, security_level).

Post-filter cost (ANN returns top-K' candidates, then filter):

$$\text{Cost}_{\text{post}} \approx \mathcal{O}(\log N) + k'$$

\(k'\) is the over-fetch needed to keep enough answers after filtering. When \(s\) is small, \(k'\) explodes (\(k' \approx k/s\)). Unsuitable for tight filters.

Decision table:

Field kind Cardinality Selectivity Recommended
security_level, tenant_id Low (4–10) Low (5–30%) Pre-filter (mandatory)
namespace, language Low (3–10) Medium (10–50%) Pre-filter
version=latest Low (2) High (80–95%) Pre or Post
created_at > 2026-01-01 High (dates) Variable Post-filter
page >= 10 High (pages) Variable Post-filter

Support varies widely by vector DB — Part 9 covers tooling.


6. Walkthrough — Attaching Metadata to Chunks

6.1 Copy metadata at chunk creation

from langchain_core.documents import Document

doc_meta = {
    "document_id": "policy-sec-v3.2",
    "version": "3.2",
    "security_level": "internal",
    "namespace": "policy",
    "language": "ko",
    "tenant_id": "acme",
    "source_url": "https://intranet.acme.com/docs/sec-3.2.pdf",
}

chunks = []
for i, chunk_text in enumerate(split_results):
    chunks.append(Document(
        page_content=chunk_text,
        metadata={
            **doc_meta,
            "chunk_id": f"{doc_meta['document_id']}#c{i:04d}",
            "page": page_map[i],
            "section": section_map[i],
        },
    ))

vectordb.add_documents(chunks)

Key idea: all parent-doc fields are copied to every chunk, because chunks are searched independently in the index.

6.2 Query with permission + version + namespace

filter = {
    "$and": [
        {"tenant_id": "acme"},                            # outermost (carrier)
        {"security_level": {"$in": ["public", "internal"]}},  # user clearance
        {"namespace": "policy"},                          # shrink search space
        {"version": "3.2"},                               # pin version
    ]
}

results = vectordb.similarity_search(
    "exception filing procedure?",
    k=5,
    filter=filter,           # pre-filter: applied before ANN
)

final = [r for r in results if 10 <= r.metadata["page"] <= 50]

Pinecone, Qdrant, Weaviate share a similar MongoDB-style and/in operator syntax. Chroma offers a less expressive dict filter. FAISS has no built-in support — precompute an ID list.

6.3 Inject provenance into the answer

def cite_answer(answer: str, results: list[Document]) -> str:
    refs = []
    for r in results:
        m = r.metadata
        refs.append(
            f"- [{m['document_id']} §{m['section'][-1]} p.{m['page']}]({m['source_url']}) (v{m['version']})"
        )
    return f"{answer}\n\n## Sources\n" + "\n".join(refs)

source_url + section + page + version together make a re-visitable citation. Without them, RAG answers carry hallucination risk.


7. Variants

7.1 namespace vs collection — same concept, different names

  • What changes: some tools implement this as a physical split (separate index); others as a logical field.
  • Why use it: splitting the search space by purpose (policy vs manual vs minutes) avoids recall blur (Part 3 §35-4).
  • What becomes possible: same embedding model still produces purpose-clean retrieval.
  • Where it fits: nearly every multi-domain RAG.
  • Limits: too narrow a split blocks cross-domain answers. 4–10 large units is the practical sweet spot.

7.2 security_level model — four tiers as the default

  • What changes: a total order — public < internal < confidential < secret.
  • Why use it: user clearance \(u\) vs chunk level \(c\) reduces to \(c \le u\), a single inequality.
  • What becomes possible: one $lte clause for permissions.
  • Where it fits: corporations, government, finance — wherever total-order classification exists.
  • Limits: a single field fails when permissions intersect (department + secrecy). Use ABAC instead — §7.3.

7.3 ABAC-style permissions — multi-attribute

  • What changes: instead of one tier, an attribute set (dept, project, region, nda_signed).
  • Why use it: real-world permissions intersect. "Marketing dept + APAC region + NDA-signed" only.
  • What becomes possible: fine-grained authorisation per user.
  • Where it fits: enterprise, multi-project, regulated industries.
  • Limits: complex filters hit the expressiveness limit of some vector DBs (Chroma); frequent permission changes mean re-indexing.

7.4 version policy — latest-only vs all versions

  • What changes: search version=latest only vs all versions with time-weighting.
  • Why use it: policy docs use latest. Statutes and papers need every version.
  • What becomes possible: answers contain only the currently valid clause, or cite the full change history.
  • Where it fits: legal/compliance (all versions), internal policy (latest only).
  • Limits: all-version search multiplies the index by \(N_{\text{versions}}\); cost scales proportionally.

7.5 source_url deep linking — to page and section

  • What changes: replace a bare URL with a deep link, e.g. https://...pdf#page=42.
  • Why use it: clicking a citation jumps the user to the exact page.
  • What becomes possible: verifiable RAG answers.
  • Where it fits: every internal-document RAG.
  • Limits: PDF viewer must respect #page= (most do); HTML needs explicit anchor ids.

8. Limits and Failure Modes

8.1 Post-filter permissions → empty answers

  • Why intrinsic: ANN may fill top-K with disallowed chunks; after permission filtering the candidate set may be empty.
  • Diagnosis: per-user empty rate — significantly higher for less-privileged users is a red flag.
  • Mitigation: move permissions to pre-filter. Raising candidate count (\(k'=50\)) is a stopgap.
  • Later part: Part 9 (vector-DB pre-filter support).

8.2 Missing metadata → filtering impossible

  • Why intrinsic: if version or security_level is omitted at chunk creation, the key itself is missing; behaviour varies by tool (some pass, some exclude).
  • Diagnosis: per-field null counts in the index — anything above zero is dangerous.
  • Mitigation: schema validation at chunk creation (Pydantic, JSON Schema); re-index on detection.
  • Later part: Part 9 (vector-DB operational hygiene).

8.3 Cardinality blow-up — slow filters

  • Why intrinsic: fields with high uniqueness (e.g. chunk_id) make poor pre-filters; the engine must index every value, hurting memory and latency.
  • Diagnosis: abnormal index-build time; rising p95 query latency.
  • Mitigation: keep high-cardinality fields as post-filter; bucket via hashing (chunk_hash_bucket = hash(chunk_id) % 256).
  • Later part: Part 9 (recommended cardinality per vector DB).

8.4 version=latest inconsistency

  • Why intrinsic: indexing a new version without deleting the old chunks leaves both visible to version=latest; answers cite two conflicting policies.
  • Diagnosis: same document_id, different version, both in top-K.
  • Mitigation: atomic publish — BEGIN → insert new → delete old → COMMIT; or an explicit is_active=true field.
  • Later part: Part 9 (transactional vector-DB updates).

8.5 Broken provenance → cannot re-visit

  • Why intrinsic: showing only document_id without source_url, section, page leaves the user unable to find the source; hallucination cannot be verified.
  • Diagnosis: zero click-through on citations means there are no real citations.
  • Mitigation: enforce a complete citation format (see §6.3); have the answer prompt require [source] per claim.
  • Later part: Chapter 22 (answer verification with mandatory sources).

8.5 Common Pitfalls

  • "Filter permissions after answering." §8.1. The classic leak pattern.
  • "Add metadata later." Once chunks are already indexed, re-indexing is the only path. Do it at creation.
  • "Pre-filter everything." §8.3. High-cardinality fields bloat the index.
  • "version=latest is enough." §8.4. Must come with atomic deletion of the old.
  • "Cite just document_id." §8.5. URL + section + page + version are all needed.

9. Settled Conclusions

Q1. Name each of the seven core metadata fields in one line.

document_id (doc), chunk_id (position), version (lineage), page (page), section (heading path), security_level (access), namespace (search space). Chapter: §4.

Q2. Why must permissions be pre-filtered?

Post-filtering lets ANN populate candidates with disallowed chunks; if all candidates are out-of-bounds, the answer is empty. Pre-filtering shrinks the search space itself. Chapter: §3, §8.1.

Q3. Why does cardinality flip the recommended filter position?

Low-cardinality fields (security_level, tenant_id) have predictable selectivity and cheap pre-filtering. High-cardinality fields (timestamp, page) are cheap to post-process. Chapter: §5.

Q4. Why must new-version publishing be atomic?

Failing to delete old-version chunks leaves both visible to version=latest; the answer cites contradictory policies simultaneously. Chapter: §8.4.

Q5. What four pieces must every citation carry?

source_url, section (or heading path), page, version. Together they enable the user to re-visit the source and verify hallucination. Chapter: §6.3, §8.5.


10. Further Reading

Primary

  • Pinecone. Filtering with metadata official doc — diagrams of internal pre/post-filter behaviour.
  • Qdrant. Payload-based filtering doc — reference for filter-first retrieval.
  • Weaviate. Filtered Vector Search doc — ABAC-with-RAG examples.
  • Anthropic. Building RAG with permissions (2024 cookbook). security_level-based multi-tenant pattern.
  • NIST. Attribute-Based Access Control (ABAC) SP 800-162 — the standard underlying §7.3.

Official docs

  • LangChain Vectorstores filter API: https://python.langchain.com/docs/concepts/vectorstores/#metadata-filtering
  • Pinecone Filtering: https://docs.pinecone.io/guides/data/filtering-with-metadata
  • Qdrant Filtering: https://qdrant.tech/documentation/concepts/filtering/
  • Weaviate Filters: https://weaviate.io/developers/weaviate/api/graphql/filters

Supporting

  • Author note Chapter 6 — metadata design.
  • Author note Chapter 35 §4–§6 — Document Boundary, Multi-collection, Filter-first.

Cheat Sheet

Field Cardinality Filter position Note
tenant_id Low Pre Multi-tenant top priority
security_level Low (4) Pre (required) Prevents permission leaks
namespace Low (3–10) Pre Shrinks search space
version Low (2–10) Pre Demands atomic publish
language Low (2–5) Pre Multilingual split
page High Post Citation display
section High Post Citation + Header Injection reuse
chunk_id Very high Post or index key Unique id
created_at Very high Post Time ranges

Design rule of thumb: permission and isolation fields → pre-filter; display and range fields → post-filter; copy every field at chunk-creation time.


Bridge — What's Next

Next — RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage.

With chunks and metadata ready, the next decision is which embedding model to index with. Part 8 compares six mainstream models across vector dimension, multilingual quality, cost, local-execution feasibility, and unpacks the model-card-to-ingestion-schema mapping (author note Chapter 35 §2).

Series overview: Series index

댓글

이 블로그의 인기 게시물

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System

"ML Foundations (6/9) — Neural Networks: From Perceptron to MLP"