"RAG Core Study (7/26) — Metadata Design: Filters, Permissions, Provenance"

5월 17, 2026

Chunks are built and augmented. Now we decide what lives beside a chunk — the fields that let retrieval slice by user, time, and document type.

Metadata is the quiet half of RAG. Without the fields stored alongside the chunk body, the questions "only this quarter's security policies", "only what this user is allowed to see", "only the latest version" cannot even be asked. Good metadata design lifts retrieval accuracy by 5–30 percentage points; poor design causes permission incidents and unverifiable answers. Part 7 unpacks seven core fields — document_id, chunk_id, version, page, section, security_level, namespace — alongside permission models, provenance, and filter cost.

0. Prerequisites

Part 3's Document Boundary and Filter-first Retrieval.
Chunk-creation flow from Parts 5–6.
Awareness that vector DBs support metadata filtering — Part 9 covers it in depth.

1. Learning Objectives

State the role of each of the seven core metadata fields in one line.
Express permission, version, and provenance filters in a single formula set.
Explain how cardinality affects search performance.
Name the five failure modes triggered by missing metadata.

2. 핵심 요약

If the chunk body answers "what does this know?", metadata answers "where, when, from whom?" Seven core fields — document_id (which doc), chunk_id (which slice), version (lineage), page/section (origin), security_level (access), namespace (logical search space) — must travel with every chunk. Retrieval runs as pre-filter → ANN → post-filter. Pre-filtering is for low-cardinality fields (e.g. tenant_id); post-filtering for high-cardinality ones (e.g. date ranges). Permissions belong in pre-filter — post-filtering them produces empty results and risks leaks.

3. Intuition — Same Corpus, Different Users

The vector DB contains both all-staff documents and executives-only documents. A regular employee's query that returns executive chunks in top-K is a permission incident. Plain dense retrieval cannot distinguish them — metadata filters do.

Why pre-filter first: if the ANN ever places a disallowed chunk in top-K, that chunk leaks to the user. Filtering after ANN can leave candidates empty (e.g. top-10 all disallowed → zero answers). Permissions must shrink the search space itself.

4. Definitions — Seven Core Fields

Field	What it carries	Example	Who uses it
`document_id`	Document identifier	`policy-sec-v3.2`	Provenance, dedup
`chunk_id`	Position within doc	`policy-sec-v3.2#c014`	Chunk-level citation
`version`	Document version	`3.2`, `2026-05-15`	Latest-only / specific version
`page`	Source page	`42`	Citation location
`section`	Heading path	`["5", "5.2", "5.2.1"]`	Reuses Part 6 Header Injection
`security_level`	Access tier	`public`, `internal`, `confidential`, `secret`	Permission pre-filter (required)
`namespace`	Logical search space	`policy`, `manual`, `meeting_notes`	Reuses Part 3 Filter-first

Optional supporting fields (add as needed):

source_url — original link.
author, owner_team — responsibility trail.
created_at, updated_at — time filters.
language — ko, en. Splits multilingual corpora.
tenant_id — the top pre-filter in multi-tenant SaaS.

Core rule: when a chunk is created, every parent-doc metadata field is copied onto it. Same point as Part 5 §8.5.

5. Math — Filter Cost and Cardinality

Filter efficiency depends on cardinality $C$ and selectivity $s$.

$N$ = total chunks in the index
$C$ = number of unique values for the filter field (cardinality)
$s$ = fraction of chunks passing the filter (e.g. tenant_id=A is 10% of the index → $s=0.1$)
$k$ = top-K

Pre-filter cost (ANN search restricted to chunks that pass the filter):

$$\text{Cost}_{\text{pre}} \approx \mathcal{O}(\log N) + \mathcal{O}(s \cdot N)$$

Smaller $s$ is better. Strong for fields with low cardinality and low selectivity (tenant_id, security_level).

Post-filter cost (ANN returns top-K' candidates, then filter):

$$\text{Cost}_{\text{post}} \approx \mathcal{O}(\log N) + k'$$

$k'$ is the over-fetch needed to keep enough answers after filtering. When $s$ is small, $k'$ explodes ($k' \approx k/s$). Unsuitable for tight filters.

Decision table:

Field kind	Cardinality	Selectivity	Recommended
`security_level`, `tenant_id`	Low (4–10)	Low (5–30%)	Pre-filter (mandatory)
`namespace`, `language`	Low (3–10)	Medium (10–50%)	Pre-filter
`version=latest`	Low (2)	High (80–95%)	Pre or Post
`created_at > 2026-01-01`	High (dates)	Variable	Post-filter
`page >= 10`	High (pages)	Variable	Post-filter

Support varies widely by vector DB — Part 9 covers tooling.

6. Walkthrough — Attaching Metadata to Chunks

6.1 Copy metadata at chunk creation

from langchain_core.documents import Document

doc_meta = {
    "document_id": "policy-sec-v3.2",
    "version": "3.2",
    "security_level": "internal",
    "namespace": "policy",
    "language": "ko",
    "tenant_id": "acme",
    "source_url": "https://intranet.acme.com/docs/sec-3.2.pdf",
}

chunks = []
for i, chunk_text in enumerate(split_results):
    chunks.append(Document(
        page_content=chunk_text,
        metadata={
            **doc_meta,
            "chunk_id": f"{doc_meta['document_id']}#c{i:04d}",
            "page": page_map[i],
            "section": section_map[i],
        },
    ))

vectordb.add_documents(chunks)

Key idea: all parent-doc fields are copied to every chunk, because chunks are searched independently in the index.

6.2 Query with permission + version + namespace

filter = {
    "$and": [
        {"tenant_id": "acme"},                            # outermost (carrier)
        {"security_level": {"$in": ["public", "internal"]}},  # user clearance
        {"namespace": "policy"},                          # shrink search space
        {"version": "3.2"},                               # pin version
    ]
}

results = vectordb.similarity_search(
    "exception filing procedure?",
    k=5,
    filter=filter,           # pre-filter: applied before ANN
)

final = [r for r in results if 10 <= r.metadata["page"] <= 50]

Pinecone, Qdrant, Weaviate share a similar MongoDB-style and/in operator syntax. Chroma offers a less expressive dict filter. FAISS has no built-in support — precompute an ID list.

6.3 Inject provenance into the answer

def cite_answer(answer: str, results: list[Document]) -> str:
    refs = []
    for r in results:
        m = r.metadata
        refs.append(
            f"- [{m['document_id']} §{m['section'][-1]} p.{m['page']}]({m['source_url']}) (v{m['version']})"
        )
    return f"{answer}\n\n## Sources\n" + "\n".join(refs)

source_url + section + page + version together make a re-visitable citation. Without them, RAG answers carry hallucination risk.

7. Variants

7.1 `namespace` vs `collection` — same concept, different names

What changes: some tools implement this as a physical split (separate index); others as a logical field.
Why use it: splitting the search space by purpose (policy vs manual vs minutes) avoids recall blur (Part 3 §35-4).
What becomes possible: same embedding model still produces purpose-clean retrieval.
Where it fits: nearly every multi-domain RAG.
Limits: too narrow a split blocks cross-domain answers. 4–10 large units is the practical sweet spot.

7.2 `security_level` model — four tiers as the default

What changes: a total order — public < internal < confidential < secret.
Why use it: user clearance $u$ vs chunk level $c$ reduces to $c \le u$, a single inequality.
What becomes possible: one $lte clause for permissions.
Where it fits: corporations, government, finance — wherever total-order classification exists.
Limits: a single field fails when permissions intersect (department + secrecy). Use ABAC instead — §7.3.

7.3 ABAC-style permissions — multi-attribute

What changes: instead of one tier, an attribute set (dept, project, region, nda_signed).
Why use it: real-world permissions intersect. "Marketing dept + APAC region + NDA-signed" only.
What becomes possible: fine-grained authorisation per user.
Where it fits: enterprise, multi-project, regulated industries.
Limits: complex filters hit the expressiveness limit of some vector DBs (Chroma); frequent permission changes mean re-indexing.

7.4 `version` policy — latest-only vs all versions

What changes: search version=latest only vs all versions with time-weighting.
Why use it: policy docs use latest. Statutes and papers need every version.
What becomes possible: answers contain only the currently valid clause, or cite the full change history.
Where it fits: legal/compliance (all versions), internal policy (latest only).
Limits: all-version search multiplies the index by $N_{\text{versions}}$; cost scales proportionally.

7.5 `source_url` deep linking — to page and section

What changes: replace a bare URL with a deep link, e.g. https://...pdf#page=42.
Why use it: clicking a citation jumps the user to the exact page.
What becomes possible: verifiable RAG answers.
Where it fits: every internal-document RAG.
Limits: PDF viewer must respect #page= (most do); HTML needs explicit anchor ids.

8. Limits and Failure Modes

8.1 Post-filter permissions → empty answers

Why intrinsic: ANN may fill top-K with disallowed chunks; after permission filtering the candidate set may be empty.
Diagnosis: per-user empty rate — significantly higher for less-privileged users is a red flag.
Mitigation: move permissions to pre-filter. Raising candidate count ($k'=50$) is a stopgap.
Later part: Part 9 (vector-DB pre-filter support).

8.2 Missing metadata → filtering impossible

Why intrinsic: if version or security_level is omitted at chunk creation, the key itself is missing; behaviour varies by tool (some pass, some exclude).
Diagnosis: per-field null counts in the index — anything above zero is dangerous.
Mitigation: schema validation at chunk creation (Pydantic, JSON Schema); re-index on detection.
Later part: Part 9 (vector-DB operational hygiene).

8.3 Cardinality blow-up — slow filters

Why intrinsic: fields with high uniqueness (e.g. chunk_id) make poor pre-filters; the engine must index every value, hurting memory and latency.
Diagnosis: abnormal index-build time; rising p95 query latency.
Mitigation: keep high-cardinality fields as post-filter; bucket via hashing (chunk_hash_bucket = hash(chunk_id) % 256).
Later part: Part 9 (recommended cardinality per vector DB).

8.4 `version=latest` inconsistency

Why intrinsic: indexing a new version without deleting the old chunks leaves both visible to version=latest; answers cite two conflicting policies.
Diagnosis: same document_id, different version, both in top-K.
Mitigation: atomic publish — BEGIN → insert new → delete old → COMMIT; or an explicit is_active=true field.
Later part: Part 9 (transactional vector-DB updates).

8.5 Broken provenance → cannot re-visit

Why intrinsic: showing only document_id without source_url, section, page leaves the user unable to find the source; hallucination cannot be verified.
Diagnosis: zero click-through on citations means there are no real citations.
Mitigation: enforce a complete citation format (see §6.3); have the answer prompt require [source] per claim.
Later part: Chapter 22 (answer verification with mandatory sources).

8.5 Common Pitfalls

"Filter permissions after answering." §8.1. The classic leak pattern.
"Add metadata later." Once chunks are already indexed, re-indexing is the only path. Do it at creation.
"Pre-filter everything." §8.3. High-cardinality fields bloat the index.
"version=latest is enough." §8.4. Must come with atomic deletion of the old.
"Cite just document_id." §8.5. URL + section + page + version are all needed.

9. Settled Conclusions

Q1. Name each of the seven core metadata fields in one line.

document_id (doc), chunk_id (position), version (lineage), page (page), section (heading path), security_level (access), namespace (search space). Chapter: §4.

Q2. Why must permissions be pre-filtered?

Post-filtering lets ANN populate candidates with disallowed chunks; if all candidates are out-of-bounds, the answer is empty. Pre-filtering shrinks the search space itself. Chapter: §3, §8.1.

Q3. Why does cardinality flip the recommended filter position?

Low-cardinality fields (security_level, tenant_id) have predictable selectivity and cheap pre-filtering. High-cardinality fields (timestamp, page) are cheap to post-process. Chapter: §5.

Q4. Why must new-version publishing be atomic?

Failing to delete old-version chunks leaves both visible to version=latest; the answer cites contradictory policies simultaneously. Chapter: §8.4.

Q5. What four pieces must every citation carry?

source_url, section (or heading path), page, version. Together they enable the user to re-visit the source and verify hallucination. Chapter: §6.3, §8.5.

10. Further Reading

Primary

Pinecone. Filtering with metadata official doc — diagrams of internal pre/post-filter behaviour.
Qdrant. Payload-based filtering doc — reference for filter-first retrieval.
Weaviate. Filtered Vector Search doc — ABAC-with-RAG examples.
Anthropic. Building RAG with permissions (2024 cookbook). security_level-based multi-tenant pattern.
NIST. Attribute-Based Access Control (ABAC) SP 800-162 — the standard underlying §7.3.

Official docs

LangChain Vectorstores filter API: https://python.langchain.com/docs/concepts/vectorstores/#metadata-filtering
Pinecone Filtering: https://docs.pinecone.io/guides/data/filtering-with-metadata
Qdrant Filtering: https://qdrant.tech/documentation/concepts/filtering/
Weaviate Filters: https://weaviate.io/developers/weaviate/api/graphql/filters

Supporting

Author note Chapter 6 — metadata design.
Author note Chapter 35 §4–§6 — Document Boundary, Multi-collection, Filter-first.

Cheat Sheet

Field	Cardinality	Filter position	Note
`tenant_id`	Low	Pre	Multi-tenant top priority
`security_level`	Low (4)	Pre (required)	Prevents permission leaks
`namespace`	Low (3–10)	Pre	Shrinks search space
`version`	Low (2–10)	Pre	Demands atomic publish
`language`	Low (2–5)	Pre	Multilingual split
`page`	High	Post	Citation display
`section`	High	Post	Citation + Header Injection reuse
`chunk_id`	Very high	Post or index key	Unique id
`created_at`	Very high	Post	Time ranges

Design rule of thumb: permission and isolation fields → pre-filter; display and range fields → post-filter; copy every field at chunk-creation time.

Bridge — What's Next

Next — RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage.

With chunks and metadata ready, the next decision is which embedding model to index with. Part 8 compares six mainstream models across vector dimension, multilingual quality, cost, local-execution feasibility, and unpacks the model-card-to-ingestion-schema mapping (author note Chapter 35 §2).

Series overview: Series index