"RAG Core Study (7/26) — Metadata Design: Filters, Permissions, Provenance"
Chunks are built and augmented. Now we decide what lives beside a chunk — the fields that let retrieval slice by user, time, and document type.
Metadata is the quiet half of RAG. Without the fields stored alongside the chunk body, the questions "only this quarter's security policies", "only what this user is allowed to see", "only the latest version" cannot even be asked. Good metadata design lifts retrieval accuracy by 5–30 percentage points; poor design causes permission incidents and unverifiable answers. Part 7 unpacks seven core fields — document_id, chunk_id, version, page, section, security_level, namespace — alongside permission models, provenance, and filter cost.
0. Prerequisites
- Part 3's Document Boundary and Filter-first Retrieval.
- Chunk-creation flow from Parts 5–6.
- Awareness that vector DBs support metadata filtering — Part 9 covers it in depth.
1. Learning Objectives
- State the role of each of the seven core metadata fields in one line.
- Express permission, version, and provenance filters in a single formula set.
- Explain how cardinality affects search performance.
- Name the five failure modes triggered by missing metadata.
2. 핵심 요약
If the chunk body answers "what does this know?", metadata answers "where, when, from whom?" Seven core fields — document_id (which doc), chunk_id (which slice), version (lineage), page/section (origin), security_level (access), namespace (logical search space) — must travel with every chunk. Retrieval runs as pre-filter → ANN → post-filter. Pre-filtering is for low-cardinality fields (e.g. tenant_id); post-filtering for high-cardinality ones (e.g. date ranges). Permissions belong in pre-filter — post-filtering them produces empty results and risks leaks.
3. Intuition — Same Corpus, Different Users
The vector DB contains both all-staff documents and executives-only documents. A regular employee's query that returns executive chunks in top-K is a permission incident. Plain dense retrieval cannot distinguish them — metadata filters do.
Why pre-filter first: if the ANN ever places a disallowed chunk in top-K, that chunk leaks to the user. Filtering after ANN can leave candidates empty (e.g. top-10 all disallowed → zero answers). Permissions must shrink the search space itself.
4. Definitions — Seven Core Fields
| Field | What it carries | Example | Who uses it |
|---|---|---|---|
document_id |
Document identifier | policy-sec-v3.2 |
Provenance, dedup |
chunk_id |
Position within doc | policy-sec-v3.2#c014 |
Chunk-level citation |
version |
Document version | 3.2, 2026-05-15 |
Latest-only / specific version |
page |
Source page | 42 |
Citation location |
section |
Heading path | ["5", "5.2", "5.2.1"] |
Reuses Part 6 Header Injection |
security_level |
Access tier | public, internal, confidential, secret |
Permission pre-filter (required) |
namespace |
Logical search space | policy, manual, meeting_notes |
Reuses Part 3 Filter-first |
Optional supporting fields (add as needed):
source_url— original link.author,owner_team— responsibility trail.created_at,updated_at— time filters.language—ko,en. Splits multilingual corpora.tenant_id— the top pre-filter in multi-tenant SaaS.
Core rule: when a chunk is created, every parent-doc metadata field is copied onto it. Same point as Part 5 §8.5.
5. Math — Filter Cost and Cardinality
Filter efficiency depends on cardinality \(C\) and selectivity \(s\).
- \(N\) = total chunks in the index
- \(C\) = number of unique values for the filter field (cardinality)
- \(s\) = fraction of chunks passing the filter (e.g. tenant_id=A is 10% of the index → \(s=0.1\))
- \(k\) = top-K
Pre-filter cost (ANN search restricted to chunks that pass the filter):
$$\text{Cost}_{\text{pre}} \approx \mathcal{O}(\log N) + \mathcal{O}(s \cdot N)$$
Smaller \(s\) is better. Strong for fields with low cardinality and low selectivity (tenant_id, security_level).
Post-filter cost (ANN returns top-K' candidates, then filter):
$$\text{Cost}_{\text{post}} \approx \mathcal{O}(\log N) + k'$$
\(k'\) is the over-fetch needed to keep enough answers after filtering. When \(s\) is small, \(k'\) explodes (\(k' \approx k/s\)). Unsuitable for tight filters.
Decision table:
| Field kind | Cardinality | Selectivity | Recommended |
|---|---|---|---|
security_level, tenant_id |
Low (4–10) | Low (5–30%) | Pre-filter (mandatory) |
namespace, language |
Low (3–10) | Medium (10–50%) | Pre-filter |
version=latest |
Low (2) | High (80–95%) | Pre or Post |
created_at > 2026-01-01 |
High (dates) | Variable | Post-filter |
page >= 10 |
High (pages) | Variable | Post-filter |
Support varies widely by vector DB — Part 9 covers tooling.
6. Walkthrough — Attaching Metadata to Chunks
6.1 Copy metadata at chunk creation
from langchain_core.documents import Document
doc_meta = {
"document_id": "policy-sec-v3.2",
"version": "3.2",
"security_level": "internal",
"namespace": "policy",
"language": "ko",
"tenant_id": "acme",
"source_url": "https://intranet.acme.com/docs/sec-3.2.pdf",
}
chunks = []
for i, chunk_text in enumerate(split_results):
chunks.append(Document(
page_content=chunk_text,
metadata={
**doc_meta,
"chunk_id": f"{doc_meta['document_id']}#c{i:04d}",
"page": page_map[i],
"section": section_map[i],
},
))
vectordb.add_documents(chunks)
Key idea: all parent-doc fields are copied to every chunk, because chunks are searched independently in the index.
6.2 Query with permission + version + namespace
filter = {
"$and": [
{"tenant_id": "acme"}, # outermost (carrier)
{"security_level": {"$in": ["public", "internal"]}}, # user clearance
{"namespace": "policy"}, # shrink search space
{"version": "3.2"}, # pin version
]
}
results = vectordb.similarity_search(
"exception filing procedure?",
k=5,
filter=filter, # pre-filter: applied before ANN
)
final = [r for r in results if 10 <= r.metadata["page"] <= 50]
Pinecone, Qdrant, Weaviate share a similar MongoDB-style and/in operator syntax. Chroma offers a less expressive dict filter. FAISS has no built-in support — precompute an ID list.
6.3 Inject provenance into the answer
def cite_answer(answer: str, results: list[Document]) -> str:
refs = []
for r in results:
m = r.metadata
refs.append(
f"- [{m['document_id']} §{m['section'][-1]} p.{m['page']}]({m['source_url']}) (v{m['version']})"
)
return f"{answer}\n\n## Sources\n" + "\n".join(refs)
source_url + section + page + version together make a re-visitable citation. Without them, RAG answers carry hallucination risk.
7. Variants
7.1 namespace vs collection — same concept, different names
- What changes: some tools implement this as a physical split (separate index); others as a logical field.
- Why use it: splitting the search space by purpose (policy vs manual vs minutes) avoids recall blur (Part 3 §35-4).
- What becomes possible: same embedding model still produces purpose-clean retrieval.
- Where it fits: nearly every multi-domain RAG.
- Limits: too narrow a split blocks cross-domain answers. 4–10 large units is the practical sweet spot.
7.2 security_level model — four tiers as the default
- What changes: a total order —
public < internal < confidential < secret. - Why use it: user clearance \(u\) vs chunk level \(c\) reduces to \(c \le u\), a single inequality.
- What becomes possible: one
$lteclause for permissions. - Where it fits: corporations, government, finance — wherever total-order classification exists.
- Limits: a single field fails when permissions intersect (department + secrecy). Use ABAC instead — §7.3.
7.3 ABAC-style permissions — multi-attribute
- What changes: instead of one tier, an attribute set (
dept,project,region,nda_signed). - Why use it: real-world permissions intersect. "Marketing dept + APAC region + NDA-signed" only.
- What becomes possible: fine-grained authorisation per user.
- Where it fits: enterprise, multi-project, regulated industries.
- Limits: complex filters hit the expressiveness limit of some vector DBs (Chroma); frequent permission changes mean re-indexing.
7.4 version policy — latest-only vs all versions
- What changes: search
version=latestonly vs all versions with time-weighting. - Why use it: policy docs use latest. Statutes and papers need every version.
- What becomes possible: answers contain only the currently valid clause, or cite the full change history.
- Where it fits: legal/compliance (all versions), internal policy (latest only).
- Limits: all-version search multiplies the index by \(N_{\text{versions}}\); cost scales proportionally.
7.5 source_url deep linking — to page and section
- What changes: replace a bare URL with a deep link, e.g.
https://...pdf#page=42. - Why use it: clicking a citation jumps the user to the exact page.
- What becomes possible: verifiable RAG answers.
- Where it fits: every internal-document RAG.
- Limits: PDF viewer must respect
#page=(most do); HTML needs explicit anchor ids.
8. Limits and Failure Modes
8.1 Post-filter permissions → empty answers
- Why intrinsic: ANN may fill top-K with disallowed chunks; after permission filtering the candidate set may be empty.
- Diagnosis: per-user empty rate — significantly higher for less-privileged users is a red flag.
- Mitigation: move permissions to pre-filter. Raising candidate count (\(k'=50\)) is a stopgap.
- Later part: Part 9 (vector-DB pre-filter support).
8.2 Missing metadata → filtering impossible
- Why intrinsic: if
versionorsecurity_levelis omitted at chunk creation, the key itself is missing; behaviour varies by tool (some pass, some exclude). - Diagnosis: per-field null counts in the index — anything above zero is dangerous.
- Mitigation: schema validation at chunk creation (Pydantic, JSON Schema); re-index on detection.
- Later part: Part 9 (vector-DB operational hygiene).
8.3 Cardinality blow-up — slow filters
- Why intrinsic: fields with high uniqueness (e.g.
chunk_id) make poor pre-filters; the engine must index every value, hurting memory and latency. - Diagnosis: abnormal index-build time; rising p95 query latency.
- Mitigation: keep high-cardinality fields as post-filter; bucket via hashing (
chunk_hash_bucket = hash(chunk_id) % 256). - Later part: Part 9 (recommended cardinality per vector DB).
8.4 version=latest inconsistency
- Why intrinsic: indexing a new version without deleting the old chunks leaves both visible to
version=latest; answers cite two conflicting policies. - Diagnosis: same
document_id, differentversion, both in top-K. - Mitigation: atomic publish —
BEGIN → insert new → delete old → COMMIT; or an explicitis_active=truefield. - Later part: Part 9 (transactional vector-DB updates).
8.5 Broken provenance → cannot re-visit
- Why intrinsic: showing only
document_idwithoutsource_url,section,pageleaves the user unable to find the source; hallucination cannot be verified. - Diagnosis: zero click-through on citations means there are no real citations.
- Mitigation: enforce a complete citation format (see §6.3); have the answer prompt require
[source]per claim. - Later part: Chapter 22 (answer verification with mandatory sources).
8.5 Common Pitfalls
- "Filter permissions after answering." §8.1. The classic leak pattern.
- "Add metadata later." Once chunks are already indexed, re-indexing is the only path. Do it at creation.
- "Pre-filter everything." §8.3. High-cardinality fields bloat the index.
- "
version=latestis enough." §8.4. Must come with atomic deletion of the old. - "Cite just
document_id." §8.5. URL + section + page + version are all needed.
9. Settled Conclusions
Q1. Name each of the seven core metadata fields in one line.
document_id (doc), chunk_id (position), version (lineage), page (page), section (heading path), security_level (access), namespace (search space).
Chapter: §4.
Q2. Why must permissions be pre-filtered?
Post-filtering lets ANN populate candidates with disallowed chunks; if all candidates are out-of-bounds, the answer is empty. Pre-filtering shrinks the search space itself. Chapter: §3, §8.1.
Q3. Why does cardinality flip the recommended filter position?
Low-cardinality fields (security_level, tenant_id) have predictable selectivity and cheap pre-filtering. High-cardinality fields (timestamp, page) are cheap to post-process. Chapter: §5.
Q4. Why must new-version publishing be atomic?
Failing to delete old-version chunks leaves both visible to version=latest; the answer cites contradictory policies simultaneously.
Chapter: §8.4.
Q5. What four pieces must every citation carry?
source_url, section (or heading path), page, version. Together they enable the user to re-visit the source and verify hallucination.
Chapter: §6.3, §8.5.
10. Further Reading
Primary
- Pinecone. Filtering with metadata official doc — diagrams of internal pre/post-filter behaviour.
- Qdrant. Payload-based filtering doc — reference for filter-first retrieval.
- Weaviate. Filtered Vector Search doc — ABAC-with-RAG examples.
- Anthropic. Building RAG with permissions (2024 cookbook). security_level-based multi-tenant pattern.
- NIST. Attribute-Based Access Control (ABAC) SP 800-162 — the standard underlying §7.3.
Official docs
- LangChain Vectorstores filter API:
https://python.langchain.com/docs/concepts/vectorstores/#metadata-filtering - Pinecone Filtering:
https://docs.pinecone.io/guides/data/filtering-with-metadata - Qdrant Filtering:
https://qdrant.tech/documentation/concepts/filtering/ - Weaviate Filters:
https://weaviate.io/developers/weaviate/api/graphql/filters
Supporting
- Author note Chapter 6 — metadata design.
- Author note Chapter 35 §4–§6 — Document Boundary, Multi-collection, Filter-first.
Cheat Sheet
| Field | Cardinality | Filter position | Note |
|---|---|---|---|
tenant_id |
Low | Pre | Multi-tenant top priority |
security_level |
Low (4) | Pre (required) | Prevents permission leaks |
namespace |
Low (3–10) | Pre | Shrinks search space |
version |
Low (2–10) | Pre | Demands atomic publish |
language |
Low (2–5) | Pre | Multilingual split |
page |
High | Post | Citation display |
section |
High | Post | Citation + Header Injection reuse |
chunk_id |
Very high | Post or index key | Unique id |
created_at |
Very high | Post | Time ranges |
Design rule of thumb: permission and isolation fields → pre-filter; display and range fields → post-filter; copy every field at chunk-creation time.
Bridge — What's Next
Next — RAG Core Study (8/26) — Embedding Models: BGE-M3 / OpenAI / Upstage / E5 / Jina / Voyage.
With chunks and metadata ready, the next decision is which embedding model to index with. Part 8 compares six mainstream models across vector dimension, multilingual quality, cost, local-execution feasibility, and unpacks the model-card-to-ingestion-schema mapping (author note Chapter 35 §2).
Series overview: Series index
댓글
댓글 쓰기