AI Operations Economics (4/4) — Context Management Patterns: auto-compact, Memory, RAG Cost Comparison

5월 05, 2026

$AI 운영 경제학 (4/4) — 컨텍스트 관리 패턴: auto-compact·Memory·RAG 비용 비교$

Context is cost. There are three ways to shrink it — compress, externalize, or retrieve.

핵심 요약

Three patterns: auto-compact (automatic summarization) / Memory (external files + index) / RAG (vector retrieval + injection)
Primary sources: Claude Code compact docs, RAG paper (Lewis et al., 2020), first-person operational measurements
Cost order (typical, long tasks): RAG ≤ Memory < auto-compact. For short tasks the order can flip
Decision rule: task length × freshness × accuracy requirement
Pitfalls: auto-compact can lose reasoning behind decisions; RAG suffers cost and accuracy losses on retrieval failure

1. The nature of context cost

LLMs bill the entire context per call. A 200K context means 200K × input rate, every call. Shrinking context is the most direct cost-reduction lever.

The three strategies differ in how they shrink.

Strategy	Mechanism	Context length	Note
auto-compact	Model summarizes its own context	~30–50% of original after compaction	Loses decision reasoning
Memory	Rarely used info kept in external files	Index only (a few KB)	Requires explicit reference
RAG	Search + inject documents at query time	Just the top-K (tens of KB)	Needs retrieval infrastructure

2. auto-compact — Automatic summarization

Mechanism: when context approaches its limit, the LLM compresses its own past tokens into a summary form. Claude Code triggers this automatically.

Cost structure: - The compaction itself is another LLM call. 200K input + summary output. - Post-compaction context is shorter, but the compaction-time bill is large and one-shot. - After compaction the model starts fresh — and cost begins accumulating again.

Strengths: - Automatic — no operator action needed. - Natural fit when work is continuous.

Weaknesses / pitfalls: - Decision reasoning can vanish during compression. Risky for tasks where "why did we decide that?" affects later steps. - Compression is a lossy transform — bad for tasks requiring exact citations (legal, medical, financial). - Frequent compaction is expensive in itself. Compacting a 200K context is essentially one large call.

When to use: default option for general coding / writing / chat. Avoid when accuracy and traceability are critical.

3. Memory — External files + index

Mechanism: store rarely-referenced information in markdown files on disk. Keep only the index (MEMORY.md) in context, and load individual files on demand.

Cost structure: - Index: about 1–3K tokens. Cacheable. - Per-file load: 1–10K tokens added to context only when relevant. - Net effect: even with hundreds of KB of memory, average context stays at a few KB.

Strengths: - Lossless. Originals are preserved. - Human-editable — fixing wrong memory is straightforward. - Plays well with caching (the index sits at the front and caches).

Weaknesses: - Requires explicit reference. The model has to know "which memory to read" — without rules, it skips. - Inefficient for very large memory pools. Beyond ~1MB the index alone becomes insufficient.

When to use: explicit, intermittent references like user preferences, past decisions, external system coordinates. The 3-tier memory pattern from series A part 1 is the standard.

4. RAG — Vector retrieval + injection

Mechanism: embed all documents into a vector DB. At query time, embed the query → top-K retrieval → inject into context → generate answer.

Cost structure: - Embedding cost: one-time per document (linear). 1M tokens ≈ $0.10 to embed. - Vector DB: hosting (Qdrant, Pinecone) or self-managed. - LLM call: query embedding + answer generation. - Context: only the K retrieved chunks (typically 5–20K tokens).

Strengths: - Large knowledge bases (millions of documents) keep context small. - Freshness — index updates without retraining. - Provenance — citable evidence for the answer.

Weaknesses: - Retrieval failure destroys answer quality. What retrieval misses, the model never sees. - Adds retrieval, embedding, DB infrastructure — heavy for small teams. - Many tuning knobs: chunking, embedding model, re-ranking.

When to use: large external knowledge (document libraries, FAQs, wikis) the model can't absorb through training. Domains requiring factual accuracy and traceability.

5. Cost comparison — Same task, three paths

Hypothetical scenario: "answer using relevant info from a 10K-document corporate wiki."

Strategy	Context length	Per-call cost (Sonnet)	Note
Full context (50M tokens)	Exceeds limit → impossible	Impossible	Some strategy is required
auto-compact (repeated)	~150K	$0.45 (input) + $0.15 (compaction)	Information loss; decision reasoning blurred
Memory (index + load)	~10K	$0.03	Needs rules for which memory to load
RAG (top-10 retrieval)	~15K	$0.05 + retrieval infra	Plus DB/host overhead

Observations: - For one-shot tasks, RAG can be more expensive due to infrastructure overhead. - For repeated tasks, RAG dominates on cost. - Memory hits a sweet spot for medium-scale (hundreds of documents).

6. Decision rules

Three axes:

Task profile	Recommendation
Short, one-off; no external knowledge	auto-compact
User- / project-specific explicit info recurs	Memory
Large external knowledge; freshness matters	RAG
Exact source citation required	RAG
Need to trace decision reasoning	Memory (avoid auto-compact)

Combining patterns: - Small system: auto-compact + Memory (1–3K index, file-load on demand). - Medium system: Memory + small RAG (internal wiki only). - Large system: all three. auto-compact for interactive, Memory for user context, RAG for the knowledge base.

7. Common pitfalls

7.1 Treating RAG as the universal answer

Adding RAG to a small (<10K docs) corpus → infra cost erases savings.
Fix: start with Memory; switch to RAG when the corpus grows.

7.2 Over-relying on auto-compact

Compaction blurs decision reasoning, leading to retries.
Fix: manually consolidate context before critical decisions. Externalize reasoning to memory or session logs.

7.3 Using Memory as a substitute for code

Memory is for non-obvious information. Facts derivable from code should be re-checked against the code.
Fix: ask "can this be derived from the codebase?" before saving to memory.

8. At a glance

Pattern	Primary cost	Data scale	Accuracy traceability	Operational burden
auto-compact	LLM compaction call	Within one session	Weak (lossy)	Low (automatic)
Memory	Index tokens	A few MB or less	Strong	Medium (needs rules)
RAG	Retrieval infra + embeddings	GBs and up	Very strong	High (infra)

Core principle: pick the intersection of data scale × accuracy requirement × operational capacity. Don't lock into a single pattern.

Series wrap — Part 4/4

The AI Operations Economics series stacked four levers: making the bill predictable (part 1), then actively reducing via routing (part 2), caching (part 3), and context management (part 4). Applied together, the same task typically lands at 10–30% of the original cost.

Combined with Series A (Coding Agents in Practice, five parts), the workflow + cost axes are complete. The next campaign is likely to take the next step in measurement and operations — evaluations.

References

Anthropic, Auto-compact in Claude Code — code.claude.com/docs/sessions (verified 2026-05-05).
Anthropic, Memory and Sessions — code.claude.com/docs/memory (verified 2026-05-05).
Lewis et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
Series parts 1 (token cost), 2 (routing), 3 (caching).

This is part 4/4 — the final entry in the AI Operations Economics series.

이 블로그 검색

MaJu Tech Notes