AI Operations Economics (4/4) — Context Management Patterns: auto-compact, Memory, RAG Cost Comparison
Context is cost. There are three ways to shrink it — compress, externalize, or retrieve.
ํต์ฌ ์์ฝ
- Three patterns: auto-compact (automatic summarization) / Memory (external files + index) / RAG (vector retrieval + injection)
- Primary sources: Claude Code compact docs, RAG paper (Lewis et al., 2020), first-person operational measurements
- Cost order (typical, long tasks): RAG ≤ Memory < auto-compact. For short tasks the order can flip
- Decision rule: task length × freshness × accuracy requirement
- Pitfalls: auto-compact can lose reasoning behind decisions; RAG suffers cost and accuracy losses on retrieval failure
1. The nature of context cost
LLMs bill the entire context per call. A 200K context means 200K × input rate, every call. Shrinking context is the most direct cost-reduction lever.
The three strategies differ in how they shrink.
| Strategy | Mechanism | Context length | Note |
|---|---|---|---|
| auto-compact | Model summarizes its own context | ~30–50% of original after compaction | Loses decision reasoning |
| Memory | Rarely used info kept in external files | Index only (a few KB) | Requires explicit reference |
| RAG | Search + inject documents at query time | Just the top-K (tens of KB) | Needs retrieval infrastructure |
2. auto-compact — Automatic summarization
Mechanism: when context approaches its limit, the LLM compresses its own past tokens into a summary form. Claude Code triggers this automatically.
Cost structure: - The compaction itself is another LLM call. 200K input + summary output. - Post-compaction context is shorter, but the compaction-time bill is large and one-shot. - After compaction the model starts fresh — and cost begins accumulating again.
Strengths: - Automatic — no operator action needed. - Natural fit when work is continuous.
Weaknesses / pitfalls: - Decision reasoning can vanish during compression. Risky for tasks where "why did we decide that?" affects later steps. - Compression is a lossy transform — bad for tasks requiring exact citations (legal, medical, financial). - Frequent compaction is expensive in itself. Compacting a 200K context is essentially one large call.
When to use: default option for general coding / writing / chat. Avoid when accuracy and traceability are critical.
3. Memory — External files + index
Mechanism: store rarely-referenced information in markdown files on disk. Keep only the index (MEMORY.md) in context, and load individual files on demand.
Cost structure: - Index: about 1–3K tokens. Cacheable. - Per-file load: 1–10K tokens added to context only when relevant. - Net effect: even with hundreds of KB of memory, average context stays at a few KB.
Strengths: - Lossless. Originals are preserved. - Human-editable — fixing wrong memory is straightforward. - Plays well with caching (the index sits at the front and caches).
Weaknesses: - Requires explicit reference. The model has to know "which memory to read" — without rules, it skips. - Inefficient for very large memory pools. Beyond ~1MB the index alone becomes insufficient.
When to use: explicit, intermittent references like user preferences, past decisions, external system coordinates. The 3-tier memory pattern from series A part 1 is the standard.
4. RAG — Vector retrieval + injection
Mechanism: embed all documents into a vector DB. At query time, embed the query → top-K retrieval → inject into context → generate answer.
Cost structure: - Embedding cost: one-time per document (linear). 1M tokens ≈ $0.10 to embed. - Vector DB: hosting (Qdrant, Pinecone) or self-managed. - LLM call: query embedding + answer generation. - Context: only the K retrieved chunks (typically 5–20K tokens).
Strengths: - Large knowledge bases (millions of documents) keep context small. - Freshness — index updates without retraining. - Provenance — citable evidence for the answer.
Weaknesses: - Retrieval failure destroys answer quality. What retrieval misses, the model never sees. - Adds retrieval, embedding, DB infrastructure — heavy for small teams. - Many tuning knobs: chunking, embedding model, re-ranking.
When to use: large external knowledge (document libraries, FAQs, wikis) the model can't absorb through training. Domains requiring factual accuracy and traceability.
5. Cost comparison — Same task, three paths
Hypothetical scenario: "answer using relevant info from a 10K-document corporate wiki."
| Strategy | Context length | Per-call cost (Sonnet) | Note |
|---|---|---|---|
| Full context (50M tokens) | Exceeds limit → impossible | Impossible | Some strategy is required |
| auto-compact (repeated) | ~150K | $0.45 (input) + $0.15 (compaction) | Information loss; decision reasoning blurred |
| Memory (index + load) | ~10K | $0.03 | Needs rules for which memory to load |
| RAG (top-10 retrieval) | ~15K | $0.05 + retrieval infra | Plus DB/host overhead |
Observations: - For one-shot tasks, RAG can be more expensive due to infrastructure overhead. - For repeated tasks, RAG dominates on cost. - Memory hits a sweet spot for medium-scale (hundreds of documents).
6. Decision rules
Three axes:
| Task profile | Recommendation |
|---|---|
| Short, one-off; no external knowledge | auto-compact |
| User- / project-specific explicit info recurs | Memory |
| Large external knowledge; freshness matters | RAG |
| Exact source citation required | RAG |
| Need to trace decision reasoning | Memory (avoid auto-compact) |
Combining patterns: - Small system: auto-compact + Memory (1–3K index, file-load on demand). - Medium system: Memory + small RAG (internal wiki only). - Large system: all three. auto-compact for interactive, Memory for user context, RAG for the knowledge base.
7. Common pitfalls
7.1 Treating RAG as the universal answer
- Adding RAG to a small (<10K docs) corpus → infra cost erases savings.
- Fix: start with Memory; switch to RAG when the corpus grows.
7.2 Over-relying on auto-compact
- Compaction blurs decision reasoning, leading to retries.
- Fix: manually consolidate context before critical decisions. Externalize reasoning to memory or session logs.
7.3 Using Memory as a substitute for code
- Memory is for non-obvious information. Facts derivable from code should be re-checked against the code.
- Fix: ask "can this be derived from the codebase?" before saving to memory.
8. At a glance
| Pattern | Primary cost | Data scale | Accuracy traceability | Operational burden |
|---|---|---|---|---|
| auto-compact | LLM compaction call | Within one session | Weak (lossy) | Low (automatic) |
| Memory | Index tokens | A few MB or less | Strong | Medium (needs rules) |
| RAG | Retrieval infra + embeddings | GBs and up | Very strong | High (infra) |
Core principle: pick the intersection of data scale × accuracy requirement × operational capacity. Don't lock into a single pattern.
Series wrap — Part 4/4
The AI Operations Economics series stacked four levers: making the bill predictable (part 1), then actively reducing via routing (part 2), caching (part 3), and context management (part 4). Applied together, the same task typically lands at 10–30% of the original cost.
Combined with Series A (Coding Agents in Practice, five parts), the workflow + cost axes are complete. The next campaign is likely to take the next step in measurement and operations — evaluations.
References
- Anthropic, Auto-compact in Claude Code — code.claude.com/docs/sessions (verified 2026-05-05).
- Anthropic, Memory and Sessions — code.claude.com/docs/memory (verified 2026-05-05).
- Lewis et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
- Series parts 1 (token cost), 2 (routing), 3 (caching).
This is part 4/4 — the final entry in the AI Operations Economics series.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ