AI Operations Economics (4/4) — Context Management Patterns: auto-compact, Memory, RAG Cost Comparison

AI ์šด์˜ ๊ฒฝ์ œํ•™ (4/4) — ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ ํŒจํ„ด: auto-compact·Memory·RAG ๋น„์šฉ ๋น„๊ต

Context is cost. There are three ways to shrink it — compress, externalize, or retrieve.


ํ•ต์‹ฌ ์š”์•ฝ

  • Three patterns: auto-compact (automatic summarization) / Memory (external files + index) / RAG (vector retrieval + injection)
  • Primary sources: Claude Code compact docs, RAG paper (Lewis et al., 2020), first-person operational measurements
  • Cost order (typical, long tasks): RAG ≤ Memory < auto-compact. For short tasks the order can flip
  • Decision rule: task length × freshness × accuracy requirement
  • Pitfalls: auto-compact can lose reasoning behind decisions; RAG suffers cost and accuracy losses on retrieval failure

1. The nature of context cost

LLMs bill the entire context per call. A 200K context means 200K × input rate, every call. Shrinking context is the most direct cost-reduction lever.

The three strategies differ in how they shrink.

Strategy Mechanism Context length Note
auto-compact Model summarizes its own context ~30–50% of original after compaction Loses decision reasoning
Memory Rarely used info kept in external files Index only (a few KB) Requires explicit reference
RAG Search + inject documents at query time Just the top-K (tens of KB) Needs retrieval infrastructure

2. auto-compact — Automatic summarization

Mechanism: when context approaches its limit, the LLM compresses its own past tokens into a summary form. Claude Code triggers this automatically.

Cost structure: - The compaction itself is another LLM call. 200K input + summary output. - Post-compaction context is shorter, but the compaction-time bill is large and one-shot. - After compaction the model starts fresh — and cost begins accumulating again.

Strengths: - Automatic — no operator action needed. - Natural fit when work is continuous.

Weaknesses / pitfalls: - Decision reasoning can vanish during compression. Risky for tasks where "why did we decide that?" affects later steps. - Compression is a lossy transform — bad for tasks requiring exact citations (legal, medical, financial). - Frequent compaction is expensive in itself. Compacting a 200K context is essentially one large call.

When to use: default option for general coding / writing / chat. Avoid when accuracy and traceability are critical.


3. Memory — External files + index

Mechanism: store rarely-referenced information in markdown files on disk. Keep only the index (MEMORY.md) in context, and load individual files on demand.

Cost structure: - Index: about 1–3K tokens. Cacheable. - Per-file load: 1–10K tokens added to context only when relevant. - Net effect: even with hundreds of KB of memory, average context stays at a few KB.

Strengths: - Lossless. Originals are preserved. - Human-editable — fixing wrong memory is straightforward. - Plays well with caching (the index sits at the front and caches).

Weaknesses: - Requires explicit reference. The model has to know "which memory to read" — without rules, it skips. - Inefficient for very large memory pools. Beyond ~1MB the index alone becomes insufficient.

When to use: explicit, intermittent references like user preferences, past decisions, external system coordinates. The 3-tier memory pattern from series A part 1 is the standard.


4. RAG — Vector retrieval + injection

Mechanism: embed all documents into a vector DB. At query time, embed the query → top-K retrieval → inject into context → generate answer.

Cost structure: - Embedding cost: one-time per document (linear). 1M tokens ≈ $0.10 to embed. - Vector DB: hosting (Qdrant, Pinecone) or self-managed. - LLM call: query embedding + answer generation. - Context: only the K retrieved chunks (typically 5–20K tokens).

Strengths: - Large knowledge bases (millions of documents) keep context small. - Freshness — index updates without retraining. - Provenance — citable evidence for the answer.

Weaknesses: - Retrieval failure destroys answer quality. What retrieval misses, the model never sees. - Adds retrieval, embedding, DB infrastructure — heavy for small teams. - Many tuning knobs: chunking, embedding model, re-ranking.

When to use: large external knowledge (document libraries, FAQs, wikis) the model can't absorb through training. Domains requiring factual accuracy and traceability.


5. Cost comparison — Same task, three paths

Hypothetical scenario: "answer using relevant info from a 10K-document corporate wiki."

Strategy Context length Per-call cost (Sonnet) Note
Full context (50M tokens) Exceeds limit → impossible Impossible Some strategy is required
auto-compact (repeated) ~150K $0.45 (input) + $0.15 (compaction) Information loss; decision reasoning blurred
Memory (index + load) ~10K $0.03 Needs rules for which memory to load
RAG (top-10 retrieval) ~15K $0.05 + retrieval infra Plus DB/host overhead

Observations: - For one-shot tasks, RAG can be more expensive due to infrastructure overhead. - For repeated tasks, RAG dominates on cost. - Memory hits a sweet spot for medium-scale (hundreds of documents).


6. Decision rules

Three axes:

Task profile Recommendation
Short, one-off; no external knowledge auto-compact
User- / project-specific explicit info recurs Memory
Large external knowledge; freshness matters RAG
Exact source citation required RAG
Need to trace decision reasoning Memory (avoid auto-compact)

Combining patterns: - Small system: auto-compact + Memory (1–3K index, file-load on demand). - Medium system: Memory + small RAG (internal wiki only). - Large system: all three. auto-compact for interactive, Memory for user context, RAG for the knowledge base.


7. Common pitfalls

7.1 Treating RAG as the universal answer

  • Adding RAG to a small (<10K docs) corpus → infra cost erases savings.
  • Fix: start with Memory; switch to RAG when the corpus grows.

7.2 Over-relying on auto-compact

  • Compaction blurs decision reasoning, leading to retries.
  • Fix: manually consolidate context before critical decisions. Externalize reasoning to memory or session logs.

7.3 Using Memory as a substitute for code

  • Memory is for non-obvious information. Facts derivable from code should be re-checked against the code.
  • Fix: ask "can this be derived from the codebase?" before saving to memory.

8. At a glance

Pattern Primary cost Data scale Accuracy traceability Operational burden
auto-compact LLM compaction call Within one session Weak (lossy) Low (automatic)
Memory Index tokens A few MB or less Strong Medium (needs rules)
RAG Retrieval infra + embeddings GBs and up Very strong High (infra)

Core principle: pick the intersection of data scale × accuracy requirement × operational capacity. Don't lock into a single pattern.


Series wrap — Part 4/4

The AI Operations Economics series stacked four levers: making the bill predictable (part 1), then actively reducing via routing (part 2), caching (part 3), and context management (part 4). Applied together, the same task typically lands at 10–30% of the original cost.

Combined with Series A (Coding Agents in Practice, five parts), the workflow + cost axes are complete. The next campaign is likely to take the next step in measurement and operations — evaluations.


References

  • Anthropic, Auto-compact in Claude Code — code.claude.com/docs/sessions (verified 2026-05-05).
  • Anthropic, Memory and Sessions — code.claude.com/docs/memory (verified 2026-05-05).
  • Lewis et al., 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.
  • Series parts 1 (token cost), 2 (routing), 3 (caching).

This is part 4/4 — the final entry in the AI Operations Economics series.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System