Coding Agents in Practice (5/5) — Cost Management: Tokens, Caching, Routing

AI ์ฝ”๋”ฉ ์—์ด์ „ํŠธ ์‹ค์ „ (5/5) — ๋น„์šฉ ๊ด€๋ฆฌ: ํ† ํฐ·์บ์‹ฑ·๋ผ์šฐํŒ…

The answer to "why is this so expensive?" is almost always which tokens didn't hit cache.


ํ•ต์‹ฌ ์š”์•ฝ

  • Four levers: token unit price / cache hit rate / model routing / context management
  • Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs, Claude Code pricing
  • Strongest single lever: prompt caching — keeping system prompts and CLAUDE.md cache-hit can drop input cost to ~10%
  • Caveat: Anthropic cache writes are base × 1.25 (5-minute TTL) or × 2 (1-hour TTL). Low hit rate can make caching more expensive

1. Where the cost comes from

LLM token cost = (input tokens × input rate) + (output tokens × output rate), multiplied by cache state.

Token type Rate (Claude Sonnet 4.6 example) Note
Input (cache miss) Base 1.0× Sent fresh each time
Input (cache write, 5-min TTL) Base × 1.25 Cost of creating the cache entry
Input (cache write, 1-hour TTL) Base × 2.0 Costs more to live longer
Input (cache hit) Base × 0.1 ~10% on hit
Output Base × ~5 Typically 5× more than input

Key observations: - Output is more expensive than input → simply asking the model for short answers saves cost. - A cache hit is 1/10 the cost → CLAUDE.md, system prompts, and frequent context must be designed to hit cache. - Cache writes are 1.25× (5-min) or 2× (1-hour) → there's a break-even point. The 1-hour cache costs 2× to write, so it pays back only after at least 4 hits (covered in detail in series C-3).


2. Lever 1 — Prompt Caching

The strongest single lever. Cacheable items: - System prompts - CLAUDE.md / rule files - Frequently used static context (tool definitions, design docs)

Application pattern (Anthropic): - Mark cache_control in API messages. - Cache by block — the same block must be sent in the same order to hit. - 5-minute TTL is default. Claude Code is known to use 1-hour TTL by default.

Common mistakes: - The front of the cached block changes each request → no hit. Move volatile content to the back; keep the static parts up front to cache. - Using 1-hour cache for low-hit-rate work → 2× write cost never recouped. - Cached content is too long — caching cuts cost, it does not split it. Manage context length separately.


3. Lever 2 — Model Routing

You don't always need the most expensive model.

Task type Suitable model Reasoning
Deep reasoning / design / hard debugging Opus 4.7 Reasoning depth pays back via fewer retries
Standard code changes / typical PRs Sonnet 4.6 Best price-quality balance
Simple text / classification / summarization Haiku 4.5 ~1/10 the unit cost; fastest
Background monitoring / polling Haiku or local oMLX Avoids external API calls entirely

Two routing axes: - Quality threshold: "If this fails, do we redo the work?" If yes, pay for the better model. - Latency tolerance: "Does the user need a 5-second answer?" If yes, smaller model + cache.

Routing gets its own treatment in series C-2. The point here is: a single-model strategy is almost always more expensive than necessary.


4. Lever 3 — Context management

Long context costs per token. Four ways to shrink:

  • Externalize memory: rarely-referenced information goes to memory files; only the index (MEMORY.md) sits in context.
  • Summary reports: subagents return summaries instead of full output (series part 4).
  • Auto-compaction: Claude Code auto-compacts when context fills — but compaction can lose the reasoning behind decisions. Manually consolidate before critical decisions.
  • Session separation: don't pile too much into one session; start a new one when a task ends.

5. Lever 4 — Output length control

Output is ~5× more expensive than input. Ways to shorten output:

  • Explicit limits: "report under 200 words," "code only, no explanation" in system / skill prompts.
  • Structured output: JSON schemas, tables, checklists naturally compress.
  • Pick one: ask for the single best option, not three options — request comparison only when truly needed.
  • Avoid restatement: block the "summary + conclusion" double-restate pattern via the system prompt.

6. Monitoring — Without measurement, nothing shrinks

Step one of cost management is measurement.

Minimum monitoring: - Daily token usage (input / output / cache hit rate). - Cost distribution per task type (code change vs. review vs. docs). - Per-model usage share.

Tools: - Anthropic Console Usage page shows daily, weekly, monthly token breakdowns. - Claude Code's /cost slash command displays per-session cost (CLI versions). - A custom routing layer should log per-call cost.

Empirical signal: if cache hit rate < 30%, revisit cache placement (likely the front of the cache block is changing each request).


7. At a glance

Lever Magnitude Application difficulty First step
Prompt Caching Very large (1/10) Medium (block re-layout) Cache CLAUDE.md / system prompt
Model Routing Large (10× per-task swing) Medium (task-classification rules) Ask "does this task need Opus?"
Context Management Medium (grows with task length) Low Summary-reports from subagents
Output Length Medium (output costs 5× input) Low Add length limits to system prompts

Recommended starting order: 1. Apply caching to system prompts + CLAUDE.md (often 50%+ savings immediately) 2. Route Haiku for simple tasks 3. Add output-length limits 4. Start daily cache-hit monitoring


Series wrap — Part 5/5

The Coding Agents in Practice series binds five axes — workflow (1), tool comparison (2), MCP (3), multi-agent (4), cost (5). Each post supports the others: workflow has to be in place before multi-agent helps, and multi-agent has to land before cost routing pays off.

The next campaign (Series C — AI Operations Economics, 4 parts) deepens the cost discussion. Part 1 is the measurement traps in token-cost structure.


References

  • Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
  • Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
  • OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
  • Series parts 1 (workflow) and 4 (multi-agent).

This is part 5/5 — the final entry in the Coding Agents in Practice series.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System