Coding Agents in Practice (5/5) — Cost Management: Tokens, Caching, Routing

5월 05, 2026

The answer to "why is this so expensive?" is almost always which tokens didn't hit cache.

핵심 요약

Four levers: token unit price / cache hit rate / model routing / context management
Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs, Claude Code pricing
Strongest single lever: prompt caching — keeping system prompts and CLAUDE.md cache-hit can drop input cost to ~10%
Caveat: Anthropic cache writes are base × 1.25 (5-minute TTL) or × 2 (1-hour TTL). Low hit rate can make caching more expensive

1. Where the cost comes from

LLM token cost = (input tokens × input rate) + (output tokens × output rate), multiplied by cache state.

Token type	Rate (Claude Sonnet 4.6 example)	Note
Input (cache miss)	Base 1.0×	Sent fresh each time
Input (cache write, 5-min TTL)	Base × 1.25	Cost of creating the cache entry
Input (cache write, 1-hour TTL)	Base × 2.0	Costs more to live longer
Input (cache hit)	Base × 0.1	~10% on hit
Output	Base × ~5	Typically 5× more than input

Key observations: - Output is more expensive than input → simply asking the model for short answers saves cost. - A cache hit is 1/10 the cost → CLAUDE.md, system prompts, and frequent context must be designed to hit cache. - Cache writes are 1.25× (5-min) or 2× (1-hour) → there's a break-even point. The 1-hour cache costs 2× to write, so it pays back only after at least 4 hits (covered in detail in series C-3).

2. Lever 1 — Prompt Caching

The strongest single lever. Cacheable items: - System prompts - CLAUDE.md / rule files - Frequently used static context (tool definitions, design docs)

Application pattern (Anthropic): - Mark cache_control in API messages. - Cache by block — the same block must be sent in the same order to hit. - 5-minute TTL is default. Claude Code is known to use 1-hour TTL by default.

Common mistakes: - The front of the cached block changes each request → no hit. Move volatile content to the back; keep the static parts up front to cache. - Using 1-hour cache for low-hit-rate work → 2× write cost never recouped. - Cached content is too long — caching cuts cost, it does not split it. Manage context length separately.

3. Lever 2 — Model Routing

You don't always need the most expensive model.

Task type	Suitable model	Reasoning
Deep reasoning / design / hard debugging	Opus 4.7	Reasoning depth pays back via fewer retries
Standard code changes / typical PRs	Sonnet 4.6	Best price-quality balance
Simple text / classification / summarization	Haiku 4.5	~1/10 the unit cost; fastest
Background monitoring / polling	Haiku or local oMLX	Avoids external API calls entirely

Two routing axes: - Quality threshold: "If this fails, do we redo the work?" If yes, pay for the better model. - Latency tolerance: "Does the user need a 5-second answer?" If yes, smaller model + cache.

Routing gets its own treatment in series C-2. The point here is: a single-model strategy is almost always more expensive than necessary.

4. Lever 3 — Context management

Long context costs per token. Four ways to shrink:

Externalize memory: rarely-referenced information goes to memory files; only the index (MEMORY.md) sits in context.
Summary reports: subagents return summaries instead of full output (series part 4).
Auto-compaction: Claude Code auto-compacts when context fills — but compaction can lose the reasoning behind decisions. Manually consolidate before critical decisions.
Session separation: don't pile too much into one session; start a new one when a task ends.

5. Lever 4 — Output length control

Output is ~5× more expensive than input. Ways to shorten output:

Explicit limits: "report under 200 words," "code only, no explanation" in system / skill prompts.
Structured output: JSON schemas, tables, checklists naturally compress.
Pick one: ask for the single best option, not three options — request comparison only when truly needed.
Avoid restatement: block the "summary + conclusion" double-restate pattern via the system prompt.

6. Monitoring — Without measurement, nothing shrinks

Step one of cost management is measurement.

Minimum monitoring: - Daily token usage (input / output / cache hit rate). - Cost distribution per task type (code change vs. review vs. docs). - Per-model usage share.

Tools: - Anthropic Console Usage page shows daily, weekly, monthly token breakdowns. - Claude Code's /cost slash command displays per-session cost (CLI versions). - A custom routing layer should log per-call cost.

Empirical signal: if cache hit rate < 30%, revisit cache placement (likely the front of the cache block is changing each request).

7. At a glance

Lever	Magnitude	Application difficulty	First step
Prompt Caching	Very large (1/10)	Medium (block re-layout)	Cache CLAUDE.md / system prompt
Model Routing	Large (10× per-task swing)	Medium (task-classification rules)	Ask "does this task need Opus?"
Context Management	Medium (grows with task length)	Low	Summary-reports from subagents
Output Length	Medium (output costs 5× input)	Low	Add length limits to system prompts

Recommended starting order: 1. Apply caching to system prompts + CLAUDE.md (often 50%+ savings immediately) 2. Route Haiku for simple tasks 3. Add output-length limits 4. Start daily cache-hit monitoring

Series wrap — Part 5/5

The Coding Agents in Practice series binds five axes — workflow (1), tool comparison (2), MCP (3), multi-agent (4), cost (5). Each post supports the others: workflow has to be in place before multi-agent helps, and multi-agent has to land before cost routing pays off.

The next campaign (Series C — AI Operations Economics, 4 parts) deepens the cost discussion. Part 1 is the measurement traps in token-cost structure.

References

Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
Series parts 1 (workflow) and 4 (multi-agent).

This is part 5/5 — the final entry in the Coding Agents in Practice series.

이 블로그 검색

MaJu Tech Notes