Coding Agents in Practice (5/5) — Cost Management: Tokens, Caching, Routing
The answer to "why is this so expensive?" is almost always which tokens didn't hit cache.
ํต์ฌ ์์ฝ
- Four levers: token unit price / cache hit rate / model routing / context management
- Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs, Claude Code pricing
- Strongest single lever: prompt caching — keeping system prompts and CLAUDE.md cache-hit can drop input cost to ~10%
- Caveat: Anthropic cache writes are base × 1.25 (5-minute TTL) or × 2 (1-hour TTL). Low hit rate can make caching more expensive
1. Where the cost comes from
LLM token cost = (input tokens × input rate) + (output tokens × output rate), multiplied by cache state.
| Token type | Rate (Claude Sonnet 4.6 example) | Note |
|---|---|---|
| Input (cache miss) | Base 1.0× | Sent fresh each time |
| Input (cache write, 5-min TTL) | Base × 1.25 | Cost of creating the cache entry |
| Input (cache write, 1-hour TTL) | Base × 2.0 | Costs more to live longer |
| Input (cache hit) | Base × 0.1 | ~10% on hit |
| Output | Base × ~5 | Typically 5× more than input |
Key observations: - Output is more expensive than input → simply asking the model for short answers saves cost. - A cache hit is 1/10 the cost → CLAUDE.md, system prompts, and frequent context must be designed to hit cache. - Cache writes are 1.25× (5-min) or 2× (1-hour) → there's a break-even point. The 1-hour cache costs 2× to write, so it pays back only after at least 4 hits (covered in detail in series C-3).
2. Lever 1 — Prompt Caching
The strongest single lever. Cacheable items: - System prompts - CLAUDE.md / rule files - Frequently used static context (tool definitions, design docs)
Application pattern (Anthropic):
- Mark cache_control in API messages.
- Cache by block — the same block must be sent in the same order to hit.
- 5-minute TTL is default. Claude Code is known to use 1-hour TTL by default.
Common mistakes: - The front of the cached block changes each request → no hit. Move volatile content to the back; keep the static parts up front to cache. - Using 1-hour cache for low-hit-rate work → 2× write cost never recouped. - Cached content is too long — caching cuts cost, it does not split it. Manage context length separately.
3. Lever 2 — Model Routing
You don't always need the most expensive model.
| Task type | Suitable model | Reasoning |
|---|---|---|
| Deep reasoning / design / hard debugging | Opus 4.7 | Reasoning depth pays back via fewer retries |
| Standard code changes / typical PRs | Sonnet 4.6 | Best price-quality balance |
| Simple text / classification / summarization | Haiku 4.5 | ~1/10 the unit cost; fastest |
| Background monitoring / polling | Haiku or local oMLX | Avoids external API calls entirely |
Two routing axes: - Quality threshold: "If this fails, do we redo the work?" If yes, pay for the better model. - Latency tolerance: "Does the user need a 5-second answer?" If yes, smaller model + cache.
Routing gets its own treatment in series C-2. The point here is: a single-model strategy is almost always more expensive than necessary.
4. Lever 3 — Context management
Long context costs per token. Four ways to shrink:
- Externalize memory: rarely-referenced information goes to memory files; only the index (MEMORY.md) sits in context.
- Summary reports: subagents return summaries instead of full output (series part 4).
- Auto-compaction: Claude Code auto-compacts when context fills — but compaction can lose the reasoning behind decisions. Manually consolidate before critical decisions.
- Session separation: don't pile too much into one session; start a new one when a task ends.
5. Lever 4 — Output length control
Output is ~5× more expensive than input. Ways to shorten output:
- Explicit limits: "report under 200 words," "code only, no explanation" in system / skill prompts.
- Structured output: JSON schemas, tables, checklists naturally compress.
- Pick one: ask for the single best option, not three options — request comparison only when truly needed.
- Avoid restatement: block the "summary + conclusion" double-restate pattern via the system prompt.
6. Monitoring — Without measurement, nothing shrinks
Step one of cost management is measurement.
Minimum monitoring: - Daily token usage (input / output / cache hit rate). - Cost distribution per task type (code change vs. review vs. docs). - Per-model usage share.
Tools:
- Anthropic Console Usage page shows daily, weekly, monthly token breakdowns.
- Claude Code's /cost slash command displays per-session cost (CLI versions).
- A custom routing layer should log per-call cost.
Empirical signal: if cache hit rate < 30%, revisit cache placement (likely the front of the cache block is changing each request).
7. At a glance
| Lever | Magnitude | Application difficulty | First step |
|---|---|---|---|
| Prompt Caching | Very large (1/10) | Medium (block re-layout) | Cache CLAUDE.md / system prompt |
| Model Routing | Large (10× per-task swing) | Medium (task-classification rules) | Ask "does this task need Opus?" |
| Context Management | Medium (grows with task length) | Low | Summary-reports from subagents |
| Output Length | Medium (output costs 5× input) | Low | Add length limits to system prompts |
Recommended starting order: 1. Apply caching to system prompts + CLAUDE.md (often 50%+ savings immediately) 2. Route Haiku for simple tasks 3. Add output-length limits 4. Start daily cache-hit monitoring
Series wrap — Part 5/5
The Coding Agents in Practice series binds five axes — workflow (1), tool comparison (2), MCP (3), multi-agent (4), cost (5). Each post supports the others: workflow has to be in place before multi-agent helps, and multi-agent has to land before cost routing pays off.
The next campaign (Series C — AI Operations Economics, 4 parts) deepens the cost discussion. Part 1 is the measurement traps in token-cost structure.
References
- Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
- Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
- OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
- Series parts 1 (workflow) and 4 (multi-agent).
This is part 5/5 — the final entry in the Coding Agents in Practice series.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ