AI Operations Economics (1/4) — Token Cost Structure and Measurement Pitfalls
"Token rate × usage" looks simple, but the actual bill always diverges from that simple formula. Where it diverges is the starting point of operations.
ํต์ฌ ์์ฝ
- Token cost = (input × input rate) + (output × output rate). Everyone gets that far.
- Five real-world pitfalls: cache-write multipliers / thinking tokens / system-prompt leakage / tool-definition cost / multi-turn accumulation
- Primary sources: Anthropic / OpenAI official pricing and caching docs, plus first-person operational measurements
- Making the bill predictable is the precondition of every cost-reduction lever
1. The simple formula vs. reality
The one-line cost model:
Cost = (input tokens × input rate) + (output tokens × output rate)
Most cost calculators stop here. In production, the bill almost always exceeds the prediction. The gap appears in five places.
2. Pitfall 1 — Cache-write multipliers
The most common misunderstanding. "Cache hit means 1/10 cost" is true, but cache writes are more expensive than baseline.
| Operation | Anthropic rate multiplier |
|---|---|
| Cache miss (regular input) | 1.0× |
| Cache write (5-min TTL) | 1.25× |
| Cache write (1-hour TTL) | 2.0× |
| Cache hit | 0.1× |
Break-even math: - 5-min cache: 0.25× extra cost → 0.9× saving on hit. Break-even after 2 hits. - 1-hour cache: 1.0× extra cost → 0.9× saving. Roughly 2 hits to break even, but the 1.0× upfront is large enough that you should validate hit rate before adopting.
Measurement pitfall: many bills lump cache-write tokens into the same line as base input, which makes it look like input usage spiked. In reality the 1.25×–2× multiplier is the spike.
3. Pitfall 2 — Thinking tokens
Reasoning modes (extended thinking) cause the model to emit tokens to itself before answering. Those tokens still bill at the output rate.
- Tokens the user never sees on screen still appear on the bill.
- Harder tasks generate exponentially more thinking tokens.
- Operating without knowing whether thinking mode is enabled means the bill can balloon unexpectedly.
Measurement rule: when the API response separates usage.output_tokens from usage.thinking_tokens, log the latter independently. Decide which task types actually need thinking per category.
4. Pitfall 3 — System-prompt leakage
The same system prompt on every call adds those tokens to every request's input.
- A 5K-token system prompt × 100 calls/day = 500K tokens of baseline per day. Without cache, all at full price.
- The longer the system prompt, the more asymmetric the cost on small tasks. If "classify one sentence" is 1/100 the size of the system prompt, ~99% of the cost belongs to the prompt.
Mitigation: (a) cache the system prompt; (b) split small tasks (classification, summarization) into a lighter system prompt or route to a smaller model.
5. Pitfall 4 — Tool-definition cost
When using MCP / function calling, tool schemas enter the input on every call. More tools, more input bloat.
- 50 tools × ~100 tokens each = 5K-token fixed cost per call.
- The user invokes only one or two, but the model has to see the whole catalog to decide.
- Cache hits help, but a frequently-changing tool catalog breaks the cache.
Mitigation: expose subset catalogs per task type. Disable irrelevant tools per session, or split into separate agents.
6. Pitfall 5 — Multi-turn accumulation
Conversation- and agent-style calls re-send the entire prior message history every turn. The first turn's input is billed 10 times in a 10-turn conversation.
- Without cache hits, an N-turn conversation's input cost grows O(N²).
- "Why is the same task twice as expensive as yesterday?" is almost always the conversation got longer.
- When auto-compaction triggers, all tokens up to the compact point are billed for that day before context resets.
Mitigation: start a new session when a task ends. For long conversations, explicitly truncate context.
7. Measurement — Where to look
Anthropic Console (Usage): daily, weekly, monthly tokens broken down into cache write / cache hit / regular input / output. The single most reliable source.
Claude Code /cost (CLI): current-session tokens and cost. See what a task just cost immediately after finishing.
Custom router logs: when running multiple providers, log a per-call cost record in the router. Aggregate daily by provider × model × task_type.
Empirical signals: - cache_hit_rate < 30% → re-examine cache placement (front of cache block likely changing). - thinking_token_share > 50% → classify whether thinking is really needed. - Average input > 2× average output → suspect system-prompt or tool-catalog leakage.
8. Unit economics — Cost per task
Monthly totals don't help decisions. Decompose by task.
| Task type | Avg input | Avg output | Avg cost (Sonnet) | Note |
|---|---|---|---|---|
| Classification / labeling | 2K | 50 | Very low | Route to Haiku |
| Summarization (1K text) | 5K | 300 | Low | Lower with system-prompt cache |
| Code change (one file) | 15K | 1K | Medium | 30% cut with cache + short output |
| Multi-agent workflow | 50K+ | 5K+ | High | Largest savings opportunity |
This unit reveals where to apply caching, routing, and short-output rules.
9. At a glance
| Pitfall | Signal | First mitigation |
|---|---|---|
| Cache-write multiplier | Input tokens above expectation | Measure hit rate before enabling 1-hour cache |
| Thinking tokens | Output tokens asymmetrically large | Classify which tasks actually need thinking |
| System-prompt leakage | Small tasks with high unit cost | Cache the system prompt + per-task separation |
| Tool-definition cost | Persistent baseline input | Subset catalogs per task |
| Multi-turn accumulation | Same task, N× the cost | Session separation + explicit context cuts |
Once the bill is predictable, the next step is active reduction — model routing, in part 2/4.
Next up
Part 2/4: Model Routing — The Cost / Quality / Latency Triangle. When unit prices differ by 10×, deciding which task goes where is the next-largest lever.
References
- Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
- Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
- Anthropic, Extended Thinking — docs.claude.com/en/docs/build-with-claude/extended-thinking (verified 2026-05-05).
- OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).
This is part 1/4 of the AI Operations Economics series.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ