AI Operations Economics (1/4) — Token Cost Structure and Measurement Pitfalls

5월 05, 2026

"Token rate × usage" looks simple, but the actual bill always diverges from that simple formula. Where it diverges is the starting point of operations.

핵심 요약

Token cost = (input × input rate) + (output × output rate). Everyone gets that far.
Five real-world pitfalls: cache-write multipliers / thinking tokens / system-prompt leakage / tool-definition cost / multi-turn accumulation
Primary sources: Anthropic / OpenAI official pricing and caching docs, plus first-person operational measurements
Making the bill predictable is the precondition of every cost-reduction lever

1. The simple formula vs. reality

The one-line cost model:

Cost = (input tokens × input rate) + (output tokens × output rate)

Most cost calculators stop here. In production, the bill almost always exceeds the prediction. The gap appears in five places.

2. Pitfall 1 — Cache-write multipliers

The most common misunderstanding. "Cache hit means 1/10 cost" is true, but cache writes are more expensive than baseline.

Operation	Anthropic rate multiplier
Cache miss (regular input)	1.0×
Cache write (5-min TTL)	1.25×
Cache write (1-hour TTL)	2.0×
Cache hit	0.1×

Break-even math: - 5-min cache: 0.25× extra cost → 0.9× saving on hit. Break-even after 2 hits. - 1-hour cache: 1.0× extra cost → 0.9× saving. Roughly 2 hits to break even, but the 1.0× upfront is large enough that you should validate hit rate before adopting.

Measurement pitfall: many bills lump cache-write tokens into the same line as base input, which makes it look like input usage spiked. In reality the 1.25×–2× multiplier is the spike.

3. Pitfall 2 — Thinking tokens

Reasoning modes (extended thinking) cause the model to emit tokens to itself before answering. Those tokens still bill at the output rate.

Tokens the user never sees on screen still appear on the bill.
Harder tasks generate exponentially more thinking tokens.
Operating without knowing whether thinking mode is enabled means the bill can balloon unexpectedly.

Measurement rule: when the API response separates usage.output_tokens from usage.thinking_tokens, log the latter independently. Decide which task types actually need thinking per category.

4. Pitfall 3 — System-prompt leakage

The same system prompt on every call adds those tokens to every request's input.

A 5K-token system prompt × 100 calls/day = 500K tokens of baseline per day. Without cache, all at full price.
The longer the system prompt, the more asymmetric the cost on small tasks. If "classify one sentence" is 1/100 the size of the system prompt, ~99% of the cost belongs to the prompt.

Mitigation: (a) cache the system prompt; (b) split small tasks (classification, summarization) into a lighter system prompt or route to a smaller model.

5. Pitfall 4 — Tool-definition cost

When using MCP / function calling, tool schemas enter the input on every call. More tools, more input bloat.

50 tools × ~100 tokens each = 5K-token fixed cost per call.
The user invokes only one or two, but the model has to see the whole catalog to decide.
Cache hits help, but a frequently-changing tool catalog breaks the cache.

Mitigation: expose subset catalogs per task type. Disable irrelevant tools per session, or split into separate agents.

6. Pitfall 5 — Multi-turn accumulation

Conversation- and agent-style calls re-send the entire prior message history every turn. The first turn's input is billed 10 times in a 10-turn conversation.

Without cache hits, an N-turn conversation's input cost grows O(N²).
"Why is the same task twice as expensive as yesterday?" is almost always the conversation got longer.
When auto-compaction triggers, all tokens up to the compact point are billed for that day before context resets.

Mitigation: start a new session when a task ends. For long conversations, explicitly truncate context.

7. Measurement — Where to look

Anthropic Console (Usage): daily, weekly, monthly tokens broken down into cache write / cache hit / regular input / output. The single most reliable source.

Claude Code /cost (CLI): current-session tokens and cost. See what a task just cost immediately after finishing.

Custom router logs: when running multiple providers, log a per-call cost record in the router. Aggregate daily by provider × model × task_type.

Empirical signals: - cache_hit_rate < 30% → re-examine cache placement (front of cache block likely changing). - thinking_token_share > 50% → classify whether thinking is really needed. - Average input > 2× average output → suspect system-prompt or tool-catalog leakage.

8. Unit economics — Cost per task

Monthly totals don't help decisions. Decompose by task.

Task type	Avg input	Avg output	Avg cost (Sonnet)	Note
Classification / labeling	2K	50	Very low	Route to Haiku
Summarization (1K text)	5K	300	Low	Lower with system-prompt cache
Code change (one file)	15K	1K	Medium	30% cut with cache + short output
Multi-agent workflow	50K+	5K+	High	Largest savings opportunity

This unit reveals where to apply caching, routing, and short-output rules.

9. At a glance

Pitfall	Signal	First mitigation
Cache-write multiplier	Input tokens above expectation	Measure hit rate before enabling 1-hour cache
Thinking tokens	Output tokens asymmetrically large	Classify which tasks actually need thinking
System-prompt leakage	Small tasks with high unit cost	Cache the system prompt + per-task separation
Tool-definition cost	Persistent baseline input	Subset catalogs per task
Multi-turn accumulation	Same task, N× the cost	Session separation + explicit context cuts

Once the bill is predictable, the next step is active reduction — model routing, in part 2/4.

Next up

Part 2/4: Model Routing — The Cost / Quality / Latency Triangle. When unit prices differ by 10×, deciding which task goes where is the next-largest lever.

References

Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
Anthropic, Extended Thinking — docs.claude.com/en/docs/build-with-claude/extended-thinking (verified 2026-05-05).
OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).

This is part 1/4 of the AI Operations Economics series.

이 블로그 검색

MaJu Tech Notes