AI Operations Economics (1/4) — Token Cost Structure and Measurement Pitfalls

AI ์šด์˜ ๊ฒฝ์ œํ•™ (1/4) — ํ† ํฐ ๋น„์šฉ ๊ตฌ์กฐ์™€ ์‹ค์ธก ํ•จ์ •

"Token rate × usage" looks simple, but the actual bill always diverges from that simple formula. Where it diverges is the starting point of operations.


ํ•ต์‹ฌ ์š”์•ฝ

  • Token cost = (input × input rate) + (output × output rate). Everyone gets that far.
  • Five real-world pitfalls: cache-write multipliers / thinking tokens / system-prompt leakage / tool-definition cost / multi-turn accumulation
  • Primary sources: Anthropic / OpenAI official pricing and caching docs, plus first-person operational measurements
  • Making the bill predictable is the precondition of every cost-reduction lever

1. The simple formula vs. reality

The one-line cost model:

Cost = (input tokens × input rate) + (output tokens × output rate)

Most cost calculators stop here. In production, the bill almost always exceeds the prediction. The gap appears in five places.


2. Pitfall 1 — Cache-write multipliers

The most common misunderstanding. "Cache hit means 1/10 cost" is true, but cache writes are more expensive than baseline.

Operation Anthropic rate multiplier
Cache miss (regular input) 1.0×
Cache write (5-min TTL) 1.25×
Cache write (1-hour TTL) 2.0×
Cache hit 0.1×

Break-even math: - 5-min cache: 0.25× extra cost → 0.9× saving on hit. Break-even after 2 hits. - 1-hour cache: 1.0× extra cost → 0.9× saving. Roughly 2 hits to break even, but the 1.0× upfront is large enough that you should validate hit rate before adopting.

Measurement pitfall: many bills lump cache-write tokens into the same line as base input, which makes it look like input usage spiked. In reality the 1.25×–2× multiplier is the spike.


3. Pitfall 2 — Thinking tokens

Reasoning modes (extended thinking) cause the model to emit tokens to itself before answering. Those tokens still bill at the output rate.

  • Tokens the user never sees on screen still appear on the bill.
  • Harder tasks generate exponentially more thinking tokens.
  • Operating without knowing whether thinking mode is enabled means the bill can balloon unexpectedly.

Measurement rule: when the API response separates usage.output_tokens from usage.thinking_tokens, log the latter independently. Decide which task types actually need thinking per category.


4. Pitfall 3 — System-prompt leakage

The same system prompt on every call adds those tokens to every request's input.

  • A 5K-token system prompt × 100 calls/day = 500K tokens of baseline per day. Without cache, all at full price.
  • The longer the system prompt, the more asymmetric the cost on small tasks. If "classify one sentence" is 1/100 the size of the system prompt, ~99% of the cost belongs to the prompt.

Mitigation: (a) cache the system prompt; (b) split small tasks (classification, summarization) into a lighter system prompt or route to a smaller model.


5. Pitfall 4 — Tool-definition cost

When using MCP / function calling, tool schemas enter the input on every call. More tools, more input bloat.

  • 50 tools × ~100 tokens each = 5K-token fixed cost per call.
  • The user invokes only one or two, but the model has to see the whole catalog to decide.
  • Cache hits help, but a frequently-changing tool catalog breaks the cache.

Mitigation: expose subset catalogs per task type. Disable irrelevant tools per session, or split into separate agents.


6. Pitfall 5 — Multi-turn accumulation

Conversation- and agent-style calls re-send the entire prior message history every turn. The first turn's input is billed 10 times in a 10-turn conversation.

  • Without cache hits, an N-turn conversation's input cost grows O(N²).
  • "Why is the same task twice as expensive as yesterday?" is almost always the conversation got longer.
  • When auto-compaction triggers, all tokens up to the compact point are billed for that day before context resets.

Mitigation: start a new session when a task ends. For long conversations, explicitly truncate context.


7. Measurement — Where to look

Anthropic Console (Usage): daily, weekly, monthly tokens broken down into cache write / cache hit / regular input / output. The single most reliable source.

Claude Code /cost (CLI): current-session tokens and cost. See what a task just cost immediately after finishing.

Custom router logs: when running multiple providers, log a per-call cost record in the router. Aggregate daily by provider × model × task_type.

Empirical signals: - cache_hit_rate < 30% → re-examine cache placement (front of cache block likely changing). - thinking_token_share > 50% → classify whether thinking is really needed. - Average input > 2× average output → suspect system-prompt or tool-catalog leakage.


8. Unit economics — Cost per task

Monthly totals don't help decisions. Decompose by task.

Task type Avg input Avg output Avg cost (Sonnet) Note
Classification / labeling 2K 50 Very low Route to Haiku
Summarization (1K text) 5K 300 Low Lower with system-prompt cache
Code change (one file) 15K 1K Medium 30% cut with cache + short output
Multi-agent workflow 50K+ 5K+ High Largest savings opportunity

This unit reveals where to apply caching, routing, and short-output rules.


9. At a glance

Pitfall Signal First mitigation
Cache-write multiplier Input tokens above expectation Measure hit rate before enabling 1-hour cache
Thinking tokens Output tokens asymmetrically large Classify which tasks actually need thinking
System-prompt leakage Small tasks with high unit cost Cache the system prompt + per-task separation
Tool-definition cost Persistent baseline input Subset catalogs per task
Multi-turn accumulation Same task, N× the cost Session separation + explicit context cuts

Once the bill is predictable, the next step is active reduction — model routing, in part 2/4.


Next up

Part 2/4: Model Routing — The Cost / Quality / Latency Triangle. When unit prices differ by 10×, deciding which task goes where is the next-largest lever.


References

  • Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
  • Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
  • Anthropic, Extended Thinking — docs.claude.com/en/docs/build-with-claude/extended-thinking (verified 2026-05-05).
  • OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).

This is part 1/4 of the AI Operations Economics series.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System