"Claude Code Token & Cache Cost Breakdown 2026 — The 1.25× vs 2× Multiplier Trap"

Official pricing, cache minimums, /usage tracking, and 10 ways to cut tokens


핵심 요약

  • Audience: Claude Code users who were surprised by a monthly bill, or teams estimating budget before rollout.
  • What you'll get: Current (April 2026) per-model pricing, cache multipliers (1.25× / 2× / 0.1×), per-model minimum cacheable tokens, /usage for live tracking, the official 10-item reduction playbook, and the hidden costs (extended thinking, agent teams, background).
  • Prerequisite: Claude Code installed (install guide). Pro/Max subscribers pay a flat rate — the numbers below don't apply to your bill, only to usage-bar planning.

1. Two billing axes — tokens and cache

Claude Code charges based on Anthropic API token consumption. Cost comes from two axes.

  1. Input / output tokens — every turn bills separately for the prompt (input) and the response (output).
  2. Prompt cache — caching the system prompt, CLAUDE.md, or long context makes retransmissions of the same content cost 0.1×. But cache writes are more expensive than standard input (1.25× or 2×).

Without grasping these two axes, you can't explain "why did the second response cost less than the first?"

Pro/Max/Team/Enterprise subscribers pay a flat monthly rate — per-token costs don't map to billing. What matters is the usage-bar limit per plan. The table below is for Anthropic Console pay-as-you-go.


2. Official pricing (April 2026, per million tokens)

Per Anthropic's pricing page.

Model Input 5-min write 1-hour write Cache read Output
Opus 4.7 $5 $6.25 $10 $0.50 $25
Opus 4.6 $5 $6.25 $10 $0.50 $25
Opus 4.5 $5 $6.25 $10 $0.50 $25
Sonnet 4.6 $3 $3.75 $6 $0.30 $15
Sonnet 4.5 $3 $3.75 $6 $0.30 $15
Haiku 4.5 $1 $1.25 $2 $0.10 $5
Haiku 3.5 $0.80 $1.00 $1.60 $0.08 $4

Observations: - Output is 5× input. Long responses drive cost fastest. - Cache writes are a one-time premium. Reads that follow are 0.1×. - Opus is 5× Sonnet. Model choice is a major lever.


3. Cache multiplier structure — 1.25× vs 2× vs 0.1×

Official rules:

Operation Multiplier Meaning
5-min TTL write (default ephemeral) 1.25× input Wins if reused within 5 minutes
1-hour TTL write (extended) input Wins for 5-min to 1-hour gaps
Cache read (hit/refresh) 0.1× input 90% discount on retransmitting same content

Example — Sonnet 4.6, a 50,000-token context reused 10 times in one session:

  • No cache: 50,000 × 10 × $3/MTok = $1.50 (input only)
  • 5-min cache: first write 50,000 × $3.75/MTok + reads 50,000 × 9 × $0.30/MTok = $0.19 + $0.135 = $0.325 (~78% savings)
  • 1-hour cache: first write 50,000 × $6/MTok + reads 50,000 × 9 × $0.30/MTok = $0.30 + $0.135 = $0.435 (~71% savings)

If you reuse within 5 minutes, 5-min TTL always wins. But idle over 5 minutes invalidates the cache and you pay full write cost again — 1-hour wins in that regime.

3.1 What TTL does Claude Code use?

Caveat: The official /costs page doesn't explicitly specify Claude Code's default TTL. Observationally, Claude Code targets cross-session resume use, behaving as a 1-hour TTL-only client. Budget with the 2× multiplier to be safe.

So Claude Code's upside is "cache hits survive long sessions". The downside is "cache write premium is 2×, not 1.25×".


4. Per-model minimum cacheable tokens

If the cacheable portion of a request is below the minimum, caching silently doesn't apply (no error, just bypassed).

Model Minimum cacheable tokens
Opus 4.7 / 4.6 / 4.5 4,096
Sonnet 4.6 2,048
Sonnet 4.5 1,024
Haiku 4.5 4,096
Haiku 3.5 2,048

Using Opus on a project with a short CLAUDE.md often misses the 4K threshold. In that regime Sonnet is more economical.

4.1 Cache breakpoint limit

Max 4 cache_control breakpoints per request. You decide where to draw cache boundaries (system prompt / tool definitions / long files). As a user you don't configure this directly — Claude Code handles it internally.


5. Track live cost with /usage

Run /usage (aliases /cost, /stats) in-session to see tokens, cost, and code change counts.

Official example:

Total cost:            $0.55
Total duration (API):  6m 19.7s
Total duration (wall): 6h 33m 10.2s
Total code changes:    0 lines added, 0 lines removed

The dollar figure is locally estimated from token counts. Authoritative billing is on the Anthropic Console Usage page.

5.1 Show on status line

/statusline can keep context/tokens/cost in the terminal bar continuously — recommended.

/statusline

Interactive setup; can auto-configure from your shell prompt.

5.2 Enterprise averages (reference)

Published numbers: - Average ~$13/developer/active day - Monthly $150–250/developer - 90% of users stay under $30/active day

These are enterprise API deployments. Individual Pro/Max experience varies significantly.


6. The 10-item official reduction playbook

6.1 /clear between tasks

When switching to unrelated work, /clear. Stale context wastes tokens on every subsequent message.

6.2 /compact with focus instructions

/compact Focus on code samples and API usage

Telling Claude what to preserve avoids losing key context.

Or set a default in CLAUDE.md:

When you are using compact, please focus on test output and code changes

6.3 Model choice — Sonnet default, Opus when complex

Sonnet handles most coding well and costs 5× less than Opus. Reserve Opus for architectural decisions or multi-step reasoning.

6.4 Cut MCP server overhead

MCP tool definitions are deferred by default — only tool names enter context until Claude invokes a specific tool. Use /context to see what's consuming space. Disable unused servers with /mcp.

Prefer CLI tools (gh, aws, gcloud, sentry-cli) — they add no per-tool listing.

6.5 Code intelligence plugins

For typed languages, install a code intelligence plugin. A "go to definition" call replaces grep + multiple file reads.

6.6 Preprocess with hooks

A hook can grep ERROR from a 10,000-line log before Claude ever sees it. Official example — PreToolUse hook filtering test output to failures only:

.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{"type": "command", "command": "~/.claude/hooks/filter-test-output.sh"}]
      }
    ]
  }
}

filter-test-output.sh:

#!/bin/bash
input=$(cat)
cmd=$(echo "$input" | jq -r '.tool_input.command')

if [[ "$cmd" =~ ^(npm test|pytest|go test) ]]; then
  filtered_cmd="$cmd 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100"
  echo "{\"hookSpecificOutput\":{\"hookEventName\":\"PreToolUse\",\"permissionDecision\":\"allow\",\"updatedInput\":{\"command\":\"$filtered_cmd\"}}}"
else
  echo "{}"
fi

Tens of thousands of tokens become hundreds. Full hook mechanics: hooks complete guide.

6.7 Move CLAUDE.md into skills

CLAUDE.md is loaded every session. Move detailed workflow instructions (PR reviews, DB migrations) to skills — they load on demand. Keep CLAUDE.md under 200 lines.

6.8 Tune extended thinking

Extended thinking is on by default. Thinking tokens bill as output tokens — default budget can be tens of thousands per request. For simple tasks, this is waste.

/effort low
MAX_THINKING_TOKENS=8000 claude

6.9 Delegate verbose work to subagents

Test runs, documentation fetches, and log parsing consume significant context. Delegate these to subagents — verbose output stays in the subagent's context, only a summary returns.

6.10 Write specific prompts

"Improve this codebase" triggers broad scanning. "Add input validation to the login function in auth.ts" finishes with minimal file reads.


7. Three hidden costs

7.1 Background

Claude Code uses some tokens while idle: - Conversation summarization jobs (for claude --resume) - Status-checking commands

Official: under $0.04 per session. Small, but multiplies across many open sessions.

7.2 Agent teams

A feature-flag capability (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) that spawns multiple instances. Each teammate runs its own context window, so plan-mode costs are ~7× a standard session.

Mitigation: - Use Sonnet for teammates (never Opus) - Keep teams small - Keep spawn prompts focused - Clean up when done

7.3 Default extended thinking on Opus

Complex work on Opus 4.7 can push thinking to hundreds of thousands of tokens. Make checking /usage a habit.


8. Team rollout — TPM/RPM recommendations

Official per-user recommendations by org size (Anthropic API plans):

Team size TPM/user RPM/user
1-5 200k-300k 5-7
5-20 100k-150k 2.5-3.5
20-50 50k-75k 1.25-1.75
50-100 25k-35k 0.62-0.87
100-500 15k-20k 0.37-0.47
500+ 10k-15k 0.25-0.35

Larger teams get lower per-user allocations because concurrency drops. Limits apply at the org level, so individuals can temporarily exceed their share.


9. PostToolUse hook for cost logging

/usage shows the current session only. For long-term tracking, append to a file with a hook.

.claude/hooks/log-cost.sh:

#!/bin/bash

INPUT=$(cat)
TOOL_NAME=$(echo "$INPUT" | jq -r '.tool_name')
SESSION_ID=$(echo "$INPUT" | jq -r '.session_id')
TS=$(date +%s)

echo "$TS | $SESSION_ID | $TOOL_NAME" >> "$CLAUDE_PROJECT_DIR/tasks/log/tool-usage.tsv"

exit 0

.claude/settings.local.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "*",
        "hooks": [
          {"type": "command", "command": "\"$CLAUDE_PROJECT_DIR\"/.claude/hooks/log-cost.sh"}
        ]
      }
    ]
  }
}

The example only counts tool calls, but parsing /usage output in a PostToolBatch or Stop hook yields precise monthly stats.


10. Counter-scenarios — when this optimization doesn't apply

  • Individual Pro/Max subscribers → flat-rate billing; per-token cost is irrelevant. Only the usage-bar matters. Pro hits a monthly cap → wait/upgrade.
  • Short sessions only → below the cache minimum (Opus 4K, Sonnet 2K), caching doesn't kick in. No effect from optimization.
  • Prototyping / exploration → speed beats cost. Overusing /compact or model switching disrupts thinking flow.
  • Enterprise API deployment → team-wide tracking is better handled by a gateway (LiteLLM per official docs) than per-user hooks.

11. What's next

With cost structure in hand:

  1. Hooks complete guide — the foundation for preprocessing and cost-logging hooks above.
  2. CLAUDE.md authoring guide — keep it under 200 lines; move details to skills.
  3. Slash commands complete reference/usage, /context, /compact, /clear, /model, /effort all documented.

References


This is post 5/15 in the "AI Coding CLI Entry Guide" series. last verified: 2026-04-25 (per Anthropic official pricing and Claude Code Costs docs).

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System