AI Operations Economics (3/4) — Prompt Caching Guide: 1-hour vs 5-minute Cache
Caching is not always savings. It is savings if the hit rate is high enough — otherwise it is a loss.
ํต์ฌ ์์ฝ
- Cache hit = 0.1× base input; cache write = 1.25× (5-min) or 2× (1-hour)
- Break-even: 5-min cache pays back at 2 hits; 1-hour cache around 2–3 hits
- Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs
- Cache targets: system prompts / CLAUDE.md / tool catalogs / frequent static context
- Decision: 1-hour is not always better than 5-min — only superior when hit rate is reliably high
1. How caching works
LLM caches store the prefix of a prompt. On a later call, if the same prefix appears, computation is reused.
Core rules:
- Prefix matching only. One byte changed in the front of the cache block → miss.
- Block boundaries must be marked explicitly (cache_control in Anthropic). OpenAI handles it automatically.
- Blocks are typically efficient at 1024+ tokens. Smaller blocks have too much overhead.
Hit-time pricing: - Anthropic: base × 0.1 (~10%). - OpenAI: base × 0.5 (~50%, varies by model).
Anthropic is aggressively cheaper, but hit-rate management requires more attention than OpenAI.
2. TTL — 5-minute vs 1-hour
Anthropic offers two TTL options.
| TTL | Cache write rate | Cache hit rate | Where to use |
|---|---|---|---|
| 5-min (default) | base × 1.25 | base × 0.1 | Within-session repetition; interactive |
| 1-hour | base × 2.0 | base × 0.1 | Long sessions, batch, background |
TTL selection rule: - Same context re-called within 5 minutes → 5-min cache. - Re-called between 5–60 minutes → 1-hour cache (if break-even holds). - Gaps over 1 hour → cache expires, and caching becomes a loss.
Claude Code note: Claude Code is known to use 1-hour TTL by default. That means every cache write is billed at 2× base. Bake the 2× multiplier into cost estimates.
3. Break-even math
5-minute cache: - First write: base × 1.25 (extra 0.25× over normal). - Each hit: base × 0.1 → savings of 0.9× per hit. - Break-even: 0.25 = 0.9 × N → N ≈ 0.28 hits. Pays back from the first hit. - Practical rule: enable when 2+ hits are likely.
1-hour cache: - First write: base × 2.0 (extra 1.0× over normal). - Each hit: base × 0.1 → 0.9× savings. - Break-even: 1.0 = 0.9 × N → N ≈ 1.12 hits. Pays back from the second hit. - Practical rule: enable when 4+ hits are likely (safety margin).
Implication: with less than one hit, caching is a loss. "Caching = savings" applied without measurement can raise the bill.
4. What to cache
4.1 System prompts
- Appear on nearly every call. First-priority cache target.
- Highest possible hit rate.
4.2 CLAUDE.md / rule files
- Fixed at the project level — identical across all calls in a project.
- 1-hour cache fits long sessions.
4.3 Tool catalogs
- MCP server defs, function-calling schemas.
- A frequently changing catalog breaks the cache — apply caching after stabilization.
4.4 Frequent static context
- Design docs, data dictionaries, code style guides.
- Push changing parts to the back, keep static parts at the front under cache.
Don't cache: - Per-user / per-session data. - Per-call timestamps and random IDs. - Very small contexts (<1K tokens). Cache overhead absorbs the savings.
5. Anthropic vs. OpenAI
| Item | Anthropic | OpenAI |
|---|---|---|
| Activation | Explicit (cache_control) |
Automatic |
| Hit rate | base × 0.1 | base × 0.5 (model-dependent) |
| Write multiplier | 1.25× (5-min), 2.0× (1-hour) | No additional cost |
| TTL | 5-min / 1-hour | Per model (typically 5–10 min) |
| Block boundary | Marked explicitly | Automatic (1024+ token prefix) |
Choosing: - Anthropic gives a deeper discount when you can fully control hit rate. - OpenAI's automatic caching is safer for volatile workloads.
6. Debugging cache misses
If caching seems inactive, check:
- Front of block changed: does the block start differ each call? (Timestamps in the front are a frequent culprit.)
- Block size: <1024 tokens disables caching.
- TTL expired: did calls space out beyond TTL?
- Reorder: cache misses if cache-block order changed. Verify whether tool definitions auto-reorder.
- Model swap: cache shares only within the same model.
Metric: compare usage.cache_read_input_tokens vs usage.cache_creation_input_tokens in the API response. Hit rate = read / (read + creation).
7. Operational pattern — Cache-friendly prompt layout
[System prompt — static, 5K tokens] ← cached
[CLAUDE.md — static, 3K tokens] ← cached
[Tool catalog — static, 2K tokens] ← cached
─────────────────────────── cache boundary ───
[User input — varies] ← not cached
Core principles: - Put unchanging content first. - Push everything that varies last. - Align cache boundaries with block boundaries.
With this layout, only user input is processed fresh, and the prior 10K tokens enter at 0.1× rate. You buy 10K tokens at the price of 1K.
8. At a glance
| Cache type | Break-even hits | Best workload |
|---|---|---|
| 5-min TTL | 1 (practical: 2) | Interactive, short sessions |
| 1-hour TTL | 2 (practical: 4) | Long sessions, batch, background |
| Cache | Don't cache |
|---|---|
| System prompts / CLAUDE.md / tool catalogs | Per-user data, timestamps, varying random |
| Frequent static documents | Tiny contexts under 1K tokens |
Core rule: don't enable caching without measurement. Cache hit rate < 30% → revisit immediately.
Next up
Part 4/4: Context Management Patterns — auto-compact / Memory / RAG Cost Comparison. Caching makes the same context cheaper to send. The next part is about making context smaller in the first place.
References
- Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
- Anthropic, 1-Hour Cache TTL — docs.claude.com/en/docs/build-with-claude/prompt-caching#cache-duration (verified 2026-05-05).
- OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
- Series parts 1 (token cost structure) and 2 (routing).
This is part 3/4 of the AI Operations Economics series.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ