AI Operations Economics (3/4) — Prompt Caching Guide: 1-hour vs 5-minute Cache

5월 05, 2026

AI 운영 경제학 (3/4) — 컨텍스트 캐싱 가이드: 1시간 vs 5분, 어디에 쓰나

Caching is not always savings. It is savings if the hit rate is high enough — otherwise it is a loss.

핵심 요약

Cache hit = 0.1× base input; cache write = 1.25× (5-min) or 2× (1-hour)
Break-even: 5-min cache pays back at 2 hits; 1-hour cache around 2–3 hits
Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs
Cache targets: system prompts / CLAUDE.md / tool catalogs / frequent static context
Decision: 1-hour is not always better than 5-min — only superior when hit rate is reliably high

1. How caching works

LLM caches store the prefix of a prompt. On a later call, if the same prefix appears, computation is reused.

Core rules: - Prefix matching only. One byte changed in the front of the cache block → miss. - Block boundaries must be marked explicitly (cache_control in Anthropic). OpenAI handles it automatically. - Blocks are typically efficient at 1024+ tokens. Smaller blocks have too much overhead.

Hit-time pricing: - Anthropic: base × 0.1 (~10%). - OpenAI: base × 0.5 (~50%, varies by model).

Anthropic is aggressively cheaper, but hit-rate management requires more attention than OpenAI.

2. TTL — 5-minute vs 1-hour

Anthropic offers two TTL options.

TTL	Cache write rate	Cache hit rate	Where to use
5-min (default)	base × 1.25	base × 0.1	Within-session repetition; interactive
1-hour	base × 2.0	base × 0.1	Long sessions, batch, background

TTL selection rule: - Same context re-called within 5 minutes → 5-min cache. - Re-called between 5–60 minutes → 1-hour cache (if break-even holds). - Gaps over 1 hour → cache expires, and caching becomes a loss.

Claude Code note: Claude Code is known to use 1-hour TTL by default. That means every cache write is billed at 2× base. Bake the 2× multiplier into cost estimates.

3. Break-even math

5-minute cache: - First write: base × 1.25 (extra 0.25× over normal). - Each hit: base × 0.1 → savings of 0.9× per hit. - Break-even: 0.25 = 0.9 × N → N ≈ 0.28 hits. Pays back from the first hit. - Practical rule: enable when 2+ hits are likely.

1-hour cache: - First write: base × 2.0 (extra 1.0× over normal). - Each hit: base × 0.1 → 0.9× savings. - Break-even: 1.0 = 0.9 × N → N ≈ 1.12 hits. Pays back from the second hit. - Practical rule: enable when 4+ hits are likely (safety margin).

Implication: with less than one hit, caching is a loss. "Caching = savings" applied without measurement can raise the bill.

4. What to cache

4.1 System prompts

Appear on nearly every call. First-priority cache target.
Highest possible hit rate.

4.2 CLAUDE.md / rule files

Fixed at the project level — identical across all calls in a project.
1-hour cache fits long sessions.

4.3 Tool catalogs

MCP server defs, function-calling schemas.
A frequently changing catalog breaks the cache — apply caching after stabilization.

4.4 Frequent static context

Design docs, data dictionaries, code style guides.
Push changing parts to the back, keep static parts at the front under cache.

Don't cache: - Per-user / per-session data. - Per-call timestamps and random IDs. - Very small contexts (<1K tokens). Cache overhead absorbs the savings.

5. Anthropic vs. OpenAI

Item	Anthropic	OpenAI
Activation	Explicit (`cache_control`)	Automatic
Hit rate	base × 0.1	base × 0.5 (model-dependent)
Write multiplier	1.25× (5-min), 2.0× (1-hour)	No additional cost
TTL	5-min / 1-hour	Per model (typically 5–10 min)
Block boundary	Marked explicitly	Automatic (1024+ token prefix)

Choosing: - Anthropic gives a deeper discount when you can fully control hit rate. - OpenAI's automatic caching is safer for volatile workloads.

6. Debugging cache misses

If caching seems inactive, check:

Front of block changed: does the block start differ each call? (Timestamps in the front are a frequent culprit.)
Block size: <1024 tokens disables caching.
TTL expired: did calls space out beyond TTL?
Reorder: cache misses if cache-block order changed. Verify whether tool definitions auto-reorder.
Model swap: cache shares only within the same model.

Metric: compare usage.cache_read_input_tokens vs usage.cache_creation_input_tokens in the API response. Hit rate = read / (read + creation).

7. Operational pattern — Cache-friendly prompt layout

[System prompt — static, 5K tokens]      ← cached
[CLAUDE.md — static, 3K tokens]           ← cached
[Tool catalog — static, 2K tokens]        ← cached
─────────────────────────── cache boundary ───
[User input — varies]                     ← not cached

Core principles: - Put unchanging content first. - Push everything that varies last. - Align cache boundaries with block boundaries.

With this layout, only user input is processed fresh, and the prior 10K tokens enter at 0.1× rate. You buy 10K tokens at the price of 1K.

8. At a glance

Cache type	Break-even hits	Best workload
5-min TTL	1 (practical: 2)	Interactive, short sessions
1-hour TTL	2 (practical: 4)	Long sessions, batch, background

Cache	Don't cache
System prompts / CLAUDE.md / tool catalogs	Per-user data, timestamps, varying random
Frequent static documents	Tiny contexts under 1K tokens

Core rule: don't enable caching without measurement. Cache hit rate < 30% → revisit immediately.

Next up

Part 4/4: Context Management Patterns — auto-compact / Memory / RAG Cost Comparison. Caching makes the same context cheaper to send. The next part is about making context smaller in the first place.

References

Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
Anthropic, 1-Hour Cache TTL — docs.claude.com/en/docs/build-with-claude/prompt-caching#cache-duration (verified 2026-05-05).
OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
Series parts 1 (token cost structure) and 2 (routing).

This is part 3/4 of the AI Operations Economics series.

이 블로그 검색

MaJu Tech Notes