AI Operations Economics (3/4) — Prompt Caching Guide: 1-hour vs 5-minute Cache

AI ์šด์˜ ๊ฒฝ์ œํ•™ (3/4) — ์ปจํ…์ŠคํŠธ ์บ์‹ฑ ๊ฐ€์ด๋“œ: 1์‹œ๊ฐ„ vs 5๋ถ„, ์–ด๋””์— ์“ฐ๋‚˜

Caching is not always savings. It is savings if the hit rate is high enough — otherwise it is a loss.


ํ•ต์‹ฌ ์š”์•ฝ

  • Cache hit = 0.1× base input; cache write = 1.25× (5-min) or 2× (1-hour)
  • Break-even: 5-min cache pays back at 2 hits; 1-hour cache around 2–3 hits
  • Primary sources: Anthropic prompt caching docs, OpenAI prompt caching docs
  • Cache targets: system prompts / CLAUDE.md / tool catalogs / frequent static context
  • Decision: 1-hour is not always better than 5-min — only superior when hit rate is reliably high

1. How caching works

LLM caches store the prefix of a prompt. On a later call, if the same prefix appears, computation is reused.

Core rules: - Prefix matching only. One byte changed in the front of the cache block → miss. - Block boundaries must be marked explicitly (cache_control in Anthropic). OpenAI handles it automatically. - Blocks are typically efficient at 1024+ tokens. Smaller blocks have too much overhead.

Hit-time pricing: - Anthropic: base × 0.1 (~10%). - OpenAI: base × 0.5 (~50%, varies by model).

Anthropic is aggressively cheaper, but hit-rate management requires more attention than OpenAI.


2. TTL — 5-minute vs 1-hour

Anthropic offers two TTL options.

TTL Cache write rate Cache hit rate Where to use
5-min (default) base × 1.25 base × 0.1 Within-session repetition; interactive
1-hour base × 2.0 base × 0.1 Long sessions, batch, background

TTL selection rule: - Same context re-called within 5 minutes → 5-min cache. - Re-called between 5–60 minutes → 1-hour cache (if break-even holds). - Gaps over 1 hour → cache expires, and caching becomes a loss.

Claude Code note: Claude Code is known to use 1-hour TTL by default. That means every cache write is billed at 2× base. Bake the 2× multiplier into cost estimates.


3. Break-even math

5-minute cache: - First write: base × 1.25 (extra 0.25× over normal). - Each hit: base × 0.1 → savings of 0.9× per hit. - Break-even: 0.25 = 0.9 × N → N ≈ 0.28 hits. Pays back from the first hit. - Practical rule: enable when 2+ hits are likely.

1-hour cache: - First write: base × 2.0 (extra 1.0× over normal). - Each hit: base × 0.1 → 0.9× savings. - Break-even: 1.0 = 0.9 × N → N ≈ 1.12 hits. Pays back from the second hit. - Practical rule: enable when 4+ hits are likely (safety margin).

Implication: with less than one hit, caching is a loss. "Caching = savings" applied without measurement can raise the bill.


4. What to cache

4.1 System prompts

  • Appear on nearly every call. First-priority cache target.
  • Highest possible hit rate.

4.2 CLAUDE.md / rule files

  • Fixed at the project level — identical across all calls in a project.
  • 1-hour cache fits long sessions.

4.3 Tool catalogs

  • MCP server defs, function-calling schemas.
  • A frequently changing catalog breaks the cache — apply caching after stabilization.

4.4 Frequent static context

  • Design docs, data dictionaries, code style guides.
  • Push changing parts to the back, keep static parts at the front under cache.

Don't cache: - Per-user / per-session data. - Per-call timestamps and random IDs. - Very small contexts (<1K tokens). Cache overhead absorbs the savings.


5. Anthropic vs. OpenAI

Item Anthropic OpenAI
Activation Explicit (cache_control) Automatic
Hit rate base × 0.1 base × 0.5 (model-dependent)
Write multiplier 1.25× (5-min), 2.0× (1-hour) No additional cost
TTL 5-min / 1-hour Per model (typically 5–10 min)
Block boundary Marked explicitly Automatic (1024+ token prefix)

Choosing: - Anthropic gives a deeper discount when you can fully control hit rate. - OpenAI's automatic caching is safer for volatile workloads.


6. Debugging cache misses

If caching seems inactive, check:

  • Front of block changed: does the block start differ each call? (Timestamps in the front are a frequent culprit.)
  • Block size: <1024 tokens disables caching.
  • TTL expired: did calls space out beyond TTL?
  • Reorder: cache misses if cache-block order changed. Verify whether tool definitions auto-reorder.
  • Model swap: cache shares only within the same model.

Metric: compare usage.cache_read_input_tokens vs usage.cache_creation_input_tokens in the API response. Hit rate = read / (read + creation).


7. Operational pattern — Cache-friendly prompt layout

[System prompt — static, 5K tokens]      ← cached
[CLAUDE.md — static, 3K tokens]           ← cached
[Tool catalog — static, 2K tokens]        ← cached
─────────────────────────── cache boundary ───
[User input — varies]                     ← not cached

Core principles: - Put unchanging content first. - Push everything that varies last. - Align cache boundaries with block boundaries.

With this layout, only user input is processed fresh, and the prior 10K tokens enter at 0.1× rate. You buy 10K tokens at the price of 1K.


8. At a glance

Cache type Break-even hits Best workload
5-min TTL 1 (practical: 2) Interactive, short sessions
1-hour TTL 2 (practical: 4) Long sessions, batch, background
Cache Don't cache
System prompts / CLAUDE.md / tool catalogs Per-user data, timestamps, varying random
Frequent static documents Tiny contexts under 1K tokens

Core rule: don't enable caching without measurement. Cache hit rate < 30% → revisit immediately.


Next up

Part 4/4: Context Management Patterns — auto-compact / Memory / RAG Cost Comparison. Caching makes the same context cheaper to send. The next part is about making context smaller in the first place.


References

  • Anthropic, Prompt Caching — docs.claude.com/en/docs/build-with-claude/prompt-caching (verified 2026-05-05).
  • Anthropic, 1-Hour Cache TTL — docs.claude.com/en/docs/build-with-claude/prompt-caching#cache-duration (verified 2026-05-05).
  • OpenAI, Prompt Caching — platform.openai.com/docs/guides/prompt-caching (verified 2026-05-05).
  • Series parts 1 (token cost structure) and 2 (routing).

This is part 3/4 of the AI Operations Economics series.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System