"Claude Code --continue/--resume Prompt Cache Invalidation — Reproduced on v2.1.116"

Cache Destruction by Session-Resume Events, Isolated via TTL-matched Control Pairs


핵심 요약

  • Symptom: Invoking claude --continue or --resume invalidates the prompt cache even when within the one-hour TTL. deferred_tools_delta, introduced in v2.1.69, is identified as the root cause. Reproduction is confirmed on v2.1.116 despite patches released in v2.1.90–92.
  • Measurement method: A TTL-matched control-pair methodology was applied — comparing cache hit rates between consecutive intra-session turns (baseline) and the first turn immediately after a session resume (test), both at identical idle gaps, thereby eliminating TTL expiry as a confounding variable. Results: ~4-minute idle gap −56 pp; ~28-minute idle gap −99 pp.
  • Response: GitHub issue #51764 is being tracked. Confirmation of the patch scope for the custom-agent + MCP + skill combination path is requested. Introduction of a --no-deferred-tools-delta opt-out flag is proposed.

들어가며 — 문제 정의

Background — Problem Statement

Claude Code relies heavily on Prompt Caching in long-running agent environments. When a context exceeding 100k tokens is retransmitted on every turn, guaranteed cache hits confine costs to cache_read while minimizing cache_creation (the write premium). Per Anthropic's official documentation, the one-hour cache carries a 2× write premium (relative to standard input token cost) on writes and 0.1× on reads. As long as the structure that sustains 95–99% hit rates in normal operating sessions holds, the majority of token costs are eliminated at equivalent context sizes.

This structure breaks down under --continue/--resume. Since deferred_tools_delta was introduced in v2.1.69, session resume has been observed to reorder tool-result blocks and attachment blocks during rollout replay, which alters the cache key on the server side. GitHub issue #42338 was the original report of this problem; it was closed following patches in v2.1.90/91/92. Extensive cache analysis by community contributor ArkNill (Reference 5), a breaking-attachment-set identification gist by simpolism (Reference 6), and an environment reproduction report by taekim34 formed the primary evidence base for that issue.

Nevertheless, the same behavior is reproduced on v2.1.116. The distinction from prior reports is that the TTL-matched control pair methodology eliminates TTL expiry as a causal factor and isolates the session-resume event itself as the direct cause of cache destruction. This methodology advances beyond mere observation of "cache broke" by treating the resume event as an independent variable through controlled comparison.


Methodology — TTL-matched Control Pairs

Prior reports (#42338) used a single-case token-count approach ("470–512k cache_creation observed on 2–12-second exit→resume"). This approach does not easily rule out the counterargument: "perhaps the 5-minute TTL happened to expire." Even acknowledging the fact (Observation 1) that Claude Code internally uses a one-hour TTL rather than a five-minute TTL, the counterargument "the one-hour TTL may have happened to expire" remains available.

To eliminate the TTL factor entirely, per-turn usage data was collected from 14 jsonl session files. Duplicate counts were removed using requestId-based deduplication, after which two comparison sets were constructed:

  • Baseline: consecutive turns within the same jsonl, idle gaps of 3–28 minutes
  • Test: end of one jsonl → start of the next (session resume), at identical idle gaps

If a hit-rate difference is observed between the two sets, TTL cannot explain it, because the idle gaps are identical. The only controlled variable distinguishing the two cases is "whether a session boundary was crossed." Any difference is therefore attributable to the session-resume event itself. This design supports a causal claim through controlled comparison rather than simple observation.


Observations

Observation 1 — Claude Code's Internal Cache TTL

ephemeral_5m_input_tokens is zero across all intervals. All cache_creation flows into ephemeral_1h_input_tokens. Claude Code internally uses a one-hour TTL cache. Accordingly, intra-session idle periods within ~60 minutes should sustain the cache.

Observation 2 — Baseline (mid-session, same jsonl)

Idle gap cache_creation cache_read Hit rate
3.3 min 389 155,492 99.8%
3.3 min 376 80,990 99.5%
3.7 min 2,123 159,700 98.7%
4.0 min 3,209 107,456 97.1%
4.0 min 5,112 173,721 97.1%
4.3 min 2,015 220,492 99.1%
4.8 min 840 78,457 98.9%
5.7 min 264 82,612 99.7%
5.8 min 3,302 85,549 96.3%
5.9 min 1,283 240,818 99.5%
6.0 min 2,385 161,823 98.5%
7.6 min 4,945 114,763 95.9%
27.4 min 1,559 150,593 99.0%

Average approximately 98% hit rate. The one-hour cache operates normally while the session process remains alive.

Observation 3 — Cross-session Transitions (same agent, new jsonl file)

Transition Idle gap Next cc Next cr Hit rate Verdict
f3f5c819 → e17536e7 28.2 min 40,260 0 0.0% Unexpected miss
e17536e7 → 096d96fa 4.1 min 23,849 16,702 41.2% Unexpected miss
c0eb34ad → b6e07fac 2423.6 min (>1h TTL) 20,061 0 0.0% TTL expiry, expected
b6e07fac → f3f5c819 2059.3 min (>1h TTL) 41,766 0 0.0% TTL expiry, expected

TTL-matched Pairs — Direct Delta

Idle gap Intra-session (baseline) Session resume (test) Delta
~4 min 97–99% hit (cr > 100k, cc ≈ 3k) 41% hit (cr 16.7k, cc 23.8k) −56 pp
~28 min 99% hit (cr 150k, cc 1.5k) 0% hit (cr 0, cc 40.3k) −99 pp

Identical idle gaps, identical agent workload, identical codebase state. The only difference is whether a session boundary was crossed. This delta supports the conclusion that the session-resume event itself destroys the cache.


영향 — 실제 비용 측정

Root-Cause Alignment — deferred_tools_delta Reordering Hypothesis

Issue #42338 identified deferred_tools_delta, introduced in v2.1.69, as the proximate cause. During rollout replay on session resume, this feature reorders tool-result blocks and attachment blocks. When block ordering changes, the byte prefix of the request changes; because the Anthropic server looks up cache keys by prefix match, a key miss results.

The observations are consistent with this hypothesis. The 4.1-minute resume case exhibits partial prefix matching (16,702 tokens cache_read + 23,849 tokens cache_creation). This matches the pattern of "prefix matching through block N, then the cache key breaking at the reordered point." Pure TTL expiry would produce a 0% hit with no partial match; the existence of partial matching is itself internal evidence that refutes the TTL hypothesis.

The 28.2-minute case shows a 0% hit rate, suggesting that reordering occurred at an earlier block in the prefix. This aligns with the positional-shift pattern of skill_listing, todo_reminders, and nested_memory injections catalogued in simpolism's gist (Reference 6). Whether a given resume produces "partial matching (41% hit rate)" or "total miss (0%)" is determined by which block is reordered first on resume.

Whether the v2.1.90–92 patches addressed this entire code path remains unclear. In particular, the block-composition patterns arising from combinations of custom agents, multiple MCP servers, skills, and hooks may fall outside the patch scope validated against simpler environments.


Impact — Measured Cost

Usage aggregated from a single 11-hour 52-minute session (197 unique turns):

Item Value
cache_read total 60,869,087 tokens
cache_creation total 2,017,110 tokens (all one-hour, 2× write premium)
input_tokens total 9,923 tokens
output_tokens total 446,687 tokens
cc / output ratio 4.52×

The cc/output ratio in a normally-cached session is typically observed at 1–2×. This session's 4.52× is anomalously high. The session includes two legitimate expiry transitions (>1h) where TTL was genuinely exceeded; those are expected costs. However, the f3f5c819 → e17536e7 transition (28.2 minutes, 40,260 cc) produced a total cache miss despite being a resume within TTL. Those 40,260 tokens of cache_creation would not have been consumed had the resume preserved the prefix.

The cache_creation for the one-hour cache carries a 2× cost relative to standard input. When unnecessary writes of the ~40,260-token magnitude are repeated on every session resume, accumulated costs reach a perceptible level in agent operations that include frequent automatic restarts.

For subscription-plan users, this cost is drawn directly from the rolling five-hour quota. cache_creation exhausts quota faster than cache_read. In overnight agent auto-restart patterns, resume-induced cache_creation can occupy a substantial portion of the quota, reducing the quota available for actual task processing.


Recommended Responses

Confirm and track issue status: GitHub issue #42338 was closed following the v2.1.90–92 patches, but reproduction is confirmed on v2.1.116. A new report has been submitted as issue #51764. Users experiencing the same behavior are advised to add their reproduction environment and version to that issue. Accumulated reproduction cases can influence patch prioritization.

Confirm patch scope: The official release notes for the v2.1.90–92 patches do not specify the scope of the fix in detail. The block-composition patterns arising in environments that combine custom agents + MCP servers + skills + hooks may have been omitted from the patch scope validated against simpler environments. Users operating that combination are advised either to verify via direct reproduction testing or to add their environment details to the issue as a request for confirmation from Anthropic.

Request --no-deferred-tools-delta opt-out flag: Given that the deferred_tools_delta feature was introduced as a performance improvement, providing an opt-out flag is more practical than removing the feature entirely. For harness users or long-running agent operators, cache stability may take precedence over full feature availability. A request for a --no-deferred-tools-delta or --strict-prefix flag is included in the Asks section of issue #51764.


Reproduction Steps

  1. Start a Claude Code session and accumulate context to ≥100k tokens, including CLAUDE.md, tool definitions, and prior conversation history.
  2. Exit the session with /exit and resume within 1–5 minutes using claude --continue. The resume interval must be kept within the one-hour TTL.
  3. On the first assistant turn of the resumed session, open ~/.claude/projects/<slug>/<sid>.jsonl and check the cache_read_input_tokens and cache_creation_input_tokens fields in the usage object for the corresponding requestId. If a monitoring proxy is available, per-turn usage queries are also feasible.
  4. Collect the control: within the same session, without using /exit, execute the next turn after an identical idle gap. This must be a consecutive turn within the same jsonl file. The baseline should show ≥95% cache_read.
  5. Compare hit rates between the two cases. If the post-resume hit rate is substantially lower (particularly in the 0–41% range), the bug is considered reproduced. Adding the reproduction environment (version, list of active features) to issue #51764 benefits the community.

Conclusion

The deferred_tools_delta-based cache-destruction bug is reproduced on v2.1.116. The TTL-matched control-pair methodology structurally supports the conclusion that the session-resume event itself is the cause of cache destruction, by removing TTL expiry as a variable. The hit-rate deltas of −56 pp at ~4 minutes and −99 pp at ~28 minutes represent functional cache invalidation, not mere performance degradation.

The patterns most materially affected by this issue are as follows. First, long-running agents where session restarts due to /compact or errors are frequent. Second, monitoring agents with overnight auto-restart structures. Third, complex harness environments maintaining contexts of 100k tokens or more.

The fact that Anthropic attempted a patch in v2.1.90–92 confirms the issue is real. Reproduction on v2.1.116 suggests either that the fix was incomplete or that a code path that re-triggers the behavior under certain environment combinations remains. It is expected that issue #51764 will be maintained in a trackable state until this problem is resolved.


References

  1. GitHub Issue #42338 — Original report (closed, locked): https://github.com/anthropics/claude-code/issues/42338
  2. GitHub Issue #34629 — --print --resume regression since v2.1.69: https://github.com/anthropics/claude-code/issues/34629
  3. GitHub Issue #46829 — Silent 1h→5m default TTL regression: https://github.com/anthropics/claude-code/issues/46829
  4. GitHub Issue #51764 — This report (open): https://github.com/anthropics/claude-code/issues/51764
  5. ArkNill — Claude Code cache analysis: https://github.com/ArkNill/claude-code-cache-analysis
  6. simpolism — Breaking attachment set analysis gist: https://gist.github.com/simpolism/302621e661f462f3e78684d96bf307ba
  7. Anthropic Prompt Caching official documentation: https://platform.claude.com/docs/en/build-with-claude/prompt-caching

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System