All Agents Went Silent Simultaneously — Claude Code OAuth Token Expiry: Outage Analysis and Recovery

Root cause analysis of an extended-period outage and five prevention measures


Key Summary

  • Multiple agents connected to Telegram and Discord stopped responding simultaneously for an extended period — cause: Claude Code OAuth token expiry
  • This is not an isolated case — GitHub Issue #12447 has 24+ comments and 22+ upvotes; no official Anthropic response as of the time of filing
  • Mitigation: long-lived token via claude setup-token (1-year validity) + watchdog auth checks + remote re-authentication capability

Background

Operating multiple agents around the clock exposes infrastructure-level failures that are quieter and more destructive than code bugs.

In this incident, every agent connected to Telegram and Discord stopped responding at the same time. tmux sessions were alive and messages were being received, but all Claude API calls were failing silently.

The anomaly was only detected after an extended period had elapsed.

The root cause was straightforward: OAuth token expiry.


Analysis

1. Failure Timeline

Phase State
Outage onset All agents stop responding
Outage duration Extended period of undetected silence
Recovery claude /login re-authentication → full restart → restored

The symptoms appeared severe, but the cause was a single expired auth token.


2. Why This Happens

The default authentication method for Claude Code under a Max subscription is OAuth. The token TTL is 2–4 hours, and automatic refresh can silently fail.

When refresh fails, every active session loses authentication simultaneously.

Three structural weaknesses contribute to this:

  1. No expiry warning is issued before the token expires
  2. There is no way to query the token's remaining TTL
  3. All in-flight tasks are left in an incomplete state when expiry occurs

3. Community Reports — GitHub Issue Status

This failure mode has been widely reported on GitHub.

Issue Description Status
#12447 OAuth token expiry interrupts autonomous workflows 24+ comments, 22+ upvotes
#33811 Both login and logout return 401 after token expiry Open
#36911 Token expires multiple times per day Open
#19456 macOS Keychain permission error prevents refresh Open

No official Anthropic response exists as of the time these issues were filed.


4. Resolution: claude setup-token — Long-Lived Token

Claude Code provides a setup-token command that generates an OAuth token valid for one year.

claude setup-token

The generated token follows the format sk-ant-oat01-xxxxx and is applied via environment variable:

export CLAUDE_CODE_OAUTH_TOKEN=sk-ant-oat01-xxxxx

Prerequisites:

  1. A Claude Pro or Max subscription is required
  2. Token generation is CLI-only (not available in the Desktop app)
  3. The token must be stored securely

5. Claude Code Authentication Priority (Official Docs)

Claude Code resolves authentication in the following order. Select the method appropriate for your environment.

Priority Method Description
1 Cloud Provider Bedrock / Vertex / Foundry environment variables
2 ANTHROPIC_AUTH_TOKEN Bearer token (for LLM gateways)
3 ANTHROPIC_API_KEY Console API key
4 apiKeyHelper Dynamic / rotating credential script
5 OAuth (/login) Default for Pro / Max / Team / Enterprise subscribers

Priority 5 (OAuth) is the default. Any subscriber without explicit configuration lands here automatically — and this is the source of the expiry failure.


6. Prevention — Five Measures

After recovery, the following five controls were implemented to prevent recurrence.

  1. Run claude setup-token → set a long-lived auth token (1-year validity)

  2. Add an auth check to the watchdog → immediate alert on expiry detection

  3. /auth command → remotely verify authentication status from Telegram or Discord

  4. /login command → issue a remote re-authentication URL (mobile tap-to-authenticate)

  5. Claude Auth status on the dashboard → single-pane visibility into auth state across all agents


Architectural Implications

  • Authentication is a single point of failure (SPOF). When multiple agents share the same OAuth token, a single expiry takes down the entire fleet.

  • The industry-standard pattern is dual-key rotation. A Blue/Green approach — where the old and new keys are simultaneously valid for a short overlap — is the safe model. Claude Code does not support this natively, making the long-lived token the practical workaround.

  • Without a watchdog, outages go undetected. If there is no alert when an agent goes silent, an extended period of non-responsiveness can pass unnoticed. A watchdog with an auth check closes this gap.


Conclusion

Production automation systems fail. The measure of a resilient system is not whether failures occur, but whether recovery is fast and recurrence is structurally prevented. This incident demonstrates how an invisible dependency — authentication — can become the single point of failure for an entire fleet.

For anyone running Claude Code agents around the clock: claude setup-token is not optional.


Sources: - GitHub Issue #12447 — OAuth token expiration disrupts autonomous workflows - GitHub Issue #33811 — OAuth token expired, no recovery path - Claude Code Authentication Docs

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System