Agent Self-Improvement Harness (7/12) — Heartbeat v2: Multi-Mode State Machine and Escalation

Designing a notification system as a state machine


ํ•ต์‹ฌ ์š”์•ฝ

What this post covers: - Designing Heartbeat as a state machine — decomposing "alive" into a set of modes and state transitions. - Locking escalation to explicit, enumerated reasons — eliminating ambiguous alerts and treating "unknown" as a first-class citizen. - Reducing false-positives with a single parameter change — raising the cold-start timeout from 60s to 120s eliminates model-loading timeouts. - Proactive Preferences feedback loop — feeding user response and non-response into an EMA to dynamically adjust escalation thresholds.

v1 Limitations and v2 Design Goals

The v1 heartbeat operates on a binary state: alive / dead. The structural limitation is that it cannot represent alive but abnormal. This results in an accumulation of alerts where anomalies are detected but root cause is unclear.

v2 has three design goals: 1. Decompose state into the product of state × context 2. Fix alert conditions to explicit enums 3. Incorporate user response as a learning signal

Technique 1: Multi-Mode State Machine

v2 modes follow the {state}.{context} naming convention.

State Cluster Modes
idle idle.normal, idle.degraded, idle.silent
working working.normal, working.slow, working.stuck
recovering recovering.from-crash, recovering.from-quota, recovering.from-network
escalation escalating, escalated, cooling-down
maintenance maintenance.scheduled, maintenance.unplanned
fallback unknown

How It Works

What matters is not the number of modes but the state transition graph. Example: working.normal → working.slow → working.stuck → escalating. Alerts are emitted only at state transition points, not at arbitrary intervals. This constraint structurally prevents repeated alerts for the same persisted state.

Technique 2: Multi-Stage Escalation Reason Enum

Alert conditions are locked to a set of explicit, enumerated reasons:

  1. Quota exceeds 80%
  2. Quota exceeds 100%
  3. Same error repeated 3 times
  4. Response time p95 threshold exceeded
  5. Cron job fails on 2 consecutive runs
  6. Memory directory size runaway
  7. Embedding server unresponsive
  8. External API returning 5xx consecutively
  9. Self-review blocking pattern detected
  10. Retain-tag validation failure rate spike
  11. Zero user response over extended period
  12. New agent self-diagnosis failure
  13. Entry into unknown mode

The Role of Reason 13

Reason 13 is the most critical entry by design. It is the mechanism that promotes indeterminate state to a first-class citizen. A system that stays silent when classification fails carries greater latent risk than a system that fails on classifiable conditions. Designating unknown entry as an escalation reason establishes a path for the system to report "I don't know what I don't know."

Technique 3: Cold-Start Timeout Tuning

The dominant source of false-positive alerts was not complex logic but a single timeout value.

  • Symptom: First heartbeat call times out → classified as no response → escalation
  • Root cause: Model cold-start loading time (initial weight loading + warmup) exceeds the default 60s timeout
  • Fix: Separate cold-start-specific timeout, raised from 60s → 120s
  • Result: False-positives on this path eliminated

Generalizable Pattern

The principle that generalizes from this finding: cold and warm path timeouts must be separated into distinct constants. Handling both paths with a single timeout value means optimizing one path degrades the other. A full analysis of 60 operational failures is covered in Part 16.

Technique 4: Proactive Preferences Feedback Loop

Rather than keeping escalation thresholds static, user response is used as a signal to adjust them dynamically.

Input Signals

  • No response / "quiet" signal: raise the escalation threshold for that mode
  • "Why wasn't I notified" signal: lower the escalation threshold for that mode

Learning Parameters

  • Exponential Moving Average (EMA) + 14-day window
  • Learning rate too fast → alert misses (false-negatives)
  • Learning rate too slow → repeated user correction requests
  • A 14-day window balances convergence and responsiveness

Measured Outcome

EMA-based convergence structurally eliminates repeated threshold-tuning requests compared to static thresholds. However, the first two weeks of the learning period require significant user feedback — that upfront cost is the price for a quieter system afterward.

Limitations and Porting Direction

Current Limitations

  • Potential misbehavior during the initial 2-week learning period
  • High frequency of reason-13 (unknown) entries increases alert fatigue — sub-classification of unknown required
  • EMA window length (14 days) is an empirical constant; re-tuning required per domain

Hermes Port

  • Heartbeat trigger: cron + on_turn_start hook
  • State machine / escalation enum: ported as-is
  • Proactive loop: MemoryProvider.on_memory_write records user response patterns to memory → escalation threshold calculation references this memory

Applicability and Open Questions

Where This Design Applies

  • Systems where alert frequency directly impacts user satisfaction
  • Operational environments where state classification is feasible (observability in place)
  • Channels where user feedback can be collected

Open Questions

  • How far can the unknown mode entry rate be reduced?
  • Is there a better learning curve than EMA (e.g., Kalman filter)?
  • In multi-user environments, how should individual thresholds be separated from shared thresholds?

The essence of a good notification system reduces to: can you explicitly specify the conditions under which you will not alert? The mode set and escalation enum are that specification. Cold-start tuning and the EMA feedback loop are the mechanisms that protect that specification from misfiring.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System