Agent Operations Retrospective (3/7) — Token Runaway to Observation Mode: Measurement-Driven Redesign

What This Post Covers

This is a record of rebuilding an agent runtime (Hermes, hereafter HM) after a token runaway event — following the sequence measure → revert → redesign → observe. Three pieces of information are conveyed:

  1. The repetition ratio metric for quantifying token runaway, and how to measure it
  2. A comparison of two design options in a Strangler Fig migration — structural enforcement vs. behavioral single line of defense — and the reasoning behind the choice
  3. The KPI table for Phase 0 (7-day observation), including the limits of single-sample interpretation

This is not a post presenting conclusions. It is a guide organizing patterns and measurement formats applicable at the in-progress validation stage.

1. Measuring Token Runaway — The Repetition Ratio Metric

HM v0.8.0 hit 6.5M token consumption in a single session. 5 user inputs, 111 tool calls generated from them. The repetition ratio extracted from this event was 42.79×.

The repetition ratio is defined as "the multiplier of tokens re-consumed internally per unit of user input." A higher value means internal loops are over-spending relative to user intent. This value was calculated by HM itself from internal message logs after the incident, and is recorded verbatim without post-processing.

The utility of this metric is straightforward: it becomes a baseline for comparing pre- and post-redesign on the same axis, rather than tracing causes after a runaway. A measurable runaway is a cost problem; an unmeasurable one is a trust problem. If you don't know where the next runaway breaks out, you can't fix it.

2. Migration Design — Strangler Fig and Where the First Attempt Failed

The initial design was a one-way migration of functionality running on top of OpenClaw (hereafter OC) to HM. OC is a system that stably holds a security layer, self-improvement loop, and part of the harness. Following the Strangler Fig pattern, the setup compared whether HM produced the same output for the same inputs, without shutting OC down.

The first attempt failed at two points:

  • Insufficient call control: HM v0.8.0 could not block tool calls from recursively re-calling themselves at the runtime level.
  • Delayed measurement infrastructure: The measurement tooling for baseline comparison was bolted on after the fact, so data didn't accumulate at the same pace as the migration.

Given this situation, the options were "keep tuning HM" or "revert to OC and redesign." The former was judged unfeasible — tuning without a comparison baseline means improvement cannot be verified — so the latter was chosen.

3. Post-Revert Redesign — Structural Enforcement vs. Behavioral Single Line of Defense

After reverting to OC and continuing operations on top of it, HM was redesigned in parallel. Two design options were compared at this stage.

Initial Design: Structural Enforcement

A direction that blocks the main agent from calling tools directly at the runtime level, forcing delegation. Theoretically clean. However, in empirical sessions, cases occurred where sub-agents received zero tools. With no tools available, the model generated hallucinated external information — specifically, a non-existent PR number (in the form of what appeared to be a NousResearch/hermes-agent repository reference).

The hallucination was caught before publishing, but in a big-bang migration it would have gone directly into the production environment.

Post-Redesign: Behavioral Single Line of Defense

Instead of removing tools, this direction enforces delegation flow through behavioral rules and detects deviations by measuring results. Switching to this approach finalized the following:

  • Measurement script (session log parsing → KPI calculation)
  • Environment variable and credential isolation
  • Delegation rate definition
  • Gate pass conditions

Cost note: The work between the initial design and the post-redesign switch was the most expensive phase. That said, the hallucinated PR number was discovered during this transition, which justified the switching cost.

4. KPI Measurement at the Retry Stage

After redesign, HM progressed from v0.8.0 → v0.9.0 → v0.10.0, with 411 commits applied in total. At this version, the Stage 1 baseline repetition ratio measured 23.80× (approximately 45% reduction from the runaway peak of 42.79×).

First measurement day KPIs:

# KPI Value Target Status
1 Repetition ratio 13.82× < 20× OK
2 Schema resent / session 29,832 tokens < 200k OK
3 Delegation rate 1.0 ≥ 0.7 OK
4 Crash count (24h) 0 0 OK
5 Daily cost $0 ≤ $1/day OK
6 Cache hit rate 25.64% ≥ 40% (secondary) Miss

On the repetition ratio, 42.79× → 13.82× is approximately 67.7% reduction. However, this reduction is not from "migrating to HM" — it is from "applying the behavioral single line of defense on top of the redesigned HM." Failing to keep this separate from the initial structural enforcement numbers will invert the interpretation.

Cache hit rate at 25.64% is a secondary KPI — it does not block gate passage but is currently below target. Auto-cache is estimated to be only partially operational; re-measurement is planned after the next isolation work.

5. Limitations — Single-Sample Interpretation and Day 2 Regression

Re-measurement the following day with the same measurement script:

# KPI Day 1 Day 2
1 Repetition ratio 13.82× 16.22×
3 Delegation rate 1.0 0.0
4 Crash count 0 1

Repetition ratio rose from 13.82× to 16.22×, delegation rate dropped from 1.0 to 0.0, 1 crash occurred. The delegation rate swing is likely sample variance, but this cannot be stated definitively before 7 days of data. The crash cause has not been analyzed yet.

Two things that must be recorded honestly here:

  1. Both Day 1 and Day 2 are single samples. Whether values will hold cannot be asserted until 7 days of baseline accumulate.
  2. Day 1 results alone are not sufficient grounds to claim the retry succeeded — Day 2 immediately showed regression.

Currently in Phase 0, 7-day observation mode. The purpose of Phase 0 is to observe, not to fix. The primary goal is building a baseline that separates stable ranges from unstable ones by logging KPIs daily with the same measurement script. The next stage gate evaluation can only proceed after these 7 days, and one of the gate conditions is deciding how to handle the cache hit rate miss below 40%.

Applicable Patterns and Open Questions

Three extractable patterns from this post:

  • Repetition ratio as a metric: Measuring internal token consumption per unit of user input as a multiplier in an agent runtime enables comparing runaway and improvement on the same axis.
  • Structural enforcement vs. behavioral line of defense: Structural enforcement looks clean but can trigger hallucinations through unintended tool blocking. The behavioral single line of defense is safer when combined with measurement.
  • Parallel redesign after revert: When a first attempt under Strangler Fig fails, running a redesign alongside the original system without shutting it down creates room to catch hallucinations and regressions before publishing.

Open questions:

  • Is the Day 2 delegation rate swing (1.0 → 0.0) sample variance or structural deviation?
  • Should the cache hit rate miss be included as a gate condition or remain a secondary KPI?
  • To what level is detection lag in the behavioral single line of defense acceptable?

Once the 7-day observation completes, a follow-up record will continue with the same measurement format.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System