Agent Operations Retrospective (4/7) — Token Guard Hook: Pre-Call Blocking
What This Post Covers
Post-hoc aggregation alone cannot stop token runaway in LLM agents. The numbers in the log after a runaway are accurate — but by that point the cost has already been incurred. This post documents the structure built on top of Hermes's plugin system that intercepts at tool_pre_call, immediately before execution, and blocks the call. Three topics are covered:
- The 3-tier threshold design based on repetition factor, and the rationale behind each number
- The hook system's event model and where Token Guard registers
- The roles of two additional hooks layered on the same system (schema dedup, delegation tracker)
The threshold numbers are operational hypotheses, not empirically validated values. That fact is stated explicitly throughout.
Why Post-Hoc Measurement Is Not Enough
One incident observed in Hermes v0.8.0: 5 inputs expanded to 111 tool calls, reaching 6.5M cumulative tokens. Expressed as a repetition factor relative to input, that is 42.79× — one request amplified to an average of forty-two times its original size.
The problem is when detection occurs. These numbers came from log aggregation after the session ended. The measurement was accurate, but by the time it was complete, the tokens were already spent. Post-hoc metrics cannot stop runaway. What is required is an intercept point immediately before execution. Token Guard is the plugin built to fulfill that requirement.
Token Guard Design — 3-Tier Threshold Based on Repetition Factor
A single cutoff blocks legitimate workloads; a threshold too loose is no different from post-hoc measurement. Token Guard divides the response into three tiers. All thresholds are based on the repetition factor (cumulative tool-call tokens ÷ input tokens).
- T1 (15×) — Warning: Mark in log and notify main orchestrator. No blocking.
- T2 (25×) — Block same-tool re-calls: Reject repeat calls to the same tool only. Calls to other tools pass through. Runaway episodes tend to originate from the pattern of repeatedly invoking the same tool.
- T3 (40×) — Force session termination: Close the session entirely.
Rationale for Each Threshold Number
All three numbers are operational hypotheses — they are not empirically validated values. Two observations informed the decisions.
- T2 (25×) rationale: The incident was detected at 42.79×, with a cumulative total of 6.5M tokens. Blocking at 25× would, by simple arithmetic, stop the runaway at roughly half that volume.
- T1 (15×) rationale: Normal operating repetition factor was observed at 13.82×. T1 sits just above this baseline — capturing the band of "still normal, but anomaly signal possible."
- T3 (40×) rationale: Slightly below the incident value of 42.79×. Terminates before the incident pattern can fully replay.
After the 7-day Phase 0 observation window, the thresholds are subject to re-calibration. The normal workload distribution and T1/T2 trigger frequency from that window will inform revised numbers.
Hook System — Event Model and Registration Point
Token Guard is not a standalone module — it operates on top of Hermes's plugin system. The plugin system is hook-based: register a callback on a defined event, and the runtime invokes it at that point.
Four events are currently exposed:
tool_pre_call— immediately before a tool calltool_post_call— immediately after a tool callsession_init— on session startsession_end— on session end
Token Guard registers on tool_pre_call. Blocking only takes effect if the intercept happens before the tool actually executes. The callback performs two operations before each call:
- Read the current session's cumulative tool-call tokens and input tokens, and compute the repetition factor
- Evaluate the tier of the computed value; if T2 or T3, reject the call
tool_post_call is used for measurement updates: after a call completes, it updates the cumulative token count so the next tool_pre_call evaluation reflects the latest state. Implementation details — function signatures, how a rejection signal is propagated — are separated into a dedicated document. This post covers "where it intercepts and what it blocks".
Measurements — Normal Operating Value 13.82× / Just Below T1 at 15×
The most critical validation question is whether the thresholds block normal workloads. If too tight, routine tasks are interrupted and the system becomes non-functional.
First measurement day observations:
- Repetition factor: 13.82×
- T1 threshold: 15×
The normal operating band sits just below T1. No warning triggered; no blocking triggered. However, a single data point cannot be generalized. Re-evaluation after the 7-day distribution observation is required.
T2 (25×) and T3 (40×) have zero trigger events to date. This admits two interpretations: (1) runaway has not recurred under current operations; (2) the effectiveness of the thresholds is not yet verified. A threshold that has never fired is not a validated threshold.
Other Plugins on the Same Hook System
Two additional hooks are registered on the same plugin system alongside Token Guard.
Schema dedup — blocks repeated transmission of tool schemas within a session. When tool definitions are passed to the LLM, resending the same schema on every call accumulates input tokens in its own right. A schema transmitted once is cached; it is not retransmitted unless it changes. This hook operates by reducing input tokens directly, not by post-hoc token savings.
Delegation tracker — tracks the rate at which the main orchestrator delegates to sub-agents. If the pattern of the main agent handling heavy tasks directly (declining delegation rate) is detected, it is flagged. If the main agent is a high-option model (opus-class), direct handling of simple tasks is cost-inefficient. This hook only observes — it does not block.
All three hooks are active simply by registering on the plugin system. The architectural reason for leaving hook slots open is precisely this extensibility.
Limitations and Scope of Applicability
Two claims can be made with confidence:
- An intercept point (
tool_pre_call) immediately before execution has been secured — filling the gap that post-hoc measurement could not cover. - The first-measurement-day normal operating value (13.82×) is positioned just below T1 (15×).
More remains that cannot be claimed.
- T1/T2/T3 thresholds are all operational hypotheses. Not empirically validated. Re-calibration scheduled after Phase 0 7-day observation.
- No T2/T3 trigger events to date. Runaway blocking capability has not yet been demonstrated.
- Single-user environment only. Whether the same thresholds are appropriate for other environments requires separate validation.
Transferable Pattern — The Same Question for Other Metrics
The extractable pattern from this design is the shift from "post-hoc measurement" to "pre-call blocking." The same question applies to other metrics:
- Are there metrics currently viewed only in post-hoc fashion?
- Is there an intercept point available immediately before a call, or at session start?
- Are there metrics where marking alone — without blocking — delivers meaningful signal?
Schema dedup and delegation tracker both originated from that same question. Whether the choice is blocking or observation, securing "a place to intercept" determines the degrees of freedom available for subsequent policy decisions. Threshold re-calibration results and trigger-event distribution after the 7-day observation window will be documented separately.
Open Questions
- What alternative metrics can identify runaway earlier, beyond repetition factor? (Examples: consecutive same-tool call count, session duration relative to input)
- Is workload classification at
session_init— enabling preemptive blocking rather than reactive blocking — feasible? - In a multi-user environment, should thresholds be separated per user/workload, or is a single global threshold sufficient?
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ