OpenClaw to Hermes Migration (10/13) — Applying a Strangler Fig Variant in Single-User Environments

What This Post Covers

This post documents a migration technique for coexisting a legacy system and a new system in a single-user (one operator) environment. The canonical Strangler Fig assumes facade routing and is designed for environments with high volumes of statistically uniform external traffic. In a single-user environment where input is unevenly distributed and traffic splitting carries little statistical meaning, a variant that removes the facade and retains only "comparison baseline preservation" is effective. This post covers how that variant works, its connection to Subset Migration, actual measured values, and confirmed limitations.

Three core arguments: - The essence of Strangler Fig is not incremental replacement but "preservation of prior state." - The sequencing in Subset Migration must be driven by "measurability." - A comparison baseline serves two distinct roles: anomaly diagnosis and improvement magnitude verification.

Two systems are used as examples: OpenClaw, the stable system, and Hermes, the new system absorbing it. The documented incident is a token runaway in Hermes v0.8.0 (5 inputs → 111 internal tool calls → 6.5M cumulative tokens, repetition factor 42.79×). This regression event serves as the reference point for validating the variant technique.

Canonical Strangler Fig and the Single-User Variant

The canonical Strangler Fig works as follows. A new system is placed alongside the legacy system, and a facade receives all external requests. The facade routes some requests to the legacy system and some to the new system according to defined rules. Over time, the new system handles an increasing share, and the legacy system withers. Finally, the legacy system is decommissioned.

This pattern operates in environments with many external users and statistically uniform traffic distribution. In a single-user environment, facade routing loses its meaning because input distribution is not uniform. On some days, ten requests may go to one side and zero to the other. Under these conditions, facade-distributed results cannot serve as a statistically valid comparison baseline.

The key change in the variant is removing the facade and replacing it with a shared measurement layer applied simultaneously to both systems.

Given the same input, both systems are measured against identical formulas to check whether they produce equivalent output. Rather than routing requests, the systems share a measurement layer.

In practice, this means the OpenClaw metrics (repetition factor, schema re-sends, delegation, crash, daily cost, cache hit) are measured in Hermes using the same columns and the same formulas. As long as this comparison baseline is active, both systems can be evaluated against a common standard.

The implication of this variant: The essence of Strangler Fig is not "incremental replacement" but "preservation of prior state." Replacing everything at once destroys preservation. Without preservation, comparison cannot be established. Without comparison, the basis for judging improvement or regression disappears.

Subset Migration — Measurability as the Primary Ordering Criterion

Subset Migration is the pattern of migrating systems not all at once, but starting with the most measurable subset. Once a migrated subset operates correctly, the next subset is migrated.

The critical decision in applying this pattern is determining "what is measurable." Migrating an unmeasurable component first means state changes immediately after migration cannot be observed.

The priority order applied:

  1. Memory engine — SQLite-based storage. Input/output schema is well-defined; read/write counts are measurable.
  2. Measurement scripts — Code computing token counts, repetition factors, and cache hit rates. This module must be migrated first so that all subsequent migration steps can be measured.
  3. Environment variables — Model selection, routing options, and other explicitly declared values.
  4. Credentials — API keys, OAuth tokens. These are static values; migration risk is low.

Components deferred to lower priority were high-variability areas: the tool set and skill triggers. Tools are added and removed frequently; skill trigger keywords change. Migrating high-variability components first means "the effect of migration" and "the effect of tool changes" cannot be separated. Migrating at a point where the measurement baseline is shifting makes regression root-cause analysis impossible.

Determining this ordering required evaluating 34 decision items. When the dependencies between items are mapped, it becomes self-evident that measurement scripts must follow the memory engine. Without measurement, the effects of all subsequent migration work remain unobservable.

Migration Phases and Regression Behavior

The full flow breaks into five phases.

Phase 1 — First attempt. Memory engine and measurement scripts are migrated to Hermes first, followed by environment variables and credentials. Hermes v0.8.0 then begins receiving inputs. OpenClaw is frozen and maintained as a preserved baseline.

Phase 2 — Runaway observed. 5 inputs → 111 tool calls → 6.5M tokens, repetition factor 42.79×. Hermes v0.8.0 over-repeated internal tool calls per input unit and re-sent the same schema multiple times, causing cumulative token growth.

Phase 3 — Rollback to stable system. Input is redirected back to OpenClaw. This was only possible because the stable system remained alive in a frozen state. If the new system had been deployed as the sole system, no rollback target would have existed, forcing continued operation in a runaway state or a full work stoppage.

Phase 4 — Redesign. Design defects in the initial implementation are isolated. The 34 decision items are re-examined, and the direct cause candidates for the runaway (conditions triggering repetition factor spikes, schema re-send origination points) are isolated separately. The version produced by this process is the "post-redesign" version.

Phase 5 — Retry, under observation. Hermes is restarted on the post-redesign version and baseline is measured: 23.80×. Approximately 45% reduction from v0.8.0's 42.79×. However, this is a single baseline data point, and a regression was detected in subsequent operational-day measurements (next section). Phase 5 is therefore "under verification," not "successful."

The structural implication of this flow: the migration did not complete in a single pass. All five phases — first attempt → runaway → rollback → redesign → retry — were only possible because OpenClaw remained alive as the stable system.

Two Roles of the Comparison Baseline

The baseline operated in two distinct modes.

First, anomaly diagnosis. The absolute value of 42.79× alone does not allow severity to be assessed. Placing it alongside the same-formula operational data from OpenClaw reveals the deviation from the mean. Because OpenClaw records were preserved, the diagnosis "this is an anomaly and indicates a system design problem" was possible.

Second, improvement magnitude verification. The retry-phase baseline of 23.80× and v0.8.0's 42.79× are results from the same formula, so comparison is valid: approximately 45% reduction. If the v0.8.0 measurement logs had been discarded at the time of the runaway, 23.80 would become an isolated figure with no reference point. Whether it represents improvement or deterioration would be indeterminate.

Methodological Limitations

The Strangler Fig variant removes facade routing, so it does not run identical inputs through both systems simultaneously for comparison. OpenClaw is frozen; only Hermes receives new inputs. Therefore, the comparison between 23.80 and 42.79 is strictly "applying the same formula to input distributions from different time periods." This is a structural limitation — the input sample is not controlled — and controlled re-measurement will be necessary in a later phase.

First Operational Day and Regression Detection

First operational day KPIs:

  • Repetition: 13.82×
  • Delegation: 1.0
  • Crash: 0
  • Daily cost: $0
  • Cache hit: 25.64%

Repetition measured below the baseline of 23.80×, crash 0, delegation 1.0 (sub-agent delegated as intended), daily cost $0 (local model-centric). Up to this point, results were moving in the intended direction.

Second operational day regression:

  • Repetition: 16.22× (up from 13.82 the previous day)
  • Delegation: 0.0 (down from 1.0 the previous day)
  • Crash: 1 (up from 0 the previous day)

Three indicators degraded simultaneously. Root cause is being isolated. This regression is recorded because the purpose of building a comparison baseline is precisely to capture moments like this. Recording only improving days nullifies the baseline's function.

Currently in the Phase 0 observation window. The data set is 1 baseline point + 2 operational-day points. This sample size is insufficient to distinguish a trend from transient oscillation. Until the observation window closes, the 23.80× baseline and subsequent operational KPIs are treated as provisional observations only.

Summary — Applicable Scope and Open Questions

Applying the Strangler Fig and Subset Migration variants in a single-user environment confirmed the following principles at the technique level:

  • Preserve a measurement baseline instead of a facade. Secure both a rollback target and a comparison standard simultaneously.
  • Migrate measurable subsets first. High-variability components are lower priority. Migration effects and tool-change effects must be separable for regression root-cause analysis to be possible.
  • Migration does not complete in a single pass. Design must assume multi-phase iteration; the critical prerequisite for that assumption is the survival of the stable system.

Applicable scope: environments where external traffic is not statistically uniform (internal tooling, personal automation, single-team internal systems); environments where both systems can share identical metrics; environments where inputs can be redirected back to the stable system on regression.

Two open questions remain: 1. To what extent can a baseline comparison with uncontrolled input samples be used as valid evidence? 2. Can the regression observed in Phase 0 (Repetition 16.22, Delegation 0.0, Crash 1) be attributed to a single root cause, or is it multi-factorial?

Both are under investigation. Until verification results are available, both the 23.80× baseline and all subsequent operational KPIs are treated as provisional observations only.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System