OpenClaw to Hermes Migration (13/13) — Cutover Day: 6-Phase Procedure and Configuration Traps

The required phases, validation checkpoints, and two representative configuration traps that arise from settings coupling when replacing an entire agent stack.


ํ•ต์‹ฌ ์š”์•ฝ

  • Agent cutover follows 6 phases: shut down the existing stack → boot-verify the new stack → apply config → register cron jobs → connect Gateway and messenger → run 24h verification.
  • Two reproducible configuration traps: $HOME pollution in a sandbox environment, and a mismatch between smart_model_routing and the default model.
  • The minimum safe rollback design is a single-line "LaunchAgent + cron restore" command.

Phase 1 — Shut Down the Existing Stack

To avoid port, token, and process conflicts during agent replacement, fully stop the existing stack before proceeding. Since cron, LaunchAgent, and the messenger bot connection can each restart independently, all three must be blocked.

crontab -r   # Remove all OpenClaw cron entries

launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist

Proceeding to Phase 2 without confirming a complete stop causes heartbeat contention from port conflicts during new stack boot. Root-cause isolation becomes difficult in the logs, so "zero residual processes" is the gate condition.


Phase 2 — Boot-Verify the New Stack

Authentication tokens can often be carried over from the existing CLI lineage. Relocate the three token axes: Codex CLI token, OAuth refresh_token, and messenger bot token.

cp ~/.codex/auth.json ~/.hermes/auth.json

cat ~/.hermes/config.yaml | grep provider

After boot, run a self-diagnostic command such as hermes doctor to verify the three axes: identity file, user file, and auth.

SOUL.md     ✓
USER.md     ✓
Auth        ✓

All three must pass before entering Phase 3. If any item fails, correct the file or token and re-verify. Boot-stage errors left unaddressed surface at a far higher cost in later phases.


Phase 3 — Configure config.yaml

Three concerns must be addressed simultaneously in config design: skill load paths, disabled platforms and skills, and routing rules.

skills:
  disabled: 45
  external_dirs:
    - ~/Hermes/.claude/skills

platform_disabled:
  discord: 14

routing:
  smart_model_routing: true
  default_model: gpt-4o-mini   # this setting causes a later issue

provider: openai-codex

external_dirs adds load paths for skill files located outside the default directory. Omitting this prevents Tier 2 skills from loading.

The combination of smart_model_routing: true and gpt-4o-mini directly ties into Configuration Trap 2 described below. When routing is enabled and the default model falls outside the allowed model set for a given mode, the runtime raises an unsupported model error.


Phase 4 — Register Cron Jobs

Break down the "heartbeat" of a persistently running agent into discrete cron entries. Assign the heartbeat an offset of 7/37 minutes to avoid top-of-the-hour conflicts.

Job Name Schedule Description
heartbeat-tick 7,37 * * * * 30-minute heartbeat
daily-brief 40 22 * * * Daily brief at 22:40
research-scan 0 9 * * * Research scan at 09:00
reflect 0 3 * * * Reflection loop at 03:00
monthly-memory 0 9 1 * * Memory cleanup, 09:00 on the 1st
crontab -e
crontab -l  # Confirm registration

Top-of-hour slots (0 min / 30 min) risk collision with other system jobs. Odd-offset values such as 7/37 are safer for heartbeat-class jobs.


Phase 5 — Connect Gateway and Messenger

hermes gateway install
launchctl load ~/Library/LaunchAgents/com.hermes.gateway.plist

launchctl list | grep hermes

Discord connection verification points: - Bot account (Mir#4703) connected - 50 slash commands synchronized - Cron ticker running at 60-second intervals

Due to Discord API propagation behavior, slash command sync can take up to 1 hour. Testing immediately after sync may show no commands visible — wait 5–10 minutes before confirming.


Phase 6 — 24-Hour Verification

Cutover pass/fail is determined over a 24-hour observation window. All periodic jobs must execute at least once, and any errors must trigger automatic recovery.

Representative events in the observation window:

Interval Event
T+0 Cutover starts / verification countdown begins
T+20m First heartbeat-tick fires (nominal)
T+4h daily-brief fires (nominal)
T+9h reflect fires (nominal)
T+15h research-scan fires + dawn heartbeat error occurs

Dawn heartbeat error log:

unsupported model error: 1 occurrence
mode info recognition failure: 1 occurrence
→ automatic recovery successful

The error root cause ties back to the routing config in Phase 3. Isolated analysis follows in Trap 2 below.


Configuration Trap 1: Sandbox $HOME Pollution

Symptom: The new stack writes data to an unexpected path instead of ~/.hermes/.

Cause: The Claude Code sandbox environment sets $HOME to a sandbox-isolated path rather than the real user home (/Users/username). If the loader script references $HOME directly, the installation path becomes misaligned.

Fix:

HERMES_HOME="$HOME/.hermes"

REAL_HOME=$(eval echo ~$USER)
HERMES_HOME="$REAL_HOME/.hermes"

eval echo ~$USER looks up the home path directly from the user record and is not affected by the sandbox's $HOME override.

Additional recommendation: perform the initial installation from a user terminal, not from inside the sandbox. Installing inside the sandbox pins symlink targets to sandbox paths, which break on restart.


Configuration Trap 2: Smart Model Routing vs. Default Model Mismatch

Symptom: During 24h verification, the dawn heartbeat produces one unsupported model error, followed by automatic recovery.

Cause: With smart_model_routing: true active, the default model is set to gpt-4o-mini. During a specific mode's reasoning phase in the heartbeat, the router selects a model that is not permitted for that mode, and the runtime rejects the model load.

Fix:

routing:
  smart_model_routing: false   # disable routing
  default_model: gpt-5.4       # switch to a fixed model

Explanation: When the per-mode allowed model set in the routing table has no intersection with default_model, the fallback path fails. There are two solutions:

  1. Keep routing enabled, but fix default_model to a model permitted across all modes.
  2. Disable routing and use a single unified model.

Option 2 is safer during early operation. It reduces failure surface area. Once the cost and performance profile stabilizes, expand to Option 1.


Rollback Path

On cutover failure or a critical error, revert to the previous stack.

launchctl load ~/Library/LaunchAgents/com.openclaw.gateway.plist && crontab ~/openclaw_cron_backup.txt

Core design principle: rollback must be expressible as a single command — "load LaunchAgent + restore cron." If multiple commands are required, the rollback itself becomes a failure point. Retain the cron backup file until the 24h verification window passes.


Final State Checklist

Item Result
Residual processes from previous stack 0
Residual tokens from previous stack 0
Residual auxiliary tokens 0
heartbeat-tick Nominal
daily-brief Nominal
reflect Nominal
research-scan Nominal (1 error → auto-recovered)
Discord slash commands 50 synchronized

Applicability and Open Questions

This 6-phase structure is not specific to the OpenClaw → Hermes migration. Any persistently running agent stack composed of cron + LaunchAgent + messenger bot + routing layer can apply the same framework. Specifically, it maps directly onto:

  • Local Claude Code agent ↔ another local agent harness
  • Stacks based on ChatGPT Plus OAuth + Codex CLI tokens
  • Two backends sharing a single messenger bot (Discord/Slack)

Two open questions remain:

  1. Is re-enabling smart_model_routing worth it? Only when per-mode cost and performance profiles diverge meaningfully. If a single model is "good enough" across all modes, routing only adds complexity.
  2. Is a 24h verification window sufficient? The monthly-memory job falls outside the 24h window. Separating low-frequency jobs into a second observation window and running two-stage verification is the safer approach.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System