OpenClaw to Hermes Migration (13/13) — Cutover Day: 6-Phase Procedure and Configuration Traps
The required phases, validation checkpoints, and two representative configuration traps that arise from settings coupling when replacing an entire agent stack.
ํต์ฌ ์์ฝ
- Agent cutover follows 6 phases: shut down the existing stack → boot-verify the new stack → apply config → register cron jobs → connect Gateway and messenger → run 24h verification.
- Two reproducible configuration traps:
$HOMEpollution in a sandbox environment, and a mismatch betweensmart_model_routingand the default model. - The minimum safe rollback design is a single-line "LaunchAgent + cron restore" command.
Phase 1 — Shut Down the Existing Stack
To avoid port, token, and process conflicts during agent replacement, fully stop the existing stack before proceeding. Since cron, LaunchAgent, and the messenger bot connection can each restart independently, all three must be blocked.
crontab -r # Remove all OpenClaw cron entries
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.openclaw.gateway.plist
Proceeding to Phase 2 without confirming a complete stop causes heartbeat contention from port conflicts during new stack boot. Root-cause isolation becomes difficult in the logs, so "zero residual processes" is the gate condition.
Phase 2 — Boot-Verify the New Stack
Authentication tokens can often be carried over from the existing CLI lineage. Relocate the three token axes: Codex CLI token, OAuth refresh_token, and messenger bot token.
cp ~/.codex/auth.json ~/.hermes/auth.json
cat ~/.hermes/config.yaml | grep provider
After boot, run a self-diagnostic command such as hermes doctor to verify the three axes: identity file, user file, and auth.
SOUL.md ✓
USER.md ✓
Auth ✓
All three must pass before entering Phase 3. If any item fails, correct the file or token and re-verify. Boot-stage errors left unaddressed surface at a far higher cost in later phases.
Phase 3 — Configure config.yaml
Three concerns must be addressed simultaneously in config design: skill load paths, disabled platforms and skills, and routing rules.
skills:
disabled: 45
external_dirs:
- ~/Hermes/.claude/skills
platform_disabled:
discord: 14
routing:
smart_model_routing: true
default_model: gpt-4o-mini # this setting causes a later issue
provider: openai-codex
external_dirs adds load paths for skill files located outside the default directory. Omitting this prevents Tier 2 skills from loading.
The combination of smart_model_routing: true and gpt-4o-mini directly ties into Configuration Trap 2 described below. When routing is enabled and the default model falls outside the allowed model set for a given mode, the runtime raises an unsupported model error.
Phase 4 — Register Cron Jobs
Break down the "heartbeat" of a persistently running agent into discrete cron entries. Assign the heartbeat an offset of 7/37 minutes to avoid top-of-the-hour conflicts.
| Job Name | Schedule | Description |
|---|---|---|
| heartbeat-tick | 7,37 * * * * | 30-minute heartbeat |
| daily-brief | 40 22 * * * | Daily brief at 22:40 |
| research-scan | 0 9 * * * | Research scan at 09:00 |
| reflect | 0 3 * * * | Reflection loop at 03:00 |
| monthly-memory | 0 9 1 * * | Memory cleanup, 09:00 on the 1st |
crontab -e
crontab -l # Confirm registration
Top-of-hour slots (0 min / 30 min) risk collision with other system jobs. Odd-offset values such as 7/37 are safer for heartbeat-class jobs.
Phase 5 — Connect Gateway and Messenger
hermes gateway install
launchctl load ~/Library/LaunchAgents/com.hermes.gateway.plist
launchctl list | grep hermes
Discord connection verification points:
- Bot account (Mir#4703) connected
- 50 slash commands synchronized
- Cron ticker running at 60-second intervals
Due to Discord API propagation behavior, slash command sync can take up to 1 hour. Testing immediately after sync may show no commands visible — wait 5–10 minutes before confirming.
Phase 6 — 24-Hour Verification
Cutover pass/fail is determined over a 24-hour observation window. All periodic jobs must execute at least once, and any errors must trigger automatic recovery.
Representative events in the observation window:
| Interval | Event |
|---|---|
| T+0 | Cutover starts / verification countdown begins |
| T+20m | First heartbeat-tick fires (nominal) |
| T+4h | daily-brief fires (nominal) |
| T+9h | reflect fires (nominal) |
| T+15h | research-scan fires + dawn heartbeat error occurs |
Dawn heartbeat error log:
unsupported model error: 1 occurrence
mode info recognition failure: 1 occurrence
→ automatic recovery successful
The error root cause ties back to the routing config in Phase 3. Isolated analysis follows in Trap 2 below.
Configuration Trap 1: Sandbox $HOME Pollution
Symptom: The new stack writes data to an unexpected path instead of ~/.hermes/.
Cause: The Claude Code sandbox environment sets $HOME to a sandbox-isolated path rather than the real user home (/Users/username). If the loader script references $HOME directly, the installation path becomes misaligned.
Fix:
HERMES_HOME="$HOME/.hermes"
REAL_HOME=$(eval echo ~$USER)
HERMES_HOME="$REAL_HOME/.hermes"
eval echo ~$USER looks up the home path directly from the user record and is not affected by the sandbox's $HOME override.
Additional recommendation: perform the initial installation from a user terminal, not from inside the sandbox. Installing inside the sandbox pins symlink targets to sandbox paths, which break on restart.
Configuration Trap 2: Smart Model Routing vs. Default Model Mismatch
Symptom: During 24h verification, the dawn heartbeat produces one unsupported model error, followed by automatic recovery.
Cause: With smart_model_routing: true active, the default model is set to gpt-4o-mini. During a specific mode's reasoning phase in the heartbeat, the router selects a model that is not permitted for that mode, and the runtime rejects the model load.
Fix:
routing:
smart_model_routing: false # disable routing
default_model: gpt-5.4 # switch to a fixed model
Explanation: When the per-mode allowed model set in the routing table has no intersection with default_model, the fallback path fails. There are two solutions:
- Keep routing enabled, but fix
default_modelto a model permitted across all modes. - Disable routing and use a single unified model.
Option 2 is safer during early operation. It reduces failure surface area. Once the cost and performance profile stabilizes, expand to Option 1.
Rollback Path
On cutover failure or a critical error, revert to the previous stack.
launchctl load ~/Library/LaunchAgents/com.openclaw.gateway.plist && crontab ~/openclaw_cron_backup.txt
Core design principle: rollback must be expressible as a single command — "load LaunchAgent + restore cron." If multiple commands are required, the rollback itself becomes a failure point. Retain the cron backup file until the 24h verification window passes.
Final State Checklist
| Item | Result |
|---|---|
| Residual processes from previous stack | 0 |
| Residual tokens from previous stack | 0 |
| Residual auxiliary tokens | 0 |
| heartbeat-tick | Nominal |
| daily-brief | Nominal |
| reflect | Nominal |
| research-scan | Nominal (1 error → auto-recovered) |
| Discord slash commands | 50 synchronized |
Applicability and Open Questions
This 6-phase structure is not specific to the OpenClaw → Hermes migration. Any persistently running agent stack composed of cron + LaunchAgent + messenger bot + routing layer can apply the same framework. Specifically, it maps directly onto:
- Local Claude Code agent ↔ another local agent harness
- Stacks based on ChatGPT Plus OAuth + Codex CLI tokens
- Two backends sharing a single messenger bot (Discord/Slack)
Two open questions remain:
- Is re-enabling
smart_model_routingworth it? Only when per-mode cost and performance profiles diverge meaningfully. If a single model is "good enough" across all modes, routing only adds complexity. - Is a 24h verification window sufficient? The
monthly-memoryjob falls outside the 24h window. Separating low-frequency jobs into a second observation window and running two-stage verification is the safer approach.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ