Redesigning Cost Structure with AI Agent Profiles — Model Routing Strategy
Hermes Cost Architecture: Consolidating Multiple Agents into Profile Units
핵심 요약
- This post covers how role-based profile consolidation and model routing strategy work in a multi-agent system.
- Key conclusion: the orchestrator model is fixed at the top tier; repetitive lookups use free/low-cost models; memory processing is offloaded locally.
- Paid spend converges to a single ChatGPT Plus subscription (~$20/month); Gemini free tier and oMLX local fill the rest.
- smart routing (dynamic orchestrator downgrade) is disabled — quality degradation outweighs cost savings.
What You Can Take from This Post
- A consolidation pattern for grouping multiple agents into role profiles
- Decision criteria for per-profile model selection (judgment depth × call frequency × cost)
- A three-stage fallback chain design — cost → quality → local offline
- Why smart routing must not be applied to the orchestrator, and where the boundary sits
Background: Limits of a Distributed Agent Structure
Early OpenClaw ran multiple agents each with independent configurations. Models differed per agent, and some roles overlapped. The real bottleneck was not cost — it was management complexity. A single model swap scattered configuration changes across multiple locations, making it difficult to maintain consistent routing quality.
This is the starting point of the Hermes redesign. The goal was not to reduce the number of agents, but to consolidate model decisions into a unit manageable from a single point. Re-classifying by role caused individual agent configs to converge into profile-level abstractions.
Profile and Model Mapping
| Profile | Model | Role |
|---|---|---|
| default (mir) | gpt-5.4 | Orchestrator. General judgment, routing decisions |
| coder | gpt-5.3-codex | Code generation, debugging |
| researcher | gemini-3.1-flash-lite | Search, information gathering |
| monitor | gemini-3.1-flash-lite | System state monitoring |
| memory | gemini-2.5-flash | Memory read/write, summarization |
| memory-runner | gemma-4-26b (oMLX local) | Local memory processing |
More profiles were proposed during the draft phase; those with weak separation criteria were absorbed into higher-level profiles during implementation.
Model Selection Logic
Per-profile model selection is evaluated on three axes: judgment depth, call frequency, and cost elasticity.
Orchestrator: gpt-5.4 Fixed
The orchestrator (default/mir) uses the top-tier model. The principle is simple: the orchestrator always gets the best model.
smart_model_routing was tested with a configuration that downgraded the orchestrator to gpt-4o-mini based on task complexity, then disabled. When the orchestrator drops to a lower model, judgment quality degradation propagates downstream through the entire pipeline. The scope of quality impact is wider than the cost savings — that is the key point.
Gemini: Deployed for Repetitive Lookup Tasks
researcher and monitor use gemini-3.1-flash-lite. These are high-frequency, low-judgment tasks — search and state monitoring. Marginal utility from upgrading the model is low given the low judgment complexity, and the high call frequency maximizes the value of the free tier. Operating within the Gemini free tier incurs no additional cost.
memory: gemini-2.5-flash
Memory summarization and read/write require contextual understanding but have a low creative judgment component. gemini-2.5-flash fits this range. Context compression uses OpenRouter's gemini-2.5-flash:free endpoint.
memory-runner: gemma-4-26b (Local)
Runs locally on oMLX. Endpoint: 127.0.0.1:8001.
- Model:
gemma-4-26b-a4b-it-4bit(4-bit quantized) - Embedding:
bge-m3-mlx-fp16
No external API is used, so there is no added cost. Latency exists, but memory processing is an asynchronous task that does not require real-time response — the tradeoff holds.
Fallback Chain
Relying on a single model means a full stop on API failure. Hermes configures a three-stage fallback chain.
gemini-pro → claude-sonnet-4-6 → omlx/gemma-4
- Stage 1: gemini-pro (cost efficient, fast response)
- Stage 2: claude-sonnet-4-6 (quality guaranteed)
- Stage 3: omlx/gemma-4 (fully local, no external dependency, last resort)
Fallback triggers automatically on API error or response timeout. Chain order is sorted by cost → quality → availability. Placing a local model at the end preserves minimum functionality even during network failures.
Monthly Spend Structure
| Item | Cost |
|---|---|
| ChatGPT Plus (mir + coder) | ~$20/month |
| Gemini (researcher, monitor, memory) | $0 (free tier) |
| oMLX local (memory-runner) | $0 |
| Total | ~$20/month |
A direct cost comparison with the OpenClaw era is not available — no documented figures exist. The expression "90% reduction" appeared in early discussions, but is not stated here as it lacks verified data.
One fact is verifiable in the current structure: paid spend converges to a single ChatGPT Plus subscription line.
Structural Effects of Consolidation
The numbers alone are surface-level indicators. The real effects occur at two points.
1. Clarity in Routing Decisions
The more agents there are, the more the orchestrator must judge "which agent handles this task" on each call. Grouping into profile units sharpens role boundaries, reducing routing ambiguity and decreasing the frequency of incorrect profile assignments.
2. Locality of Model Swaps
Replacing the model attached to a specific profile requires changing only that profile's configuration in one place. In the previous structure, per-agent model configs were distributed, and the same model swap caused synchronization issues across multiple locations.
Boundary for smart_model_routing
smart_model_routing detects task complexity and dynamically selects the model to deploy — low-cost model for simple lookups, higher model for complex judgment.
Observations from applying it to the orchestrator:
- When the orchestrator drops to a lower model, routing judgment quality degrades noticeably
- Judgment errors propagate to downstream agents, making recovery costs exceed savings
- Restored principle: orchestrator fixed at gpt-5.4
Conclusion: smart_model_routing is valid for intra-agent tasks (researcher, memory, and other single-role profiles), but is not suitable for the orchestrator that determines overall flow. Layers where judgment quality propagates downstream require a fixed model.
Summary Table
| Item | Decision |
|---|---|
| Profile configuration | default, coder, researcher, monitor, memory, memory-runner |
| Orchestrator model | gpt-5.4 fixed |
| Free tier usage | Gemini free tier, oMLX local |
| Monthly spend | ~$20 (ChatGPT Plus) |
| Fallback chain | gemini-pro → claude-sonnet-4-6 → omlx/gemma-4 |
| Smart routing | Disabled for orchestrator; selectively applied to sub-agents only |
Applicability and Open Questions
Applicable to
- Individual or small-scale systems running multiple LLM agents
- Configurations that mix paid APIs, free tiers, and local models to flatten the cost curve
- Orchestrator–worker pipelines where judgment quality propagates downstream
Open Questions
- At what threshold does smart routing for sub-agents branch stably?
- When the fallback chain drops to stage 2 or 3, how should quality degradation be signaled to the user?
- Where does role boundary ambiguity re-emerge when profiles are reduced further?
These three questions define the design axes for the next iteration.
댓글
댓글 쓰기