Redesigning Cost Structure with AI Agent Profiles — Model Routing Strategy

Hermes Cost Architecture: Consolidating Multiple Agents into Profile Units


핵심 요약

  • This post covers how role-based profile consolidation and model routing strategy work in a multi-agent system.
  • Key conclusion: the orchestrator model is fixed at the top tier; repetitive lookups use free/low-cost models; memory processing is offloaded locally.
  • Paid spend converges to a single ChatGPT Plus subscription (~$20/month); Gemini free tier and oMLX local fill the rest.
  • smart routing (dynamic orchestrator downgrade) is disabled — quality degradation outweighs cost savings.

What You Can Take from This Post

  1. A consolidation pattern for grouping multiple agents into role profiles
  2. Decision criteria for per-profile model selection (judgment depth × call frequency × cost)
  3. A three-stage fallback chain design — cost → quality → local offline
  4. Why smart routing must not be applied to the orchestrator, and where the boundary sits

Background: Limits of a Distributed Agent Structure

Early OpenClaw ran multiple agents each with independent configurations. Models differed per agent, and some roles overlapped. The real bottleneck was not cost — it was management complexity. A single model swap scattered configuration changes across multiple locations, making it difficult to maintain consistent routing quality.

This is the starting point of the Hermes redesign. The goal was not to reduce the number of agents, but to consolidate model decisions into a unit manageable from a single point. Re-classifying by role caused individual agent configs to converge into profile-level abstractions.


Profile and Model Mapping

Profile Model Role
default (mir) gpt-5.4 Orchestrator. General judgment, routing decisions
coder gpt-5.3-codex Code generation, debugging
researcher gemini-3.1-flash-lite Search, information gathering
monitor gemini-3.1-flash-lite System state monitoring
memory gemini-2.5-flash Memory read/write, summarization
memory-runner gemma-4-26b (oMLX local) Local memory processing

More profiles were proposed during the draft phase; those with weak separation criteria were absorbed into higher-level profiles during implementation.


Model Selection Logic

Per-profile model selection is evaluated on three axes: judgment depth, call frequency, and cost elasticity.

Orchestrator: gpt-5.4 Fixed

The orchestrator (default/mir) uses the top-tier model. The principle is simple: the orchestrator always gets the best model.

smart_model_routing was tested with a configuration that downgraded the orchestrator to gpt-4o-mini based on task complexity, then disabled. When the orchestrator drops to a lower model, judgment quality degradation propagates downstream through the entire pipeline. The scope of quality impact is wider than the cost savings — that is the key point.

Gemini: Deployed for Repetitive Lookup Tasks

researcher and monitor use gemini-3.1-flash-lite. These are high-frequency, low-judgment tasks — search and state monitoring. Marginal utility from upgrading the model is low given the low judgment complexity, and the high call frequency maximizes the value of the free tier. Operating within the Gemini free tier incurs no additional cost.

memory: gemini-2.5-flash

Memory summarization and read/write require contextual understanding but have a low creative judgment component. gemini-2.5-flash fits this range. Context compression uses OpenRouter's gemini-2.5-flash:free endpoint.

memory-runner: gemma-4-26b (Local)

Runs locally on oMLX. Endpoint: 127.0.0.1:8001.

  • Model: gemma-4-26b-a4b-it-4bit (4-bit quantized)
  • Embedding: bge-m3-mlx-fp16

No external API is used, so there is no added cost. Latency exists, but memory processing is an asynchronous task that does not require real-time response — the tradeoff holds.


Fallback Chain

Relying on a single model means a full stop on API failure. Hermes configures a three-stage fallback chain.

gemini-pro → claude-sonnet-4-6 → omlx/gemma-4
  • Stage 1: gemini-pro (cost efficient, fast response)
  • Stage 2: claude-sonnet-4-6 (quality guaranteed)
  • Stage 3: omlx/gemma-4 (fully local, no external dependency, last resort)

Fallback triggers automatically on API error or response timeout. Chain order is sorted by cost → quality → availability. Placing a local model at the end preserves minimum functionality even during network failures.


Monthly Spend Structure

Item Cost
ChatGPT Plus (mir + coder) ~$20/month
Gemini (researcher, monitor, memory) $0 (free tier)
oMLX local (memory-runner) $0
Total ~$20/month

A direct cost comparison with the OpenClaw era is not available — no documented figures exist. The expression "90% reduction" appeared in early discussions, but is not stated here as it lacks verified data.

One fact is verifiable in the current structure: paid spend converges to a single ChatGPT Plus subscription line.


Structural Effects of Consolidation

The numbers alone are surface-level indicators. The real effects occur at two points.

1. Clarity in Routing Decisions

The more agents there are, the more the orchestrator must judge "which agent handles this task" on each call. Grouping into profile units sharpens role boundaries, reducing routing ambiguity and decreasing the frequency of incorrect profile assignments.

2. Locality of Model Swaps

Replacing the model attached to a specific profile requires changing only that profile's configuration in one place. In the previous structure, per-agent model configs were distributed, and the same model swap caused synchronization issues across multiple locations.


Boundary for smart_model_routing

smart_model_routing detects task complexity and dynamically selects the model to deploy — low-cost model for simple lookups, higher model for complex judgment.

Observations from applying it to the orchestrator:

  • When the orchestrator drops to a lower model, routing judgment quality degrades noticeably
  • Judgment errors propagate to downstream agents, making recovery costs exceed savings
  • Restored principle: orchestrator fixed at gpt-5.4

Conclusion: smart_model_routing is valid for intra-agent tasks (researcher, memory, and other single-role profiles), but is not suitable for the orchestrator that determines overall flow. Layers where judgment quality propagates downstream require a fixed model.


Summary Table

Item Decision
Profile configuration default, coder, researcher, monitor, memory, memory-runner
Orchestrator model gpt-5.4 fixed
Free tier usage Gemini free tier, oMLX local
Monthly spend ~$20 (ChatGPT Plus)
Fallback chain gemini-pro → claude-sonnet-4-6 → omlx/gemma-4
Smart routing Disabled for orchestrator; selectively applied to sub-agents only

Applicability and Open Questions

Applicable to

  • Individual or small-scale systems running multiple LLM agents
  • Configurations that mix paid APIs, free tiers, and local models to flatten the cost curve
  • Orchestrator–worker pipelines where judgment quality propagates downstream

Open Questions

  • At what threshold does smart routing for sub-agents branch stably?
  • When the fallback chain drops to stage 2 or 3, how should quality degradation be signaled to the user?
  • Where does role boundary ambiguity re-emerge when profiles are reduced further?

These three questions define the design axes for the next iteration.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System