AI Operations Economics (2/4) — Model Routing: The Cost / Quality / Latency Triangle
"The most expensive model" is not the answer — over 80% of tasks can hit the same outcome at 1/10 the cost.
ํต์ฌ ์์ฝ
- Routing is a rule system that matches tasks to models. It is not on-the-fly model selection
- Three axes: cost / quality / latency — every routing decision is a tradeoff among these
- Primary sources: Anthropic and OpenAI pricing pages; first-person operational data
- Starting question: "If this task fails, can we just retry?" If yes, start with a smaller model
- Real automation: a classifier categorizes the task and the router picks the model. Human ad-hoc selection is not routing
1. Why route at all
LLM unit prices vary widely within a single provider.
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| Claude Opus 4.7 | $15 | $75 |
| Claude Sonnet 4.6 | $3 | $15 |
| Claude Haiku 4.5 | $0.80 | $4 |
Opus is roughly 19× the input price of Haiku. Running a classification task on Opus means a 19× bill.
The right question is "is Haiku enough?" — not "is Opus better?" Picking the cheapest model that is good enough is the heart of routing.
2. The three axes
Every routing decision is a tradeoff across three axes.
2.1 Cost
- Unit price × token usage.
- Caching, routing, and short outputs together cut cost by ~95% (parts 1/4 and series A's 5/5).
2.2 Quality
- Measure as retry frequency. If a small model needs three tries, effective cost is 3×.
- Quality is not an accuracy threshold — it's a retry function.
2.3 Latency
- Smaller models are faster. For interactive use cases that need 5-second answers, the largest models are out.
- Background jobs have zero latency weight.
Conceptual formula:
Effective cost = (rate × tokens) × (1 + retry_rate) + (latency × user_impatience_penalty)
In practice this collapses into a rule table.
3. Task × model mapping
A reasonable starting point. Calibrate with your own measurements.
| Task | Recommended model | Reasoning |
|---|---|---|
| Classification / labeling / intent | Haiku or local oMLX | Simple; retry cost is low |
| Summarization (≤5K input) | Haiku or Sonnet | Start with Haiku; promote if quality slips |
| Standard code change / PR drafting | Sonnet | Best price-quality balance |
| Hard debugging / design / multi-step reasoning | Opus | Retry savings outweigh per-call cost |
| Background monitoring / polling | Haiku or local | High frequency, latency-tolerant |
| Interactive chat | Sonnet (streaming) | Latency requirement ≤2s |
Core rule: start small, measure, promote only when needed. Starting with the largest model is almost always overspend.
4. Multi-provider routing
If you mix Anthropic, OpenAI, and local rather than living in a single ecosystem, cross-provider routing is an additional lever.
Provider selection axes: - Data location: can code or documents leave for an external API? If not, route local. - Model strengths: code → Anthropic tends to lead; certain reasoning tasks → OpenAI o-series (decide via benchmarks plus your own measurements). - Availability: when one provider 5xx's, automatically fall back to another.
Fallback strategy: - Primary: most cost-effective provider. - Secondary: same-class model from another provider. - Tertiary: when both fail, escalate to a stronger model (retry is expensive, but the task must complete).
Measurement: track provider × model × success_rate daily. If fallback fires often for one task, reconsider the primary choice.
5. Classifier-driven automatic routing
Human selection per call is not routing — it's improvisation. Real routing is automated by a classifier.
Minimum classifier:
- Input: user request text + metadata (task length, available tools, etc.).
- Output: {model: "haiku-4.5", reason: "classification task"}.
- Implementation: rule-based (keywords) first → upgrade to a small LLM (Haiku) once measured.
Router shape:
[user request] → [classifier] → [model pick] → [LLM call] → [result]
│ │
└─[log: task_type, model, cost, latency]
The classifier itself burns tokens, so use the smallest model or rule logic for it. If the classifier exceeds 10% of call cost, simplify to rules.
6. Measurement — What to look at
Effective metrics for routing impact:
- Model-call distribution: per task type, which model is used and at what rate.
- Retry rate: per model. A small model with 30%+ retries is too small for the task.
- Fallback firing rate: too high → primary choice lacks reliability.
- Average cost per task: before vs. after classifier deployment.
Empirical rule: classifier deployment usually drops average cost by 50%+ . If not, the classifier's rules are too conservative (it isn't routing enough work to small models).
7. Four ways routing breaks
7.1 Permanent regression to small models
- "Haiku is fast and cheap, route everything to it."
- Result: retries explode → effective cost equal or higher.
7.2 Default-to-large
- "Opus is safe, default everything there."
- Result: 80% of work is overspend.
7.3 Routing without measurement
- Rolled out routing but never measured impact.
- Result: six months later the rules don't match reality and no one notices.
7.4 Too many models
- 5 providers × 3 models = 15 options.
- Result: classifier becomes unpredictable, fallback chains explode in complexity, debugging suffers.
8. At a glance
| Step | Core | Signal |
|---|---|---|
| 1. Task classification | Rule-based or classifier | task_type distribution |
| 2. Model mapping | Start small | Retry rate |
| 3. Multi-provider | Data / strength / availability | Fallback firing rate |
| 4. Automation | Classifier + router | Average cost per task trend |
| 5. Monitoring | provider × model × daily | success_rate, latency |
Routing is not write-once. Model lineups, prices, and internal task mix change — quarterly review is appropriate.
Next up
Part 3/4: Prompt Caching Guide — 1-hour vs 5-minute Cache. If routing decides the model, caching decides the input tokens. Both levers must work together to complete cost reduction.
References
- Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
- OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).
- Coding Agents in Practice (5/5) — four cost levers (final entry of series A).
- Series part 1 — token cost structure and pitfalls.
This is part 2/4 of the AI Operations Economics series.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ