AI Operations Economics (2/4) — Model Routing: The Cost / Quality / Latency Triangle

5월 05, 2026

AI 운영 경제학 (2/4) — 모델 라우팅 전략: 비용·품질·지연 3축 의사결정

"The most expensive model" is not the answer — over 80% of tasks can hit the same outcome at 1/10 the cost.

핵심 요약

Routing is a rule system that matches tasks to models. It is not on-the-fly model selection
Three axes: cost / quality / latency — every routing decision is a tradeoff among these
Primary sources: Anthropic and OpenAI pricing pages; first-person operational data
Starting question: "If this task fails, can we just retry?" If yes, start with a smaller model
Real automation: a classifier categorizes the task and the router picks the model. Human ad-hoc selection is not routing

1. Why route at all

LLM unit prices vary widely within a single provider.

Model	Input ($/1M tokens)	Output ($/1M tokens)
Claude Opus 4.7	$15	$75
Claude Sonnet 4.6	$3	$15
Claude Haiku 4.5	$0.80	$4

Opus is roughly 19× the input price of Haiku. Running a classification task on Opus means a 19× bill.

The right question is "is Haiku enough?" — not "is Opus better?" Picking the cheapest model that is good enough is the heart of routing.

2. The three axes

Every routing decision is a tradeoff across three axes.

2.1 Cost

Unit price × token usage.
Caching, routing, and short outputs together cut cost by ~95% (parts 1/4 and series A's 5/5).

2.2 Quality

Measure as retry frequency. If a small model needs three tries, effective cost is 3×.
Quality is not an accuracy threshold — it's a retry function.

2.3 Latency

Smaller models are faster. For interactive use cases that need 5-second answers, the largest models are out.
Background jobs have zero latency weight.

Conceptual formula:

Effective cost = (rate × tokens) × (1 + retry_rate) + (latency × user_impatience_penalty)

In practice this collapses into a rule table.

3. Task × model mapping

A reasonable starting point. Calibrate with your own measurements.

Task	Recommended model	Reasoning
Classification / labeling / intent	Haiku or local oMLX	Simple; retry cost is low
Summarization (≤5K input)	Haiku or Sonnet	Start with Haiku; promote if quality slips
Standard code change / PR drafting	Sonnet	Best price-quality balance
Hard debugging / design / multi-step reasoning	Opus	Retry savings outweigh per-call cost
Background monitoring / polling	Haiku or local	High frequency, latency-tolerant
Interactive chat	Sonnet (streaming)	Latency requirement ≤2s

Core rule: start small, measure, promote only when needed. Starting with the largest model is almost always overspend.

4. Multi-provider routing

If you mix Anthropic, OpenAI, and local rather than living in a single ecosystem, cross-provider routing is an additional lever.

Provider selection axes: - Data location: can code or documents leave for an external API? If not, route local. - Model strengths: code → Anthropic tends to lead; certain reasoning tasks → OpenAI o-series (decide via benchmarks plus your own measurements). - Availability: when one provider 5xx's, automatically fall back to another.

Fallback strategy: - Primary: most cost-effective provider. - Secondary: same-class model from another provider. - Tertiary: when both fail, escalate to a stronger model (retry is expensive, but the task must complete).

Measurement: track provider × model × success_rate daily. If fallback fires often for one task, reconsider the primary choice.

5. Classifier-driven automatic routing

Human selection per call is not routing — it's improvisation. Real routing is automated by a classifier.

Minimum classifier: - Input: user request text + metadata (task length, available tools, etc.). - Output: {model: "haiku-4.5", reason: "classification task"}. - Implementation: rule-based (keywords) first → upgrade to a small LLM (Haiku) once measured.

Router shape:

[user request] → [classifier] → [model pick] → [LLM call] → [result]
                     │                              │
                     └─[log: task_type, model, cost, latency]

The classifier itself burns tokens, so use the smallest model or rule logic for it. If the classifier exceeds 10% of call cost, simplify to rules.

6. Measurement — What to look at

Effective metrics for routing impact:

Model-call distribution: per task type, which model is used and at what rate.
Retry rate: per model. A small model with 30%+ retries is too small for the task.
Fallback firing rate: too high → primary choice lacks reliability.
Average cost per task: before vs. after classifier deployment.

Empirical rule: classifier deployment usually drops average cost by 50%+ . If not, the classifier's rules are too conservative (it isn't routing enough work to small models).

7. Four ways routing breaks

7.1 Permanent regression to small models

"Haiku is fast and cheap, route everything to it."
Result: retries explode → effective cost equal or higher.

7.2 Default-to-large

"Opus is safe, default everything there."
Result: 80% of work is overspend.

7.3 Routing without measurement

Rolled out routing but never measured impact.
Result: six months later the rules don't match reality and no one notices.

7.4 Too many models

5 providers × 3 models = 15 options.
Result: classifier becomes unpredictable, fallback chains explode in complexity, debugging suffers.

8. At a glance

Step	Core	Signal
1. Task classification	Rule-based or classifier	task_type distribution
2. Model mapping	Start small	Retry rate
3. Multi-provider	Data / strength / availability	Fallback firing rate
4. Automation	Classifier + router	Average cost per task trend
5. Monitoring	provider × model × daily	success_rate, latency

Routing is not write-once. Model lineups, prices, and internal task mix change — quarterly review is appropriate.

Next up

Part 3/4: Prompt Caching Guide — 1-hour vs 5-minute Cache. If routing decides the model, caching decides the input tokens. Both levers must work together to complete cost reduction.

References

Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).
Coding Agents in Practice (5/5) — four cost levers (final entry of series A).
Series part 1 — token cost structure and pitfalls.

This is part 2/4 of the AI Operations Economics series.

이 블로그 검색

MaJu Tech Notes