AI Operations Economics (2/4) — Model Routing: The Cost / Quality / Latency Triangle

AI ์šด์˜ ๊ฒฝ์ œํ•™ (2/4) — ๋ชจ๋ธ ๋ผ์šฐํŒ… ์ „๋žต: ๋น„์šฉ·ํ’ˆ์งˆ·์ง€์—ฐ 3์ถ• ์˜์‚ฌ๊ฒฐ์ •

"The most expensive model" is not the answer — over 80% of tasks can hit the same outcome at 1/10 the cost.


ํ•ต์‹ฌ ์š”์•ฝ

  • Routing is a rule system that matches tasks to models. It is not on-the-fly model selection
  • Three axes: cost / quality / latency — every routing decision is a tradeoff among these
  • Primary sources: Anthropic and OpenAI pricing pages; first-person operational data
  • Starting question: "If this task fails, can we just retry?" If yes, start with a smaller model
  • Real automation: a classifier categorizes the task and the router picks the model. Human ad-hoc selection is not routing

1. Why route at all

LLM unit prices vary widely within a single provider.

Model Input ($/1M tokens) Output ($/1M tokens)
Claude Opus 4.7 $15 $75
Claude Sonnet 4.6 $3 $15
Claude Haiku 4.5 $0.80 $4

Opus is roughly 19× the input price of Haiku. Running a classification task on Opus means a 19× bill.

The right question is "is Haiku enough?" — not "is Opus better?" Picking the cheapest model that is good enough is the heart of routing.


2. The three axes

Every routing decision is a tradeoff across three axes.

2.1 Cost

  • Unit price × token usage.
  • Caching, routing, and short outputs together cut cost by ~95% (parts 1/4 and series A's 5/5).

2.2 Quality

  • Measure as retry frequency. If a small model needs three tries, effective cost is 3×.
  • Quality is not an accuracy threshold — it's a retry function.

2.3 Latency

  • Smaller models are faster. For interactive use cases that need 5-second answers, the largest models are out.
  • Background jobs have zero latency weight.

Conceptual formula:

Effective cost = (rate × tokens) × (1 + retry_rate) + (latency × user_impatience_penalty)

In practice this collapses into a rule table.


3. Task × model mapping

A reasonable starting point. Calibrate with your own measurements.

Task Recommended model Reasoning
Classification / labeling / intent Haiku or local oMLX Simple; retry cost is low
Summarization (≤5K input) Haiku or Sonnet Start with Haiku; promote if quality slips
Standard code change / PR drafting Sonnet Best price-quality balance
Hard debugging / design / multi-step reasoning Opus Retry savings outweigh per-call cost
Background monitoring / polling Haiku or local High frequency, latency-tolerant
Interactive chat Sonnet (streaming) Latency requirement ≤2s

Core rule: start small, measure, promote only when needed. Starting with the largest model is almost always overspend.


4. Multi-provider routing

If you mix Anthropic, OpenAI, and local rather than living in a single ecosystem, cross-provider routing is an additional lever.

Provider selection axes: - Data location: can code or documents leave for an external API? If not, route local. - Model strengths: code → Anthropic tends to lead; certain reasoning tasks → OpenAI o-series (decide via benchmarks plus your own measurements). - Availability: when one provider 5xx's, automatically fall back to another.

Fallback strategy: - Primary: most cost-effective provider. - Secondary: same-class model from another provider. - Tertiary: when both fail, escalate to a stronger model (retry is expensive, but the task must complete).

Measurement: track provider × model × success_rate daily. If fallback fires often for one task, reconsider the primary choice.


5. Classifier-driven automatic routing

Human selection per call is not routing — it's improvisation. Real routing is automated by a classifier.

Minimum classifier: - Input: user request text + metadata (task length, available tools, etc.). - Output: {model: "haiku-4.5", reason: "classification task"}. - Implementation: rule-based (keywords) first → upgrade to a small LLM (Haiku) once measured.

Router shape:

[user request] → [classifier] → [model pick] → [LLM call] → [result]
                     │                              │
                     └─[log: task_type, model, cost, latency]

The classifier itself burns tokens, so use the smallest model or rule logic for it. If the classifier exceeds 10% of call cost, simplify to rules.


6. Measurement — What to look at

Effective metrics for routing impact:

  • Model-call distribution: per task type, which model is used and at what rate.
  • Retry rate: per model. A small model with 30%+ retries is too small for the task.
  • Fallback firing rate: too high → primary choice lacks reliability.
  • Average cost per task: before vs. after classifier deployment.

Empirical rule: classifier deployment usually drops average cost by 50%+ . If not, the classifier's rules are too conservative (it isn't routing enough work to small models).


7. Four ways routing breaks

7.1 Permanent regression to small models

  • "Haiku is fast and cheap, route everything to it."
  • Result: retries explode → effective cost equal or higher.

7.2 Default-to-large

  • "Opus is safe, default everything there."
  • Result: 80% of work is overspend.

7.3 Routing without measurement

  • Rolled out routing but never measured impact.
  • Result: six months later the rules don't match reality and no one notices.

7.4 Too many models

  • 5 providers × 3 models = 15 options.
  • Result: classifier becomes unpredictable, fallback chains explode in complexity, debugging suffers.

8. At a glance

Step Core Signal
1. Task classification Rule-based or classifier task_type distribution
2. Model mapping Start small Retry rate
3. Multi-provider Data / strength / availability Fallback firing rate
4. Automation Classifier + router Average cost per task trend
5. Monitoring provider × model × daily success_rate, latency

Routing is not write-once. Model lineups, prices, and internal task mix change — quarterly review is appropriate.


Next up

Part 3/4: Prompt Caching Guide — 1-hour vs 5-minute Cache. If routing decides the model, caching decides the input tokens. Both levers must work together to complete cost reduction.


References

  • Anthropic, Pricing — claude.com/pricing (verified 2026-05-05).
  • OpenAI, Pricing — platform.openai.com/pricing (verified 2026-05-05).
  • Coding Agents in Practice (5/5) — four cost levers (final entry of series A).
  • Series part 1 — token cost structure and pitfalls.

This is part 2/4 of the AI Operations Economics series.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System