"Multi-Provider Routing — Which Model for Which Task (Harness Series 5/6)"
The category-best models from Series ① — using all nine of them is great, but how do you decide which one to call each time? The answer: a routing layer.
This article covers per-task automatic model selection — cost/quality tradeoffs, fallback strategies, building a classifier, and where unified gateways like OpenRouter fit.
Series Roadmap (6 parts)
- What Is Harness Engineering?
- Context Engineering
- Memory Systems
- Tools & Sandboxing
- Multi-Provider Routing ← this article
- Evaluation & Ops
1. Why Multi-Provider
Single-Provider Risk
- API outages (Anthropic 2024 incidents)
- Sudden price increases
- Model deprecation
- Regional limits (Seedance 2.0 = US blocked)
The Best Model Differs by Task
Recap from Series ①: - Overall #1: GPT-5.5 ($5/$30) — expensive but accurate - Price #1: DeepSeek V4-Flash ($0.14/$0.28) — 36× cheaper - Coding #1: Kimi K2.6 ($0.95/$4) — beats GPT-5.4 - Agent #1: MiniMax M2.7 ($0.30/$1.20) — top GDPval ELO - Local: Qwen3.6-27B — $0
If a user makes 100 calls/hour and all go to GPT-5.5, that's $50~$100/hour. With task-based routing, 80% can hit V4-Flash or local → 80% cost reduction.
2. Routing Dimensions
Variables in a routing decision:
2-1. Task Type
- Chat / summarization → V4-Flash or Sonnet 4.6
- Complex reasoning → Opus 4.7 or GPT-5.5
- Coding → Kimi K2.6 or Sonnet 4.6
- Agentic tool calling → MiniMax M2.7
- Image → GPT Image 2 or Nano Banana 2
2-2. Input Size
- < 4K tokens → Haiku 4.5 (cheap)
- 4K~32K → Sonnet 4.6
- 32K+ → Opus 4.7 or V4-Pro
- 200K+ → GPT-5.5 (1.05M) or V4 (1M)
2-3. Cost Sensitivity
- Free-tier user → local first
- Paid user → balanced
- Premium user → quality first
2-4. Accuracy Requirement
- Medical / legal → Opus 4.7 + verification
- General chat → Sonnet 4.6
- Data transformation → Haiku 4.5
2-5. Security / Data Policy
- Contains PII → force local (oMLX)
- Internal code → local or enterprise contract
- General → managed API
3. Routing Patterns
3-1. Static Routing (simplest)
A rule table:
- if: task_type == "chat" and input_size < 4000
use: anthropic/claude-haiku-4.5
- if: task_type == "code"
use: moonshot/kimi-k2.6
- if: task_type == "reasoning" and complexity == "high"
use: anthropic/claude-opus-4.7
- default: deepseek/v4-flash
Pros: clear, debuggable. Cons: every new case requires manual updates.
3-2. Classifier-Based Routing
A separate fast model (like Haiku 4.5) looks at the input and classifies:
User: "Refactor this function for thread safety"
Classifier: → "task=code, complexity=high, est_tokens=8K"
Router: → moonshot/kimi-k2.6
Pros: handles new cases automatically. Cons: classification cost (the classifier itself uses tokens).
3-3. Cascade Routing
Try the cheap model first → evaluate output → done if good, escalate if not.
1. Try Haiku 4.5 ($0.80/$4)
2. Eval: confidence < 0.7?
3. If yes: retry on Opus 4.7 ($15/$75)
Pros: lower average cost. Cons: failure cases pay both costs.
3-4. Ensemble Routing
Send the same input to multiple models and combine: - Critical medical/legal tasks - Higher cost, higher accuracy
4. Fallback Strategy
Provider Outage
Anthropic down → OpenAI OpenAI down → DeepSeek All down → local (oMLX)
Rate Limits
Each provider has per-minute token limits. Hit them → switch provider.
Pseudo-code
async def call_model(prompt, primary, fallbacks):
try:
return await primary.complete(prompt)
except (RateLimitError, ServerError, TimeoutError):
for fb in fallbacks:
try:
return await fb.complete(prompt)
except Exception:
continue
raise AllProvidersFailedError()
Circuit Breaker
3 failures within 5 minutes from one provider → block that provider for 5 minutes. Auto-recover.
5. OpenRouter — The Unified Gateway
What It Is
A service exposing every major LLM provider behind one OpenAI-compatible endpoint.
import openai
client = openai.OpenAI(
api_key="sk-or-v1-...",
base_url="https://openrouter.ai/api/v1"
)
response = client.chat.completions.create(
model="anthropic/claude-opus-4.7",
messages=[...]
)
Pros
- Single API key for every provider
- Auto-failover (configurable)
- Cost comparison + dashboards
- Cache-hit pricing (where supported)
- Mix closed and open-source models
Cons
- 5~10% markup
- Slight streaming differences
- New-model integration takes days~week
Alternatives
- LiteLLM (open source, self-host)
- Vercel AI Gateway
- Portkey
- Custom router (e.g., Router_Control)
6. Custom Router vs Unified Gateway
Build Your Own
- Direct integration with each provider
- Custom policies + cache
- You manage API keys
- Example: Router_Control (Node.js + classifier + proxy)
Unified Gateway (OpenRouter etc.)
- Quick start
- Markup cost
- Limited policy customization
Recommendation: - Starting out: OpenRouter - 1M+ calls/month or specialized policy: build your own
7. Caching — The Single Biggest Cost Lever
Anthropic Prompt Caching
- 5-minute cache: write 1.25× / read 0.1× (10× cheaper)
- 1-hour cache: write 2× / read 0.1×
- Claude Code uses 1-hour cache → 2× write multiplier applies
OpenAI Prompt Caching
- Automatic, 50% discount on cache hits
DeepSeek V4
- Cache-hit input $0.0028 (V4-Flash) — 1/50 of standard
- For internal chatbots with fixed system prompts, hit rates can exceed 90%
Caching Strategy
- Fix the system prompt to maximize cache benefit
- Place frequently changing parts at the end (avoid invalidating prefix cache)
- Inject per-user variables from the memory system dynamically
8. Cost Tracking + Analytics
Per-call cost
cost_per_call = (input_tokens × input_price + output_tokens × output_price) / 1M
Per-user / Per-task aggregation
- Per-user → pricing strategy
- Per-task → improve routing policy
Outlier Detection
- Calls 10× more expensive than average → alert (RAG runaways, infinite loops)
- Sudden hourly spikes
Tools
- LangSmith (managed)
- Langfuse (open source)
- Helicone
- Custom SQLite + dashboard
9. Anti-Patterns
9-1. "Always Use the Most Expensive Model"
80% of cases would do fine on V4-Flash; using only GPT-5.5 → cost explosion.
9-2. "Static Rules Everywhere"
A 50-line if-else file → maintenance burden. Time to introduce a classifier.
9-3. "No Fallback"
One Anthropic outage → entire service down. Multi-provider isn't optional — it's insurance.
9-4. "Caching, What's That?"
A 50K-token system prompt re-prefilling on every call → 100 user calls = 5M tokens = $25 on GPT-5.5. With cache, $2.50.
9-5. "No Cost Tracking"
You only find out when the monthly bill arrives. Real-time tracking is mandatory.
10. Real Routing Policy (Coding Assistant)
routing:
rules:
- condition: input_tokens < 1000 and task == "code_completion"
model: local/qwen3.6-27b # oMLX, $0
- condition: task == "explain_code" and input_tokens < 8000
model: anthropic/claude-haiku-4.5 # $0.80/$4
- condition: task == "refactor" or task == "implement"
model: moonshot/kimi-k2.6 # $0.95/$4, SWE-Pro 58.6%
- condition: task == "architecture_review" or complexity == "high"
model: anthropic/claude-opus-4.7 # $15/$75
- default: deepseek/v4-flash # $0.14/$0.28
fallback:
primary_failure: openrouter
rate_limit: backup_provider
cache:
system_prompt: true
duration: 1h
Outcome: - 80% calls: local + Haiku + V4-Flash → ~$0.30/hr - 15%: Kimi K2.6 → ~$1 - 5%: Opus 4.7 → ~$10
Weighted average ~$1/hr. All-Opus would be ~$20/hr — 20× difference.
Bottom Line
| Decision | Recommendation |
|---|---|
| Starting | OpenRouter + static rules |
| Growing | Add a classifier |
| Cost-critical | Cascade + maximize cache |
| Security-critical | Local (oMLX) primary, managed as fallback |
| Reliability | At least 2 providers + auto-failover |
The single takeaway: "Single-model use is no longer the rational default in 2026."
Part 6 (the final one) covers the measurement layer that makes everything else accountable — Evaluation & Operations.
First-Party Sources
- OpenRouter: openrouter.ai/docs
- Anthropic Prompt Caching: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- DeepSeek V4 pricing: api-docs.deepseek.com/quick_start/pricing
- LangSmith: docs.smith.langchain.com
- Langfuse: langfuse.com
๋๊ธ
๋๊ธ ์ฐ๊ธฐ