"Multi-Provider Routing — Which Model for Which Task (Harness Series 5/6)"

4월 29, 2026

The category-best models from Series ① — using all nine of them is great, but how do you decide which one to call each time? The answer: a routing layer.

This article covers per-task automatic model selection — cost/quality tradeoffs, fallback strategies, building a classifier, and where unified gateways like OpenRouter fit.

Series Roadmap (6 parts)

What Is Harness Engineering?
Context Engineering
Memory Systems
Tools & Sandboxing
Multi-Provider Routing ← this article
Evaluation & Ops

1. Why Multi-Provider

Single-Provider Risk

API outages (Anthropic 2024 incidents)
Sudden price increases
Model deprecation
Regional limits (Seedance 2.0 = US blocked)

The Best Model Differs by Task

Recap from Series ①: - Overall #1: GPT-5.5 ($5/$30) — expensive but accurate - Price #1: DeepSeek V4-Flash ($0.14/$0.28) — 36× cheaper - Coding #1: Kimi K2.6 ($0.95/$4) — beats GPT-5.4 - Agent #1: MiniMax M2.7 ($0.30/$1.20) — top GDPval ELO - Local: Qwen3.6-27B — $0

If a user makes 100 calls/hour and all go to GPT-5.5, that's $50~$100/hour. With task-based routing, 80% can hit V4-Flash or local → 80% cost reduction.

2. Routing Dimensions

Variables in a routing decision:

2-1. Task Type

Chat / summarization → V4-Flash or Sonnet 4.6
Complex reasoning → Opus 4.7 or GPT-5.5
Coding → Kimi K2.6 or Sonnet 4.6
Agentic tool calling → MiniMax M2.7
Image → GPT Image 2 or Nano Banana 2

2-2. Input Size

< 4K tokens → Haiku 4.5 (cheap)
4K~32K → Sonnet 4.6
32K+ → Opus 4.7 or V4-Pro
200K+ → GPT-5.5 (1.05M) or V4 (1M)

2-3. Cost Sensitivity

Free-tier user → local first
Paid user → balanced
Premium user → quality first

2-4. Accuracy Requirement

Medical / legal → Opus 4.7 + verification
General chat → Sonnet 4.6
Data transformation → Haiku 4.5

2-5. Security / Data Policy

Contains PII → force local (oMLX)
Internal code → local or enterprise contract
General → managed API

3. Routing Patterns

3-1. Static Routing (simplest)

A rule table:

- if: task_type == "chat" and input_size < 4000
  use: anthropic/claude-haiku-4.5
- if: task_type == "code"
  use: moonshot/kimi-k2.6
- if: task_type == "reasoning" and complexity == "high"
  use: anthropic/claude-opus-4.7
- default: deepseek/v4-flash

Pros: clear, debuggable. Cons: every new case requires manual updates.

3-2. Classifier-Based Routing

A separate fast model (like Haiku 4.5) looks at the input and classifies:

User: "Refactor this function for thread safety"
Classifier: → "task=code, complexity=high, est_tokens=8K"
Router: → moonshot/kimi-k2.6

Pros: handles new cases automatically. Cons: classification cost (the classifier itself uses tokens).

3-3. Cascade Routing

Try the cheap model first → evaluate output → done if good, escalate if not.

1. Try Haiku 4.5 ($0.80/$4)
2. Eval: confidence < 0.7?
3. If yes: retry on Opus 4.7 ($15/$75)

Pros: lower average cost. Cons: failure cases pay both costs.

3-4. Ensemble Routing

Send the same input to multiple models and combine: - Critical medical/legal tasks - Higher cost, higher accuracy

4. Fallback Strategy

Provider Outage

Anthropic down → OpenAI OpenAI down → DeepSeek All down → local (oMLX)

Rate Limits

Each provider has per-minute token limits. Hit them → switch provider.

Pseudo-code

async def call_model(prompt, primary, fallbacks):
    try:
        return await primary.complete(prompt)
    except (RateLimitError, ServerError, TimeoutError):
        for fb in fallbacks:
            try:
                return await fb.complete(prompt)
            except Exception:
                continue
        raise AllProvidersFailedError()

Circuit Breaker

3 failures within 5 minutes from one provider → block that provider for 5 minutes. Auto-recover.

5. OpenRouter — The Unified Gateway

What It Is

A service exposing every major LLM provider behind one OpenAI-compatible endpoint.

import openai
client = openai.OpenAI(
    api_key="sk-or-v1-...",
    base_url="https://openrouter.ai/api/v1"
)
response = client.chat.completions.create(
    model="anthropic/claude-opus-4.7",
    messages=[...]
)

Pros

Single API key for every provider
Auto-failover (configurable)
Cost comparison + dashboards
Cache-hit pricing (where supported)
Mix closed and open-source models

Cons

5~10% markup
Slight streaming differences
New-model integration takes days~week

Alternatives

LiteLLM (open source, self-host)
Vercel AI Gateway
Portkey
Custom router (e.g., Router_Control)

6. Custom Router vs Unified Gateway

Build Your Own

Direct integration with each provider
Custom policies + cache
You manage API keys
Example: Router_Control (Node.js + classifier + proxy)

Unified Gateway (OpenRouter etc.)

Quick start
Markup cost
Limited policy customization

Recommendation: - Starting out: OpenRouter - 1M+ calls/month or specialized policy: build your own

7. Caching — The Single Biggest Cost Lever

Anthropic Prompt Caching

5-minute cache: write 1.25× / read 0.1× (10× cheaper)
1-hour cache: write 2× / read 0.1×
Claude Code uses 1-hour cache → 2× write multiplier applies

OpenAI Prompt Caching

Automatic, 50% discount on cache hits

DeepSeek V4

Cache-hit input $0.0028 (V4-Flash) — 1/50 of standard
For internal chatbots with fixed system prompts, hit rates can exceed 90%

Caching Strategy

Fix the system prompt to maximize cache benefit
Place frequently changing parts at the end (avoid invalidating prefix cache)
Inject per-user variables from the memory system dynamically

8. Cost Tracking + Analytics

Per-call cost

cost_per_call = (input_tokens × input_price + output_tokens × output_price) / 1M

Per-user / Per-task aggregation

Per-user → pricing strategy
Per-task → improve routing policy

Outlier Detection

Calls 10× more expensive than average → alert (RAG runaways, infinite loops)
Sudden hourly spikes

Tools

LangSmith (managed)
Langfuse (open source)
Helicone
Custom SQLite + dashboard

9. Anti-Patterns

9-1. "Always Use the Most Expensive Model"

80% of cases would do fine on V4-Flash; using only GPT-5.5 → cost explosion.

9-2. "Static Rules Everywhere"

A 50-line if-else file → maintenance burden. Time to introduce a classifier.

9-3. "No Fallback"

One Anthropic outage → entire service down. Multi-provider isn't optional — it's insurance.

9-4. "Caching, What's That?"

A 50K-token system prompt re-prefilling on every call → 100 user calls = 5M tokens = $25 on GPT-5.5. With cache, $2.50.

9-5. "No Cost Tracking"

You only find out when the monthly bill arrives. Real-time tracking is mandatory.

10. Real Routing Policy (Coding Assistant)

routing:
  rules:
    - condition: input_tokens < 1000 and task == "code_completion"
      model: local/qwen3.6-27b  # oMLX, $0

    - condition: task == "explain_code" and input_tokens < 8000
      model: anthropic/claude-haiku-4.5  # $0.80/$4

    - condition: task == "refactor" or task == "implement"
      model: moonshot/kimi-k2.6  # $0.95/$4, SWE-Pro 58.6%

    - condition: task == "architecture_review" or complexity == "high"
      model: anthropic/claude-opus-4.7  # $15/$75

    - default: deepseek/v4-flash  # $0.14/$0.28

  fallback:
    primary_failure: openrouter
    rate_limit: backup_provider

  cache:
    system_prompt: true
    duration: 1h

Outcome: - 80% calls: local + Haiku + V4-Flash → ~$0.30/hr - 15%: Kimi K2.6 → ~$1 - 5%: Opus 4.7 → ~$10

Weighted average ~$1/hr. All-Opus would be ~$20/hr — 20× difference.

Bottom Line

Decision	Recommendation
Starting	OpenRouter + static rules
Growing	Add a classifier
Cost-critical	Cascade + maximize cache
Security-critical	Local (oMLX) primary, managed as fallback
Reliability	At least 2 providers + auto-failover

The single takeaway: "Single-model use is no longer the rational default in 2026."

Part 6 (the final one) covers the measurement layer that makes everything else accountable — Evaluation & Operations.

First-Party Sources

OpenRouter: openrouter.ai/docs
Anthropic Prompt Caching: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
DeepSeek V4 pricing: api-docs.deepseek.com/quick_start/pricing
LangSmith: docs.smith.langchain.com
Langfuse: langfuse.com