"Multi-Provider Routing — Which Model for Which Task (Harness Series 5/6)"

The category-best models from Series ① — using all nine of them is great, but how do you decide which one to call each time? The answer: a routing layer.

This article covers per-task automatic model selection — cost/quality tradeoffs, fallback strategies, building a classifier, and where unified gateways like OpenRouter fit.

Series Roadmap (6 parts)

  1. What Is Harness Engineering?
  2. Context Engineering
  3. Memory Systems
  4. Tools & Sandboxing
  5. Multi-Provider Routing ← this article
  6. Evaluation & Ops

1. Why Multi-Provider

Single-Provider Risk

  • API outages (Anthropic 2024 incidents)
  • Sudden price increases
  • Model deprecation
  • Regional limits (Seedance 2.0 = US blocked)

The Best Model Differs by Task

Recap from Series ①: - Overall #1: GPT-5.5 ($5/$30) — expensive but accurate - Price #1: DeepSeek V4-Flash ($0.14/$0.28) — 36× cheaper - Coding #1: Kimi K2.6 ($0.95/$4) — beats GPT-5.4 - Agent #1: MiniMax M2.7 ($0.30/$1.20) — top GDPval ELO - Local: Qwen3.6-27B — $0

If a user makes 100 calls/hour and all go to GPT-5.5, that's $50~$100/hour. With task-based routing, 80% can hit V4-Flash or local → 80% cost reduction.


2. Routing Dimensions

Variables in a routing decision:

2-1. Task Type

  • Chat / summarization → V4-Flash or Sonnet 4.6
  • Complex reasoning → Opus 4.7 or GPT-5.5
  • Coding → Kimi K2.6 or Sonnet 4.6
  • Agentic tool calling → MiniMax M2.7
  • Image → GPT Image 2 or Nano Banana 2

2-2. Input Size

  • < 4K tokens → Haiku 4.5 (cheap)
  • 4K~32K → Sonnet 4.6
  • 32K+ → Opus 4.7 or V4-Pro
  • 200K+ → GPT-5.5 (1.05M) or V4 (1M)

2-3. Cost Sensitivity

  • Free-tier user → local first
  • Paid user → balanced
  • Premium user → quality first

2-4. Accuracy Requirement

  • Medical / legal → Opus 4.7 + verification
  • General chat → Sonnet 4.6
  • Data transformation → Haiku 4.5

2-5. Security / Data Policy

  • Contains PII → force local (oMLX)
  • Internal code → local or enterprise contract
  • General → managed API

3. Routing Patterns

3-1. Static Routing (simplest)

A rule table:

- if: task_type == "chat" and input_size < 4000
  use: anthropic/claude-haiku-4.5
- if: task_type == "code"
  use: moonshot/kimi-k2.6
- if: task_type == "reasoning" and complexity == "high"
  use: anthropic/claude-opus-4.7
- default: deepseek/v4-flash

Pros: clear, debuggable. Cons: every new case requires manual updates.

3-2. Classifier-Based Routing

A separate fast model (like Haiku 4.5) looks at the input and classifies:

User: "Refactor this function for thread safety"
Classifier: → "task=code, complexity=high, est_tokens=8K"
Router: → moonshot/kimi-k2.6

Pros: handles new cases automatically. Cons: classification cost (the classifier itself uses tokens).

3-3. Cascade Routing

Try the cheap model first → evaluate output → done if good, escalate if not.

1. Try Haiku 4.5 ($0.80/$4)
2. Eval: confidence < 0.7?
3. If yes: retry on Opus 4.7 ($15/$75)

Pros: lower average cost. Cons: failure cases pay both costs.

3-4. Ensemble Routing

Send the same input to multiple models and combine: - Critical medical/legal tasks - Higher cost, higher accuracy


4. Fallback Strategy

Provider Outage

Anthropic down → OpenAI OpenAI down → DeepSeek All down → local (oMLX)

Rate Limits

Each provider has per-minute token limits. Hit them → switch provider.

Pseudo-code

async def call_model(prompt, primary, fallbacks):
    try:
        return await primary.complete(prompt)
    except (RateLimitError, ServerError, TimeoutError):
        for fb in fallbacks:
            try:
                return await fb.complete(prompt)
            except Exception:
                continue
        raise AllProvidersFailedError()

Circuit Breaker

3 failures within 5 minutes from one provider → block that provider for 5 minutes. Auto-recover.


5. OpenRouter — The Unified Gateway

What It Is

A service exposing every major LLM provider behind one OpenAI-compatible endpoint.

import openai
client = openai.OpenAI(
    api_key="sk-or-v1-...",
    base_url="https://openrouter.ai/api/v1"
)
response = client.chat.completions.create(
    model="anthropic/claude-opus-4.7",
    messages=[...]
)

Pros

  • Single API key for every provider
  • Auto-failover (configurable)
  • Cost comparison + dashboards
  • Cache-hit pricing (where supported)
  • Mix closed and open-source models

Cons

  • 5~10% markup
  • Slight streaming differences
  • New-model integration takes days~week

Alternatives

  • LiteLLM (open source, self-host)
  • Vercel AI Gateway
  • Portkey
  • Custom router (e.g., Router_Control)

6. Custom Router vs Unified Gateway

Build Your Own

  • Direct integration with each provider
  • Custom policies + cache
  • You manage API keys
  • Example: Router_Control (Node.js + classifier + proxy)

Unified Gateway (OpenRouter etc.)

  • Quick start
  • Markup cost
  • Limited policy customization

Recommendation: - Starting out: OpenRouter - 1M+ calls/month or specialized policy: build your own


7. Caching — The Single Biggest Cost Lever

Anthropic Prompt Caching

  • 5-minute cache: write 1.25× / read 0.1× (10× cheaper)
  • 1-hour cache: write 2× / read 0.1×
  • Claude Code uses 1-hour cache → 2× write multiplier applies

OpenAI Prompt Caching

  • Automatic, 50% discount on cache hits

DeepSeek V4

  • Cache-hit input $0.0028 (V4-Flash) — 1/50 of standard
  • For internal chatbots with fixed system prompts, hit rates can exceed 90%

Caching Strategy

  • Fix the system prompt to maximize cache benefit
  • Place frequently changing parts at the end (avoid invalidating prefix cache)
  • Inject per-user variables from the memory system dynamically

8. Cost Tracking + Analytics

Per-call cost

cost_per_call = (input_tokens × input_price + output_tokens × output_price) / 1M

Per-user / Per-task aggregation

  • Per-user → pricing strategy
  • Per-task → improve routing policy

Outlier Detection

  • Calls 10× more expensive than average → alert (RAG runaways, infinite loops)
  • Sudden hourly spikes

Tools

  • LangSmith (managed)
  • Langfuse (open source)
  • Helicone
  • Custom SQLite + dashboard

9. Anti-Patterns

9-1. "Always Use the Most Expensive Model"

80% of cases would do fine on V4-Flash; using only GPT-5.5 → cost explosion.

9-2. "Static Rules Everywhere"

A 50-line if-else file → maintenance burden. Time to introduce a classifier.

9-3. "No Fallback"

One Anthropic outage → entire service down. Multi-provider isn't optional — it's insurance.

9-4. "Caching, What's That?"

A 50K-token system prompt re-prefilling on every call → 100 user calls = 5M tokens = $25 on GPT-5.5. With cache, $2.50.

9-5. "No Cost Tracking"

You only find out when the monthly bill arrives. Real-time tracking is mandatory.


10. Real Routing Policy (Coding Assistant)

routing:
  rules:
    - condition: input_tokens < 1000 and task == "code_completion"
      model: local/qwen3.6-27b  # oMLX, $0

    - condition: task == "explain_code" and input_tokens < 8000
      model: anthropic/claude-haiku-4.5  # $0.80/$4

    - condition: task == "refactor" or task == "implement"
      model: moonshot/kimi-k2.6  # $0.95/$4, SWE-Pro 58.6%

    - condition: task == "architecture_review" or complexity == "high"
      model: anthropic/claude-opus-4.7  # $15/$75

    - default: deepseek/v4-flash  # $0.14/$0.28

  fallback:
    primary_failure: openrouter
    rate_limit: backup_provider

  cache:
    system_prompt: true
    duration: 1h

Outcome: - 80% calls: local + Haiku + V4-Flash → ~$0.30/hr - 15%: Kimi K2.6 → ~$1 - 5%: Opus 4.7 → ~$10

Weighted average ~$1/hr. All-Opus would be ~$20/hr — 20× difference.


Bottom Line

Decision Recommendation
Starting OpenRouter + static rules
Growing Add a classifier
Cost-critical Cascade + maximize cache
Security-critical Local (oMLX) primary, managed as fallback
Reliability At least 2 providers + auto-failover

The single takeaway: "Single-model use is no longer the rational default in 2026."

Part 6 (the final one) covers the measurement layer that makes everything else accountable — Evaluation & Operations.


First-Party Sources

  • OpenRouter: openrouter.ai/docs
  • Anthropic Prompt Caching: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
  • DeepSeek V4 pricing: api-docs.deepseek.com/quick_start/pricing
  • LangSmith: docs.smith.langchain.com
  • Langfuse: langfuse.com

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System