"Evaluation & Operations — If You Can't Measure It, You Can't Improve It (Harness Series 6/6)"
The closing piece. You've built context (Part 2), memory (Part 3), tools (Part 4), and routing (Part 5). How do you know it actually works? You measure. What you can't measure, you can't improve.
This article covers the Evaluation (quality measurement) and Operations (runtime management) layers — the system that continuously verifies every component covered in Parts 1~5.
Series Roadmap (6 parts — final)
- What Is Harness Engineering?
- Context Engineering
- Memory Systems
- Tools & Sandboxing
- Multi-Provider Routing
- Evaluation & Operations ← this article (final)
1. Why Eval Is a Distinct Concern
Different from Regular Software
- Unit-test pass ≠ user satisfaction
- Same input → different outputs (non-deterministic)
- "Correct answer" isn't always defined (creative writing, summarization, refactoring)
- A model swap shifts all behavior subtly
Agents Need Continuous Verification
- New model release → routing policy impact
- System-prompt edit → regression risk
- New tool added → old behavior shifts
- Memory accumulation → answer drift
Conclusion: Eval isn't a one-time activity. It's ongoing infrastructure.
2. Eval Types
2-1. Automatic Metrics (quantitative)
- BLEU, ROUGE (translation/summary, limited usefulness)
- Code execution: does it run, do tests pass
- Schema validation: does JSON output match schema
- Length, latency, cost (auto-tracked)
2-2. LLM-as-Judge
- A larger model (e.g., Opus 4.7) evaluates outputs from a smaller model
- Rubric-based scoring (0~10)
- Limit: same model evaluating its own output → self-bias
2-3. Human Eval
- Most accurate, slowest, most expensive
- Build a 50~200-sample baseline up front
- Required to validate correlation with LLM-as-Judge
2-4. Adversarial / Red Team
- Deliberately hard cases
- Prompt injection tests
- Security bypass attempts
3. Building an Eval Harness
Dataset
- 100~1000 representative cases
- Reference outputs labeled (when applicable)
- Categorized (chat / code / reasoning / agent)
Eval Pipeline
results = []
for case in dataset:
output = await agent.run(case.input)
score = await evaluate(case, output) # automated or LLM-as-judge
results.append({
"case_id": case.id,
"model": agent.model,
"input_tokens": output.input_tokens,
"output_tokens": output.output_tokens,
"cost": output.cost,
"latency_ms": output.latency_ms,
"score": score,
"passed": score >= threshold,
})
summary = compute_metrics(results)
report.publish(summary)
CI Integration
- Pre-merge: run eval on 100 cases
- Block merge if below threshold
- Detect cost / latency / accuracy regressions
Continuous Eval
- Sample 5~10% of production traffic for eval
- Aggregate daily → trend detection
4. Observability — What to Measure
Core Metrics
| Category | Metric | Target |
|---|---|---|
| Cost | Per-call cost | Within budget |
| Latency | TTFT, TPOT, total | SLO compliant |
| Quality | Eval score, user feedback | No regression |
| Reliability | Success rate, failure classification | 99%+ |
| Usage | Per-user / per-task | Anomaly detection |
Traces (distributed tracing)
A single user request branches across multiple tool calls and model calls. View it as one graph.
User request
├─ Classifier (Haiku 4.5, 50ms)
├─ Tool: search_codebase (200ms)
├─ Tool: read_file × 3 (30ms each)
└─ Generation (Kimi K2.6, 2.1s)
Total: 2.4s, $0.03
Logs vs Traces vs Metrics
- Logs: text events (debugging)
- Traces: single-request flow (perf analysis)
- Metrics: aggregates (dashboards)
You need all three.
5. Tool Comparison (April 2026)
LangSmith (managed, by LangChain)
- Traces + eval + datasets unified
- Deeply integrated with LangChain
- $39~$300/month + usage
- Low learning curve
Langfuse (open source)
- Self-host capable
- Traces + eval + user feedback
- Cost advantage at scale
- LangChain·LiteLLM·OpenAI integrations
Helicone (managed proxy)
- Proxy-style (minimal code change)
- Logs + metrics + cache integrated
- Cost forecasting + alerts
Phoenix (Arize)
- LLM observability + ML observability unified
- Open source + managed options
Roll Your Own
- SQLite + Grafana
- Sufficient at small scale
- Custom policy
6. Long-Running Tasks — The Hardest Part of Operations
The Problem
- Agent kicks off a 30-minute~multi-hour task
- User closes the app? System restart? Token limit hit?
- How is in-progress state preserved?
The Runner Pattern (CLAUDE.md standard)
CLAUDE.md
runnerskill: "Long-running/background task control. Externalize task state to a durable ledger so compact, handoff, and session resume reconnect to the same work instead of relaunching it."
Core principles: - Save task state to an external ledger (SQLite, files) - Update the ledger after each step - On restart / compact / handoff, restore from the ledger
Step Decomposition
Step 1: research (5min) → save to ledger
Step 2: plan (3min) → save
Step 3: implement_a (10min) → save
Step 4: implement_b (8min) → save
Step 5: verify (5min) → save
Step 6: done
Each step is independently runnable. If step 5 dies, step 4's results carry forward and resumption uses them.
Compact + Handoff
- Context pressure → compress current state into a handoff doc
- New session reads the handoff doc + ledger → knows where to resume
- CLAUDE.md
tasks/handoffs/pattern formalizes this
7. Cost Operations
Budget Management
- Monthly budget set → real-time tracking
- Alert at 80%
- Define behavior at 100% (service stop? Degrade?)
Per-User Limits
- Free tier: 100K tokens/day
- Paid: 1M tokens/day
- At limit → next-day reset or upgrade
Routing + Cache Effect Measurement
- Compare avg cost before/after routing
- Daily cache-hit-rate trend
- Cost per model / per task type
Outlier Alerts
- One user 10× normal usage in one hour → alert (account compromise? infinite loop?)
- Auto-detect cost spikes
8. Quality Gates
Pre-deploy
- 100-case eval → 80% pass required
- Cost regression ≤5%
- Latency regression ≤10%
In Production
- A/B test: validate new policy on 5% traffic
- Auto-rollback: if eval score drops, immediately revert
Post-deploy
- Compare metrics 24h later
- Collect user feedback
9. Incident Response
Severity
- P0: service down → immediate fallback
- P1: quality regression → fix within 24h
- P2: cost spike → next sprint
- P3: single user report → track
Runbook
- Document procedures for each incident type
- "Anthropic API down" → "Fallback to OpenRouter. Retry in 1h"
- "Cost spike detected" → "Review routing policy + apply temporary user limits"
Postmortem
- Blameless analysis after every incident
- What happened / why / how to prevent
- Action items → next sprint
10. Series Synthesis — The Harness Engineering Checklist
| Area | Core Question | Recommended Tools |
|---|---|---|
| 1. Definition | Model vs Harness clear? | (conceptual) |
| 2. Context | Can a 5K answer suffice? | CLAUDE.md, MCP limits |
| 3. Memory | State persisted across sessions? | SQLite+FTS5 → Mem0 → Zep |
| 4. Tools | Permission, isolation, failure handling? | Hooks, worktree, VM |
| 5. Routing | Right model per task? | OpenRouter → custom |
| 6. Eval/Ops | Measurement + ops infra? | LangSmith / Langfuse |
Get all six right and you have a real harness. Miss one and that area becomes the 6× gap.
11. Where to Start
Small Team (1~5 people)
- Write CLAUDE.md (Context 1+2+3 basics)
- Bash permission whitelist (Tools 4)
- Sign up for OpenRouter (Routing 5)
- SQLite + basic logging (Eval 6 start)
- Measure → improve → measure
Team (5~50 people)
- All of the above + self-hosted Langfuse
- Eval 100 cases integrated in CI
- Worktree isolation standardized
- Per-user cost tracking
Enterprise
- Custom router (Router_Control pattern)
- Adopt Mem0 / Zep
- VM isolation (Cursor pattern)
- 24/7 on-call + runbooks
12. Series Closing
What this 6-part series covered:
| Part | Core Thesis |
|---|---|
| 1 | Agent = Model + Harness. Same model, 6× gap. |
| 2 | Context window is a shared budget. Small and precise. |
| 3 | Memory isn't extended context. It's a separate system. |
| 4 | Models want to call tools. The harness defines how. |
| 5 | Single-model use is no longer rational. |
| 6 | What you can't measure, you can't improve. Continuous eval is infrastructure. |
Series one-liner: "In 2026, LLM agent differentiation isn't the model — it's everything stacked on top of it. Building that everything well is harness engineering."
What's Next
This series ends, but the harness field is just forming. Things likely to shift in 6~12 months: - AutoHarness-class meta-tools maturing - Model-specific dedicated harness pattern standardization - Public, shared eval datasets - Mainstream KG memory adoption - Cursor 4 / Claude Code 4 (next quarter)
Quarterly updates will follow on this blog.
First-Party Sources
- LangSmith: docs.smith.langchain.com
- Langfuse: langfuse.com
- "Dive into Claude Code": arxiv.org/abs/2604.14228
- Martin Fowler harness: martinfowler.com/articles/harness-engineering.html
- AutoHarness: github.com/aiming-lab/AutoHarness
- Cursor 3 Cloud Agents: cursor.com/cloud
๋๊ธ
๋๊ธ ์ฐ๊ธฐ