"Evaluation & Operations — If You Can't Measure It, You Can't Improve It (Harness Series 6/6)"

4월 29, 2026

The closing piece. You've built context (Part 2), memory (Part 3), tools (Part 4), and routing (Part 5). How do you know it actually works? You measure. What you can't measure, you can't improve.

This article covers the Evaluation (quality measurement) and Operations (runtime management) layers — the system that continuously verifies every component covered in Parts 1~5.

Series Roadmap (6 parts — final)

What Is Harness Engineering?
Context Engineering
Memory Systems
Tools & Sandboxing
Multi-Provider Routing
Evaluation & Operations ← this article (final)

1. Why Eval Is a Distinct Concern

Different from Regular Software

Unit-test pass ≠ user satisfaction
Same input → different outputs (non-deterministic)
"Correct answer" isn't always defined (creative writing, summarization, refactoring)
A model swap shifts all behavior subtly

Agents Need Continuous Verification

New model release → routing policy impact
System-prompt edit → regression risk
New tool added → old behavior shifts
Memory accumulation → answer drift

Conclusion: Eval isn't a one-time activity. It's ongoing infrastructure.

2. Eval Types

2-1. Automatic Metrics (quantitative)

BLEU, ROUGE (translation/summary, limited usefulness)
Code execution: does it run, do tests pass
Schema validation: does JSON output match schema
Length, latency, cost (auto-tracked)

2-2. LLM-as-Judge

A larger model (e.g., Opus 4.7) evaluates outputs from a smaller model
Rubric-based scoring (0~10)
Limit: same model evaluating its own output → self-bias

2-3. Human Eval

Most accurate, slowest, most expensive
Build a 50~200-sample baseline up front
Required to validate correlation with LLM-as-Judge

2-4. Adversarial / Red Team

Deliberately hard cases
Prompt injection tests
Security bypass attempts

3. Building an Eval Harness

Dataset

100~1000 representative cases
Reference outputs labeled (when applicable)
Categorized (chat / code / reasoning / agent)

Eval Pipeline

results = []
for case in dataset:
    output = await agent.run(case.input)
    score = await evaluate(case, output)  # automated or LLM-as-judge
    results.append({
        "case_id": case.id,
        "model": agent.model,
        "input_tokens": output.input_tokens,
        "output_tokens": output.output_tokens,
        "cost": output.cost,
        "latency_ms": output.latency_ms,
        "score": score,
        "passed": score >= threshold,
    })

summary = compute_metrics(results)
report.publish(summary)

CI Integration

Pre-merge: run eval on 100 cases
Block merge if below threshold
Detect cost / latency / accuracy regressions

Continuous Eval

Sample 5~10% of production traffic for eval
Aggregate daily → trend detection

4. Observability — What to Measure

Core Metrics

Category	Metric	Target
Cost	Per-call cost	Within budget
Latency	TTFT, TPOT, total	SLO compliant
Quality	Eval score, user feedback	No regression
Reliability	Success rate, failure classification	99%+
Usage	Per-user / per-task	Anomaly detection

Traces (distributed tracing)

A single user request branches across multiple tool calls and model calls. View it as one graph.

User request
  ├─ Classifier (Haiku 4.5, 50ms)
  ├─ Tool: search_codebase (200ms)
  ├─ Tool: read_file × 3 (30ms each)
  └─ Generation (Kimi K2.6, 2.1s)
Total: 2.4s, $0.03

Logs vs Traces vs Metrics

Logs: text events (debugging)
Traces: single-request flow (perf analysis)
Metrics: aggregates (dashboards)

You need all three.

5. Tool Comparison (April 2026)

LangSmith (managed, by LangChain)

Traces + eval + datasets unified
Deeply integrated with LangChain
$39~$300/month + usage
Low learning curve

Langfuse (open source)

Self-host capable
Traces + eval + user feedback
Cost advantage at scale
LangChain·LiteLLM·OpenAI integrations

Helicone (managed proxy)

Proxy-style (minimal code change)
Logs + metrics + cache integrated
Cost forecasting + alerts

Phoenix (Arize)

LLM observability + ML observability unified
Open source + managed options

Roll Your Own

SQLite + Grafana
Sufficient at small scale
Custom policy

6. Long-Running Tasks — The Hardest Part of Operations

The Problem

Agent kicks off a 30-minute~multi-hour task
User closes the app? System restart? Token limit hit?
How is in-progress state preserved?

The Runner Pattern (CLAUDE.md standard)

CLAUDE.md runner skill: "Long-running/background task control. Externalize task state to a durable ledger so compact, handoff, and session resume reconnect to the same work instead of relaunching it."

Core principles: - Save task state to an external ledger (SQLite, files) - Update the ledger after each step - On restart / compact / handoff, restore from the ledger

Step Decomposition

Step 1: research (5min) → save to ledger
Step 2: plan (3min) → save
Step 3: implement_a (10min) → save
Step 4: implement_b (8min) → save
Step 5: verify (5min) → save
Step 6: done

Each step is independently runnable. If step 5 dies, step 4's results carry forward and resumption uses them.

Compact + Handoff

Context pressure → compress current state into a handoff doc
New session reads the handoff doc + ledger → knows where to resume
CLAUDE.md tasks/handoffs/ pattern formalizes this

7. Cost Operations

Budget Management

Monthly budget set → real-time tracking
Alert at 80%
Define behavior at 100% (service stop? Degrade?)

Per-User Limits

Free tier: 100K tokens/day
Paid: 1M tokens/day
At limit → next-day reset or upgrade

Routing + Cache Effect Measurement

Compare avg cost before/after routing
Daily cache-hit-rate trend
Cost per model / per task type

Outlier Alerts

One user 10× normal usage in one hour → alert (account compromise? infinite loop?)
Auto-detect cost spikes

8. Quality Gates

Pre-deploy

100-case eval → 80% pass required
Cost regression ≤5%
Latency regression ≤10%

In Production

A/B test: validate new policy on 5% traffic
Auto-rollback: if eval score drops, immediately revert

Post-deploy

Compare metrics 24h later
Collect user feedback

9. Incident Response

Severity

P0: service down → immediate fallback
P1: quality regression → fix within 24h
P2: cost spike → next sprint
P3: single user report → track

Runbook

Document procedures for each incident type
"Anthropic API down" → "Fallback to OpenRouter. Retry in 1h"
"Cost spike detected" → "Review routing policy + apply temporary user limits"

Postmortem

Blameless analysis after every incident
What happened / why / how to prevent
Action items → next sprint

10. Series Synthesis — The Harness Engineering Checklist

Area	Core Question	Recommended Tools
1. Definition	Model vs Harness clear?	(conceptual)
2. Context	Can a 5K answer suffice?	CLAUDE.md, MCP limits
3. Memory	State persisted across sessions?	SQLite+FTS5 → Mem0 → Zep
4. Tools	Permission, isolation, failure handling?	Hooks, worktree, VM
5. Routing	Right model per task?	OpenRouter → custom
6. Eval/Ops	Measurement + ops infra?	LangSmith / Langfuse

Get all six right and you have a real harness. Miss one and that area becomes the 6× gap.

11. Where to Start

Small Team (1~5 people)

Write CLAUDE.md (Context 1+2+3 basics)
Bash permission whitelist (Tools 4)
Sign up for OpenRouter (Routing 5)
SQLite + basic logging (Eval 6 start)
Measure → improve → measure

Team (5~50 people)

All of the above + self-hosted Langfuse
Eval 100 cases integrated in CI
Worktree isolation standardized
Per-user cost tracking

Enterprise

Custom router (Router_Control pattern)
Adopt Mem0 / Zep
VM isolation (Cursor pattern)
24/7 on-call + runbooks

12. Series Closing

What this 6-part series covered:

Part	Core Thesis
1	Agent = Model + Harness. Same model, 6× gap.
2	Context window is a shared budget. Small and precise.
3	Memory isn't extended context. It's a separate system.
4	Models want to call tools. The harness defines how.
5	Single-model use is no longer rational.
6	What you can't measure, you can't improve. Continuous eval is infrastructure.

Series one-liner: "In 2026, LLM agent differentiation isn't the model — it's everything stacked on top of it. Building that everything well is harness engineering."

What's Next

This series ends, but the harness field is just forming. Things likely to shift in 6~12 months: - AutoHarness-class meta-tools maturing - Model-specific dedicated harness pattern standardization - Public, shared eval datasets - Mainstream KG memory adoption - Cursor 4 / Claude Code 4 (next quarter)

Quarterly updates will follow on this blog.

First-Party Sources

LangSmith: docs.smith.langchain.com
Langfuse: langfuse.com
"Dive into Claude Code": arxiv.org/abs/2604.14228
Martin Fowler harness: martinfowler.com/articles/harness-engineering.html
AutoHarness: github.com/aiming-lab/AutoHarness
Cursor 3 Cloud Agents: cursor.com/cloud