"Evaluation & Operations — If You Can't Measure It, You Can't Improve It (Harness Series 6/6)"

The closing piece. You've built context (Part 2), memory (Part 3), tools (Part 4), and routing (Part 5). How do you know it actually works? You measure. What you can't measure, you can't improve.

This article covers the Evaluation (quality measurement) and Operations (runtime management) layers — the system that continuously verifies every component covered in Parts 1~5.

Series Roadmap (6 parts — final)

  1. What Is Harness Engineering?
  2. Context Engineering
  3. Memory Systems
  4. Tools & Sandboxing
  5. Multi-Provider Routing
  6. Evaluation & Operations ← this article (final)

1. Why Eval Is a Distinct Concern

Different from Regular Software

  • Unit-test pass ≠ user satisfaction
  • Same input → different outputs (non-deterministic)
  • "Correct answer" isn't always defined (creative writing, summarization, refactoring)
  • A model swap shifts all behavior subtly

Agents Need Continuous Verification

  • New model release → routing policy impact
  • System-prompt edit → regression risk
  • New tool added → old behavior shifts
  • Memory accumulation → answer drift

Conclusion: Eval isn't a one-time activity. It's ongoing infrastructure.


2. Eval Types

2-1. Automatic Metrics (quantitative)

  • BLEU, ROUGE (translation/summary, limited usefulness)
  • Code execution: does it run, do tests pass
  • Schema validation: does JSON output match schema
  • Length, latency, cost (auto-tracked)

2-2. LLM-as-Judge

  • A larger model (e.g., Opus 4.7) evaluates outputs from a smaller model
  • Rubric-based scoring (0~10)
  • Limit: same model evaluating its own output → self-bias

2-3. Human Eval

  • Most accurate, slowest, most expensive
  • Build a 50~200-sample baseline up front
  • Required to validate correlation with LLM-as-Judge

2-4. Adversarial / Red Team

  • Deliberately hard cases
  • Prompt injection tests
  • Security bypass attempts

3. Building an Eval Harness

Dataset

  • 100~1000 representative cases
  • Reference outputs labeled (when applicable)
  • Categorized (chat / code / reasoning / agent)

Eval Pipeline

results = []
for case in dataset:
    output = await agent.run(case.input)
    score = await evaluate(case, output)  # automated or LLM-as-judge
    results.append({
        "case_id": case.id,
        "model": agent.model,
        "input_tokens": output.input_tokens,
        "output_tokens": output.output_tokens,
        "cost": output.cost,
        "latency_ms": output.latency_ms,
        "score": score,
        "passed": score >= threshold,
    })

summary = compute_metrics(results)
report.publish(summary)

CI Integration

  • Pre-merge: run eval on 100 cases
  • Block merge if below threshold
  • Detect cost / latency / accuracy regressions

Continuous Eval

  • Sample 5~10% of production traffic for eval
  • Aggregate daily → trend detection

4. Observability — What to Measure

Core Metrics

Category Metric Target
Cost Per-call cost Within budget
Latency TTFT, TPOT, total SLO compliant
Quality Eval score, user feedback No regression
Reliability Success rate, failure classification 99%+
Usage Per-user / per-task Anomaly detection

Traces (distributed tracing)

A single user request branches across multiple tool calls and model calls. View it as one graph.

User request
  ├─ Classifier (Haiku 4.5, 50ms)
  ├─ Tool: search_codebase (200ms)
  ├─ Tool: read_file × 3 (30ms each)
  └─ Generation (Kimi K2.6, 2.1s)
Total: 2.4s, $0.03

Logs vs Traces vs Metrics

  • Logs: text events (debugging)
  • Traces: single-request flow (perf analysis)
  • Metrics: aggregates (dashboards)

You need all three.


5. Tool Comparison (April 2026)

LangSmith (managed, by LangChain)

  • Traces + eval + datasets unified
  • Deeply integrated with LangChain
  • $39~$300/month + usage
  • Low learning curve

Langfuse (open source)

  • Self-host capable
  • Traces + eval + user feedback
  • Cost advantage at scale
  • LangChain·LiteLLM·OpenAI integrations

Helicone (managed proxy)

  • Proxy-style (minimal code change)
  • Logs + metrics + cache integrated
  • Cost forecasting + alerts

Phoenix (Arize)

  • LLM observability + ML observability unified
  • Open source + managed options

Roll Your Own

  • SQLite + Grafana
  • Sufficient at small scale
  • Custom policy

6. Long-Running Tasks — The Hardest Part of Operations

The Problem

  • Agent kicks off a 30-minute~multi-hour task
  • User closes the app? System restart? Token limit hit?
  • How is in-progress state preserved?

The Runner Pattern (CLAUDE.md standard)

CLAUDE.md runner skill: "Long-running/background task control. Externalize task state to a durable ledger so compact, handoff, and session resume reconnect to the same work instead of relaunching it."

Core principles: - Save task state to an external ledger (SQLite, files) - Update the ledger after each step - On restart / compact / handoff, restore from the ledger

Step Decomposition

Step 1: research (5min) → save to ledger
Step 2: plan (3min) → save
Step 3: implement_a (10min) → save
Step 4: implement_b (8min) → save
Step 5: verify (5min) → save
Step 6: done

Each step is independently runnable. If step 5 dies, step 4's results carry forward and resumption uses them.

Compact + Handoff

  • Context pressure → compress current state into a handoff doc
  • New session reads the handoff doc + ledger → knows where to resume
  • CLAUDE.md tasks/handoffs/ pattern formalizes this

7. Cost Operations

Budget Management

  • Monthly budget set → real-time tracking
  • Alert at 80%
  • Define behavior at 100% (service stop? Degrade?)

Per-User Limits

  • Free tier: 100K tokens/day
  • Paid: 1M tokens/day
  • At limit → next-day reset or upgrade

Routing + Cache Effect Measurement

  • Compare avg cost before/after routing
  • Daily cache-hit-rate trend
  • Cost per model / per task type

Outlier Alerts

  • One user 10× normal usage in one hour → alert (account compromise? infinite loop?)
  • Auto-detect cost spikes

8. Quality Gates

Pre-deploy

  • 100-case eval → 80% pass required
  • Cost regression ≤5%
  • Latency regression ≤10%

In Production

  • A/B test: validate new policy on 5% traffic
  • Auto-rollback: if eval score drops, immediately revert

Post-deploy

  • Compare metrics 24h later
  • Collect user feedback

9. Incident Response

Severity

  • P0: service down → immediate fallback
  • P1: quality regression → fix within 24h
  • P2: cost spike → next sprint
  • P3: single user report → track

Runbook

  • Document procedures for each incident type
  • "Anthropic API down" → "Fallback to OpenRouter. Retry in 1h"
  • "Cost spike detected" → "Review routing policy + apply temporary user limits"

Postmortem

  • Blameless analysis after every incident
  • What happened / why / how to prevent
  • Action items → next sprint

10. Series Synthesis — The Harness Engineering Checklist

Area Core Question Recommended Tools
1. Definition Model vs Harness clear? (conceptual)
2. Context Can a 5K answer suffice? CLAUDE.md, MCP limits
3. Memory State persisted across sessions? SQLite+FTS5 → Mem0 → Zep
4. Tools Permission, isolation, failure handling? Hooks, worktree, VM
5. Routing Right model per task? OpenRouter → custom
6. Eval/Ops Measurement + ops infra? LangSmith / Langfuse

Get all six right and you have a real harness. Miss one and that area becomes the 6× gap.


11. Where to Start

Small Team (1~5 people)

  1. Write CLAUDE.md (Context 1+2+3 basics)
  2. Bash permission whitelist (Tools 4)
  3. Sign up for OpenRouter (Routing 5)
  4. SQLite + basic logging (Eval 6 start)
  5. Measure → improve → measure

Team (5~50 people)

  1. All of the above + self-hosted Langfuse
  2. Eval 100 cases integrated in CI
  3. Worktree isolation standardized
  4. Per-user cost tracking

Enterprise

  1. Custom router (Router_Control pattern)
  2. Adopt Mem0 / Zep
  3. VM isolation (Cursor pattern)
  4. 24/7 on-call + runbooks

12. Series Closing

What this 6-part series covered:

Part Core Thesis
1 Agent = Model + Harness. Same model, 6× gap.
2 Context window is a shared budget. Small and precise.
3 Memory isn't extended context. It's a separate system.
4 Models want to call tools. The harness defines how.
5 Single-model use is no longer rational.
6 What you can't measure, you can't improve. Continuous eval is infrastructure.

Series one-liner: "In 2026, LLM agent differentiation isn't the model — it's everything stacked on top of it. Building that everything well is harness engineering."


What's Next

This series ends, but the harness field is just forming. Things likely to shift in 6~12 months: - AutoHarness-class meta-tools maturing - Model-specific dedicated harness pattern standardization - Public, shared eval datasets - Mainstream KG memory adoption - Cursor 4 / Claude Code 4 (next quarter)

Quarterly updates will follow on this blog.


First-Party Sources

  • LangSmith: docs.smith.langchain.com
  • Langfuse: langfuse.com
  • "Dive into Claude Code": arxiv.org/abs/2604.14228
  • Martin Fowler harness: martinfowler.com/articles/harness-engineering.html
  • AutoHarness: github.com/aiming-lab/AutoHarness
  • Cursor 3 Cloud Agents: cursor.com/cloud

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System