"Agent Evaluation Harnesses (1/4) — How to Validate AI Results with Tests, Rubrics, and Regression Loops"

The most common AI-agent illusion is mistaking "it worked a few times" for "it now works." An evaluation harness breaks that illusion. It asks whether the system still holds together when run again under the same expectations.


Key Takeaways

  • Agent evaluation is not just about answer quality. It must also inspect tool use, policy adherence, failure handling, and final deliverables.
  • Strong evaluation harnesses combine exact checks with rubric-based scoring, while removing cheap, obvious failures as early as possible.
  • A rubric is not a vibe check. It is a repeatable scoring frame.
  • Evaluation datasets are not static reports. They are operating assets that get stronger as real failures are added back in.
  • The purpose of evaluation is not leaderboard vanity. It is to build sensors that tell you how to improve the harness.

1. Why agents need a distinct evaluation harness

Testing is important in all software, but agent systems make evaluation harder. Outputs can vary across runs, many tasks do not have a single exact answer, and agents do more than write text. They read files, invoke tools, make intermediate decisions, and sometimes carry work over long sessions.

That means "the answer looks decent" is not enough. In practice you need to evaluate questions like:

  • did it choose the right tool
  • did it avoid unnecessary actions
  • did it follow explicit constraints
  • did the final output satisfy the requested format and standard
  • did the system regress compared with earlier behavior

So an evaluation harness is not a chatbot scorecard. It is closer to a test rig for the full agent workflow.

2. Exact checks and rubric-based checks are different tools

You usually need both, but they should not be blended into one vague bucket.

Evaluation type Best for Strength Limitation
Exact checks format, values, execution success, policy violations cheap, fast, consistent weak on semantic quality
Rubric-based checks summaries, analysis, prioritization, writing quality catches meaningful differences more subjective and less stable

For example, schema compliance, required-section presence, or forbidden-file edits do not need an LLM judge. But questions like "is this analysis decision-useful?" are hard to settle with rigid rules alone.

This distinction matters because:

Anything you can catch deterministically should be caught before you spend expensive judgment on it.

3. A good rubric is not subjective praise

A rubric is not "write something good." It is a breakdown of criteria that allows different reviewers to reach similar conclusions.

For an agent deliverable, that can look like this:

Criterion Question Example score
Accuracy Are facts, paths, and commands correct? 0~5
Procedure compliance Did it follow required constraints and sequence? 0~5
Completeness Did it cover all requested outputs? 0~5
Practicality Can a real user use this immediately? 0~5
Risk Did it avoid overclaiming, unsafe edits, or unnecessary actions? 0~5

The important part is not the label names. It is the decision language. For instance: if the main result is decent but the agent edited a forbidden file, the result is not a pass. It is a failure.

4. Evaluation datasets should preserve failure, not just success

One common mistake is building datasets from only the best-looking examples. But the most useful evaluation set is often the opposite.

  • cases that frequently failed
  • borderline cases that caused drift
  • policy-sensitive cases
  • cases likely to break when models, prompts, or tools change

Following the Chapter 5 and Chapter 9 adaptation notes in sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md, it is more useful to treat the dataset as an operational regression asset than as a polished benchmark.

In practice, it helps to classify examples like this:

  • golden: representative cases that must keep passing
  • edge: ambiguous but important boundary cases
  • failure_regression: cases taken from real failures or review findings
  • policy: permissions, approval, and forbidden-action cases

Once you do this, the average score becomes less interesting than which failure family came back.

5. Regression evaluation should ask "what broke?"

Operationally, the first question is not "did we innovate?" It is "did we break something that used to work?" Prompt edits, tool-description changes, and model swaps often look good on demos while quietly damaging adjacent behaviors.

That is why regression runs should first detect:

  • representative cases that used to pass but now fail
  • abnormal spikes in cost, latency, or tool-call count
  • increased policy or formatting violations

This connects directly to the continuous measurement logic in drafts/blog/260429_harness_series_06_evaluation_ops_en.md. Evaluation is not a one-off release ritual. It is a recurring guardrail around change.

6. LLM-as-Judge is useful, but risky as the sole judge

LLM-as-Judge is powerful for nuanced quality checks: explanatory strength, prioritization, usefulness, and semantic alignment. But it becomes risky when treated as the only arbiter.

  • the judge may share the generator's biases
  • scores can drift with prompt wording
  • fluent answers may be overrated even when they are weak

A more durable operating pattern is:

  1. run deterministic checks first
  2. send only surviving cases to rubric evaluation
  3. use human review to calibrate important samples
  4. log judgment rationale so drift can be inspected later

The point is not "do not use LLM judges." The point is use them inside a broader stack of cheaper and stricter checks.

7. What an evaluation harness looks like in a workspace like ours

In a repository that mixes documents, tooling rules, and operating constraints, evaluation often cares more about procedure compliance than style alone.

For example:

Evaluation target Question
File ownership compliance Did it edit only allowed files?
Read-order compliance Did it read the requested references first?
Language protocol Did it keep user-facing output in Korean where required?
Concurrent-work safety Did it avoid reverting someone else's changes?
Deliverable completeness Did it include publish-ready frontmatter and series-nav placeholders?

These checks are operationally valuable because many real failures are caused not by weak prose, but by violating the work contract around the prose.

8. Practical starting point

You do not need a giant leaderboard to start.

  1. Turn non-negotiable rules into deterministic checks first.
  2. Build a small golden set of 20 to 50 representative tasks.
  3. Reduce the rubric to 3 to 5 criteria so reviewers converge.
  4. Add every real failure back into the regression set.
  5. Re-run the same set before and after prompt, tool, or model changes.

That alone is enough to move evaluation from rhetoric into operations.

9. Common failure modes

Watching only the average score

The average can stay flat while policy violations get worse.

Writing vague rubrics

Criteria like "natural" or "good" drift too easily between evaluators.

Failing to preserve real failures

If production failures are not added back into the dataset, the system keeps relearning the same lesson the hard way.

Separating evaluation from improvement

If scores do not feed harness changes, the dashboard becomes decoration.

References

  • docs/blog_series_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_์ด๊ด„_design.md
  • sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md
  • drafts/blog/260429_ํ•˜๋„ค์Šค์‹œ๋ฆฌ์ฆˆ06_ํ‰๊ฐ€์šด์˜_๋ธ”๋กœ๊ทธ.md
  • drafts/blog/260429_harness_series_06_evaluation_ops_en.md

This is Part 1 of the Operations, Evaluation, and Memory series. Next: Long-running agents — why handoff structure must come before memory systems.

Series overview: Harness Engineering Series Guide

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System