"Agent Evaluation Harnesses (1/4) — How to Validate AI Results with Tests, Rubrics, and Regression Loops"

5월 18, 2026

The most common AI-agent illusion is mistaking "it worked a few times" for "it now works." An evaluation harness breaks that illusion. It asks whether the system still holds together when run again under the same expectations.

Key Takeaways

Agent evaluation is not just about answer quality. It must also inspect tool use, policy adherence, failure handling, and final deliverables.
Strong evaluation harnesses combine exact checks with rubric-based scoring, while removing cheap, obvious failures as early as possible.
A rubric is not a vibe check. It is a repeatable scoring frame.
Evaluation datasets are not static reports. They are operating assets that get stronger as real failures are added back in.
The purpose of evaluation is not leaderboard vanity. It is to build sensors that tell you how to improve the harness.

1. Why agents need a distinct evaluation harness

Testing is important in all software, but agent systems make evaluation harder. Outputs can vary across runs, many tasks do not have a single exact answer, and agents do more than write text. They read files, invoke tools, make intermediate decisions, and sometimes carry work over long sessions.

That means "the answer looks decent" is not enough. In practice you need to evaluate questions like:

did it choose the right tool
did it avoid unnecessary actions
did it follow explicit constraints
did the final output satisfy the requested format and standard
did the system regress compared with earlier behavior

So an evaluation harness is not a chatbot scorecard. It is closer to a test rig for the full agent workflow.

2. Exact checks and rubric-based checks are different tools

You usually need both, but they should not be blended into one vague bucket.

Evaluation type	Best for	Strength	Limitation
Exact checks	format, values, execution success, policy violations	cheap, fast, consistent	weak on semantic quality
Rubric-based checks	summaries, analysis, prioritization, writing quality	catches meaningful differences	more subjective and less stable

For example, schema compliance, required-section presence, or forbidden-file edits do not need an LLM judge. But questions like "is this analysis decision-useful?" are hard to settle with rigid rules alone.

This distinction matters because:

Anything you can catch deterministically should be caught before you spend expensive judgment on it.

3. A good rubric is not subjective praise

A rubric is not "write something good." It is a breakdown of criteria that allows different reviewers to reach similar conclusions.

For an agent deliverable, that can look like this:

Criterion	Question	Example score
Accuracy	Are facts, paths, and commands correct?	0~5
Procedure compliance	Did it follow required constraints and sequence?	0~5
Completeness	Did it cover all requested outputs?	0~5
Practicality	Can a real user use this immediately?	0~5
Risk	Did it avoid overclaiming, unsafe edits, or unnecessary actions?	0~5

The important part is not the label names. It is the decision language. For instance: if the main result is decent but the agent edited a forbidden file, the result is not a pass. It is a failure.

4. Evaluation datasets should preserve failure, not just success

One common mistake is building datasets from only the best-looking examples. But the most useful evaluation set is often the opposite.

cases that frequently failed
borderline cases that caused drift
policy-sensitive cases
cases likely to break when models, prompts, or tools change

Following the Chapter 5 and Chapter 9 adaptation notes in sources/260518_하네스엔지니어링_15장_블로그활용노트.md, it is more useful to treat the dataset as an operational regression asset than as a polished benchmark.

In practice, it helps to classify examples like this:

golden: representative cases that must keep passing
edge: ambiguous but important boundary cases
failure_regression: cases taken from real failures or review findings
policy: permissions, approval, and forbidden-action cases

Once you do this, the average score becomes less interesting than which failure family came back.

5. Regression evaluation should ask "what broke?"

Operationally, the first question is not "did we innovate?" It is "did we break something that used to work?" Prompt edits, tool-description changes, and model swaps often look good on demos while quietly damaging adjacent behaviors.

That is why regression runs should first detect:

representative cases that used to pass but now fail
abnormal spikes in cost, latency, or tool-call count
increased policy or formatting violations

This connects directly to the continuous measurement logic in drafts/blog/260429_harness_series_06_evaluation_ops_en.md. Evaluation is not a one-off release ritual. It is a recurring guardrail around change.

6. LLM-as-Judge is useful, but risky as the sole judge

LLM-as-Judge is powerful for nuanced quality checks: explanatory strength, prioritization, usefulness, and semantic alignment. But it becomes risky when treated as the only arbiter.

the judge may share the generator's biases
scores can drift with prompt wording
fluent answers may be overrated even when they are weak

A more durable operating pattern is:

run deterministic checks first
send only surviving cases to rubric evaluation
use human review to calibrate important samples
log judgment rationale so drift can be inspected later

The point is not "do not use LLM judges." The point is use them inside a broader stack of cheaper and stricter checks.

7. What an evaluation harness looks like in a workspace like ours

In a repository that mixes documents, tooling rules, and operating constraints, evaluation often cares more about procedure compliance than style alone.

For example:

Evaluation target	Question
File ownership compliance	Did it edit only allowed files?
Read-order compliance	Did it read the requested references first?
Language protocol	Did it keep user-facing output in Korean where required?
Concurrent-work safety	Did it avoid reverting someone else's changes?
Deliverable completeness	Did it include publish-ready frontmatter and series-nav placeholders?

These checks are operationally valuable because many real failures are caused not by weak prose, but by violating the work contract around the prose.

8. Practical starting point

You do not need a giant leaderboard to start.

Turn non-negotiable rules into deterministic checks first.
Build a small golden set of 20 to 50 representative tasks.
Reduce the rubric to 3 to 5 criteria so reviewers converge.
Add every real failure back into the regression set.
Re-run the same set before and after prompt, tool, or model changes.

That alone is enough to move evaluation from rhetoric into operations.

9. Common failure modes

Watching only the average score

The average can stay flat while policy violations get worse.

Writing vague rubrics

Criteria like "natural" or "good" drift too easily between evaluators.

Failing to preserve real failures

If production failures are not added back into the dataset, the system keeps relearning the same lesson the hard way.

Separating evaluation from improvement

If scores do not feed harness changes, the dashboard becomes decoration.

References

docs/blog_series_하네스엔지니어링_총괄_design.md
sources/260518_하네스엔지니어링_15장_블로그활용노트.md
drafts/blog/260429_하네스시리즈06_평가운영_블로그.md
drafts/blog/260429_harness_series_06_evaluation_ops_en.md

This is Part 1 of the Operations, Evaluation, and Memory series. Next: Long-running agents — why handoff structure must come before memory systems.

Series overview: Harness Engineering Series Guide

이 블로그 검색

MaJu Tech Notes