Agent Operations Design Notes (5/9) — Agent Evaluation Is Closer to Regression Testing

When people talk about evaluating agents, they often jump to benchmark scores or demo performance. In real operations, the more important question is different: after adding a new model, a new prompt, a new tool, or a new permission policy, did something that used to work start breaking again? That is why practical agent evaluation behaves much more like regression testing.

핵심 요약

The first job of agent evaluation is not how much better did we get, but what started breaking again.
Deterministic checks, rule-based validation, rubrics, and LLM-as-a-judge each do different work.
Past incidents and repeat review failures are often the strongest starting point for a regression set.
Managed Agents still do not remove the need for local evaluation, because each organization defines failure differently.

1. Why agent evaluation often fails when treated like a benchmark

Benchmarks are useful for comparison, but operations usually need different information:

after which change did something break
what kind of task regressed
why did it regress

Agents are not only answer generators. They read files, call tools, route through approvals, and work over long-running flows. That means the same headline score can still hide very different failure patterns.

So practical evaluation often needs to look less like a leaderboard and more like a regression-detection system.

2. The first question is not how much better, but what broke again

Suppose a new model improves writing quality but increases file-scope violations. Or suppose a new permission policy improves safety but sharply lowers task completion.

A simple scorecard often misses that kind of tradeoff.

Operationally, the first questions are usually:

did tasks that used to pass now fail
did the new setup increase a specific failure bucket
did apparent improvement come with worse irreversible mistakes

That is why evaluation should be built to detect regression before celebration.

3. Deterministic checks and rubrics do different jobs

Strong evaluation systems rarely depend on a single method.

Evaluation method	Good at catching	Main limit
deterministic check	format failures, scope violations, forbidden actions, missing approvals	weak on semantic quality
rubric	completeness, usefulness, factual quality, clarity	needs consistency management
LLM-as-a-judge	large-scale semantic comparison, qualitative assistance	judgment can drift
human review	policy exceptions, high-risk judgment	expensive

The important point is that one does not replace the others.

A practical pattern is often:

deterministic checks first
rubric or judge for meaning quality
human review for high-risk policy edges

4. Why real failures make the strongest regression set

Many teams try to design an elegant evaluation dataset from scratch. That can help, but in operations the most valuable regression set often already exists.

Examples:

previous incidents
recurring review comments
missed-approval cases
scope-violation cases
near-miss publication or transmission errors

Those cases are stronger than generic benchmark examples because they represent places where your system already failed in reality.

So the starting point for a regression suite is often not a neat benchmark. It is a list of failures your team already paid for.

5. Evaluation datasets are operating assets, not reports

If an evaluation set is treated as a one-off experiment file, it becomes stale quickly. To become an operating asset, it usually needs at least:

what failure bucket the case represents
why the case matters
what expected behavior looks like
after which changes it tends to regress

That means the dataset is not just a score generator. It is a way for the organization to remember what it considers failure.

6. Why LLM-as-a-judge should not be the only judge

Judge models are powerful, but they are unstable as the only decision-maker.

Problems appear quickly:

the reasoning behind the grade can drift
clear structural failures can be softened into subjective commentary
if the evaluator model changes, your evaluation criteria can change with it

So judge models are usually better treated as a higher-layer semantic helper than as the only gate.

The more stable pattern is:

deterministic checks for clear failures
judge assistance for semantic quality
human review for policy exceptions and high-risk decisions

7. Why local evaluation remains necessary in the Managed Agents era

Provider-side evaluations and generic platform quality signals are useful. They still do not define local failure for you.

The platform usually does not know:

which files in your repo must never be touched
which outbound messages require approval
which publishing flows are forbidden
which regressions are operationally critical for your team

That is why local evaluation does not compete with platform evaluation. It anchors your own failure definition above it.

8. A small-team way to start with regression-first evaluation

You do not need a large framework to begin.

collect ten recent failure cases
turn format, scope, and approval problems into deterministic checks
add three to five rubric dimensions for completeness and accuracy
run the same set before and after each meaningful change
look for what broke again before asking what improved

That is often much more operationally useful than building demo-only evaluations.

9. Conclusion: the heart of agent evaluation is closer to recurrence prevention than performance comparison

Good agent evaluation is not mainly about making a prettier ranking table.

Operationally, the deeper questions are:

did something that used to work start failing again
which change increased which failure bucket
can we catch that before the next release or handoff

That is why agent evaluation behaves less like a benchmark contest and more like a regression-testing system for repeated failure prevention.

References

OpenAI Agents SDK, Guardrails
OpenAI, Introducing AgentKit, 2025-10-06
Google Cloud, BigQuery Agent Analytics, 2025-11-21

Series overview: Series index

이 블로그 검색

MaJu Tech Notes