Agent Operations Design Notes (5/9) — Agent Evaluation Is Closer to Regression Testing
When people talk about evaluating agents, they often jump to benchmark scores or demo performance. In real operations, the more important question is different: after adding a new model, a new prompt, a new tool, or a new permission policy, did something that used to work start breaking again? That is why practical agent evaluation behaves much more like regression testing.
ํต์ฌ ์์ฝ
- The first job of agent evaluation is not
how much better did we get, but what started breaking again. - Deterministic checks, rule-based validation, rubrics, and
LLM-as-a-judgeeach do different work. - Past incidents and repeat review failures are often the strongest starting point for a regression set.
- Managed Agents still do not remove the need for local evaluation, because each organization defines failure differently.
1. Why agent evaluation often fails when treated like a benchmark
Benchmarks are useful for comparison, but operations usually need different information:
- after which change did something break
- what kind of task regressed
- why did it regress
Agents are not only answer generators. They read files, call tools, route through approvals, and work over long-running flows. That means the same headline score can still hide very different failure patterns.
So practical evaluation often needs to look less like a leaderboard and more like a regression-detection system.
2. The first question is not how much better, but what broke again
Suppose a new model improves writing quality but increases file-scope violations. Or suppose a new permission policy improves safety but sharply lowers task completion.
A simple scorecard often misses that kind of tradeoff.
Operationally, the first questions are usually:
- did tasks that used to pass now fail
- did the new setup increase a specific failure bucket
- did apparent improvement come with worse irreversible mistakes
That is why evaluation should be built to detect regression before celebration.
3. Deterministic checks and rubrics do different jobs
Strong evaluation systems rarely depend on a single method.
| Evaluation method | Good at catching | Main limit |
|---|---|---|
| deterministic check | format failures, scope violations, forbidden actions, missing approvals | weak on semantic quality |
| rubric | completeness, usefulness, factual quality, clarity | needs consistency management |
| LLM-as-a-judge | large-scale semantic comparison, qualitative assistance | judgment can drift |
| human review | policy exceptions, high-risk judgment | expensive |
The important point is that one does not replace the others.
A practical pattern is often:
- deterministic checks first
- rubric or judge for meaning quality
- human review for high-risk policy edges
4. Why real failures make the strongest regression set
Many teams try to design an elegant evaluation dataset from scratch. That can help, but in operations the most valuable regression set often already exists.
Examples:
- previous incidents
- recurring review comments
- missed-approval cases
- scope-violation cases
- near-miss publication or transmission errors
Those cases are stronger than generic benchmark examples because they represent places where your system already failed in reality.
So the starting point for a regression suite is often not a neat benchmark. It is a list of failures your team already paid for.
5. Evaluation datasets are operating assets, not reports
If an evaluation set is treated as a one-off experiment file, it becomes stale quickly. To become an operating asset, it usually needs at least:
- what failure bucket the case represents
- why the case matters
- what expected behavior looks like
- after which changes it tends to regress
That means the dataset is not just a score generator. It is a way for the organization to remember what it considers failure.
6. Why LLM-as-a-judge should not be the only judge
Judge models are powerful, but they are unstable as the only decision-maker.
Problems appear quickly:
- the reasoning behind the grade can drift
- clear structural failures can be softened into subjective commentary
- if the evaluator model changes, your evaluation criteria can change with it
So judge models are usually better treated as a higher-layer semantic helper than as the only gate.
The more stable pattern is:
- deterministic checks for clear failures
- judge assistance for semantic quality
- human review for policy exceptions and high-risk decisions
7. Why local evaluation remains necessary in the Managed Agents era
Provider-side evaluations and generic platform quality signals are useful. They still do not define local failure for you.
The platform usually does not know:
- which files in your repo must never be touched
- which outbound messages require approval
- which publishing flows are forbidden
- which regressions are operationally critical for your team
That is why local evaluation does not compete with platform evaluation. It anchors your own failure definition above it.
8. A small-team way to start with regression-first evaluation
You do not need a large framework to begin.
- collect ten recent failure cases
- turn format, scope, and approval problems into deterministic checks
- add three to five rubric dimensions for completeness and accuracy
- run the same set before and after each meaningful change
- look for what broke again before asking what improved
That is often much more operationally useful than building demo-only evaluations.
9. Conclusion: the heart of agent evaluation is closer to recurrence prevention than performance comparison
Good agent evaluation is not mainly about making a prettier ranking table.
Operationally, the deeper questions are:
- did something that used to work start failing again
- which change increased which failure bucket
- can we catch that before the next release or handoff
That is why agent evaluation behaves less like a benchmark contest and more like a regression-testing system for repeated failure prevention.
Related Internal Links
- AI Agent Permission Design: Where Should You Draw the Line Between Allow, Ask, and Deny?
- In the Managed Agents Era, How Should You Design an Approval Loop?
- Sandboxing Is Not Just a Security Feature. It Is a Quality Structure.
- In Long-Running Agent Operations, Handoff Design Comes Before Memory
- What a Good Agent Memory Architecture Looks Like
References
- OpenAI Agents SDK, Guardrails
- OpenAI, Introducing AgentKit,
2025-10-06 - Google Cloud, BigQuery Agent Analytics,
2025-11-21
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ