Agent Operations Design Notes (5/9) — Agent Evaluation Is Closer to Regression Testing

When people talk about evaluating agents, they often jump to benchmark scores or demo performance. In real operations, the more important question is different: after adding a new model, a new prompt, a new tool, or a new permission policy, did something that used to work start breaking again? That is why practical agent evaluation behaves much more like regression testing.


ํ•ต์‹ฌ ์š”์•ฝ

  • The first job of agent evaluation is not how much better did we get, but what started breaking again.
  • Deterministic checks, rule-based validation, rubrics, and LLM-as-a-judge each do different work.
  • Past incidents and repeat review failures are often the strongest starting point for a regression set.
  • Managed Agents still do not remove the need for local evaluation, because each organization defines failure differently.

1. Why agent evaluation often fails when treated like a benchmark

Benchmarks are useful for comparison, but operations usually need different information:

  • after which change did something break
  • what kind of task regressed
  • why did it regress

Agents are not only answer generators. They read files, call tools, route through approvals, and work over long-running flows. That means the same headline score can still hide very different failure patterns.

So practical evaluation often needs to look less like a leaderboard and more like a regression-detection system.

2. The first question is not how much better, but what broke again

Suppose a new model improves writing quality but increases file-scope violations. Or suppose a new permission policy improves safety but sharply lowers task completion.

A simple scorecard often misses that kind of tradeoff.

Operationally, the first questions are usually:

  • did tasks that used to pass now fail
  • did the new setup increase a specific failure bucket
  • did apparent improvement come with worse irreversible mistakes

That is why evaluation should be built to detect regression before celebration.

3. Deterministic checks and rubrics do different jobs

Strong evaluation systems rarely depend on a single method.

Evaluation method Good at catching Main limit
deterministic check format failures, scope violations, forbidden actions, missing approvals weak on semantic quality
rubric completeness, usefulness, factual quality, clarity needs consistency management
LLM-as-a-judge large-scale semantic comparison, qualitative assistance judgment can drift
human review policy exceptions, high-risk judgment expensive

The important point is that one does not replace the others.

A practical pattern is often:

  1. deterministic checks first
  2. rubric or judge for meaning quality
  3. human review for high-risk policy edges

4. Why real failures make the strongest regression set

Many teams try to design an elegant evaluation dataset from scratch. That can help, but in operations the most valuable regression set often already exists.

Examples:

  • previous incidents
  • recurring review comments
  • missed-approval cases
  • scope-violation cases
  • near-miss publication or transmission errors

Those cases are stronger than generic benchmark examples because they represent places where your system already failed in reality.

So the starting point for a regression suite is often not a neat benchmark. It is a list of failures your team already paid for.

5. Evaluation datasets are operating assets, not reports

If an evaluation set is treated as a one-off experiment file, it becomes stale quickly. To become an operating asset, it usually needs at least:

  • what failure bucket the case represents
  • why the case matters
  • what expected behavior looks like
  • after which changes it tends to regress

That means the dataset is not just a score generator. It is a way for the organization to remember what it considers failure.

6. Why LLM-as-a-judge should not be the only judge

Judge models are powerful, but they are unstable as the only decision-maker.

Problems appear quickly:

  • the reasoning behind the grade can drift
  • clear structural failures can be softened into subjective commentary
  • if the evaluator model changes, your evaluation criteria can change with it

So judge models are usually better treated as a higher-layer semantic helper than as the only gate.

The more stable pattern is:

  • deterministic checks for clear failures
  • judge assistance for semantic quality
  • human review for policy exceptions and high-risk decisions

7. Why local evaluation remains necessary in the Managed Agents era

Provider-side evaluations and generic platform quality signals are useful. They still do not define local failure for you.

The platform usually does not know:

  • which files in your repo must never be touched
  • which outbound messages require approval
  • which publishing flows are forbidden
  • which regressions are operationally critical for your team

That is why local evaluation does not compete with platform evaluation. It anchors your own failure definition above it.

8. A small-team way to start with regression-first evaluation

You do not need a large framework to begin.

  1. collect ten recent failure cases
  2. turn format, scope, and approval problems into deterministic checks
  3. add three to five rubric dimensions for completeness and accuracy
  4. run the same set before and after each meaningful change
  5. look for what broke again before asking what improved

That is often much more operationally useful than building demo-only evaluations.

9. Conclusion: the heart of agent evaluation is closer to recurrence prevention than performance comparison

Good agent evaluation is not mainly about making a prettier ranking table.

Operationally, the deeper questions are:

  • did something that used to work start failing again
  • which change increased which failure bucket
  • can we catch that before the next release or handoff

That is why agent evaluation behaves less like a benchmark contest and more like a regression-testing system for repeated failure prevention.

Related Internal Links

References

Series overview: Series index

๋Œ“๊ธ€