Agent Self-Improvement Harness (3/12) — Self-Review Cron in Three Stages: scan→apply→fix
Why doing everything at once is slower in practice
Key Summary
- Running scan + analysis + fix in a single pipeline causes false positives to spike
- Splitting into three discrete stages — scan→apply→fix — clarifies each stage's responsibility and structurally blocks false positives
- The batch limit (2 files, 10 minutes) is a root-cause tracing design decision, not an efficiency constraint
What Is the Self-Review Cron
OpenClaw's self-review cron periodically scans the codebase and automatically corrects items that fall below quality thresholds.
Two trigger paths exist. The first is an event-based trigger that fires immediately on every commit push. The second is a schedule-based trigger that inspects the full codebase at regular intervals. Both paths go through the same three-stage pipeline.
The initial design was simple: one cron cycle handles scan + analysis + fix in a single pass. Logically clean. In practice, false positives appeared at scale and forced a structural redesign.
Body
1. Structural Root Cause of False Positives
In a monolithic pipeline, once the scan stage produces a list of "suspected candidates," the next stage immediately executes fixes. The problem is that scan operates conservatively — it captures everything that might be a problem.
That is not a flaw in the scanner. Scanners are supposed to produce more false positives than false negatives. Catching too much is preferable to missing real issues. Distinguishing actual problems from noise is the responsibility of the next stage.
In a single pipeline, that disambiguation stage is either absent or runs in the same execution context as the scan — so it arrives at the same conclusion with the same bias. If scan says "looks like a problem," analysis agrees, and the fix proceeds.
The result is an accumulation of cases where correct code is unnecessarily modified.
2. Three-Stage Split: scan→apply→fix
The solution was to physically separate generation from verification.
scan (Stage 1: Candidate List Generation) - Scans the codebase and produces a list of "potentially problematic candidates" - Criteria are intentionally loose — the goal is to miss nothing - Output is a candidate list file. No immediate modification
apply (Stage 2: Filtering + Fix Queue Registration) - Receives scan output and determines whether each candidate is an actual problem - The critical false-positive filtering stage - Only confirmed issues are registered in the fix queue
fix (Stage 3: Batched Correction) - Pulls items from the fix queue and executes the actual modifications - Batch limit: 2 files, 10 minutes - No bulk changes in a single cycle
Each stage is an independent execution unit. apply does not run until scan completes; fix does not run until apply completes. All three stages are not required to run within the same cron cycle.
3. Evaluation Criteria: What Each Stage Judges
Each stage applies distinct evaluation criteria.
scan stage criteria: - Static pattern matching (undefined variables, unused imports, TODO/FIXME comments) - Complexity thresholds (function length, nesting depth) - Coverage gaps (public methods with no tests)
Criteria are intentionally loose. This stage's objective is "miss nothing." False positives are handled downstream.
apply stage criteria: - Context analysis: is the TODO planned or unfinished? - Impact scope: does the fix cascade to other modules? - Confidence score: is the scan's evidence sufficient? (items below threshold are excluded from the queue)
fix stage output format:
fix_result:
file: src/core/agent.py
change_type: remove_unused_import
lines_affected: [12, 45]
confidence: 0.91
rollback_ref: commit_sha_before
Output includes change type, affected lines, confidence score, and rollback reference. If fix fails or exceeds 10 minutes, the item remains in the queue and carries over to the next cycle.
4. Why the Batch Is Capped at 2 Files
"Wouldn't fixing everything at once be faster?" is a natural question.
The answer is root-cause tracing. If 10 files are modified in one batch and something breaks elsewhere afterward, the cause must be isolated among all 10 changes. Limiting to 2 files narrows the blast radius immediately.
The 10-minute cap follows the same logic. A fix taking longer than expected signals that the change is not simple. Complex modifications require human judgment, not automation. When the limit is exceeded, the item stays in the queue for the next cycle.
This is not sacrificing speed. It is paying the debugging cost upfront.
5. Prescan False Positive Fix: apply Stage Integration Case
The canonical problem before the three-stage split was a prescan (pre-scan preliminary check) that flagged certain patterns as false positives.
When prescan found a TODO comment in code, it classified the location as "incomplete code" and added it to the fix list. The problem was that many TODOs were intentionally placed markers — planned extensions, deferred to a later version. All of them were captured as "incomplete."
In the monolithic pipeline, there was no mechanism to filter this out. After splitting into three stages, a context-aware classification rule was added to the apply stage.
def classify_todo(comment: str, context_lines: list[str]) -> str:
"""
Returns: 'actionable' | 'planned' | 'ambiguous'
"""
planned_signals = ["향후", "다음 버전", "v2", "roadmap", "planned"]
if any(sig in comment for sig in planned_signals):
return "planned"
if any(sig in comment for sig in ["fix", "bug", "broken", "FIXME"]):
return "actionable"
return "ambiguous"
The apply stage excludes TODOs classified as planned from the fix queue. Items marked ambiguous receive a reduced confidence score, enter the queue, and are automatically deferred to the next cycle if the 10-minute fix limit is hit first.
The result: prescan false positive rate dropped significantly, and unintended deletion of planned TODOs was eliminated.
6. OpenClaw → Hermes Migration Context
OpenClaw is a stable, actively running agent platform. The three-stage self-review cron split is a design pattern established during OpenClaw's stabilization phase.
Hermes is the next-generation platform for OpenClaw. An initial OC→HM migration was attempted but encountered a token-explosion problem, causing a rollback to OpenClaw. A second Hermes attempt is currently underway and in the verification stage. Related changes are tracked in PR#12497 against the harness git repository.
The same self-review pattern is applicable to the Hermes re-attempt (currently under verification). The platform changes; the principle of separating generation from verification does not.
Lessons Learned
- Designed for "finish it in one pass" → false positives multiplied and manual recovery work increased. The purpose of automation is to reduce human intervention — but over-triggering made intervention more frequent
- Attempted to tighten scan criteria before introducing the apply stage → false positives decreased, but real issues started to be missed (false negatives increased). Returned to the principle that scan's job is "miss nothing"
- Ran fix without a batch limit, once modifying 8 files in a single cycle → an apparently unrelated error appeared afterward. Tracing the root cause across 8 changes took longer than the fix itself
Closing
"Separate generation from verification" is a principle that applies across software engineering. It is why code review is separate from coding, why build and test are distinct steps in CI/CD, and why writing a PR and merging it are different people's responsibilities — all the same principle.
The three-stage self-review cron follows the same logic. Doing everything in one pass looks faster, but when false-positive recovery costs are included, it is slower. Splitting into scan→apply→fix allows each stage to focus on its single responsibility, and when something breaks, the failure's origin is immediately traceable.
When designing an automation pipeline, splitting into discrete stages is often the better choice over "finish it in one shot."
Series overview: Series index
댓글
댓글 쓰기