Agent Self-Improvement Harness (6/12) — Evolution of the Reflect Pipeline

self-audit 스킬 — 에이전트가 스스로 하네스 준수율을 점검하는 구조

Optimizing AI agent long-term memory management from LLM-only to regex-based scripts


핵심 요약

  • The Reflect pipeline evolved through three stages: v0.3 (LLM-only) to v0.5 (inverted architecture) to the current script-augmented design
  • Once data solidifies into structured formats, replacing LLM calls with regex scripts is the rational move
  • The script transition improved testability, predictability, and debugging efficiency across the board
1. 21개 하네스 기준 항목

Background

The Reflect pipeline is the long-term memory management system for an AI agent. Every night, it processes the day's temporary memory and merges relevant entries into permanent storage. This post covers the optimization lessons learned from rearchitecting this pipeline three times.

2. 자동 점검 프로세스

The Evolution

v0.3 (Initial): LLM Handles Everything

All four runners required individual LLM calls. A single failure point in any runner could halt the entire pipeline — a fragile design.

v0.5 (Inverted Architecture): Separate Judgment from Execution

The Manager now processes only a summary report (2k-5k tokens), while downstream Runners handle first-pass processing of raw files. This reduced the LLM's workload while preserving its judgment capabilities.

Current: Script-Augmented Stage

The turning point was the "extract Retain tags from conversation logs" task. The actual data had fully stabilized into structured formats (- W:, - O(c=0.90):, - S[entity]:), so I replaced the LLM runner with a Python regex script.

The division of labor is now clear: structured data is extracted by regex, while the LLM focuses exclusively on semantic judgment and natural language generation.

Pitfalls and Caveats

LLMs are flexible but non-deterministic — they don't guarantee the same output for the same input. Continuing to use an LLM after the data format has been finalized means carrying unnecessary cost and instability. Conversely, switching to scripts while the data format is still in flux means rewriting code with every format change.

The timing heuristic: Switch to scripts when the data format has been stable for at least two weeks without changes.

Takeaway

During the stabilization phase, anything that can be structured should be moved to scripts. Script-based processing is testable, predictable, and easy to debug through failure isolation. Use the LLM only where an LLM is genuinely needed — that's the key to long-term operational sustainability.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System