"Designing LLM-Script Collaboration — Let AI Judge, Let Code Execute"
A role-separation architecture that cut token usage by 96%
핵심 요약
- The core principle: LLMs handle semantic judgment, scripts handle rule-based execution
- Token consumption in the Reflect pipeline dropped from 277k to 5k — a 96% reduction
- Twelve automation scripts are categorized into four groups: data transformation, validation, maintenance, and real-time invocation
Background
Running an AI agent system taught me the limits of LLMs through repeated failures. They burned massive token counts on simple data transformations, produced different outputs for identical inputs, and fell short on arithmetic reliability. The "just let the LLM do everything" approach failed on both cost and accuracy.
The Design
Philosophy: Separate Judgment from Execution
The principle I arrived at is straightforward. Tasks requiring semantic judgment go to the LLM. Tasks following strict rules go to Python or Bash scripts. This single separation delivered three wins:
- Cost reduction: Reflect pipeline tokens dropped from 277k to 5k (96%)
- Accuracy: Rule-based tasks hit 100% correctness
- Determinism: Same input, same output, every time
Twelve Scripts in Four Categories
Scripts are grouped by function:
- Data Transformation:
retain-merge.py,conflict-apply.py, etc. — structured data processing - Validation & Monitoring:
confidence-decay.py,bank-lint.py, etc. — rule-based quality checks - Maintenance & Expansion:
session-cleanup.py,topics-expand.py— automated housekeeping - Real-time Invocation:
recall-match.py,recall-cleanup.py— memory system integration
Hybrid Search Strategy
The memory system runs two retrieval methods in parallel: keyword-based recall matching for precision and embedding-based semantic search for coverage. Each compensates for the other's blind spots.
Lessons Learned
Early on, I let the LLM handle data transformations too. The results were non-deterministic and expensive. Drawing the boundary between "what LLMs are good at" and "what code is good at" took trial and error. The deciding question is always the same: does this need judgment, or does it need rule application?
Takeaway
LLMs are not a universal solution. Separating judgment from execution and deploying the right tool for each is how you control both cost and reliability in an AI agent system. The 96% token reduction was not a clever trick — it was an architectural decision.
댓글
댓글 쓰기