"Designing LLM-Script Collaboration — Let AI Judge, Let Code Execute"

4월 04, 2026

나를 온톨로지화하면 무엇이 달라질까 — 개인 온톨로지 4층 프레임워크 3편

A role-separation architecture that cut token usage by 96%

핵심 요약

The core principle: LLMs handle semantic judgment, scripts handle rule-based execution
Token consumption in the Reflect pipeline dropped from 277k to 5k — a 96% reduction
Twelve automation scripts are categorized into four groups: data transformation, validation, maintenance, and real-time invocation

Background

Running an AI agent system taught me the limits of LLMs through repeated failures. They burned massive token counts on simple data transformations, produced different outputs for identical inputs, and fell short on arithmetic reliability. The "just let the LLM do everything" approach failed on both cost and accuracy.

2. 4층 프레임워크: Being → Values → Capabilities → Actions

The Design

Philosophy: Separate Judgment from Execution

The principle I arrived at is straightforward. Tasks requiring semantic judgment go to the LLM. Tasks following strict rules go to Python or Bash scripts. This single separation delivered three wins:

Cost reduction: Reflect pipeline tokens dropped from 277k to 5k (96%)
Accuracy: Rule-based tasks hit 100% correctness
Determinism: Same input, same output, every time

Twelve Scripts in Four Categories

Scripts are grouped by function:

Data Transformation: retain-merge.py, conflict-apply.py, etc. — structured data processing
Validation & Monitoring: confidence-decay.py, bank-lint.py, etc. — rule-based quality checks
Maintenance & Expansion: session-cleanup.py, topics-expand.py — automated housekeeping
Real-time Invocation: recall-match.py, recall-cleanup.py — memory system integration

Hybrid Search Strategy

The memory system runs two retrieval methods in parallel: keyword-based recall matching for precision and embedding-based semantic search for coverage. Each compensates for the other's blind spots.

Lessons Learned

Early on, I let the LLM handle data transformations too. The results were non-deterministic and expensive. Drawing the boundary between "what LLMs are good at" and "what code is good at" took trial and error. The deciding question is always the same: does this need judgment, or does it need rule application?

Takeaway

LLMs are not a universal solution. Separating judgment from execution and deploying the right tool for each is how you control both cost and reliability in an AI agent system. The 96% token reduction was not a clever trick — it was an architectural decision.

이 블로그 검색

MaJu Tech Notes