"Designing LLM-Script Collaboration — Let AI Judge, Let Code Execute"

나를 온톨로지화하면 무엇이 달라질까 — 개인 온톨로지 4층 프레임워크 3편

A role-separation architecture that cut token usage by 96%


핵심 요약

  • The core principle: LLMs handle semantic judgment, scripts handle rule-based execution
  • Token consumption in the Reflect pipeline dropped from 277k to 5k — a 96% reduction
  • Twelve automation scripts are categorized into four groups: data transformation, validation, maintenance, and real-time invocation

Background

Running an AI agent system taught me the limits of LLMs through repeated failures. They burned massive token counts on simple data transformations, produced different outputs for identical inputs, and fell short on arithmetic reliability. The "just let the LLM do everything" approach failed on both cost and accuracy.

2. 4층 프레임워크: Being → Values → Capabilities → Actions

The Design

Philosophy: Separate Judgment from Execution

The principle I arrived at is straightforward. Tasks requiring semantic judgment go to the LLM. Tasks following strict rules go to Python or Bash scripts. This single separation delivered three wins:

  • Cost reduction: Reflect pipeline tokens dropped from 277k to 5k (96%)
  • Accuracy: Rule-based tasks hit 100% correctness
  • Determinism: Same input, same output, every time
5. AI 에이전트에 적용하면

Twelve Scripts in Four Categories

Scripts are grouped by function:

  1. Data Transformation: retain-merge.py, conflict-apply.py, etc. — structured data processing
  2. Validation & Monitoring: confidence-decay.py, bank-lint.py, etc. — rule-based quality checks
  3. Maintenance & Expansion: session-cleanup.py, topics-expand.py — automated housekeeping
  4. Real-time Invocation: recall-match.py, recall-cleanup.py — memory system integration

Hybrid Search Strategy

The memory system runs two retrieval methods in parallel: keyword-based recall matching for precision and embedding-based semantic search for coverage. Each compensates for the other's blind spots.

Lessons Learned

Early on, I let the LLM handle data transformations too. The results were non-deterministic and expensive. Drawing the boundary between "what LLMs are good at" and "what code is good at" took trial and error. The deciding question is always the same: does this need judgment, or does it need rule application?

Takeaway

LLMs are not a universal solution. Separating judgment from execution and deploying the right tool for each is how you control both cost and reliability in an AI agent system. The 96% token reduction was not a clever trick — it was an architectural decision.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System