How I Built a Self-Improving Agent Architecture

How I Built a Self-Improving Agent Architecture — A Loop Design Breakdown

4월 04, 2026

Automatic skill generation, memory compression, and session archiving — how agents learn from experience

Summary

Implementing a "learn-from-experience" structure in AI agents yields progressively higher quality across sessions
A self-improvement loop that automatically generates skill candidates after complex tasks and patches problems immediately upon detection
Structuring and compressing memory to eliminate context waste
A session archive that extracts core knowledge even from deleted sessions

Background

Running AI agents surfaces a recurring set of structural problems: similar trial-and-error occurs each time a familiar task type appears, gains from one session do not transfer to the next, and performance degrades as context length grows.

Humans learn from experience and perform better over time. Can an agent operate on the same principle?

The OpenClaw project addressed this problem directly. OpenClaw is a multi-agent platform with hardened security and a stable self-improvement architecture. The design enables agents to automatically extract patterns from task experience, convert them into skills, and reorganize memory.

Body

1. Skill Self-Improvement Loop — The Agent Generates Its Own Skills

Conventionally, agent skills are authored by humans: write the skill file, specify the procedure, configure the trigger.

However, when an agent executes complex tasks, patterns emerge naturally. Combining multiple tools, or finding a resolution path through error recovery — that process itself is a skill candidate.

How the self-improvement loop operates:

Execute complex task (5+ tools used, error recovery included)
  ↓
Automatic analysis after task completion
  ↓
If repeatable pattern detected → register in skill-candidates.json
  ↓
Apply candidate skill in next task
  ↓
Patch immediately if problems are found

The critical constraint: a pattern must repeat 3 or more times before promotion to a skill. A single occurrence may be coincidence. Three repetitions signal structure.

Skill candidates are not pushed directly to production. They are applied experimentally in the next task. If a problem surfaces, it is patched immediately. The loop is: "This should work" → "This part breaks in practice" → "Fixed." Iterate.

2. Memory Compression — Core Context in 500 Tokens

An agent's context window is finite. As sessions lengthen, early instructions receive less weight and performance degrades.

Memory flush was restructured to solve this.

Previous approach: Agents stored memory in free-form text. No length limit. Tool outputs were often included in full. Memory grew bloated and wasted context.

Structured approach:

## Memory Flush
- Goal: [objective of this task]
- Progress: [completed / remaining]
- Decisions: [decisions made and their rationale]
- Changed Files: [list of modified files]
- Blockers: [open blockers]
- Next Steps: [next actions]

Three rules govern this: 1. Summarize tool output over 50 lines — discard the raw output. Re-run if the original is needed. 2. Target 500-token compression. A hard constraint that forces retention of only what matters. 3. Fix the schema to 6 fields. Free-form structures expand without bound.

The result: minimal tokens spent reading memory at session handoff, with core context preserved.

3. Session Archive — Learning from Deleted Sessions

When a session is deleted or reset in Claude Code, the conversation history disappears. If the insights gained in that session disappear with it, that is a net loss.

Session archiving solves this structurally.

How it operates:

Session end / deletion detected
  ↓
Run session-archive.py
  ↓
Auto-extract key content from .deleted / .reset sessions
  ↓
Store in staging/session-archive.json

What is extracted: - Problems solved and how they were solved - Decisions made and their rationale - Patterns discovered or cautions identified

What is not extracted: - Full source code (already in git) - Raw tool output (re-runnable) - Transient debugging traces

Sessions are ephemeral. What was learned persists.

4. From Observation to Verification — Dialectical Modeling

When an agent learns user behavior patterns, how does it validate what it has observed?

Drawing conclusions from a single observation is fragile. The user tag (U-tag) system applies a dialectical model.

3-stage verification:

Stage	Condition	Meaning
observation	1 occurrence	"This tendency is visible"
hypothesis	2+ repetitions	"This pattern is likely"
verified	sustained over time, no counterevidence	"Confirmed pattern"

A contradictions field tracks counterevidence. If a pattern is in verified state — say, "this user prefers data-driven decisions" — and an instance of intuition-based decision-making is observed, it is recorded as a contradiction and the status is reviewed.

Behavioral pattern assessments require the same validation rigor as data. This is a built-in defense against confirmation bias.

5. Automated Code Review — 3-Phase Separation

The process for agents reviewing agent-generated code was also redesigned.

Previous: Scan, analysis, and fix executed in a single pass. High false-positive rate. Fix scope unclear.

Redesigned: 3 distinct phases.

scan
  ↓ Generate candidate issue list
apply (analysis + queue)
  ↓ Filter to real issues only, enqueue for fix
fix (batch)
  → 2 files at a time, 10-minute cap

This enforces separation of generation and verification. Running both in a single pass degrades the quality of each. Separated, each operates correctly.

The batch fix limit of 2 files per 10 minutes exists because fixing too many files at once makes root-cause tracing difficult when something breaks.

Findings and Caveats

Skill auto-generation pitfall: The initial design promoted candidates after 2 repetitions. Coincidental patterns were being promoted to skills. Raising the threshold to 3 repetitions stabilized quality.
Over-aggressive memory compression loses context. At 300 tokens, "why this decision was made" was frequently dropped. 500 tokens is the current balance point.
Self-improvement can become an infinite loop. "Improving the skill that improves the skill that improves…" — meta-loops must be prevented. Self-improvement is restricted to trigger only after real task execution.
Lesson from the Hermes migration: During the early OpenClaw-to-Hermes transition, a token runaway event occurred. The self-improvement loop had been deployed to production without sufficient validation. The project reverted to OpenClaw and Hermes is currently being re-verified. The principle extracted from this: self-improvement structures must be fully validated in staging before production deployment.

Conclusion

Implementing a self-improvement architecture in an agent produces a system that compounds in quality with use. It stops repeating the same mistakes, extracts patterns from experience, and manages memory efficiently.

Three core principles:

Automate what repeats (3 occurrences → skill promotion)
Separate generation from verification (scan / analyze / fix — 3 phases)
Retain only the core, discard the rest (500-token compression, session archive)

Full autonomous evolution is a long way off. But the first step toward an agent that learns from experience is simpler than it seems: detect repeated patterns, record them, apply them next time. Not so different from how humans learn.

이 블로그 검색

MaJu Tech Notes