Agent Memory Engine (8/10) — Applying Karpathy LLM Wiki in Production
Implementation record: 17 of Andrej Karpathy's 18 proposed features implemented, plus 6 custom additions
Summary
- Applied Karpathy's LLM Wiki concept in full to the OpenClaw agent platform
- 17 of the original 18 features are implemented; 6 additional features address gaps found in production
- Instead of searching external documents on every query (RAG), the LLM incrementally builds and maintains a permanent, structured wiki
What Is an LLM Wiki
The core idea Andrej Karpathy published is straightforward. Rather than retrieving external documents on every query (RAG), the LLM itself incrementally builds and maintains a structured markdown wiki.
The wiki operates through three actions:
- Ingest: Accepts raw sources and updates wiki pages. Triggers cascade updates to related pages.
- Query: Answers questions using the wiki. Knowledge gaps discovered during answering are promoted into the wiki.
- Lint: Audits the entire wiki. Detects and resolves contradictions, gaps, and stale entries.
When these three actions form a loop, the knowledge base self-refines without human intervention.
Body
1. Karpathy's Original 18 Features vs. OpenClaw Implementation
| # | Karpathy Feature | OpenClaw Implementation | Status |
|---|---|---|---|
| 1 | Raw Sources (immutable originals) | memory/ + docs/manuals/ | ✅ |
| 2 | Wiki (LLM-maintained markdown) | bank/ multi-file Context Tree | ✅ |
| 3 | Schema (co-evolving config) | tree-index.json + AGENTS.md | ✅ |
| 4 | Ingest cascade updates | retain-merge cascade_update | ✅ |
| 5 | Interactive Ingest | source-ingest.py --interactive | ✅ |
| 6 | Automated external source Ingest | source-ingest.py --url/--file | ✅ |
| 7 | Query (wiki-backed answers) | recall-tree.py BM25 hybrid | ✅ |
| 8 | Query → Wiki promotion | recall-tree.py --promote | ✅ |
| 9 | LLM reranking | recall-tree.py --rerank (oMLX gemma-4) | ✅ |
| 10 | Basic Lint | bank-lint + memory-warning-report | ✅ |
| 11 | Lint exploration suggestions | exploration.page_gap_suggestions | ✅ |
| 12 | Lint unresolved contradiction tracking | exploration.unresolved_contradictions | ✅ |
| 13 | index.md (catalog) | tree-index.json (full file index) | ✅ |
| 14 | log.md (chronological record) | _meta/changelog.md | ✅ |
| 15 | Git version control | git tracked | ✅ |
| 16 | Schema co-evolution | self-review cron auto-update | ✅ |
| 17 | Batch vs. interactive toggle | default=1 item, Reflect=batch | ✅ |
| 18 | Obsidian graph view | N/A (CLI environment) | — |
17/18 implemented. The only unimplemented feature is the Obsidian graph view, which is not applicable in a CLI environment.
2. Ingest / Query / Lint: Concrete Implementation
Each of the three actions is described in detail below.
Ingest
def ingest(source, interactive=False):
content = fetch(source)
tags = TAG_ROUTING.classify(content) # Deterministic regex-based classification
target_files = resolve_targets(tags) # W/B/O/S → resolve target files
for f in target_files:
cascade_update(f, content) # Cascade update
changelog.append(source, tags)
TAG_ROUTING classifies content into four tag types — W (world facts) / B (activities) / O (opinions) / S (status) — using regex keyword rules. The decision to use deterministic rules instead of LLM judgment is explained in the lessons-learned section below.
Query
def query(q, rerank=False, promote=False):
candidates = bm25_hybrid_search(q) # BM25 + keyword hybrid
if rerank:
candidates = llm_rerank(candidates, model="oMLX/gemma-4")
answer = generate_answer(q, candidates)
if promote and has_knowledge_gap(answer):
wiki.create_stub(q) # Gap → create new wiki stub page
return answer
Korean agglutinative morphology handling is the key challenge. "๊ฐ์น๊ด์ด" → "๊ฐ์น๊ด" morpheme normalization is applied at the BM25 indexing stage. English-centric search libraries do not provide this out of the box.
Lint
lint_checks = [
check_contradictions, # Detect unresolved contradictions
check_page_gaps, # Detect linked-but-missing pages
check_stale_entries, # Entries below confidence-decay threshold
check_orphan_pages, # Pages referenced by nothing
check_data_gaps, # Entries with figures but no source
]
Lint runs in two modes: manual and automated. The Reflect cron (03:00) executes a full batch Lint; the micro-cycle (every 30 minutes) runs a lightweight version.
3. Six OpenClaw-Specific Features Not in Karpathy's Proposal
These features address problems discovered in production that the original proposal does not cover.
1. confidence-decay (Opinion Confidence Decay)
Opinion files are assigned a confidence score that auto-decrements by −0.02 per day. Entries falling below 0.30 are automatically removed. Karpathy mentions contradiction detection, but quantifying the temporal value of opinions is a separate problem. Old opinions and recent opinions must not carry equal weight.
2. fuzzy dedup (Fuzzy Duplicate Prevention)
Duplicate accumulation is blocked at an 80% keyword overlap threshold. "Flutter state management is good" and "Flutter์ state management๊ฐ ์ข๋ค" are semantically identical despite different surface forms — exact-match deduplication cannot catch this.
3. TAG_ROUTING (Deterministic Tag Routing)
W/B/O/S tags are auto-classified via regex keywords to determine target files. "Tool change" → world/tools.md; "Project" → world/projects.md. Deterministic rules are chosen over LLM judgment to ensure routing stability.
4. proactive briefing (Autonomous Briefing)
A multi-mode proactive briefing system that surfaces relevant wiki content without being asked. Includes a conversation-context-aware mode. The wiki should not only respond to manual searches — it must proactively surface information.
5. Three-Tier Automated Distillation Pipeline
memory/ (daily log) → bank/ (curated knowledge) → recall/ (search buffer) — three stages running automatically. Reflect (full distillation at 03:00 daily) and micro-cycle (lightweight distillation every 30 minutes) refine knowledge without human intervention.
6. Korean Agglutinative BM25 Partial Matching
"๊ฐ์น๊ด์ด" → "๊ฐ์น๊ด" morpheme normalization. Not provided out of the box by English-centric search systems.
4. RAG vs. Wiki Paradigm
RAG approach: query → vector search → top-N chunk retrieval → LLM answer - Advantage: simple to set up - Disadvantage: answer quality degrades on retrieval failure; knowledge does not accumulate
Wiki approach: LLM writes knowledge directly into structured markdown → search traverses structured files - Advantage: knowledge is progressively refined; structured files yield higher retrieval precision - Disadvantage: initial build requires effort; maintenance pipeline required
The wiki approach is decisively superior in long-term operation because of cumulative knowledge refinement. RAG searches raw chunks on every query; the wiki searches the refined structure built up by prior Ingest, Query, and Lint passes. Answer quality for the same query improves as the system matures.
5. Measured Results
| Metric | Value |
|---|---|
| Search precision | 10/10 natural language queries |
| File size | Peak 46 KB → 10.6 KB (75% reduction) |
| Duplicate accumulation | 0 after fuzzy dedup |
| Topic coverage | 100% across major topics |
| memory-warning count | Many → few |
| LLM accuracy (Gemma 4) | 16/16 (100%) |
Lessons Learned
Wiki was not designed from the start: Knowledge was accumulated in a single file until it bloated to 46 KB before the need for structure became apparent. The schema had to be designed using Karpathy's framework as a reference, then the existing file was decomposed and reclassified in reverse.
Overreliance on LLM judgment: Initial tag classification was delegated to the LLM, but nondeterministic outputs caused the same input to route to different files. Stability was restored after introducing TAG_ROUTING with deterministic rules.
Lesson from the OpenClaw → Hermes migration: A migration to Hermes was attempted while OpenClaw was in stable operation, but token runaway occurred. After reverting to OpenClaw, Hermes redesign is currently under validation. This process confirmed that the wiki system must remain platform-independent.
Conclusion
Karpathy's LLM Wiki operates as a loop of three actions: Ingest, Query, and Lint. Applying this structure in production achieves a level of knowledge refinement and retrieval precision unreachable with the RAG approach. The key is not copying the idea verbatim, but redesigning it for your own environment — language, operational model, and data characteristics.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ