Agent Memory Engine (8/10) — Applying Karpathy LLM Wiki in Production

Implementation record: 17 of Andrej Karpathy's 18 proposed features implemented, plus 6 custom additions


Summary

  • Applied Karpathy's LLM Wiki concept in full to the OpenClaw agent platform
  • 17 of the original 18 features are implemented; 6 additional features address gaps found in production
  • Instead of searching external documents on every query (RAG), the LLM incrementally builds and maintains a permanent, structured wiki

What Is an LLM Wiki

The core idea Andrej Karpathy published is straightforward. Rather than retrieving external documents on every query (RAG), the LLM itself incrementally builds and maintains a structured markdown wiki.

The wiki operates through three actions:

  • Ingest: Accepts raw sources and updates wiki pages. Triggers cascade updates to related pages.
  • Query: Answers questions using the wiki. Knowledge gaps discovered during answering are promoted into the wiki.
  • Lint: Audits the entire wiki. Detects and resolves contradictions, gaps, and stale entries.

When these three actions form a loop, the knowledge base self-refines without human intervention.


Body

1. Karpathy's Original 18 Features vs. OpenClaw Implementation

# Karpathy Feature OpenClaw Implementation Status
1 Raw Sources (immutable originals) memory/ + docs/manuals/
2 Wiki (LLM-maintained markdown) bank/ multi-file Context Tree
3 Schema (co-evolving config) tree-index.json + AGENTS.md
4 Ingest cascade updates retain-merge cascade_update
5 Interactive Ingest source-ingest.py --interactive
6 Automated external source Ingest source-ingest.py --url/--file
7 Query (wiki-backed answers) recall-tree.py BM25 hybrid
8 Query → Wiki promotion recall-tree.py --promote
9 LLM reranking recall-tree.py --rerank (oMLX gemma-4)
10 Basic Lint bank-lint + memory-warning-report
11 Lint exploration suggestions exploration.page_gap_suggestions
12 Lint unresolved contradiction tracking exploration.unresolved_contradictions
13 index.md (catalog) tree-index.json (full file index)
14 log.md (chronological record) _meta/changelog.md
15 Git version control git tracked
16 Schema co-evolution self-review cron auto-update
17 Batch vs. interactive toggle default=1 item, Reflect=batch
18 Obsidian graph view N/A (CLI environment)

17/18 implemented. The only unimplemented feature is the Obsidian graph view, which is not applicable in a CLI environment.


2. Ingest / Query / Lint: Concrete Implementation

Each of the three actions is described in detail below.

Ingest


def ingest(source, interactive=False):
    content = fetch(source)
    tags = TAG_ROUTING.classify(content)   # Deterministic regex-based classification
    target_files = resolve_targets(tags)   # W/B/O/S → resolve target files
    for f in target_files:
        cascade_update(f, content)         # Cascade update
    changelog.append(source, tags)

TAG_ROUTING classifies content into four tag types — W (world facts) / B (activities) / O (opinions) / S (status) — using regex keyword rules. The decision to use deterministic rules instead of LLM judgment is explained in the lessons-learned section below.

Query


def query(q, rerank=False, promote=False):
    candidates = bm25_hybrid_search(q)    # BM25 + keyword hybrid
    if rerank:
        candidates = llm_rerank(candidates, model="oMLX/gemma-4")
    answer = generate_answer(q, candidates)
    if promote and has_knowledge_gap(answer):
        wiki.create_stub(q)               # Gap → create new wiki stub page
    return answer

Korean agglutinative morphology handling is the key challenge. "๊ฐ€์น˜๊ด€์ด" → "๊ฐ€์น˜๊ด€" morpheme normalization is applied at the BM25 indexing stage. English-centric search libraries do not provide this out of the box.

Lint


lint_checks = [
    check_contradictions,      # Detect unresolved contradictions
    check_page_gaps,           # Detect linked-but-missing pages
    check_stale_entries,       # Entries below confidence-decay threshold
    check_orphan_pages,        # Pages referenced by nothing
    check_data_gaps,           # Entries with figures but no source
]

Lint runs in two modes: manual and automated. The Reflect cron (03:00) executes a full batch Lint; the micro-cycle (every 30 minutes) runs a lightweight version.


3. Six OpenClaw-Specific Features Not in Karpathy's Proposal

These features address problems discovered in production that the original proposal does not cover.

1. confidence-decay (Opinion Confidence Decay)

Opinion files are assigned a confidence score that auto-decrements by −0.02 per day. Entries falling below 0.30 are automatically removed. Karpathy mentions contradiction detection, but quantifying the temporal value of opinions is a separate problem. Old opinions and recent opinions must not carry equal weight.

2. fuzzy dedup (Fuzzy Duplicate Prevention)

Duplicate accumulation is blocked at an 80% keyword overlap threshold. "Flutter state management is good" and "Flutter์˜ state management๊ฐ€ ์ข‹๋‹ค" are semantically identical despite different surface forms — exact-match deduplication cannot catch this.

3. TAG_ROUTING (Deterministic Tag Routing)

W/B/O/S tags are auto-classified via regex keywords to determine target files. "Tool change" → world/tools.md; "Project" → world/projects.md. Deterministic rules are chosen over LLM judgment to ensure routing stability.

4. proactive briefing (Autonomous Briefing)

A multi-mode proactive briefing system that surfaces relevant wiki content without being asked. Includes a conversation-context-aware mode. The wiki should not only respond to manual searches — it must proactively surface information.

5. Three-Tier Automated Distillation Pipeline

memory/ (daily log) → bank/ (curated knowledge) → recall/ (search buffer) — three stages running automatically. Reflect (full distillation at 03:00 daily) and micro-cycle (lightweight distillation every 30 minutes) refine knowledge without human intervention.

6. Korean Agglutinative BM25 Partial Matching

"๊ฐ€์น˜๊ด€์ด" → "๊ฐ€์น˜๊ด€" morpheme normalization. Not provided out of the box by English-centric search systems.


4. RAG vs. Wiki Paradigm

RAG approach: query → vector search → top-N chunk retrieval → LLM answer - Advantage: simple to set up - Disadvantage: answer quality degrades on retrieval failure; knowledge does not accumulate

Wiki approach: LLM writes knowledge directly into structured markdown → search traverses structured files - Advantage: knowledge is progressively refined; structured files yield higher retrieval precision - Disadvantage: initial build requires effort; maintenance pipeline required

The wiki approach is decisively superior in long-term operation because of cumulative knowledge refinement. RAG searches raw chunks on every query; the wiki searches the refined structure built up by prior Ingest, Query, and Lint passes. Answer quality for the same query improves as the system matures.


5. Measured Results

Metric Value
Search precision 10/10 natural language queries
File size Peak 46 KB → 10.6 KB (75% reduction)
Duplicate accumulation 0 after fuzzy dedup
Topic coverage 100% across major topics
memory-warning count Many → few
LLM accuracy (Gemma 4) 16/16 (100%)

Lessons Learned

Wiki was not designed from the start: Knowledge was accumulated in a single file until it bloated to 46 KB before the need for structure became apparent. The schema had to be designed using Karpathy's framework as a reference, then the existing file was decomposed and reclassified in reverse.

Overreliance on LLM judgment: Initial tag classification was delegated to the LLM, but nondeterministic outputs caused the same input to route to different files. Stability was restored after introducing TAG_ROUTING with deterministic rules.

Lesson from the OpenClaw → Hermes migration: A migration to Hermes was attempted while OpenClaw was in stable operation, but token runaway occurred. After reverting to OpenClaw, Hermes redesign is currently under validation. This process confirmed that the wiki system must remain platform-independent.


Conclusion

Karpathy's LLM Wiki operates as a loop of three actions: Ingest, Query, and Lint. Applying this structure in production achieves a level of knowledge refinement and retrieval precision unreachable with the RAG approach. The key is not copying the idea verbatim, but redesigning it for your own environment — language, operational model, and data characteristics.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System