Agent Memory Engine (10/10) — Context Tree + BM25 Hybrid Search

4월 05, 2026

OpenClaw memory system rebuilt from a 46KB monolithic file to a multi-layer hierarchy with BM25 search

Key Summary

Replaced a bloated 46KB single file and flat keyword matching with Context Tree + BM25 hybrid
Korean agglutinative morphology problem ("가치관이" fails to match "가치관") resolved via substring matching
Results: search precision 65 → 90, file size reduced 75%, memory-warnings reduced from frequent to rare

Background

An AI agent's memory system is core infrastructure for preserving knowledge across sessions. OpenClaw is a multi-agent platform in stable production. It stores learned content, user preferences, and project rules in memory between sessions, then retrieves that data on demand.

The initial design managed all memory in a single file (self-architecture.md). At small scale this worked. As operations accumulated, the file grew to 46KB — requiring the agent to load the entire file into context on every retrieval. A single file was consuming a significant portion of the context window.

The deeper problem was not file size. It was search precision.

Body

1. Three Limits of the Original System

Limit 1: Monolithic file bloat

When all knowledge is concentrated in one file, even a narrow query forces a full load. Fetching only values-related records still pulls 46KB into context. Irrelevant information interferes with response generation and reduces token efficiency.

Limit 2: Flat keyword matching fails for Korean

The original search performed exact full-keyword matching. A search for "가치관" (values) matched only lines containing that exact string. Korean is an agglutinative language: the same stem combines with postpositions to produce "가치관이", "가치관을", "가치관에서", and other surface forms. Stem-only search misses all inflected variants — result: NO_MATCH.

In English, searching "values" matches most occurrences. In Korean, postposition attachment creates distinct surface strings that diverge from the original stem.

Limit 3: Duplicate accumulation and mixed classification

There was no procedure to check existing content before adding new knowledge. Information such as "TTL is set to 30 minutes" was recorded multiple times in different contexts. Technical decisions, personal preferences, and project rules coexisted inside a single file with no structural separation.

2. Context Tree — Node Schema and Hierarchy

The first axis of the solution is structure. The single file is split by topic; tree-index.json manages the hierarchy.

Node schema

Each node is defined by the following fields:

{
  "id": "identity/values",
  "path": "bank/identity/values.md",
  "keywords": ["가치관", "판단기준", "원칙"],
  "parent": "identity",
  "children": [],
  "weight": 1.0
}

id: unique identifier; slash notation encodes hierarchy (category/topic)
path: actual file path
keywords: search terms that route to this node
parent / children: tree relationship
weight: score correction factor

Hierarchy example (parent-child relationships)

root
├── identity/              (parent: root)
│   ├── identity/values    (parent: identity)
│   └── identity/personality
├── technical/             (parent: root)
│   ├── technical/heartbeat-reflect
│   └── technical/cron-design
└── project/               (parent: root)
    ├── project/rules
    └── project/decisions

Multiple nodes are defined across this structure.

Tree traversal logic

On an incoming query, traversal proceeds in this order:

Iterate all nodes in tree-index.json; compare query against each keywords array
Add matched nodes to the candidate list
If no candidates: walk up to the parent node and retry (fallback)
Load only the files of the final candidate nodes — average 2–5 KB

This replaces full 46KB loads with selective per-file loads.

3. BM25 Search Engine

Structure alone is insufficient. A query may not exactly match any keyword in the tree index.

BM25 (Best Match 25) is the standard scoring algorithm in information retrieval. It extends TF-IDF with document-length normalization, enabling fair comparison across documents of varying lengths.

Three additions were made to BM25 for OpenClaw.

IDF cache

IDF (Inverse Document Frequency) quantifies how rarely a term appears across the document corpus. Recalculating this on every query is wasteful, so the values are cached and invalidated only when documents change.

idf_cache = {}

def get_idf(term, corpus):
    if term in idf_cache:
        return idf_cache[term]
    df = sum(1 for doc in corpus if term in doc)
    idf = math.log((len(corpus) - df + 0.5) / (df + 0.5) + 1)
    idf_cache[term] = idf
    return idf

Korean substring matching

Morphological analyzers (e.g., KoNLPy) were evaluated but rejected: heavy dependencies and unstable behavior across environments. Instead, substring match was adopted. If "가치관" is contained within "가치관이", the match succeeds. Agglutinative morphology is handled pragmatically without a parser.

def token_match(query_term, document_text):
    # exact match preferred; substring match as fallback
    if query_term in document_text.split():
        return 1.0
    elif query_term in document_text:
        return 0.7  # partial match with weight decay
    return 0.0

30-minute TTL query cache

Repeated queries return cached results. The initial TTL was 5 minutes, but operational patterns showed frequent repeated queries within the same context causing premature expiry. 30 minutes was confirmed as the appropriate value.

4. Hybrid Scoring

The final score combines Context Tree keyword matching with BM25:

final_score = 0.4 × tree_keyword_score + 0.6 × bm25_score

tree_keyword_score: node keyword match result × weight correction
bm25_score: content-based relevance score

The initial weighting was 0.5/0.5. Documents not present in the tree keyword list were consistently ranked low — but content-relevant documents exist outside the keyword inventory. Raising BM25 weight to 0.6 resolved this.

The BM25 bias reflects empirical results: when a query partially matches tree keywords, documents with high BM25 scores were consistently more accurate.

5. MemTree Auto-Split and Retain-Merge

bank-size-watch.py monitors file size periodically. Files exceeding the threshold (10 KB) are split by topic automatically. At 5 KB the files fragmented too aggressively, increasing management complexity. 10 KB was the balance point between split frequency and file count.

retain-merge TAG_ROUTING routes new knowledge to the appropriate file. Each knowledge item receives a tag (W/B/O/S); keyword-based routing places it in the correct node.

TAG_ROUTING = {
    "W": ["가치관", "원칙", "판단기준"],   # → identity/values.md
    "B": ["크론", "스케줄", "설계"],       # → technical/cron-design.md
    "O": ["Reflect", "파이프라인", "heartbeat"],  # → technical/heartbeat-reflect.md
    "S": ["프로젝트", "규칙", "결정"],     # → project/rules.md
}

Before placing incoming knowledge into its target file, retain-merge compares it against existing content. Duplicates are merged; conflicts are flagged.

6. Three-Layer Memory Architecture

The full memory system is divided into three layers:

Layer	Directory	Role
Raw data	`memory/`	Daily logs. Pre-processed state. Immutable.
Curated knowledge	`bank/`	Verified and promoted content. Managed by Context Tree.
Search buffer	`recall/`	Cache of frequently referenced content.

Data flow: memory/ → (verify/extract) → bank/ → (on query) → recall/

Queries always check recall/ first, then bank/. memory/ is raw data and excluded from direct search.

7. Query Examples

Query 1: "가치관이 뭐야?" (What are your values?)

[tree traversal]
  query term "가치관이"
  → keyword "가치관" ⊂ "가치관이" → substring match success
  → node: identity/values (weight: 1.0)

[bm25]
  identity/values.md: score 0.84
  project/rules.md:   score 0.31

[final]
  identity/values: 0.4×1.0 + 0.6×0.84 = 0.904  ← rank 1
  project/rules:   0.4×0.0 + 0.6×0.31 = 0.186

→ load identity/values.md (2.1 KB)

Before: NO_MATCH. After: exact target file loaded.

Query 2: "Reflect pipeline structure"

[tree traversal]
  "Reflect" → keywords ["Reflect", "파이프라인", "heartbeat"]
  → node: technical/heartbeat-reflect (weight: 1.0)

[bm25]
  heartbeat-reflect.md: score 0.91

[final]
  0.4×1.0 + 0.6×0.91 = 0.946 ← rank 1

→ load heartbeat-reflect.md (3.4 KB)

Before: full 46 KB load. After: 3.4 KB direct match.

8. Measured Results

Search precision: 65 → 90 - Significant reduction in Korean inflected-query match failures

File size: 46 KB → 10.6 KB (75% reduction) - Single file replaced by multiple files; individual file average 2–5 KB

Duplicate accumulation: 0 occurrences - retain-merge compares against existing content and merges duplicates before writing

memory-warnings: frequent → rare - Context window pressure reduced; memory-related warnings substantially decreased

Current State and Hermes

OpenClaw is in stable production. The Context Tree + BM25 architecture is production-validated.

Hermes is the next-generation platform designed to succeed OpenClaw. An initial migration triggered token runaway, leading to a rollback to OpenClaw followed by a Hermes redesign. The system is currently under re-verification. The memory architecture design is grounded in the structure validated by OpenClaw.

Conclusion

In an AI agent's memory system, retrieval matters more than storage. No matter how much is stored, inaccessible knowledge is effectively absent.

Context Tree provides structure — node schema and parent-child relationships define where knowledge lives. BM25 provides search precision — content-based relevance scoring defines how relevant a document is. Combined as a hybrid, the system selects documents that are both structurally correct and content-relevant.

Korean agglutinative morphology is handled practically with substring matching, without a morphological analyzer. No complex dependencies; sufficient for production operation.

When designing a memory system: design the retrieval structure before the storage structure. A memory that cannot be found is not a memory.

Series overview: Series index

이 블로그 검색

MaJu Tech Notes