Agent Memory Engine (10/10) — Context Tree + BM25 Hybrid Search
OpenClaw memory system rebuilt from a 46KB monolithic file to a multi-layer hierarchy with BM25 search
Key Summary
- Replaced a bloated 46KB single file and flat keyword matching with Context Tree + BM25 hybrid
- Korean agglutinative morphology problem ("๊ฐ์น๊ด์ด" fails to match "๊ฐ์น๊ด") resolved via substring matching
- Results: search precision 65 → 90, file size reduced 75%, memory-warnings reduced from frequent to rare
Background
An AI agent's memory system is core infrastructure for preserving knowledge across sessions. OpenClaw is a multi-agent platform in stable production. It stores learned content, user preferences, and project rules in memory between sessions, then retrieves that data on demand.
The initial design managed all memory in a single file (self-architecture.md). At small scale this worked. As operations accumulated, the file grew to 46KB — requiring the agent to load the entire file into context on every retrieval. A single file was consuming a significant portion of the context window.
The deeper problem was not file size. It was search precision.
Body
1. Three Limits of the Original System
Limit 1: Monolithic file bloat
When all knowledge is concentrated in one file, even a narrow query forces a full load. Fetching only values-related records still pulls 46KB into context. Irrelevant information interferes with response generation and reduces token efficiency.
Limit 2: Flat keyword matching fails for Korean
The original search performed exact full-keyword matching. A search for "๊ฐ์น๊ด" (values) matched only lines containing that exact string. Korean is an agglutinative language: the same stem combines with postpositions to produce "๊ฐ์น๊ด์ด", "๊ฐ์น๊ด์", "๊ฐ์น๊ด์์", and other surface forms. Stem-only search misses all inflected variants — result: NO_MATCH.
In English, searching "values" matches most occurrences. In Korean, postposition attachment creates distinct surface strings that diverge from the original stem.
Limit 3: Duplicate accumulation and mixed classification
There was no procedure to check existing content before adding new knowledge. Information such as "TTL is set to 30 minutes" was recorded multiple times in different contexts. Technical decisions, personal preferences, and project rules coexisted inside a single file with no structural separation.
2. Context Tree — Node Schema and Hierarchy
The first axis of the solution is structure. The single file is split by topic; tree-index.json manages the hierarchy.
Node schema
Each node is defined by the following fields:
{
"id": "identity/values",
"path": "bank/identity/values.md",
"keywords": ["๊ฐ์น๊ด", "ํ๋จ๊ธฐ์ค", "์์น"],
"parent": "identity",
"children": [],
"weight": 1.0
}
id: unique identifier; slash notation encodes hierarchy (category/topic)path: actual file pathkeywords: search terms that route to this nodeparent/children: tree relationshipweight: score correction factor
Hierarchy example (parent-child relationships)
root
├── identity/ (parent: root)
│ ├── identity/values (parent: identity)
│ └── identity/personality
├── technical/ (parent: root)
│ ├── technical/heartbeat-reflect
│ └── technical/cron-design
└── project/ (parent: root)
├── project/rules
└── project/decisions
Multiple nodes are defined across this structure.
Tree traversal logic
On an incoming query, traversal proceeds in this order:
- Iterate all nodes in
tree-index.json; compare query against eachkeywordsarray - Add matched nodes to the candidate list
- If no candidates: walk up to the parent node and retry (fallback)
- Load only the files of the final candidate nodes — average 2–5 KB
This replaces full 46KB loads with selective per-file loads.
3. BM25 Search Engine
Structure alone is insufficient. A query may not exactly match any keyword in the tree index.
BM25 (Best Match 25) is the standard scoring algorithm in information retrieval. It extends TF-IDF with document-length normalization, enabling fair comparison across documents of varying lengths.
Three additions were made to BM25 for OpenClaw.
IDF cache
IDF (Inverse Document Frequency) quantifies how rarely a term appears across the document corpus. Recalculating this on every query is wasteful, so the values are cached and invalidated only when documents change.
idf_cache = {}
def get_idf(term, corpus):
if term in idf_cache:
return idf_cache[term]
df = sum(1 for doc in corpus if term in doc)
idf = math.log((len(corpus) - df + 0.5) / (df + 0.5) + 1)
idf_cache[term] = idf
return idf
Korean substring matching
Morphological analyzers (e.g., KoNLPy) were evaluated but rejected: heavy dependencies and unstable behavior across environments. Instead, substring match was adopted. If "๊ฐ์น๊ด" is contained within "๊ฐ์น๊ด์ด", the match succeeds. Agglutinative morphology is handled pragmatically without a parser.
def token_match(query_term, document_text):
# exact match preferred; substring match as fallback
if query_term in document_text.split():
return 1.0
elif query_term in document_text:
return 0.7 # partial match with weight decay
return 0.0
30-minute TTL query cache
Repeated queries return cached results. The initial TTL was 5 minutes, but operational patterns showed frequent repeated queries within the same context causing premature expiry. 30 minutes was confirmed as the appropriate value.
4. Hybrid Scoring
The final score combines Context Tree keyword matching with BM25:
final_score = 0.4 × tree_keyword_score + 0.6 × bm25_score
- tree_keyword_score: node keyword match result ×
weightcorrection - bm25_score: content-based relevance score
The initial weighting was 0.5/0.5. Documents not present in the tree keyword list were consistently ranked low — but content-relevant documents exist outside the keyword inventory. Raising BM25 weight to 0.6 resolved this.
The BM25 bias reflects empirical results: when a query partially matches tree keywords, documents with high BM25 scores were consistently more accurate.
5. MemTree Auto-Split and Retain-Merge
bank-size-watch.py monitors file size periodically. Files exceeding the threshold (10 KB) are split by topic automatically. At 5 KB the files fragmented too aggressively, increasing management complexity. 10 KB was the balance point between split frequency and file count.
retain-merge TAG_ROUTING routes new knowledge to the appropriate file. Each knowledge item receives a tag (W/B/O/S); keyword-based routing places it in the correct node.
TAG_ROUTING = {
"W": ["๊ฐ์น๊ด", "์์น", "ํ๋จ๊ธฐ์ค"], # → identity/values.md
"B": ["ํฌ๋ก ", "์ค์ผ์ค", "์ค๊ณ"], # → technical/cron-design.md
"O": ["Reflect", "ํ์ดํ๋ผ์ธ", "heartbeat"], # → technical/heartbeat-reflect.md
"S": ["ํ๋ก์ ํธ", "๊ท์น", "๊ฒฐ์ "], # → project/rules.md
}
Before placing incoming knowledge into its target file, retain-merge compares it against existing content. Duplicates are merged; conflicts are flagged.
6. Three-Layer Memory Architecture
The full memory system is divided into three layers:
| Layer | Directory | Role |
|---|---|---|
| Raw data | memory/ |
Daily logs. Pre-processed state. Immutable. |
| Curated knowledge | bank/ |
Verified and promoted content. Managed by Context Tree. |
| Search buffer | recall/ |
Cache of frequently referenced content. |
Data flow: memory/ → (verify/extract) → bank/ → (on query) → recall/
Queries always check recall/ first, then bank/. memory/ is raw data and excluded from direct search.
7. Query Examples
Query 1: "๊ฐ์น๊ด์ด ๋ญ์ผ?" (What are your values?)
[tree traversal]
query term "๊ฐ์น๊ด์ด"
→ keyword "๊ฐ์น๊ด" ⊂ "๊ฐ์น๊ด์ด" → substring match success
→ node: identity/values (weight: 1.0)
[bm25]
identity/values.md: score 0.84
project/rules.md: score 0.31
[final]
identity/values: 0.4×1.0 + 0.6×0.84 = 0.904 ← rank 1
project/rules: 0.4×0.0 + 0.6×0.31 = 0.186
→ load identity/values.md (2.1 KB)
Before: NO_MATCH. After: exact target file loaded.
Query 2: "Reflect pipeline structure"
[tree traversal]
"Reflect" → keywords ["Reflect", "ํ์ดํ๋ผ์ธ", "heartbeat"]
→ node: technical/heartbeat-reflect (weight: 1.0)
[bm25]
heartbeat-reflect.md: score 0.91
[final]
0.4×1.0 + 0.6×0.91 = 0.946 ← rank 1
→ load heartbeat-reflect.md (3.4 KB)
Before: full 46 KB load. After: 3.4 KB direct match.
8. Measured Results
Search precision: 65 → 90 - Significant reduction in Korean inflected-query match failures
File size: 46 KB → 10.6 KB (75% reduction) - Single file replaced by multiple files; individual file average 2–5 KB
Duplicate accumulation: 0 occurrences - retain-merge compares against existing content and merges duplicates before writing
memory-warnings: frequent → rare - Context window pressure reduced; memory-related warnings substantially decreased
Current State and Hermes
OpenClaw is in stable production. The Context Tree + BM25 architecture is production-validated.
Hermes is the next-generation platform designed to succeed OpenClaw. An initial migration triggered token runaway, leading to a rollback to OpenClaw followed by a Hermes redesign. The system is currently under re-verification. The memory architecture design is grounded in the structure validated by OpenClaw.
Conclusion
In an AI agent's memory system, retrieval matters more than storage. No matter how much is stored, inaccessible knowledge is effectively absent.
Context Tree provides structure — node schema and parent-child relationships define where knowledge lives. BM25 provides search precision — content-based relevance scoring defines how relevant a document is. Combined as a hybrid, the system selects documents that are both structurally correct and content-relevant.
Korean agglutinative morphology is handled practically with substring matching, without a morphological analyzer. No complex dependencies; sufficient for production operation.
When designing a memory system: design the retrieval structure before the storage structure. A memory that cannot be found is not a memory.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ