Agent Memory Engine (9/10) — Applying ByteRover Context Tree: BM25 Hybrid Search

A record of analyzing an external tool's design principles and re-engineering them as a hybrid search system tailored to a custom environment.


Summary

  • Analyzed the 5-tier hybrid search architecture from the ByteRover CLI, then re-implemented it natively within OpenClaw.
  • Scored 90/100 against the ByteRover methodology benchmark, compared to the baseline 65/100.
  • Resolved Korean agglutinative morphology failures using BM25 + tree keyword hybrid scoring.

Background: Memory Retrieval Accuracy Problem

In AI agent systems, retrieving the right memory is significantly harder than accumulating it. As the knowledge file count grew to 43, simple natural-language queries such as "what are my values?" repeatedly failed to surface the correct file.

To address this, the design of ByteRover CLI's 5-tier hybrid search (BM25 + LLM agentic, 92–96% accuracy) was analyzed. Rather than adopting ByteRover directly, the decision was made to extract only the core algorithmic ideas and re-implement them as OpenClaw-native components.

OpenClaw vs Hermes context: OpenClaw is the stable, production memory system. Hermes is the migration target. An earlier transition attempt resulted in token runaway, prompting a rollback to OpenClaw. A second Hermes attempt (currently in verification) is ongoing. This Context Tree implementation is OpenClaw-based.


Implementation

1. ByteRover vs OpenClaw: Comparison

Dimension ByteRover Official OpenClaw Native
Overall Score (self-eval) 65/100 90/100
Retrieval Accuracy 92–96% 100% (10/10 queries)
Search Method BM25 + LLM agentic BM25 + tree keyword hybrid
External Dependencies Requires external CLI Zero (scripts embedded)
Korean Language Support Unverified Agglutinative partial-match implemented

The score gap is explained by specialization: ByteRover is a general-purpose tool; the OpenClaw implementation is purpose-built for a specific agent runtime. Generality was traded for targeted performance.


2. Four Core Integration Mechanisms

Mechanism 1: Context Tree Hierarchical Structure

Problem: In a flat-file layout, self-architecture.md grew to 46 KB. Every query required loading the entire file, consuming context with irrelevant content.

Design: Split into 7 domains under bank/, totaling 43 files.

bank/
├── identity/      (3 files)   — values, persona, self-definition
├── daily/         (6 files)   — routines, habits, condition logs
├── knowledge/    (12 files)   — domain knowledge, learning notes
├── patterns/      (4 files)   — behavioral patterns, action rules
├── entities/      (6 files)   — relationships, tools, service definitions
├── experience/    (1 file)    — case experience
├── opinions/      (2 files)   — assessments, confidence-weighted evaluations
├── world/         (3 files)   — projects, environment, tooling status
└── _meta/         (3 files)   — index, schema, config

Outcome: Maximum file size reduced from 46 KB → 10.6 KB (75% reduction).


Mechanism 2: BM25 Hybrid Search Scoring

tree-index.json defines a node → keyword → file mapping (24 nodes). Hybrid scores are computed as:

score = 0.4 * tree_keyword_match + 0.6 * bm25_idf_score

IDF cache: tree-index.json pre-embeds 3,452 terms. This enables BM25 without any external library.

Query cache: TTL 30 minutes, max 20 entries. Repeated identical queries skip recomputation.

Korean agglutinative normalization:

def normalize_korean(token: str) -> str:
    for keyword in index_keywords:
        if token.startswith(keyword) or keyword in token:
            return keyword
    return token

Prior flat matching failed to link "๊ฐ€์น˜๊ด€์ด" (values + topic marker) to "๊ฐ€์น˜๊ด€" (values). Korean morphology attaches postpositions to stems; reducing inflected tokens to their base form was essential for reliable retrieval.


Mechanism 3: MemTree Auto-Split

bank-size-watch.py monitors file sizes and triggers automatic splits when thresholds are exceeded.

THRESHOLDS = {
    "warn":  8 * 1024,   # 8 KB  — log warning
    "split": 15 * 1024,  # 15 KB — auto-split trigger
}

Production split cases: - guide.md (10 KB) → split into 3 files - full.md (46 KB) → split into 7 files - 30+ dependency paths auto-updated after each split

Dependency path updates must be automated. Without them, references break silently. A prerequisite step is constructing the dependency graph before splitting.


Mechanism 4: TAG_ROUTING Keyword-Based Classification

retain-merge.py classifies memory tags extracted from conversation (W/B/O/S) via regex patterns and automatically selects the target file.

TAG_ROUTING = {
    "W": {  # World — environment / tools / projects
        r"๋„๊ตฌ|ํˆด|tool": "world/tools.md",
        r"ํ”„๋กœ์ ํŠธ|project": "world/projects.md",
    },
    "B": {  # Behavior — patterns / habits
        r"ํŒจํ„ด|๋ฐ˜๋ณต|๋ฃจํ‹ด": "patterns/routines.md",
    },
    "O": {  # Opinion — assessments / evaluations
        r"์ƒ๊ฐ|ํŒ๋‹จ|ํ‰๊ฐ€": "opinions/assessments.md",
    },
    "S": {  # Self — identity / values
        r"๊ฐ€์น˜๊ด€|์›์น™|์ฒ ํ•™": "identity/values.md",
    },
}

Fuzzy dedup: Blocks duplicate accumulation at 80% keyword overlap. Prevents the same content from being stored multiple times under surface-level rephrasing.


3. Three-Layer Memory Pipeline Architecture

The full system operates across three layers.

Layer 1: memory/          ← Daily conversation logs
  └── Auto-generated by memoryFlush. Contains Retain tags. Raw memory storage.

Layer 2: bank/            ← Curated knowledge (Context Tree structure)
  └── Reflect (daily 03:00) auto-distills: memory/ → bank/
      7 scripts executed automatically (Phase 1–3.5)

Layer 3: recall/          ← Search buffer
  └── recall-tree.py BM25 hybrid search
      TTL 1-hour cache

Auxiliary systems: - micro-cycle: Lightweight distillation from memory/bank/ every 30 minutes. Prevents memory loss in long sessions. - confidence-decay: Daily confidence decay on opinions/*.md (−0.02/day). Entries below 0.30 are auto-deleted, pruning stale assessments. - proactive briefing: 18 autonomous briefing modes. Proactively surfaces relevant memories at context switches.


4. Search Query Flow

Query input
  ↓
Query cache lookup (30-min TTL)
  ├── Cache HIT → return immediately
  └── Cache MISS
        ↓
      Korean agglutinative normalization
        ↓
      tree-index.json node mapping (tree keyword score)
        ↓
      BM25 IDF scoring (IDF cache: 3,452 terms)
        ↓
      Hybrid score aggregation (0.4 : 0.6)
        ↓
      Return top-N files → store in recall/ cache (1-hour TTL)

Measured Results

Metric Before After
Search Precision Flat matching (frequent Korean failures) 10/10 natural-language queries passed
Max File Size 46 KB 10.6 KB (75% reduction)
Duplicate Accumulation Recurring (3+ instances) 0 (fuzzy dedup)
Topic Coverage Incomplete 43/43 files, 100%
memory-warnings 31 6

Representative case: the query "๊ฐ€์น˜๊ด€์ด ๋ญ์•ผ?" ("What are my values?") previously returned NO_MATCH. After agglutinative normalization, it correctly resolves to identity/values.md.


Design Decisions and Trade-offs

Why ByteRover was not adopted directly: External CLI dependency, cloud sync functionality, and Git-like branch management introduced unnecessary complexity for the target environment. Extracting only the core algorithm and embedding it as internal scripts achieves equivalent retrieval quality with zero external dependencies.

Rationale for 0.4:0.6 BM25 weighting: Starting from a 0.5:0.5 split, accuracy improved when BM25 IDF scoring carried higher weight in a Korean agglutinative context. Tree keyword matching handles node classification; fine-grained scoring is delegated entirely to BM25.

Post-split dependency management: Splitting full.md into 7 files broke 30+ path references. The required sequence: run grep -r "full.md" . to map dependencies before splitting, then execute the automated path-update script after splitting.


Conclusion

The design principles behind ByteRover Context Tree — hierarchical file structure, BM25 hybrid scoring, and auto-split — are directly applicable to AI agent memory systems. However, adopting a general-purpose tool as-is increases system complexity through features irrelevant to the target environment.

Extracting only the core algorithm and adding Korean agglutinative normalization with a native IDF cache produced higher retrieval accuracy than the original. Generality and specialized performance are a trade-off. The effective approach is to define the target environment first, then choose whether to adopt or re-implement accordingly.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System

"ML Foundations (6/9) — Neural Networks: From Perceptron to MLP"