Agent Memory Engine (1/10) — SQLite Memory Engine for LLM Agents: memcore Architecture

4월 12, 2026

Design principles and structure for building agent memory in a single SQLite file

핵심 요약

Three core techniques covered in this post:

Retain tags: Force the LLM to emit structured tags at conversation time, then harvest them post-hoc with regex. Deterministic, hallucination-free, zero token overhead versus post-hoc LLM extraction.
2-Track execution: Rule-expressible tasks run as scripts (Track 1); only tasks requiring natural language go to an economy LLM (Track 2). A practical criterion for auditing whether LLM calls are being wasted on work that rules can handle.
SQLite single-file memory: FTS5 (keyword) + sqlite-vec (semantic) + 6-tier hybrid recall. Zero external dependencies; the entire agent memory runs in a 9 MB file.

Introduction

The first engineering bottleneck encountered when operating LLM agents long-term is memory. At tens of conversations the full context can be appended on every call without issue. Once hundreds to thousands of facts, opinions, and states must be managed cumulatively and only the relevant subset retrieved on demand, the problem changes in character entirely.

This post documents the design principles behind memcore, a SQLite-based memory engine. The full agent memory lives in a single SQLite file; markdown files serve as the source of truth. Topics covered: (1) four structural constraints of LLM memory design, (2) five design principles that address them, (3) concrete techniques — Retain tags, hybrid recall, opinion decay — and (4) the minimum portable unit for transplanting the pattern to other agents.

1. Four Structural Reasons LLM Memory Is Hard

Context is finite and expensive

Including cumulative conversation history in every call causes token consumption to grow super-linearly. In a monolithic heartbeat architecture, a single loop was observed consuming up to 4.5 million tokens per day. Cost accelerates relative to conversation volume.

Relevance degrades at scale

When memory holds 50 entries, injecting everything is feasible. At 500 entries, only the necessary few must be selected with precision. Single-vector search alone is unstable for this selection. For proper-noun queries like "which model does reflect use?", keyword matching provides a stronger signal than semantic similarity.

Memory quality decays over time

Opinions frequently become invalid as time passes. Unmanaged past opinions accumulate as noise, and the agent makes current decisions based on stale judgments.

Using frontier models for maintenance drains the budget

Applying top-tier models to memory cleanup, conflict detection, and summarization routes the budget away from user conversations into plumbing. Infrastructure cost can reverse its relationship with primary task cost.

2. Five Design Principles

Principle 1: Script-first

If it can be expressed as a rule, a script handles it. Most of memory extraction, merging, deduplication, and diagnostics reduces to deterministic rules. Judgments like "if this tag is present, extract it", "if similarity exceeds 80%, it's a duplicate", "if an opinion is 30 days old, delete it" require no LLM.

The Track 1 pipeline (10–30-minute intervals) operates at zero token consumption. LLMs are invoked only for semantic judgments that cannot be reduced to rules.

Principle 2: Dual-write

bank/ markdown files are the source of truth; memcore.db (SQLite) is the search and analysis layer. Markdown is human-readable, git-tracked, and editable in any editor. A corrupted DB can be rebuilt from markdown; the inverse is asymmetrically harder. On conflict, bank/ wins.

Principle 3: Modularize by schedule

Each concern gets an independent module with an independent schedule. Adding a new task costs one module plus one macOS LaunchAgent entry. Modules have no inter-dependencies, so a failure in one does not propagate.

Principle 4: Tiered LLM allocation

Even within "AI judgment", model choice spans up to 100× in cost. User conversation → premium; memory maintenance → free/local; semantic judgment → economy. This separation is the core of overall token reduction.

Principle 5: No promotion without evidence

Three stages — observation → hypothesis → verified pattern — must be traversed before a fact is promoted to long-term memory. A single utterance is never immediately hardened into a rule.

3. Overall Architecture

Conversation → session-scan (10 min, tag extraction, 0 tokens)
             → micro-cycle  (30 min, merge + 1 local-LLM judgment)
             → reflect      (daily 03:00, full validation + cloud LLM)

bank/ (md files, git-tracked)  ←→  memcore.db (SQLite, FTS5, vectors)
                                         ↑
                                  recall (6-tier hybrid search)

All memory fits in a single SQLite file. WAL mode supports concurrent reads; 22 tables divide responsibilities. File size: approximately 9 MB.

Backup is file copy; migration is sqlite3 dump. No external vector DB dependencies — no PostgreSQL, Redis, or Pinecone.

SQLite was chosen because the retrieval requirements are not purely semantic. Proper-noun queries are served more accurately by keyword matching than embeddings. SQLite + FTS5 handles keywords; sqlite-vec + bge-m3 handles semantics. Two retrieval systems coexist in one file.

4. Retain Tags — Inverting the Extraction Contract

Problem: LLM extraction is unreliable

The conventional approach to extracting facts from conversations asks an LLM after the fact: "extract the important facts." Structural drawbacks:

Non-deterministic: different results on every call
Hallucination bleed: facts absent from the source get generated
Call overhead: an LLM call is required per conversation
Non-debuggable: extraction rationale is hard to trace

Solution: Enforce structure at distillation time

Retain tags invert the extraction contract. Instead of post-hoc extraction, the prompt forces the LLM to emit structured tags at conversation time. Five tag types are used:

- W: Mac Mini M4 32GB RAM confirmed          (World fact)
- B: Reflect v0.5 pipeline build complete    (Behavior log)
- O(c=0.90): inverted structure wins on cost (Opinion + confidence)
- S:openclaw-rebuild: Phase C complete       (State)
- U-Preference: reject features without verification (User pattern)

W: Objective fact. Does not decay.
B: Activity log. Records "what was done."
O: Opinion. Carries a confidence score; decays over time.
S: Entity state. Tracks the current state of a specific project or system.
U: User observation. Records preferences, habits, and patterns.

Extraction is pure regex

Because the tag format is enforced at conversation time, extraction is handled deterministically with regex.

RETAIN_PATTERNS = [
    re.compile(r"^- (W): (.+)", re.MULTILINE),
    re.compile(r"^- (B): (.+)", re.MULTILINE),
    re.compile(r"^- (O\(c=[\d.]+\)): (.+)", re.MULTILINE),
    re.compile(r"^- (S:[^:]+): (.+)", re.MULTILINE),
    re.compile(r"^- (U-\w+): (.+)", re.MULTILINE),
]

No LLM calls. Zero token consumption, deterministic output, no hallucinations, zero false positives (unmatched tags go unextracted). Parsing runs live against JSONL every 10 minutes without waiting for session close; results land in bank/ within 30 minutes.

Core principle

Retain tags are not merely a data format — they are a forcing function. Making the conversation LLM structure its own output makes the downstream extraction pipeline cheap and trustworthy. The design decision is to move classification from "post-hoc LLM" to "concurrent LLM."

5. 6-Tier Hybrid Recall

Memory that cannot be retrieved is functionally the same as memory that was never stored. memcore uses 6-tier hybrid search.

"""Search strategy:
  1. Topic keyword match  → score × 2.0  (curator-approved, strongest signal)
  2. Vector similarity    → score × 1.5  (semantic search, oMLX/bge-m3)
  3. FTS5 full-text       → BM25 score   (mixed Korean+English)
  4. LIKE fallback        → score × 0.5  (last resort)
  5. Wiki FTS             → separate tier (feature doc search)
  6. Wiki LIKE            → separate tier (doc fallback)
"""

Tier 1 — Topic keyword match (×2.0): Exact match against curator-approved keywords. Highest weight because it is an explicit human signal.

Tier 2 — Vector similarity (×1.5): Cosine similarity of bge-m3 embeddings generated by oMLX. Handles semantic queries where "cost reduction" must also surface "token efficiency."

Tier 3 — FTS5 full-text (BM25): Full-text search via the SQLite FTS5 extension. Scores mixed Korean/English content using the BM25 algorithm.

Tier 4 — LIKE fallback (×0.5): Last resort when the top three tiers find nothing. Simple substring containment check. Lowest weight.

Tiers 5–6 — Wiki FTS/LIKE: Separate tiers for feature document search. Enables the agent to treat its own spec as searchable memory.

Graceful degradation

When oMLX is offline, Tier 2 (vector) is automatically skipped and the remaining four tiers provide coverage. Keyword and full-text search handle the majority of queries without vectors. Availability is deliberately prioritized over quality.

6. Micro-Cycle — 30-Minute Pipeline, One LLM Call

The micro-cycle is an 8-step pipeline on a 30-minute interval.

1. [Script] Check for new memory/*.md files
2. [Script] retain-extract → staging JSON
3. [Script] retain-merge → bank/ merge + conflict detection + fuzzy dedup (80%)
4. [Script] Diagnostics (bank-lint, memory-optimize)
5. [Script] decision-prepare → prepare items requiring judgment
6. [LLM]   oMLX/gemma-4 — 1 semantic judgment (local, free)
7. [Script] decision-apply → apply to bank/
8. [Script] Write log

7 of 8 steps are scripts. The LLM is called once, at step 6, and handles exactly one semantic judgment — e.g., "do these two memories conflict?" or "does this observation support the existing hypothesis?"

Rationale for the one-call limit

The micro-cycle runs 48 times per day. Permitting multiple calls could accumulate 10 per cycle, yielding 480 calls daily and triggering rate-limit exhaustion and runaway cost. The one-call limit guarantees predictable behavior. Memory maintenance tolerates latency in a way that user conversations do not, so the remainder defers to the next cycle.

Rationale for local oMLX

Consuming a significant fraction of the Groq free tier (1,500 req/day) on a single micro-cycle reduces the budget available for other free-LLM use cases. oMLX + gemma-4 run locally — offline, unlimited, electricity-unit cost — and operate independently of network failures and API outages.

7. Opinion Decay System

The confidence on an O(c=0.90) tag decays by -0.02 per day. An initial confidence of 0.90 reaches 0.30 after 30 days, at which point the entry is automatically deleted.

Design rationale: opinions can become invalid over time. A record stating "this model is the most efficient" from three months ago may already be meaningless given a new model release. Left unmanaged, the agent makes present decisions on stale past judgments.

The decay system automates this. Beliefs not reconfirmed by new observations expire naturally. Opinions that remain valid get their confidence reset as they are reconfirmed in subsequent conversations.

Facts (W tags) do not decay. Objective facts are either true or retracted, independent of elapsed time.

8. U-Tag Dialectic — 3-Stage Promotion

User observations are not committed without evidence; they traverse three stages.

Observation (first sighting)
  → Hypothesis (supported by 2+ observations)
    → Verified (3+ observations, promoted to stable pattern)

Hardening a single utterance immediately into a pattern produces overfitting: a context-specific statement becomes a universal rule. This mechanism prevents "keep it simple today" from solidifying into "always keep it simple."

9. 2-Track Execution — Separating Rules from LLMs

Every task is classified into one of two tracks.

Track 1 (script): Output is determinable by rule → script. Zero tokens.
Track 2 (LLM): Output requires natural language → LLM. Economy model, not frontier.

In the initial implementation, all 14 modes called frontier models. Auditing revealed that only 6 genuinely required LLM judgment, and economy models were sufficient for those. The remaining 8 — file existence checks, tag parsing, deduplication, diagnostic reports — were fully replaceable with rule-based logic.

The magnitude of reduction varies by the existing call pattern. The key takeaway is not a specific number but the introduction of the decision criterion: "audit whether LLM is being used for tasks that rules can handle."

10. 5-Tier LLM System

L0 Premium  — gpt-5.4         (subscription, conversation-only)
L1 Standard — claude-sonnet, gemini-pro  (pay-per-use, fallback)
L2 Economy  — gemini-flash     ($0.075/M, reflect/self-review)
L3 Free     — groq/llama-70b, github-models  ($0, cron alerts)
L4 Local    — oMLX/gemma-4    ($0, memory pipeline)

Operational rules

L0 is cron-prohibited: Spending subscription tokens on automation reduces headroom for user conversations.
Fallback within the same tier: On L2 failure, retry another L2; do not escalate to L0.
Fallback monitoring: Free-tier exhaustion for Groq (1,500 req/day) and GitHub Models (150 req/day) is detected in real time. Gateway logs are scanned for 429/timeout/error, and per-provider health is reported.

11. Feature Docs ↔ memcore Auto-Sync

An agent must be able to query its own capabilities. A query like "how does the recall system work?" should surface the spec document written by the developer.

features_sync.py splits docs/features/ markdown at H2 boundaries, detects changes via content hash, and syncs to the wiki table in memcore. The system's own spec becomes searchable memory.

12. CLI — 33 Commands

The memcore CLI is organized into three categories.

Pipeline (10) — data flow

extract, merge, decision-prepare, decision-apply, conflict-apply,
dialectic, skill-stage, session-scan, micro-cycle, migrate

Data ingestion (extract) → merge → judgment (decision) → application (apply).

Maintain (13) — health

decay, lint, topics-validate, topics-expand, entity-audit, bank-size,
archive, session-cleanup, session-archive, optimize, ontology-sync,
features-sync, backfill-vectors

Opinion decay, integrity checks, vector backfill, and other memory health tasks.

Diagnose (10) — status

stats, monitor, warn, wiki-lint, task-scan, upcoming-event,
rollback, fallback-monitor, pricing-show, pricing-report

State inspection, anomaly detection, and cost tracking tools.

13. Key Metrics

Item	Value
Track 1 (scripts)	Zero LLM calls
Track 2 (LLM)	Economy model only
Curated rows	295
Wiki pages	1,666
CLI commands	33 (3 categories)
DB tables	22
DB size	~9 MB (single file)
LLM tiers	5
Recall tiers	6
Micro-cycle interval	30 min
Session-scan interval	10 min
Opinion decay rate	-0.02 / day
External dependencies	0 (SQLite only)

14. Minimum Portable Unit for Other Agents

There is no need to adopt memcore wholesale. Applying five core concepts resolves the majority of memory problems.

1. Define the extraction contract

Define 3–5 tag types and enforce the format in the distillation prompt. The specific tag structure varies per agent, but the pattern — "structure at conversation time, harvest post-hoc with regex" — is universal.

2. SQLite + FTS5

Place a curated table and an FTS5 virtual table in a single file. Add sqlite-vec if semantic search is required.

3. A `bank/` directory

Maintain the source of truth as markdown files. Git tracking preserves change history; the DB can be rebuilt from markdown if corrupted.

4. A micro-cycle script

Run memory updates on a 30-minute schedule, designed to operate without LLM calls. Cap LLM invocations at one per cycle.

5. Tiered LLM allocation

Best model for conversation; cheapest model for maintenance. This single principle accounts for most of the token savings.

Conclusion

The general principle extractable from the memcore design is that most memory problems are engineering problems, not LLM problems.

Extraction → regex is sufficient
Deduplication → fuzzy matching is sufficient
Opinion management → a simple decay function is sufficient
Where LLM is genuinely needed → semantic judgment: "do these two facts conflict?"

Making this distinction explicit reduces cost while preserving memory quality. Scripts, unlike LLMs, are deterministic, hallucination-free, and debuggable.

Applicability and Open Questions

Applicable scenarios: Single-user or small-scale collaborative agents, locally-operated environments, memory growth rates within tens of MB per day.
Scenarios where limits surface: Multi-tenant environments with hundreds of writes per second, environments requiring distributed read/write consistency. In those cases the single-file SQLite assumption breaks first.
Open questions: (1) What mechanism would enable the Retain tag schema to evolve automatically? (2) What signals are needed to adaptively tune opinion decay rates per topic? (3) How does recall rate change when feature doc sync is extended to code-level specs (function signatures, tests)?

One SQLite file, a handful of markdown files, a few lines of regex. This minimal combination is sufficient to build agent memory with zero external dependencies — that is the core claim memcore makes.

Series overview: Series index