Agent Memory Engine (1/10) — SQLite Memory Engine for LLM Agents: memcore Architecture
Design principles and structure for building agent memory in a single SQLite file
ํต์ฌ ์์ฝ
Three core techniques covered in this post:
- Retain tags: Force the LLM to emit structured tags at conversation time, then harvest them post-hoc with regex. Deterministic, hallucination-free, zero token overhead versus post-hoc LLM extraction.
- 2-Track execution: Rule-expressible tasks run as scripts (Track 1); only tasks requiring natural language go to an economy LLM (Track 2). A practical criterion for auditing whether LLM calls are being wasted on work that rules can handle.
- SQLite single-file memory: FTS5 (keyword) + sqlite-vec (semantic) + 6-tier hybrid recall. Zero external dependencies; the entire agent memory runs in a 9 MB file.
Introduction
The first engineering bottleneck encountered when operating LLM agents long-term is memory. At tens of conversations the full context can be appended on every call without issue. Once hundreds to thousands of facts, opinions, and states must be managed cumulatively and only the relevant subset retrieved on demand, the problem changes in character entirely.
This post documents the design principles behind memcore, a SQLite-based memory engine. The full agent memory lives in a single SQLite file; markdown files serve as the source of truth. Topics covered: (1) four structural constraints of LLM memory design, (2) five design principles that address them, (3) concrete techniques — Retain tags, hybrid recall, opinion decay — and (4) the minimum portable unit for transplanting the pattern to other agents.
1. Four Structural Reasons LLM Memory Is Hard
Context is finite and expensive
Including cumulative conversation history in every call causes token consumption to grow super-linearly. In a monolithic heartbeat architecture, a single loop was observed consuming up to 4.5 million tokens per day. Cost accelerates relative to conversation volume.
Relevance degrades at scale
When memory holds 50 entries, injecting everything is feasible. At 500 entries, only the necessary few must be selected with precision. Single-vector search alone is unstable for this selection. For proper-noun queries like "which model does reflect use?", keyword matching provides a stronger signal than semantic similarity.
Memory quality decays over time
Opinions frequently become invalid as time passes. Unmanaged past opinions accumulate as noise, and the agent makes current decisions based on stale judgments.
Using frontier models for maintenance drains the budget
Applying top-tier models to memory cleanup, conflict detection, and summarization routes the budget away from user conversations into plumbing. Infrastructure cost can reverse its relationship with primary task cost.
2. Five Design Principles
Principle 1: Script-first
If it can be expressed as a rule, a script handles it. Most of memory extraction, merging, deduplication, and diagnostics reduces to deterministic rules. Judgments like "if this tag is present, extract it", "if similarity exceeds 80%, it's a duplicate", "if an opinion is 30 days old, delete it" require no LLM.
The Track 1 pipeline (10–30-minute intervals) operates at zero token consumption. LLMs are invoked only for semantic judgments that cannot be reduced to rules.
Principle 2: Dual-write
bank/ markdown files are the source of truth; memcore.db (SQLite) is the search and analysis layer. Markdown is human-readable, git-tracked, and editable in any editor. A corrupted DB can be rebuilt from markdown; the inverse is asymmetrically harder. On conflict, bank/ wins.
Principle 3: Modularize by schedule
Each concern gets an independent module with an independent schedule. Adding a new task costs one module plus one macOS LaunchAgent entry. Modules have no inter-dependencies, so a failure in one does not propagate.
Principle 4: Tiered LLM allocation
Even within "AI judgment", model choice spans up to 100× in cost. User conversation → premium; memory maintenance → free/local; semantic judgment → economy. This separation is the core of overall token reduction.
Principle 5: No promotion without evidence
Three stages — observation → hypothesis → verified pattern — must be traversed before a fact is promoted to long-term memory. A single utterance is never immediately hardened into a rule.
3. Overall Architecture
Conversation → session-scan (10 min, tag extraction, 0 tokens)
→ micro-cycle (30 min, merge + 1 local-LLM judgment)
→ reflect (daily 03:00, full validation + cloud LLM)
bank/ (md files, git-tracked) ←→ memcore.db (SQLite, FTS5, vectors)
↑
recall (6-tier hybrid search)
All memory fits in a single SQLite file. WAL mode supports concurrent reads; 22 tables divide responsibilities. File size: approximately 9 MB.
Backup is file copy; migration is sqlite3 dump. No external vector DB dependencies — no PostgreSQL, Redis, or Pinecone.
SQLite was chosen because the retrieval requirements are not purely semantic. Proper-noun queries are served more accurately by keyword matching than embeddings. SQLite + FTS5 handles keywords; sqlite-vec + bge-m3 handles semantics. Two retrieval systems coexist in one file.
4. Retain Tags — Inverting the Extraction Contract
Problem: LLM extraction is unreliable
The conventional approach to extracting facts from conversations asks an LLM after the fact: "extract the important facts." Structural drawbacks:
- Non-deterministic: different results on every call
- Hallucination bleed: facts absent from the source get generated
- Call overhead: an LLM call is required per conversation
- Non-debuggable: extraction rationale is hard to trace
Solution: Enforce structure at distillation time
Retain tags invert the extraction contract. Instead of post-hoc extraction, the prompt forces the LLM to emit structured tags at conversation time. Five tag types are used:
- W: Mac Mini M4 32GB RAM confirmed (World fact)
- B: Reflect v0.5 pipeline build complete (Behavior log)
- O(c=0.90): inverted structure wins on cost (Opinion + confidence)
- S:openclaw-rebuild: Phase C complete (State)
- U-Preference: reject features without verification (User pattern)
- W: Objective fact. Does not decay.
- B: Activity log. Records "what was done."
- O: Opinion. Carries a confidence score; decays over time.
- S: Entity state. Tracks the current state of a specific project or system.
- U: User observation. Records preferences, habits, and patterns.
Extraction is pure regex
Because the tag format is enforced at conversation time, extraction is handled deterministically with regex.
RETAIN_PATTERNS = [
re.compile(r"^- (W): (.+)", re.MULTILINE),
re.compile(r"^- (B): (.+)", re.MULTILINE),
re.compile(r"^- (O\(c=[\d.]+\)): (.+)", re.MULTILINE),
re.compile(r"^- (S:[^:]+): (.+)", re.MULTILINE),
re.compile(r"^- (U-\w+): (.+)", re.MULTILINE),
]
No LLM calls. Zero token consumption, deterministic output, no hallucinations, zero false positives (unmatched tags go unextracted). Parsing runs live against JSONL every 10 minutes without waiting for session close; results land in bank/ within 30 minutes.
Core principle
Retain tags are not merely a data format — they are a forcing function. Making the conversation LLM structure its own output makes the downstream extraction pipeline cheap and trustworthy. The design decision is to move classification from "post-hoc LLM" to "concurrent LLM."
5. 6-Tier Hybrid Recall
Memory that cannot be retrieved is functionally the same as memory that was never stored. memcore uses 6-tier hybrid search.
"""Search strategy:
1. Topic keyword match → score × 2.0 (curator-approved, strongest signal)
2. Vector similarity → score × 1.5 (semantic search, oMLX/bge-m3)
3. FTS5 full-text → BM25 score (mixed Korean+English)
4. LIKE fallback → score × 0.5 (last resort)
5. Wiki FTS → separate tier (feature doc search)
6. Wiki LIKE → separate tier (doc fallback)
"""
Tier 1 — Topic keyword match (×2.0): Exact match against curator-approved keywords. Highest weight because it is an explicit human signal.
Tier 2 — Vector similarity (×1.5): Cosine similarity of bge-m3 embeddings generated by oMLX. Handles semantic queries where "cost reduction" must also surface "token efficiency."
Tier 3 — FTS5 full-text (BM25): Full-text search via the SQLite FTS5 extension. Scores mixed Korean/English content using the BM25 algorithm.
Tier 4 — LIKE fallback (×0.5): Last resort when the top three tiers find nothing. Simple substring containment check. Lowest weight.
Tiers 5–6 — Wiki FTS/LIKE: Separate tiers for feature document search. Enables the agent to treat its own spec as searchable memory.
Graceful degradation
When oMLX is offline, Tier 2 (vector) is automatically skipped and the remaining four tiers provide coverage. Keyword and full-text search handle the majority of queries without vectors. Availability is deliberately prioritized over quality.
6. Micro-Cycle — 30-Minute Pipeline, One LLM Call
The micro-cycle is an 8-step pipeline on a 30-minute interval.
1. [Script] Check for new memory/*.md files
2. [Script] retain-extract → staging JSON
3. [Script] retain-merge → bank/ merge + conflict detection + fuzzy dedup (80%)
4. [Script] Diagnostics (bank-lint, memory-optimize)
5. [Script] decision-prepare → prepare items requiring judgment
6. [LLM] oMLX/gemma-4 — 1 semantic judgment (local, free)
7. [Script] decision-apply → apply to bank/
8. [Script] Write log
7 of 8 steps are scripts. The LLM is called once, at step 6, and handles exactly one semantic judgment — e.g., "do these two memories conflict?" or "does this observation support the existing hypothesis?"
Rationale for the one-call limit
The micro-cycle runs 48 times per day. Permitting multiple calls could accumulate 10 per cycle, yielding 480 calls daily and triggering rate-limit exhaustion and runaway cost. The one-call limit guarantees predictable behavior. Memory maintenance tolerates latency in a way that user conversations do not, so the remainder defers to the next cycle.
Rationale for local oMLX
Consuming a significant fraction of the Groq free tier (1,500 req/day) on a single micro-cycle reduces the budget available for other free-LLM use cases. oMLX + gemma-4 run locally — offline, unlimited, electricity-unit cost — and operate independently of network failures and API outages.
7. Opinion Decay System
The confidence on an O(c=0.90) tag decays by -0.02 per day. An initial confidence of 0.90 reaches 0.30 after 30 days, at which point the entry is automatically deleted.
Design rationale: opinions can become invalid over time. A record stating "this model is the most efficient" from three months ago may already be meaningless given a new model release. Left unmanaged, the agent makes present decisions on stale past judgments.
The decay system automates this. Beliefs not reconfirmed by new observations expire naturally. Opinions that remain valid get their confidence reset as they are reconfirmed in subsequent conversations.
Facts (W tags) do not decay. Objective facts are either true or retracted, independent of elapsed time.
8. U-Tag Dialectic — 3-Stage Promotion
User observations are not committed without evidence; they traverse three stages.
Observation (first sighting)
→ Hypothesis (supported by 2+ observations)
→ Verified (3+ observations, promoted to stable pattern)
Hardening a single utterance immediately into a pattern produces overfitting: a context-specific statement becomes a universal rule. This mechanism prevents "keep it simple today" from solidifying into "always keep it simple."
9. 2-Track Execution — Separating Rules from LLMs
Every task is classified into one of two tracks.
- Track 1 (script): Output is determinable by rule → script. Zero tokens.
- Track 2 (LLM): Output requires natural language → LLM. Economy model, not frontier.
In the initial implementation, all 14 modes called frontier models. Auditing revealed that only 6 genuinely required LLM judgment, and economy models were sufficient for those. The remaining 8 — file existence checks, tag parsing, deduplication, diagnostic reports — were fully replaceable with rule-based logic.
The magnitude of reduction varies by the existing call pattern. The key takeaway is not a specific number but the introduction of the decision criterion: "audit whether LLM is being used for tasks that rules can handle."
10. 5-Tier LLM System
L0 Premium — gpt-5.4 (subscription, conversation-only)
L1 Standard — claude-sonnet, gemini-pro (pay-per-use, fallback)
L2 Economy — gemini-flash ($0.075/M, reflect/self-review)
L3 Free — groq/llama-70b, github-models ($0, cron alerts)
L4 Local — oMLX/gemma-4 ($0, memory pipeline)
Operational rules
- L0 is cron-prohibited: Spending subscription tokens on automation reduces headroom for user conversations.
- Fallback within the same tier: On L2 failure, retry another L2; do not escalate to L0.
- Fallback monitoring: Free-tier exhaustion for Groq (1,500 req/day) and GitHub Models (150 req/day) is detected in real time. Gateway logs are scanned for 429/timeout/error, and per-provider health is reported.
11. Feature Docs ↔ memcore Auto-Sync
An agent must be able to query its own capabilities. A query like "how does the recall system work?" should surface the spec document written by the developer.
features_sync.py splits docs/features/ markdown at H2 boundaries, detects changes via content hash, and syncs to the wiki table in memcore. The system's own spec becomes searchable memory.
12. CLI — 33 Commands
The memcore CLI is organized into three categories.
Pipeline (10) — data flow
extract, merge, decision-prepare, decision-apply, conflict-apply,
dialectic, skill-stage, session-scan, micro-cycle, migrate
Data ingestion (extract) → merge → judgment (decision) → application (apply).
Maintain (13) — health
decay, lint, topics-validate, topics-expand, entity-audit, bank-size,
archive, session-cleanup, session-archive, optimize, ontology-sync,
features-sync, backfill-vectors
Opinion decay, integrity checks, vector backfill, and other memory health tasks.
Diagnose (10) — status
stats, monitor, warn, wiki-lint, task-scan, upcoming-event,
rollback, fallback-monitor, pricing-show, pricing-report
State inspection, anomaly detection, and cost tracking tools.
13. Key Metrics
| Item | Value |
|---|---|
| Track 1 (scripts) | Zero LLM calls |
| Track 2 (LLM) | Economy model only |
| Curated rows | 295 |
| Wiki pages | 1,666 |
| CLI commands | 33 (3 categories) |
| DB tables | 22 |
| DB size | ~9 MB (single file) |
| LLM tiers | 5 |
| Recall tiers | 6 |
| Micro-cycle interval | 30 min |
| Session-scan interval | 10 min |
| Opinion decay rate | -0.02 / day |
| External dependencies | 0 (SQLite only) |
14. Minimum Portable Unit for Other Agents
There is no need to adopt memcore wholesale. Applying five core concepts resolves the majority of memory problems.
1. Define the extraction contract
Define 3–5 tag types and enforce the format in the distillation prompt. The specific tag structure varies per agent, but the pattern — "structure at conversation time, harvest post-hoc with regex" — is universal.
2. SQLite + FTS5
Place a curated table and an FTS5 virtual table in a single file. Add sqlite-vec if semantic search is required.
3. A bank/ directory
Maintain the source of truth as markdown files. Git tracking preserves change history; the DB can be rebuilt from markdown if corrupted.
4. A micro-cycle script
Run memory updates on a 30-minute schedule, designed to operate without LLM calls. Cap LLM invocations at one per cycle.
5. Tiered LLM allocation
Best model for conversation; cheapest model for maintenance. This single principle accounts for most of the token savings.
Conclusion
The general principle extractable from the memcore design is that most memory problems are engineering problems, not LLM problems.
- Extraction → regex is sufficient
- Deduplication → fuzzy matching is sufficient
- Opinion management → a simple decay function is sufficient
- Where LLM is genuinely needed → semantic judgment: "do these two facts conflict?"
Making this distinction explicit reduces cost while preserving memory quality. Scripts, unlike LLMs, are deterministic, hallucination-free, and debuggable.
Applicability and Open Questions
- Applicable scenarios: Single-user or small-scale collaborative agents, locally-operated environments, memory growth rates within tens of MB per day.
- Scenarios where limits surface: Multi-tenant environments with hundreds of writes per second, environments requiring distributed read/write consistency. In those cases the single-file SQLite assumption breaks first.
- Open questions: (1) What mechanism would enable the Retain tag schema to evolve automatically? (2) What signals are needed to adaptively tune opinion decay rates per topic? (3) How does recall rate change when feature doc sync is extended to code-level specs (function signatures, tests)?
One SQLite file, a handful of markdown files, a few lines of regex. This minimal combination is sufficient to build agent memory with zero external dependencies — that is the core claim memcore makes.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ