Agent Operations Retrospective (1/7) — 7 Lessons Distilled from 60 Operational Failures
cold-start 60→120s, page_gap 8→50, exit-code split — the most powerful fixes were single-line changes
ํต์ฌ ์์ฝ
- Approximately 60 failure logs accumulated during long-term operation were analyzed and distilled into 7 reusable lessons.
- The most effective fixes were not large-scale refactors — they were single-line threshold or definition changes.
- The common principle across all 7 lessons: make the system express more precisely what it knows and what it does not know.
This post targets engineers working with failure logs from agent-based systems. It covers 7 parameters, conventions, and cache policies where measurable improvements were observed.
1. Exit-Code Convention Split
The conventional meaning of exit 0 is "everything is OK," but in production, it must be defined as "my job is done" for cron analysis to function correctly. Split into three modes:
--strictmode:0= validation passed,1= validation failed.- Reflect mode:
0= clean exit regardless of queue state. - Daily-audit mode:
1= "issues found" (success),0= no results (suspicious).
How it works. The inverted code convention in daily-audit is designed to flag a "check that found nothing" as a suspect signal. It surfaces cases where the inspection script exits silently empty-handed due to sampling failure, missing input, or similar silent faults.
Impact. This single convention change allowed the cron analyzer to distinguish genuine failures from normal exits. Previously, all exit 0 values were treated identically, leaving suspect cases undetected.
2. Automatic Warning Resolution
Warnings that do not recur within 10 days are automatically resolved. The underlying assumption is simple: real problems come back.
How it works. One-off noise — transient network blips, external API rate-limit spikes — does not recur, so it should not remain in the long-term tracking queue. The 10-day window covers at least one retry cycle for periodic cron and batch jobs.
Limitation. Intermittent failures with a recurrence interval exceeding 10 days (e.g., monthly) may be missed by auto-resolution. A separate monthly accumulation counter compensates for this gap.
3. page_gap 8 → 50
A chunk-linking parameter in the memory/ directory. Even when topics match, chunks beyond the threshold distance are not merged into the same graph.
How it works. page_gap=8 requires a topic to appear continuously within 8 chunks — a strict setting. In long sessions where topics resurface intermittently, the graph fragments into disconnected subgraphs. Raising to 50 allows the same topic scattered across a week's worth of sessions to be consolidated into a single graph.
Impact. Recall hit-rate improved noticeably. The trade-off is increased false-positive connections, which are filtered at the topic-matching layer (Resolution L2/L1/L0, §5).
4. Heartbeat cold-start 60s → 120s
The first-call timeout ceiling. Covered in Part 8 but worth reiterating.
How it works. Most cold-start timeouts originate from model loading time — not network latency or queue saturation. The dominant cause is the one-time memory load on first invocation.
Impact. A single line change from 60s to 120s eliminated false-positive alerts during overnight hours. The count of heartbeats misclassified as failures dropped to near zero.
5. Topic-Cued Recall TTL 60min + Resolution L2/L1/L0
Cache policy for retrieval results. A simple 60-minute TTL cache is combined with a resolution-based update mechanism.
Resolution tiers. - L2 = exact match - L1 = topic match - L0 = keyword match
How it works. If the same topic is recalled within 60 minutes, the cache is used. However, if the cached hit has low resolution (L0/L1), a 0.3x score penalty is applied, preserving the opportunity for a higher-precision result to replace it. This structure converts a simple TTL cache into an adaptive cache.
Applicability. Generalizes to any agent system with a retrieval layer. Adjust the TTL value to match the domain's topic-switching frequency.
6. softThresholdTokens 6000 — Early Compression
Context compression trigger. Fire it well before the limit, not near it.
How it works. Triggering compression at the limit means too much content is compressed at once, increasing information loss. Triggering at the 6000-token threshold reduces the volume compressed in each pass, which reduces semantic loss per compression event.
Impact. Frequency of compression degrading model response quality decreased. Because early compression runs in smaller increments, cumulative loss per session follows a gentle curve rather than a linear ramp.
7. Retain 4 Categories + c=X Confidence
Two metadata axes are attached to stored entries: a category tag and a confidence score.
Category tags. - W = world fact - B = behavior pattern - O = opinion - S = source pointer
Confidence c=. A value between 0 and 1. Combined with confidence-decay, it decreases automatically over time.
How it works. The decay function operates differently per category — old opinions (O) weaken automatically; old facts (W) have a decay coefficient near zero and remain nearly stable. This distinction enforces at the system level that facts are stable and interpretations are subject to review.
Common Pattern: Making Room for Not-Knowing
The single principle threading through all 7 lessons:
Give the system a designated place to express that it does not know.
exit-code splitexpresses: "is this a failure or a normal exit?"c=confidence expresses: "how certain is this information?"Resolution L0/L1/L2expresses: "how precise was this retrieval?"
A system with dedicated representation for uncertainty becomes capable of self-inspection. An interface that does not hide uncertainty is the practical starting point of observability.
Portability: Hermes Port Mapping
All 7 lessons are backend-agnostic. Mapping to Hermes equivalents:
| Lesson | Hermes Target |
|---|---|
| exit-code convention | cron payload |
| page_gap | memcore parameter |
| cold-start 120s | Hermes hook |
| TTL 60min + Resolution | prefetch cache |
| softThreshold 6000 | context_compressor config |
| Retain 4 categories | ingest pipeline |
| Warning auto-resolution | audit loop |
All are structurally portable with a 1:1 mapping.
Summary: Large Refactor vs. Single-Line Fix
Large refactors typically introduce larger bugs. The most effective improvements during long-term operation were single-line threshold or definition changes. Three properties they share:
- Locality — the change scope is within one file and one line.
- Semantic-layer change — not merely changing a value, but redefining what that value means.
- Observability gain — logs carry more information after the change than before.
Open Questions
- Are 4 category tags (W/B/O/S) sufficient? Long-term operation may surface domains requiring finer granularity.
page_gap=50is optimal at current operating scale, but whether the same value holds for domains with different chunk density (e.g., high-frequency conversation vs. low-frequency documents) requires re-measurement.- The decay curve for
c=confidence is currently exponential. Non-linear decay (e.g., step function) may be more accurate for certain categories.
Values in this post should be treated as starting points. Adjust to the domain characteristics of the target system.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ