Agent Operations Retrospective (1/7) — 7 Lessons Distilled from 60 Operational Failures

4월 08, 2026

cold-start 60→120s, page_gap 8→50, exit-code split — the most powerful fixes were single-line changes

핵심 요약

Approximately 60 failure logs accumulated during long-term operation were analyzed and distilled into 7 reusable lessons.
The most effective fixes were not large-scale refactors — they were single-line threshold or definition changes.
The common principle across all 7 lessons: make the system express more precisely what it knows and what it does not know.

This post targets engineers working with failure logs from agent-based systems. It covers 7 parameters, conventions, and cache policies where measurable improvements were observed.

1. Exit-Code Convention Split

The conventional meaning of exit 0 is "everything is OK," but in production, it must be defined as "my job is done" for cron analysis to function correctly. Split into three modes:

--strict mode: 0 = validation passed, 1 = validation failed.
Reflect mode: 0 = clean exit regardless of queue state.
Daily-audit mode: 1 = "issues found" (success), 0 = no results (suspicious).

How it works. The inverted code convention in daily-audit is designed to flag a "check that found nothing" as a suspect signal. It surfaces cases where the inspection script exits silently empty-handed due to sampling failure, missing input, or similar silent faults.

Impact. This single convention change allowed the cron analyzer to distinguish genuine failures from normal exits. Previously, all exit 0 values were treated identically, leaving suspect cases undetected.

2. Automatic Warning Resolution

Warnings that do not recur within 10 days are automatically resolved. The underlying assumption is simple: real problems come back.

How it works. One-off noise — transient network blips, external API rate-limit spikes — does not recur, so it should not remain in the long-term tracking queue. The 10-day window covers at least one retry cycle for periodic cron and batch jobs.

Limitation. Intermittent failures with a recurrence interval exceeding 10 days (e.g., monthly) may be missed by auto-resolution. A separate monthly accumulation counter compensates for this gap.

3. page_gap 8 → 50

A chunk-linking parameter in the memory/ directory. Even when topics match, chunks beyond the threshold distance are not merged into the same graph.

How it works. page_gap=8 requires a topic to appear continuously within 8 chunks — a strict setting. In long sessions where topics resurface intermittently, the graph fragments into disconnected subgraphs. Raising to 50 allows the same topic scattered across a week's worth of sessions to be consolidated into a single graph.

Impact. Recall hit-rate improved noticeably. The trade-off is increased false-positive connections, which are filtered at the topic-matching layer (Resolution L2/L1/L0, §5).

4. Heartbeat cold-start 60s → 120s

The first-call timeout ceiling. Covered in Part 8 but worth reiterating.

How it works. Most cold-start timeouts originate from model loading time — not network latency or queue saturation. The dominant cause is the one-time memory load on first invocation.

Impact. A single line change from 60s to 120s eliminated false-positive alerts during overnight hours. The count of heartbeats misclassified as failures dropped to near zero.

5. Topic-Cued Recall TTL 60min + Resolution L2/L1/L0

Cache policy for retrieval results. A simple 60-minute TTL cache is combined with a resolution-based update mechanism.

Resolution tiers. - L2 = exact match - L1 = topic match - L0 = keyword match

How it works. If the same topic is recalled within 60 minutes, the cache is used. However, if the cached hit has low resolution (L0/L1), a 0.3x score penalty is applied, preserving the opportunity for a higher-precision result to replace it. This structure converts a simple TTL cache into an adaptive cache.

Applicability. Generalizes to any agent system with a retrieval layer. Adjust the TTL value to match the domain's topic-switching frequency.

6. softThresholdTokens 6000 — Early Compression

Context compression trigger. Fire it well before the limit, not near it.

How it works. Triggering compression at the limit means too much content is compressed at once, increasing information loss. Triggering at the 6000-token threshold reduces the volume compressed in each pass, which reduces semantic loss per compression event.

Impact. Frequency of compression degrading model response quality decreased. Because early compression runs in smaller increments, cumulative loss per session follows a gentle curve rather than a linear ramp.

7. Retain 4 Categories + c=X Confidence

Two metadata axes are attached to stored entries: a category tag and a confidence score.

Category tags. - W = world fact - B = behavior pattern - O = opinion - S = source pointer

Confidence c=. A value between 0 and 1. Combined with confidence-decay, it decreases automatically over time.

How it works. The decay function operates differently per category — old opinions (O) weaken automatically; old facts (W) have a decay coefficient near zero and remain nearly stable. This distinction enforces at the system level that facts are stable and interpretations are subject to review.

Common Pattern: Making Room for Not-Knowing

The single principle threading through all 7 lessons:

Give the system a designated place to express that it does not know.

exit-code split expresses: "is this a failure or a normal exit?"
c= confidence expresses: "how certain is this information?"
Resolution L0/L1/L2 expresses: "how precise was this retrieval?"

A system with dedicated representation for uncertainty becomes capable of self-inspection. An interface that does not hide uncertainty is the practical starting point of observability.

Portability: Hermes Port Mapping

All 7 lessons are backend-agnostic. Mapping to Hermes equivalents:

Lesson	Hermes Target
exit-code convention	cron payload
page_gap	memcore parameter
cold-start 120s	Hermes hook
TTL 60min + Resolution	prefetch cache
softThreshold 6000	context_compressor config
Retain 4 categories	ingest pipeline
Warning auto-resolution	audit loop

All are structurally portable with a 1:1 mapping.

Summary: Large Refactor vs. Single-Line Fix

Large refactors typically introduce larger bugs. The most effective improvements during long-term operation were single-line threshold or definition changes. Three properties they share:

Locality — the change scope is within one file and one line.
Semantic-layer change — not merely changing a value, but redefining what that value means.
Observability gain — logs carry more information after the change than before.

Open Questions

Are 4 category tags (W/B/O/S) sufficient? Long-term operation may surface domains requiring finer granularity.
page_gap=50 is optimal at current operating scale, but whether the same value holds for domains with different chunk density (e.g., high-frequency conversation vs. low-frequency documents) requires re-measurement.
The decay curve for c= confidence is currently exponential. Non-linear decay (e.g., step function) may be more accurate for certain categories.

Values in this post should be treated as starting points. Adjust to the domain characteristics of the target system.

Series overview: Series index

이 블로그 검색

MaJu Tech Notes