"Long-Running Agents (2/4) — Designing Handoff Structure So Work Survives Context Breaks"

It is easy to frame long-running agents as a memory problem. That framing is incomplete. In practice, the first requirement is not "how do we remember everything?" but "how does the next session know where to resume?"


Key Takeaways

  • Long-running agent work usually fails not because the model is weak, but because the system has no clean resume structure when a session breaks.
  • That is why the first design unit is not generic memory. It is progress files, handoff notes, stage boundaries, and resume rules.
  • A strong handoff does not try to replay the whole past. It gives the next session an immediately actionable state summary.
  • Memory comes later. First, you need durable records that tell the system what happened and what comes next.
  • Operationally, long-running quality should be judged less by eloquence and more by recovery after interruption.

1. The real problem is discontinuity

Once an agent needs to work for 30 minutes, two hours, or several days, the shape of the problem changes. Context windows are finite, sessions end, people revise decisions midstream, and tool state may move underneath the task.

At that point, many people jump straight to "long-term memory." But a more immediate problem usually appears first:

  • nobody knows exactly where the task stopped
  • prior decisions are scattered
  • the next session does not know what to read first
  • completion criteria are unclear, so the work loops

In other words, the primary bottleneck is not storage depth. It is missing handoff structure.

2. Why "handoff before memory"

The Chapter 8 and Chapter 15 adaptation notes in sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md keep repeating the same lesson from long-running agent cases: a new session often performs better when it starts from one structured handoff artifact rather than trying to reconstruct the entire conversation history.

The principle is simple:

Memory preserves more of the past. Handoff narrows the next action.

For long-running operations, the second function matters first. The next session does not need everything. It needs the smallest durable state that allows work to continue correctly.

3. A good handoff is not a chat summary

If handoff is treated as a rough recap, it becomes weak fast. A good handoff should communicate not just what was discussed, but what state the work is in now.

At minimum, it helps to include:

Field Purpose
Current goal Fixes the target deliverable
Completed stages Makes progress explicit
Next action Tells the next session what to do first
Verification state Marks what is still unverified
Risks / blockers Preserves unresolved issues
Reference paths Points to the files the next session must read

With these six fields, a new session often does not need to re-read hundreds of turns.

4. Resume requires stage boundaries

Another reason long-running work drifts is that phases blur together. Research, planning, implementation, and verification get mixed into one continuous stream. Once interrupted, the system cannot tell where to restart.

That is why it usually helps to break long work into stages like:

  1. research
  2. scope lock
  3. draft or implementation
  4. verification
  5. handoff or done

If each stage writes something durable before moving on, interruptions cost far less. This aligns naturally with the runner-style logic described in drafts/blog/260429_harness_series_06_evaluation_ops_en.md.

5. Progress files are cheaper than memory and more direct for operations

Progress files matter because they require very little interpretation. Memory retrieval must decide what to bring back. A progress file states the explicit operational status of the current job.

Even this is enough to improve resumption quality:

  • current stage
  • completed checklist
  • remaining tasks
  • items needing verification
  • last touched file or document

This is less glamorous than long-term memory, but far more useful when the goal is simply to continue the work correctly.

6. What this looks like in our repository

This workspace does not treat long-running operations as one memory-system problem. It already distributes the job across multiple external artifacts.

Role Example path Function
Current scope tasks/plan.md fixes what is being done now
Session handoff tasks/handoffs/ passes state to the next session
Recovery snapshot tasks/sessions/ preserves intermediate state
Document map docs/memory-map.md shows what should be read
Long-term knowledge docs/ accumulates reusable information

This makes the operating lesson clear: long-term memory is one layer, not the whole long-running strategy.

7. How this differs from a memory-systems article

drafts/blog/260429_harness_series_03_memory_systems_en.md focused on types of external memory and storage strategy. That article is about preserving information. This one is about recovering interrupted work.

The distinction looks like this:

Question Memory systems Long-running ops / handoff
Main concern what should be stored where should work resume
Unit of storage facts, events, procedures current state, next action, verification state
Retrieval mode retrieve and inject read directly and continue
Failure signal stale memory, noise, conflicts failed resume, duplicate work, costly reconstruction

So this C2 entry is not repeating memory-system theory. It is defining an operational recovery pattern.

8. Practical starting point

You do not need heavy orchestration to start.

  1. Break long tasks into 3 to 5 stages.
  2. Standardize what gets written at the end of each stage.
  3. Put the must-read file list into the handoff.
  4. Separate completion criteria from still-unverified items.
  5. Test whether resume actually works after a forced interruption.

The important metric is not data volume. It is resumability under interruption.

9. Common failure modes

Treating the full conversation log as the handoff

Logs are rich, but they often do not narrow the next action enough.

Trying to solve everything with memory

Even strong long-term memory does not tell the system the current job state unless that state was explicitly externalized.

Omitting the next action

A status summary without a first action forces the next session back into planning mode.

Failing to record verification state

If unfinished validation is not written down, resumed work repeats analysis unnecessarily.

10. A better completion test for long-running agents

A long-running agent is not "good" just because it spent a lot of time thinking. A better test is whether the system can answer yes to questions like:

  • when the session changes, is the next action still clear
  • does the system avoid repeating the same work
  • is unverified state preserved outside the context window
  • can a human quickly understand the task state midstream

Only after that foundation is in place does it make sense to add richer memory layers, automation, or multi-agent delegation.

References

  • docs/blog_series_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_์ด๊ด„_design.md
  • sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md
  • drafts/blog/260429_ํ•˜๋„ค์Šค์‹œ๋ฆฌ์ฆˆ03_๋ฉ”๋ชจ๋ฆฌ์‹œ์Šคํ…œ_๋ธ”๋กœ๊ทธ.md
  • drafts/blog/260429_harness_series_06_evaluation_ops_en.md

This is Part 2 of the Operations, Evaluation, and Memory series. Previous: Agent evaluation harnesses. Next: guardrails for permissions, approvals, sandboxing, and audit logs.

Series overview: Harness Engineering Series Guide

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System