"Long-Running Agents (2/4) — Designing Handoff Structure So Work Survives Context Breaks"
It is easy to frame long-running agents as a memory problem. That framing is incomplete. In practice, the first requirement is not "how do we remember everything?" but "how does the next session know where to resume?"
Key Takeaways
- Long-running agent work usually fails not because the model is weak, but because the system has no clean resume structure when a session breaks.
- That is why the first design unit is not generic memory. It is progress files, handoff notes, stage boundaries, and resume rules.
- A strong handoff does not try to replay the whole past. It gives the next session an immediately actionable state summary.
- Memory comes later. First, you need durable records that tell the system what happened and what comes next.
- Operationally, long-running quality should be judged less by eloquence and more by recovery after interruption.
1. The real problem is discontinuity
Once an agent needs to work for 30 minutes, two hours, or several days, the shape of the problem changes. Context windows are finite, sessions end, people revise decisions midstream, and tool state may move underneath the task.
At that point, many people jump straight to "long-term memory." But a more immediate problem usually appears first:
- nobody knows exactly where the task stopped
- prior decisions are scattered
- the next session does not know what to read first
- completion criteria are unclear, so the work loops
In other words, the primary bottleneck is not storage depth. It is missing handoff structure.
2. Why "handoff before memory"
The Chapter 8 and Chapter 15 adaptation notes in sources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋
ธํธ.md keep repeating the same lesson from long-running agent cases: a new session often performs better when it starts from one structured handoff artifact rather than trying to reconstruct the entire conversation history.
The principle is simple:
Memory preserves more of the past. Handoff narrows the next action.
For long-running operations, the second function matters first. The next session does not need everything. It needs the smallest durable state that allows work to continue correctly.
3. A good handoff is not a chat summary
If handoff is treated as a rough recap, it becomes weak fast. A good handoff should communicate not just what was discussed, but what state the work is in now.
At minimum, it helps to include:
| Field | Purpose |
|---|---|
| Current goal | Fixes the target deliverable |
| Completed stages | Makes progress explicit |
| Next action | Tells the next session what to do first |
| Verification state | Marks what is still unverified |
| Risks / blockers | Preserves unresolved issues |
| Reference paths | Points to the files the next session must read |
With these six fields, a new session often does not need to re-read hundreds of turns.
4. Resume requires stage boundaries
Another reason long-running work drifts is that phases blur together. Research, planning, implementation, and verification get mixed into one continuous stream. Once interrupted, the system cannot tell where to restart.
That is why it usually helps to break long work into stages like:
- research
- scope lock
- draft or implementation
- verification
- handoff or done
If each stage writes something durable before moving on, interruptions cost far less. This aligns naturally with the runner-style logic described in drafts/blog/260429_harness_series_06_evaluation_ops_en.md.
5. Progress files are cheaper than memory and more direct for operations
Progress files matter because they require very little interpretation. Memory retrieval must decide what to bring back. A progress file states the explicit operational status of the current job.
Even this is enough to improve resumption quality:
- current stage
- completed checklist
- remaining tasks
- items needing verification
- last touched file or document
This is less glamorous than long-term memory, but far more useful when the goal is simply to continue the work correctly.
6. What this looks like in our repository
This workspace does not treat long-running operations as one memory-system problem. It already distributes the job across multiple external artifacts.
| Role | Example path | Function |
|---|---|---|
| Current scope | tasks/plan.md |
fixes what is being done now |
| Session handoff | tasks/handoffs/ |
passes state to the next session |
| Recovery snapshot | tasks/sessions/ |
preserves intermediate state |
| Document map | docs/memory-map.md |
shows what should be read |
| Long-term knowledge | docs/ |
accumulates reusable information |
This makes the operating lesson clear: long-term memory is one layer, not the whole long-running strategy.
7. How this differs from a memory-systems article
drafts/blog/260429_harness_series_03_memory_systems_en.md focused on types of external memory and storage strategy. That article is about preserving information. This one is about recovering interrupted work.
The distinction looks like this:
| Question | Memory systems | Long-running ops / handoff |
|---|---|---|
| Main concern | what should be stored | where should work resume |
| Unit of storage | facts, events, procedures | current state, next action, verification state |
| Retrieval mode | retrieve and inject | read directly and continue |
| Failure signal | stale memory, noise, conflicts | failed resume, duplicate work, costly reconstruction |
So this C2 entry is not repeating memory-system theory. It is defining an operational recovery pattern.
8. Practical starting point
You do not need heavy orchestration to start.
- Break long tasks into 3 to 5 stages.
- Standardize what gets written at the end of each stage.
- Put the must-read file list into the handoff.
- Separate completion criteria from still-unverified items.
- Test whether resume actually works after a forced interruption.
The important metric is not data volume. It is resumability under interruption.
9. Common failure modes
Treating the full conversation log as the handoff
Logs are rich, but they often do not narrow the next action enough.
Trying to solve everything with memory
Even strong long-term memory does not tell the system the current job state unless that state was explicitly externalized.
Omitting the next action
A status summary without a first action forces the next session back into planning mode.
Failing to record verification state
If unfinished validation is not written down, resumed work repeats analysis unnecessarily.
10. A better completion test for long-running agents
A long-running agent is not "good" just because it spent a lot of time thinking. A better test is whether the system can answer yes to questions like:
- when the session changes, is the next action still clear
- does the system avoid repeating the same work
- is unverified state preserved outside the context window
- can a human quickly understand the task state midstream
Only after that foundation is in place does it make sense to add richer memory layers, automation, or multi-agent delegation.
References
docs/blog_series_ํ๋ค์ค์์ง๋์ด๋ง_์ด๊ด_design.mdsources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋ ธํธ.mddrafts/blog/260429_ํ๋ค์ค์๋ฆฌ์ฆ03_๋ฉ๋ชจ๋ฆฌ์์คํ _๋ธ๋ก๊ทธ.mddrafts/blog/260429_harness_series_06_evaluation_ops_en.md
This is Part 2 of the Operations, Evaluation, and Memory series. Previous: Agent evaluation harnesses. Next: guardrails for permissions, approvals, sandboxing, and audit logs.
Series overview: Harness Engineering Series Guide
๋๊ธ
๋๊ธ ์ฐ๊ธฐ