"Long-Running Agents (2/4) — Designing Handoff Structure So Work Survives Context Breaks"

5월 18, 2026

It is easy to frame long-running agents as a memory problem. That framing is incomplete. In practice, the first requirement is not "how do we remember everything?" but "how does the next session know where to resume?"

Key Takeaways

Long-running agent work usually fails not because the model is weak, but because the system has no clean resume structure when a session breaks.
That is why the first design unit is not generic memory. It is progress files, handoff notes, stage boundaries, and resume rules.
A strong handoff does not try to replay the whole past. It gives the next session an immediately actionable state summary.
Memory comes later. First, you need durable records that tell the system what happened and what comes next.
Operationally, long-running quality should be judged less by eloquence and more by recovery after interruption.

1. The real problem is discontinuity

Once an agent needs to work for 30 minutes, two hours, or several days, the shape of the problem changes. Context windows are finite, sessions end, people revise decisions midstream, and tool state may move underneath the task.

At that point, many people jump straight to "long-term memory." But a more immediate problem usually appears first:

nobody knows exactly where the task stopped
prior decisions are scattered
the next session does not know what to read first
completion criteria are unclear, so the work loops

In other words, the primary bottleneck is not storage depth. It is missing handoff structure.

2. Why "handoff before memory"

The Chapter 8 and Chapter 15 adaptation notes in sources/260518_하네스엔지니어링_15장_블로그활용노트.md keep repeating the same lesson from long-running agent cases: a new session often performs better when it starts from one structured handoff artifact rather than trying to reconstruct the entire conversation history.

The principle is simple:

Memory preserves more of the past. Handoff narrows the next action.

For long-running operations, the second function matters first. The next session does not need everything. It needs the smallest durable state that allows work to continue correctly.

3. A good handoff is not a chat summary

If handoff is treated as a rough recap, it becomes weak fast. A good handoff should communicate not just what was discussed, but what state the work is in now.

At minimum, it helps to include:

Field	Purpose
Current goal	Fixes the target deliverable
Completed stages	Makes progress explicit
Next action	Tells the next session what to do first
Verification state	Marks what is still unverified
Risks / blockers	Preserves unresolved issues
Reference paths	Points to the files the next session must read

With these six fields, a new session often does not need to re-read hundreds of turns.

4. Resume requires stage boundaries

Another reason long-running work drifts is that phases blur together. Research, planning, implementation, and verification get mixed into one continuous stream. Once interrupted, the system cannot tell where to restart.

That is why it usually helps to break long work into stages like:

research
scope lock
draft or implementation
verification
handoff or done

If each stage writes something durable before moving on, interruptions cost far less. This aligns naturally with the runner-style logic described in drafts/blog/260429_harness_series_06_evaluation_ops_en.md.

5. Progress files are cheaper than memory and more direct for operations

Progress files matter because they require very little interpretation. Memory retrieval must decide what to bring back. A progress file states the explicit operational status of the current job.

Even this is enough to improve resumption quality:

current stage
completed checklist
remaining tasks
items needing verification
last touched file or document

This is less glamorous than long-term memory, but far more useful when the goal is simply to continue the work correctly.

6. What this looks like in our repository

This workspace does not treat long-running operations as one memory-system problem. It already distributes the job across multiple external artifacts.

Role	Example path	Function
Current scope	`tasks/plan.md`	fixes what is being done now
Session handoff	`tasks/handoffs/`	passes state to the next session
Recovery snapshot	`tasks/sessions/`	preserves intermediate state
Document map	`docs/memory-map.md`	shows what should be read
Long-term knowledge	`docs/`	accumulates reusable information

This makes the operating lesson clear: long-term memory is one layer, not the whole long-running strategy.

7. How this differs from a memory-systems article

drafts/blog/260429_harness_series_03_memory_systems_en.md focused on types of external memory and storage strategy. That article is about preserving information. This one is about recovering interrupted work.

The distinction looks like this:

Question	Memory systems	Long-running ops / handoff
Main concern	what should be stored	where should work resume
Unit of storage	facts, events, procedures	current state, next action, verification state
Retrieval mode	retrieve and inject	read directly and continue
Failure signal	stale memory, noise, conflicts	failed resume, duplicate work, costly reconstruction

So this C2 entry is not repeating memory-system theory. It is defining an operational recovery pattern.

8. Practical starting point

You do not need heavy orchestration to start.

Break long tasks into 3 to 5 stages.
Standardize what gets written at the end of each stage.
Put the must-read file list into the handoff.
Separate completion criteria from still-unverified items.
Test whether resume actually works after a forced interruption.

The important metric is not data volume. It is resumability under interruption.

9. Common failure modes

Treating the full conversation log as the handoff

Logs are rich, but they often do not narrow the next action enough.

Trying to solve everything with memory

Even strong long-term memory does not tell the system the current job state unless that state was explicitly externalized.

Omitting the next action

A status summary without a first action forces the next session back into planning mode.

Failing to record verification state

If unfinished validation is not written down, resumed work repeats analysis unnecessarily.

10. A better completion test for long-running agents

A long-running agent is not "good" just because it spent a lot of time thinking. A better test is whether the system can answer yes to questions like:

when the session changes, is the next action still clear
does the system avoid repeating the same work
is unverified state preserved outside the context window
can a human quickly understand the task state midstream

Only after that foundation is in place does it make sense to add richer memory layers, automation, or multi-agent delegation.

References

docs/blog_series_하네스엔지니어링_총괄_design.md
sources/260518_하네스엔지니어링_15장_블로그활용노트.md
drafts/blog/260429_하네스시리즈03_메모리시스템_블로그.md
drafts/blog/260429_harness_series_06_evaluation_ops_en.md

This is Part 2 of the Operations, Evaluation, and Memory series. Previous: Agent evaluation harnesses. Next: guardrails for permissions, approvals, sandboxing, and audit logs.

Series overview: Harness Engineering Series Guide

이 블로그 검색

MaJu Tech Notes