"Harness Engineering Basics (2/4) — How AI Agents Actually Work: Context, Tool Calls, and the Agent Loop"
AI agents do not directly touch your computer. The model proposes the next action in text, the harness turns that proposal into a real tool call, and the result comes back as new context. Once you understand that loop, tool misuse, context blow-up, and long-task failures become much easier to explain.
Key Takeaways
- The default agent pattern is a repeated loop of observe → plan → act → verify → record.
- The model does not directly operate the filesystem or shell. It emits a tool-call instruction, and the harness executes it.
- System rules, developer rules, user requests, and tool results are not all the same kind of prompt. They are different input layers.
- The context window is not infinite storage. It is a shared budget, so placement matters as much as size.
- In practice, many agent quality problems come less from model intelligence than from loop design and input structure.
1. An agent is not a chatbot answer, but a repeated loop
A normal chatbot interaction is simple: the user asks, the model answers, and the turn ends. A work-oriented agent usually does more. It reads, searches, executes, checks, and then decides what to do next.
The pattern usually looks like this:
- read the goal and constraints
- decide what information is missing
- propose the next tool call
- read the tool result
- continue the loop until the completion condition is met
Compressed into one line:
Agent = model reasoning + harness execution + result reinjection
So an agent is not just a model with a personality. It is a loop that the model and harness run together.
2. The model does not execute directly. The harness does
This is the first misconception to remove. When an agent reads a file or runs a shell command, the model is not directly manipulating the operating system.
What actually happens is simpler:
- the model outputs a structured request such as "read this file" or "run this command"
- the harness checks whether that call is allowed
- if allowed, the harness runs the tool
- the result is returned to the model as new context
- the model decides the next step
That is why tool calling is not just a feature. It is part of the operating loop.
If you miss this structure, you also miss:
- why permissions and sandboxing matter
- why the length and format of tool results affect the next answer
3. Agent input is layered, not flat
Calling all agent input a "prompt" hides important differences. In practice, several layers are combined into what the model sees.
| Input layer | Role |
|---|---|
| System instructions | highest-level operating and safety rules |
| Developer or project instructions | repository-specific workflow and policy |
| User request | the current task and outcome target |
| Tool results | facts and execution results from the outside world |
| Conversation history | the recent reasoning trail and intermediate state |
These layers should not be treated the same because their priorities and lifetimes differ.
- System rules should be stable.
- User requests can redirect the task.
- Tool results are useful but often short-lived.
- Conversation history is valuable but becomes a compression target.
If you ignore this layering, you often end up with bloated system prompts that try to do everything at once.
4. The context window is a shared budget
The series design document and reading notes both make the same point: context is not just "bigger is better." Multiple elements compete for the same space.
- instruction files
- the current user task
- reference documents
- tool results
- the model's own recent plans and outputs
So context behaves less like a giant archive and more like a workbench. Only the relevant material should be on it.
This matters for three reasons.
4.1 More material can reduce quality
Long inputs bury important rules. Long-context weaknesses also make middle information easier to lose.
4.2 Tool results are expensive
Search results, long logs, and file dumps can consume a large part of the token budget for several subsequent turns.
4.3 Re-reading on demand is often stronger
Keeping file paths and fetch tools available is usually better than carrying every file inline at all times.
Good agents are not strong because they hold everything in memory. They are strong because they decide what stays outside and what comes in now.
5. Tool results are part of context too
If tool calls are only seen as external execution, the picture remains incomplete. Tool results come back as model input, so tool design immediately becomes context design.
Compare these two search tools:
- one returns title, date, source, and a short summary
- one dumps thousands of lines of raw text
Both are "search," but the second one can quickly damage the quality of the next few turns.
That is why good harnesses do more than expose tools.
- they cap output size
- they store large results externally
- they give the model summaries and pointers first
- they require a second fetch when details are truly needed
This is also where the A4 article on tool engineering begins.
6. The simplest useful agent loop
Real systems vary, but for beginners the following model is enough.
| Step | Question | Harness role |
|---|---|---|
| Observe | what do we know and what is missing | assemble the right inputs |
| Plan | what should happen next | keep goals and constraints visible |
| Act | which tool should be called | check permissions and execute |
| Verify | did the result improve the state | provide tests, reviews, retry rules |
| Record | what must survive the turn or session | store logs, handoffs, state files |
This is useful because it makes failures diagnosable. Instead of saying "the model was bad," you can ask which stage of the loop was weak.
For example:
- observe failure: it never read the needed file
- plan failure: it ignored constraints and jumped into action
- act failure: it chose the wrong tool
- verify failure: it accepted a bad result
- record failure: the next session could not resume
7. This maps directly to repository-native structures
In this repository, the agent loop already has familiar anchors.
| Loop element | Repository example |
|---|---|
| High-level instructions | AGENTS.md, CLAUDE.md |
| Scope reduction | tasks/plan.md, docs/memory-map.md |
| Execution boundary | publish restrictions, protected config/ |
| State preservation | tasks/handoffs/, tasks/sessions/ |
| Verification surface | quality gates, path checks, review steps |
Understanding the loop is therefore less about learning new jargon and more about recognizing which document or mechanism owns which part of agent behavior.
8. Common failure signals
Reworking the original notes and earlier drafts, several failure signals show up repeatedly when people do not understand how agents operate.
8.1 Everything is stuffed into one instruction file
Without layer separation, system rules and task-specific rules start to conflict.
8.2 Tool results are reinjected without limits
Large logs and raw search dumps pollute context and weaken later decisions.
8.3 The loop can act but cannot verify
The system reads and executes, but no test or review loop catches bad outcomes.
8.4 Session continuity depends only on conversation history
Long work gets compacted or interrupted. Without handoff notes or state files, every new session starts from reconstruction.
9. Minimum practical principles
You do not need to memorize a complicated architecture chart. The four principles below are enough to reason better about most agents.
- The model does not execute directly. The harness executes and returns results.
- Agent input is layered. Different layers have different roles and lifetimes.
- Context is a shared budget. Reducing and staging inputs often matters more than expanding them.
- An agent is a loop. A good loop includes verification and recording, not just action.
Once that is clear, the next topics naturally follow.
- A3 will focus on what belongs in instruction files versus external context.
- A4 will focus on how tools should be designed so the model can act more accurately.
References
docs/blog_series_ํ๋ค์ค์์ง๋์ด๋ง_์ด๊ด_design.mdsources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋ ธํธ.mddrafts/blog/260429_harness_series_02_context_engineering_en.md- WikiDocs, Chapter 2 notes from
ํ๋ค์ค ์์ง๋์ด๋ง ๋ฐฑ๊ณผ์ฌ์
This is Part 2 of the Harness Engineering Basics series. Next: Why instruction structure and context design matter more than longer prompts.
Series overview: Harness Engineering Series Guide
๋๊ธ
๋๊ธ ์ฐ๊ธฐ