"Harness Engineering Basics (2/4) — How AI Agents Actually Work: Context, Tool Calls, and the Agent Loop"

AI agents do not directly touch your computer. The model proposes the next action in text, the harness turns that proposal into a real tool call, and the result comes back as new context. Once you understand that loop, tool misuse, context blow-up, and long-task failures become much easier to explain.


Key Takeaways

  • The default agent pattern is a repeated loop of observe → plan → act → verify → record.
  • The model does not directly operate the filesystem or shell. It emits a tool-call instruction, and the harness executes it.
  • System rules, developer rules, user requests, and tool results are not all the same kind of prompt. They are different input layers.
  • The context window is not infinite storage. It is a shared budget, so placement matters as much as size.
  • In practice, many agent quality problems come less from model intelligence than from loop design and input structure.

1. An agent is not a chatbot answer, but a repeated loop

A normal chatbot interaction is simple: the user asks, the model answers, and the turn ends. A work-oriented agent usually does more. It reads, searches, executes, checks, and then decides what to do next.

The pattern usually looks like this:

  1. read the goal and constraints
  2. decide what information is missing
  3. propose the next tool call
  4. read the tool result
  5. continue the loop until the completion condition is met

Compressed into one line:

Agent = model reasoning + harness execution + result reinjection

So an agent is not just a model with a personality. It is a loop that the model and harness run together.

2. The model does not execute directly. The harness does

This is the first misconception to remove. When an agent reads a file or runs a shell command, the model is not directly manipulating the operating system.

What actually happens is simpler:

  • the model outputs a structured request such as "read this file" or "run this command"
  • the harness checks whether that call is allowed
  • if allowed, the harness runs the tool
  • the result is returned to the model as new context
  • the model decides the next step

That is why tool calling is not just a feature. It is part of the operating loop.

If you miss this structure, you also miss:

  • why permissions and sandboxing matter
  • why the length and format of tool results affect the next answer

3. Agent input is layered, not flat

Calling all agent input a "prompt" hides important differences. In practice, several layers are combined into what the model sees.

Input layer Role
System instructions highest-level operating and safety rules
Developer or project instructions repository-specific workflow and policy
User request the current task and outcome target
Tool results facts and execution results from the outside world
Conversation history the recent reasoning trail and intermediate state

These layers should not be treated the same because their priorities and lifetimes differ.

  • System rules should be stable.
  • User requests can redirect the task.
  • Tool results are useful but often short-lived.
  • Conversation history is valuable but becomes a compression target.

If you ignore this layering, you often end up with bloated system prompts that try to do everything at once.

4. The context window is a shared budget

The series design document and reading notes both make the same point: context is not just "bigger is better." Multiple elements compete for the same space.

  • instruction files
  • the current user task
  • reference documents
  • tool results
  • the model's own recent plans and outputs

So context behaves less like a giant archive and more like a workbench. Only the relevant material should be on it.

This matters for three reasons.

4.1 More material can reduce quality

Long inputs bury important rules. Long-context weaknesses also make middle information easier to lose.

4.2 Tool results are expensive

Search results, long logs, and file dumps can consume a large part of the token budget for several subsequent turns.

4.3 Re-reading on demand is often stronger

Keeping file paths and fetch tools available is usually better than carrying every file inline at all times.

Good agents are not strong because they hold everything in memory. They are strong because they decide what stays outside and what comes in now.

5. Tool results are part of context too

If tool calls are only seen as external execution, the picture remains incomplete. Tool results come back as model input, so tool design immediately becomes context design.

Compare these two search tools:

  • one returns title, date, source, and a short summary
  • one dumps thousands of lines of raw text

Both are "search," but the second one can quickly damage the quality of the next few turns.

That is why good harnesses do more than expose tools.

  • they cap output size
  • they store large results externally
  • they give the model summaries and pointers first
  • they require a second fetch when details are truly needed

This is also where the A4 article on tool engineering begins.

6. The simplest useful agent loop

Real systems vary, but for beginners the following model is enough.

Step Question Harness role
Observe what do we know and what is missing assemble the right inputs
Plan what should happen next keep goals and constraints visible
Act which tool should be called check permissions and execute
Verify did the result improve the state provide tests, reviews, retry rules
Record what must survive the turn or session store logs, handoffs, state files

This is useful because it makes failures diagnosable. Instead of saying "the model was bad," you can ask which stage of the loop was weak.

For example:

  • observe failure: it never read the needed file
  • plan failure: it ignored constraints and jumped into action
  • act failure: it chose the wrong tool
  • verify failure: it accepted a bad result
  • record failure: the next session could not resume

7. This maps directly to repository-native structures

In this repository, the agent loop already has familiar anchors.

Loop element Repository example
High-level instructions AGENTS.md, CLAUDE.md
Scope reduction tasks/plan.md, docs/memory-map.md
Execution boundary publish restrictions, protected config/
State preservation tasks/handoffs/, tasks/sessions/
Verification surface quality gates, path checks, review steps

Understanding the loop is therefore less about learning new jargon and more about recognizing which document or mechanism owns which part of agent behavior.

8. Common failure signals

Reworking the original notes and earlier drafts, several failure signals show up repeatedly when people do not understand how agents operate.

8.1 Everything is stuffed into one instruction file

Without layer separation, system rules and task-specific rules start to conflict.

8.2 Tool results are reinjected without limits

Large logs and raw search dumps pollute context and weaken later decisions.

8.3 The loop can act but cannot verify

The system reads and executes, but no test or review loop catches bad outcomes.

8.4 Session continuity depends only on conversation history

Long work gets compacted or interrupted. Without handoff notes or state files, every new session starts from reconstruction.

9. Minimum practical principles

You do not need to memorize a complicated architecture chart. The four principles below are enough to reason better about most agents.

  1. The model does not execute directly. The harness executes and returns results.
  2. Agent input is layered. Different layers have different roles and lifetimes.
  3. Context is a shared budget. Reducing and staging inputs often matters more than expanding them.
  4. An agent is a loop. A good loop includes verification and recording, not just action.

Once that is clear, the next topics naturally follow.

  • A3 will focus on what belongs in instruction files versus external context.
  • A4 will focus on how tools should be designed so the model can act more accurately.

References

  • docs/blog_series_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_์ด๊ด„_design.md
  • sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md
  • drafts/blog/260429_harness_series_02_context_engineering_en.md
  • WikiDocs, Chapter 2 notes from ํ•˜๋„ค์Šค ์—”์ง€๋‹ˆ์–ด๋ง ๋ฐฑ๊ณผ์‚ฌ์ „

This is Part 2 of the Harness Engineering Basics series. Next: Why instruction structure and context design matter more than longer prompts.

Series overview: Harness Engineering Series Guide

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System