"Harness Engineering Basics (4/4) — MCP and Tool Engineering: Design the Tool Surface for AI Agents"

Adding more tools to an agent does not automatically make it more capable. If tool names are vague, permissions are too broad, and outputs are too long, even a strong model will make poor choices repeatedly. The real issue is not tool count. It is the tool surface. MCP can standardize the connection, but it does not design the interface for you.

Key Takeaways

Tools are the model's hands and feet, but the model still does not control them directly. Tool descriptions, input schemas, permission boundaries, and output formats all influence behavior quality.
MCP is a useful standard for connecting external tools and data, but "connected" does not mean "well designed."
A good tool surface is usually clearly named, narrowly scoped, risk-separated, and structured in its output.
A bad tool surface creates failures through broad permissions, overlapping functions, vague descriptions, and raw verbose results.
Tool engineering is therefore less about feature expansion and more about reducing decision cost and blast radius.

1. Why tool design is a quality problem

As Part 2 showed, agents work through a tool-calling loop. That means tool surfaces become part of the model's decision environment.

Consider these two tools:

search_docs: searches product docs and returns the top five matches with short summaries
fetch_anything: gets whatever you want

Both may technically be search-like utilities, but the second one imposes much more decision burden on the model. It is unclear when to use it, what it returns, and how expensive the result might be.

So tool design is not just backend engineering. It is interface design for model judgment.

2. A good tool surface clarifies four things

At the beginner level, tool engineering can be kept simple. The most important job is to make four elements explicit.

Element	Core question
Name	Is it obvious what this tool does
Description	Does it say when to use it and when not to use it
Input schema	Does it ask only for what is necessary and reduce easy mistakes
Output shape	Does it return the minimum structure needed for the next decision

When those four are clear, the same model often behaves much more reliably.

3. Names and descriptions matter far more than people expect

A human engineer can recover intent from code or documentation. A model usually relies first on the tool description and schema it is given. So names and descriptions are not just docs. They are behavior-guidance mechanisms.

Common bad names look like this:

run
helper
project_tool
"Useful tool for many tasks"

These do not provide enough decision signal.

Better directions look more like:

read_repo_file
search_blog_drafts
list_pending_tasks
fetch_official_docs_snippet

Descriptions follow the same rule. "What it can do" is not enough. It helps much more to also say "when it should be used."

For example:

Weak description: "Search documents."
Better description: "Searches project docs and returns top matches with titles and short summaries. Use this before opening large documents directly."

4. Input schemas are mistake-prevention devices

Many teams treat schemas as simple validation. In practice, they matter more than that. A schema reduces the size of the model's action space.

Good schemas usually:

make required parameters explicit
prefer constrained choices over open-ended input when possible
make dangerous combinations harder to produce
keep a single responsibility

For example, file reading and file deletion should not usually be collapsed into one vague "file tool." Search and edit should also stay separate. Merging functions can look convenient to humans while becoming ambiguous for the model.

So a schema is not just backend formality. It is UX for misuse prevention.

5. Outputs should be short and structured

The earlier tool-and-sandbox draft and the chapter notes both repeat the same warning. A large fraction of agent failures begin when tools return too much, too easily.

That creates cost twice:

at execution time
and again in future turns, because the result becomes part of context

That is why stronger tool outputs usually contain:

a short summary
source or file-path metadata
date or version information when relevant
an ID or pointer for follow-up fetches

Weaker outputs tend to look like:

huge raw logs
full-text search dumps
mixed essential and non-essential details
no source or uncertainty markers

A good tool is not the one that returns the most. It is the one that returns what the next decision actually needs.

6. MCP is a standard, not automatic design

MCP is a meaningful improvement because it standardizes how external tools and data sources are exposed. Different clients can reuse the same servers, and the connection layer becomes more systematic.

But one misunderstanding appears repeatedly:

Attaching MCP does not finish tool engineering.

MCP mostly solves the connection layer. The harder design questions still remain:

which tools should be exposed
how much permission should they have
how should they be named and described
how should large results be limited
how should untrusted servers be isolated

MCP is therefore closer to a power strip than to a finished tool strategy. Useful, but not sufficient.

7. Risk separation should be built into the tool surface

Good tool surfaces reflect not only function but also risk level. Read and write, search and execute, local edit and external send should not all be treated as equivalent.

In practice, the following split is a sound baseline.

Tool type	Default posture
Read tools	can often be allowed more broadly
Search tools	need output-size discipline
Edit tools	need scope restriction and follow-up checks
Execution tools	need allowlists and timeouts
External-send tools	usually need approvals or explicit limits

Without this separation, the model may attempt actions with very different risk levels using the same confidence pattern.

8. Common anti-patterns

Weak tool engineering tends to produce the same mistakes again and again.

8.1 One universal tool for everything

It looks simple on paper but makes the choice problem much harder.

8.2 Abstract descriptions

The model lacks enough signal about when the tool should be used.

8.3 Raw result dumping

Long logs, full search dumps, and full-document returns become immediate context problems.

8.4 Handling risk only outside the tool surface

High-level policy matters, but tools also need fine-grained shape and separation.

8.5 Connecting everything that can be connected

Attaching many MCP servers does not guarantee better behavior. Often it only increases choice confusion.

9. Practical checklist before adding a tool

Before exposing a new tool or MCP server, these questions are worth asking first:

Does this tool have a distinct role, or does it overlap with an existing one?
Can the model infer when to use it from the name alone?
Does the description explain both when to use it and when not to use it?
Does the schema reduce dangerous or vague freedom?
Is the output concise, structured, and suitable for follow-up fetches?
Are read, edit, execute, and external-send risks separated?
If the tool fails, how should the agent recover or stop?

If these questions do not have clear answers, connectivity alone will not improve quality.

10. Design the surface, not just the inventory

To close the A-series, the most important line is this:

Agent quality is often shaped more by the clarity of the tool surface than by the sheer number of tools.

Good harnesses do not expand capability blindly. They make it easier for the model to choose the right action for the current situation.

concrete names
usage-oriented descriptions
narrow, explicit schemas
short, structured outputs
permissions separated by risk

Even with the same model, those choices reduce instability. That is why tool engineering is less about adding power and more about designing interfaces for better decisions.

References

docs/blog_series_하네스엔지니어링_총괄_design.md
sources/260518_하네스엔지니어링_15장_블로그활용노트.md
drafts/blog/260429_하네스시리즈04_도구와샌드박싱_블로그.md
drafts/blog/260429_harness_series_04_tools_sandboxing_en.md
WikiDocs, Chapter 4 notes from 하네스 엔지니어링 백과사전

This is Part 4 of the Harness Engineering Basics series. Suggested next reading: evaluation harnesses, long-running agents, and the OpenAI/Claude implementation track.

Series overview: Harness Engineering Series Guide

이 블로그 검색

MaJu Tech Notes