"Harness Engineering Basics (4/4) — MCP and Tool Engineering: Design the Tool Surface for AI Agents"
Adding more tools to an agent does not automatically make it more capable. If tool names are vague, permissions are too broad, and outputs are too long, even a strong model will make poor choices repeatedly. The real issue is not tool count. It is the tool surface. MCP can standardize the connection, but it does not design the interface for you.
Key Takeaways
- Tools are the model's hands and feet, but the model still does not control them directly. Tool descriptions, input schemas, permission boundaries, and output formats all influence behavior quality.
- MCP is a useful standard for connecting external tools and data, but "connected" does not mean "well designed."
- A good tool surface is usually clearly named, narrowly scoped, risk-separated, and structured in its output.
- A bad tool surface creates failures through broad permissions, overlapping functions, vague descriptions, and raw verbose results.
- Tool engineering is therefore less about feature expansion and more about reducing decision cost and blast radius.
1. Why tool design is a quality problem
As Part 2 showed, agents work through a tool-calling loop. That means tool surfaces become part of the model's decision environment.
Consider these two tools:
search_docs: searches product docs and returns the top five matches with short summariesfetch_anything: gets whatever you want
Both may technically be search-like utilities, but the second one imposes much more decision burden on the model. It is unclear when to use it, what it returns, and how expensive the result might be.
So tool design is not just backend engineering. It is interface design for model judgment.
2. A good tool surface clarifies four things
At the beginner level, tool engineering can be kept simple. The most important job is to make four elements explicit.
| Element | Core question |
|---|---|
| Name | Is it obvious what this tool does |
| Description | Does it say when to use it and when not to use it |
| Input schema | Does it ask only for what is necessary and reduce easy mistakes |
| Output shape | Does it return the minimum structure needed for the next decision |
When those four are clear, the same model often behaves much more reliably.
3. Names and descriptions matter far more than people expect
A human engineer can recover intent from code or documentation. A model usually relies first on the tool description and schema it is given. So names and descriptions are not just docs. They are behavior-guidance mechanisms.
Common bad names look like this:
runhelperproject_tool- "Useful tool for many tasks"
These do not provide enough decision signal.
Better directions look more like:
read_repo_filesearch_blog_draftslist_pending_tasksfetch_official_docs_snippet
Descriptions follow the same rule. "What it can do" is not enough. It helps much more to also say "when it should be used."
For example:
- Weak description: "Search documents."
- Better description: "Searches project docs and returns top matches with titles and short summaries. Use this before opening large documents directly."
4. Input schemas are mistake-prevention devices
Many teams treat schemas as simple validation. In practice, they matter more than that. A schema reduces the size of the model's action space.
Good schemas usually:
- make required parameters explicit
- prefer constrained choices over open-ended input when possible
- make dangerous combinations harder to produce
- keep a single responsibility
For example, file reading and file deletion should not usually be collapsed into one vague "file tool." Search and edit should also stay separate. Merging functions can look convenient to humans while becoming ambiguous for the model.
So a schema is not just backend formality. It is UX for misuse prevention.
5. Outputs should be short and structured
The earlier tool-and-sandbox draft and the chapter notes both repeat the same warning. A large fraction of agent failures begin when tools return too much, too easily.
That creates cost twice:
- at execution time
- and again in future turns, because the result becomes part of context
That is why stronger tool outputs usually contain:
- a short summary
- source or file-path metadata
- date or version information when relevant
- an ID or pointer for follow-up fetches
Weaker outputs tend to look like:
- huge raw logs
- full-text search dumps
- mixed essential and non-essential details
- no source or uncertainty markers
A good tool is not the one that returns the most. It is the one that returns what the next decision actually needs.
6. MCP is a standard, not automatic design
MCP is a meaningful improvement because it standardizes how external tools and data sources are exposed. Different clients can reuse the same servers, and the connection layer becomes more systematic.
But one misunderstanding appears repeatedly:
Attaching MCP does not finish tool engineering.
MCP mostly solves the connection layer. The harder design questions still remain:
- which tools should be exposed
- how much permission should they have
- how should they be named and described
- how should large results be limited
- how should untrusted servers be isolated
MCP is therefore closer to a power strip than to a finished tool strategy. Useful, but not sufficient.
7. Risk separation should be built into the tool surface
Good tool surfaces reflect not only function but also risk level. Read and write, search and execute, local edit and external send should not all be treated as equivalent.
In practice, the following split is a sound baseline.
| Tool type | Default posture |
|---|---|
| Read tools | can often be allowed more broadly |
| Search tools | need output-size discipline |
| Edit tools | need scope restriction and follow-up checks |
| Execution tools | need allowlists and timeouts |
| External-send tools | usually need approvals or explicit limits |
Without this separation, the model may attempt actions with very different risk levels using the same confidence pattern.
8. Common anti-patterns
Weak tool engineering tends to produce the same mistakes again and again.
8.1 One universal tool for everything
It looks simple on paper but makes the choice problem much harder.
8.2 Abstract descriptions
The model lacks enough signal about when the tool should be used.
8.3 Raw result dumping
Long logs, full search dumps, and full-document returns become immediate context problems.
8.4 Handling risk only outside the tool surface
High-level policy matters, but tools also need fine-grained shape and separation.
8.5 Connecting everything that can be connected
Attaching many MCP servers does not guarantee better behavior. Often it only increases choice confusion.
9. Practical checklist before adding a tool
Before exposing a new tool or MCP server, these questions are worth asking first:
- Does this tool have a distinct role, or does it overlap with an existing one?
- Can the model infer when to use it from the name alone?
- Does the description explain both when to use it and when not to use it?
- Does the schema reduce dangerous or vague freedom?
- Is the output concise, structured, and suitable for follow-up fetches?
- Are read, edit, execute, and external-send risks separated?
- If the tool fails, how should the agent recover or stop?
If these questions do not have clear answers, connectivity alone will not improve quality.
10. Design the surface, not just the inventory
To close the A-series, the most important line is this:
Agent quality is often shaped more by the clarity of the tool surface than by the sheer number of tools.
Good harnesses do not expand capability blindly. They make it easier for the model to choose the right action for the current situation.
- concrete names
- usage-oriented descriptions
- narrow, explicit schemas
- short, structured outputs
- permissions separated by risk
Even with the same model, those choices reduce instability. That is why tool engineering is less about adding power and more about designing interfaces for better decisions.
References
docs/blog_series_ํ๋ค์ค์์ง๋์ด๋ง_์ด๊ด_design.mdsources/260518_ํ๋ค์ค์์ง๋์ด๋ง_15์ฅ_๋ธ๋ก๊ทธํ์ฉ๋ ธํธ.mddrafts/blog/260429_ํ๋ค์ค์๋ฆฌ์ฆ04_๋๊ตฌ์์๋๋ฐ์ฑ_๋ธ๋ก๊ทธ.mddrafts/blog/260429_harness_series_04_tools_sandboxing_en.md- WikiDocs, Chapter 4 notes from
ํ๋ค์ค ์์ง๋์ด๋ง ๋ฐฑ๊ณผ์ฌ์
This is Part 4 of the Harness Engineering Basics series. Suggested next reading: evaluation harnesses, long-running agents, and the OpenAI/Claude implementation track.
Series overview: Harness Engineering Series Guide
๋๊ธ
๋๊ธ ์ฐ๊ธฐ