"Building an OpenAI Harness (1/3) — Understanding Responses API, Tools, and the Agents SDK as an Operational Stack"

If you understand OpenAI only as "a model API," it becomes hard to see where the real agent system should be assembled. In practice, Responses API, tools, function calling, remote MCP, and the Agents SDK look like one connected surface, but they play different roles. Once you separate those roles, it becomes much easier to see where the harness should stay thin and where it needs structure.


Key Takeaways

  • The Responses API is not just a text-generation endpoint. It is the base operational surface for responses, tool calls, and conversation-state linkage.
  • Tools expand the model's action space through built-in tools, custom function calling, and remote MCP, but the decision of when a tool should be available is still a harness design problem.
  • The Agents SDK does not replace model calls. It sits above them as a layer for turn management, handoffs, guardrails, tracing, and session orchestration.
  • In practice, it is often best to keep Responses API at the bottom, attach a narrow tool surface and verification rules above it, and bring in the Agents SDK only when orchestration complexity genuinely appears.
  • So the central OpenAI harness question is not "which model should we use?" but rather "which operational responsibility belongs on which layer?"

1. Why OpenAI should be read as layers, not just a product

From a harness perspective, the OpenAI surface is not one thing.

  • response generation
  • conversation-state linkage
  • tool calling
  • file, web, and external-system access
  • multi-agent execution flow management

If you collapse those into one mental bucket, implementations bloat quickly. If you separate them into layers, it becomes much clearer what should be delegated to the API and what should remain your responsibility.

This is a useful split:

Layer Primary role
Responses API input, response generation, tool calls, multi-turn linkage
Tools web search, file search, function calling, remote MCP
Your harness approvals, policy, retries, verification, logging, cost control
Agents SDK orchestration across turns and agents

The point of this post is not to list features. It is to read them as operational responsibilities.

2. The Responses API is really a runtime boundary

In the current official OpenAI docs, the center of gravity is the Responses API. It handles not just visible output, but also tool use and follow-up turn linkage in the same overall flow.

Three details matter most in practice:

  1. A response can include output items and tool calls, not just a plain string.
  2. Multi-turn linkage can be maintained through mechanisms such as previous_response_id.
  3. The same request surface exposes control points like tools, parallel_tool_calls, and max_tool_calls.

That means that as soon as you adopt the Responses API, you are already touching part of an agent loop. It is therefore better understood as the runtime boundary of a minimal harness, not merely a completion-style endpoint.

3. Built-in tools are an action surface, not a feature checklist

The official tools guide now presents built-in tools, function calling, and remote MCP inside the same tool surface. That matters because it assumes the model may need to move from answering into acting.

Some of the most operationally meaningful tools are:

  • web_search: for freshness-sensitive or externally verifiable work
  • file_search: for uploaded files or vector-store retrieval
  • function calling: for your own code and internal systems
  • mcp: for external tool servers exposed through a standard interface

But the practical question is not "should we enable everything?" In most systems, the right answer is no.

  • Enabling web_search by default for non-freshness tasks adds cost and variability.
  • Search tools that dump long raw results pollute the next turn's context.
  • Loosely defined functions make tool choice less reliable.
  • Too many MCP servers increase connectivity and confusion at the same time.

Tools are therefore not just a capability surface. They are also a blast-radius surface.

4. Function calling is boundary design

Many teams think of function calling as "the model can call a function." That is true but incomplete. In a harness, function calling is the boundary between model judgment and system action.

Good function schemas usually have these traits:

  • names make roles obvious
  • parameters stay narrow and predictable
  • different risk levels are not mixed together
  • outputs return only the minimum structure needed for the next decision

For example, "search docs" and "edit docs" should not usually live behind one vague tool abstraction. What feels convenient to a human often becomes ambiguous to a model.

That is why harness quality is often more sensitive to schema clarity than to the sheer number of functions.

5. Remote MCP improves connectivity, not design

It is significant that remote MCP now appears in the same official tool surface. In OpenAI-based harnesses, external tool connectivity is increasingly becoming a standard part of the platform story rather than an ad hoc side channel.

But that does not mean MCP designs the harness for you.

You still need to decide:

  • which servers should be exposed
  • which ones should remain read-only
  • which results should be summarized before returning
  • which actions should require explicit approval
  • which traces and audit records should be stored

MCP lowers friction for extension. It does not remove the need for operational architecture.

6. The Agents SDK is best understood as a higher-level loop manager

If you read the official OpenAI documentation together with the SDK docs, the Agents SDK looks much less like a replacement for the Responses API and much more like an orchestration layer above it.

The recurring concepts are:

  • tools
  • handoffs
  • sessions
  • guardrails
  • tracing
  • streaming

That combination makes the role fairly clear. The Agents SDK is about managing agent execution flows, not just sending prompts.

This framing usually reduces confusion:

Question Better matching layer
Which tools may the model use in this request Responses API + tools
How is this conversation linked to the next turn Responses API
Can this agent hand off to a specialist agent Agents SDK
Where do traces and run flow live Agents SDK or your observability layer
Where should pre/post guardrails sit your harness plus SDK support

So the SDK is not the default starting point. It is an upper structure for systems that have already become orchestration-heavy.

7. When the Responses API is enough, and when the SDK helps

The answer depends on the team, but the pattern is usually straightforward.

Cases where the Responses API is often enough

  • a single agent or a thin loop is sufficient
  • the tool surface is small and clear
  • conversation continuity is relatively simple
  • request-by-request handling matters more than handoffs

Cases where the Agents SDK becomes attractive

  • specialist agents need to hand work between each other
  • session and handoff behavior repeat often
  • tracing and guardrails need to be reused consistently
  • multi-step execution should be managed as one run unit

The key point is this:

The SDK is not "the advanced way to use OpenAI." It is a way to reduce orchestration chaos once operational complexity reaches a certain level.

8. What your side still has to own

Even as OpenAI provides broader runtime surfaces, the core harness responsibilities do not disappear. Your system still needs to own:

  • which tools are allowed for which jobs
  • whether freshness checks are required
  • how external search output is summarized and verified
  • when to retry versus stop
  • how cost and tool-call ceilings are enforced
  • where traces, audit records, and approvals live

In other words, OpenAI increasingly offers strong agent runtime parts, but the product-level harness is still your design.

9. Practical checklist before wiring an OpenAI harness

  1. Is this task just response generation, or does it really need a tool loop?
  2. Does it require freshness, or is local context enough?
  3. How should built-in tools and custom functions be separated?
  4. Are function schemas split by responsibility and risk level?
  5. Is remote MCP actually needed, or are internal functions enough?
  6. How much multi-turn state should remain in the API versus your own artifacts?
  7. Has orchestration complexity grown enough to justify the SDK?

10. The more clearly you see the layers, the simpler the harness gets

As the OpenAI surface expands, it can look more confusing to newcomers. But from a harness perspective, the structure is actually fairly manageable.

  • Responses API is the base runtime boundary
  • tools expand the action surface
  • function calling and MCP connect external systems
  • the Agents SDK manages more complex loops

This framing also removes the pressure to adopt everything at once. Good harnesses usually start smaller:

  1. build a thin loop with the Responses API
  2. attach only the tools that are necessary
  3. keep verification and approvals in your own harness
  4. bring in the Agents SDK only when complexity warrants it

That is why the real OpenAI harness question is not model choice. It is this:

Which operational responsibility belongs on which layer?

Part 2 moves that same question to Claude, where the separation between CLAUDE.md, skills, hooks, and permissions becomes even more concrete.

References

  • OpenAI Docs, Responses API
    https://platform.openai.com/docs/api-reference/responses/create?api-mode=responses
  • OpenAI Docs, Using tools
    https://platform.openai.com/docs/guides/tools?api-mode=responses
  • OpenAI Docs, Conversation state
    https://platform.openai.com/docs/guides/conversation-state?api-mode=responses
  • OpenAI Docs, Agents SDK guide
    https://platform.openai.com/docs/guides/agents-sdk
  • OpenAI Agents SDK Docs, Agents
    https://openai.github.io/openai-agents-python/agents/
  • docs/blog_series_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_์ด๊ด„_design.md
  • sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md

This is Part 1/3 of the OpenAI and Claude Harnesses series. Suggested next reading: building a Claude harness, then comparing OpenAI and Claude harness design philosophies.

Series overview: Harness Engineering Series Guide

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System