"Harness Engineering by Real Cases (4/4) — How OpenAI, Anthropic, Vercel, GitHub, and Cursor Actually Design Agents"

One of the fastest ways to understand harness engineering is to stop arguing in abstractions and compare public cases. Under the same label of "AI agents," OpenAI emphasizes runtime primitives, Anthropic emphasizes operator surfaces and work separation, Vercel emphasizes productization and deployment flow, GitHub emphasizes pull-request-native collaboration, and Cursor emphasizes the cloud workbench. The real difference is not just the model. It is what each company chooses as the primary work surface.


Key Takeaways

  • When comparing public agent systems, the most useful lens is not model quality but five harness questions: execution environment, tool surface, review artifact, handoff structure, and permission boundary.
  • OpenAI is pushing hard on standard runtime components for agents: the Responses API, shell tool, hosted container workspace, and Codex App Server.
  • Anthropic keeps exposing task separation and context separation: subagents, hooks, permissions, background work, and orchestrator-worker patterns.
  • Vercel treats agents less like standalone coding assistants and more like product surfaces that can generate, run, deploy, and inspect.
  • GitHub places the agent inside the pull request workflow. Branches, commits, reviews, Actions logs, and MCP defaults become the collaboration surface.
  • Cursor emphasizes cloud VMs, handoff between environments, and visual verification artifacts such as screenshots and demos.
  • The practical conclusion is simple: strong harnesses do not just give agents more capability. They create work surfaces and artifacts that humans can reliably review.

1. The right way to read these cases

If you only compare feature lists, the systems start to blur together. They all use tools. They all handle multi-step work. They all talk about MCP or long-running tasks.

Harness engineering asks a better set of questions:

  1. where does the agent actually work
  2. what tool surface does it see by default
  3. what do humans inspect when reviewing the result
  4. how does work survive across longer runs
  5. where are permissions and risk boundaries enforced

That is the lens used throughout this post.

2. A quick comparison table

Company Primary work surface Most visible harness move What we should learn
OpenAI API runtime + container workspace + App Server tool loop, shell, hosted workspace, cross-surface protocol unify multiple products around a shared execution engine
Anthropic Claude Code workspace + subagents + hooks + background tasks context isolation, permission layering, orchestrator-worker long-running work needs role separation before memory expansion
Vercel v0 product surface + AI SDK + deployment flow + Vercel MCP prompt-to-product flow, deployability, remote MCP with OAuth agent outputs become valuable when they are tied to deployment and operations
GitHub cloud agent in a branch/PR/review loop PR-native workflow, logs, MCP defaults, human review loop place the agent inside existing engineering collaboration artifacts
Cursor local/cloud agent workspace + remote machine + demos isolated machine, parallel agents, handoff UX, visual proof long-running coding needs reviewable evidence, not only textual claims

This table already shows that each company chooses a different center of gravity.

3. OpenAI: standardizing the agent runtime

On February 4, 2026, OpenAI described the Codex App Server as the shared bridge behind the web app, CLI, IDE extension, and macOS app. Then on March 11, 2026, OpenAI explained how the Responses API can be equipped with a shell tool and hosted container workspace so that model-proposed actions run inside an isolated environment with files, storage, and restricted networking.

The important point is that OpenAI is not mainly presenting the agent as a chat surface. It is presenting the agent as a shared runtime loop.

What the public surface reveals

  • the model proposes tool calls instead of touching the filesystem directly
  • the platform owns shell execution, workspace storage, and network restrictions
  • the Codex App Server lets multiple clients reuse the same harness behavior
  • the runtime semantics matter more than any single UI surface

What to learn from it

This is a strong reminder that if an agent experience is going to exist across terminal, IDE, app, and API surfaces, then reusing prompts is not enough. The loop itself must be shared.

That gives you:

  • more consistent behavior across products
  • fewer duplicated implementations of progress streaming, diffs, and workspace inspection
  • a better place to enforce permissions and execution constraints

The practical lesson is:

If your agent will appear in multiple surfaces, standardize the execution loop and artifact format before you standardize the UI.

4. Anthropic: long-running work depends on separation

Anthropic's public design surface is different. Claude Code documentation exposes subagents with separate context windows and separate tool permissions. Hooks documentation exposes official lifecycle control points such as PreToolUse, PostToolUse, and SubagentStop. Anthropic's June 13, 2025 engineering post on Research describes a lead agent spawning workers in parallel through an orchestrator-worker pattern. Its September 29, 2025 Claude Code update also highlighted background work for longer-running tasks.

Anthropic is therefore making work separation and operating rules highly visible parts of the product.

What the public surface reveals

  • subagents can run with their own context windows
  • tool access can vary by subagent
  • hooks are an official place to enforce policies and automation
  • long-running work is framed around decomposition and supervision, not just larger memory

What to learn from it

The biggest lesson here is that long-running agents are not mainly a memory feature. They are a structure feature.

What matters more is:

  • context isolation
  • narrowly scoped permissions
  • explicit control points before and after actions
  • worker decomposition under a lead agent

That leads to a practical rule:

To make long-running work more reliable, start with narrower roles and narrower permissions before trying to make one giant agent remember more.

5. Vercel: connecting generation to deployment

Vercel's agent story looks different again. The v0 docs present a surface where natural language can generate UIs and apps that can then be deployed to Vercel. Vercel Academy's agent skeleton explicitly frames an agent as a loop handled by ToolLoopAgent. Vercel's MCP docs, updated February 12, 2026, expose an official remote MCP server with OAuth for project management, deployment management, and log analysis.

What makes this interesting is that Vercel does not stop at "the agent made code." It ties generation to run, deploy, and inspect.

What the public surface reveals

  • v0 is presented as a creation surface that naturally flows into deployment
  • the AI SDK exposes the actual tool loop instead of hiding it completely
  • Vercel MCP connects external agents to real operational surfaces
  • deployment and inspection are part of the same product story

What to learn from it

Vercel's strongest lesson is that code generation alone is not yet a product harness. The harness becomes much more useful when the output is already connected to the next operational step.

That means:

  • generated output should be runnable
  • runnable output should be deployable
  • deployable output should remain inspectable
  • MCP should be seen as an operational connector, not just a docs search add-on

The practical takeaway is:

If the agent's output does not naturally continue into the next operational step, the harness is still incomplete.

6. GitHub: making the agent PR-native

GitHub's Copilot cloud agent docs explain that work happens in a GitHub Actions-powered environment, where the agent can research a repository, create a plan, make changes on a branch, and optionally open a pull request. The MCP docs show default GitHub and Playwright MCP servers, with the GitHub server using a specially scoped read-only token for the current repository by default. Review docs explicitly say that Copilot-created PRs still need normal human review.

GitHub's philosophy is unusually clear: the agent belongs inside branches, commits, pull requests, reviews, and workflow logs.

What the public surface reveals

  • coding and iteration happen on GitHub, not only in a chat UI
  • the working unit is both a session and a branch/PR artifact
  • the default MCP servers support repository context and browser-based verification
  • human review remains a first-class step

What to learn from it

GitHub's design works because it keeps the agent inside artifacts teams already understand.

That gives you:

  • lower adoption cost
  • natural reuse of review culture and compliance policy
  • a durable audit trail through diffs, logs, and comments

The practical lesson is:

For team-facing coding agents, pull request quality is often a more meaningful metric than chat quality.

7. Cursor: the cloud workbench and proof artifacts

Cursor's February 24, 2026 post about agents controlling their own computers emphasized cloud sandboxes, screenshots, videos, and browser testing. Its April 2, 2026 Cursor 3 launch highlighted multi-repo layout, local/cloud handoff, and one interface for many agents across desktop, web, mobile, Slack, GitHub, and Linear. Background agents documentation also describes isolated Ubuntu-based machines, repo cloning, branch handoff, and environment setup via environment.json.

Cursor is therefore making the remote workbench itself a core part of the agent harness.

What the public surface reveals

  • remote machines are not a hidden implementation detail but a user-facing concept
  • environment setup, background terminals, and machine snapshots matter
  • handoff between local and cloud work is an explicit design problem
  • screenshots, demos, and videos are treated as review artifacts

What to learn from it

Cursor is especially strong on one point: textual assurances are not enough. People trust long-running coding agents more when they can inspect visible proof of what happened.

That suggests a practical rule:

  • UI work should often produce screenshots
  • browser work should often produce demos or recordings
  • longer runs should end in a branch or handoff artifact, not only a summary paragraph

So the main lesson is:

The more autonomous the coding agent becomes, the more important reviewable evidence becomes.

8. Common patterns across all five cases

Despite major product differences, a few harness patterns repeat.

8-1. The agent is an operator in a workspace, not just a responder

OpenAI has a hosted workspace. Cursor has a remote machine. GitHub has an Actions environment. Vercel has a deployment-connected product surface. Anthropic has an operating workspace with subagents and hooks.

All of them move beyond a pure answer box.

8-2. Tool surface design matters more than raw tool count

GitHub defaults to a narrow MCP set. Anthropic emphasizes per-subagent permissions. OpenAI explicitly frames the shell tool and workspace. Vercel connects agents to project and deployment tools. The repeated pattern is legible action surfaces.

8-3. Long-running work depends on handoff structure more than on memory alone

Cursor's local/cloud handoff, GitHub's branch and PR artifacts, Anthropic's worker separation, and OpenAI's shared cross-surface harness all reflect the same insight: continuity is an externalized structure problem.

8-4. Strong harnesses leave reviewable artifacts for humans

  • OpenAI: diffs and streamed progress
  • Anthropic: hook boundaries, subagent structure, background operating surfaces
  • Vercel: deployable output, project state, deployment logs
  • GitHub: branch, commit, PR, review, workflow log
  • Cursor: screenshot, demo, video, remote branch handoff

The common principle is that a harness should not trap truth inside the model's internal state. It should leave external evidence humans can inspect.

9. A practical checklist for our own work

These cases point to a short but useful checklist.

1. Decide the primary work surface first

Is the agent working in chat, in a repository, in a PR loop, in a deployment system, or in a remote machine? If this stays vague, the rest of the harness usually stays vague too.

2. Reduce the tool surface before expanding it

Separate search, read, edit, execute, publish, and deploy. Ambiguous tools increase confusion for both the model and the reviewer.

3. Let long-running work resume from artifacts, not just conversation

Plans, handoffs, branches, PRs, screenshots, and deployment logs are often better continuity anchors than chat history.

4. Make verification visible to humans

In many workflows, a diff, screenshot, preview URL, or workflow log is more useful than a plain "done" message.

5. Enforce permissions through execution boundaries

Approvals, sandboxing, branch isolation, read-only tokens, remote OAuth, and hooks are real guardrails. Descriptive warnings in prompts are not enough.

10. Common misreadings

"The best systems just have more features"

Not really. The strongest public cases tend to make execution and review surfaces clearer, not merely larger.

"Long-running agents are mainly about bigger context windows"

The public cases suggest otherwise. Handoff, isolation, permissions, and evidence surfaces matter at least as much.

"Adding MCP automatically improves the harness"

Only partially. MCP is a connection standard. The more important questions are what gets connected, how it is scoped, and what permissions and review paths surround it.

11. Conclusion: five companies, one underlying lesson

OpenAI, Anthropic, Vercel, GitHub, and Cursor speak different product languages.

  • OpenAI speaks in runtime language
  • Anthropic speaks in operator and decomposition language
  • Vercel speaks in productization and deployment language
  • GitHub speaks in collaboration and review language
  • Cursor speaks in cloud workbench and verification artifact language

But when you translate all five back into harness engineering, they keep repeating the same lesson:

Strong agents do not come only from stronger models. They come from clearer workbenches, narrower tool surfaces, stronger handoff structures, more reviewable artifacts, and firmer permission boundaries.

That is also the conclusion of the full D-series: repeatable patterns matter, architecture decisions matter, and ACI matters because the workbench changes the quality of the work.

References

  • docs/blog_series_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_์ด๊ด„_design.md
  • sources/260518_ํ•˜๋„ค์Šค์—”์ง€๋‹ˆ์–ด๋ง_15์žฅ_๋ธ”๋กœ๊ทธํ™œ์šฉ๋…ธํŠธ.md
  • OpenAI, Unlocking the Codex harness: how we built the App Server (February 4, 2026) — https://openai.com/index/unlocking-the-codex-harness/
  • OpenAI, From model to agent: Equipping the Responses API with a computer environment (March 11, 2026) — https://openai.com/index/equip-responses-api-computer-environment/
  • Anthropic, How we built our multi-agent research system (June 13, 2025) — https://www.anthropic.com/engineering/multi-agent-research-system
  • Anthropic Docs, Subagents / Hooks reference / Security (verified May 19, 2026) — https://docs.anthropic.com/en/docs/claude-code/sub-agents / https://docs.anthropic.com/en/docs/claude-code/hooks / https://docs.anthropic.com/en/docs/claude-code/security
  • Vercel Docs, v0 / Use Vercel's MCP server / Vercel Academy Agent Skeleton (verified May 19, 2026; MCP doc updated February 12, 2026) — https://v0.app/docs / https://vercel.com/docs/agent-resources/vercel-mcp / https://vercel.com/academy/filesystem-agents/agent-skeleton
  • GitHub Docs, About Copilot cloud agent / MCP and Copilot cloud agent (verified May 19, 2026) — https://docs.github.com/en/copilot/concepts/agents/cloud-agent/about-cloud-agent / https://docs.github.com/en/copilot/concepts/agents/cloud-agent/mcp-and-cloud-agent
  • Cursor, Cursor agents can now control their own computers (February 24, 2026) / Meet the new Cursor (April 2, 2026) — https://cursor.com/blog/agent-computer-use / https://cursor.com/blog/cursor-3

This post is Part 4 of 4 in the Patterns, Strategy, and Cases series. Earlier parts: 12 Harness Patterns, 7 Harness Design Decisions, Harness Is Everything.

Series overview: Harness Engineering Series Guide

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System