"Context Engineering — What to Show the Model (Harness Series 2/6)"

4월 29, 2026

Second of the seven engineering problems from Series 1/6. Deciding what to show the model. Good context → good answer. Break the context, and even GPT-5.5 starts answering like GPT-2.

Part 1 introduced the Agent = Model + Harness thesis. This article covers the most consequential axis of that harness: Context Engineering.

Context Engineering = "the discipline of designing systems that dynamically assemble the right information for each step, rather than loading everything and hoping."

This guide walks through the principles, techniques, and per-tool implementations (Claude Code, Cursor, oMLX, etc.).

Series Roadmap (6 parts)

What Is Harness Engineering? — definition, history, why
Context Engineering ← this article
Memory Systems — session/long-term, vector vs SQL
Tools & Sandboxing — tool spec, MCP, VM isolation
Multi-Provider Routing — cost/quality routing, fallback
Evaluation & Ops — eval harness, observability, runner

1. The Context Window Is a Shared Budget

One token ≈ 0.75 English words. A 128K window holds about 96K words. A 1M window holds roughly 750K words — five novels.

That budget is shared across: - System prompt - Retrieved documents (RAG) - Conversation history - The model's own output - Tool call results (MCP and friends)

Key principle: "Bigger context window = better outcome" is wrong. What you spend the budget on is what matters.

Major-Model Context (April 2026)

Model	Context
GPT-5.5	1.05M tokens (2× rate above 272K input)
Claude Opus 4.7	200K (1M optional)
DeepSeek V4	1M tokens
Kimi K2.6	256K tokens
Qwen3.6-27B	262K (extensible to 1M via YaRN)

A million tokens is now standard — but filling it is rarely the answer.

2. Why "Don't Fill It" — Lost in the Middle

Long-context research consistently shows:

Models use the beginning and the end of context best. The middle gets lost (the "Lost in the Middle" problem).

Filling 1M tokens means: - Higher TTFT (prefill is O(N²)) - Higher cost - Lower accuracy (information buried in the middle) - Reasoning burnout: a 50-step workflow at 20K tokens/call = 1M cumulative — output quality degrades

Bottom line: Context should be small and precise.

3. Five Core Techniques

3-1. Auto-Compact

When context utilization nears the cap, compact older content into a summary.

Claude Code: at roughly 98% utilization, runs auto-compact. Older turns get summarized to free space.

oMLX takes a different angle — external persistence via SSD KV cache rather than in-window compaction.

3-2. Hierarchical Processing

Convert older context into successively more abstract representations: - Raw: "User edited line 42 of file X to handle case Y..." - Level 1: "User edited auth logic in file X" - Level 2: "Session focused on auth + routing"

Deeper compression = more density, more detail loss. Find the balance.

3-3. Tool-Output Limiting

Tool calls eat context fast.

Claude Code default: MCP tool output capped at 25,000 tokens, with a warning at 10,000. Servers may opt into up to 500,000 characters, persisted to disk instead of held in context.

Don't load full tool results into the window. Big results go to disk; the model sees a summary or index.

3-4. Reference vs Inline

Instead of dropping a whole code file into context: - Put just the file path in the system prompt - Let the model use a Read tool when needed

This makes context-filling dynamic. A 100K file isn't carried every turn.

3-5. CLAUDE.md / Project Context Files

Persistent project information lives in a separate file that's auto-loaded each session.

Claude Code: auto-loads CLAUDE.md from the project root every session. Custom commands, coding conventions, forbidden patterns.

This file is part of Guides (per Part 1's classification). It accumulates without volatilizing.

4. Claude Code's 5-Layer Compaction Pipeline

Per arXiv 2604.14228 ("Dive into Claude Code"), context management is a 5-layer pipeline:

Live conversation buffer — most recent turns intact
Recent tool results — partial retention
Compacted history — older turns summarized
Project context — CLAUDE.md + project metadata
System primitives — core system prompt

Each layer has its own compression ratio and lifetime. When layer 1 fills, content evicts to 2~3; once those fill, content gets summarized.

vs Plain FIFO

Most chatbots just delete the oldest message. Claude Code summarizes and keeps. Information loss is minimized while space is reclaimed.

5. MCP — Standardized Tool Interface

Anthropic's Model Context Protocol (MCP), released November 2024, is the standard for tool-call wiring. As of 2026, Claude Code, Cursor, and OpenAI Codex CLI all use MCP servers.

Context Implications

MCP servers expose external systems (DB, API, FS) as tools
Tool results land in the context window → pressure
The Claude Code 25K/500K limits exist to manage this pressure

Good vs Bad MCP Servers

Good: returns summarized results; large data exposed via IDs to fetch separately
Bad: dumps a 50K-row SQL result back, blowing up context

Tool authors share responsibility for context engineering.

6. Production Patterns

6-1. System-Prompt Diet

Bad: "You are an extremely helpful, friendly, cooperative, deeply thoughtful, accurate..."
Good: "Goal-driven. State results directly. No filler."

Trade adjective stacks for behavior commands. Saves tokens. Gives the model an actionable directive.

6-2. Externalize In-Progress State

Big tasks save mid-step state to external files (plan.md, tasks/) → reload after a context clear.

CLAUDE.md pattern: every task ends with a snapshot in tasks/sessions/. Auto-loaded next session.

6-3. Sub-Agent Isolation

Long searches and analyses go to a sub-agent. Main context only sees the result, not the search noise.

Claude Code's Agent tool implements this pattern.

6-4. RAG vs Long Context

RAG: 100K-doc index, fetch 5K per query
Long Context: load all 100K in-window

Comparison: | Item | RAG | Long Context | |---|---|---| | Cost | Low | High (full prefill) | | Accuracy | Depends on retrieval | Self-attention | | Update | Re-index only | Re-fill every time | | Use when | Structured retrieval | One-shot whole-doc reasoning |

Even with 1M-token models, RAG retains a clear cost edge.

7. Tool-by-Tool Comparison

Tool	Auto-compact	Tool-output cap	Project context	External memory
Claude Code	✅ (~98%)	25K/500K	CLAUDE.md	✅ (Skills, MCP)
Cursor 3	✅	—	.cursorrules	✅ (cloud agent)
OpenClaw	✅	—	CLAUDE.md (compat)	✅ (memcore)
Hermes	✅	—	own	✅ (FTS5 + vector)
oMLX	— (model-level)	—	—	SSD KV cache

oMLX persists KV cache at the inference server layer (not the model). Repeated calls with the same system prompt make prefill effectively free.

8. Anti-Patterns (the common mistakes)

8-1. "Just stuff it into a big-window model"

A 1M window doesn't help if you fill it: - TTFT explodes (prefill ∝ N²) - Cost explodes (GPT-5.5 above 272K input doubles) - Middle information lost (Lost in the Middle)

8-2. "Pipe tool results directly into context"

A 50K-row SQL dump pollutes the context window for every subsequent turn. Summarize and externalize.

8-3. "Cram everything into CLAUDE.md"

A 30K-token CLAUDE.md eats 30K from every session start. Keep core only; move infrequent knowledge to docs/ and load on demand.

8-4. "Preserve all conversation history"

Long conversations need compression and externalization. The most common failure when transitioning from chatbot patterns to agents.

Bottom Line

Context Engineering is not a window-size race.

Principle	One line
Small and precise	5K accurate beats 1M filled
Dynamic assembly	Build at runtime, not statically
Externalize	Memory, tools, files live outside — pulled in when needed
Auto-compact	Compress before the cap
Layer separation	System / project / session / tool results have different lifetimes

The single takeaway: "The context window is a shared budget. Every token has a price."

Part 3 (next) covers preserving information outside the window — Memory Systems.

First-Party Sources

"Dive into Claude Code": arxiv.org/abs/2604.14228 (Liu et al., 2026-04-14)
Anthropic MCP spec: modelcontextprotocol.io
Lost in the Middle: arXiv:2307.03172
Martin Fowler harness article: martinfowler.com/articles/harness-engineering.html