"Context Engineering — What to Show the Model (Harness Series 2/6)"
Second of the seven engineering problems from Series 1/6. Deciding what to show the model. Good context → good answer. Break the context, and even GPT-5.5 starts answering like GPT-2.
Part 1 introduced the Agent = Model + Harness thesis. This article covers the most consequential axis of that harness: Context Engineering.
Context Engineering = "the discipline of designing systems that dynamically assemble the right information for each step, rather than loading everything and hoping."
This guide walks through the principles, techniques, and per-tool implementations (Claude Code, Cursor, oMLX, etc.).
Series Roadmap (6 parts)
- What Is Harness Engineering? — definition, history, why
- Context Engineering ← this article
- Memory Systems — session/long-term, vector vs SQL
- Tools & Sandboxing — tool spec, MCP, VM isolation
- Multi-Provider Routing — cost/quality routing, fallback
- Evaluation & Ops — eval harness, observability, runner
1. The Context Window Is a Shared Budget
One token ≈ 0.75 English words. A 128K window holds about 96K words. A 1M window holds roughly 750K words — five novels.
That budget is shared across: - System prompt - Retrieved documents (RAG) - Conversation history - The model's own output - Tool call results (MCP and friends)
Key principle: "Bigger context window = better outcome" is wrong. What you spend the budget on is what matters.
Major-Model Context (April 2026)
| Model | Context |
|---|---|
| GPT-5.5 | 1.05M tokens (2× rate above 272K input) |
| Claude Opus 4.7 | 200K (1M optional) |
| DeepSeek V4 | 1M tokens |
| Kimi K2.6 | 256K tokens |
| Qwen3.6-27B | 262K (extensible to 1M via YaRN) |
A million tokens is now standard — but filling it is rarely the answer.
2. Why "Don't Fill It" — Lost in the Middle
Long-context research consistently shows:
Models use the beginning and the end of context best. The middle gets lost (the "Lost in the Middle" problem).
Filling 1M tokens means: - Higher TTFT (prefill is O(N²)) - Higher cost - Lower accuracy (information buried in the middle) - Reasoning burnout: a 50-step workflow at 20K tokens/call = 1M cumulative — output quality degrades
Bottom line: Context should be small and precise.
3. Five Core Techniques
3-1. Auto-Compact
When context utilization nears the cap, compact older content into a summary.
Claude Code: at roughly 98% utilization, runs auto-compact. Older turns get summarized to free space.
oMLX takes a different angle — external persistence via SSD KV cache rather than in-window compaction.
3-2. Hierarchical Processing
Convert older context into successively more abstract representations: - Raw: "User edited line 42 of file X to handle case Y..." - Level 1: "User edited auth logic in file X" - Level 2: "Session focused on auth + routing"
Deeper compression = more density, more detail loss. Find the balance.
3-3. Tool-Output Limiting
Tool calls eat context fast.
Claude Code default: MCP tool output capped at 25,000 tokens, with a warning at 10,000. Servers may opt into up to 500,000 characters, persisted to disk instead of held in context.
Don't load full tool results into the window. Big results go to disk; the model sees a summary or index.
3-4. Reference vs Inline
Instead of dropping a whole code file into context:
- Put just the file path in the system prompt
- Let the model use a Read tool when needed
This makes context-filling dynamic. A 100K file isn't carried every turn.
3-5. CLAUDE.md / Project Context Files
Persistent project information lives in a separate file that's auto-loaded each session.
Claude Code: auto-loads
CLAUDE.mdfrom the project root every session. Custom commands, coding conventions, forbidden patterns.
This file is part of Guides (per Part 1's classification). It accumulates without volatilizing.
4. Claude Code's 5-Layer Compaction Pipeline
Per arXiv 2604.14228 ("Dive into Claude Code"), context management is a 5-layer pipeline:
- Live conversation buffer — most recent turns intact
- Recent tool results — partial retention
- Compacted history — older turns summarized
- Project context — CLAUDE.md + project metadata
- System primitives — core system prompt
Each layer has its own compression ratio and lifetime. When layer 1 fills, content evicts to 2~3; once those fill, content gets summarized.
vs Plain FIFO
Most chatbots just delete the oldest message. Claude Code summarizes and keeps. Information loss is minimized while space is reclaimed.
5. MCP — Standardized Tool Interface
Anthropic's Model Context Protocol (MCP), released November 2024, is the standard for tool-call wiring. As of 2026, Claude Code, Cursor, and OpenAI Codex CLI all use MCP servers.
Context Implications
- MCP servers expose external systems (DB, API, FS) as tools
- Tool results land in the context window → pressure
- The Claude Code 25K/500K limits exist to manage this pressure
Good vs Bad MCP Servers
- Good: returns summarized results; large data exposed via IDs to fetch separately
- Bad: dumps a 50K-row SQL result back, blowing up context
Tool authors share responsibility for context engineering.
6. Production Patterns
6-1. System-Prompt Diet
- Bad: "You are an extremely helpful, friendly, cooperative, deeply thoughtful, accurate..."
- Good: "Goal-driven. State results directly. No filler."
Trade adjective stacks for behavior commands. Saves tokens. Gives the model an actionable directive.
6-2. Externalize In-Progress State
Big tasks save mid-step state to external files (plan.md, tasks/) → reload after a context clear.
CLAUDE.md pattern: every task ends with a snapshot in
tasks/sessions/. Auto-loaded next session.
6-3. Sub-Agent Isolation
Long searches and analyses go to a sub-agent. Main context only sees the result, not the search noise.
Claude Code's
Agenttool implements this pattern.
6-4. RAG vs Long Context
- RAG: 100K-doc index, fetch 5K per query
- Long Context: load all 100K in-window
Comparison: | Item | RAG | Long Context | |---|---|---| | Cost | Low | High (full prefill) | | Accuracy | Depends on retrieval | Self-attention | | Update | Re-index only | Re-fill every time | | Use when | Structured retrieval | One-shot whole-doc reasoning |
Even with 1M-token models, RAG retains a clear cost edge.
7. Tool-by-Tool Comparison
| Tool | Auto-compact | Tool-output cap | Project context | External memory |
|---|---|---|---|---|
| Claude Code | ✅ (~98%) | 25K/500K | CLAUDE.md | ✅ (Skills, MCP) |
| Cursor 3 | ✅ | — | .cursorrules | ✅ (cloud agent) |
| OpenClaw | ✅ | — | CLAUDE.md (compat) | ✅ (memcore) |
| Hermes | ✅ | — | own | ✅ (FTS5 + vector) |
| oMLX | — (model-level) | — | — | SSD KV cache |
oMLX persists KV cache at the inference server layer (not the model). Repeated calls with the same system prompt make prefill effectively free.
8. Anti-Patterns (the common mistakes)
8-1. "Just stuff it into a big-window model"
A 1M window doesn't help if you fill it: - TTFT explodes (prefill ∝ N²) - Cost explodes (GPT-5.5 above 272K input doubles) - Middle information lost (Lost in the Middle)
8-2. "Pipe tool results directly into context"
A 50K-row SQL dump pollutes the context window for every subsequent turn. Summarize and externalize.
8-3. "Cram everything into CLAUDE.md"
A 30K-token CLAUDE.md eats 30K from every session start. Keep core only; move infrequent knowledge to docs/ and load on demand.
8-4. "Preserve all conversation history"
Long conversations need compression and externalization. The most common failure when transitioning from chatbot patterns to agents.
Bottom Line
Context Engineering is not a window-size race.
| Principle | One line |
|---|---|
| Small and precise | 5K accurate beats 1M filled |
| Dynamic assembly | Build at runtime, not statically |
| Externalize | Memory, tools, files live outside — pulled in when needed |
| Auto-compact | Compress before the cap |
| Layer separation | System / project / session / tool results have different lifetimes |
The single takeaway: "The context window is a shared budget. Every token has a price."
Part 3 (next) covers preserving information outside the window — Memory Systems.
First-Party Sources
- "Dive into Claude Code": arxiv.org/abs/2604.14228 (Liu et al., 2026-04-14)
- Anthropic MCP spec: modelcontextprotocol.io
- Lost in the Middle: arXiv:2307.03172
- Martin Fowler harness article: martinfowler.com/articles/harness-engineering.html
๋๊ธ
๋๊ธ ์ฐ๊ธฐ