"LLM Reasoning Modes (1/6) — Why Thinking Costs: Test-Time Compute and Reasoning Tokens"
Claude's
low·medium·high·xhigh·max, OpenAI'sreasoning_effort. This series dissects what that dial actually controls and why it behaves the way it does, from primary sources. Part 1 establishes the one premise every dial rests on — thinking costs.
Modern models don't answer immediately. Claude and GPT alike, given a hard question, first emit a block of thinking and only then begin the answer. That stretch — shown as "thinking…" or a long pause — is the model spending extra tokens, separate from the answer you see. Whether it's called effort or reasoning_effort, a reasoning mode is ultimately a knob over how many of these extra tokens to spend.
To understand this series you have to understand the thing the knob controls before the knob itself. That thing is test-time compute.
In One Paragraph
A reasoning model unrolls its chain-of-thought as tokens before the final answer. This is test-time compute — spending more compute at inference instead of scaling training — and it lifts accuracy on hard, multi-step work: math, logic, coding, agentic loops. But those thinking tokens are billed as output tokens and add latency. So every vendor exposes a per-call dial over how much to think (Claude
effort, OpenAIreasoning_effort). The raw chain of thought is generally not returned to the caller — only summaries.
1. What a Reasoning Model Is: Tokens Before the Answer
A standard model takes input and generates answer tokens one by one. A reasoning model inserts a step in between: before producing the answer, it generates thinking (reasoning) tokens, then uses that thinking as context to write the final response.
Those thinking tokens are a chain-of-thought (CoT) — intermediate reasoning that breaks the problem into steps — spelled out as a token sequence. It plays the same role as a human jotting down intermediate work on paper, except the model does it by generating tokens.
The key point: this is not free. Thinking tokens are real generated tokens, so they consume compute, time, and money. "Think harder" is literally "spend more tokens."
2. Test-Time Compute — Spending on Inference, Not Training
There are two broad ways to raise model performance.
- Train-time scaling: grow parameters, data, and training compute. Make the model itself bigger and smarter. The cost is paid once, at training.
- Test-time scaling: leave the model as is, and spend more compute solving each individual question. Generate longer chains of thought, explore multiple paths, self-verify. The cost is paid per question.
Reasoning modes control the latter. With a single fixed model, you can think briefly (or not at all) on some questions and deeply on others — i.e., allocate compute per call to match how hard the question is. Tuning compute to "how hard is this one" without swapping models is the essence of the dial.
Let test-time compute be \(C_{\text{test}}\). On hard problems, accuracy generally rises as \(C_{\text{test}}\) grows — but not without bound; it's a curve of diminishing returns. The shape of that curve is what makes the cost-quality tradeoff in Part 6 of this series.
3. Why More Thinking Helps — and Where It Stops
Where extra thinking helps and where it doesn't splits cleanly.
High payoff — problems that need multi-step reasoning: - Math and logic (you must accumulate intermediate steps to reach the answer) - Hard coding and debugging (trace causes, test hypotheses) - Agentic loops (plan the next tool call, interpret results) - Long-form analysis (weave many facts into a conclusion)
These have a long path to the answer. When the model lays that path out in tokens, it makes fewer mistakes than when it tries to jump straight to the answer.
Little payoff or actively harmful — narrow output spaces: - Simple classification, extraction, format conversion - Short rewrites, single-answer lookups
When the answer is effectively fixed, more thinking doesn't improve it. Excess reasoning instead becomes overthinking — burning tokens needlessly, sometimes second-guessing a correct answer into a wrong one. Even Anthropic's own effort documentation notes that on structured-output tasks the top setting (max) "can lead to overthinking."
So the correct use of a reasoning mode is not "always maximum" but match the dial to task difficulty. This principle recurs throughout the series.
4. The Cost of Reasoning Tokens: Billed as Output Tokens
This is where the dial stops being a quality knob and becomes a cost knob.
Claude's thinking tokens and OpenAI's reasoning tokens are both billed as output tokens. Output tokens are far pricier than input tokens (e.g., Claude Opus 4.8 is 5 USD input / 25 USD output per 1M). Increasing thinking means spending more of the most expensive token class.
The cost structure:
| Segment | What | Billed at |
|---|---|---|
| Input | prompt and context | input rate (cheap) |
| Thinking | thinking / reasoning tokens | output rate (expensive) |
| Answer | the visible response | output rate (expensive) |
Thinking tokens also raise latency. The model must finish thinking before the first answer token appears, so a high reasoning setting increases time-to-first-token. Cost and latency moving together is the first thing you feel when operating reasoning modes.
5. Thinking Is Invisible — Hidden CoT and Summaries
Thinking tokens are generated and billed, yet the raw chain of thought is generally not returned to you verbatim.
- Claude: current models never return the raw chain of thought. They optionally provide a summary (
thinking.display: "summarized"returns a readable summary;"omitted"leaves the thinking block's text empty). Thinking happens and is billed the same regardless of the display setting. - OpenAI: reasoning tokens are not surfaced verbatim (summaries only), yet are still billed as output tokens.
Why hide it? The raw chain of thought exposes model behavior directly and carries safety/misuse exposure risk. What matters practically for you is twofold — (1) cost includes tokens you can't see, so account for them, and (2) to show progress in a UI you must explicitly enable summaries (the default is often the hidden side).
6. So a Dial Was Born — effort / reasoning_effort
In one line: thinking raises quality but increases cost and latency, and its payoff varies by task. The obvious conclusion — thinking should be tunable per call. Both vendors expose exactly that knob.
- Claude —
effort:low / medium / high / xhigh / max. Unusually, it controls not just thinking depth but all response tokens (explanations, tool calls included) with a single dial. Default ishigh. - OpenAI —
reasoning_effort:minimal / low / medium / high(GPT-5; later generations addnoneandxhigh). Default ismedium. Output length is controlled separately by averbosityparameter.
That the two designs differ — Claude one dial for thinking + output, OpenAI thinking and output split — is central to the comparison later in this series.
What Comes Next
The series decomposes the dial in this order:
- Part 2 — Claude's Thinking: the shift from a fixed budget (
budget_tokens) to adaptive thinking, thinking blocks anddisplay, hidden raw CoT. - Part 3 — Claude
effortin full: behavior of each levellow~max, per-model support and defaults, effect on tool calls, preamble, and thinking. - Part 4 — Claude Code's effort in practice: the five CLI levels,
/effortand env vars, session persistence, what ultracode actually is. - Part 5 — OpenAI and Codex reasoning_effort:
minimal~xhigh, GPT-5 / 5.5 / 5.2-codex differences,config.toml, separation fromverbosity. - Part 6 — Comparison, tradeoffs, practical guide: Claude vs OpenAI summary, the cost-latency-quality curve, benchmark figures, matching the dial to task difficulty.
Next, we look at how Claude's thinking moved from a fixed budget to an adaptive mode the model sets for itself.
This article is grounded in Anthropic and OpenAI primary documentation. Exact parameter values, defaults, and per-model support are tabulated in Parts 3-5.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ