"LLM Reasoning Modes (6/6) — Comparison and Practical Guide: Cost, Latency, Quality, and Which Dial When"

6월 15, 2026

We've taken the dial apart piece by piece — Part 1 the principle, Part 2 Claude's thinking, Parts 3-4 Claude effort, Part 5 OpenAI reasoning_effort. Part 6 puts the two side by side and answers, with benchmarks, "does more thinking always help?" Then it tabulates which notch to use per task.

The one-line conclusion of this series, up front: the reasoning dial is both a quality knob and a cost knob, and the right answer is not "always maximum" but "match it to task difficulty." Part 6 gathers the comparison, figures, and guide that back that up.

In One Paragraph

Claude's effort controls thinking, text, and tool calls with a single dial (default high); OpenAI splits thinking depth (reasoning_effort, default medium) and output length (verbosity) into two dials. Benchmarks show the dial is a cost-efficiency lever — per Anthropic, on SWE-bench Verified, Opus 4.5 at medium effort matched Sonnet 4.5's best score using 76% fewer output tokens, and at high exceeded it by 4.3 points using 48% fewer tokens. More thinking is not always better (overthinking). Practical rule: low for simple tasks, high for hard reasoning/agents, top notch only for genuine frontier problems.

1. Claude vs OpenAI — One-Table Comparison

Both platforms address the same tradeoff (quality ↔ cost/latency) with different design philosophies.

Dimension	Claude `effort`	OpenAI `reasoning_effort`
Levels	`low · medium · high · xhigh · max`	`none·minimal · low · medium · high · xhigh` (varies by model)
Default	`high`	`medium`
Scope	all tokens (thinking + text + tool calls)	thinking depth. Output length is separate (`verbosity`)
Thinking control	adaptive thinking + effort (one dial)	`reasoning_effort` + `verbosity` (two dials)
Lowest notch	`low` (still thinks on hard problems)	`none` / `minimal` (≈ no reasoning)
Raw chain of thought	summary only, never returned verbatim	hidden reasoning + optional summary
Loop-cost knob	Task Budgets (countdown, min 20k)	(none separate — effort + output cap)

Two key differences:

Single vs split. Claude's one dial moves thinking depth and output/tool verbosity together. OpenAI turns thinking (reasoning_effort) and output length (verbosity) separately. "Think deeply but answer briefly" is two explicit parameters on OpenAI; on Claude you supplement with the prompt.
Defaults. Claude high, OpenAI medium. The same "leave it at default" means different amounts of thinking, so cost and latency feel different when you move between platforms.

2. Cost, Latency, Quality — Three Variables That Move Together

Recall the curve from Part 1. Let test-time compute be \(C_{\text{test}}\); on hard problems accuracy generally rises as \(C_{\text{test}}\) grows but with diminishing returns. Turning the dial up means growing \(C_{\text{test}}\), and three variables move with it.

Quality ↑ — but only on hard tasks, and increasingly gently.
Cost ↑ — thinking/reasoning tokens are billed at the expensive output-token rate (both platforms). Turn the dial up and even the invisible tokens grow, inflating the bill.
Latency ↑ — thinking must finish before the first answer token, so time-to-first-token rises.

So the dial isn't a pure "be smarter" button — it's a knob over how much cost and latency you'll pay for quality. In the flat region of the curve, turning it higher barely moves quality while cost and latency keep climbing — that flat region is the overthinking of the next section.

3. What the Benchmarks Say — Smarter Model + Lower Effort

The dial's real value isn't "crank it to max to squeeze out points" — it's getting the same score for less. Anthropic's reported SWE-bench Verified figures for Opus 4.5 show this well.

medium effort: Opus 4.5 matched Sonnet 4.5's best score — using 76% fewer output tokens.
high effort: Opus 4.5 exceeded Sonnet 4.5 by 4.3 percentage points — still using 48% fewer tokens.

Two implications to take away:

Running a smarter model at low effort beats running a weaker model at maximum — and costs less. Model choice and effort choice have to be considered together.
Raising effort is not automatically a token explosion. The same task can finish better and cheaper, because better reasoning cuts wasted, flailing attempts.

So effort is not a simple "burn cost to buy quality" trade — it's an optimization where, depending on the model+effort combination, quality and cost can improve together.

4. Overthinking — Where More Thinking Hurts

Push the dial past the flat region and extra thinking isn't just unhelpful — it can hurt.

On narrow-output tasks (classification, extraction, formatting, single-answer), more thinking doesn't improve the answer. Even Anthropic's effort documentation states that on structured-output tasks max "can lead to overthinking."
Excess reasoning sometimes second-guesses a correct answer into a wrong one.
And all that extra thinking is billed as cost and latency — with no quality gain.

That's why max (Claude) and the top notches of high/xhigh (OpenAI) are reserved seats, not defaults. Claude Code making max apply to the current session only (Part 4) is the same logic — a guardrail against the top notch silently becoming a standing default.

5. Practical Guide — Matching the Dial to Task Difficulty

The whole series' principle, compressed into one table. This is a starting point for matching the dial to difficulty (then fine-tune on your own eval set).

Task type	Claude `effort`	OpenAI `reasoning_effort`
Simple classification / extraction / formatting / short answer	`low`	`minimal` (GPT-5) / `none` (GPT-5.5) / `low` (Codex family)
General workflow / ordinary generation	`medium`	`medium` (default)
Complex reasoning / hard coding	`high` (default)	`high`
Long agentic / coding sessions	`xhigh`	`xhigh` (supported models)
Genuine frontier problems (latency/cost-insensitive)	`max`	`high`/`xhigh`

Operating tips:

Set the starting point deliberately. Claude defaults to high, so light tasks need a step down to cut cost/latency. Sonnet 4.6 in particular runs at high if effort is unset, which can raise latency — make medium your explicit default there.
High effort needs a generous output cap. At Claude xhigh/max, set a large max_tokens (start ~64k) so the model has room to unfold (Part 3).
Output length is not an effort problem. On OpenAI, "think deeply but answer briefly" keeps reasoning_effort and lowers verbosity. On Claude, control length via the prompt.
Shallow reasoning → the dial, not the prompt. If a hard task comes back shallow, raising effort a notch is more direct than a "think harder" prompt (Part 3).

6. Loop Cost Is Separate — Task Budgets

Set per-turn depth with effort, but bound the cumulative tokens of a whole agent loop with a separate knob. Claude's Task Budgets (output_config.task_budget, minimum 20,000) tells the model its loop-wide budget and lets it self-moderate against a shrinking countdown — unlike max_tokens, an enforced ceiling the model is unaware of, this is a budget the model knows about and manages (Part 3). On long autonomous work, splitting "depth via effort, total via Task Budget" lets you manage quality and cost at once.

Closing

A reasoning mode is, in the end, the answer to one question — how much compute should this call spend?

Principle (Part 1): thinking is test-time compute, billed as output tokens.
Claude (Parts 2-4): adaptive thinking sets "whether and when," effort sets "how deep, across the whole response." Claude Code exposes the same dial in the CLI.
OpenAI (Part 5): reasoning_effort (depth) and verbosity (length) as two knobs.
Choice (Part 6): match the dial to task difficulty. A smarter model at the right effort — that's the path to quality and cost together.

Once you can see the dial, "why is the same model fast and cheap one day, slow and expensive the next" stops being a mystery. It's usually a question of where the dial was set.

Comparisons and figures are grounded in Anthropic and OpenAI primary documentation and Anthropic's reported Opus 4.5 SWE-bench Verified results. Benchmark numbers vary by model and setup — validate on your own eval set before relying on them in production.

Series overview: Series index

이 블로그 검색

MaJu Tech Notes