"LLM Reasoning Modes (6/6) — Comparison and Practical Guide: Cost, Latency, Quality, and Which Dial When"

We've taken the dial apart piece by piece — Part 1 the principle, Part 2 Claude's thinking, Parts 3-4 Claude effort, Part 5 OpenAI reasoning_effort. Part 6 puts the two side by side and answers, with benchmarks, "does more thinking always help?" Then it tabulates which notch to use per task.

The one-line conclusion of this series, up front: the reasoning dial is both a quality knob and a cost knob, and the right answer is not "always maximum" but "match it to task difficulty." Part 6 gathers the comparison, figures, and guide that back that up.

In One Paragraph

Claude's effort controls thinking, text, and tool calls with a single dial (default high); OpenAI splits thinking depth (reasoning_effort, default medium) and output length (verbosity) into two dials. Benchmarks show the dial is a cost-efficiency lever — per Anthropic, on SWE-bench Verified, Opus 4.5 at medium effort matched Sonnet 4.5's best score using 76% fewer output tokens, and at high exceeded it by 4.3 points using 48% fewer tokens. More thinking is not always better (overthinking). Practical rule: low for simple tasks, high for hard reasoning/agents, top notch only for genuine frontier problems.


1. Claude vs OpenAI — One-Table Comparison

Both platforms address the same tradeoff (quality ↔ cost/latency) with different design philosophies.

Dimension Claude effort OpenAI reasoning_effort
Levels low · medium · high · xhigh · max none·minimal · low · medium · high · xhigh (varies by model)
Default high medium
Scope all tokens (thinking + text + tool calls) thinking depth. Output length is separate (verbosity)
Thinking control adaptive thinking + effort (one dial) reasoning_effort + verbosity (two dials)
Lowest notch low (still thinks on hard problems) none / minimal (≈ no reasoning)
Raw chain of thought summary only, never returned verbatim hidden reasoning + optional summary
Loop-cost knob Task Budgets (countdown, min 20k) (none separate — effort + output cap)

Two key differences:

  • Single vs split. Claude's one dial moves thinking depth and output/tool verbosity together. OpenAI turns thinking (reasoning_effort) and output length (verbosity) separately. "Think deeply but answer briefly" is two explicit parameters on OpenAI; on Claude you supplement with the prompt.
  • Defaults. Claude high, OpenAI medium. The same "leave it at default" means different amounts of thinking, so cost and latency feel different when you move between platforms.

2. Cost, Latency, Quality — Three Variables That Move Together

Recall the curve from Part 1. Let test-time compute be \(C_{\text{test}}\); on hard problems accuracy generally rises as \(C_{\text{test}}\) grows but with diminishing returns. Turning the dial up means growing \(C_{\text{test}}\), and three variables move with it.

  • Quality ↑ — but only on hard tasks, and increasingly gently.
  • Cost ↑ — thinking/reasoning tokens are billed at the expensive output-token rate (both platforms). Turn the dial up and even the invisible tokens grow, inflating the bill.
  • Latency ↑ — thinking must finish before the first answer token, so time-to-first-token rises.

So the dial isn't a pure "be smarter" button — it's a knob over how much cost and latency you'll pay for quality. In the flat region of the curve, turning it higher barely moves quality while cost and latency keep climbing — that flat region is the overthinking of the next section.

3. What the Benchmarks Say — Smarter Model + Lower Effort

The dial's real value isn't "crank it to max to squeeze out points" — it's getting the same score for less. Anthropic's reported SWE-bench Verified figures for Opus 4.5 show this well.

  • medium effort: Opus 4.5 matched Sonnet 4.5's best score — using 76% fewer output tokens.
  • high effort: Opus 4.5 exceeded Sonnet 4.5 by 4.3 percentage points — still using 48% fewer tokens.

Two implications to take away:

  1. Running a smarter model at low effort beats running a weaker model at maximum — and costs less. Model choice and effort choice have to be considered together.
  2. Raising effort is not automatically a token explosion. The same task can finish better and cheaper, because better reasoning cuts wasted, flailing attempts.

So effort is not a simple "burn cost to buy quality" trade — it's an optimization where, depending on the model+effort combination, quality and cost can improve together.

4. Overthinking — Where More Thinking Hurts

Push the dial past the flat region and extra thinking isn't just unhelpful — it can hurt.

  • On narrow-output tasks (classification, extraction, formatting, single-answer), more thinking doesn't improve the answer. Even Anthropic's effort documentation states that on structured-output tasks max "can lead to overthinking."
  • Excess reasoning sometimes second-guesses a correct answer into a wrong one.
  • And all that extra thinking is billed as cost and latency — with no quality gain.

That's why max (Claude) and the top notches of high/xhigh (OpenAI) are reserved seats, not defaults. Claude Code making max apply to the current session only (Part 4) is the same logic — a guardrail against the top notch silently becoming a standing default.

5. Practical Guide — Matching the Dial to Task Difficulty

The whole series' principle, compressed into one table. This is a starting point for matching the dial to difficulty (then fine-tune on your own eval set).

Task type Claude effort OpenAI reasoning_effort
Simple classification / extraction / formatting / short answer low minimal (GPT-5) / none (GPT-5.5) / low (Codex family)
General workflow / ordinary generation medium medium (default)
Complex reasoning / hard coding high (default) high
Long agentic / coding sessions xhigh xhigh (supported models)
Genuine frontier problems (latency/cost-insensitive) max high/xhigh

Operating tips:

  • Set the starting point deliberately. Claude defaults to high, so light tasks need a step down to cut cost/latency. Sonnet 4.6 in particular runs at high if effort is unset, which can raise latency — make medium your explicit default there.
  • High effort needs a generous output cap. At Claude xhigh/max, set a large max_tokens (start ~64k) so the model has room to unfold (Part 3).
  • Output length is not an effort problem. On OpenAI, "think deeply but answer briefly" keeps reasoning_effort and lowers verbosity. On Claude, control length via the prompt.
  • Shallow reasoning → the dial, not the prompt. If a hard task comes back shallow, raising effort a notch is more direct than a "think harder" prompt (Part 3).

6. Loop Cost Is Separate — Task Budgets

Set per-turn depth with effort, but bound the cumulative tokens of a whole agent loop with a separate knob. Claude's Task Budgets (output_config.task_budget, minimum 20,000) tells the model its loop-wide budget and lets it self-moderate against a shrinking countdown — unlike max_tokens, an enforced ceiling the model is unaware of, this is a budget the model knows about and manages (Part 3). On long autonomous work, splitting "depth via effort, total via Task Budget" lets you manage quality and cost at once.

Closing

A reasoning mode is, in the end, the answer to one question — how much compute should this call spend?

  • Principle (Part 1): thinking is test-time compute, billed as output tokens.
  • Claude (Parts 2-4): adaptive thinking sets "whether and when," effort sets "how deep, across the whole response." Claude Code exposes the same dial in the CLI.
  • OpenAI (Part 5): reasoning_effort (depth) and verbosity (length) as two knobs.
  • Choice (Part 6): match the dial to task difficulty. A smarter model at the right effort — that's the path to quality and cost together.

Once you can see the dial, "why is the same model fast and cheap one day, slow and expensive the next" stops being a mystery. It's usually a question of where the dial was set.


Comparisons and figures are grounded in Anthropic and OpenAI primary documentation and Anthropic's reported Opus 4.5 SWE-bench Verified results. Benchmark numbers vary by model and setup — validate on your own eval set before relying on them in production.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System

"ML Foundations (6/9) — Neural Networks: From Perceptron to MLP"