"LLM Reasoning Modes (5/6) — OpenAI and Codex reasoning_effort: from minimal to xhigh"

Parts 3-4 showed that Claude's single effort knob tunes thinking, text, and tool calls all at once. Part 5 looks at how OpenAI handles the same trade-off. The key difference: OpenAI splits the knob in two — reasoning_effort for thinking depth, verbosity for output length.

OpenAI also exposes a knob to tune quality, latency, and cost per call without switching models. It's called reasoning_effort. But it differs from Claude in two decisive ways: the knob is split into two, and the set of values plus the default change by model generation. Part 5 takes both head-on.

In One Paragraph

OpenAI controls thinking depth with reasoning_effort (the Responses/Chat API; in the Codex CLI config it's model_reasoning_effort). The values differ by model generation — GPT-5 is minimal·low·medium·high, GPT-5.5 is none·low·medium·high·xhigh, and the gpt-5-codex family is low·medium·high·xhigh (no minimal; configuring it normalizes to low). The default is medium, the recommended balanced starting point. And OpenAI splits output length into a separate parameter, verbosity — in contrast to Claude's single effort, which moves thinking depth and output expansiveness together. Reasoning tokens are not returned verbatim (summaries only) and are billed as output tokens.


1. The Name of the Knob — reasoning_effort and model_reasoning_effort

In OpenAI, the parameter that tunes thinking depth is reasoning_effort. It surfaces in two places.

  • API (Responses / Chat): reasoning_effort.
  • Codex CLI config: model_reasoning_effort.

It's the same conceptual knob; just remember the key name differs between calling the API directly in code and setting it in the Codex CLI config file. The meaning is identical — it sets the tone for how many reasoning (thinking) tokens the model spends before the visible answer.

Higher effort means more reasoning tokens, longer inference, and better accuracy on hard tasks. Lower effort is the reverse. It's the same kind of trade-off knob as Claude's effort, but as the next section shows, the set of values differs by model generation.

2. Values by Model Generation — Same Knob, Different Notches

The first trap you hit: the values reasoning_effort accepts differ by model. A value that exists on one model is absent on another.

Model generation Accepted values Default Notes
GPT-5 minimal, low, medium, high medium supports minimal
GPT-5.5 none, low, medium, high, xhigh medium bottom is none, adds xhigh at the top
GPT-5.2-Codex / gpt-5-codex low, medium, high, xhigh medium no minimal — configuring it normalizes to low

How to read it:

  • GPT-5 puts minimal at the bottom.
  • GPT-5.5 shifts the bottom to none and adds xhigh at the top.
  • The gpt-5-codex family (including GPT-5.2-Codex) is tuned for coding and agentic work and does not support minimal. If you configure minimal, it normalizes to low. So on a Codex model, the lightest "reasoning-nearly-off" step is low.

Because the notches differ per generation, when moving code or config to a different model you must stay within the values that model accepts.

3. minimal — The Latency-Shaving Step

minimal runs with few or no reasoning tokens to minimize latency — especially time-to-first-token.

When to use it:

  • Deterministic, lightweight tasks with a narrow output space — extraction, formatting, short rewrites, simple classification.
  • Work where deep thinking barely improves accuracy and only adds latency.

When to avoid it:

  • Multi-step planning or tool-heavy workflows. Shaving reasoning on these hurts quality.

minimal exists on GPT-5 and is absent on the gpt-5-codex family (configuring it normalizes to low). On GPT-5.5 the bottom rung is none.

4. The Default Is medium — and When to Go Higher

The default, per the table above, is medium. It's the recommended starting point that balances quality, reliability, latency, and cost. Absent a specific reason, start here and move toward whichever side you need.

  • If answers come out shallow on a hard task, raise effort (high, and xhigh on models that support it). More reasoning tokens, longer inference, higher accuracy on hard tasks.
  • For lightweight, deterministic tasks, lower it (low, GPT-5's minimal, GPT-5.5's none). Less latency and cost.

Note that where Claude defaults to high, OpenAI defaults to medium — a common point of confusion when moving between the two platforms, since "leave it at the default" means a different amount of thinking on each.

5. reasoning_effort vs verbosity — Splitting the Knob in Two

Here is the core design difference in OpenAI. Separate from reasoning_effort, OpenAI has a parameter called verbosity.

  • reasoning_effort = the depth of thinking. How many reasoning tokens to spend before the visible answer.
  • verbosity (low / medium / high) = the length/expansiveness of the output. How far to unfold the answer.

The two are orthogonal. verbosity lets you tune answer length without rewriting the prompt — e.g., think deeply (high reasoning_effort) but answer briefly (low verbosity), or the reverse.

Let's make the contrast with Claude explicit.

Thinking depth Output length/expansiveness
Claude effort controls both (controlled by the same effort)
OpenAI reasoning_effort verbosity (separate)

So Claude's single effort knob moves thinking depth and output/tool verbosity together. OpenAI splits this into two knobs, turning thinking depth (reasoning_effort) and output length (verbosity) independently. Two design philosophies for the same trade-off.

6. How Reasoning Tokens Are Returned and Billed

OpenAI's reasoning tokens are not returned verbatim — they're hidden, with optional summaries only. And reasoning tokens are billed as output tokens (general OpenAI platform behavior and pricing). So raising reasoning_effort increases the hidden reasoning tokens, and your output-token cost rises accordingly.

This shares the same broad frame as Claude — on both platforms, thinking/reasoning tokens are billed at the expensive output-token tier, and the raw chain of thought is not returned (a summary is the ceiling). That's why the effort knob is not only a quality lever but a direct cost lever.

7. Codex CLI — config.toml and the -c Override

In the Codex CLI, the same knob surfaces in both the config file and on the command line.

Write the model and reasoning effort into ~/.codex/config.toml.

model = "gpt-5.2-codex"
model_reasoning_effort = "high"   # low | medium | high | xhigh

To use a different effort for a single run, override the setting with -c.

codex -m gpt-5.2-codex -c model_reasoning_effort="xhigh" "your prompt"

The full enum the config accepts is none | minimal | low | medium | high | xhigh. But whether a value actually applies is subject to per-model support. For instance, the gpt-5-codex family has no minimal (configuring it normalizes to low), so on a default Codex model the lightest step is low.

8. Which Value to Pick — Summary

  • Default to medium — the balance of quality, reliability, latency, and cost. Start here.
  • Lightweight, deterministic tasks (extraction, formatting, short rewrites, simple classification) → minimal (GPT-5) / none (GPT-5.5) / low (Codex family). Shaves latency.
  • Hard, multi-step, tool-heavy taskshigh, and xhigh on supporting models. Spend more reasoning tokens and time to gain accuracy.
  • If answer length is the problem, move verbosity, not effort. Thinking depth and output length are separate knobs.

What Comes Next

We've now dissected Claude's effort (Parts 3-4) and OpenAI's reasoning_effort plus verbosity (Part 5) separately. The final Part 6 puts them side by side — comparing the cost, latency, and quality trade-offs with benchmark figures, and laying out how to match the knob to task difficulty. That's where the answer to "does more thinking always help?" lives.


Parameter names, per-model values, defaults, and Codex configuration are grounded in OpenAI's GPT-5 new-parameters docs, the GPT-5.5 and GPT-5.2-Codex model docs, and Codex CLI configuration sources. Reasoning-token billing (at the output-token tier) is stated as general OpenAI platform behavior.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System

"ML Foundations (6/9) — Neural Networks: From Perceptron to MLP"