Local AI Infrastructure Notes (11/15) — You Cannot Turn Off Qwen3 Thinking Mode

Gate 조건과 Orchestration Preset — 복잡한 작업을 파이프라인으로 분해하는 법

Why the fastest model on benchmarks took 102 seconds in production


핵심 요약

  • qwen3:30b-a3b had the highest tok/s (37.7), but thinking mode couldn't be disabled — "Say OK" took 102 seconds
  • Modelfile edits, system prompts, and API parameters all failed — the limitation is baked into the model weights
  • Final choice: qwen3.5:9b — thinking mode is controllable and VRAM usage is a manageable 9.2GB

Background

While benchmarking local LLMs for faster inference, I discovered the qwen3:30b-a3b model. With its MoE (Mixture-of-Experts) architecture activating only 3B of its 30B total parameters, it clocked in at 37.7 tok/s — the fastest of all candidates.

But a simple "Say OK" request took 102 seconds. Roughly 3,500 tokens were burned entirely on reasoning — not a single one on the actual answer.

1. 4단계 게이트 — 각 단계에 진입 조건과 종료 조건이 있다

Three Attempts, Three Failures

Attempt 1: Ollama Modelfile modification — Tried to suppress <think> tag injection. Ineffective because the model weights themselves are trained to reason.

Attempt 2: System prompt suppression — "Never use thinking mode" was simply ignored. The model explained within its thinking block why reasoning was necessary.

Attempt 3: API think:false parameter — The thinking field came back empty, but the reasoning text just migrated to the content field. Time and token consumption remained identical.

2. 각 게이트의 Built-in Rules

qwen3 vs qwen3.5

Model think:false result
qwen3:30b-a3b Not controllable
qwen3.5:9b Works as expected
qwen3.5:35b-a3b Works as expected

Ollama's Default Template: A Double Trap

Ollama's default template hard-codes <think> tag injection during response generation. Combined with the model weights' forced reasoning, this creates a compounding problem.

The VRAM Mismatch Trap

deepseek-v2:16b has a download size of 8.9GB but actually occupies 19.0GB of VRAM. MoE architectures carry router and KV cache overhead that inflates memory usage 60%+ beyond the download size.

Final Selection

Model tok/s Response Time Thinking Control VRAM
qwen3:30b-a3b 37.7 102s No 20.4GB
qwen3.5:35b-a3b 18.3 6.2s Yes 26.3GB
qwen3.5:9b 13.5 8.1s Yes 9.2GB

Pitfalls and Caveats

  1. The danger of features you can't disable: Thinking mode is a strength, but when it can't be turned off, it becomes poison in an agent system.
  2. tok/s does not equal response time: Fast generation speed doesn't guarantee fast end-to-end response time.
  3. Download size does not equal VRAM: For MoE models, you must check actual usage with ollama ps after loading in a real environment.

Takeaway

The fastest model on benchmarks can become the slowest in production. Benchmark against actual agent tasks — tool calls, heartbeat responses — not peak throughput numbers.

댓글