Local AI Infrastructure Notes (11/15) — You Cannot Turn Off Qwen3 Thinking Mode
Why the fastest model on benchmarks took 102 seconds in production
핵심 요약
- qwen3:30b-a3b had the highest tok/s (37.7), but thinking mode couldn't be disabled — "Say OK" took 102 seconds
- Modelfile edits, system prompts, and API parameters all failed — the limitation is baked into the model weights
- Final choice: qwen3.5:9b — thinking mode is controllable and VRAM usage is a manageable 9.2GB
Background
While benchmarking local LLMs for faster inference, I discovered the qwen3:30b-a3b model. With its MoE (Mixture-of-Experts) architecture activating only 3B of its 30B total parameters, it clocked in at 37.7 tok/s — the fastest of all candidates.
But a simple "Say OK" request took 102 seconds. Roughly 3,500 tokens were burned entirely on reasoning — not a single one on the actual answer.
Three Attempts, Three Failures
Attempt 1: Ollama Modelfile modification — Tried to suppress <think> tag injection. Ineffective because the model weights themselves are trained to reason.
Attempt 2: System prompt suppression — "Never use thinking mode" was simply ignored. The model explained within its thinking block why reasoning was necessary.
Attempt 3: API think:false parameter — The thinking field came back empty, but the reasoning text just migrated to the content field. Time and token consumption remained identical.
qwen3 vs qwen3.5
| Model | think:false result |
|---|---|
| qwen3:30b-a3b | Not controllable |
| qwen3.5:9b | Works as expected |
| qwen3.5:35b-a3b | Works as expected |
Ollama's Default Template: A Double Trap
Ollama's default template hard-codes <think> tag injection during response generation. Combined with the model weights' forced reasoning, this creates a compounding problem.
The VRAM Mismatch Trap
deepseek-v2:16b has a download size of 8.9GB but actually occupies 19.0GB of VRAM. MoE architectures carry router and KV cache overhead that inflates memory usage 60%+ beyond the download size.
Final Selection
| Model | tok/s | Response Time | Thinking Control | VRAM |
|---|---|---|---|---|
| qwen3:30b-a3b | 37.7 | 102s | No | 20.4GB |
| qwen3.5:35b-a3b | 18.3 | 6.2s | Yes | 26.3GB |
| qwen3.5:9b | 13.5 | 8.1s | Yes | 9.2GB |
Pitfalls and Caveats
- The danger of features you can't disable: Thinking mode is a strength, but when it can't be turned off, it becomes poison in an agent system.
- tok/s does not equal response time: Fast generation speed doesn't guarantee fast end-to-end response time.
- Download size does not equal VRAM: For MoE models, you must check actual usage with
ollama psafter loading in a real environment.
Takeaway
The fastest model on benchmarks can become the slowest in production. Benchmark against actual agent tasks — tool calls, heartbeat responses — not peak throughput numbers.
댓글
댓글 쓰기