Local AI Infrastructure Notes (11/15) — You Cannot Turn Off Qwen3 Thinking Mode

Gate 조건과 Orchestration Preset — 복잡한 작업을 파이프라인으로 분해하는 법

Why the fastest model on benchmarks took 102 seconds in production

핵심 요약

qwen3:30b-a3b had the highest tok/s (37.7), but thinking mode couldn't be disabled — "Say OK" took 102 seconds
Modelfile edits, system prompts, and API parameters all failed — the limitation is baked into the model weights
Final choice: qwen3.5:9b — thinking mode is controllable and VRAM usage is a manageable 9.2GB

Background

While benchmarking local LLMs for faster inference, I discovered the qwen3:30b-a3b model. With its MoE (Mixture-of-Experts) architecture activating only 3B of its 30B total parameters, it clocked in at 37.7 tok/s — the fastest of all candidates.

But a simple "Say OK" request took 102 seconds. Roughly 3,500 tokens were burned entirely on reasoning — not a single one on the actual answer.

Three Attempts, Three Failures

Attempt 1: Ollama Modelfile modification — Tried to suppress <think> tag injection. Ineffective because the model weights themselves are trained to reason.

Attempt 2: System prompt suppression — "Never use thinking mode" was simply ignored. The model explained within its thinking block why reasoning was necessary.

Attempt 3: API think:false parameter — The thinking field came back empty, but the reasoning text just migrated to the content field. Time and token consumption remained identical.

qwen3 vs qwen3.5

Model	think:false result
qwen3:30b-a3b	Not controllable
qwen3.5:9b	Works as expected
qwen3.5:35b-a3b	Works as expected

Ollama's Default Template: A Double Trap

Ollama's default template hard-codes <think> tag injection during response generation. Combined with the model weights' forced reasoning, this creates a compounding problem.

The VRAM Mismatch Trap

deepseek-v2:16b has a download size of 8.9GB but actually occupies 19.0GB of VRAM. MoE architectures carry router and KV cache overhead that inflates memory usage 60%+ beyond the download size.

Final Selection

Model	tok/s	Response Time	Thinking Control	VRAM
qwen3:30b-a3b	37.7	102s	No	20.4GB
qwen3.5:35b-a3b	18.3	6.2s	Yes	26.3GB
qwen3.5:9b	13.5	8.1s	Yes	9.2GB

Pitfalls and Caveats

The danger of features you can't disable: Thinking mode is a strength, but when it can't be turned off, it becomes poison in an agent system.
tok/s does not equal response time: Fast generation speed doesn't guarantee fast end-to-end response time.
Download size does not equal VRAM: For MoE models, you must check actual usage with ollama ps after loading in a real environment.

Takeaway

The fastest model on benchmarks can become the slowest in production. Benchmark against actual agent tasks — tool calls, heartbeat responses — not peak throughput numbers.

이 블로그 검색

MaJu Tech Notes