Local AI Infrastructure Notes (10/15) — Mac Mini M4 Local LLM Benchmark: MoE vs Dense
Comparing 4 Qwen3.5 variants, the thinking mode trap, and warm/cold start strategies
핵심 요약
- A MoE (Mixture of Experts) 35B model is 3.8x faster than a Dense 27B model
- Thinking mode enabled by default on newer models wastes 80 seconds on trivial tasks — disable it explicitly
- The real value of local LLMs isn't cost recovery; it's unlimited experimentation and full data privacy
Background
Running AI agents 24/7 meant I wanted to offload repetitive tasks — monitoring, memory cleanup, file I/O — from cloud APIs to local inference. The goal was an autonomous automation pipeline free from rate limits and network latency. I tested four Qwen3.5 model variants on a Mac Mini M4 with 32GB RAM running Ollama.
Benchmark Results
| Model | Architecture | Generation | VRAM | 3k Token Response | Verdict |
|---|---|---|---|---|---|
| qwen3.5:27b | Dense | 4.5 tok/s | 25.3GB | ~170s | Rejected |
| qwen3:8b | Dense | Fast | ~8GB | ~15s | Doesn't follow instructions |
| qwen3.5:9b | Dense | 13.1 tok/s | ~8GB | 30-60s | Backup |
| qwen3.5:35b-a3b-16k | MoE | 17.2 tok/s | 26.0GB | 20-40s | Adopted |
The MoE model's advantage was overwhelming: 4.7x faster prompt eval, 3.8x faster generation, roughly 7x faster end-to-end response. MoE activates only ~3B parameters during inference, so VRAM usage is similar while actual compute drops dramatically.
The Thinking Mode Trap
An automated cron job started failing. A simple "Say OK" prompt was taking 80.8 seconds. Analysis revealed that actual text generation took 0.12 seconds — the remaining time was entirely wasted on the reasoning (thinking) process.
Newer models ship with reasoning mode enabled by default. You must explicitly disable it via the API:
{
"payload": {
"thinking": "off",
"lightContext": true,
"timeoutSeconds": 1800
}
}
Hiding the reasoning output in the UI is not enough. thinking: "off" is mandatory.
Warm/Cold Start Strategy
Cold starts add roughly 30 seconds of latency on the first call. Two strategies to mitigate this:
- LaunchAgent auto warm-up: On boot, load the model into VRAM permanently with
keep_alive=-1 - Model sharing: Monitoring and memory management agents share the same model. A 30-minute heartbeat cycle naturally maintains the warm state
Local LLM Prompt Design Rules
Prompts that worked on cloud LLMs often failed locally:
- Single responsibility: 1 prompt = 1 file + 1 action. Compound instructions are split at the script level
- Explicit tool control: Not "save it" but "use the write tool to save it"
- No edit tool: Hallucination-induced parameter errors were frequent; only write (overwrite) is allowed
Pitfalls and Caveats
Initial hardware cost was roughly $800 USD. Break-even takes about 5-6 years. If financial ROI is your goal, local LLMs are not the answer. Also, allocating 26GB to the model on a 32GB machine leaves only 6GB for the OS — larger model experiments are off the table. 64GB or a Mac Studio is recommended.
Takeaway
The real value of local LLMs comes down to three things: an unlimited test environment, complete data privacy, and predictable latency (consistently around 24 seconds). Architecture (MoE vs Dense) matters more than raw parameter count — confirmed through actual measurement.
댓글
댓글 쓰기