Local AI Infrastructure Notes (10/15) — Mac Mini M4 Local LLM Benchmark: MoE vs Dense

4월 04, 2026

Comparing 4 Qwen3.5 variants, the thinking mode trap, and warm/cold start strategies

핵심 요약

A MoE (Mixture of Experts) 35B model is 3.8x faster than a Dense 27B model
Thinking mode enabled by default on newer models wastes 80 seconds on trivial tasks — disable it explicitly
The real value of local LLMs isn't cost recovery; it's unlimited experimentation and full data privacy

Background

Running AI agents 24/7 meant I wanted to offload repetitive tasks — monitoring, memory cleanup, file I/O — from cloud APIs to local inference. The goal was an autonomous automation pipeline free from rate limits and network latency. I tested four Qwen3.5 model variants on a Mac Mini M4 with 32GB RAM running Ollama.

2. Layer 2: 행동 규칙 (lessons.md) — 경험에서 승격된 규칙

Benchmark Results

Model	Architecture	Generation	VRAM	3k Token Response	Verdict
qwen3.5:27b	Dense	4.5 tok/s	25.3GB	~170s	Rejected
qwen3:8b	Dense	Fast	~8GB	~15s	Doesn't follow instructions
qwen3.5:9b	Dense	13.1 tok/s	~8GB	30-60s	Backup
qwen3.5:35b-a3b-16k	MoE	17.2 tok/s	26.0GB	20-40s	Adopted

The MoE model's advantage was overwhelming: 4.7x faster prompt eval, 3.8x faster generation, roughly 7x faster end-to-end response. MoE activates only ~3B parameters during inference, so VRAM usage is similar while actual compute drops dramatically.

The Thinking Mode Trap

An automated cron job started failing. A simple "Say OK" prompt was taking 80.8 seconds. Analysis revealed that actual text generation took 0.12 seconds — the remaining time was entirely wasted on the reasoning (thinking) process.

Newer models ship with reasoning mode enabled by default. You must explicitly disable it via the API:

{
  "payload": {
    "thinking": "off",
    "lightContext": true,
    "timeoutSeconds": 1800
  }
}

Hiding the reasoning output in the UI is not enough. thinking: "off" is mandatory.

Warm/Cold Start Strategy

Cold starts add roughly 30 seconds of latency on the first call. Two strategies to mitigate this:

LaunchAgent auto warm-up: On boot, load the model into VRAM permanently with keep_alive=-1
Model sharing: Monitoring and memory management agents share the same model. A 30-minute heartbeat cycle naturally maintains the warm state

Local LLM Prompt Design Rules

Prompts that worked on cloud LLMs often failed locally:

Single responsibility: 1 prompt = 1 file + 1 action. Compound instructions are split at the script level
Explicit tool control: Not "save it" but "use the write tool to save it"
No edit tool: Hallucination-induced parameter errors were frequent; only write (overwrite) is allowed

Pitfalls and Caveats

Initial hardware cost was roughly $800 USD. Break-even takes about 5-6 years. If financial ROI is your goal, local LLMs are not the answer. Also, allocating 26GB to the model on a 32GB machine leaves only 6GB for the OS — larger model experiments are off the table. 64GB or a Mac Studio is recommended.

Takeaway

The real value of local LLMs comes down to three things: an unlimited test environment, complete data privacy, and predictable latency (consistently around 24 seconds). Architecture (MoE vs Dense) matters more than raw parameter count — confirmed through actual measurement.

이 블로그 검색

MaJu Tech Notes