Local AI Infrastructure Notes (10/15) — Mac Mini M4 Local LLM Benchmark: MoE vs Dense

3계층 메모리 아키텍처 — 도메인 지식 / 행동 규칙 / 세션 복원

Comparing 4 Qwen3.5 variants, the thinking mode trap, and warm/cold start strategies


핵심 요약

  • A MoE (Mixture of Experts) 35B model is 3.8x faster than a Dense 27B model
  • Thinking mode enabled by default on newer models wastes 80 seconds on trivial tasks — disable it explicitly
  • The real value of local LLMs isn't cost recovery; it's unlimited experimentation and full data privacy
1. Layer 1: 프로젝트 지식 (docs/) — 영구 기억

Background

Running AI agents 24/7 meant I wanted to offload repetitive tasks — monitoring, memory cleanup, file I/O — from cloud APIs to local inference. The goal was an autonomous automation pipeline free from rate limits and network latency. I tested four Qwen3.5 model variants on a Mac Mini M4 with 32GB RAM running Ollama.

2. Layer 2: 행동 규칙 (lessons.md) — 경험에서 승격된 규칙

Benchmark Results

Model Architecture Generation VRAM 3k Token Response Verdict
qwen3.5:27b Dense 4.5 tok/s 25.3GB ~170s Rejected
qwen3:8b Dense Fast ~8GB ~15s Doesn't follow instructions
qwen3.5:9b Dense 13.1 tok/s ~8GB 30-60s Backup
qwen3.5:35b-a3b-16k MoE 17.2 tok/s 26.0GB 20-40s Adopted

The MoE model's advantage was overwhelming: 4.7x faster prompt eval, 3.8x faster generation, roughly 7x faster end-to-end response. MoE activates only ~3B parameters during inference, so VRAM usage is similar while actual compute drops dramatically.

The Thinking Mode Trap

An automated cron job started failing. A simple "Say OK" prompt was taking 80.8 seconds. Analysis revealed that actual text generation took 0.12 seconds — the remaining time was entirely wasted on the reasoning (thinking) process.

Newer models ship with reasoning mode enabled by default. You must explicitly disable it via the API:

{
  "payload": {
    "thinking": "off",
    "lightContext": true,
    "timeoutSeconds": 1800
  }
}

Hiding the reasoning output in the UI is not enough. thinking: "off" is mandatory.

Warm/Cold Start Strategy

Cold starts add roughly 30 seconds of latency on the first call. Two strategies to mitigate this:

  1. LaunchAgent auto warm-up: On boot, load the model into VRAM permanently with keep_alive=-1
  2. Model sharing: Monitoring and memory management agents share the same model. A 30-minute heartbeat cycle naturally maintains the warm state

Local LLM Prompt Design Rules

Prompts that worked on cloud LLMs often failed locally:

  1. Single responsibility: 1 prompt = 1 file + 1 action. Compound instructions are split at the script level
  2. Explicit tool control: Not "save it" but "use the write tool to save it"
  3. No edit tool: Hallucination-induced parameter errors were frequent; only write (overwrite) is allowed

Pitfalls and Caveats

Initial hardware cost was roughly $800 USD. Break-even takes about 5-6 years. If financial ROI is your goal, local LLMs are not the answer. Also, allocating 26GB to the model on a 32GB machine leaves only 6GB for the OS — larger model experiments are off the table. 64GB or a Mac Studio is recommended.

Takeaway

The real value of local LLMs comes down to three things: an unlimited test environment, complete data privacy, and predictable latency (consistently around 24 seconds). Architecture (MoE vs Dense) matters more than raw parameter count — confirmed through actual measurement.

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System