Local AI Infrastructure Notes (9/15) — Replacing the Local LLM: From Qwen3.5 to Gemma 4

A generational swap between MoE models. Speed, memory, accuracy, context — all improved.


Summary

  • Replaced Qwen3.5-35B-A3B-4bit with gemma-4-26b-a4b-it-4bit on Mac Mini M4 32GB
  • Speed +7%, RAM -4GB, accuracy 94%→100%, context 4× (32K→131K), multimodal support added
  • All existing pipeline tests passed: recall-tree, retain-merge, micro-cycle, and more

Background: Why a Local LLM

Cloud LLMs are powerful but not cost-effective for all workloads. Memory management, classification, and summarization tasks that run automatically hundreds of times per session will produce runaway token costs when routed through a cloud provider.

This was directly observed during a migration attempt that connected a multi-agent orchestration platform's memory pipeline to a cloud LLM — token consumption became uncontrollable. The architecture was restructured to isolate a local model for mechanical tasks. The local LLM handles classification, summarization, and tag extraction; the cloud LLM handles tasks requiring strategic judgment.

Qwen3.5-35B-A3B (Alibaba) had been running on this layer for an extended period. When Google released Gemma 4 — another MoE architecture — a same-class generational replacement became viable.


Body

1. Model Spec Comparison

Item Qwen3.5-35B-A3B (Alibaba) Gemma-4-26B-A4B (Google)
Total parameters 35B 26B
Active parameters 3B 3.8B
Architecture MoE (35B/3B) MoE (26B/3.8B, 128 experts, top-8)
Quantization 4-bit GGUF 4-bit MLX safetensors
Disk 19 GB 15 GB
Context 32K 131K (256K supported)
Vision None Multimodal supported

Qwen has more total parameters, but Gemma has more active parameters (3B vs 3.8B). In MoE models, active parameters determine actual inference behavior — Gemma has the edge here.

Disk footprint is also smaller: 15 GB vs 19 GB. The MLX safetensors format runs natively on Apple Silicon.

Considerations when selecting a local LLM on Apple Silicon (Mac Mini M4): - Unified memory architecture means model size maps directly to RAM consumption — no VRAM/RAM split. - MLX leverages the Apple Neural Engine; it is more efficient than GGUF on the same hardware. - A 15 GB model on a 32 GB system leaves sufficient headroom for OS and services. 19 GB is tight.


2. Speed Benchmarks

Measured on Mac Mini M4 32GB, oMLX 0.3.4.

Item Qwen3.5 Gemma 4 Delta
Generation speed (256 tokens, 3-run avg) 27.3 tok/s 29.1 tok/s +7%
oMLX stats average 14.2 tok/s (long-term cumulative) 22.5 tok/s (early measurement) +58%
RAM usage ~19 GB ~15 GB -4 GB
Free RAM ~13 GB ~17 GB +4 GB

The 256-token controlled benchmark shows a 7% improvement. The oMLX stats comparison shows 58%, but this compares Qwen's long-term cumulative average (including intermittent stalls) against Gemma's early measurement set — a direct comparison is not statistically fair. Long-term recalibration is required after extended operation.

The RAM savings are the more immediately actionable result. Freeing 4 GB allows reallocation to other services: dashboard, monitoring, and embedding server.


3. Accuracy Tests

Ran the T01–T14 test suite used by OpenClaw's memory-runner pipeline.

Test Qwen3.5 Gemma 4
T01–T10 (basic instruction / file ops) 10/10 10/10
T13 (multi-tag extraction) 3/3 3/3
T14 (conditional branching) 2/3 3/3
Total 15/16 (94%) 16/16 (100%)

The critical delta is T14: conditional branching. This was Qwen3.5's single failure point. Gemma 4 passes it cleanly. Conditional branching governs "where does this memory get stored" in the memory management layer — a 100% pass rate here is operationally significant.


4. Pipeline Compatibility

Model-agnostic pipeline design was the precondition for this swap. Pipelines that depend on model identity rather than output format cannot be cleanly migrated.

All pipelines passed: - recall-tree: full query test suite passed - retain-merge: TAG_ROUTING keyword classification intact - micro-cycle: lightweight distillation intact - confidence-decay: decay calculation intact - bank-lint: knowledge file integrity check intact - proactive-select: briefing selection logic intact - Embedding (bge-m3 1024d): hybrid search intact

Why the swap had zero pipeline impact: the pipelines depend on output format (JSON, tags), not on model identity. This decoupled design is what makes model replacement safe.


5. What 4× Context Expansion Actually Means

Context expanded from 32K to 131K (up to 256K supported). Practical implications:

  • memoryFlush: longer session summaries can be processed in a single pass
  • retain-merge: more context available for tag classification decisions
  • Reflect distillation: a full day's memory can be processed in one pass without chunking

Longer context does increase per-token processing time, so simply maximizing context length is not the correct approach. The value is headroom when it is needed.


6. Local LLM vs Cloud LLM — Selection Criteria

Local LLM is preferable when: - High-frequency repetitive calls: classification, tag extraction, summarization in automated pipelines running hundreds to thousands of times. Cloud LLM cost scales linearly. - Latency tolerance: 27–29 tok/s is insufficient for real-time conversation but adequate for background batch processing. - Privacy: data does not leave the local machine.

Cloud LLM is preferable when: - Complex reasoning: strategic judgment, multi-step planning, creative generation. - Current information: local models have a training cutoff. - Long context + high throughput: cloud infrastructure has no hardware ceiling.

Practical approach: route mechanical, repetitive tasks to local; route judgment tasks to cloud. The two-tier architecture optimizes both cost and quality simultaneously.


Lessons Learned

oMLX stats sampling bias: Qwen's long-term cumulative average vs Gemma's early sample is not a controlled comparison. Recalibration after extended operation is necessary. Early benchmark results warrant measured interpretation.

Quantization format difference: GGUF (Qwen) and MLX safetensors (Gemma) use different runtimes. On Mac Mini, MLX is native — this advantages Gemma. On Linux/CUDA environments, results may differ.


Conclusion

Gemma 4 is a clean generational replacement: every benchmark metric improved while full pipeline compatibility was maintained. The T14 conditional branching 100% pass rate and the 4 GB RAM reduction are the two results with the most direct operational value.

For developers running local LLMs, two design recommendations:

  1. Run pipeline compatibility tests before committing to a model swap.
  2. Design pipelines to depend on output format (JSON/tags) rather than on model identity — this reduces replacement cost to near zero.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System