Local AI Infrastructure Notes (9/15) — Replacing the Local LLM: From Qwen3.5 to Gemma 4
A generational swap between MoE models. Speed, memory, accuracy, context — all improved.
Summary
- Replaced Qwen3.5-35B-A3B-4bit with gemma-4-26b-a4b-it-4bit on Mac Mini M4 32GB
- Speed +7%, RAM -4GB, accuracy 94%→100%, context 4× (32K→131K), multimodal support added
- All existing pipeline tests passed: recall-tree, retain-merge, micro-cycle, and more
Background: Why a Local LLM
Cloud LLMs are powerful but not cost-effective for all workloads. Memory management, classification, and summarization tasks that run automatically hundreds of times per session will produce runaway token costs when routed through a cloud provider.
This was directly observed during a migration attempt that connected a multi-agent orchestration platform's memory pipeline to a cloud LLM — token consumption became uncontrollable. The architecture was restructured to isolate a local model for mechanical tasks. The local LLM handles classification, summarization, and tag extraction; the cloud LLM handles tasks requiring strategic judgment.
Qwen3.5-35B-A3B (Alibaba) had been running on this layer for an extended period. When Google released Gemma 4 — another MoE architecture — a same-class generational replacement became viable.
Body
1. Model Spec Comparison
| Item | Qwen3.5-35B-A3B (Alibaba) | Gemma-4-26B-A4B (Google) |
|---|---|---|
| Total parameters | 35B | 26B |
| Active parameters | 3B | 3.8B |
| Architecture | MoE (35B/3B) | MoE (26B/3.8B, 128 experts, top-8) |
| Quantization | 4-bit GGUF | 4-bit MLX safetensors |
| Disk | 19 GB | 15 GB |
| Context | 32K | 131K (256K supported) |
| Vision | None | Multimodal supported |
Qwen has more total parameters, but Gemma has more active parameters (3B vs 3.8B). In MoE models, active parameters determine actual inference behavior — Gemma has the edge here.
Disk footprint is also smaller: 15 GB vs 19 GB. The MLX safetensors format runs natively on Apple Silicon.
Considerations when selecting a local LLM on Apple Silicon (Mac Mini M4): - Unified memory architecture means model size maps directly to RAM consumption — no VRAM/RAM split. - MLX leverages the Apple Neural Engine; it is more efficient than GGUF on the same hardware. - A 15 GB model on a 32 GB system leaves sufficient headroom for OS and services. 19 GB is tight.
2. Speed Benchmarks
Measured on Mac Mini M4 32GB, oMLX 0.3.4.
| Item | Qwen3.5 | Gemma 4 | Delta |
|---|---|---|---|
| Generation speed (256 tokens, 3-run avg) | 27.3 tok/s | 29.1 tok/s | +7% |
| oMLX stats average | 14.2 tok/s (long-term cumulative) | 22.5 tok/s (early measurement) | +58% |
| RAM usage | ~19 GB | ~15 GB | -4 GB |
| Free RAM | ~13 GB | ~17 GB | +4 GB |
The 256-token controlled benchmark shows a 7% improvement. The oMLX stats comparison shows 58%, but this compares Qwen's long-term cumulative average (including intermittent stalls) against Gemma's early measurement set — a direct comparison is not statistically fair. Long-term recalibration is required after extended operation.
The RAM savings are the more immediately actionable result. Freeing 4 GB allows reallocation to other services: dashboard, monitoring, and embedding server.
3. Accuracy Tests
Ran the T01–T14 test suite used by OpenClaw's memory-runner pipeline.
| Test | Qwen3.5 | Gemma 4 |
|---|---|---|
| T01–T10 (basic instruction / file ops) | 10/10 | 10/10 |
| T13 (multi-tag extraction) | 3/3 | 3/3 |
| T14 (conditional branching) | 2/3 | 3/3 |
| Total | 15/16 (94%) | 16/16 (100%) |
The critical delta is T14: conditional branching. This was Qwen3.5's single failure point. Gemma 4 passes it cleanly. Conditional branching governs "where does this memory get stored" in the memory management layer — a 100% pass rate here is operationally significant.
4. Pipeline Compatibility
Model-agnostic pipeline design was the precondition for this swap. Pipelines that depend on model identity rather than output format cannot be cleanly migrated.
All pipelines passed: - recall-tree: full query test suite passed - retain-merge: TAG_ROUTING keyword classification intact - micro-cycle: lightweight distillation intact - confidence-decay: decay calculation intact - bank-lint: knowledge file integrity check intact - proactive-select: briefing selection logic intact - Embedding (bge-m3 1024d): hybrid search intact
Why the swap had zero pipeline impact: the pipelines depend on output format (JSON, tags), not on model identity. This decoupled design is what makes model replacement safe.
5. What 4× Context Expansion Actually Means
Context expanded from 32K to 131K (up to 256K supported). Practical implications:
- memoryFlush: longer session summaries can be processed in a single pass
- retain-merge: more context available for tag classification decisions
- Reflect distillation: a full day's memory can be processed in one pass without chunking
Longer context does increase per-token processing time, so simply maximizing context length is not the correct approach. The value is headroom when it is needed.
6. Local LLM vs Cloud LLM — Selection Criteria
Local LLM is preferable when: - High-frequency repetitive calls: classification, tag extraction, summarization in automated pipelines running hundreds to thousands of times. Cloud LLM cost scales linearly. - Latency tolerance: 27–29 tok/s is insufficient for real-time conversation but adequate for background batch processing. - Privacy: data does not leave the local machine.
Cloud LLM is preferable when: - Complex reasoning: strategic judgment, multi-step planning, creative generation. - Current information: local models have a training cutoff. - Long context + high throughput: cloud infrastructure has no hardware ceiling.
Practical approach: route mechanical, repetitive tasks to local; route judgment tasks to cloud. The two-tier architecture optimizes both cost and quality simultaneously.
Lessons Learned
oMLX stats sampling bias: Qwen's long-term cumulative average vs Gemma's early sample is not a controlled comparison. Recalibration after extended operation is necessary. Early benchmark results warrant measured interpretation.
Quantization format difference: GGUF (Qwen) and MLX safetensors (Gemma) use different runtimes. On Mac Mini, MLX is native — this advantages Gemma. On Linux/CUDA environments, results may differ.
Conclusion
Gemma 4 is a clean generational replacement: every benchmark metric improved while full pipeline compatibility was maintained. The T14 conditional branching 100% pass rate and the 4 GB RAM reduction are the two results with the most direct operational value.
For developers running local LLMs, two design recommendations:
- Run pipeline compatibility tests before committing to a model swap.
- Design pipelines to depend on output format (JSON/tags) rather than on model identity — this reduces replacement cost to near zero.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ