Local AI Infrastructure Notes (7/15) — Best Local LLM Setup on macOS: The Complete oMLX Guide
GPT-5.5 at $5 per million input tokens. Claude Opus 4.7 with similar premium pricing. For heavy users on M3/M4 Macs, the monthly bill stings. There is another path — local inference. But the tooling landscape (Ollama, LM Studio, llama.cpp) is fragmented, and recommendations swing wildly.
oMLX, which shipped v0.3.8rc1 on April 28, 2026, settles the question. It is a macOS-only inference server built on Apple's MLX framework, with vLLM-style continuous batching and an SSD-backed KV cache. It drops agent time-to-first-token from 30~90 seconds to under 5 seconds, and Claude Code, Cursor, and OpenClaw can use it as a drop-in backend.
This guide explains why oMLX is the right choice for Apple Silicon Macs as of April 2026, sourced entirely from primary documentation (GitHub, Apple ML Research, Ollama blog). It covers installation, model recommendations by Mac RAM tier, and Claude Code integration.
In One Paragraph
oMLX is a local LLM server you launch from the macOS menu bar. It exposes both OpenAI- and Anthropic-compatible APIs, making it a drop-in target for Claude Code, Cursor, and OpenClaw. Its differentiator is a two-tier KV cache (RAM hot + SSD cold) — TTFT for long-system-prompt agents drops from 30~90 seconds to under 5 seconds. Apache 2.0, macOS 15+, M1~M4 supported.
What oMLX Actually Is
Maintainer: jundot (GitHub jundot/omlx)
Latest release: v0.3.8rc1 (2026-04-28)
License: Apache 2.0
System requirements:
- macOS 15.0 (Sequoia) or higher
- Apple Silicon (M1/M2/M3/M4/M5)
- Python 3.10+
Four core capabilities
- Tiered KV cache
- RAM hot cache → evicts to SSD cold cache when full
- Block-level prefix sharing + Copy-on-Write (vLLM-inspired)
-
Every KV block persisted as safetensors → instant restore on restart
-
Continuous batching
- Built on mlx-lm's BatchGenerator
-
Multiple concurrent requests → higher GPU utilization
-
Multi-model concurrent serving
- LLM + VLM (Qwen3.5, GLM-4V, Pixtral) + OCR (DeepSeek-OCR, GLM-OCR) + embeddings (BGE-M3) + rerankers
-
LRU eviction inside RAM budget
-
API compatibility
- OpenAI:
http://localhost:8000/v1/chat/completions - Anthropic:
http://localhost:8000/v1/messages - Drop-in for Claude Code, Cursor, OpenClaw
The menu-bar app handles model download, start, and stop visually. The /admin web dashboard provides real-time monitoring and one-click benchmarking.
Why MLX Wins on Apple Silicon
oMLX runs on Apple's MLX framework, which exploits Apple Silicon's Unified Memory Architecture (UMA) directly. CPU and GPU share memory, so there is no copy step. This is where Apple Silicon diverges from PC GPU stacks.
Same model, same Mac — measured differences
- MLX vs llama.cpp: 15~30% faster throughput, ~10% less memory
- M4 Pro 64GB, LM Studio MLX vs Ollama (llama.cpp): 46% faster
- Qwen3.5-35B-A3B: MLX 71.2 tok/s vs Ollama 30.3 tok/s — 2.3×
The gap widens with model size. Memory bandwidth is the binding constraint for LLM inference, and MLX taps UMA without overhead.
M5 Neural Accelerators (announced 2025-11)
Apple ML Research's M5 measurements:
| Item | M4 | M5 | Improvement |
|---|---|---|---|
| Memory bandwidth | 120 GB/s | 153 GB/s | +28% |
| 14B dense prefill TTFT | — | <10s | — |
| 30B MoE prefill TTFT | — | <3s | — |
| Total TTFT speedup | baseline | 3.33×~4.06× | — |
The M5 GPU includes dedicated matrix-multiplication units (Neural Accelerators), engaged via Metal 4's TensorOps + Metal Performance Primitives. MLX uses them automatically. On M5, MLX gets faster on top of its existing lead.
Ollama 0.19's MLX backend (2026-03-30)
Ollama 0.19 added an experimental MLX backend. On 32GB+ unified-memory Macs, it bypasses llama.cpp: - Prefill 1,154 → 1,810 tok/s - Decode 58 → 112 tok/s - NVFP4 quantization
So as long as you stay on MLX, the underlying speed is comparable across oMLX, LM Studio MLX, and Ollama 0.19+. The differentiation comes from API compatibility and caching strategy.
oMLX vs Ollama vs LM Studio — One-Line Comparison
| Item | Ollama | LM Studio | oMLX |
|---|---|---|---|
| Form | CLI + daemon | GUI app | Menu bar app + server |
| Backend | llama.cpp (+optional MLX) | MLX or llama.cpp | MLX only |
| API | OpenAI | OpenAI | OpenAI + Anthropic |
| KV cache | RAM | RAM | RAM + SSD |
| Multi-model | One active | One active | Concurrent (LRU) |
| VLM/OCR | Some | Some | First-class |
| License | MIT | Closed | Apache 2.0 |
When to use which
- Ollama: Docker/k8s, multi-node ops, x86 portability
- LM Studio: GUI exploration, parameter tuning (non-developers)
- oMLX: Agent workflows, Claude Code/Cursor drop-in, repeated calls with long system prompts
Installation — Three Paths
A. macOS app (recommended for general users)
- Download the latest .dmg from https://github.com/jundot/omlx/releases
- Mount the .dmg → drag to Applications
- First run → security warning → allow in System Settings
- Click menu bar icon → download a model → Start
- In-app auto-update keeps it current
B. Homebrew (developers)
brew tap jundot/omlx
brew install omlx
omlx serve
C. From source (customization)
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .
omlx serve
Verify it's running
curl http://localhost:8000/v1/models
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-4b-instruct","messages":[{"role":"user","content":"hello"}]}'
Web admin: http://localhost:8000/admin
Model Recommendations by Mac RAM
The same oMLX runs different models well depending on Mac size. All quantizations below are 4-bit by default (mlx-community on Hugging Face).
16GB RAM (M1/M2/M3 base, M4 base, some MacBook Air)
- Qwen3-4B-Instruct 4-bit (~2.5 GB) — chat, summary
- Llama 4 Scout 4B 4-bit (~2.5 GB) — general
- Phi-5-mini 3.8B 4-bit (~2.3 GB) — strong reasoning
16GB is safe for 4B~8B models. Larger models trigger SSD swap and tank the throughput.
24GB RAM (some M3/M4 Pro, M4 Air 24GB)
- Qwen3.6-27B 4-bit (~14 GB) — primary for coding/general
- Mistral Medium 3 22B 4-bit (~12 GB) — multilingual
- Llama 4 Scout 17B 4-bit (~10 GB) — general
24GB is the current local-LLM sweet spot. Qwen3.6-27B delivers near-GPT-5-mini quality.
36~48GB RAM (M3/M4 Pro upper)
- Qwen3.6-27B 8-bit (~28 GB) — high-quality coding
- GPT-OSS 20B 4-bit (~12 GB) — reasoning
- Coding model + embeddings concurrently (oMLX multi-model)
64GB+ RAM (M3/M4/M5 Max)
- Qwen3.6-27B 8-bit (~28 GB) — primary
- Kimi K2.6 32B-active 4-bit (~64 GB total weight, 1T MoE) — top coding
- GPT-OSS 20B + Qwen3.6-27B + embeddings concurrent
128GB (M3/M4 Ultra)
- DeepSeek-V4-Flash 4-bit (~150 GB) splittable
- Qwen3.6-72B / Llama 4 Maverick
mlx-community on Hugging Face hosts roughly 4,653 MLX-quantized models. New releases typically appear in 4-bit form within 24 hours.
Claude Code as Drop-in Backend
This is oMLX's real value proposition. Claude Code normally calls Anthropic's API, but swap the API URL to oMLX's Anthropic-compatible endpoint and inference becomes local.
Configuration
In .zshrc or project env:
export ANTHROPIC_API_URL=http://localhost:8000
export ANTHROPIC_API_KEY=local
Or in .claude/settings.json:
{
"anthropic_base_url": "http://localhost:8000",
"anthropic_api_key": "local"
}
oMLX's /v1/messages endpoint follows the Anthropic Messages API spec, including Tool Use (verify with a short test for newer tools).
Effects
- External API cost: $0
- No data leaving your machine
- Works offline
- SSD KV cache → repeated calls with the same system prompt see TTFT under 5 seconds
Cursor configuration
API URL: http://localhost:8000/v1
API Key: (any string)
Model: any downloaded ID (e.g., qwen3.6-27b-4bit)
Caveats
macOS 15.0 minimum
Pre-Sequoia is unsupported. Stick with Ollama or LM Studio there.
M-series above M5
Release notes don't explicitly mention M6+. General forward compatibility is expected, but check GitHub Issues.
Anthropic-compat API edges
- Newer Tool Use formats (some MCP tools) may not match exactly
- Test with a small task before relying on it
SSD wear
Persisted KV cache → more SSD writes. - Use an external SSD (e.g., T7 Shield) to protect internal NVMe - Otherwise monitor SMART every 6~12 months
Security — keep it local
Default bind is 127.0.0.1. To access from another device, route through Tailscale or similar. Never expose directly to the internet.
Real Workflow — M4 Max 64GB Mac
Heavy Claude Code users can hit $200~$500/month including the $20 Pro subscription plus API metered usage. With oMLX, you can shave 80%:
- Install oMLX (Homebrew)
- Pull Qwen3.6-27B 8-bit + GPT-OSS 20B 4-bit
- Set Claude Code's
ANTHROPIC_API_URL=http://localhost:8000 - Routine coding (autocomplete, refactor, small functions): Qwen3.6-27B for speed
- Hard reasoning (architecture, thread safety, concurrency): route to Anthropic API for Claude Opus 4.7
- Result: ~80% calls go local → typical savings of $200+/month
In-house RAG with no data egress
- oMLX + bge-m3 embeddings + Qwen3.6-27B 8-bit
- All models served concurrently (oMLX strength)
- In-house docs → embeddings → vector DB (Chroma/Qdrant)
- Query → top-k retrieval → Qwen3.6-27B answer
- Zero external API calls, response time ~2~5s
Bottom Line
The single sentence summary: "Apple Silicon-optimized Claude Code backend."
| Strength | Reason |
|---|---|
| Speed | MLX-based → 1.5~2.3× over Ollama (Apple Silicon) |
| Compatibility | OpenAI + Anthropic APIs |
| Cost | Apache 2.0, models free (mlx-community) |
| Integration | Drop-in for Claude Code/Cursor |
| Cache | Two-tier RAM + SSD → agent TTFT under 5s |
| Multimodal | LLM + VLM + OCR + embeddings concurrently |
If you have an M3+ Mac and Claude Code costs are wearing you down, install it this weekend. Easy to remove if it doesn't fit. Worth a smaller bill if it does.
First-Party Sources
- oMLX: github.com/jundot/omlx, omlx.ai
- Apple MLX research: machinelearning.apple.com/research/exploring-llms-mlx-m5
- MLX framework: github.com/ml-explore/mlx
- Ollama MLX: ollama.com/blog/mlx
- mlx-community: huggingface.co/mlx-community
- Qwen3.6-27B: huggingface.co/Qwen/Qwen3.6-27B
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ