Local AI Infrastructure Notes (7/15) — Best Local LLM Setup on macOS: The Complete oMLX Guide

4월 29, 2026

GPT-5.5 at $5 per million input tokens. Claude Opus 4.7 with similar premium pricing. For heavy users on M3/M4 Macs, the monthly bill stings. There is another path — local inference. But the tooling landscape (Ollama, LM Studio, llama.cpp) is fragmented, and recommendations swing wildly.

oMLX, which shipped v0.3.8rc1 on April 28, 2026, settles the question. It is a macOS-only inference server built on Apple's MLX framework, with vLLM-style continuous batching and an SSD-backed KV cache. It drops agent time-to-first-token from 30~90 seconds to under 5 seconds, and Claude Code, Cursor, and OpenClaw can use it as a drop-in backend.

This guide explains why oMLX is the right choice for Apple Silicon Macs as of April 2026, sourced entirely from primary documentation (GitHub, Apple ML Research, Ollama blog). It covers installation, model recommendations by Mac RAM tier, and Claude Code integration.

In One Paragraph

oMLX is a local LLM server you launch from the macOS menu bar. It exposes both OpenAI- and Anthropic-compatible APIs, making it a drop-in target for Claude Code, Cursor, and OpenClaw. Its differentiator is a two-tier KV cache (RAM hot + SSD cold) — TTFT for long-system-prompt agents drops from 30~90 seconds to under 5 seconds. Apache 2.0, macOS 15+, M1~M4 supported.

What oMLX Actually Is

Maintainer: jundot (GitHub jundot/omlx) Latest release: v0.3.8rc1 (2026-04-28) License: Apache 2.0 System requirements: - macOS 15.0 (Sequoia) or higher - Apple Silicon (M1/M2/M3/M4/M5) - Python 3.10+

Four core capabilities

Tiered KV cache
RAM hot cache → evicts to SSD cold cache when full
Block-level prefix sharing + Copy-on-Write (vLLM-inspired)
Every KV block persisted as safetensors → instant restore on restart
Continuous batching
Built on mlx-lm's BatchGenerator
Multiple concurrent requests → higher GPU utilization
Multi-model concurrent serving
LLM + VLM (Qwen3.5, GLM-4V, Pixtral) + OCR (DeepSeek-OCR, GLM-OCR) + embeddings (BGE-M3) + rerankers
LRU eviction inside RAM budget
API compatibility
OpenAI: http://localhost:8000/v1/chat/completions
Anthropic: http://localhost:8000/v1/messages
Drop-in for Claude Code, Cursor, OpenClaw

The menu-bar app handles model download, start, and stop visually. The /admin web dashboard provides real-time monitoring and one-click benchmarking.

Why MLX Wins on Apple Silicon

oMLX runs on Apple's MLX framework, which exploits Apple Silicon's Unified Memory Architecture (UMA) directly. CPU and GPU share memory, so there is no copy step. This is where Apple Silicon diverges from PC GPU stacks.

Same model, same Mac — measured differences

MLX vs llama.cpp: 15~30% faster throughput, ~10% less memory
M4 Pro 64GB, LM Studio MLX vs Ollama (llama.cpp): 46% faster
Qwen3.5-35B-A3B: MLX 71.2 tok/s vs Ollama 30.3 tok/s — 2.3×

The gap widens with model size. Memory bandwidth is the binding constraint for LLM inference, and MLX taps UMA without overhead.

M5 Neural Accelerators (announced 2025-11)

Apple ML Research's M5 measurements:

Item	M4	M5	Improvement
Memory bandwidth	120 GB/s	153 GB/s	+28%
14B dense prefill TTFT	—	<10s	—
30B MoE prefill TTFT	—	<3s	—
Total TTFT speedup	baseline	3.33×~4.06×	—

The M5 GPU includes dedicated matrix-multiplication units (Neural Accelerators), engaged via Metal 4's TensorOps + Metal Performance Primitives. MLX uses them automatically. On M5, MLX gets faster on top of its existing lead.

Ollama 0.19's MLX backend (2026-03-30)

Ollama 0.19 added an experimental MLX backend. On 32GB+ unified-memory Macs, it bypasses llama.cpp: - Prefill 1,154 → 1,810 tok/s - Decode 58 → 112 tok/s - NVFP4 quantization

So as long as you stay on MLX, the underlying speed is comparable across oMLX, LM Studio MLX, and Ollama 0.19+. The differentiation comes from API compatibility and caching strategy.

oMLX vs Ollama vs LM Studio — One-Line Comparison

Item	Ollama	LM Studio	oMLX
Form	CLI + daemon	GUI app	Menu bar app + server
Backend	llama.cpp (+optional MLX)	MLX or llama.cpp	MLX only
API	OpenAI	OpenAI	OpenAI + Anthropic
KV cache	RAM	RAM	RAM + SSD
Multi-model	One active	One active	Concurrent (LRU)
VLM/OCR	Some	Some	First-class
License	MIT	Closed	Apache 2.0

When to use which

Ollama: Docker/k8s, multi-node ops, x86 portability
LM Studio: GUI exploration, parameter tuning (non-developers)
oMLX: Agent workflows, Claude Code/Cursor drop-in, repeated calls with long system prompts

Installation — Three Paths

A. macOS app (recommended for general users)

Download the latest .dmg from https://github.com/jundot/omlx/releases
Mount the .dmg → drag to Applications
First run → security warning → allow in System Settings
Click menu bar icon → download a model → Start
In-app auto-update keeps it current

B. Homebrew (developers)

brew tap jundot/omlx
brew install omlx
omlx serve

C. From source (customization)

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .
omlx serve

Verify it's running

curl http://localhost:8000/v1/models

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b-instruct","messages":[{"role":"user","content":"hello"}]}'

Web admin: http://localhost:8000/admin

Model Recommendations by Mac RAM

The same oMLX runs different models well depending on Mac size. All quantizations below are 4-bit by default (mlx-community on Hugging Face).

16GB RAM (M1/M2/M3 base, M4 base, some MacBook Air)

Qwen3-4B-Instruct 4-bit (~2.5 GB) — chat, summary
Llama 4 Scout 4B 4-bit (~2.5 GB) — general
Phi-5-mini 3.8B 4-bit (~2.3 GB) — strong reasoning

16GB is safe for 4B~8B models. Larger models trigger SSD swap and tank the throughput.

24GB RAM (some M3/M4 Pro, M4 Air 24GB)

Qwen3.6-27B 4-bit (~14 GB) — primary for coding/general
Mistral Medium 3 22B 4-bit (~12 GB) — multilingual
Llama 4 Scout 17B 4-bit (~10 GB) — general

24GB is the current local-LLM sweet spot. Qwen3.6-27B delivers near-GPT-5-mini quality.

36~48GB RAM (M3/M4 Pro upper)

Qwen3.6-27B 8-bit (~28 GB) — high-quality coding
GPT-OSS 20B 4-bit (~12 GB) — reasoning
Coding model + embeddings concurrently (oMLX multi-model)

64GB+ RAM (M3/M4/M5 Max)

Qwen3.6-27B 8-bit (~28 GB) — primary
Kimi K2.6 32B-active 4-bit (~64 GB total weight, 1T MoE) — top coding
GPT-OSS 20B + Qwen3.6-27B + embeddings concurrent

128GB (M3/M4 Ultra)

DeepSeek-V4-Flash 4-bit (~150 GB) splittable
Qwen3.6-72B / Llama 4 Maverick

mlx-community on Hugging Face hosts roughly 4,653 MLX-quantized models. New releases typically appear in 4-bit form within 24 hours.

Claude Code as Drop-in Backend

This is oMLX's real value proposition. Claude Code normally calls Anthropic's API, but swap the API URL to oMLX's Anthropic-compatible endpoint and inference becomes local.

Configuration

In .zshrc or project env:

export ANTHROPIC_API_URL=http://localhost:8000
export ANTHROPIC_API_KEY=local

Or in .claude/settings.json:

{
  "anthropic_base_url": "http://localhost:8000",
  "anthropic_api_key": "local"
}

oMLX's /v1/messages endpoint follows the Anthropic Messages API spec, including Tool Use (verify with a short test for newer tools).

Effects

External API cost: $0
No data leaving your machine
Works offline
SSD KV cache → repeated calls with the same system prompt see TTFT under 5 seconds

Cursor configuration

API URL: http://localhost:8000/v1
API Key: (any string)
Model: any downloaded ID (e.g., qwen3.6-27b-4bit)

Caveats

macOS 15.0 minimum

Pre-Sequoia is unsupported. Stick with Ollama or LM Studio there.

M-series above M5

Release notes don't explicitly mention M6+. General forward compatibility is expected, but check GitHub Issues.

Anthropic-compat API edges

Newer Tool Use formats (some MCP tools) may not match exactly
Test with a small task before relying on it

SSD wear

Persisted KV cache → more SSD writes. - Use an external SSD (e.g., T7 Shield) to protect internal NVMe - Otherwise monitor SMART every 6~12 months

Security — keep it local

Default bind is 127.0.0.1. To access from another device, route through Tailscale or similar. Never expose directly to the internet.

Real Workflow — M4 Max 64GB Mac

Heavy Claude Code users can hit $200~$500/month including the $20 Pro subscription plus API metered usage. With oMLX, you can shave 80%:

Install oMLX (Homebrew)
Pull Qwen3.6-27B 8-bit + GPT-OSS 20B 4-bit
Set Claude Code's ANTHROPIC_API_URL=http://localhost:8000
Routine coding (autocomplete, refactor, small functions): Qwen3.6-27B for speed
Hard reasoning (architecture, thread safety, concurrency): route to Anthropic API for Claude Opus 4.7
Result: ~80% calls go local → typical savings of $200+/month

In-house RAG with no data egress

oMLX + bge-m3 embeddings + Qwen3.6-27B 8-bit
All models served concurrently (oMLX strength)
In-house docs → embeddings → vector DB (Chroma/Qdrant)
Query → top-k retrieval → Qwen3.6-27B answer
Zero external API calls, response time ~2~5s

Bottom Line

The single sentence summary: "Apple Silicon-optimized Claude Code backend."

Strength	Reason
Speed	MLX-based → 1.5~2.3× over Ollama (Apple Silicon)
Compatibility	OpenAI + Anthropic APIs
Cost	Apache 2.0, models free (mlx-community)
Integration	Drop-in for Claude Code/Cursor
Cache	Two-tier RAM + SSD → agent TTFT under 5s
Multimodal	LLM + VLM + OCR + embeddings concurrently

If you have an M3+ Mac and Claude Code costs are wearing you down, install it this weekend. Easy to remove if it doesn't fit. Worth a smaller bill if it does.

First-Party Sources

oMLX: github.com/jundot/omlx, omlx.ai
Apple MLX research: machinelearning.apple.com/research/exploring-llms-mlx-m5
MLX framework: github.com/ml-explore/mlx
Ollama MLX: ollama.com/blog/mlx
mlx-community: huggingface.co/mlx-community
Qwen3.6-27B: huggingface.co/Qwen/Qwen3.6-27B

Series overview: Series index