Local AI Infrastructure Notes (7/15) — Best Local LLM Setup on macOS: The Complete oMLX Guide

GPT-5.5 at $5 per million input tokens. Claude Opus 4.7 with similar premium pricing. For heavy users on M3/M4 Macs, the monthly bill stings. There is another path — local inference. But the tooling landscape (Ollama, LM Studio, llama.cpp) is fragmented, and recommendations swing wildly.

oMLX, which shipped v0.3.8rc1 on April 28, 2026, settles the question. It is a macOS-only inference server built on Apple's MLX framework, with vLLM-style continuous batching and an SSD-backed KV cache. It drops agent time-to-first-token from 30~90 seconds to under 5 seconds, and Claude Code, Cursor, and OpenClaw can use it as a drop-in backend.

This guide explains why oMLX is the right choice for Apple Silicon Macs as of April 2026, sourced entirely from primary documentation (GitHub, Apple ML Research, Ollama blog). It covers installation, model recommendations by Mac RAM tier, and Claude Code integration.

In One Paragraph

oMLX is a local LLM server you launch from the macOS menu bar. It exposes both OpenAI- and Anthropic-compatible APIs, making it a drop-in target for Claude Code, Cursor, and OpenClaw. Its differentiator is a two-tier KV cache (RAM hot + SSD cold) — TTFT for long-system-prompt agents drops from 30~90 seconds to under 5 seconds. Apache 2.0, macOS 15+, M1~M4 supported.


What oMLX Actually Is

Maintainer: jundot (GitHub jundot/omlx) Latest release: v0.3.8rc1 (2026-04-28) License: Apache 2.0 System requirements: - macOS 15.0 (Sequoia) or higher - Apple Silicon (M1/M2/M3/M4/M5) - Python 3.10+

Four core capabilities

  1. Tiered KV cache
  2. RAM hot cache → evicts to SSD cold cache when full
  3. Block-level prefix sharing + Copy-on-Write (vLLM-inspired)
  4. Every KV block persisted as safetensors → instant restore on restart

  5. Continuous batching

  6. Built on mlx-lm's BatchGenerator
  7. Multiple concurrent requests → higher GPU utilization

  8. Multi-model concurrent serving

  9. LLM + VLM (Qwen3.5, GLM-4V, Pixtral) + OCR (DeepSeek-OCR, GLM-OCR) + embeddings (BGE-M3) + rerankers
  10. LRU eviction inside RAM budget

  11. API compatibility

  12. OpenAI: http://localhost:8000/v1/chat/completions
  13. Anthropic: http://localhost:8000/v1/messages
  14. Drop-in for Claude Code, Cursor, OpenClaw

The menu-bar app handles model download, start, and stop visually. The /admin web dashboard provides real-time monitoring and one-click benchmarking.


Why MLX Wins on Apple Silicon

oMLX runs on Apple's MLX framework, which exploits Apple Silicon's Unified Memory Architecture (UMA) directly. CPU and GPU share memory, so there is no copy step. This is where Apple Silicon diverges from PC GPU stacks.

Same model, same Mac — measured differences

  • MLX vs llama.cpp: 15~30% faster throughput, ~10% less memory
  • M4 Pro 64GB, LM Studio MLX vs Ollama (llama.cpp): 46% faster
  • Qwen3.5-35B-A3B: MLX 71.2 tok/s vs Ollama 30.3 tok/s2.3×

The gap widens with model size. Memory bandwidth is the binding constraint for LLM inference, and MLX taps UMA without overhead.

M5 Neural Accelerators (announced 2025-11)

Apple ML Research's M5 measurements:

Item M4 M5 Improvement
Memory bandwidth 120 GB/s 153 GB/s +28%
14B dense prefill TTFT <10s
30B MoE prefill TTFT <3s
Total TTFT speedup baseline 3.33×~4.06×

The M5 GPU includes dedicated matrix-multiplication units (Neural Accelerators), engaged via Metal 4's TensorOps + Metal Performance Primitives. MLX uses them automatically. On M5, MLX gets faster on top of its existing lead.

Ollama 0.19's MLX backend (2026-03-30)

Ollama 0.19 added an experimental MLX backend. On 32GB+ unified-memory Macs, it bypasses llama.cpp: - Prefill 1,154 → 1,810 tok/s - Decode 58 → 112 tok/s - NVFP4 quantization

So as long as you stay on MLX, the underlying speed is comparable across oMLX, LM Studio MLX, and Ollama 0.19+. The differentiation comes from API compatibility and caching strategy.


oMLX vs Ollama vs LM Studio — One-Line Comparison

Item Ollama LM Studio oMLX
Form CLI + daemon GUI app Menu bar app + server
Backend llama.cpp (+optional MLX) MLX or llama.cpp MLX only
API OpenAI OpenAI OpenAI + Anthropic
KV cache RAM RAM RAM + SSD
Multi-model One active One active Concurrent (LRU)
VLM/OCR Some Some First-class
License MIT Closed Apache 2.0

When to use which

  • Ollama: Docker/k8s, multi-node ops, x86 portability
  • LM Studio: GUI exploration, parameter tuning (non-developers)
  • oMLX: Agent workflows, Claude Code/Cursor drop-in, repeated calls with long system prompts

Installation — Three Paths

A. macOS app (recommended for general users)

  1. Download the latest .dmg from https://github.com/jundot/omlx/releases
  2. Mount the .dmg → drag to Applications
  3. First run → security warning → allow in System Settings
  4. Click menu bar icon → download a model → Start
  5. In-app auto-update keeps it current

B. Homebrew (developers)

brew tap jundot/omlx
brew install omlx
omlx serve

C. From source (customization)

git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e .
omlx serve

Verify it's running

curl http://localhost:8000/v1/models

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b-instruct","messages":[{"role":"user","content":"hello"}]}'

Web admin: http://localhost:8000/admin


Model Recommendations by Mac RAM

The same oMLX runs different models well depending on Mac size. All quantizations below are 4-bit by default (mlx-community on Hugging Face).

16GB RAM (M1/M2/M3 base, M4 base, some MacBook Air)

  • Qwen3-4B-Instruct 4-bit (~2.5 GB) — chat, summary
  • Llama 4 Scout 4B 4-bit (~2.5 GB) — general
  • Phi-5-mini 3.8B 4-bit (~2.3 GB) — strong reasoning

16GB is safe for 4B~8B models. Larger models trigger SSD swap and tank the throughput.

24GB RAM (some M3/M4 Pro, M4 Air 24GB)

  • Qwen3.6-27B 4-bit (~14 GB) — primary for coding/general
  • Mistral Medium 3 22B 4-bit (~12 GB) — multilingual
  • Llama 4 Scout 17B 4-bit (~10 GB) — general

24GB is the current local-LLM sweet spot. Qwen3.6-27B delivers near-GPT-5-mini quality.

36~48GB RAM (M3/M4 Pro upper)

  • Qwen3.6-27B 8-bit (~28 GB) — high-quality coding
  • GPT-OSS 20B 4-bit (~12 GB) — reasoning
  • Coding model + embeddings concurrently (oMLX multi-model)

64GB+ RAM (M3/M4/M5 Max)

  • Qwen3.6-27B 8-bit (~28 GB) — primary
  • Kimi K2.6 32B-active 4-bit (~64 GB total weight, 1T MoE) — top coding
  • GPT-OSS 20B + Qwen3.6-27B + embeddings concurrent

128GB (M3/M4 Ultra)

  • DeepSeek-V4-Flash 4-bit (~150 GB) splittable
  • Qwen3.6-72B / Llama 4 Maverick

mlx-community on Hugging Face hosts roughly 4,653 MLX-quantized models. New releases typically appear in 4-bit form within 24 hours.


Claude Code as Drop-in Backend

This is oMLX's real value proposition. Claude Code normally calls Anthropic's API, but swap the API URL to oMLX's Anthropic-compatible endpoint and inference becomes local.

Configuration

In .zshrc or project env:

export ANTHROPIC_API_URL=http://localhost:8000
export ANTHROPIC_API_KEY=local

Or in .claude/settings.json:

{
  "anthropic_base_url": "http://localhost:8000",
  "anthropic_api_key": "local"
}

oMLX's /v1/messages endpoint follows the Anthropic Messages API spec, including Tool Use (verify with a short test for newer tools).

Effects

  • External API cost: $0
  • No data leaving your machine
  • Works offline
  • SSD KV cache → repeated calls with the same system prompt see TTFT under 5 seconds

Cursor configuration

API URL: http://localhost:8000/v1
API Key: (any string)
Model: any downloaded ID (e.g., qwen3.6-27b-4bit)

Caveats

macOS 15.0 minimum

Pre-Sequoia is unsupported. Stick with Ollama or LM Studio there.

M-series above M5

Release notes don't explicitly mention M6+. General forward compatibility is expected, but check GitHub Issues.

Anthropic-compat API edges

  • Newer Tool Use formats (some MCP tools) may not match exactly
  • Test with a small task before relying on it

SSD wear

Persisted KV cache → more SSD writes. - Use an external SSD (e.g., T7 Shield) to protect internal NVMe - Otherwise monitor SMART every 6~12 months

Security — keep it local

Default bind is 127.0.0.1. To access from another device, route through Tailscale or similar. Never expose directly to the internet.


Real Workflow — M4 Max 64GB Mac

Heavy Claude Code users can hit $200~$500/month including the $20 Pro subscription plus API metered usage. With oMLX, you can shave 80%:

  1. Install oMLX (Homebrew)
  2. Pull Qwen3.6-27B 8-bit + GPT-OSS 20B 4-bit
  3. Set Claude Code's ANTHROPIC_API_URL=http://localhost:8000
  4. Routine coding (autocomplete, refactor, small functions): Qwen3.6-27B for speed
  5. Hard reasoning (architecture, thread safety, concurrency): route to Anthropic API for Claude Opus 4.7
  6. Result: ~80% calls go local → typical savings of $200+/month

In-house RAG with no data egress

  1. oMLX + bge-m3 embeddings + Qwen3.6-27B 8-bit
  2. All models served concurrently (oMLX strength)
  3. In-house docs → embeddings → vector DB (Chroma/Qdrant)
  4. Query → top-k retrieval → Qwen3.6-27B answer
  5. Zero external API calls, response time ~2~5s

Bottom Line

The single sentence summary: "Apple Silicon-optimized Claude Code backend."

Strength Reason
Speed MLX-based → 1.5~2.3× over Ollama (Apple Silicon)
Compatibility OpenAI + Anthropic APIs
Cost Apache 2.0, models free (mlx-community)
Integration Drop-in for Claude Code/Cursor
Cache Two-tier RAM + SSD → agent TTFT under 5s
Multimodal LLM + VLM + OCR + embeddings concurrently

If you have an M3+ Mac and Claude Code costs are wearing you down, install it this weekend. Easy to remove if it doesn't fit. Worth a smaller bill if it does.


First-Party Sources

  • oMLX: github.com/jundot/omlx, omlx.ai
  • Apple MLX research: machinelearning.apple.com/research/exploring-llms-mlx-m5
  • MLX framework: github.com/ml-explore/mlx
  • Ollama MLX: ollama.com/blog/mlx
  • mlx-community: huggingface.co/mlx-community
  • Qwen3.6-27B: huggingface.co/Qwen/Qwen3.6-27B

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System