"Local LLMs 2026 — Ollama vs LM Studio, and What Actually Runs on Your Machine"
Install to model selection, with picks for 8 GB / 16 GB / 24 GB / 64 GB systems on Mac and Windows
ํต์ฌ ์์ฝ
- Audience: Anyone who wants to run LLMs on their own computer for cost, privacy, or learning — both general users and developers.
- What you'll get: 1) Ollama vs LM Studio — which to pick, 2) what runs on 8/16/24/64 GB systems, 3) Mac and Windows install steps, 4) realistic use cases and limits, 5) honest comparison vs cloud.
- One-liner: First-time GUI → LM Studio. Automation/server/dev → Ollama. Both free. Apple M-series surprises at 8 GB and shines at 64 GB.
1. Why local
| Factor | Cloud | Local |
|---|---|---|
| Cost | Per-token | Free after install (electricity, hardware) |
| Privacy | Sent to vendor | Never leaves your machine |
| Offline | Needs network | Works |
| Quality | GPT-5.5 / Claude 4.7 / Gemini 3 | Llama 3 / Qwen 3 / Mistral (open-source SOTA) |
| Speed | Depends on model + network | Depends on your hardware (~30–40 tok/s on a typical laptop) |
| Multimodal | Rich | Text + some vision |
| Context | 1M+ | 8K–128K |
Local fits when: - Sensitive data (medical, legal, contracts). - Heavy automation that's making cloud costs sting. - Offline environments. - Learning / experimentation (model comparison, fine-tuning).
Cloud still wins when: - You need top-end quality on hard tasks. - Multimodal (image/audio in/out). - Casual usage.
2. Ollama vs LM Studio
2.1 Core differences
| Ollama | LM Studio | |
|---|---|---|
| Interface | CLI (terminal) | GUI desktop app |
| Audience | Devs / automation | Non-devs / experimenters |
| Model library | Curated (ollama.com/library) | HuggingFace integration |
| Apple Silicon | Metal (good) | MLX (1.6× faster on M3 Ultra) |
| GPU auto-detect | NVIDIA / AMD / Apple | Manual config |
| Docker | Official image | ❌ |
| API server | Built in (localhost:11434) |
Toggle in app |
| n8n / LangChain | First-class | Works via OpenAI-compatible API |
| Free | ✅ | ✅ |
2.2 Which to pick
- GUI exploration / model browsing → LM Studio
- CLI / automation / server → Ollama
- Both is fine — they don't conflict. Many start with LM Studio, then add Ollama for dev work.
2.3 Speed (M3 Ultra, Gemma 3 1B)
- LM Studio (MLX): 237 tok/s
- Ollama (Metal): 149 tok/s
Ollama is adopting MLX, but as of April 2026 LM Studio is faster on Apple Silicon.
3. What runs on your machine
3.1 Model picks by RAM/VRAM (Q4_K_M)
| System | Recommended model | Memory used | Use |
|---|---|---|---|
| 8 GB VRAM (or 8 GB Apple unified) | Qwen 3.5 9B (default rec) | 6.6 GB | General chat, summary, translate |
| 8 GB | Llama 3.1 8B | 5–6 GB | English-first chat |
| 8 GB | Mistral 7B | 4.5 GB | Lighter tasks |
| 16 GB | Qwen 3 14B Q5_K_M | 10.2 GB | High-quality chat |
| 16 GB | Mistral Small 24B Q4 | 13.4 GB | Stronger reasoning |
| 24 GB | Mixtral 8x7B / 35B Q4 | 18–22 GB | Production-tier work |
| 48+ GB | Llama 3.3 70B Q4 / Qwen 2.5 72B Q4 | 40–44 GB | Near-cloud quality |
| Apple M-Mac 64 GB+ | Llama 3.3 70B / Qwen 72B | uses unified memory | M5 Max 64 GB ≈ H100-class |
Sources: LocalLLM.in VRAM guide, GitHub ollama/ollama.
3.2 Quantization in one line
Q4_K_M = 4-bit compression. ~75% smaller than FP16, with ~5–10% quality loss. The standard default. - Q8 / FP16 = full quality (4× memory) - Q5_K_M = slightly better, +20% memory - Q3 = even smaller, but visibly worse
Start with Q4_K_M everywhere; only step up if quality bothers you.
3.3 The Apple Silicon trick
Unified memory lets the GPU use system RAM. A 64 GB MacBook Pro is effectively a 64 GB VRAM machine. 70B models that won't fit on an RTX 4090 (24 GB) run on M3/M4/M5 Max 64 GB.
4. Install — Mac & Windows
4.1 Ollama (CLI)
Mac
brew install ollama
ollama run llama3.1:8b
Windows (10/11, native ARM64 in 2026) 1. Get the installer from ollama.com/download 2. After install, in PowerShell:
ollama run qwen3:9b
Common commands
ollama list # installed models
ollama pull mistral # download without running
ollama rm llama3.1 # remove
ollama serve # API server (default port 11434)
4.2 LM Studio (GUI)
Mac & Windows 1. Download from lmstudio.ai 2. Open → search icon → search a model (e.g., "qwen 3.5 9b") 3. Download → use the "Chat" tab right away 4. (Optional) "Local Server" tab → Start → enables an OpenAI-compatible API
LM Studio shows "Will fit in your RAM/VRAM" on every model — friendlier on hardware questions.
5. Five practical use cases
5.1 Sensitive document analysis
Summarize legal contracts, medical records, internal memos without sending them anywhere. Even Qwen 3.5 9B on an 8 GB Mac is enough.
5.2 Coding assist
Llama 3.1 8B or Qwen 2.5 Coder 7B paired with the Continue VS Code extension — a viable Copilot alternative.
5.3 RAG over your own files
Ollama API + LangChain/LlamaIndex over your PDFs/notes — your own NotebookLM that never phones home.
5.4 Automation (n8n + Ollama)
Swap the Claude Haiku call from Part 7's email-triage workflow with a local Llama → token cost goes to zero.
5.5 Learning / experimentation
Compare model behavior, choose a fine-tuning base, measure quantization impact.
6. Honest limits
- Quality gap: a 70B local model is close, not equal, to GPT-5.5 / Claude Opus 4.7. Most visible on reasoning, code, and non-English tasks.
- Multimodal is limited: image/audio in/out is rougher than cloud.
- Context windows: typically 8K–32K (some 128K). Cloud is at 1M+ now.
- Heat / fan / power: 70B on a laptop = fans on full. Desktops cost a bit more in electricity.
- First download is slow: 4–80 GB per model, 30 minutes to 2 hours over Wi-Fi.
- Weaker safety alignment: open-source models have lighter safety filters — you carry the responsibility.
7. First-week starter plan
| Day | Task |
|---|---|
| 1 | Install LM Studio → download Qwen 3.5 9B → spend an hour chatting; compare with cloud feel. |
| 2 | Install Ollama → pull the same model → CLI run → start the API server. |
| 3 | Connect your Python or n8n code via the OpenAI-compatible API. Run it without any cloud key. |
| 4–5 | Try RAG on one PDF (LM Studio built-in or LangChain). |
| 6–7 | Evaluate. Tally savings vs lost capabilities → keep / mix / drop. |
8. Cloud vs local — the honest call
| Cloud (Claude / GPT / Gemini) | Local (Llama / Qwen / Mistral) | |
|---|---|---|
| Quality | Top-tier | Strong (15–30% gap) |
| Cost | Usage-based | $0 ongoing, hardware + power |
| Speed | Network-bound | Hardware-bound |
| Privacy | Policy-dependent | Total |
| Multimodal | Rich | Limited |
| Context | 1M+ | 8K–128K |
| Scaling | Unlimited | Hardware ceiling |
| Liability | Vendor-shared | All yours |
Practical recommendation: Cloud as your main stack, local for sensitive work and automation. The "all of one" stance loses on both fronts; mixing is the rational play.
Developer notes
- Ollama API is OpenAI-compatible: set
OPENAI_API_BASE=http://localhost:11434/v1and reuse the OpenAI SDK as-is. Switch between local and cloud with one env var. - Mind the licenses: Llama 3 (Meta), Qwen (Alibaba), Mistral — each license differs. Some restrict commercial use over MAU thresholds.
- n8n integration: use the Ollama Chat node or the OpenAI Chat node with a custom Base URL.
- Fine-tuning: 8B models LoRA-tune on 16 GB VRAM. Tools like unsloth and axolotl make this practical.
- Benchmarking: lm-eval-harness, MTEB — automate model comparisons on your own dataset.
- Production serving: vLLM (concurrent requests), TGI (HuggingFace), llama.cpp server beat Ollama on throughput. Ollama is best for personal use and MVPs.
- Edge deployment: Raspberry Pi 5 + 8 GB handles 3B models. Mobile gets Llama 3.2 1B/3B at quantization.
References
- Ollama official
- Ollama model library
- LM Studio official
- LocalLLM.in — Ollama VRAM requirements
- GitHub — ollama/ollama
- Continue (VS Code extension)
This is part 8 of 11 in the AI Basics series. Series complete. Browse the full series in the blog TOC page.
๋๊ธ
๋๊ธ ์ฐ๊ธฐ