"Local LLMs 2026 — Ollama vs LM Studio, and What Actually Runs on Your Machine"

Install to model selection, with picks for 8 GB / 16 GB / 24 GB / 64 GB systems on Mac and Windows


ํ•ต์‹ฌ ์š”์•ฝ

  • Audience: Anyone who wants to run LLMs on their own computer for cost, privacy, or learning — both general users and developers.
  • What you'll get: 1) Ollama vs LM Studio — which to pick, 2) what runs on 8/16/24/64 GB systems, 3) Mac and Windows install steps, 4) realistic use cases and limits, 5) honest comparison vs cloud.
  • One-liner: First-time GUI → LM Studio. Automation/server/dev → Ollama. Both free. Apple M-series surprises at 8 GB and shines at 64 GB.

1. Why local

Factor Cloud Local
Cost Per-token Free after install (electricity, hardware)
Privacy Sent to vendor Never leaves your machine
Offline Needs network Works
Quality GPT-5.5 / Claude 4.7 / Gemini 3 Llama 3 / Qwen 3 / Mistral (open-source SOTA)
Speed Depends on model + network Depends on your hardware (~30–40 tok/s on a typical laptop)
Multimodal Rich Text + some vision
Context 1M+ 8K–128K

Local fits when: - Sensitive data (medical, legal, contracts). - Heavy automation that's making cloud costs sting. - Offline environments. - Learning / experimentation (model comparison, fine-tuning).

Cloud still wins when: - You need top-end quality on hard tasks. - Multimodal (image/audio in/out). - Casual usage.


2. Ollama vs LM Studio

2.1 Core differences

Ollama LM Studio
Interface CLI (terminal) GUI desktop app
Audience Devs / automation Non-devs / experimenters
Model library Curated (ollama.com/library) HuggingFace integration
Apple Silicon Metal (good) MLX (1.6× faster on M3 Ultra)
GPU auto-detect NVIDIA / AMD / Apple Manual config
Docker Official image
API server Built in (localhost:11434) Toggle in app
n8n / LangChain First-class Works via OpenAI-compatible API
Free

2.2 Which to pick

  • GUI exploration / model browsing → LM Studio
  • CLI / automation / server → Ollama
  • Both is fine — they don't conflict. Many start with LM Studio, then add Ollama for dev work.

2.3 Speed (M3 Ultra, Gemma 3 1B)

  • LM Studio (MLX): 237 tok/s
  • Ollama (Metal): 149 tok/s

Ollama is adopting MLX, but as of April 2026 LM Studio is faster on Apple Silicon.


3. What runs on your machine

3.1 Model picks by RAM/VRAM (Q4_K_M)

System Recommended model Memory used Use
8 GB VRAM (or 8 GB Apple unified) Qwen 3.5 9B (default rec) 6.6 GB General chat, summary, translate
8 GB Llama 3.1 8B 5–6 GB English-first chat
8 GB Mistral 7B 4.5 GB Lighter tasks
16 GB Qwen 3 14B Q5_K_M 10.2 GB High-quality chat
16 GB Mistral Small 24B Q4 13.4 GB Stronger reasoning
24 GB Mixtral 8x7B / 35B Q4 18–22 GB Production-tier work
48+ GB Llama 3.3 70B Q4 / Qwen 2.5 72B Q4 40–44 GB Near-cloud quality
Apple M-Mac 64 GB+ Llama 3.3 70B / Qwen 72B uses unified memory M5 Max 64 GB ≈ H100-class

Sources: LocalLLM.in VRAM guide, GitHub ollama/ollama.

3.2 Quantization in one line

Q4_K_M = 4-bit compression. ~75% smaller than FP16, with ~5–10% quality loss. The standard default. - Q8 / FP16 = full quality (4× memory) - Q5_K_M = slightly better, +20% memory - Q3 = even smaller, but visibly worse

Start with Q4_K_M everywhere; only step up if quality bothers you.

3.3 The Apple Silicon trick

Unified memory lets the GPU use system RAM. A 64 GB MacBook Pro is effectively a 64 GB VRAM machine. 70B models that won't fit on an RTX 4090 (24 GB) run on M3/M4/M5 Max 64 GB.


4. Install — Mac & Windows

4.1 Ollama (CLI)

Mac

brew install ollama

ollama run llama3.1:8b

Windows (10/11, native ARM64 in 2026) 1. Get the installer from ollama.com/download 2. After install, in PowerShell:

ollama run qwen3:9b

Common commands

ollama list           # installed models
ollama pull mistral   # download without running
ollama rm llama3.1    # remove
ollama serve          # API server (default port 11434)

4.2 LM Studio (GUI)

Mac & Windows 1. Download from lmstudio.ai 2. Open → search icon → search a model (e.g., "qwen 3.5 9b") 3. Download → use the "Chat" tab right away 4. (Optional) "Local Server" tab → Start → enables an OpenAI-compatible API

LM Studio shows "Will fit in your RAM/VRAM" on every model — friendlier on hardware questions.


5. Five practical use cases

5.1 Sensitive document analysis

Summarize legal contracts, medical records, internal memos without sending them anywhere. Even Qwen 3.5 9B on an 8 GB Mac is enough.

5.2 Coding assist

Llama 3.1 8B or Qwen 2.5 Coder 7B paired with the Continue VS Code extension — a viable Copilot alternative.

5.3 RAG over your own files

Ollama API + LangChain/LlamaIndex over your PDFs/notes — your own NotebookLM that never phones home.

5.4 Automation (n8n + Ollama)

Swap the Claude Haiku call from Part 7's email-triage workflow with a local Llama → token cost goes to zero.

5.5 Learning / experimentation

Compare model behavior, choose a fine-tuning base, measure quantization impact.


6. Honest limits

  • Quality gap: a 70B local model is close, not equal, to GPT-5.5 / Claude Opus 4.7. Most visible on reasoning, code, and non-English tasks.
  • Multimodal is limited: image/audio in/out is rougher than cloud.
  • Context windows: typically 8K–32K (some 128K). Cloud is at 1M+ now.
  • Heat / fan / power: 70B on a laptop = fans on full. Desktops cost a bit more in electricity.
  • First download is slow: 4–80 GB per model, 30 minutes to 2 hours over Wi-Fi.
  • Weaker safety alignment: open-source models have lighter safety filters — you carry the responsibility.

7. First-week starter plan

Day Task
1 Install LM Studio → download Qwen 3.5 9B → spend an hour chatting; compare with cloud feel.
2 Install Ollama → pull the same model → CLI run → start the API server.
3 Connect your Python or n8n code via the OpenAI-compatible API. Run it without any cloud key.
4–5 Try RAG on one PDF (LM Studio built-in or LangChain).
6–7 Evaluate. Tally savings vs lost capabilities → keep / mix / drop.

8. Cloud vs local — the honest call

Cloud (Claude / GPT / Gemini) Local (Llama / Qwen / Mistral)
Quality Top-tier Strong (15–30% gap)
Cost Usage-based $0 ongoing, hardware + power
Speed Network-bound Hardware-bound
Privacy Policy-dependent Total
Multimodal Rich Limited
Context 1M+ 8K–128K
Scaling Unlimited Hardware ceiling
Liability Vendor-shared All yours

Practical recommendation: Cloud as your main stack, local for sensitive work and automation. The "all of one" stance loses on both fronts; mixing is the rational play.


Developer notes

  1. Ollama API is OpenAI-compatible: set OPENAI_API_BASE=http://localhost:11434/v1 and reuse the OpenAI SDK as-is. Switch between local and cloud with one env var.
  2. Mind the licenses: Llama 3 (Meta), Qwen (Alibaba), Mistral — each license differs. Some restrict commercial use over MAU thresholds.
  3. n8n integration: use the Ollama Chat node or the OpenAI Chat node with a custom Base URL.
  4. Fine-tuning: 8B models LoRA-tune on 16 GB VRAM. Tools like unsloth and axolotl make this practical.
  5. Benchmarking: lm-eval-harness, MTEB — automate model comparisons on your own dataset.
  6. Production serving: vLLM (concurrent requests), TGI (HuggingFace), llama.cpp server beat Ollama on throughput. Ollama is best for personal use and MVPs.
  7. Edge deployment: Raspberry Pi 5 + 8 GB handles 3B models. Mobile gets Llama 3.2 1B/3B at quantization.

References


This is part 8 of 11 in the AI Basics series. Series complete. Browse the full series in the blog TOC page.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System