"Local LLMs 2026 — Ollama vs LM Studio, and What Actually Runs on Your Machine"

4월 25, 2026

Install to model selection, with picks for 8 GB / 16 GB / 24 GB / 64 GB systems on Mac and Windows

핵심 요약

Audience: Anyone who wants to run LLMs on their own computer for cost, privacy, or learning — both general users and developers.
What you'll get: 1) Ollama vs LM Studio — which to pick, 2) what runs on 8/16/24/64 GB systems, 3) Mac and Windows install steps, 4) realistic use cases and limits, 5) honest comparison vs cloud.
One-liner: First-time GUI → LM Studio. Automation/server/dev → Ollama. Both free. Apple M-series surprises at 8 GB and shines at 64 GB.

1. Why local

Factor	Cloud	Local
Cost	Per-token	Free after install (electricity, hardware)
Privacy	Sent to vendor	Never leaves your machine
Offline	Needs network	Works
Quality	GPT-5.5 / Claude 4.7 / Gemini 3	Llama 3 / Qwen 3 / Mistral (open-source SOTA)
Speed	Depends on model + network	Depends on your hardware (~30–40 tok/s on a typical laptop)
Multimodal	Rich	Text + some vision
Context	1M+	8K–128K

Local fits when: - Sensitive data (medical, legal, contracts). - Heavy automation that's making cloud costs sting. - Offline environments. - Learning / experimentation (model comparison, fine-tuning).

Cloud still wins when: - You need top-end quality on hard tasks. - Multimodal (image/audio in/out). - Casual usage.

2. Ollama vs LM Studio

2.1 Core differences

	Ollama	LM Studio
Interface	CLI (terminal)	GUI desktop app
Audience	Devs / automation	Non-devs / experimenters
Model library	Curated (ollama.com/library)	HuggingFace integration
Apple Silicon	Metal (good)	MLX (1.6× faster on M3 Ultra)
GPU auto-detect	NVIDIA / AMD / Apple	Manual config
Docker	Official image	❌
API server	Built in (`localhost:11434`)	Toggle in app
n8n / LangChain	First-class	Works via OpenAI-compatible API
Free	✅	✅

2.2 Which to pick

GUI exploration / model browsing → LM Studio
CLI / automation / server → Ollama
Both is fine — they don't conflict. Many start with LM Studio, then add Ollama for dev work.

2.3 Speed (M3 Ultra, Gemma 3 1B)

LM Studio (MLX): 237 tok/s
Ollama (Metal): 149 tok/s

Ollama is adopting MLX, but as of April 2026 LM Studio is faster on Apple Silicon.

3. What runs on your machine

3.1 Model picks by RAM/VRAM (Q4_K_M)

System	Recommended model	Memory used	Use
8 GB VRAM (or 8 GB Apple unified)	Qwen 3.5 9B (default rec)	6.6 GB	General chat, summary, translate
8 GB	Llama 3.1 8B	5–6 GB	English-first chat
8 GB	Mistral 7B	4.5 GB	Lighter tasks
16 GB	Qwen 3 14B Q5_K_M	10.2 GB	High-quality chat
16 GB	Mistral Small 24B Q4	13.4 GB	Stronger reasoning
24 GB	Mixtral 8x7B / 35B Q4	18–22 GB	Production-tier work
48+ GB	Llama 3.3 70B Q4 / Qwen 2.5 72B Q4	40–44 GB	Near-cloud quality
Apple M-Mac 64 GB+	Llama 3.3 70B / Qwen 72B	uses unified memory	M5 Max 64 GB ≈ H100-class

Sources: LocalLLM.in VRAM guide, GitHub ollama/ollama.

3.2 Quantization in one line

Q4_K_M = 4-bit compression. ~75% smaller than FP16, with ~5–10% quality loss. The standard default. - Q8 / FP16 = full quality (4× memory) - Q5_K_M = slightly better, +20% memory - Q3 = even smaller, but visibly worse

Start with Q4_K_M everywhere; only step up if quality bothers you.

3.3 The Apple Silicon trick

Unified memory lets the GPU use system RAM. A 64 GB MacBook Pro is effectively a 64 GB VRAM machine. 70B models that won't fit on an RTX 4090 (24 GB) run on M3/M4/M5 Max 64 GB.

4. Install — Mac & Windows

4.1 Ollama (CLI)

Mac

brew install ollama

ollama run llama3.1:8b

Windows (10/11, native ARM64 in 2026) 1. Get the installer from ollama.com/download 2. After install, in PowerShell:

ollama run qwen3:9b

Common commands

ollama list           # installed models
ollama pull mistral   # download without running
ollama rm llama3.1    # remove
ollama serve          # API server (default port 11434)

4.2 LM Studio (GUI)

Mac & Windows 1. Download from lmstudio.ai 2. Open → search icon → search a model (e.g., "qwen 3.5 9b") 3. Download → use the "Chat" tab right away 4. (Optional) "Local Server" tab → Start → enables an OpenAI-compatible API

LM Studio shows "Will fit in your RAM/VRAM" on every model — friendlier on hardware questions.

5. Five practical use cases

5.1 Sensitive document analysis

Summarize legal contracts, medical records, internal memos without sending them anywhere. Even Qwen 3.5 9B on an 8 GB Mac is enough.

5.2 Coding assist

Llama 3.1 8B or Qwen 2.5 Coder 7B paired with the Continue VS Code extension — a viable Copilot alternative.

5.3 RAG over your own files

Ollama API + LangChain/LlamaIndex over your PDFs/notes — your own NotebookLM that never phones home.

5.4 Automation (n8n + Ollama)

Swap the Claude Haiku call from Part 7's email-triage workflow with a local Llama → token cost goes to zero.

5.5 Learning / experimentation

Compare model behavior, choose a fine-tuning base, measure quantization impact.

6. Honest limits

Quality gap: a 70B local model is close, not equal, to GPT-5.5 / Claude Opus 4.7. Most visible on reasoning, code, and non-English tasks.
Multimodal is limited: image/audio in/out is rougher than cloud.
Context windows: typically 8K–32K (some 128K). Cloud is at 1M+ now.
Heat / fan / power: 70B on a laptop = fans on full. Desktops cost a bit more in electricity.
First download is slow: 4–80 GB per model, 30 minutes to 2 hours over Wi-Fi.
Weaker safety alignment: open-source models have lighter safety filters — you carry the responsibility.

7. First-week starter plan

Day	Task
1	Install LM Studio → download Qwen 3.5 9B → spend an hour chatting; compare with cloud feel.
2	Install Ollama → pull the same model → CLI run → start the API server.
3	Connect your Python or n8n code via the OpenAI-compatible API. Run it without any cloud key.
4–5	Try RAG on one PDF (LM Studio built-in or LangChain).
6–7	Evaluate. Tally savings vs lost capabilities → keep / mix / drop.

8. Cloud vs local — the honest call

	Cloud (Claude / GPT / Gemini)	Local (Llama / Qwen / Mistral)
Quality	Top-tier	Strong (15–30% gap)
Cost	Usage-based	$0 ongoing, hardware + power
Speed	Network-bound	Hardware-bound
Privacy	Policy-dependent	Total
Multimodal	Rich	Limited
Context	1M+	8K–128K
Scaling	Unlimited	Hardware ceiling
Liability	Vendor-shared	All yours

Practical recommendation: Cloud as your main stack, local for sensitive work and automation. The "all of one" stance loses on both fronts; mixing is the rational play.

Developer notes

Ollama API is OpenAI-compatible: set OPENAI_API_BASE=http://localhost:11434/v1 and reuse the OpenAI SDK as-is. Switch between local and cloud with one env var.
Mind the licenses: Llama 3 (Meta), Qwen (Alibaba), Mistral — each license differs. Some restrict commercial use over MAU thresholds.
n8n integration: use the Ollama Chat node or the OpenAI Chat node with a custom Base URL.
Fine-tuning: 8B models LoRA-tune on 16 GB VRAM. Tools like unsloth and axolotl make this practical.
Benchmarking: lm-eval-harness, MTEB — automate model comparisons on your own dataset.
Production serving: vLLM (concurrent requests), TGI (HuggingFace), llama.cpp server beat Ollama on throughput. Ollama is best for personal use and MVPs.
Edge deployment: Raspberry Pi 5 + 8 GB handles 3B models. Mobile gets Llama 3.2 1B/3B at quantization.

References

This is part 8 of 11 in the AI Basics series. Series complete. Browse the full series in the blog TOC page.

이 블로그 검색

MaJu Tech Notes