"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"
The final part. So far we've focused on models. Now we focus on the tools that actually run them. We'll lay out the philosophical difference between PyTorch and TensorFlow, and trace how Hugging Face
transformers,llama.cpp,MLX, andOllamabuilt the bridge to running large language models on your own machine. By the end you should have the full mental model of "download a pretrained LLM and serve it locally."
0. Learning Objectives
- Explain the eager-vs-graph difference between PyTorch and TensorFlow.
- Explain, in graph terms, how autograd automates the backward pass.
- Use Hugging Face
transformers'AutoModel,AutoTokenizer, andpipelineabstractions. - Describe llama.cpp's GGUF quantization, INT4 inference, and CPU-first flow.
- Describe Apple MLX's use of unified memory and how it differs from PyTorch.
- Run a local LLM with Ollama and call it through an OpenAI-compatible API.
1. 핵심 요약
- PyTorch: eager by default. Each line executes immediately; autograd builds a dynamic graph in the background. Today's de facto standard in research, NLP, and the LLM community.
- TensorFlow 2: turns code into graphs with
tf.function. Strong in production, mobile, and embedded. - Hugging Face transformers: load a pretrained model in one line. The entry point for nearly every open LLM.
- llama.cpp: a C++ LLM inference engine. Combined with GGUF quantization, it runs INT4-quantized models on Mac, Windows, and Linux, on CPU and GPU.
- MLX (Apple): an ML framework built for Apple Silicon's unified memory. Fast and memory-efficient on Macs.
- Ollama: a desktop wrapper around llama.cpp.
ollama run llama3downloads, runs, and exposes an OpenAI-compatible API.
2. Intuition — Eager vs Graph
2.1 Two Execution Models
x = torch.tensor([1.0, 2.0])
y = x * 2
print(y) # immediately prints [2.0, 4.0]
@tf.function
def f(x):
return x * 2
y = f(tf.constant([1.0, 2.0])) # first call traces, then runs a compiled graph
Eager runs each line immediately. Debugging feels natural. The cost is that the compiler never sees the whole computation, so optimization is limited.
Graph mode builds a computation graph first, then a compiler (XLA, for example) optimizes the whole thing as one program. Faster, but harder to debug and weaker on dynamic control flow.
Both camps are converging on "write in eager, compile selectively": PyTorch's torch.compile, TensorFlow's @tf.function.
2.2 Why Research Moved to PyTorch
Around 2018–2020 the research community migrated en masse. The reasons are not romantic:
- Debugging works with plain Python (
pdb,print). - Dynamic graphs fit variable-input workloads (NLP, RL, Transformers).
- The API is close to numpy — low learning curve.
- The pretrained-model ecosystem (Hugging Face) grew up on PyTorch.
2.3 Why TensorFlow Stuck Around in Production
- TensorFlow Serving, TFLite, and TensorFlow.js make a mature deployment pipeline.
- Mobile inference (iOS Core ML / Android TFLite) is well-supported.
- TPU is a first-class target (Google infra).
Even so, PyTorch is steadily eroding the production gap, especially in the LLM era.
3. Definitions — autograd
3.1 The Dynamic Computation Graph
PyTorch records a graph of operations as the forward pass executes:
x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
z = y.sum()
z.backward()
print(x.grad) # tensor([2., 2.])
z.backward() walks the graph in reverse and runs the \(\delta\) propagation from Part 6 automatically.
3.2 Peeking at the Graph
z.grad_fn # <SumBackward0>
z.grad_fn.next_functions # ((<MulBackward0>, 0),)
z.grad_fn.next_functions[0][0].next_functions # ((<AccumulateGrad>, 0),)
Each node carries its backward function; next_functions points to the previous nodes (the input tensors).
3.3 Disabling Gradient Tracking
with torch.no_grad():
y = model(x) # no graph recording — saves memory and time
x.detach() # detach a tensor from the graph
Required for inference and evaluation code.
3.4 torch.compile (PyTorch 2.0+)
model = torch.compile(model) # TorchDynamo + Inductor → compiled graph
Trace the eager code dynamically and compile it. Typical speedup is 1.3–2×.
4. Definitions — Hugging Face transformers
4.1 AutoModel + AutoTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
prompt = "Explain backpropagation in one sentence."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))
Three lines load an 8B-parameter LLM, tokenize input, and generate.
4.2 Layers of Abstraction
| Layer | Role |
|---|---|
AutoTokenizer |
Text ↔ token-ID conversion |
AutoModel |
Load weights, run forward |
AutoModelFor{Task} |
Task-specific heads (classification, generation, QA, …) |
pipeline() |
Tokenize, run, decode — one line |
Trainer |
Training loop, integrating PyTorch and DeepSpeed |
4.3 PEFT (LoRA)
from peft import LoraConfig, get_peft_model
lora_cfg = LoraConfig(
r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()
LLM Core Study Part 2's LoRA dropped into a single line.
5. Definitions — llama.cpp / MLX / Ollama
5.1 llama.cpp
A C++ LLM inference engine. Its essentials:
- GGUF format: model weights + metadata in one file, including quantization info.
- Low-bit quantization: Q4_K_M, Q5_K_M, Q8_0 — a 70B model can drop from 40 GB to 20 GB.
- Backends: CPU (AVX2/AVX-512), CUDA, Metal (Apple), Vulkan.
- CLI and server:
./llama-cli,./llama-server(OpenAI-compatible API).
./build/bin/llama-cli -m models/llama-3.1-8b-instruct.Q4_K_M.gguf \
-p "Explain backprop in one sentence." \
-n 128 -t 8 -ngl 99
5.2 GGUF Quantization
| Quantization | Bits | 8B model size | Quality loss |
|---|---|---|---|
| F16 | 16 | 16 GB | baseline |
| Q8_0 | 8 | 8 GB | negligible |
| Q5_K_M | 5 | 5.5 GB | very small |
| Q4_K_M | 4 | 4.6 GB | small |
| Q3_K_M | 3 | 3.5 GB | noticeable |
Where LLM Core Study Part 2's QLoRA quantizes during training, GGUF quantizes during inference.
5.3 MLX (Apple)
Apple Silicon's ML framework. The largest difference from PyTorch is its native use of unified memory: CPU and GPU share the same physical RAM, so host↔device copies disappear.
import mlx.core as mx
import mlx.nn as nn
class MLP(nn.Module):
def __init__(self, dim):
super().__init__()
self.fc = nn.Linear(dim, dim)
def __call__(self, x):
return mx.maximum(self.fc(x), 0)
model = MLP(64)
x = mx.random.normal(shape=(2, 64))
print(model(x).shape)
API resembles PyTorch but defaults to lazy evaluation. On Apple Silicon, the memory efficiency and throughput are both excellent.
5.4 Ollama
A desktop-friendly wrapper around llama.cpp.
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain backprop in one sentence."
The server mode launches automatically:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.1:8b", "messages": [{"role":"user","content":"hi"}]}'
You can connect with the OpenAI Python SDK directly:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain backprop in one sentence."}],
)
print(resp.choices[0].message.content)
Change only the base_url and existing OpenAI client code targets a local model. This compatibility is the single biggest accelerator behind local LLM adoption.
6. Diagram — Training to Local Inference
7. Principle Walkthrough — Pivot Points From Training to Local Inference
A local LLM workflow compresses to five stages: pretrained model → quantization → inference engine → API-compatible interface → efficient adaptation. The learning value lies in seeing why each pivot point exists.
7.1 transformers — The Common Standard for Pretrained Models
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
Why transformers became the standard
After Hugging Face's 2018 pytorch-pretrained-BERT, the Auto classes emerged — AutoModelForCausalLM selects the right architecture automatically. A pretrained model loaded in a single line was the infrastructure that triggered the LLM-ecosystem explosion.
Limits Python + PyTorch dependency. Ideal for research and experimentation; inference speed is average. Production inference uses purpose-built engines like vLLM, TGI, or llama.cpp.
7.2 Quantization — Why GGUF Became Indispensable
Observed pattern - Qwen2.5-7B fp16: 14 GB memory (at the edge of consumer GPUs) - Q4_K_M (4-bit quantization): 4.4 GB (fits a 32 GB Mac mini) - Quality loss: negligible at Q5+, noticeable at Q3 on code/math tasks
Why it appeared
One week after LLaMA 1's release in 2023, llama.cpp and GGUF appeared. Running 7B models on a Raspberry Pi became possible — the decisive moment of LLM democratization. Dettmers 2022, LLM.int8(), AWQ (Lin 2023), and GPTQ (Frantar 2022) are the standard quantization algorithms.
Forward link Quantization only reduces inference memory. For training, techniques like QLoRA (Dettmers 2023) train LoRA atop 4-bit weights.
7.3 Inference Engines — Why Several Coexist
The same model is served by different engines because each environment optimizes differently:
| Engine | Where it's fast | Why |
|---|---|---|
| transformers | Research, prototyping (CUDA, MPS) | Flexibility |
| vLLM | Server GPUs with many users | PagedAttention (paged KV cache) |
| TGI (HF) | Production servers | Multi-worker orchestration |
| llama.cpp | CPU, Mac, Raspberry Pi | GGUF + SIMD optimization |
| MLX | Apple Silicon | Unified memory, ANE |
| Ollama | Desktop UX | llama.cpp wrapper + model registry |
The framing: "which engine for which environment" matters more than "which model." The same 7B model hits 1000 tok/s on vLLM and 5 tok/s on llama.cpp CPU.
7.4 OpenAI-Compatible API — Same Code, Different Backends
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(model="qwen2.5:7b", messages=[...])
Why this changed the game The OpenAI API became the de facto common interface for LLMs. Ollama, vLLM, and TGI all expose OpenAI-compatible endpoints → the same client code targets local and cloud.
The framing: interface standardization = vendor lock-in avoidance. To swap models, just change base_url.
7.5 LoRA — Why Full Fine-Tuning Disappeared
from peft import LoraConfig, get_peft_model
config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(base_model, config)
What you observe Full fine-tuning a 7B model = 14 GB × 4 (Adam state) = 56 GB. Out of consumer reach. LoRA: instead of training \(W\), train low-rank corrections \(W + BA\) (rank 16). Trainable parameters: under 0.1% of the model.
Why it appeared Hu et al. 2022, LoRA: Low-Rank Adaptation of Large Language Models. The insight that the small change on top of a pretrained model is enough. QLoRA (Dettmers 2023) further reduces memory by keeping the base model in 4-bit.
The framing: full fine-tuning is gone — LoRA + QLoRA + careful data curation is the new standard. Efficient adaptation is the dominant paradigm.
8. Variants & Case Studies
8.1 Inference Engines Compared
| Engine | Strength | Weakness |
|---|---|---|
| transformers | Best for research and experimentation | Inference speed average |
| vLLM | Very fast via paged KV cache | Needs server-class GPU |
| TGI (Hugging Face) | Production inference server | More complex to operate |
| llama.cpp | Fast on CPU / Mac | Python integration is separate |
| MLX | Highest efficiency on Apple Silicon | macOS only |
| Ollama | Best desktop UX | Limited advanced options |
| LM Studio | GUI, beginner-friendly | Depends on bundled engine |
8.2 Training vs Inference Pipelines
| Stage | Tools |
|---|---|
| Data prep | datasets, pandas, polars |
| Full training | PyTorch + DeepSpeed / FSDP |
| LoRA training | PEFT, mlx-lm |
| Evaluation | lm-evaluation-harness, OpenAI evals |
| Quantization | bitsandbytes, AWQ, GPTQ, GGUF |
| Serving | vLLM / TGI / Ollama / llama.cpp |
| Client | OpenAI SDK-compatible |
8.3 Real-World Examples
- Personal assistant: Ollama + a small model + local embeddings for offline RAG.
- Code assistant: Continue.dev + Ollama as a Copilot alternative.
- Desktop chatbot: LM Studio, Jan, Msty.
- On-device search: an embedding model + ChromaDB.
- Research experiments: Hugging Face Hub model + transformers + Weights & Biases.
9. Limits & Failure Modes
9.1 GPU / Memory Ceiling
Why it is essential A 70B model in fp16 needs 140 GB. Even Q4_K_M-quantized, it's around 40 GB — beyond consumer GPUs and Macs. Model size itself becomes the ceiling.
How you spot it Immediate OOM, or catastrophic slowdown from swap.
Next step - Start with 4–13B. - For >13B, use Apple Silicon unified memory (64 GB+) or a GPU cluster. - MoE models (Mixtral 8×7B) activate only a subset at inference — memory savings.
9.2 Quantization Quality Loss
Why it is essential Below Q4, precise tasks like code and math show visible degradation. The model is "almost right" but wrong at decisive points.
How you spot it Compare fp16 vs Q4 outputs on the same prompts. Casual chat barely differs; code accumulates noise.
Next step - Casual chat: Q4_K_M is fine. - Code / math: Q5_K_M or higher. - Reasoning models: Q6 or fp16 recommended.
9.3 Context Length
Why it is essential KV cache memory scales linearly with context length. At 32k context, the KV cache approaches the model size itself.
How you spot it OOM on long inputs, or response time spike.
Next step - Sliding Window Attention (Mistral) — keep only a fixed window. - Flash Attention 2 — memory-efficient inference. - Mamba / RWKV — \(O(L)\) memory.
9.4 Inference Speed
Why it is essential CPU inference is memory-bandwidth bound. 7B model: CPU 5–20 tok/s, Apple Silicon 30–50 tok/s, RTX 4090 100+ tok/s.
How you spot it Measure whether interactive chat is usable.
Next step - Smaller model (≤ 3B) — faster responses. - Speculative decoding — small model drafts, big model verifies. - Efficient engines (vLLM) + quantization.
9.5 Version / Format Churn
Why it is essential GGUF format keeps evolving (V1 → V2 → V3). llama.cpp / Ollama updates may break model-file compatibility.
How you spot it Errors mentioning quantization format during model loading.
Next step - Regular updates + record which model version you're using. - Re-quantize to the latest GGUF when possible.
9.6 Security and Privacy
Why it is essential Models can surface implicit biases from training data. Sensitive domains (medical, legal) require evaluation.
How you spot it Domain-specific evaluation sets to measure output quality and bias.
Next step - Domain-specific fine-tuning (LoRA) — the §7.5 pattern. - Guardrails (NeMo Guardrails, LangChain) for output filtering. - For sensitive data, local LLMs provide a privacy advantage — data never leaves the device.
10. Quick Recap — Answer Before You Peek
Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.
Q1. What was the decisive reason the transformers library became the LLM ecosystem standard?
Answer AutoModelForCausalLM.from_pretrained(...) — a one-line common API for any pretrained model.
Why No need to know per-architecture classes; the Auto classes pick the right one. This was the pretrained-model sharing infrastructure. (Section 7.1.)
Q2. Why is 4-bit quantization (Q4_K_M) popular? Its limit?
Answer It shrinks a 7B model from 14 GB to 4.4 GB, fitting consumer hardware. Limit: noticeable degradation on precise tasks like code and math → Q5+ recommended. Why llama.cpp + GGUF (2023) was the decisive moment of LLM democratization. Quantization unlocked the gap where model size had bounded inference feasibility. (Sections 7.2, 9.2.)
Q3. Why do vLLM vs llama.cpp produce different speeds on the same model and task?
Answer Each engine optimizes for a different environment. vLLM targets server GPUs with many users (PagedAttention); llama.cpp targets CPU/Mac (SIMD). Why "Which engine in which environment" matters more than "which model." A 7B model hits 1000 tok/s on vLLM and 5 tok/s on llama.cpp CPU. (Section 7.3.)
Q4. Why did the OpenAI-compatible API become the de facto interface across local LLM tools?
Answer It is effectively the common interface. The same client code targets local (Ollama) and cloud (OpenAI) — only base_url changes.
Why Interface standardization = vendor-lock-in avoidance. Swapping models costs almost nothing. (Section 7.4.)
Q5. Why did LoRA replace full fine-tuning?
Answer It trains under 0.1% of the parameters, fitting consumer-grade memory. Learning the "small change" on top of the pretrained model suffices. Why Hu et al. 2022 LoRA. QLoRA (Dettmers 2023) keeps the base model in 4-bit, cutting further: full fine-tune 56 GB → QLoRA 6 GB. (Section 7.5.)
If you answered four or five easily, the five pivot points of the local LLM workflow are in place.
11. Closing the Series — Where to Next
Across these nine parts we covered:
- Part 1 sklearn: the shared grammar of ML
- Part 2 k-NN: distance, dimension, and bias-variance in one model
- Part 3 linear regression: where loss and optimization start
- Part 4 logistic regression: probabilities, cross-entropy, MLE
- Part 5 regularization: containing overfitting
- Part 6 neural networks: backprop and the universal approximation theorem
- Part 7 training craft: Adam, BatchNorm, initialization, schedules
- Part 8 architectures: CNN, RNN, Transformer
- Part 9 (this part): frameworks and local LLMs
The natural sequel is the LLM Core Study series (six parts).
- The inside of a single Transformer block: tokenization, embeddings, attention, positional encoding
- Fine-tuning: LoRA, QLoRA, distillation
- Decoding: greedy, beam, top-p, speculative
- Advanced: RAG, CoT, MoE, in-context learning
- Mathematical intuition: softmax, CE, KL, LayerNorm
- The study roadmap
With the foundations in place, the next series tracks why an LLM looks the way it does, at the same depth.
12. Further Reading
Primary sources
- Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019. arXiv:1912.01703.
- Abadi, M. et al. TensorFlow: A system for large-scale machine learning. OSDI 2016. arXiv:1605.08695.
- Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020 Demo. arXiv:1910.03771.
- Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
- Dettmers, T. et al. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314.
Official docs and repositories
- PyTorch:
https://pytorch.org/docs/stable/ - TensorFlow:
https://www.tensorflow.org/api_docs/ - Hugging Face transformers:
https://huggingface.co/docs/transformers/ - PEFT:
https://huggingface.co/docs/peft/ - llama.cpp:
https://github.com/ggml-org/llama.cpp - MLX:
https://ml-explore.github.io/mlx/ - Ollama:
https://ollama.com/ - vLLM:
https://docs.vllm.ai/
Companion books
- d2l.ai (Dive into Deep Learning): free textbook.
- Karpathy, A. Neural Networks: Zero to Hero (YouTube). Backprop and Transformers from scratch.
- Chip Huyen. Designing Machine Learning Systems. O'Reilly 2022. The deployment-and-ops side.
I hope the series helped clarify the larger shape of machine learning. See you in the next series, where we dive into the inside of an LLM at the same depth.
Series overview: Series index
댓글
댓글 쓰기