"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

The final part. So far we've focused on models. Now we focus on the tools that actually run them. We'll lay out the philosophical difference between PyTorch and TensorFlow, and trace how Hugging Face transformers, llama.cpp, MLX, and Ollama built the bridge to running large language models on your own machine. By the end you should have the full mental model of "download a pretrained LLM and serve it locally."

0. Learning Objectives

Explain the eager-vs-graph difference between PyTorch and TensorFlow.
Explain, in graph terms, how autograd automates the backward pass.
Use Hugging Face transformers' AutoModel, AutoTokenizer, and pipeline abstractions.
Describe llama.cpp's GGUF quantization, INT4 inference, and CPU-first flow.
Describe Apple MLX's use of unified memory and how it differs from PyTorch.
Run a local LLM with Ollama and call it through an OpenAI-compatible API.

1. 핵심 요약

PyTorch: eager by default. Each line executes immediately; autograd builds a dynamic graph in the background. Today's de facto standard in research, NLP, and the LLM community.
TensorFlow 2: turns code into graphs with tf.function. Strong in production, mobile, and embedded.
Hugging Face transformers: load a pretrained model in one line. The entry point for nearly every open LLM.
llama.cpp: a C++ LLM inference engine. Combined with GGUF quantization, it runs INT4-quantized models on Mac, Windows, and Linux, on CPU and GPU.
MLX (Apple): an ML framework built for Apple Silicon's unified memory. Fast and memory-efficient on Macs.
Ollama: a desktop wrapper around llama.cpp. ollama run llama3 downloads, runs, and exposes an OpenAI-compatible API.

2. Intuition — Eager vs Graph

2.1 Two Execution Models

x = torch.tensor([1.0, 2.0])
y = x * 2
print(y)                 # immediately prints [2.0, 4.0]

@tf.function
def f(x):
    return x * 2
y = f(tf.constant([1.0, 2.0]))  # first call traces, then runs a compiled graph

Eager runs each line immediately. Debugging feels natural. The cost is that the compiler never sees the whole computation, so optimization is limited.

Graph mode builds a computation graph first, then a compiler (XLA, for example) optimizes the whole thing as one program. Faster, but harder to debug and weaker on dynamic control flow.

Both camps are converging on "write in eager, compile selectively": PyTorch's torch.compile, TensorFlow's @tf.function.

2.2 Why Research Moved to PyTorch

Around 2018–2020 the research community migrated en masse. The reasons are not romantic:

Debugging works with plain Python (pdb, print).
Dynamic graphs fit variable-input workloads (NLP, RL, Transformers).
The API is close to numpy — low learning curve.
The pretrained-model ecosystem (Hugging Face) grew up on PyTorch.

2.3 Why TensorFlow Stuck Around in Production

TensorFlow Serving, TFLite, and TensorFlow.js make a mature deployment pipeline.
Mobile inference (iOS Core ML / Android TFLite) is well-supported.
TPU is a first-class target (Google infra).

Even so, PyTorch is steadily eroding the production gap, especially in the LLM era.

3. Definitions — autograd

3.1 The Dynamic Computation Graph

PyTorch records a graph of operations as the forward pass executes:

x = torch.tensor([1.0, 2.0], requires_grad=True)
y = x * 2
z = y.sum()
z.backward()
print(x.grad)            # tensor([2., 2.])

z.backward() walks the graph in reverse and runs the \(\delta\) propagation from Part 6 automatically.

3.2 Peeking at the Graph

z.grad_fn                            # <SumBackward0>
z.grad_fn.next_functions             # ((<MulBackward0>, 0),)
z.grad_fn.next_functions[0][0].next_functions  # ((<AccumulateGrad>, 0),)

Each node carries its backward function; next_functions points to the previous nodes (the input tensors).

3.3 Disabling Gradient Tracking

with torch.no_grad():
    y = model(x)           # no graph recording — saves memory and time

x.detach()                 # detach a tensor from the graph

Required for inference and evaluation code.

3.4 torch.compile (PyTorch 2.0+)

model = torch.compile(model)         # TorchDynamo + Inductor → compiled graph

Trace the eager code dynamically and compile it. Typical speedup is 1.3–2×.

4. Definitions — Hugging Face transformers

4.1 AutoModel + AutoTokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "meta-llama/Llama-3.1-8B-Instruct"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

prompt = "Explain backpropagation in one sentence."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))

Three lines load an 8B-parameter LLM, tokenize input, and generate.

4.2 Layers of Abstraction

Layer	Role
`AutoTokenizer`	Text ↔ token-ID conversion
`AutoModel`	Load weights, run forward
`AutoModelFor{Task}`	Task-specific heads (classification, generation, QA, …)
`pipeline()`	Tokenize, run, decode — one line
`Trainer`	Training loop, integrating PyTorch and DeepSpeed

4.3 PEFT (LoRA)

from peft import LoraConfig, get_peft_model

lora_cfg = LoraConfig(
    r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

LLM Core Study Part 2's LoRA dropped into a single line.

5. Definitions — llama.cpp / MLX / Ollama

5.1 llama.cpp

A C++ LLM inference engine. Its essentials:

GGUF format: model weights + metadata in one file, including quantization info.
Low-bit quantization: Q4_K_M, Q5_K_M, Q8_0 — a 70B model can drop from 40 GB to 20 GB.
Backends: CPU (AVX2/AVX-512), CUDA, Metal (Apple), Vulkan.
CLI and server: ./llama-cli, ./llama-server (OpenAI-compatible API).

./build/bin/llama-cli -m models/llama-3.1-8b-instruct.Q4_K_M.gguf \
                      -p "Explain backprop in one sentence." \
                      -n 128 -t 8 -ngl 99

5.2 GGUF Quantization

Quantization	Bits	8B model size	Quality loss
F16	16	16 GB	baseline
Q8_0	8	8 GB	negligible
Q5_K_M	5	5.5 GB	very small
Q4_K_M	4	4.6 GB	small
Q3_K_M	3	3.5 GB	noticeable

Where LLM Core Study Part 2's QLoRA quantizes during training, GGUF quantizes during inference.

5.3 MLX (Apple)

Apple Silicon's ML framework. The largest difference from PyTorch is its native use of unified memory: CPU and GPU share the same physical RAM, so host↔device copies disappear.

import mlx.core as mx
import mlx.nn as nn

class MLP(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc = nn.Linear(dim, dim)
    def __call__(self, x):
        return mx.maximum(self.fc(x), 0)

model = MLP(64)
x = mx.random.normal(shape=(2, 64))
print(model(x).shape)

API resembles PyTorch but defaults to lazy evaluation. On Apple Silicon, the memory efficiency and throughput are both excellent.

5.4 Ollama

A desktop-friendly wrapper around llama.cpp.

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain backprop in one sentence."

The server mode launches automatically:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.1:8b", "messages": [{"role":"user","content":"hi"}]}'

You can connect with the OpenAI Python SDK directly:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain backprop in one sentence."}],
)
print(resp.choices[0].message.content)

Change only the base_url and existing OpenAI client code targets a local model. This compatibility is the single biggest accelerator behind local LLM adoption.

6. Diagram — Training to Local Inference

7. Principle Walkthrough — Pivot Points From Training to Local Inference

A local LLM workflow compresses to five stages: pretrained model → quantization → inference engine → API-compatible interface → efficient adaptation. The learning value lies in seeing why each pivot point exists.

7.1 transformers — The Common Standard for Pretrained Models

from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")

Why transformers became the standard After Hugging Face's 2018 pytorch-pretrained-BERT, the Auto classes emerged — AutoModelForCausalLM selects the right architecture automatically. A pretrained model loaded in a single line was the infrastructure that triggered the LLM-ecosystem explosion.

Limits Python + PyTorch dependency. Ideal for research and experimentation; inference speed is average. Production inference uses purpose-built engines like vLLM, TGI, or llama.cpp.

7.2 Quantization — Why GGUF Became Indispensable

Observed pattern - Qwen2.5-7B fp16: 14 GB memory (at the edge of consumer GPUs) - Q4_K_M (4-bit quantization): 4.4 GB (fits a 32 GB Mac mini) - Quality loss: negligible at Q5+, noticeable at Q3 on code/math tasks

Why it appeared One week after LLaMA 1's release in 2023, llama.cpp and GGUF appeared. Running 7B models on a Raspberry Pi became possible — the decisive moment of LLM democratization. Dettmers 2022, LLM.int8(), AWQ (Lin 2023), and GPTQ (Frantar 2022) are the standard quantization algorithms.

Forward link Quantization only reduces inference memory. For training, techniques like QLoRA (Dettmers 2023) train LoRA atop 4-bit weights.

7.3 Inference Engines — Why Several Coexist

The same model is served by different engines because each environment optimizes differently:

Engine	Where it's fast	Why
transformers	Research, prototyping (CUDA, MPS)	Flexibility
vLLM	Server GPUs with many users	PagedAttention (paged KV cache)
TGI (HF)	Production servers	Multi-worker orchestration
llama.cpp	CPU, Mac, Raspberry Pi	GGUF + SIMD optimization
MLX	Apple Silicon	Unified memory, ANE
Ollama	Desktop UX	llama.cpp wrapper + model registry

The framing: "which engine for which environment" matters more than "which model." The same 7B model hits 1000 tok/s on vLLM and 5 tok/s on llama.cpp CPU.

7.4 OpenAI-Compatible API — Same Code, Different Backends

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(model="qwen2.5:7b", messages=[...])

Why this changed the game The OpenAI API became the de facto common interface for LLMs. Ollama, vLLM, and TGI all expose OpenAI-compatible endpoints → the same client code targets local and cloud.

The framing: interface standardization = vendor lock-in avoidance. To swap models, just change base_url.

7.5 LoRA — Why Full Fine-Tuning Disappeared

from peft import LoraConfig, get_peft_model
config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(base_model, config)

What you observe Full fine-tuning a 7B model = 14 GB × 4 (Adam state) = 56 GB. Out of consumer reach. LoRA: instead of training \(W\), train low-rank corrections \(W + BA\) (rank 16). Trainable parameters: under 0.1% of the model.

Why it appeared Hu et al. 2022, LoRA: Low-Rank Adaptation of Large Language Models. The insight that the small change on top of a pretrained model is enough. QLoRA (Dettmers 2023) further reduces memory by keeping the base model in 4-bit.

The framing: full fine-tuning is gone — LoRA + QLoRA + careful data curation is the new standard. Efficient adaptation is the dominant paradigm.

8. Variants & Case Studies

8.1 Inference Engines Compared

Engine	Strength	Weakness
transformers	Best for research and experimentation	Inference speed average
vLLM	Very fast via paged KV cache	Needs server-class GPU
TGI (Hugging Face)	Production inference server	More complex to operate
llama.cpp	Fast on CPU / Mac	Python integration is separate
MLX	Highest efficiency on Apple Silicon	macOS only
Ollama	Best desktop UX	Limited advanced options
LM Studio	GUI, beginner-friendly	Depends on bundled engine

8.2 Training vs Inference Pipelines

Stage	Tools
Data prep	datasets, pandas, polars
Full training	PyTorch + DeepSpeed / FSDP
LoRA training	PEFT, mlx-lm
Evaluation	lm-evaluation-harness, OpenAI evals
Quantization	bitsandbytes, AWQ, GPTQ, GGUF
Serving	vLLM / TGI / Ollama / llama.cpp
Client	OpenAI SDK-compatible

8.3 Real-World Examples

Personal assistant: Ollama + a small model + local embeddings for offline RAG.
Code assistant: Continue.dev + Ollama as a Copilot alternative.
Desktop chatbot: LM Studio, Jan, Msty.
On-device search: an embedding model + ChromaDB.
Research experiments: Hugging Face Hub model + transformers + Weights & Biases.

9. Limits & Failure Modes

9.1 GPU / Memory Ceiling

Why it is essential A 70B model in fp16 needs 140 GB. Even Q4_K_M-quantized, it's around 40 GB — beyond consumer GPUs and Macs. Model size itself becomes the ceiling.

How you spot it Immediate OOM, or catastrophic slowdown from swap.

Next step - Start with 4–13B. - For >13B, use Apple Silicon unified memory (64 GB+) or a GPU cluster. - MoE models (Mixtral 8×7B) activate only a subset at inference — memory savings.

9.2 Quantization Quality Loss

Why it is essential Below Q4, precise tasks like code and math show visible degradation. The model is "almost right" but wrong at decisive points.

How you spot it Compare fp16 vs Q4 outputs on the same prompts. Casual chat barely differs; code accumulates noise.

Next step - Casual chat: Q4_K_M is fine. - Code / math: Q5_K_M or higher. - Reasoning models: Q6 or fp16 recommended.

9.3 Context Length

Why it is essential KV cache memory scales linearly with context length. At 32k context, the KV cache approaches the model size itself.

How you spot it OOM on long inputs, or response time spike.

Next step - Sliding Window Attention (Mistral) — keep only a fixed window. - Flash Attention 2 — memory-efficient inference. - Mamba / RWKV — \(O(L)\) memory.

9.4 Inference Speed

Why it is essential CPU inference is memory-bandwidth bound. 7B model: CPU 5–20 tok/s, Apple Silicon 30–50 tok/s, RTX 4090 100+ tok/s.

How you spot it Measure whether interactive chat is usable.

Next step - Smaller model (≤ 3B) — faster responses. - Speculative decoding — small model drafts, big model verifies. - Efficient engines (vLLM) + quantization.

9.5 Version / Format Churn

Why it is essential GGUF format keeps evolving (V1 → V2 → V3). llama.cpp / Ollama updates may break model-file compatibility.

How you spot it Errors mentioning quantization format during model loading.

Next step - Regular updates + record which model version you're using. - Re-quantize to the latest GGUF when possible.

9.6 Security and Privacy

Why it is essential Models can surface implicit biases from training data. Sensitive domains (medical, legal) require evaluation.

How you spot it Domain-specific evaluation sets to measure output quality and bias.

Next step - Domain-specific fine-tuning (LoRA) — the §7.5 pattern. - Guardrails (NeMo Guardrails, LangChain) for output filtering. - For sensitive data, local LLMs provide a privacy advantage — data never leaves the device.

10. Quick Recap — Answer Before You Peek

Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.

Q1. What was the decisive reason the transformers library became the LLM ecosystem standard?

Answer AutoModelForCausalLM.from_pretrained(...) — a one-line common API for any pretrained model. Why No need to know per-architecture classes; the Auto classes pick the right one. This was the pretrained-model sharing infrastructure. (Section 7.1.)

Q2. Why is 4-bit quantization (Q4_K_M) popular? Its limit?

Answer It shrinks a 7B model from 14 GB to 4.4 GB, fitting consumer hardware. Limit: noticeable degradation on precise tasks like code and math → Q5+ recommended. Why llama.cpp + GGUF (2023) was the decisive moment of LLM democratization. Quantization unlocked the gap where model size had bounded inference feasibility. (Sections 7.2, 9.2.)

Q3. Why do vLLM vs llama.cpp produce different speeds on the same model and task?

Answer Each engine optimizes for a different environment. vLLM targets server GPUs with many users (PagedAttention); llama.cpp targets CPU/Mac (SIMD). Why "Which engine in which environment" matters more than "which model." A 7B model hits 1000 tok/s on vLLM and 5 tok/s on llama.cpp CPU. (Section 7.3.)

Q4. Why did the OpenAI-compatible API become the de facto interface across local LLM tools?

Answer It is effectively the common interface. The same client code targets local (Ollama) and cloud (OpenAI) — only base_url changes. Why Interface standardization = vendor-lock-in avoidance. Swapping models costs almost nothing. (Section 7.4.)

Q5. Why did LoRA replace full fine-tuning?

Answer It trains under 0.1% of the parameters, fitting consumer-grade memory. Learning the "small change" on top of the pretrained model suffices. Why Hu et al. 2022 LoRA. QLoRA (Dettmers 2023) keeps the base model in 4-bit, cutting further: full fine-tune 56 GB → QLoRA 6 GB. (Section 7.5.)

If you answered four or five easily, the five pivot points of the local LLM workflow are in place.

11. Closing the Series — Where to Next

Across these nine parts we covered:

Part 1 sklearn: the shared grammar of ML
Part 2 k-NN: distance, dimension, and bias-variance in one model
Part 3 linear regression: where loss and optimization start
Part 4 logistic regression: probabilities, cross-entropy, MLE
Part 5 regularization: containing overfitting
Part 6 neural networks: backprop and the universal approximation theorem
Part 7 training craft: Adam, BatchNorm, initialization, schedules
Part 8 architectures: CNN, RNN, Transformer
Part 9 (this part): frameworks and local LLMs

The natural sequel is the LLM Core Study series (six parts).

The inside of a single Transformer block: tokenization, embeddings, attention, positional encoding
Fine-tuning: LoRA, QLoRA, distillation
Decoding: greedy, beam, top-p, speculative
Advanced: RAG, CoT, MoE, in-context learning
Mathematical intuition: softmax, CE, KL, LayerNorm
The study roadmap

With the foundations in place, the next series tracks why an LLM looks the way it does, at the same depth.

12. Further Reading

Primary sources

Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019. arXiv:1912.01703.
Abadi, M. et al. TensorFlow: A system for large-scale machine learning. OSDI 2016. arXiv:1605.08695.
Wolf, T. et al. Transformers: State-of-the-Art Natural Language Processing. EMNLP 2020 Demo. arXiv:1910.03771.
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.
Dettmers, T. et al. QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314.

Official docs and repositories

PyTorch: https://pytorch.org/docs/stable/
TensorFlow: https://www.tensorflow.org/api_docs/
Hugging Face transformers: https://huggingface.co/docs/transformers/
PEFT: https://huggingface.co/docs/peft/
llama.cpp: https://github.com/ggml-org/llama.cpp
MLX: https://ml-explore.github.io/mlx/
Ollama: https://ollama.com/
vLLM: https://docs.vllm.ai/

Companion books

d2l.ai (Dive into Deep Learning): free textbook.
Karpathy, A. Neural Networks: Zero to Hero (YouTube). Backprop and Transformers from scratch.
Chip Huyen. Designing Machine Learning Systems. O'Reilly 2022. The deployment-and-ops side.

I hope the series helped clarify the larger shape of machine learning. See you in the next series, where we dive into the inside of an LLM at the same depth.

Series overview: Series index