"LLM Core Study (2/6) — Fine-tuning: LoRA, QLoRA, Distillation, Adapter"

Adapting a pretrained LLM to your data is the topic of this lecture. We start from the memory bookkeeping of full fine-tuning, then walk the PEFT family — Adapter → LoRA → QLoRA — and end with knowledge distillation. Each technique is shown as the same picture at a different zoom level.


0. Learning Objectives

  • Split full fine-tuning memory into the four canonical components (weights, gradients, optimizer state, activations).
  • State the difference between Adapter (Houlsby 2019) and LoRA (Hu 2021) in one line each.
  • Explain why \(\Delta W = BA\) is rank-\(r\) and what choosing \(r\) decides.
  • Take apart QLoRA (Dettmers 2023) into NF4, double quantisation, and paged optimiser.
  • Define catastrophic forgetting from a distribution-shift perspective and list two mitigations.
  • Derive Hinton 2015's distillation loss and explain the role of the \(T^2\) factor.

1. ํ•ต์‹ฌ ์š”์•ฝ

  • Full FT is expensive in memory, disk, and time. A 7 B model needs serious multi-GPU machinery.
  • PEFT trains only a tiny fraction of parameters yet matches FT on many tasks.
  • LoRA: factor weight updates as \(\Delta W = BA\) with rank \(r \ll \min(d_{\text{in}}, d_{\text{out}})\).
  • QLoRA: quantise the base to 4-bit (NF4) and keep LoRA in BF16; nearly recovers full-precision FT.
  • Catastrophic forgetting: the model loses old capabilities when adapted to a new distribution. PEFT partially mitigates.
  • Distillation: train a student on the soft targets of a teacher so the student learns inter-class structure for free.

2. Intuition: why pretraining alone is not enough

Pretraining tunes weights to predict the next token over a generic distribution. Your use case is usually a narrow slice: a domain (law, medical), a format (JSON, function calls), a policy (refusals). Fine-tuning moves weights toward that slice.

The cost is the trouble. For a 7 B model in BF16: weights are 14 GB; AdamW state (FP32 momentum and variance) doubles that to ~56 GB; gradients another 14 GB; activations 10–30 GB depending on sequence length and gradient checkpointing. Total: ~95–115 GB. A single 80 GB GPU is not enough — ZeRO/FSDP, gradient checkpointing, or PEFT are mandatory.

The PEFT idea: train only a tiny fraction of the parameters, so optimiser state and gradients become tiny, even if the weights still take their full slot in memory.


3. Full Fine-Tuning

3.1 Loss

Standard next-token cross-entropy:

$$ \mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t}), $$

optimised with AdamW (Loshchilov & Hutter 2019):

$$ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t,\ \ v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, $$ $$ \theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_{t-1}. $$

3.2 Memory decomposition (7 B, BF16 mixed precision)

Component Size
Weights \(\theta\) 14 GB
Gradients \(g\) 14 GB
AdamW \(m, v\) (FP32) 56 GB
Activations (4 K seq, BS 1) 10–30 GB
Total 94–114 GB

A single A100 80 GB does not fit. Either distribute (FSDP/ZeRO-3) or shrink the trainable footprint (PEFT).

3.3 Limits

  • Cost. Wasteful for small datasets.
  • Catastrophic forgetting risk (see §7).
  • Deployment overhead — a separate 7 GB checkpoint per task.

4. Adapter — the first PEFT

Houlsby et al., 2019 inserted small down-up projections inside each transformer block:

$$ h \leftarrow h + W_{\text{up}} \, \sigma(W_{\text{down}} h), $$

with \(W_{\text{down}} \in \mathbb{R}^{r \times d}, W_{\text{up}} \in \mathbb{R}^{d \times r}\). Pretrained weights stay frozen; only adapters train. With 1–5 % trainable parameters, BERT-class models nearly matched full fine-tuning.

The cost is inference latency — each block gains extra matmuls. LoRA was motivated to keep adapter benefits without the inference penalty.


5. LoRA (Hu 2021)

5.1 The construction

For a linear layer \(y = W_0 x\) with \(W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}\),

$$ y = W_0 x + \Delta W \, x,\ \ \Delta W = B A, $$

with \(A \in \mathbb{R}^{r \times d_{\text{in}}}\), \(B \in \mathbb{R}^{d_{\text{out}} \times r}\), and \(r \ll \min(d_{\text{in}}, d_{\text{out}})\) (commonly 4–64).

  • \(W_0\) is frozen; only \(A, B\) train.
  • Initialise \(A \sim \mathcal{N}(0, \sigma^2)\) and \(B = 0\): the model starts indistinguishable from the base.
  • Practical scaling: \(\Delta W = (\alpha / r) BA\) so gradient scale is rank-invariant.

5.2 Why low-rank works

The hypothesis: a strong pretrained model is already "almost right," so adaptation lives in a low-rank correction. Hu et al., 2021 showed that GPT-3 175B benefits from LoRA with rank as low as 1 on some tasks. Typical task rank is 8–64.

5.3 Merging at inference

After training, compute \(W = W_0 + (\alpha/r) B A\) once and replace the base weight. Inference cost is identical to the base — no Adapter penalty. You can hot-swap multiple LoRAs by adding/subtracting their \(\Delta W\)s on demand.

5.4 Parameter savings

For \(d = 4096, r = 8\): \(W_0\) has \(\sim 1.68 \times 10^7\) parameters; LoRA adds \(r(d_{\text{in}} + d_{\text{out}}) = 65{,}536\). About 0.4 %. Optimizer state and gradients shrink by the same factor.

5.5 Where to insert LoRA

The original paper applied LoRA only to attention's \(W^Q\) and \(W^V\). Follow-up work extends to \(W^K, W^O\) and the MLP \(W_1, W_2\). Choosing which modules and which rank per module is the main LoRA tuning decision.

5.6 Minimal PyTorch

import torch
import torch.nn as nn


class LoRALinear(nn.Module):
    def __init__(self, base: nn.Linear, r: int = 8, alpha: float = 16.0):
        super().__init__()
        self.base = base
        for p in self.base.parameters():
            p.requires_grad = False
        in_f, out_f = base.in_features, base.out_features
        self.A = nn.Parameter(torch.randn(r, in_f) * 0.01)
        self.B = nn.Parameter(torch.zeros(out_f, r))
        self.scale = alpha / r

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.base(x) + (x @ self.A.T @ self.B.T) * self.scale

5.7 Limits

  • Tasks far from the pretraining distribution (e.g. protein sequences) often still need full FT.
  • Choice of target modules and rank is heuristic.
  • Forgetting is mitigated but not eliminated.

5.8 Practice

  1. Replace \(B = 0\) with \(B \sim \mathcal{N}(0, 0.01)\) and observe output variance at step 1.
  2. Train two LoRAs on the same base and combine them via summed \(\Delta W\); measure task interference.
  3. Sweep \(r \in \{1, 4, 8, 32\}\) and plot validation perplexity.

6. QLoRA (Dettmers 2023)

6.1 Motivation

LoRA shrinks what you train but not how big the base is in GPU memory. QLoRA additionally quantises the frozen base to 4-bit, letting 65B-class models fine-tune on a single 48 GB GPU while matching 16-bit performance.

6.2 NF4 (NormalFloat 4-bit)

Weights are approximately zero-mean Gaussian. NF4 defines its 16 quantisation levels as the quantiles of a standard normal, not as equally spaced integers. Same memory budget (4 bits / parameter), much better fit, hence less quantisation noise.

A 7 B model with NF4 is roughly 3.5 GB of weights (plus ~0.5 GB of quantisation metadata).

6.3 Double quantisation

Block-wise quantisation stores one scale per block (e.g. 64 weights). Quantise those scales too. Saves ~0.4 bit per weight on average.

6.4 Paged optimiser

Activation and optimiser spikes during long sequences can OOM. QLoRA uses NVIDIA unified memory to page optimiser state between CPU and GPU. Stable under spikes, slightly slower under sustained load.

6.5 Training flow

  1. Load \(W_0\) as NF4 + double-quant (read-only).
  2. On forward, dequantise the 4-bit weights to BF16 on the fly; do not pass gradients through the 4-bit codes.
  3. Train only the LoRA adapters (BF16). Optimizer state lives only on those.
  4. For deployment, merge LoRA into a BF16 base (or keep base quantised and apply LoRA separately).

6.6 Limits and gotchas

  • The base stays quantised. Merging LoRA into a quantised base is non-trivial.
  • KV cache and activations are not automatically quantised; for inference savings see GPTQ/AWQ separately.
  • Small loss-of-precision can compound across tens of LoRA targets — always re-evaluate.

6.7 Practice

  1. Load a 7 B model in NF4 vs BF16 with bitsandbytes and compare GPU memory.
  2. Run identical training scripts (BF16 base vs NF4 base) and compare perplexity and throughput.
  3. Measure NF4 quantisation error per layer to identify the most-affected modules.

7. Catastrophic Forgetting

7.1 Definition

A model that performed well on a distribution \(\mathcal{D}_A\) degrades sharply after further training on \(\mathcal{D}_B\). First named by McCloskey & Cohen (1989) in neural network research.

7.2 LLM manifestations

  • An English pretrained model fine-tuned on Korean often loses English MMLU score.
  • Aggressive code-specific fine-tuning reduces general conversational ability.
  • Safety fine-tuning sometimes hurts general capability if overdone.

7.3 Mitigations

  • PEFT itself. LoRA leaves \(W_0\) untouched; forgetting is therefore partial. Merging the LoRA into the base removes that protection.
  • Replay. Mix pretraining-style data into the fine-tuning set. Strong, but you need access to similar data.
  • EWC (Kirkpatrick 2017). Penalise drift on parameters with high Fisher information:

$$ \mathcal{L}_{\text{EWC}} = \mathcal{L}_{\text{new}}(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_i^A)^2. $$

  • Mixed curriculum. Practical and cheap: e.g. 70 % new task, 30 % general pretraining text.

7.4 Practice

  1. Compare MMLU before/after (a) full FT, (b) LoRA on Korean instruction data.
  2. Sweep replay ratios (0 / 10 / 30 %) and plot the trade-off curve.

8. Knowledge Distillation (Hinton 2015)

8.1 The loss

Train a student \(P_S\) to match a teacher \(P_T\):

$$ \mathcal{L}_{\text{KD}} = \alpha \cdot \mathcal{L}_{\text{CE}}(P_S, y_{\text{true}}) + (1 - \alpha) \cdot T^2 \cdot \mathrm{KL}(P_T^{(T)} \,\|\, P_S^{(T)}), $$

where the temperature-softened softmax is

$$ P^{(T)}(y) = \frac{\exp(z_y / T)}{\sum_{y'} \exp(z_{y'} / T)},\ \ T > 1. $$

The \(T^2\) factor restores gradient magnitude after temperature softening (gradients scale by \(1/T^2\) at the softmax).

8.2 Why it works

  • One-hot labels carry no inter-class information.
  • A teacher's soft target says "this is mostly a dog, but slightly cat-like, and very not-truck" — dark knowledge that helps the student generalise from less data.

8.3 LLM-scale distillation

  • DistilBERT (Sanh 2019) keeps 97 % of BERT-base quality at 40 % fewer parameters.
  • Recent small LLMs (TinyLlama, Llama-3.2-1B) are distilled from larger siblings.
  • For generative models, sequence-level distillation (student matches teacher-generated sequences) is common and often more practical than logit matching.

8.4 Practice

  1. On a small classification dataset, compare student trained with one-hot CE vs distilled with \(T = 4\).
  2. Sweep \((\alpha, T)\) and plot validation accuracy contours.

9. Vocabulary management — the silent foot-gun

If you fine-tune in a domain with frequent rare tokens (drug names, code identifiers), they split into many sub-words. Two responses:

  1. Extend the vocabulary. Add new rows to embedding and LM head, initialised at the mean of existing rows. The model must learn what the new token means from scratch.
  2. Live with it. Higher token counts and cost, but weights remain unchanged.

With LoRA, extending the vocabulary is awkward because the embedding/LM-head rows are not normally trained. Most practitioners keep the vocabulary and add data.


10. Quick Recap — Answer Before You Peek

Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.

Q1. How much memory does AdamW's optimizer state take relative to weights?

Answer Roughly 4×. Weights 1× + first moment 1× + second moment 1× + gradients 1× ≈ 4×. A 7B model in fp16 (14 GB) needs ~56 GB for full fine-tuning. Why The ceiling on full fine-tuning is optimizer state, not model size. This is why LoRA + QLoRA emerged — 56 GB for 7B → 6 GB with QLoRA. (Section §3 Full Fine-Tuning.)

Q2. What are the shapes and initialization of \(A\) and \(B\) in LoRA?

Answer \(A \in \mathbb{R}^{r \times d}\) is initialized from a standard normal; \(B \in \mathbb{R}^{d \times r}\) is initialized to zero. So at training start, \(BA = 0\) — the base model is untouched. Why Guarantees that the initial state does not break the base. Only as \(B\) drifts from zero during training does the correction take effect. Same identity-as-default pattern as ResNet's skip connection. (Section §5 LoRA.)

Q3. What does "merging LoRA into the base" buy at inference?

Answer You precompute \(W + BA\) into a single \(W'\). At inference no extra matmul runs → same latency as the base model. Why Separating \(BA\) saves training memory; merging saves inference latency. This is why LoRA became standard across training and inference. (Section §5.)

Q4. NF4 vs INT4 in one line — what's the difference?

Answer NF4 (Normal Float 4) is optimized for the normal distribution assumption; INT4 assumes a uniform distribution. LLM weights are roughly normally distributed → NF4 loses less information than INT4. Why Dettmers 2023 QLoRA. NF4 + Double Quantization + Paged Optimizer let you LoRA-train a 65B model on a single 48 GB GPU. (Section §6 QLoRA.)

Q5. Define catastrophic forgetting in distribution-shift terms. Why is there a \(T^2\) factor in distillation?

Answer Forgetting: fine-tuning on a new domain moves weights from the best fit of the old distribution to the best fit of the new one — old knowledge is overwritten. \(T^2\): preserves the loss scale — soft targets divided by \(T\) flatten the distribution, so gradients shrink by \(1/T^2\); multiplying by \(T^2\) compensates. Why Forgetting: a distribution-shift view of parameter movement; EWC (Kirkpatrick 2017) and LoRA partially address it. Distillation \(T^2\): Hinton 2015 — lets you combine hard targets and soft targets at the same learning rate. (Sections §7 Forgetting, §8 Distillation.)


If four or five came out as one-liners, the memory decomposition → LoRA → QLoRA → distribution preservation arc is in place.


11. Further reading


Part 2 of 6 in the LLM Core Study series. Part 3 covers decoding: Greedy, Beam, Top-p, Speculative.

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System