"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"
Building an MLP, as we did in Part 6, is not the same as getting it to train well. The right optimizer, regularization, initialization, and learning rate — when these line up, deep networks converge. When they don't, the network refuses to learn at all. This part is the catalog of those crafts, with formulas, paper citations, and code in one place.
0. Learning Objectives
- Compare and write the update rules for SGD, Momentum, Nesterov, and Adam.
- Explain how Dropout, BatchNorm, and LayerNorm work and where they belong in a model.
- Derive the variance formulas for Xavier and He initialization and match them to activations.
- Implement step, cosine, and warmup learning-rate schedules in PyTorch.
- Explain why gradient clipping is effectively required for RNNs and Transformers.
- Diagnose the most common training failures (NaN, plateau, overfitting) and apply first-line fixes.
1. ํต์ฌ ์์ฝ
- SGD: \(w \leftarrow w - \eta \nabla\). Momentum + Nesterov accelerate flat and oscillating regions.
- Adam: tracks first and second moments to give every coordinate its own adaptive learning rate. The default for deep networks.
- Dropout: zero out random neurons during training; pass them through at inference. Data-driven regularization.
- BatchNorm: per-batch mean/variance normalization → faster, more stable training. Uses running stats at inference.
- LayerNorm: per-sample normalization. The Transformer standard.
- Initialization: Xavier (tanh) → \(\text{Var}(W) = 1/n_{\text{in}}\); He (ReLU) → \(2/n_{\text{in}}\).
- Learning-rate schedule: warmup → main training → cosine decay is the modern default.
- Gradient clipping: prevents exploding gradients. Effectively mandatory in RNNs and Transformers.
2. Intuition — Why Plain SGD Falls Short
2.1 The Shape of the Loss Landscape
A neural network loss \(\mathcal{L}(w)\) is non-convex, saddle-rich, and full of narrow valleys.
- Narrow valleys: steep in one direction, flat in another. Plain SGD oscillates along the steep axis and barely advances along the flat one.
- Saddle points: gradients shrink to near zero even though loss is still high.
- Flat plateaus: gradients near zero, learning stalls.
Momentum and Adam exist to deal with these specific landscape features.
2.2 Different Coordinates Want Different Learning Rates
Gradients on different features can vary in magnitude by orders of magnitude (sparse vs dense features). A single \(\eta\) for all coordinates either explodes one side or freezes the other. Adam adapts the learning rate per coordinate, and that single property is the reason it dominates.
2.3 Why Initialization Is Decisive
If the variance of the forward signal amplifies or shrinks layer by layer, the backward gradient does the same — vanishing or exploding. Initialization is the practice of choosing the weight variance so that the forward signal variance is preserved across layers.
3. Definitions & Notation — Optimizers
3.1 SGD
$$ w_{t+1} = w_t - \eta \, g_t, \quad g_t = \nabla_w \mathcal{L}(w_t) $$
Mini-batch SGD estimates \(g_t\) from \(B\) randomly selected examples at each step.
3.2 Momentum (Polyak 1964)
$$ \begin{aligned} v_{t+1} &= \mu v_t + g_t \ w_{t+1} &= w_t - \eta v_{t+1} \end{aligned} $$
\(\mu \in [0, 1)\), typically 0.9. The exponential moving average of past gradients acts like inertia.
3.3 Nesterov Accelerated Gradient (NAG)
Take a partial step in the momentum direction before evaluating the gradient.
$$ \begin{aligned} v_{t+1} &= \mu v_t + \nabla \mathcal{L}(w_t - \eta \mu v_t) \ w_{t+1} &= w_t - \eta v_{t+1} \end{aligned} $$
Damps oscillations in narrow valleys.
3.4 Adam (Kingma & Ba 2014)
$$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment)} \ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment, elementwise)} \ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \ w_{t+1} &= w_t - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned} $$
Defaults: \(\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}\). AdamW (Loshchilov & Hutter 2019) decouples weight decay from the loss for better stability. It is the de facto standard for Transformers.
| Optimizer | Strength | Weakness |
|---|---|---|
| SGD | Simple, often best generalization | Tuning LR is delicate |
| Momentum SGD | Acceleration | Still tuning-heavy |
| Adam | Works out of the box, per-coord LR | Sometimes loses to SGD on generalization |
| AdamW | Adam with sane weight decay | Transformer standard |
4. Math & Mechanism — Regularization
4.1 Dropout (Srivastava et al. 2014)
Drop each neuron with probability \(p\) during training:
$$ h_i' = \frac{m_i}{1 - p} \, h_i, \quad m_i \sim \mathrm{Bernoulli}(1 - p) $$
The \(1/(1-p)\) factor (inverted dropout) means inference needs no extra adjustment — set \(m_i = 1\).
Intuition: prevent the network from relying on specific neuron combinations → ensemble effect.
4.2 BatchNorm (Ioffe & Szegedy 2015)
Normalize per channel within the batch:
$$ \hat{h}_i = \frac{h_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{h}_i + \beta $$
\(\gamma, \beta\) are learnable scale and shift. Train: batch statistics. Inference: running mean/variance.
Benefits: stable training, larger learning rates, reduced internal covariate shift. Drawback: small batches produce noisy stats, which makes BatchNorm a poor fit for RNNs and Transformers — use LayerNorm there.
4.3 LayerNorm (Ba, Kiros, Hinton 2016)
Normalize across all channels of a single sample (no batch dependency):
$$ \hat{h}_i = \frac{h_i - \mu_{\text{layer}}}{\sqrt{\sigma_{\text{layer}}^2 + \epsilon}} $$
\(\mu, \sigma\) are computed inside one sample. Batch-size invariant. The Transformer/LLM default.
| Normalization | Statistics over | Typical use |
|---|---|---|
| BatchNorm | Batch + spatial | CNN |
| LayerNorm | Channels (one sample) | Transformers, RNNs |
| GroupNorm | Groups of channels | Small-batch CNN |
| InstanceNorm | One sample, one channel | Style transfer |
| RMSNorm | The variance term of LayerNorm only | LLaMA and recent LLMs |
4.4 Initialization
Xavier (Glorot & Bengio 2010) — for tanh / sigmoid:
$$ \text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} $$
He (He et al. 2015) — for ReLU. Because ReLU kills the negative half, double the variance.
$$ \text{Var}(W) = \frac{2}{n_{\text{in}}} $$
PyTorch:
nn.init.kaiming_normal_(W, nonlinearity="relu")
nn.init.xavier_normal_(W)
nn.Linear's default initializer is a He-uniform variant.
5. Diagram
6. Principle Walkthrough — Five Decisions Hidden in One Training Loop
A standard training loop looks simple, yet behind that single loop five decisions operate simultaneously: optimizer / regularization / initialization / learning-rate schedule / gradient clipping. Section 7 unpacks each into its family.
6.1 The Standard Training Loop — Five Decisions Compressed
model = MLPWithReg(in_dim=64, hidden=128, out_dim=10, p_drop=0.2) # ① init + regularization
opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) # ② optimizer
sched = LambdaLR(opt, lr_lambda=cosine_warmup_fn) # ③ LR schedule
for step in range(10_000):
loss = F.cross_entropy(model(xb), yb)
opt.zero_grad(); loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # ④ gradient clipping
opt.step(); sched.step()
Each decision is the body of neural-network training.
6.2 Adam — Why It Combined Momentum with RMSProp
Observed pattern - SGD: a single step in one direction. Oscillates on noisy gradients. - Momentum: an EMA of past gradients accelerates along consistent directions and damps reversals. First moment \(m_t\). - RMSProp: per-parameter adaptive learning rate. Frequently-updated parameters get smaller steps, rare ones get larger. Second moment \(v_t\). - Adam (Kingma & Ba 2015): both at once. \(m_t\) (direction) + \(v_t\) (scale) + bias correction.
Why it became the NLP / Transformer default Per-parameter scale adaptation fits naturally to weights with very different distributions — embeddings, attention, FFN. In vision, SGD+Momentum often still wins on generalization, but for Transformers AdamW is the de facto choice.
Forward link Lion (Google 2023) and Sophia (Liu 2024) — memory-efficient and faster variants — emerge for large-LLM training. The trend is smarter use of second-order information.
6.3 BatchNorm vs LayerNorm — What Differs and Why Both Exist
What's different - BatchNorm: normalizes across the sample dimension within a batch. CNN standard. - LayerNorm: normalizes across the channel dimension per sample. Transformer standard.
Why BatchNorm fails for Transformers - Variable sequence length → sequence-wise statistics are weakly defined. - Small per-device batches in LLM training → noisy BatchNorm statistics. - At inference time, normalizing per input is more natural.
The framing: both stabilize each layer's input distribution — they just normalize along different dimensions depending on data structure.
6.4 Learning-Rate Schedule — Why cosine warmup Became the LLM Standard
Observed pattern - Warmup (first N steps): linearly increase LR from 0 to peak. Protects weights from huge early gradients. - Cosine annealing (after peak): smoothly decay LR along a cosine curve down to zero. Enables fine adjustment in the late stage.
Why this combination became standard - Linear warmup ensures early stability. - Cosine annealing ensures smooth late-stage convergence. - GPT-3 (Brown et al. 2020), LLaMA (Touvron et al. 2023) both use this combination.
Forward link LLM training in Part 8 extends this perspective — schedule and data ordering are themselves quality decisions.
6.5 Gradient Clipping — The Safety Net Against Blow-Up
What it stops RNNs and Transformers are especially prone to gradient explosion. A single anomalous step can throw the weights and destroy training.
Mechanism
clip_grad_norm_(params, max_norm=1.0) rescales gradients whose norm exceeds 1.0, preserving direction but bounding magnitude.
Why it is the standard
- Exploding gradients are the most common cause of NaN loss in LLM training.
- max_norm=1.0 is the safe default across most Transformer pipelines.
7. Variants & Case Studies — Training Tricks Form an Interacting Family
These choices are not picked separately. Optimizer, regularization, initialization, schedule, and gradient clipping affect each other.
7.1 Optimizer Family — Trade-Offs of Memory / Generalization / Speed
| Name | Core idea | Where it lives |
|---|---|---|
| SGD + Momentum | Simple, strong generalization | Vision (ResNet, plain ViT) |
| Adam (Kingma 2015) | First & second moments adapt | NLP default |
| AdamW (Loshchilov 2019) | Decouples weight decay from the optimizer | Transformer standard |
| Adafactor (Shazeer 2018) | Memory-saving second moment | T5 / PaLM / large LLMs |
| Lion (Google 2023) | Sign-based, AdamW-equivalent at lower memory | GPT-J, a few large models |
| Sophia (Liu 2024) | Hessian-diagonal second-order info | Large-LLM acceleration |
The framing: as model size grows, the optimizer's own memory drives training cost — the reason Adafactor and Lion exist.
7.2 Regularization Family — A Symphony of Mechanisms
| Name | Where it acts | Effect |
|---|---|---|
| Dropout (Srivastava 2014) | Each forward pass | Implicit ensemble |
| DropPath / Stochastic Depth | Layer level | Generalization for deep nets |
| Label smoothing (Szegedy 2016) | Loss function | Prevents overconfidence, calibration ↑ |
| Mixup (Zhang 2018) | Input side | Linear interpolation of two images |
| CutMix (Yun 2019) | Input side | Region swap between images |
| EMA | Inference time | Weight averaging for generalization |
| Weight decay | Optimizer | L2 regularization |
The framing: neural-net regularization intervenes simultaneously at data / model / training / inference — four phases. Strengthening one phase only collapses the effect of another.
7.3 Initialization Family — Paired With Activations
| Name | Paired activation | Author |
|---|---|---|
| Xavier / Glorot | sigmoid, tanh | Glorot & Bengio 2010 |
| He / Kaiming | ReLU family | He et al. 2015 |
| Orthogonal | RNNs, gated | Saxe et al. 2014 |
| Identity | Residual blocks | He et al. 2016 (ResNet) |
| MUP (Yang 2021) | All activations, scale-adaptive | ฮผ-parametrization |
The framing: initialization must match the activation's variance behavior. ReLU zeroes the negative half, so only half the input variance passes through → He init uses \(\sqrt{2/n_{\text{in}}}\). Xavier targets the linear regime of sigmoid with \(\sqrt{1/n_{\text{in}}}\).
7.4 Learning-Rate Schedules — A Policy on Decay
| Schedule | Behavior | Use |
|---|---|---|
| Step decay | \(\eta / 10\) per epoch | Early-vision default |
| Cosine annealing (Loshchilov 2017) | Smooth decay | Vision and Transformers |
| One-cycle (Smith 2018) | Aggressive ramp then decay | fast.ai recommended |
| Cosine + warm restarts | Periodic re-start | Multi-model averaging |
| Linear warmup + cosine | Early stability + smooth late | The de facto LLM standard |
| Polynomial decay | Polynomial schedule | BERT training |
The framing: learning rate is not a static hyperparameter but a function over time. The same model behaves very differently across schedules.
7.5 Industry Cases — LLM Training on One Line
| Model | Optimizer | Schedule | Regularization | Init |
|---|---|---|---|---|
| BERT (2018) | AdamW | Linear warmup + decay | Dropout 0.1 | Truncated normal |
| GPT-3 (2020) | AdamW (ฮฒ=0.9, 0.95) | Cosine + warmup | No dropout | Scaled normal |
| LLaMA (2023) | AdamW (ฮฒ=0.9, 0.95) | Cosine, lr=3e-4 | RMSNorm, no dropout | Truncated normal |
| LLaMA 2 | AdamW | Cosine | RMSNorm | ฮผP-influenced |
| Mixtral | AdamW | Cosine | RMSNorm, MoE separate | Truncated normal |
The framing: post-GPT, dropout has effectively disappeared (or weakened drastically) and BatchNorm has been replaced by LayerNorm / RMSNorm as standard. Training practice evolves.
8. Limits & Failure Modes
8.1 NaN Loss
Why it is essential Floating-point easily overflows/underflows in \(\exp\), \(\log\), or large gradients. Once NaN appears, it propagates and ruins every weight.
How you spot it Loss is inf or NaN. Trace per step where it first appeared.
Next step
- LR ÷ 10, gradient clip max_norm=1.0
- Normalize inputs (StandardScaler)
- Mixed-precision training: ensure GradScaler is configured
- Add LayerNorm/RMSNorm for stabilization
8.2 Plateau
Why it is essential Training is not progressing — possible causes include too-low LR, dead ReLU, or an undersized model.
How you spot it Loss curve is flat. Gradient norms near zero.
Next step - Try LR × 3 - Swap sigmoid for ReLU or GELU - Increase hidden size - Shorten warmup
8.3 Train ↓ Test ↑ (Overfitting)
Why it is essential Model capacity > information in the data. The §7.2 regularization symphony is insufficient.
How you spot it Train accuracy 99%, test accuracy 70%.
Next step Increase dropout p, increase weight decay, strengthen augmentation, apply early stopping. If still failing, shrink the model.
8.4 BatchNorm with Small Batches
Why it is essential BatchNorm uses batch statistics. At batch size < 16, those statistics are noisy and destabilize training.
How you spot it The same model trains well at large batch size and badly at small batch size.
Next step LayerNorm (Transformers), GroupNorm (CNNs), or SyncBatchNorm (distributed).
8.5 Adam vs SGD — The Generalization Paradox
Why it is essential Adam wins on speed but SGD+Momentum often wins on generalization (Wilson et al. 2017, The Marginal Value of Adaptive Gradient Methods). Especially in vision.
How you spot it On the same model, Adam has better train accuracy / R² but worse test.
Next step Try SGD + Momentum with a cosine schedule. Transformers still favor AdamW.
8.6 Distributed-Training Statistics Sync
Why it is essential BatchNorm seeing only its local batch yields per-GPU statistics that disagree — normalization loses consistency.
How you spot it Single-GPU and multi-GPU learning curves diverge.
Next step
torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) to synchronize statistics across GPUs. Or switch to LayerNorm.
What the Limits Sketch
The six limits of training compress the LLM-era training standard: - NaN → mixed precision + grad clip 1.0 - Plateau → cosine warmup - Overfit → weaker dropout + stronger augmentation - BatchNorm limits → LayerNorm/RMSNorm - Adam generalization → AdamW + weight decay - Distributed statistics → SyncBN or LN
9. Quick Recap — Answer Before You Peek
Q1. Why did Adam become the Transformer default?
Answer Its per-parameter adaptive learning rate matches the very different distributions of embedding, attention, and FFN weights. Why Adam combines first-moment (direction) + second-moment (scale) + bias correction. SGD applies a single LR to every parameter — fine for vision, inadequate for Transformers. In vision, SGD+Momentum still often generalizes better (Wilson et al. 2017). (Sections 6.2, 8.5.)
Q2. Why do Transformers use LayerNorm instead of BatchNorm?
Answer Variable sequence length and small per-device batches make BatchNorm statistics noisy. Normalizing per input (LayerNorm) is more natural. Why BatchNorm normalizes across batch dim; LayerNorm across channel dim. CNN image batches fit BatchNorm; Transformer token sequences fit LayerNorm. RMSNorm is a simpler LayerNorm variant (LLaMA). (Sections 6.3, 8.4.)
Q3. What's the difference between He init and Xavier init? Why pair them with activations?
Answer Xavier: \(\sqrt{1/n_{\text{in}}}\) (sigmoid/tanh). He: \(\sqrt{2/n_{\text{in}}}\) (ReLU). Why ReLU zeroes the negative half — only half the input variance passes through. To preserve output variance, weight variance doubles. Wrong init exponentially blows up or kills activations and gradients with depth. (Section 7.3.)
Q4. Why has "linear warmup + cosine annealing" become the LLM training standard?
Answer Warmup protects weights from large early gradients; cosine annealing produces smooth late-stage convergence. Why Early gradients are noisy → too-large LRs destroy weights. Ramp up gradually, then ease down for fine adjustment. GPT-3 and LLaMA both use this combination. (Sections 6.4, 7.4.)
Q5. Loss went NaN. Four first responses?
Answer (1) LR ÷ 10, (2) gradient clip max_norm=1.0, (3) normalize inputs, (4) check GradScaler configuration if using mixed precision.
Why The dominant cause is exploding gradients meeting floating-point limits. These four are the standard safety kit for LLM training. Adding LayerNorm/RMSNorm also helps. (Sections 6.5, 8.1.)
If four or five came easily, the five decisions of neural-network training are in place.
10. Further Reading
Primary sources
- Kingma, D. P., Ba, J. Adam: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980.
- Loshchilov, I., Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101 — AdamW.
- Srivastava, N. et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 (2014): 1929–1958.
- Ioffe, S., Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. arXiv:1502.03167.
- Ba, J. L., Kiros, J. R., Hinton, G. E. Layer Normalization. arXiv:1607.06450 (2016).
- Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. — Xavier init.
- He, K. et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. arXiv:1502.01852 — He init.
- Smith, L. N. Cyclical Learning Rates for Training Neural Networks. WACV 2017. arXiv:1506.01186.
Official docs
- PyTorch optimizers:
https://pytorch.org/docs/stable/optim.html - LR schedulers:
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate - BatchNorm / LayerNorm:
https://pytorch.org/docs/stable/nn.html#normalization-layers
Companion books
- Goodfellow, Bengio, Courville. Deep Learning. Chapter 8.
- Bottou, Curtis, Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review 60(2), 223–311 (2018).
In Part 8 we take a tour of deep learning architectures: CNN (LeNet → AlexNet → ResNet), RNN/LSTM, the arrival of attention, and the long path that ended in the Transformer.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ