"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

Building an MLP, as we did in Part 6, is not the same as getting it to train well. The right optimizer, regularization, initialization, and learning rate — when these line up, deep networks converge. When they don't, the network refuses to learn at all. This part is the catalog of those crafts, with formulas, paper citations, and code in one place.

0. Learning Objectives

Compare and write the update rules for SGD, Momentum, Nesterov, and Adam.
Explain how Dropout, BatchNorm, and LayerNorm work and where they belong in a model.
Derive the variance formulas for Xavier and He initialization and match them to activations.
Implement step, cosine, and warmup learning-rate schedules in PyTorch.
Explain why gradient clipping is effectively required for RNNs and Transformers.
Diagnose the most common training failures (NaN, plateau, overfitting) and apply first-line fixes.

1. 핵심 요약

SGD: $w \leftarrow w - \eta \nabla$. Momentum + Nesterov accelerate flat and oscillating regions.
Adam: tracks first and second moments to give every coordinate its own adaptive learning rate. The default for deep networks.
Dropout: zero out random neurons during training; pass them through at inference. Data-driven regularization.
BatchNorm: per-batch mean/variance normalization → faster, more stable training. Uses running stats at inference.
LayerNorm: per-sample normalization. The Transformer standard.
Initialization: Xavier (tanh) → $\text{Var}(W) = 1/n_{\text{in}}$; He (ReLU) → $2/n_{\text{in}}$.
Learning-rate schedule: warmup → main training → cosine decay is the modern default.
Gradient clipping: prevents exploding gradients. Effectively mandatory in RNNs and Transformers.

2. Intuition — Why Plain SGD Falls Short

2.1 The Shape of the Loss Landscape

A neural network loss $\mathcal{L}(w)$ is non-convex, saddle-rich, and full of narrow valleys.

Narrow valleys: steep in one direction, flat in another. Plain SGD oscillates along the steep axis and barely advances along the flat one.
Saddle points: gradients shrink to near zero even though loss is still high.
Flat plateaus: gradients near zero, learning stalls.

Momentum and Adam exist to deal with these specific landscape features.

2.2 Different Coordinates Want Different Learning Rates

Gradients on different features can vary in magnitude by orders of magnitude (sparse vs dense features). A single $\eta$ for all coordinates either explodes one side or freezes the other. Adam adapts the learning rate per coordinate, and that single property is the reason it dominates.

2.3 Why Initialization Is Decisive

If the variance of the forward signal amplifies or shrinks layer by layer, the backward gradient does the same — vanishing or exploding. Initialization is the practice of choosing the weight variance so that the forward signal variance is preserved across layers.

3. Definitions & Notation — Optimizers

3.1 SGD

$$ w_{t+1} = w_t - \eta \, g_t, \quad g_t = \nabla_w \mathcal{L}(w_t) $$

Mini-batch SGD estimates $g_t$ from $B$ randomly selected examples at each step.

3.2 Momentum (Polyak 1964)

$$ \begin{aligned} v_{t+1} &= \mu v_t + g_t \ w_{t+1} &= w_t - \eta v_{t+1} \end{aligned} $$

$\mu \in [0, 1)$, typically 0.9. The exponential moving average of past gradients acts like inertia.

3.3 Nesterov Accelerated Gradient (NAG)

Take a partial step in the momentum direction before evaluating the gradient.

$$ \begin{aligned} v_{t+1} &= \mu v_t + \nabla \mathcal{L}(w_t - \eta \mu v_t) \ w_{t+1} &= w_t - \eta v_{t+1} \end{aligned} $$

Damps oscillations in narrow valleys.

3.4 Adam (Kingma & Ba 2014)

$$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment)} \ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment, elementwise)} \ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \ w_{t+1} &= w_t - \eta \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned} $$

Defaults: $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}$. AdamW (Loshchilov & Hutter 2019) decouples weight decay from the loss for better stability. It is the de facto standard for Transformers.

Optimizer	Strength	Weakness
SGD	Simple, often best generalization	Tuning LR is delicate
Momentum SGD	Acceleration	Still tuning-heavy
Adam	Works out of the box, per-coord LR	Sometimes loses to SGD on generalization
AdamW	Adam with sane weight decay	Transformer standard

4. Math & Mechanism — Regularization

4.1 Dropout (Srivastava et al. 2014)

Drop each neuron with probability $p$ during training:

$$ h_i' = \frac{m_i}{1 - p} \, h_i, \quad m_i \sim \mathrm{Bernoulli}(1 - p) $$

The $1/(1-p)$ factor (inverted dropout) means inference needs no extra adjustment — set $m_i = 1$.

Intuition: prevent the network from relying on specific neuron combinations → ensemble effect.

4.2 BatchNorm (Ioffe & Szegedy 2015)

Normalize per channel within the batch:

$$ \hat{h}_i = \frac{h_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{h}_i + \beta $$

$\gamma, \beta$ are learnable scale and shift. Train: batch statistics. Inference: running mean/variance.

Benefits: stable training, larger learning rates, reduced internal covariate shift. Drawback: small batches produce noisy stats, which makes BatchNorm a poor fit for RNNs and Transformers — use LayerNorm there.

4.3 LayerNorm (Ba, Kiros, Hinton 2016)

Normalize across all channels of a single sample (no batch dependency):

$$ \hat{h}_i = \frac{h_i - \mu_{\text{layer}}}{\sqrt{\sigma_{\text{layer}}^2 + \epsilon}} $$

$\mu, \sigma$ are computed inside one sample. Batch-size invariant. The Transformer/LLM default.

Normalization	Statistics over	Typical use
BatchNorm	Batch + spatial	CNN
LayerNorm	Channels (one sample)	Transformers, RNNs
GroupNorm	Groups of channels	Small-batch CNN
InstanceNorm	One sample, one channel	Style transfer
RMSNorm	The variance term of LayerNorm only	LLaMA and recent LLMs

4.4 Initialization

Xavier (Glorot & Bengio 2010) — for tanh / sigmoid:

$$ \text{Var}(W) = \frac{2}{n_{\text{in}} + n_{\text{out}}} $$

He (He et al. 2015) — for ReLU. Because ReLU kills the negative half, double the variance.

$$ \text{Var}(W) = \frac{2}{n_{\text{in}}} $$

PyTorch:

nn.init.kaiming_normal_(W, nonlinearity="relu")
nn.init.xavier_normal_(W)

nn.Linear's default initializer is a He-uniform variant.

5. Diagram

6. Principle Walkthrough — Five Decisions Hidden in One Training Loop

A standard training loop looks simple, yet behind that single loop five decisions operate simultaneously: optimizer / regularization / initialization / learning-rate schedule / gradient clipping. Section 7 unpacks each into its family.

6.1 The Standard Training Loop — Five Decisions Compressed

model = MLPWithReg(in_dim=64, hidden=128, out_dim=10, p_drop=0.2)   # ① init + regularization
opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)          # ② optimizer
sched = LambdaLR(opt, lr_lambda=cosine_warmup_fn)                    # ③ LR schedule
for step in range(10_000):
    loss = F.cross_entropy(model(xb), yb)
    opt.zero_grad(); loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # ④ gradient clipping
    opt.step(); sched.step()

Each decision is the body of neural-network training.

6.2 Adam — Why It Combined Momentum with RMSProp

Observed pattern - SGD: a single step in one direction. Oscillates on noisy gradients. - Momentum: an EMA of past gradients accelerates along consistent directions and damps reversals. First moment $m_t$. - RMSProp: per-parameter adaptive learning rate. Frequently-updated parameters get smaller steps, rare ones get larger. Second moment $v_t$. - Adam (Kingma & Ba 2015): both at once. $m_t$ (direction) + $v_t$ (scale) + bias correction.

Why it became the NLP / Transformer default Per-parameter scale adaptation fits naturally to weights with very different distributions — embeddings, attention, FFN. In vision, SGD+Momentum often still wins on generalization, but for Transformers AdamW is the de facto choice.

Forward link Lion (Google 2023) and Sophia (Liu 2024) — memory-efficient and faster variants — emerge for large-LLM training. The trend is smarter use of second-order information.

6.3 BatchNorm vs LayerNorm — What Differs and Why Both Exist

What's different - BatchNorm: normalizes across the sample dimension within a batch. CNN standard. - LayerNorm: normalizes across the channel dimension per sample. Transformer standard.

Why BatchNorm fails for Transformers - Variable sequence length → sequence-wise statistics are weakly defined. - Small per-device batches in LLM training → noisy BatchNorm statistics. - At inference time, normalizing per input is more natural.

The framing: both stabilize each layer's input distribution — they just normalize along different dimensions depending on data structure.

6.4 Learning-Rate Schedule — Why cosine warmup Became the LLM Standard

Observed pattern - Warmup (first N steps): linearly increase LR from 0 to peak. Protects weights from huge early gradients. - Cosine annealing (after peak): smoothly decay LR along a cosine curve down to zero. Enables fine adjustment in the late stage.

Why this combination became standard - Linear warmup ensures early stability. - Cosine annealing ensures smooth late-stage convergence. - GPT-3 (Brown et al. 2020), LLaMA (Touvron et al. 2023) both use this combination.

Forward link LLM training in Part 8 extends this perspective — schedule and data ordering are themselves quality decisions.

6.5 Gradient Clipping — The Safety Net Against Blow-Up

What it stops RNNs and Transformers are especially prone to gradient explosion. A single anomalous step can throw the weights and destroy training.

Mechanism clip_grad_norm_(params, max_norm=1.0) rescales gradients whose norm exceeds 1.0, preserving direction but bounding magnitude.

Why it is the standard - Exploding gradients are the most common cause of NaN loss in LLM training. - max_norm=1.0 is the safe default across most Transformer pipelines.

7. Variants & Case Studies — Training Tricks Form an Interacting Family

These choices are not picked separately. Optimizer, regularization, initialization, schedule, and gradient clipping affect each other.

7.1 Optimizer Family — Trade-Offs of Memory / Generalization / Speed

Name	Core idea	Where it lives
SGD + Momentum	Simple, strong generalization	Vision (ResNet, plain ViT)
Adam (Kingma 2015)	First & second moments adapt	NLP default
AdamW (Loshchilov 2019)	Decouples weight decay from the optimizer	Transformer standard
Adafactor (Shazeer 2018)	Memory-saving second moment	T5 / PaLM / large LLMs
Lion (Google 2023)	Sign-based, AdamW-equivalent at lower memory	GPT-J, a few large models
Sophia (Liu 2024)	Hessian-diagonal second-order info	Large-LLM acceleration

The framing: as model size grows, the optimizer's own memory drives training cost — the reason Adafactor and Lion exist.

7.2 Regularization Family — A Symphony of Mechanisms

Name	Where it acts	Effect
Dropout (Srivastava 2014)	Each forward pass	Implicit ensemble
DropPath / Stochastic Depth	Layer level	Generalization for deep nets
Label smoothing (Szegedy 2016)	Loss function	Prevents overconfidence, calibration ↑
Mixup (Zhang 2018)	Input side	Linear interpolation of two images
CutMix (Yun 2019)	Input side	Region swap between images
EMA	Inference time	Weight averaging for generalization
Weight decay	Optimizer	L2 regularization

The framing: neural-net regularization intervenes simultaneously at data / model / training / inference — four phases. Strengthening one phase only collapses the effect of another.

7.3 Initialization Family — Paired With Activations

Name	Paired activation	Author
Xavier / Glorot	sigmoid, tanh	Glorot & Bengio 2010
He / Kaiming	ReLU family	He et al. 2015
Orthogonal	RNNs, gated	Saxe et al. 2014
Identity	Residual blocks	He et al. 2016 (ResNet)
MUP (Yang 2021)	All activations, scale-adaptive	μ-parametrization

The framing: initialization must match the activation's variance behavior. ReLU zeroes the negative half, so only half the input variance passes through → He init uses $\sqrt{2/n_{\text{in}}}$. Xavier targets the linear regime of sigmoid with $\sqrt{1/n_{\text{in}}}$.

7.4 Learning-Rate Schedules — A Policy on Decay

Schedule	Behavior	Use
Step decay	$\eta / 10$ per epoch	Early-vision default
Cosine annealing (Loshchilov 2017)	Smooth decay	Vision and Transformers
One-cycle (Smith 2018)	Aggressive ramp then decay	fast.ai recommended
Cosine + warm restarts	Periodic re-start	Multi-model averaging
Linear warmup + cosine	Early stability + smooth late	The de facto LLM standard
Polynomial decay	Polynomial schedule	BERT training

The framing: learning rate is not a static hyperparameter but a function over time. The same model behaves very differently across schedules.

7.5 Industry Cases — LLM Training on One Line

Model	Optimizer	Schedule	Regularization	Init
BERT (2018)	AdamW	Linear warmup + decay	Dropout 0.1	Truncated normal
GPT-3 (2020)	AdamW (β=0.9, 0.95)	Cosine + warmup	No dropout	Scaled normal
LLaMA (2023)	AdamW (β=0.9, 0.95)	Cosine, lr=3e-4	RMSNorm, no dropout	Truncated normal
LLaMA 2	AdamW	Cosine	RMSNorm	μP-influenced
Mixtral	AdamW	Cosine	RMSNorm, MoE separate	Truncated normal

The framing: post-GPT, dropout has effectively disappeared (or weakened drastically) and BatchNorm has been replaced by LayerNorm / RMSNorm as standard. Training practice evolves.

8. Limits & Failure Modes

8.1 NaN Loss

Why it is essential Floating-point easily overflows/underflows in $\exp$, $\log$, or large gradients. Once NaN appears, it propagates and ruins every weight.

How you spot it Loss is inf or NaN. Trace per step where it first appeared.

Next step - LR ÷ 10, gradient clip max_norm=1.0 - Normalize inputs (StandardScaler) - Mixed-precision training: ensure GradScaler is configured - Add LayerNorm/RMSNorm for stabilization

8.2 Plateau

Why it is essential Training is not progressing — possible causes include too-low LR, dead ReLU, or an undersized model.

How you spot it Loss curve is flat. Gradient norms near zero.

Next step - Try LR × 3 - Swap sigmoid for ReLU or GELU - Increase hidden size - Shorten warmup

8.3 Train ↓ Test ↑ (Overfitting)

Why it is essential Model capacity > information in the data. The §7.2 regularization symphony is insufficient.

How you spot it Train accuracy 99%, test accuracy 70%.

Next step Increase dropout p, increase weight decay, strengthen augmentation, apply early stopping. If still failing, shrink the model.

8.4 BatchNorm with Small Batches

Why it is essential BatchNorm uses batch statistics. At batch size < 16, those statistics are noisy and destabilize training.

How you spot it The same model trains well at large batch size and badly at small batch size.

Next step LayerNorm (Transformers), GroupNorm (CNNs), or SyncBatchNorm (distributed).

8.5 Adam vs SGD — The Generalization Paradox

Why it is essential Adam wins on speed but SGD+Momentum often wins on generalization (Wilson et al. 2017, The Marginal Value of Adaptive Gradient Methods). Especially in vision.

How you spot it On the same model, Adam has better train accuracy / R² but worse test.

Next step Try SGD + Momentum with a cosine schedule. Transformers still favor AdamW.

8.6 Distributed-Training Statistics Sync

Why it is essential BatchNorm seeing only its local batch yields per-GPU statistics that disagree — normalization loses consistency.

How you spot it Single-GPU and multi-GPU learning curves diverge.

Next step torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) to synchronize statistics across GPUs. Or switch to LayerNorm.

What the Limits Sketch

The six limits of training compress the LLM-era training standard: - NaN → mixed precision + grad clip 1.0 - Plateau → cosine warmup - Overfit → weaker dropout + stronger augmentation - BatchNorm limits → LayerNorm/RMSNorm - Adam generalization → AdamW + weight decay - Distributed statistics → SyncBN or LN

9. Quick Recap — Answer Before You Peek

Q1. Why did Adam become the Transformer default?

Answer Its per-parameter adaptive learning rate matches the very different distributions of embedding, attention, and FFN weights. Why Adam combines first-moment (direction) + second-moment (scale) + bias correction. SGD applies a single LR to every parameter — fine for vision, inadequate for Transformers. In vision, SGD+Momentum still often generalizes better (Wilson et al. 2017). (Sections 6.2, 8.5.)

Q2. Why do Transformers use LayerNorm instead of BatchNorm?

Answer Variable sequence length and small per-device batches make BatchNorm statistics noisy. Normalizing per input (LayerNorm) is more natural. Why BatchNorm normalizes across batch dim; LayerNorm across channel dim. CNN image batches fit BatchNorm; Transformer token sequences fit LayerNorm. RMSNorm is a simpler LayerNorm variant (LLaMA). (Sections 6.3, 8.4.)

Q3. What's the difference between He init and Xavier init? Why pair them with activations?

Answer Xavier: $\sqrt{1/n_{\text{in}}}$ (sigmoid/tanh). He: $\sqrt{2/n_{\text{in}}}$ (ReLU). Why ReLU zeroes the negative half — only half the input variance passes through. To preserve output variance, weight variance doubles. Wrong init exponentially blows up or kills activations and gradients with depth. (Section 7.3.)

Q4. Why has "linear warmup + cosine annealing" become the LLM training standard?

Answer Warmup protects weights from large early gradients; cosine annealing produces smooth late-stage convergence. Why Early gradients are noisy → too-large LRs destroy weights. Ramp up gradually, then ease down for fine adjustment. GPT-3 and LLaMA both use this combination. (Sections 6.4, 7.4.)

Q5. Loss went NaN. Four first responses?

Answer (1) LR ÷ 10, (2) gradient clip max_norm=1.0, (3) normalize inputs, (4) check GradScaler configuration if using mixed precision. Why The dominant cause is exploding gradients meeting floating-point limits. These four are the standard safety kit for LLM training. Adding LayerNorm/RMSNorm also helps. (Sections 6.5, 8.1.)

If four or five came easily, the five decisions of neural-network training are in place.

10. Further Reading

Primary sources

Kingma, D. P., Ba, J. Adam: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980.
Loshchilov, I., Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101 — AdamW.
Srivastava, N. et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15 (2014): 1929–1958.
Ioffe, S., Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. arXiv:1502.03167.
Ba, J. L., Kiros, J. R., Hinton, G. E. Layer Normalization. arXiv:1607.06450 (2016).
Glorot, X., Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010. — Xavier init.
He, K. et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ICCV 2015. arXiv:1502.01852 — He init.
Smith, L. N. Cyclical Learning Rates for Training Neural Networks. WACV 2017. arXiv:1506.01186.

Official docs

PyTorch optimizers: https://pytorch.org/docs/stable/optim.html
LR schedulers: https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
BatchNorm / LayerNorm: https://pytorch.org/docs/stable/nn.html#normalization-layers

Companion books

Goodfellow, Bengio, Courville. Deep Learning. Chapter 8.
Bottou, Curtis, Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review 60(2), 223–311 (2018).

In Part 8 we take a tour of deep learning architectures: CNN (LeNet → AlexNet → ResNet), RNN/LSTM, the arrival of attention, and the long path that ended in the Transformer.

Series overview: Series index

Schedule	Behavior	Use
Step decay	\(\eta / 10\) per epoch	Early-vision default
Cosine annealing (Loshchilov 2017)	Smooth decay	Vision and Transformers
One-cycle (Smith 2018)	Aggressive ramp then decay	fast.ai recommended
Cosine + warm restarts	Periodic re-start	Multi-model averaging
Linear warmup + cosine	Early stability + smooth late	The de facto LLM standard
Polynomial decay	Polynomial schedule	BERT training