"ML Foundations (5/9) — Regularization & Model Selection: Taming Overfitting"

Making a model bigger is easy: raise the polynomial degree, add features. The hard part is knowing when to stop. Regularization adds a single penalty line to the loss and lets the model police itself. That one line is the seed of Ridge, Lasso, ElasticNet, and the weight decay and dropout machinery of neural networks.


0. Learning Objectives

  • Write the bias-variance decomposition in one formula and identify where regularization acts.
  • State the Ridge (\(L_2\)) and Lasso (\(L_1\)) losses, with the closed form (Ridge) or KKT conditions (Lasso).
  • Explain geometrically why \(L_1\) produces sparse solutions, using two pictures.
  • State ElasticNet's loss and when to use it.
  • Pick \(\lambda\) with k-fold cross-validation in code.
  • Define AIC and BIC and contrast them with CV.

1. ํ•ต์‹ฌ ์š”์•ฝ

  • Regularization = add a complexity penalty to the loss: \(\mathcal{L} \to \mathcal{L} + \lambda \|w\|^?\).
  • Ridge (\(L_2\)): \(\lambda \|w\|_2^2\) — shrinks large weights. Robust under multicollinearity.
  • Lasso (\(L_1\)): \(\lambda \|w\|_1\) — drives weights exactly to zero. Automatic feature selection.
  • ElasticNet: weighted mix of both. Stable when \(p\) is large and features are correlated.
  • The hyperparameter \(\lambda\) is chosen by cross-validation. AIC and BIC are single-pass information criteria — faster alternatives to CV.
  • Weight decay, dropout, and early stopping in neural networks are all members of the same family.

2. Intuition — Where Overfitting Comes From

2.1 Polynomial Degree and Overfitting

In Part 3 we synthesized \(y = 0.5x^3 - 2x + 5 + \varepsilon\) and fit polynomials of varying degree. Training R² climbs toward 1 as the degree grows, but test R² collapses past some point.

  test R²
   1.0 +    .         .  .
       |  .  .       .    \
       |.    .     .       \
       |      \  .           \
   0.5 +       \              \
       |        \              \____
   0.0 +---------+----+----+----+----+--→ degree
              1    2    3    5    10   20

Overfitting is the moment the model starts memorizing noise instead of capturing signal.

2.2 Why Shrinking Weights Helps

Overfit models typically have huge weights with alternating signs. They cannot catch the signal in one pass, so a large positive and a large negative weight cancel each other out while still imitating the data on the training set.

Regularization adds a \(\|w\|^?\) penalty to the loss, so growing the weights itself becomes expensive. The optimizer must trade fit quality against weight magnitude — the result is a smoother function.

This also has a clean Bayesian reading: a Gaussian prior on weights gives \(L_2\), and a Laplace prior gives \(L_1\). Regularization is a prior.

2.3 Bias-Variance Decomposition, Revisited

From Part 1:

$$ \mathbb{E}[L] = \underbrace{\text{Bias}^2}_{\text{model simplicity}} + \underbrace{\text{Variance}}_{\text{data jitter}} + \sigma^2 $$

Regularization reduces variance and slightly increases bias. Small \(\lambda\) → plain regression (high variance), \(\lambda \to \infty\) → \(w = 0\) (high bias). The optimum minimizes the sum.


3. Definitions & Notation

3.1 Ridge Regression (L2)

$$ \mathcal{L}_{\text{Ridge}}(\beta) = \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \|w\|_2^2 $$

where \(w\) is the weight vector without the intercept and \(\lambda > 0\) is the regularization strength.

Closed form:

$$ \boxed{\,\hat{\beta}_{\text{Ridge}} = (X^\top X + \lambda I_d)^{-1} X^\top y\,} $$

(In practice the intercept is excluded from the penalty.)

3.2 Lasso Regression (L1)

$$ \mathcal{L}_{\text{Lasso}}(\beta) = \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \|w\|_1 $$

The \(L_1\) term is non-differentiable at zero, so there is no closed form. Solve with coordinate descent:

$$ w_j \leftarrow \mathrm{soft}_\lambda\Big(\, \frac{x_j^\top r_{-j}}{x_j^\top x_j} \,\Big) $$

with \(r_{-j}\) the residual excluding the \(j\)-th term and \(\mathrm{soft}_\lambda(z) = \mathrm{sign}(z)\max(|z| - \lambda, 0)\) the soft threshold.

3.3 ElasticNet

$$ \mathcal{L}_{\text{EN}}(\beta) = \frac{1}{2n}\|y - X\beta\|_2^2 + \lambda \big(\alpha \|w\|_1 + (1 - \alpha) \|w\|_2^2\big) $$

with mixing parameter \(\alpha \in [0, 1]\). Proposed by Zou & Hastie (2005).


4. Math & Mechanism

4.1 Why Ridge Is Stable

\(X^\top X + \lambda I\) is always invertible. Even when \(X^\top X\) is near-singular under multicollinearity, adding \(\lambda\) along the diagonal improves the condition number.

Via SVD, Ridge shrinks all singular values uniformly. With \(X = U \Sigma V^\top\):

$$ \hat{\beta}_{\text{Ridge}} = V \, \mathrm{diag}\Big(\frac{\sigma_i}{\sigma_i^2 + \lambda}\Big) \, U^\top y $$

Small singular values (the noise directions) scale as \(\sigma_i / \lambda\); large ones (signal) scale as \(1/\sigma_i\). Only the noise directions are aggressively shrunk.

4.2 The Geometry of Lasso — Why Sparsity Emerges

The optimum sits where the loss contours (ellipses) just touch the constraint region (\(\|w\|_? \le t\)).

  • \(L_2\)'s constraint region is a disk. The contact point generally is not on a coordinate axis, so coefficients are nonzero.
  • \(L_1\)'s constraint region is a diamond with vertices on the coordinate axes. Contact points tend to land at vertices, where some coefficients are exactly zero.
L2 (disk):                     L1 (diamond):
   |                              |
   |  *  ← loss contours          |  ↘
  -|---+---                       --+----- ← vertex on axis
   |                              |

That is precisely what "Lasso performs automatic feature selection" means. Many of the resulting coefficients are exactly zero.

4.3 Picking \(\lambda\)

Grid search with k-fold CV. sklearn helpers:

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0], cv=5)
LassoCV(alphas=None, cv=5)          # generates a default grid
ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95], cv=5)

Alternatives:

  • 1-SE rule (Breiman 1984): choose the largest \(\lambda\) whose CV mean is within one SE of the best. Favors simpler models.
  • AIC / BIC: one-shot information criteria.

$$ \text{AIC} = 2k - 2\log \mathcal{L}, \quad \text{BIC} = k \log n - 2\log \mathcal{L} $$

where \(k\) is the effective parameter count and \(n\) the sample size. BIC penalizes larger models more heavily (\(\log n\) vs 2). BIC is the typical pick for small data; CV is the default once you have enough samples.

4.4 Regularization Inside Neural Networks

The optimizer flag weight_decay=1e-4 is effectively \(L_2\) — each step does \(w \leftarrow w - \eta(\nabla \mathcal{L} + \lambda w)\). Dropout is data-driven regularization; early stopping is regularization in the time axis. They are all in the same family — Part 7 covers this in depth.


5. Diagram

diagram-1

6. Principle Walkthrough — The Geometric Difference Between L1 and L2 in One Picture

The code itself is a one-liner. The geometric intuition behind why L1 and L2 produce different solutions is the key that unlocks all of regularization.

6.1 The Standard Call

from sklearn.linear_model import RidgeCV, LassoCV
pipe = Pipeline([("scaler", StandardScaler()), ("reg", RidgeCV(alphas=np.logspace(-3, 3, 13), cv=5))])
pipe.fit(X_tr, y_tr)

RidgeCV runs internal 5-fold CV across the \(\alpha\) grid and picks the optimum automatically. StandardScaler must come first — the \(L_2\) penalty \(\lambda \|w\|^2\) multiplies into feature scale directly. If units differ, the penalty is decided by unit choice rather than data, and the model collapses.

6.2 The Real Reason Ridge Is Stable — The SVD View

What you watch The normal equation \(\hat{\beta} = (X^{\top}X)^{-1} X^{\top} y\) explodes when \(X^{\top}X\) is near-singular. Ridge replaces it with \((X^{\top}X + \lambda I)^{-1}\) — forced invertibility.

Rewriting with \(X = U \Sigma V^{\top}\):

$$ \hat{\beta}_{\text{Ridge}} = V \, \mathrm{diag}\!\Big(\frac{\sigma_i}{\sigma_i^2 + \lambda}\Big) \, U^{\top} y $$

  • Large \(\sigma_i\) (signal directions): shrinkage ratio \(\sigma_i / (\sigma_i^2 + \lambda) \approx 1/\sigma_i\) — nearly unchanged.
  • Small \(\sigma_i\) (noise directions): shrinkage ratio \(\sigma_i / \lambda\) — strongly shrunk.

The framing: Ridge selectively shrinks the noise directions. It does not flatten everything uniformly. This is the same place Part 3 §6.2 landed on SVD stability.

6.3 Lasso's Geometry — Why Sparse Solutions

The intersection of the loss contour (an ellipse) and the constraint region gives the solution.

L2 (Ridge): a circle                L1 (Lasso): a diamond
        |                                    |
       /-\\                                  /\\
      |   |  ●← loss contour              /  \\● ← loss contour
       \\-/                                  \\/
        |                                    |

\(L_2\) region: a sphere. The loss contour meets it off-axis — all \(w_j\) end up nonzero but small.

\(L_1\) region: a diamond, with vertices on the axes. The closest point to the loss contour is often a vertex — meaning some \(w_j\) are exactly zero.

Origin story Tibshirani 1996, Regression Shrinkage and Selection via the Lasso (JRSS-B). Before this, feature selection and regularization were separate procedures. Lasso does both in a single optimization. This is why it exploded in genomics, text, and any \(p \gg n\) domain.

The framing: the shape of the penalty region decides the character of the solution. L2 smooths; L1 sparsifies; ElasticNet blends.

6.4 Choosing \(\lambda\) — How?

  • k-fold CV: the most honest. RidgeCV / LassoCV automate it. The default for large data.
  • 1-SE rule (Breiman 1984): pick the largest \(\lambda\) within one standard error of the CV mean — favors simpler models.
  • AIC / BIC: information criteria. One-pass but distributionally dependent.

$$ \text{AIC} = 2k - 2\log \mathcal{L}, \quad \text{BIC} = k \log n - 2\log \mathcal{L} $$

BIC's \(\log n\) penalty exceeds AIC's \(2\), so BIC picks simpler models on large datasets.

Forward link In neural networks, \(\lambda\) selection becomes an ensemble of weight-decay strength, dropout rate, early-stopping epoch, augmentation strength, and BatchNorm momentum — all interacting (Part 7). A single "one-line regularization" gives way to a regularization symphony.


7. Variants & Case Studies — Regularization Is Not One Idea

L1 and L2 are the starting points; data structure decides which variant is more appropriate. Neural networks then explode the single \(\lambda\) into a symphony of mechanisms.

7.1 Group Lasso — Keep or Drop Feature Groups Together

What changes Replace L1's \(\|w\|_1\) with a group norm. All five dummy variables of a one-hot-encoded categorical feature survive or die together.

Why it appeared Yuan & Lin 2006, Model Selection and Estimation in Regression with Grouped Variables. Plain L1 arbitrarily keeps a single member of a group — domain interpretation collapses.

What it enabled - Categorical features: cities 1–5 stay or leave as a unit. - Multi-resolution time features (day, week, month): kept together when meaningful. - Gene clusters in genomics.

7.2 Fused Lasso — Smoothness on Adjacent Coefficients

What changes \(\|w\|_1 + \|Dw\|_1\), where \(D\) is a difference matrix. The penalty includes \(|w_j - w_{j+1}|\) — adjacent coefficients are pulled to similar values.

Why it appeared Tibshirani et al. 2005, Sparsity and Smoothness via the Fused Lasso. For time-series or spatial data where features have order, adjacent coefficients are usually close. Encode that prior into the penalty.

What it enabled - DNA copy-number variant detection (piecewise-constant signal). - Change-point detection in time series. - Piecewise-constant image segmentation.

7.3 SCAD / MCP — Lighter Penalty on Big Coefficients

What changes L1 penalizes every coefficient uniformly, including the genuinely large ones. SCAD (Fan & Li 2001) and MCP (Zhang 2010) penalize small coefficients strongly while letting large ones nearly through.

Why it appeared "Shrinking the genuine signal breaks the oracle property — the procedure fails to recover the true model." Non-convex penalties answer this.

What it enabled Theoretical oracle consistency (as \(n \to \infty\), the true model is recovered) combined with practical sparsity. A standard candidate in genomics studies.

7.4 Regularization for Classification — Direct Extension of Part 4

Part 4's logistic regression with penalty="l1"/"l2"/"elasticnet" is exactly the same machinery, with C = 1/\\lambda reskinned. SVM's regularization works on the same principle — maximize the margin while regularizing.

7.5 Regularization in Neural Networks — A Symphony

In neural networks, regularization is no longer one knob — it is several simultaneous mechanisms.

Mechanism Where it acts Equivalent to
Weight decay Optimizer L2 regularization
Dropout (Srivastava 2014) Each forward pass Implicit ensemble
BatchNorm (Ioffe 2015) Each layer Normalization + learned affine
LayerNorm (Ba 2016) Each token The Transformer default
Early stopping Training time dimension Temporal regularization
Data augmentation Input space Input-side regularization
Label smoothing Loss function Output-side regularization

These interact — applying weight decay + dropout + augmentation simultaneously usually requires lowering each (Part 7).

7.6 When to Use What

Situation Recommended Why
Multicollinearity, every feature matters Ridge Stabilize all coefficients, preserve interpretation
Many features, few are real signal Lasso Automatic feature selection
Groups of correlated features ElasticNet Co-survival of groups
Small \(n\), large \(p\) BIC or 1-SE rule Conservative selection
Large dataset CV over a wide grid No distributional assumption
Categorical group features Group Lasso Preserve domain interpretation
Neural network Weight decay + dropout + augmentation combo Single mechanism insufficient

8. Limits & Failure Modes

8.1 Choosing \(\lambda\) on a Single Split

Why it is essential A single train/val split has large random variance. Different random_state values can shift the chosen \(\lambda^{*}\) by a factor of two or three.

How you spot it Re-run with different seeds and check whether \(\lambda^{*}\) is stable.

Next step k-fold CV (k=5 or 10) — every part of the data plays validation in turn. Or nested CV: outer fold is test, inner fold selects \(\lambda\).

8.2 Forgetting to Scale

Why it is essential The penalty \(\lambda \sum w_j^2\) multiplies into feature scale. One feature at 1000× the unit of another sees a 1,000,000× larger penalty — and is effectively never trained.

How you spot it Check per-feature std. A 10× gap is enough to demand normalization.

Next step Always wrap with StandardScaler or RobustScaler inside a Pipeline. In neural networks, BatchNorm / LayerNorm play the same role.

8.3 Grid Too Narrow

Why it is essential sklearn's default grid may not match your data — the true optimum could lie outside. CV only finds the optimum within the grid.

How you spot it The chosen \(\lambda^{*}\) sits at the grid edge (min or max). Widen until it lands in the interior.

Next step Start with np.logspace(-4, 4, 40) and adjust until \(\lambda^{*}\) lies in the middle.

8.4 Lasso's Unstable Selection

Why it is essential Among highly correlated features, Lasso picks one arbitrarily. Small data perturbations can swap which feature survives — interpretation becomes unstable.

How you spot it Bootstrap resamples, or vary \(\lambda\) slightly, and watch the selected-feature set change.

Next step ElasticNet (L1 + L2) — keep correlated groups together. Or stability selection (Meinshausen & Bรผhlmann 2010) — adopt only features chosen consistently across bootstrap.

8.5 AIC / BIC Under Misspecification

Why it is essential AIC and BIC explicitly use the likelihood \(\mathcal{L}\) — they depend on a distributional assumption. When the Gaussian assumption fails, the numbers become uninterpretable.

How you spot it Normality tests on residuals (Shapiro-Wilk, Q-Q plot).

Next step CV is the distribution-free, safer alternative. The standard for large datasets.

8.6 Neural-Network Regularization Is Coupled

Why it is essential A neural net's regularization is not a single \(\lambda\) — weight-decay strength, dropout probability, BatchNorm momentum, augmentation strength, and early-stopping epoch all interact. Turning up only one breaks training; turning up everything causes underfitting.

How you spot it Ablations — strengthen / weaken one knob at a time.

Next step Part 7 develops regularization as symphony in depth.


What the Limits Sketch

Six limits, two larger truths: - Honesty of selection: skipping \(\lambda\), grid, or scaling means the model is not honestly meeting the data — same family as the Part 1 §8.1 leakage problem. - Interaction is the new norm: the era of one model, one regularization is over. Part 7 of neural-net training is the interaction of multiple regularizers itself becoming model design.


9. Quick Recap — Answer Before You Peek

Q1. Ridge and Lasso produce different solutions on the same data. Why?

Answer The geometry of the constraint regions differs. Ridge's L2 is a sphere; Lasso's L1 is a diamond. The loss contour meets the diamond at a vertex on the axes — so some coefficients become exactly zero. Why L1's sparsity is a side effect of geometry, providing automatic feature selection (Tibshirani 1996). L2 shrinks every coefficient smoothly. Penalty shape dictates solution character. (Section 6.3.)

Q2. What is the real mechanism by which Ridge cures multicollinearity? (Beyond "forcing invertibility.")

Answer Through the SVD lens, Ridge selectively strongly shrinks only the noise directions (small \(\sigma_i\)). Signal directions (large \(\sigma_i\)) are left nearly untouched. Why \(\hat{\beta}_{\text{Ridge}} = V \, \mathrm{diag}(\sigma_i / (\sigma_i^2 + \lambda)) \, U^\top y\) — noise directions shrink as \(\sigma_i / \lambda\), signal directions as \(1/\sigma_i\). Same place as Part 3 §6.2's SVD stability. (Section 6.2.)

Q3. Neural-network weight decay corresponds to what in sklearn?

Answer L2 regularization (the Ridge penalty term). Why The update rule \(w \leftarrow w - \eta(\nabla \mathcal{L} + \lambda w)\) is exactly the gradient of a loss with an added \(\lambda \|w\|^2\) term. In neural nets it works alongside dropout, BatchNorm, early stopping, and augmentation — no single mechanism suffices. (Section 7.5.)

Q4. Why is Lasso's selection unstable on correlated features? The remedy?

Answer Uniform L1 penalty arbitrarily picks one member of a correlated group — small data perturbations switch which member survives. Why ElasticNet (L1 + L2) keeps correlated groups together. Or stability selection (Meinshausen & Bรผhlmann 2010) adopts only features consistently chosen across bootstrap. (Sections 7.1, 8.4.)

Q5. CV chose a \(\lambda\) at the edge of the grid. What do you suspect?

Answer The grid is too narrow — the true optimum sits outside it. Why sklearn's default grid may not match your data. Re-search with np.logspace(-4, 4, 40). A trustworthy \(\lambda^{*}\) lies in the interior of the grid. (Sections 6.4, 8.3.)


If all five answers came easily, the geometry, the SVD lens, and the neural-net extension of regularization are in place.


10. Further Reading

Primary sources

  • Tibshirani, R. Regression Shrinkage and Selection via the Lasso. JRSS-B 58(1), 267–288 (1996). — The Lasso paper.
  • Hoerl, A. E., Kennard, R. W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12(1), 55–67 (1970). — Ridge.
  • Zou, H., Hastie, T. Regularization and Variable Selection via the Elastic Net. JRSS-B 67(2), 301–320 (2005). — ElasticNet.
  • Akaike, H. A new look at the statistical model identification. IEEE TAC 19(6), 716–723 (1974). — AIC.
  • Schwarz, G. Estimating the Dimension of a Model. Annals of Statistics 6(2), 461–464 (1978). — BIC.

Official docs

  • sklearn linear models: https://scikit-learn.org/stable/modules/linear_model.html
  • lasso_path: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lasso_path.html
  • Model selection: https://scikit-learn.org/stable/modules/model_evaluation.html

Companion books

  • Hastie, Tibshirani, Friedman. ESL. Sections 3.4, 3.6, and Chapter 7.
  • Bishop. PRML. Sections 3.1.4 and 4.1.

In Part 6 we leave classical ML for neural networks. From the single-layer perceptron to the XOR problem, multi-layer perceptrons (MLP), backpropagation, activation functions, and the universal approximation theorem — the foundations you want in hand before saying the word "deep learning."

Series overview: Series index

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System