"ML Foundations (3/9) — Linear Regression: From Least Squares to Gradient Descent"
Fitting a line to data was solved by Gauss and Legendre two centuries ago. Yet their solution still anchors the loss-function design and optimizer flow of modern deep learning. Linear regression looks tiny, but it is a scale model of all of machine learning. In this part we will see loss, optimization, and the seed of regularization meeting inside a single formula.
0. Learning Objectives
- Write the linear hypothesis \(\hat{y} = w^\top x + b\) in full matrix form.
- Explain why residual sum of squares (RSS) is the standard loss, grounded in the Gaussian noise assumption.
- Derive the normal equation \(w^* = (X^\top X)^{-1} X^\top y\) and know when it breaks.
- Run one gradient-descent update by hand.
- Explain why polynomial regression is still a linear model.
- Use residual analysis to diagnose whether the fit is adequate.
1. ํต์ฌ ์์ฝ
- Linear regression looks for a function of the form \(\hat{y} = w^\top x + b\). It is simple but the best-understood model in ML.
- Loss: residual sum of squares \(\sum (y_i - \hat{y}_i)^2\). Equivalent to maximum likelihood under Gaussian noise.
- Closed form: \(w^* = (X^\top X)^{-1} X^\top y\). One line of code for small \(d\).
- When the closed form breaks (large \(d\), multicollinearity): gradient descent or regularization (Part 5).
- Polynomial and interaction features keep the model linear in the weights while turning the input space nonlinear — that is how this small model fits curves.
2. Intuition — Why Minimize Squared Residuals?
2.1 Drawing a Line Through Scattered Points
Imagine a scatter plot. Which line passes "best" through the points? The intuition is straightforward: the line that stays as close to all points as possible.
But "close" needs a definition.
- Absolute deviations \(\sum |y_i - \hat{y}_i|\) → MAE regression (robust to outliers)
- Squared deviations \(\sum (y_i - \hat{y}_i)^2\) → least-squares regression (smooth, differentiable, math-friendly)
- Quantile losses → quantile regression
We adopt squared deviations as the default. The math is clean, and there is a deeper reason — Gaussian noise.
2.2 Why Squaring Is Natural (Gaussian Noise Assumption)
Assume the data-generating model
$$ y_i = w^\top x_i + b + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) $$
If \(\varepsilon_i\) are i.i.d. zero-mean Gaussian, the likelihood is
$$ \mathcal{L}(w, b) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma} \exp\!\Big(-\frac{(y_i - w^\top x_i - b)^2}{2\sigma^2}\Big) $$
Take logs:
$$ \log \mathcal{L} = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y_i - w^\top x_i - b)^2 $$
The \(w, b\) that maximize the likelihood are exactly those that minimize RSS. So "square the residuals" is not an arbitrary intuition — it is maximum likelihood under Gaussian noise.
2.3 A Quick Historical Note
- Legendre, A.-M. Nouvelles mรฉthodes pour la dรฉtermination des orbites des comรจtes (1805) — first published use of the term "least squares."
- Gauss, C. F. Theoria Motus Corporum Coelestium (1809) — justified least squares under Gaussian errors.
The mathematics of fitting comet orbits is still inside modern ML.
3. Definitions & Notation
3.1 Matrix Form of the Linear Model
With data \(\{(x_i, y_i)\}_{i=1}^{n}\), \(x_i \in \mathbb{R}^d\), \(y_i \in \mathbb{R}\), build the design matrix (with a leading column of ones to absorb the bias):
$$ X = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,d} \ 1 & x_{2,1} & \cdots & x_{2,d} \ \vdots & & & \vdots \ 1 & x_{n,1} & \cdots & x_{n,d} \end{bmatrix} \;\in\; \mathbb{R}^{n \times (d+1)} $$
$$ \beta = \begin{bmatrix} b \ w_1 \ \vdots \ w_d \end{bmatrix} \;\in\; \mathbb{R}^{d+1} $$
Predictions are \(\hat{y} = X\beta\), and the loss is
$$ \mathcal{L}(\beta) = \frac{1}{2n} \|y - X\beta\|_2^{2} $$
(The \(\tfrac{1}{2n}\) constant is just a convention for clean gradients — the optimum is identical.)
3.2 Two Solution Paths
| Method | Formula | When to use |
|---|---|---|
| Normal equation | \(\beta^* = (X^\top X)^{-1} X^\top y\) | \(X^\top X\) invertible, small \(d\) |
| Gradient descent | Iterative updates | Large \(d\) or \(n\), or nonlinear losses |
4. Math & Mechanism
4.1 Deriving the Normal Equation
From \(\mathcal{L}(\beta) = \tfrac{1}{2n}\|y - X\beta\|^2\), the gradient is
$$ \nabla_\beta \mathcal{L} = \frac{1}{n} X^\top (X\beta - y) $$
Setting it to zero gives
$$ X^\top X \beta = X^\top y $$
$$ \boxed{\beta^* = (X^\top X)^{-1} X^\top y} $$
That is the normal equation.
When it fails
- \(X^\top X\) is singular (multicollinearity, or \(d > n\)): the inverse does not exist or is numerically unstable. Use Ridge (Part 5) or SVD-based solves (
np.linalg.lstsq). - \(d\) is huge: the \((d+1)^3\) inversion becomes too expensive. Switch to gradient descent.
4.2 Gradient Descent
With learning rate \(\eta\):
$$ \beta^{(t+1)} = \beta^{(t)} - \eta \nabla_\beta \mathcal{L}(\beta^{(t)}) = \beta^{(t)} - \frac{\eta}{n} X^\top (X\beta^{(t)} - y) $$
For least-squares regression the loss is convex, so any \(\eta < 2 / \lambda_{\max}(X^\top X / n)\) guarantees convergence.
Mini-batch SGD samples \(B\) examples per step to estimate the gradient. We dig into that in Part 7.
4.3 The Coefficient of Determination
To summarize fit quality with one number:
$$ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} $$
The numerator is the residual sum of squares; the denominator is the total sum of squares. R² measures how much of the variance the model explains beyond simply predicting the mean.
- R² = 1: perfect fit
- R² = 0: as good as predicting the mean
- R² < 0: worse than predicting the mean
4.4 Polynomial Regression — Still Linear
\(\hat{y} = w_0 + w_1 x + w_2 x^2 + w_3 x^3\) is still a linear model. Why?
Apply the feature map \(\phi(x) = (1, x, x^2, x^3)\). The model becomes \(\hat{y} = w^\top \phi(x)\), which is linear in \(w\).
This basis-function trick is extremely general: polynomials, trigonometric functions, radial basis functions. The model fits a line/hyperplane in the transformed space. The price is dimension blowup — Part 5's regularization becomes essential.
5. Diagram
Residual patterns at a glance:
random scatter around 0 → fit looks OK
visible pattern (U, fan) → wrong model, needs extra terms
6. Principle Walkthrough — Unpacking the sklearn One-Liner
The code itself is almost a single line. The interesting question is what is happening behind it.
6.1 The Standard Call
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_tr, y_tr)
That call dispatches to np.linalg.lstsq — the LAPACK routine that solves the least-squares system via SVD. Why SVD instead of the closed-form normal equation \(\beta^{*} = (X^{\top}X)^{-1} X^{\top} y\)? The next section answers.
6.2 Two Solution Paths — When Closed Form vs. Gradient Descent
What's being compared The same loss \(\mathcal{L}(\beta) = \tfrac{1}{2n}\|y - X\beta\|_2^2\) can be minimized two ways: (a) closed-form \((X^{\top}X)^{-1} X^{\top} y\), (b) iterative \(\beta \leftarrow \beta - \eta \nabla \mathcal{L}\). The minimum is identical; the cost structure is not.
Why both survived The normal equation costs \(O(d^3)\) — only in features \(d\), with samples \(n\) appearing once. When \(d\) is small (tens to hundreds) this dominates. When \(d\) is large (tens of thousands or millions — text, image, genomics) the cubic explodes, and gradient descent's \(O(n d k)\) for \(k\) iterations becomes the only feasible path. Data scale chooses the algorithm.
Why SVD replaced \((X^{\top}X)^{-1}\) When \(X^{\top}X\) is nearly singular (multicollinearity), the inverse blows up. With SVD \(X = U \Sigma V^{\top}\) you write \(\beta^{*} = V \Sigma^{+} U^{\top} y\) — small singular values are automatically truncated, preserving numerical stability. That is why sklearn ships SVD by default. The same SVD lens later explains how Ridge regression works in Part 5.
Forward link Gradient descent applied one batch at a time becomes stochastic gradient descent (SGD) — the workhorse of every deep network from Part 6 onward. Linear-regression GD is the simplest possible example of neural-net training.
6.3 Polynomial Regression and the First Encounter with Overfitting
Polynomial regression replaces \(x\) with \(x, x^2, \ldots, x^d\). A trivial change — except the feature count explodes as \(\binom{d+p}{p}\), and the model memorizes the training set. That single observation motivates the whole of Part 5 (regularization). Recall that linear in "linear regression" means the coefficients are linear, not the features.
6.4 Reading Residuals as Diagnosis
Residuals \(r_i = y_i - \hat{y}_i\) that are uncorrelated with predictions and roughly Gaussian indicate that the noise assumption is in reasonable agreement with the data. Pattern in residuals (e.g., bigger fitted values yield bigger residuals) means the assumption is violated — Section 8 returns to that as heteroscedasticity or nonlinearity.
7. Variants & Case Studies — Why This Model Keeps Mutating
Linear regression survived two centuries not because of simplicity but because of extensibility. Swap a single assumption and you land in another domain's standard toolkit.
7.1 Weighted Least Squares (WLS)
What changes Instead of treating every residual equally under squared loss, attach a weight \(w_i\) to each sample.
$$ \mathcal{L}(\beta) = \frac{1}{2n} \sum_i w_i (y_i - x_i^\top \beta)^2 $$
Closed form: \(\beta^{*} = (X^{\top} W X)^{-1} X^{\top} W y\), \(W = \mathrm{diag}(w_1, \dots, w_n)\).
Why it appeared In physics and astronomy each measurement has a different accuracy. Gauss already played with this in his 1809 work on celestial orbits, and the idea became standard in 19th-century geodesy (Bessel, Helmert). The slogan: "give more reliable samples a louder voice."
What it enabled - Heteroscedastic data (residual variance depends on input): down-weight noisy samples to correct OLS bias. - Time-series recency weighting: larger \(w_i\) on recent points becomes an exponentially-weighted regression. - Asymmetric-cost industries: false positives and false negatives differ in cost, encoded into per-sample weights.
Limits and next step The weights \(w_i\) must be supplied from outside — the model cannot learn them. The technique that does learn them, iteratively reweighted least squares (IRLS), is the training algorithm for logistic regression in Part 4.
7.2 Generalized Linear Models (GLM)
What changes Replace the Gaussian assumption on \(y\) with any member of the exponential family. The linear predictor \(\eta = w^{\top} x + b\) connects to the mean \(\mu = \mathbb{E}[y]\) via a link function \(g(\mu) = \eta\).
| Output type | Distribution | Link function | Model |
|---|---|---|---|
| Continuous real | Gaussian | identity \(g(\mu) = \mu\) | Linear regression |
| Binary (0/1) | Bernoulli | logit \(\log\frac{\mu}{1-\mu}\) | Logistic regression (Part 4) |
| Count (0,1,2,…) | Poisson | log \(\log \mu\) | Poisson regression |
| Positive continuous | Gamma | reciprocal \(1/\mu\) | Gamma regression |
Why it appeared Nelder & Wedderburn 1972, Generalized Linear Models (JRSS-A), introduced this unifying perspective. Before then, every output distribution had its own estimation procedure scattered across the statistics literature. The discovery that one algorithm (IRLS) handles them all reshaped how regression is taught.
What it enabled - Medicine: mortality regression (Bernoulli) and incident-count regression (Poisson). - Insurance: a two-stage frequency-Poisson + severity-Gamma model for claim modeling. - Marketing: click-through (logistic) plus purchase amount (gamma). - Part 4 lets us see logistic regression as one slot in this larger GLM grid.
Limits and next step GLM still insists on a linear combination of features. Nonlinear relationships push you toward polynomial features, GAMs (Generalized Additive Models), or — Part 6 — neural networks. A neural net is effectively learned representation + a GLM-style output head.
7.3 Robust Regression — Huber and Quantile
What changes Replace squared loss with one that doesn't blow up on outliers.
- Huber loss: quadratic for small residuals, linear for large; the boundary is a hyperparameter \(\delta\).
- Quantile regression: estimate the \(\tau\)-quantile, not the mean. \(\tau = 0.5\) becomes median regression.
Why it appeared Huber 1964, Robust Estimation of a Location Parameter (Ann. Math. Stat.), is the founding paper of robust statistics. It quantified for the first time the fragility of OLS under a single outlier — the precise way the Gaussian assumption fails.
What it enabled - Finance: pricing models where extreme events would otherwise wreck a mean fit. - Risk modeling: directly estimating the 95th-percentile loss (Value-at-Risk). - Real-estate: estimating price band rather than a single average — separate models for the low end and the high end of the market.
Limits and next step Robust regression makes you choose which loss matches your domain. Part 7 (deep-learning training) generalizes loss-function design (focal loss, contrastive loss, etc.) into deep models.
7.4 Bayesian Linear Regression
What changes Place a prior over \(\beta\) and estimate the posterior. Instead of a single \(\beta^{*}\), you get the full \(p(\beta \mid \text{data})\).
Why it appeared Point estimates from OLS produce unrealistic confidence intervals on small samples. The Bayesian view writes down explicitly the procedure of narrowing uncertainty with data. With a Gaussian prior, the posterior is also Gaussian — closed form — and its mean coincides with Ridge regression, a duality Part 5 unpacks.
What it enabled - Confidence intervals on small-sample medical studies. - Bayesian A/B tests with full posterior on effect size. - Active learning: query points where posterior variance is high.
Limits and next step Prior choice influences the answer. The honest point is that the prior is stated explicitly, while OLS's implicit assumptions are not.
7.5 Where the Model Shows Up in the Real World
- Housing-price prediction: linear combination of area, location, age — with polynomial extensions.
- Medicine — dose response: linear regression as a first pass, then GLM or mixed-effect models.
- Economic indicators: GDP and unemployment trend estimation.
- A/B test adjustment: covariate regression to correct treatment effects (CUPED).
- LLM training loss curves: fitting a simple trend to predict the convergence point.
- Physics and astronomy: the same role Gauss gave it in 1809 — 200 years and still in active use.
8. Limits & Failure Modes — Each Limit Births the Next Model
8.1 Violation of Linearity
If the underlying relation is curved, a straight line cannot follow it. R² stays low; residuals show a curved or U-shaped pattern.
Why it is essential A linear model expresses only a linear combination of given features. Interactions between inputs and nonlinear transforms must be supplied in advance. This is the same structural limit Gauss accepted in 1809.
How you spot it Curved or U-shaped pattern in the residuals-vs-predictions scatter. Both train and test R² low together implies underfitting.
Next step Polynomial regression → GAMs (splines) → neural networks (Part 6) — the expressiveness track. The key idea: learn the features, and linear regression survives as the final output head.
8.2 Sensitivity to Outliers
Squared loss inflates large residuals. A single outlier can drag the entire fit toward itself.
Why it is essential The Gaussian density decays as \(\exp(-x^2/2)\) — the Gaussian generative assumption and the large penalty on large residuals are two sides of the same coin. When the real data has heavy tails, OLS absorbs that thickness into estimation bias.
How you spot it Cook's distance, leverage, studentized residuals — diagnostics that identify high-influence samples. The residual Q-Q plot drifts off the diagonal in the tails.
Next step Huber / quantile regression (Section 7.3) or RANSAC (Random Sample Consensus) for robust estimation.
8.3 Multicollinearity
Highly correlated features make \(X^{\top}X\) near-singular; coefficient variance explodes.
Why it is essential When two features carry "the same information," the model has no basis to decide how to split weight between them. Small perturbations in data lead to large swings in coefficients.
How you spot it VIF (Variance Inflation Factor) above 5–10 is a warning. Correlation-matrix entries above 0.9 identify collinear pairs.
Next step Part 5's Ridge regression confronts this directly — adding \(\lambda I\) forces invertibility while shrinking all singular values proportionally. Lasso takes a further step by surviving only one variable per collinear group.
8.4 Heteroscedasticity
Residual variance changes with input. The constant-variance part of the Gaussian noise assumption breaks.
Why it is essential OLS's standard errors and confidence intervals become wrong. The point estimate \(\hat{\beta}\) is still consistent, but the quoted uncertainty misleads you.
How you spot it Funnel shape in residuals-vs-predictions. Breusch–Pagan or White tests.
Next step Section 7.1 WLS / GLS, or more generally GLM (Section 7.2) where the variance function is named explicitly.
8.5 Extrapolation
No model can be trusted outside the training input range.
Why it is essential The model only learned a function on the region of seen data. Outside that region, model structure dictates extrapolation behavior — linear regression keeps going straight, polynomial regression diverges according to its degree. Neither choice is "the truth."
How you spot it Measure distance-to-training-distribution (Mahalanobis distance, kNN distance).
Next step Calibrated uncertainty — prediction intervals, Conformal Prediction — and Bayesian regression, whose posterior variance automatically grows in extrapolation regions.
8.6 \(p \gg n\) — More Features Than Samples
\(X^{\top}X\) becomes non-invertible and the closed-form solution is undefined.
Why it is essential An under-determined system has infinitely many \(\beta\) values that fit the training data perfectly. Which solution gets picked must be decided by an assumption or regularization, not by the data alone.
How you spot it \(d > n\) by inspection. Text (BoW dimensions in the tens to hundreds of thousands), genomics, high-resolution images all live here.
Next step Part 5's full toolkit — Lasso (sparse), Ridge (shrunken), ElasticNet (both). Or dimensionality reduction (PCA, learned embeddings in Part 6) followed by regression.
What the Limits Sketch
The six limits of linear regression compress six future stops in the ML curriculum: nonlinearity → neural networks (Part 6) · outliers → robust statistics · multicollinearity → regularization (Part 5) · heteroscedasticity → GLM (7.2) · extrapolation → uncertainty estimation · \(p \gg n\) → sparse estimation (Part 5).
9. Quick Recap — Answer Before You Peek
Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.
Q1. The normal equation and gradient descent give the same answer — why keep both?
Answer Because data scale changes the cost structure. Why Normal equation costs \(O(d^3)\) in features \(d\) — dominant when \(d\) is small, infeasible when \(d\) is huge. Gradient descent costs \(O(n d k)\) and is the only path at scale. Part 6's stochastic gradient descent for neural nets is the same recipe. (Section 6.2.)
Q2. Why doesn't sklearn LinearRegression compute \((X^{\top}X)^{-1}\) directly?
Answer Because the inverse explodes when \(X^{\top}X\) is nearly singular (multicollinearity). It uses SVD instead. Why With \(X = U\Sigma V^{\top}\), tiny singular values are truncated and the solution stays numerically stable. The same SVD lens explains Ridge regression in Part 5. (Section 6.2.)
Q3. In polynomial regression, why does test R² fall after some degree even though train R² keeps rising?
Answer Overfitting. The feature count blows up as \(\binom{d+p}{p}\) and the model memorizes the training set. Why Sample size \(n\) is fixed; parameter count grows, so model variance rises (the bias-variance decomposition of Part 1). Part 5's regularization is the direct response. (Sections 6.3, 8.6.)
Q4. A funnel-shaped residual plot — which assumption broke and what do you reach for?
Answer Heteroscedasticity. Residual variance depends on input. The point estimate is fine, but the confidence intervals are wrong. Switch to WLS or a GLM. Why The constant-variance part of the Gaussian noise assumption is dead. WLS down-weights noisy samples; GLM names the variance function via the chosen distribution. (Sections 8.4, 7.1, 7.2.)
Q5. Of the six limits of linear regression, why is extrapolation the one not solved by switching to a different model?
Answer Because extrapolation is a problem of data coverage, not of model capacity. Why The other five are fixable by changing the model (nonlinear, regularized, robust loss, etc.). Extrapolation is a prediction in a region you have not seen — no model can guarantee truth there. The honest response is calibrated uncertainty (Bayesian posterior variance, Conformal Prediction) rather than a "better" point estimate. (Section 8.5.)
If at least four answers were on the tip of your tongue, the article landed. If a question stuck, jump back to the section in parentheses and re-read.
10. Further Reading
Primary sources
- Legendre, A.-M. Nouvelles mรฉthodes pour la dรฉtermination des orbites des comรจtes (1805). — Coined "least squares."
- Gauss, C. F. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium (1809). — Gaussian-error justification of least squares.
- Stigler, S. M. The History of Statistics: The Measurement of Uncertainty Before 1900, Harvard 1986.
Official docs
- sklearn
LinearRegression:https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares np.linalg.lstsq:https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html
Companion books
- Hastie, Tibshirani, Friedman. Elements of Statistical Learning. Chapter 3.
- Bishop. Pattern Recognition and Machine Learning. Chapter 3.
In Part 4, we meet logistic regression — called "regression" but actually a classification model. By pushing the linear combination \(w^\top x + b\) through a sigmoid, we get probabilities, and from that single idea we get cross-entropy, softmax, and the MLE foundation for almost every modern classifier.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ