"ML Foundations (4/9) — Logistic Regression & Classification: Making Probabilities"
The name says "regression" but the job is classification. Take Part 3's linear score \(w^\top x + b\) and push it through a sigmoid — that single move turns the output into a probability between 0 and 1. On top of that tiny modification sit cross-entropy loss, softmax multi-class extension, and the maximum-likelihood foundation that almost every modern classifier rests on. Even the final layer of a neural network is, almost without exception, this same construction.
0. Learning Objectives
- Define the logistic function \(\sigma(z)\), its derivative, and its meaning, both in formula and in a picture.
- Derive cross-entropy from the Bernoulli likelihood.
- Extend naturally to multi-class via softmax + categorical cross-entropy.
- Explain why the decision boundary \(w^\top x + b = 0\) is a straight line or hyperplane.
- Sweep the decision threshold and draw the ROC curve and AUC.
- Recognize which metrics to use under class imbalance and how to mitigate it.
1. 핵심 요약
- Logistic regression: \(P(y=1 \mid x) = \sigma(w^\top x + b)\), with \(\sigma(z) = 1 / (1 + e^{-z})\).
- Loss: negative log-likelihood = binary cross-entropy.
$$ \mathcal{L}(w, b) = -\frac{1}{n}\sum_{i} \big[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \big] $$
- Multi-class extension: sigmoid → softmax, binary CE → categorical CE.
- The decision boundary is \(w^\top x + b = 0\) — a hyperplane in input space.
- No closed-form solution; solve via gradient descent or IRLS (iteratively reweighted least squares). The loss is convex, so a global optimum is guaranteed.
- Because the output is a probability, you can shift the threshold to balance precision against recall directly.
2. Intuition — What the Sigmoid Does
2.1 Slip a "Probability Hat" on the Linear Output
In Part 3, linear regression returned \(\hat{y} = w^\top x + b\) directly. For classification, the target is 0 or 1, and the output we want is a probability in \([0, 1]\). So we run the score through the sigmoid:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Properties:
- \(\sigma(0) = 0.5\)
- \(z \to +\infty\) ⇒ \(\sigma \to 1\)
- \(z \to -\infty\) ⇒ \(\sigma \to 0\)
- Derivative: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) — we'll revisit this during backprop.
The sigmoid smoothly compresses a score z = w^T x + b into a probability.
2.2 Why 0.5 Is the Default Boundary
Predicting class 1 when \(P(y=1 \mid x) > 0.5\) is equivalent to \(\sigma(z) > 0.5\), which is equivalent to \(z > 0\). So the decision boundary is \(w^\top x + b = 0\). That equation defines a hyperplane in input space. Swap the threshold for some other value and the hyperplane shifts in parallel.
2.3 Why Cross-Entropy Is the Loss
Linear regression's squared loss was MLE under Gaussian noise. We use the same trick for classification.
Assume the label \(y \in \{0, 1\}\) is Bernoulli:
$$ P(y \mid x; w, b) = \hat{p}^{y} (1 - \hat{p})^{1 - y}, \quad \hat{p} = \sigma(w^\top x + b) $$
The negative log-likelihood (NLL) over the dataset:
$$ -\log \mathcal{L} = -\sum_i \big[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\big] $$
That is binary cross-entropy. Divide by \(n\) for the per-sample loss. Gaussian noise → squared error; Bernoulli noise → cross-entropy. Both are NLL.
3. Definitions & Notation
3.1 Binary Logistic Regression
- Input: \(x \in \mathbb{R}^d\)
- Weights: \(w \in \mathbb{R}^d\), bias: \(b \in \mathbb{R}\)
- Score: \(z = w^\top x + b\)
- Probability: \(\hat{p} = \sigma(z)\)
- Prediction: \(\hat{y} = \mathbb{1}[\hat{p} > 0.5]\) (or any chosen threshold)
3.2 Binary Cross-Entropy
$$ \mathcal{L}(w, b) = -\frac{1}{n}\sum_{i=1}^{n} \big[y_i \log \sigma(z_i) + (1 - y_i) \log(1 - \sigma(z_i))\big] $$
3.3 Multi-Class: Softmax
\(C\) classes, one weight vector per class:
$$ z_c = w_c^\top x + b_c, \quad c = 1, \dots, C $$
Softmax:
$$ \hat{p}_c = \frac{e^{z_c}}{\sum_{c'=1}^{C} e^{z_{c'}}} $$
Categorical cross-entropy:
$$ \mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{i,c} \log \hat{p}_{i,c} $$
\(y_{i,c}\) is one-hot. This is the exact structure of the next-token loss in LLM training (see LLM Core Study Part 5).
4. Math & Mechanism
4.1 Gradient Derivation (Binary)
Apply the chain rule to the NLL. The intermediate step is the cleanest in all of ML:
$$ \frac{\partial \mathcal{L}}{\partial z_i} = \hat{p}_i - y_i $$
Just "prediction minus target." This simplicity is what makes sigmoid + cross-entropy the canonical pair.
$$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{n} \sum_i (\hat{p}_i - y_i) x_i = \frac{1}{n} X^\top (\hat{p} - y) $$
$$ \frac{\partial \mathcal{L}}{\partial b} = \frac{1}{n} \sum_i (\hat{p}_i - y_i) $$
Identical shape to linear regression's \(\tfrac{1}{n}X^\top (X\beta - y)\). Only \(X\beta\) is replaced by \(\hat{p}\).
4.2 The Loss Is Convex
\(\nabla^2 \mathcal{L}\) is positive semi-definite, so every local optimum is global. From any initial weights, gradient descent (with sufficiently small learning rate) converges to the same answer.
This is a luxury neural networks do not have, and it is why logistic regression is so robust. The same math reappears at the output layer of practically every classifier neural network.
4.3 IRLS — Newton's Method for Logistic Regression
Newton's method:
$$ \beta^{(t+1)} = \beta^{(t)} - H^{-1} g $$
where \(g\) is the gradient and \(H\) the Hessian. For logistic regression:
$$ H = \frac{1}{n} X^\top W X, \quad W = \mathrm{diag}\big(\hat{p}_i(1 - \hat{p}_i)\big) $$
This update is Iteratively Reweighted Least Squares (IRLS). sklearn's default solvers (liblinear, lbfgs, saga) are second-order or their approximations.
4.4 Decision Boundaries and Margins
The boundary is \(w^\top x + b = 0\). For multi-class:
$$ \{x : w_c^\top x + b_c = w_{c'}^\top x + b_{c'}\} $$
is the boundary between classes \(c\) and \(c'\). All are hyperplanes.
If the data are not linearly separable, no choice of \(w, b\) eliminates all errors. That alone is a signal that a nonlinear decision boundary is required — polynomial features or neural networks (Part 6).
5. Diagram
Multi-class swaps sigmoid → softmax and grows the output to \(C\) nodes — the rest is identical.
6. Principle Walkthrough — The Chain From Logit to Sigmoid to Cross-Entropy
The code itself is one line. The interesting content is the chain logit → sigmoid → cross-entropy → MLE that this line packages — the same chain that powers the output head of every neural classifier.
6.1 The Standard Call
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=10000).fit(X_tr, y_tr)
y_prob = clf.predict_proba(X_te)[:, 1]
predict_proba returns the post-sigmoid probability. predict is just that probability thresholded at 0.5 — and moving that threshold to match business cost is the key downstream decision (Section 6.3).
6.2 Why the Sigmoid Derivative Is So Clean
The observation \(\sigma(z) = 1/(1 + e^{-z})\) has derivative \(\sigma'(z) = \sigma(z)\,(1 - \sigma(z))\) — computable from the function value alone.
Why it matters - In backpropagation (Part 7), each layer's activation derivative gets multiplied into the gradient. With sigmoid, caching forward results lets backward run with no extra compute. - But: \(\sigma'\) maxes out at 0.25 (at \(z = 0\)). Multiplying through \(L\) layers gives \(0.25^L\), which collapses fast → vanishing gradient. Part 7 traces this directly to the adoption of ReLU.
Forward link In Part 6's MLP, sigmoid almost vanishes as a hidden activation — but it survives as the output head of binary classifiers. Sigmoid + cross-entropy is the canonical neural-net classification output.
6.3 ROC and Thresholds — The Threshold, Not the Model, Makes the Decision
What you watch The same model produces different confusion matrices at different thresholds \(\tau\). \(\tau = 0.5\) is arbitrary — the right threshold is dictated by business cost.
The framing - ROC curve: trace (FPR, TPR) over all \(\tau\). AUC = 1 is perfect, 0.5 is random. The curve measures model separability, decoupled from threshold. - PR curve: precision vs recall. More useful under class imbalance (ROC is over-optimistic when negatives dominate). - Drop \(\tau\) from 0.5 to 0.3 → recall ↑, precision ↓. Medical screening (recall-first) versus spam filtering (precision-first) — different cost asymmetries, different operating points.
Forward link Threshold selection is a post-training decision. Part 7's training-time class-imbalance techniques (focal loss, weighted loss) generalize the same intuition into the training loop.
6.4 Why Cross-Entropy Is the Logistic Regression Loss
Start with the Bernoulli likelihood \(p(y \mid x) = \hat{p}^y (1 - \hat{p})^{1-y}\) and run maximum likelihood estimation (MLE). The negative log-likelihood is exactly cross-entropy:
$$ \mathcal{L}(\beta) = -\frac{1}{n} \sum_i \big[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\big] $$
Cross-entropy is not a freely chosen loss — it is the direct consequence of assuming a Bernoulli output. The same chain produces softmax + categorical cross-entropy for multi-class, MSE under a Gaussian assumption for regression. Every loss is the shadow of a distributional assumption — a perspective that surfaces again in GLMs (Part 3 §7.2) and in neural networks (Part 6).
7. Variants & Case Studies — The Weight a Probability-Producing Model Carries
Logistic regression has survived since Cox 1958 not because of accuracy but because of its well-calibrated probabilities. The probability is itself an input to a decision.
7.1 Regularized Logistic Regression — What Is \(C\)?
What changes
Add an \(L_2\) or \(L_1\) penalty to the loss. sklearn's API uses C = 1/\\lambda — smaller C, stronger regularization. (C is the "confidence in the data" framing; lambda is the "penalty strength" framing.)
Why it appeared Small-sample / high-feature domains like medicine and finance can drive MLE into perfect separation — coefficients diverge to \(\pm\infty\). Tibshirani's Lasso (1996) and Hoerl-Kennard's Ridge (1970) ported directly to logistic regression.
What it enabled - L2: cures multicollinearity, stabilizes coefficients. The standard for ad CTR models. - L1: feature selection. With text classification and vocabularies in the hundreds of thousands, only a few hundred words survive. - ElasticNet: keeps correlated feature groups together.
Limits and next step Part 5 develops the same principle for regression. Neural networks generalize the same family as weight decay, dropout, and early stopping (Part 7).
7.2 Class Imbalance — Threshold or Loss Weighting?
What changes With a 1% positive rate, an unweighted model degenerates to "always negative." Four standard moves:
| Technique | Where it intervenes | Core idea |
|---|---|---|
class_weight="balanced" |
Loss function | 1 positive sample is worth 99 negatives in loss |
| Lower the threshold | Inference time | 0.5 → 0.2 (recall-first) |
| Oversampling (SMOTE) | Data | Synthesize more positives |
| Change the metric | Evaluation | F1, PR-AUC instead of accuracy |
Why it appeared Provost & Fawcett 1997, Analysis and Visualization of Classifier Performance, formalized the class-imbalance evaluation problem. Chawla et al.'s SMOTE (2002) is the canonical data-level fix.
What it enabled
- Fraud detection (positives at 0.1%): class_weight + threshold tuning is the standard combination.
- Medical diagnostics (recall-first): deliberately lower the threshold to avoid missing positives.
- Ad CTR (99% negative, but the negatives carry information): negative sampling extracts only useful negatives instead of plain weighting.
Limits and next step Weighting alone does not save extreme imbalance (<0.1%) — anomaly-detection framings or a different model class may be needed. Part 7's focal loss (Lin et al. 2017) extends the idea by focusing on hard examples at training time.
7.3 Probability Calibration — How Confident Should the Model Be?
What changes When the model says "probability 0.9," is the empirical accuracy on those predictions actually ~0.9? If yes, the model is well-calibrated. If 0.9 maps to 0.65, the model is overconfident.
Why it appeared Medical diagnostics, insurance claims, credit scoring all consume the probability as the decision input. The difference between 0.95 and 0.99 changes a premium — so the model's ability to measure that gap matters. Platt 1999; Niculescu-Mizil & Caruana 2005, Predicting Good Probabilities with Supervised Learning.
What it enabled
- Reliability diagrams diagnose miscalibration visually.
- Platt scaling (sigmoid post-hoc), isotonic regression (monotonic post-hoc).
- sklearn's CalibratedClassifierCV applies post-hoc calibration in one call.
- A calibrated model also makes threshold-choice decisions more trustworthy — precision-recall trade-offs land on real costs.
Limits and next step Neural networks tend to be overconfident (Guo et al. 2017, On Calibration of Modern Neural Networks). Part 7's temperature scaling is the standard post-hoc fix.
7.4 Direct Link to Neural Networks — One Layer = Logistic Regression
The framing Logistic regression = a one-layer neural network with sigmoid activation. The MLP of Part 6 stacks this structure with nonlinear hidden activations. In other words, logistic regression's cross-entropy + sigmoid pair is the direct ancestor of every neural classification head.
Why this view matters The last layer of nearly every neural classifier is (linear) → sigmoid (binary) or (linear) → softmax (multi-class). Understanding logistic regression = understanding the output stage of every neural classifier.
7.5 Industry Cases
- Credit scoring: the oldest application of logistic regression. Demands both interpretability and calibrated probability.
- Ad CTR (Click-Through Rate): billions of predictions per day. Simplicity means speed; logistic regression often beats more sophisticated models for that reason alone.
- Medical risk scores (Apgar, MELD, etc.): direct descendants of logistic regression — interpretable risk scores.
- First-stage spam filters: fast rough cut before downstream neural models.
- A/B test conversion modeling: covariate-adjusted logistic regression for unbiased treatment-effect estimation.
8. Limits & Failure Modes — Linearity, Saturation, and Calibration Decide It
8.1 Not Linearly Separable
Why it is essential Logistic regression's decision boundary is \(w^\top x + b = 0\) — a hyperplane in the input space. Problems like XOR have no \(w, b\) that separates the classes.
How you spot it Visualize the data with PCA in 2D; if the classes look nonlinear, suspect. Or simply observe unreasonably low accuracy after training.
Next step Manual nonlinear feature engineering (polynomial / interaction terms), kernel trick (SVM), or Part 6's neural networks. The essence of a neural net is learn features, then apply a linear classifier.
8.2 Sigmoid Saturation
Why it is essential For large \(|z|\), \(\sigma(z) \to 0\) or \(1\), and the derivative goes to zero. Even with a big loss, the gradient does not flow back, and learning halts.
How you spot it A flat learning curve at high loss → suspect saturation. Weight magnitudes much larger than expected.
Next step Regularize to suppress \(|w|\), clip outliers. In neural nets, swap sigmoid for non-saturating activations like ReLU/GELU (Part 7 §6.3).
8.3 Multicollinearity
Why it is essential Same as Part 3 §8.3 — highly correlated features explode coefficient variance.
How you spot it VIF, correlation matrix, or noticing that small changes in regularization shift coefficients dramatically.
Next step L2 (Ridge-like) regularization. In high-dimensional domains like text and genomics, L1 (Lasso-like) produces sparse weights.
8.4 Class Imbalance
Why it is essential The loss function assumes balanced sampling and gets dominated by the majority class. Section 7.2's mitigations are required.
How you spot it Near-zero recall on the positive class; low PR-AUC.
Next step
class_weight="balanced", SMOTE, threshold tuning, or focal loss (Part 7).
8.5 Calibration Error
Why it is essential Accuracy and calibration are separate problems. If the model says 90% but the empirical rate is 65%, the predicted probability is unfit as a decision input — even with high accuracy.
How you spot it Reliability diagram — predicted probability on \(x\)-axis, empirical accuracy on \(y\)-axis. Off the diagonal = miscalibration.
Next step
CalibratedClassifierCV (Platt scaling or isotonic). For neural networks, temperature scaling.
8.6 \(p \gg n\) or Sparse Classes
Why it is essential MLE entering perfect-separation territory drives weights to \(\pm\infty\). This is the pathology of learning too well.
How you spot it Unregularized training produces unrealistically large weight magnitudes. Training loss collapses to zero while generalization fails.
Next step Regularization is mandatory, not optional. Part 5's full toolkit is the direct answer.
What the Limits Sketch
The six limits of logistic regression compress the design decisions of every neural classification head: - Linear separation limit → hidden layers (Part 6 MLP) - Sigmoid saturation → ReLU/GELU (Part 7 activation comparison) - Multicollinearity → weight decay (Part 7 regularization) - Class imbalance → focal loss (Part 7 loss design) - Calibration error → temperature scaling (Part 7 post-hoc calibration) - MLE divergence → mandatory regularization (Part 5)
9. Quick Recap — Answer Before You Peek
Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.
Q1. Why is logistic regression — despite the name — actually a classification model?
Answer The output is the probability \(\hat{p}(y=1 \mid x)\) — it regresses a continuous quantity, but that quantity is a probability, and thresholding it produces a classifier. Why The logit \(\log\frac{\hat{p}}{1-\hat{p}} = w^\top x + b\) is a linear regression; sigmoid maps it into \([0, 1]\). It is exactly the Bernoulli + logit-link case of the GLM (Part 3 §7.2). (Sections 6.4, 7.4.)
Q2. Why is cross-entropy not an arbitrary loss?
Answer Maximum likelihood estimation under a Bernoulli assumption gives the negative log-likelihood — which is exactly cross-entropy. Why "Every loss is the shadow of a distributional assumption" — Gaussian assumption gives MSE for regression, categorical assumption gives softmax cross-entropy for multi-class. Loss choice = distribution choice. (Section 6.4.)
Q3. Where does the threshold 0.5 come from? Why move it?
Answer 0.5 is arbitrary. The real threshold is decided by business cost. Why Medical screening (high cost of false negatives) → lower \(\tau\), favor recall. Spam filtering (high cost of false positives) → higher \(\tau\), favor precision. ROC and PR curves visualize all thresholds simultaneously. (Sections 6.3, 7.2.)
Q4. Why is the sigmoid derivative \(\sigma(z)(1-\sigma(z))\) a problem in deep networks?
Answer Its maximum is 0.25 → the gradient through \(L\) sigmoid layers shrinks like \(0.25^L\), and learning halts (vanishing gradient). Why This is the direct historical reason Part 7 adopts ReLU — its derivative is exactly 1 on the positive side, so gradients pass cleanly. Sigmoid survives only as an output head. (Section 6.2.)
Q5. Why might a 95%-accurate model be unsuitable for medical decisions?
Answer If the probability isn't calibrated, threshold choice and risk scoring become unreliable. Accuracy and calibration are independent problems. Why A model that says "0.9 confidence" while being right 65% of the time is unfit as a decision input. Reliability diagrams diagnose miscalibration; Platt or isotonic regression fixes it (Niculescu-Mizil & Caruana 2005). (Sections 7.3, 8.5.)
If four or five answers came easily, the logistic-regression chain — logit → sigmoid → cross-entropy → MLE → calibration — is in place.
10. Further Reading
Primary sources
- Cox, D. R. The Regression Analysis of Binary Sequences. Journal of the Royal Statistical Society, Series B, 20(2), 215–242 (1958). — The origin of logistic regression.
- McCullagh, P., Nelder, J. A. Generalized Linear Models, 2nd ed., Chapman & Hall (1989). — The systematic GLM treatment.
- Krishnapuram, B. et al. Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE TPAMI 27(6), 957–968 (2005). — Multi-class with regularization.
Official docs
- sklearn
LogisticRegression:https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression - Probability calibration:
https://scikit-learn.org/stable/modules/calibration.html - ROC and PR:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
Companion books
- Hastie, Tibshirani, Friedman. ESL. Chapter 4.
- Bishop. PRML. Section 4.3.
In Part 5 we tackle regularization and model selection. We'll formalize the \(L_1\) and \(L_2\) penalties that have been hovering in Parts 3 and 4, walk through Ridge, Lasso, and ElasticNet with both equations and pictures, and use cross-validation plus information criteria (AIC, BIC) to choose a model.
Series overview: Series index
댓글
댓글 쓰기