"ML Foundations (1/9) — Machine Learning & sklearn: The Coordinate System of Learning"

This is a 9-part study series on classical machine learning, written as a runway toward neural networks and local LLMs. Part 1 nails down the shared grammar that every ML model obeys. sklearn shows this grammar more consistently than any other library. By the end of this part, the fit → predict → score → evaluate loop should feel like a single muscle, regardless of which model sits inside it.


0. Learning Objectives

By the end of this part you should be able to:

  • Distinguish supervised, unsupervised, and reinforcement learning in one sentence each.
  • Use sklearn's estimator interface (fit, predict, score, transform) inside a single screen of code.
  • Explain why we split data into train/validation/test and where data leakage typically enters.
  • Compute the core classification and regression metrics (accuracy, precision, recall, F1, MSE, R²) from their definitions, not just from a library call.
  • Run k-fold cross-validation in one line and report the mean and standard deviation of the score.
  • Use Pipeline to bind preprocessing and the model into a single object that structurally blocks data leakage.

1. 핵심 요약

  • Machine learning is the task of finding a function \(f: X \to Y\) from data. Every model, no matter how fancy, is a variation on this.
  • sklearn unifies all models behind the same interface (fit, predict, score), so swapping models is a one-line change.
  • The train/val/test split is the standard. Use validation to choose hyperparameters; touch the test set exactly once, at the very end.
  • Data leakage — test information bleeding into training — is the single most common beginner mistake. Pipeline + cross_val_score blocks it structurally.
  • Metrics depend on the problem type (classification vs regression) and on class balance. Never trust accuracy alone.

2. Intuition — What Does Machine Learning Actually Do?

2.1 We Are Finding a Function

Classical programming is the human writing the rules.

input → [human-written rules] → output

Machine learning gives both the input and the output, and asks the algorithm to discover the rules in the data.

(input, output) pairs → [learning algorithm] → function f

In symbols, with input space \(X\) and output space \(Y\), ML solves

$$ f^{*} = \arg\min_{f \in \mathcal{F}} \; \mathbb{E}_{(x,y) \sim \mathcal{D}}\big[\, L\big(f(x),\, y\big)\,\big] $$

where \(\mathcal{F}\) is the hypothesis class (the family of functions we are willing to consider), \(L\) is a loss function, and \(\mathcal{D}\) is the data distribution.

The catch is that \(\mathcal{D}\) is unknowable in full. We only ever have a finite sample \(\{(x_i, y_i)\}_{i=1}^{n}\), so we approximate the expectation with the empirical risk:

$$ \hat{f} = \arg\min_{f \in \mathcal{F}} \; \frac{1}{n} \sum_{i=1}^{n} L\big(f(x_i),\, y_i\big) $$

Almost every ML algorithm is a variation on this formula. They differ in the model class \(\mathcal{F}\), in the loss \(L\), and in the optimization method — that's it.

2.2 The Three Branches

Branch Input Output Typical task
Supervised x y (label) Classification, regression
Unsupervised x Clustering, dimensionality reduction, density estimation
Reinforcement state s action a, reward r Games, robotics, RLHF

This series focuses on supervised learning. Unsupervised methods appear in Part 6 (autoencoders), and reinforcement learning appears in the LLM series under RLHF.

2.3 Why sklearn?

sklearn has been the de facto Python ML library since 2007. In a sentence: it is the library that gives every ML model the same coat.

  • Every estimator follows the fit(X, y)predict(X) pattern.
  • Swapping models is a one-line change in the class name.
  • Preprocessing (StandardScaler, OneHotEncoder) all the way through evaluation (cross_val_score) share the same grammar.
  • It is not deep learning, but it teaches the mental flow that deep learning code also follows (dataset → model → training → evaluation).

The canonical citation:

Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011): 2825–2830.

The JMLR 2011 paper states the design philosophy explicitly: "a consistent interface, balanced compromises between ease of use and performance, and dependence on a small set of well-established libraries (NumPy, SciPy)."


3. Definitions & Notation — The Estimator Interface

3.1 Data Notation

  • \(X \in \mathbb{R}^{n \times d}\): the input matrix with \(n\) samples and \(d\)-dimensional features.
  • \(y \in \mathbb{R}^{n}\) (regression) or \(y \in \{0, 1, \dots, C-1\}^{n}\) (classification): the label vector.
  • A single sample: \(x_i \in \mathbb{R}^{d}\), \(y_i\).

3.2 The Estimator Interface

In sklearn, every learnable object is an estimator. The standard method signatures are:

class Estimator:
    def fit(self, X, y=None):  return self           # learn
    def predict(self, X):      return y_hat          # predict
    def score(self, X, y):     return metric_value   # default metric
    def transform(self, X):    return X_transformed  # for preprocessors / dim. reducers
    def fit_transform(self, X, y=None): ...          # fit + transform
Method Lives on Meaning
fit All estimators Learn from data
predict Classifiers, regressors Predict for new input
score Classifiers, regressors Default accuracy for classifiers, R² for regressors
transform Preprocessors, dim. reducers Map input into a new representation
predict_proba Probabilistic classifiers Probability for each class

This consistency is sklearn's core asset. Swapping LogisticRegression, RandomForestClassifier, and KNeighborsClassifier does not change a single line of the calling code.


4. Math & Mechanism — Generalization, Splits, Leakage

4.1 Decomposing Generalization Error

The generalization error of a learned model \(\hat{f}\) decomposes as

$$ \mathbb{E}\big[L(\hat{f}(x), y)\big] \;=\; \underbrace{\big(\,\mathbb{E}[\hat{f}(x)] - f^{*}(x)\,\big)^{2}}_{\text{Bias}^2} \;+\; \underbrace{\mathrm{Var}\big[\hat{f}(x)\big]}_{\text{Variance}} \;+\; \underbrace{\sigma^{2}}_{\text{Noise}} $$

  • Bias: how badly the model class fails to express the true function — i.e., underfitting.
  • Variance: how much the model wobbles as the training set changes — i.e., overfitting.
  • Noise: irreducible uncertainty in the data itself. No model removes this.

We will revisit this in Part 5 (regularization). For now, the only thing to internalize is: low training error does not imply low generalization error.

4.2 Why Train / Val / Test Are Three Things

  • Train: fit the model's parameters (weights).
  • Validation: choose hyperparameters (e.g., \(k\) for k-NN, \(\lambda\) for Ridge).
  • Test: estimate the final generalization error exactly once.

If you skip the validation/test separation, your hyperparameter choice itself starts fitting the test set, producing an optimistic bias. Looking at the same test set repeatedly while reporting numbers is, in effect, training on it.

Typical splits are 60/20/20 or 70/15/15. With very large datasets, 98/1/1 is fine — one percent of a million points is still 10,000 test examples.

4.3 Data Leakage

This is the single most important concept in this part. Leakage is the future leaking into the past.

Typical leakage modes:

  1. Preprocessing leakage: fit StandardScaler on the entire dataset before splitting → test statistics seep into train normalization.
  2. Target leakage: a feature is effectively the label (e.g., a date-of-death column when predicting mortality).
  3. Temporal leakage: shuffling a time series randomly → you are predicting the past from the future.
  4. Group leakage: the same patient appears across train and test windows.

The mental model is to treat test as the future. No statistic computed on the test set may be reused in training.

sklearn's Pipeline enforces this discipline at the structural level — we will rely on it from Part 4 onward.


5. Diagram — The Full Flow

diagram-1

The rule of thumb: use validation to choose the model, then look at test once. If you ever look at test and then change the model, that test set has effectively become a validation set.


6. Principle Walkthrough — Unpacking sklearn's Five Steps

The visible code is just five lines. The interesting part is what learning-science principle each line encodes.

6.1 The Standard Five-Step Workflow

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
model = LogisticRegression(max_iter=10000).fit(X_tr, y_tr)
print(f"acc = {accuracy_score(y_te, model.predict(X_te)):.4f}")

Five actions — load, split, fit, predict, evaluate. The same five apply to every sklearn model, and that uniform fit/predict interface is why sklearn became the de facto ML standard (Buitinck et al. 2013).

6.2 How Pipeline Structurally Prevents Data Leakage

The failure mode The most common leakage pattern: call StandardScaler().fit(X) on the full dataset before splitting. Validation/test statistics contaminate the train-time normalization, and test performance inflates.

How Pipeline blocks it Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())]) passed to cross_val_score fits the scaler only on each fold's training portion; the validation portion only sees transform. Leakage is impossible by construction.

Why this design was adopted The Pipeline API (sklearn 0.15, 2014) directly addressed the ML community's reproducibility crisis. The same dataset and model would yield different scores from different research groups — and leakage was the cause more than half the time. Pipeline made the temporal ordering of data explicit in code, and the problem subsided.

Forward link Part 7's BatchNorm and LayerNorm extend the same principle to neural networks: normalize each mini-batch using only that batch's statistics during training.

6.3 Metrics Are Not Black Boxes — Each Is a One-Line Definition

From the confusion matrix \([[\text{TN}, \text{FP}], [\text{FN}, \text{TP}]]\), every classification metric reduces to one line.

Metric Definition Meaning
Accuracy \((\text{TP}+\text{TN})/(\text{TP}+\text{TN}+\text{FP}+\text{FN})\) "Fraction of correct predictions"
Precision \(\text{TP}/(\text{TP}+\text{FP})\) "Of predicted positives, how many were truly positive"
Recall \(\text{TP}/(\text{TP}+\text{FN})\) "Of true positives, how many we caught"
F1 \(2 \cdot \text{Prec} \cdot \text{Rec} / (\text{Prec}+\text{Rec})\) Harmonic mean of the two

Regression metrics follow the same template — \(\text{MSE} = \tfrac{1}{n}\sum (y - \hat{y})^2\), \(R^2 = 1 - \text{SS}_{\text{res}}/\text{SS}_{\text{tot}}\). Before reading a number off, recall what the formula is asking — that habit is the starting point of Section 7's metric selection.


7. Variants & Case Studies — Where Metric Choice Becomes a Business Decision

Once a model goes into production, the most contested topic is rarely model architecture — it is which metric you optimize. Pick the wrong one and the model looks great while quietly damaging the business.

7.1 Class Imbalance — Where Accuracy Stops Working

What changes With a 1% positive rate, "always predict negative" gives 99% accuracy. The model did nothing and scored highly.

Why it appeared Medical diagnostics, fraud detection, anomaly detection all live in the rare-positive regime. Provost & Fawcett 1997, Analysis and Visualization of Classifier Performance (KDD), established PR and ROC curves as the standard evaluation tools for this regime.

What it enabled - Precision vs recall: precision is "fraction of predicted positives that were real"; recall is "fraction of real positives caught." Splitting the two metrics makes the trade-off explicit. - F1 (harmonic mean): when both error types cost roughly the same. - PR-AUC / ROC-AUC: model-level separability before picking a threshold. - Weighted-cost model: directly minimize \(c_{\text{FN}} \cdot \text{FN} + c_{\text{FP}} \cdot \text{FP}\) — domain-specific expected loss.

Limits and next step Mapping a metric to business cost is not automated. Part 4 (logistic regression) explains threshold selection; Part 5 (regularization) covers class-imbalance correction.

7.2 Regression Metrics — A Policy on "What Counts as a Big Error"

Metric Property Use case
MSE Quadratic penalty on big errors Default, differentiable
RMSE √MSE, same unit as y Reporting / interpretability
MAE Robust to outliers When outliers are routine
Fraction of variance explained Scale-invariant model comparison
MAPE Percentage error Time series with large dynamic range
Quantile loss \(\tau\)-quantile estimation Risk / VaR modeling

The key idea: choosing a metric is choosing a loss function. Picking MSE means committing to a policy that treats one big error and ten small errors as equivalent. MAE expresses a different policy. The metric is a value statement.

7.3 Cross-Validation — The Data's Structure Picks the Split

Method When Why
KFold(5) Default regression IID assumption
StratifiedKFold(5) Classification Preserves class ratio per fold
GroupKFold Patient / user / store grouping Same group across folds = leakage
TimeSeriesSplit Time series Random shuffle = future leaks into past
LeaveOneOut Very small data Maximizes use of every sample

The key idea: splitting is not "how to cut the data" — it is "how to tell the model what is not IID." Random-splitting a time series effectively trains a model on future data — generalization estimates collapse.

7.4 Real-World Examples

  • Spam filter: precision-weighted (a false positive loses user trust).
  • Cancer screening: recall-weighted (a missed positive harms a patient).
  • Click-through rate (CTR) prediction: PR-AUC + calibration (probabilities must mean what they claim).
  • Fraud detection: direct weighted-cost loss minimization.
  • Demand forecasting: MAPE (so small and large stores can be compared).
  • A/B test effect estimation: regression with confidence intervals — report the effect size band, not a point estimate.

8. Limits & Failure Modes — What sklearn Cannot Save You From

The sklearn API guarantees learning correctness and reproducibility. The honesty of your evaluation is on you. Most failures live here.

8.1 Data Leakage

Why it is essential Leakage is future information bleeding into the past. The model learned that information honestly — but the information will not exist in production. The result is a systematic overestimate of offline performance.

How you spot it - Test R²/accuracy nearly equal or higher than train → suspect - Feature importance lists a "direct answer" feature (e.g., date-of-death predicting mortality) - Any preprocessing called with fit_transform before splitting

Next step Wrap every preprocessing step in Pipeline and pass to cross-validation. TimeSeriesSplit for time series, GroupKFold for grouped data. Part 5's regularization is the natural extension of this discipline.

8.2 Tuning Hyperparameters on the Test Set

Why it is essential If you tune by looking at test scores, test becomes validation. After 100 comparisons, the best score is statistically guaranteed — multiple testing.

How you spot it Any workflow that compares N models and reports the highest test score has already committed the sin.

Next step Three-way split (train / val / test). Or nested CV: outer fold is test, inner fold is hyperparameter selection. Part 5 returns to this when selecting \(\lambda\).

8.3 No Random Seed

Why it is essential sklearn's train_test_split, KFold, and any randomized initialization depend on the RNG. Without a fixed seed, scores vary by ±2–5% per run — "0.3% improvement" becomes statistically meaningless.

How you spot it Run the code twice. If the scores differ, the seed is loose.

Next step random_state=42 everywhere. And for honest reporting, run with several seeds and report \(\text{mean} \pm \text{std}\).

8.4 \(p\) (Features) ≫ \(n\) (Samples)

Why it is essential 30 features × 100 samples: any model has enough capacity to memorize the training set. Generalization cannot be estimated reliably.

How you spot it \(d/n > 0.1\) needs caution; \(> 1\) demands regularization.

Next step Regularization (Part 5: Lasso, Ridge), dimensionality reduction (PCA), or collecting more data. Section 8.6 of Part 3 confronts this head-on.

8.5 Non-IID Samples (Time Series, Grouped Data)

Why it is essential Random splits rest on the IID (independent, identically distributed) assumption. Time series, patient cohorts, and store-level data are not IID. Mixing measurements from the same patient across train and val produces a model overfit to that patient.

How you spot it Domain knowledge is the entry point. Are there time or group indices? If yes, random splits are risky.

Next step TimeSeriesSplit, GroupKFold, or prospective evaluation (measure on genuinely new data).

8.6 Single-Metric Tunnel Vision

Why it is essential Comparing models by accuracy alone hides class imbalance and cost structure. Comparing by F1 alone hides threshold dependence. In production, consensus across multiple metrics is safer.

How you spot it A model-comparison table with one column.

Next step Pair classification report (precision, recall, F1, support), confusion matrix, calibration curve, and a business cost function. Report them together.


What the Limits Sketch

Six limits, one message: sklearn automates learning; it does not automate honesty in evaluation. The same insight returns in Part 7 ("why pick by val loss, not train loss?") and Part 8 ("why must transformer effects be evaluated task by task?").


9. Quick Recap — Answer Before You Peek

Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.

Q1. Why did sklearn's fit/predict interface become the ML standard?

Answer Because every model can be driven with the same five-step workflow (load → split → fit → predict → evaluate). Why A uniform interface lowered the cost of comparing, reproducing, and automating ML experiments at once (Buitinck et al. 2013). The same idea later shaped PyTorch and TensorFlow's training loops. (Sections 6.1, Part 9.)

Q2. What happens if you call StandardScaler().fit_transform(X) before splitting?

Answer Data leakage. The validation/test mean and variance contaminate train-time normalization, inflating test scores. Why When you pass sklearn's Pipeline to cross_val_score, the scaler is fit only on each fold's training portion and the validation portion only sees transform — leakage is blocked by construction. (Sections 6.2, 8.1.)

Q3. A model has 99% accuracy on a dataset where the positive class is 1%. Is it a good model?

Answer No. "Predict negative for everything" gets the same score. Why Under class imbalance, accuracy carries no information. Precision, recall, F1, and PR-AUC are the right tools (Provost & Fawcett 1997). (Section 7.1.)

Q4. What goes wrong if you use KFold(shuffle=True) on time-series data?

Answer Time leakage — you effectively train on the future to predict the past, and the generalization estimate collapses. Why Random splitting assumes IID. Time series are not IID; you need TimeSeriesSplit. Grouped data needs GroupKFold for the same reason. (Sections 7.3, 8.5.)

Q5. The same code gives scores that vary by ±2% between runs. How do you make a fair comparison?

Answer Fix the RNG (random_state=42) and run with several seeds, reporting \(\text{mean} \pm \text{std}\). Why A single score is indistinguishable from noise. A 0.3% difference must exceed seed variance to be meaningful. (Section 8.3.)


If four or five answers came easily, sklearn's core concepts are in place. For any question that stalled, jump to the section in parentheses and re-read.


10. Further Reading

Primary sources

  • Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. JMLR 12 (2011): 2825–2830. — The original sklearn paper.
  • Buitinck, L. et al. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop, 2013. arXiv:1309.0238 — the design philosophy behind fit/predict.

Official docs

  • scikit-learn User Guide: https://scikit-learn.org/stable/user_guide.html
  • Model evaluation: https://scikit-learn.org/stable/modules/model_evaluation.html
  • Cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html
  • Pipelines: https://scikit-learn.org/stable/modules/compose.html

Companion books

  • Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, 2nd ed., Springer 2009. Freely available PDF; Chapters 1–2 are the natural sequel to this part.
  • Bishop, C. Pattern Recognition and Machine Learning. Springer 2006. Chapter 1.

In Part 2 we visit the simplest model in machine learning: k-Nearest Neighbors. From a single sentence — "measure distance, average the nearest answers" — we will see all the core ideas (bias-variance, the curse of dimensionality, non-parametric models, decision boundaries) emerge together.

Series overview: Series index

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System