"ML Foundations (6/9) — Neural Networks: From Perceptron to MLP"

A neural network is logistic regression from Part 4 stacked into layers. Each layer passes through a new nonlinear activation, expanding the family of functions the network can represent. In this part we put how a neural network learns onto one page. Perceptrons, XOR, MLPs, backpropagation, activation functions, and the universal approximation theorem all meet here.


0. Learning Objectives

  • Write the perceptron's learning rule and explain why it cannot solve XOR.
  • Express an MLP's forward pass as one line of matrix products.
  • Derive backpropagation by hand on a two-layer MLP.
  • Compare the definitions, derivatives, and trade-offs of sigmoid, tanh, ReLU, and GELU.
  • State the universal approximation theorem (Cybenko 1989 / Hornik 1991) precisely.
  • Train an MLP end-to-end in PyTorch.

1. 핵심 요약

  • Perceptron (1958): \(\hat{y} = \mathrm{step}(w^\top x + b)\). Converges only on linearly separable data.
  • XOR (1969): a single-layer perceptron cannot solve it — a major reason for the first AI winter.
  • An MLP is a repetition of linear combinations followed by nonlinear activations: \(h = \phi(Wx + b)\), stacked.
  • Training: forward to compute the loss → backward to propagate gradients → update weights.
  • Universal approximation (1989): a sufficiently wide single-hidden-layer MLP can approximate any continuous function to arbitrary accuracy.
  • ReLU's adoption (2010) made deep networks practically trainable.

2. Intuition — What Makes Neural Networks Different

2.1 A Single Neuron: Logistic Regression Reborn

One neuron's output:

$$ h = \phi(w^\top x + b) $$

If \(\phi\) is the sigmoid, this is exactly Part 4's logistic regression. If \(\phi\) is the step function, it is Rosenblatt's 1958 perceptron.

The novelty of a network is not the single neuron — it is that the outputs of many neurons feed the next layer's inputs. Stacking layers transforms the input space nonlinearly, which lets the network express far richer decision boundaries.

2.2 XOR — Why One Layer Is Not Enough

The XOR truth table:

\(x_1\) \(x_2\) y
0 0 0
0 1 1
1 0 1
1 1 0

No straight line separates these four points. A single-layer perceptron, which only draws hyperplanes, cannot learn XOR.

With two layers it becomes possible. Layer one builds \(h_1 = x_1 \text{ AND NOT } x_2\) and \(h_2 = \text{NOT } x_1 \text{ AND } x_2\); layer two builds \(y = h_1 \text{ OR } h_2\). The nonlinear activation in layer one reshapes the input space so that the linear classifier in layer two suddenly faces a separable problem.

Minsky & Papert's Perceptrons (1969) formalized the single-layer limit, and the field stalled. The thing that ended the freeze was the rediscovery of backpropagation (Rumelhart, Hinton, Williams 1986), which made multi-layer training tractable.

2.3 Why the Nonlinearity Matters

If \(\phi(z) = z\) (the identity), a two-layer MLP collapses into

$$ h_2 = W_2 (W_1 x + b_1) + b_2 = W_2 W_1 x + (W_2 b_1 + b_2) $$

which is a single linear model. Stacking adds no expressive power without nonlinearities. That is the entire reason activations exist.


3. Definitions & Notation

3.1 MLP Forward Pass

For an MLP with \(L\) hidden layers:

Input layer:

$$ h^{(0)} = x $$

Hidden layers \(\ell = 1, \dots, L\):

$$ h^{(\ell)} = \phi^{(\ell)}\!\big(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}\big) $$

Output layer (regression):

$$ \hat{y} = W^{(L+1)} h^{(L)} + b^{(L+1)} $$

Output layer (classification):

$$ \hat{y} = \mathrm{softmax}\!\big(W^{(L+1)} h^{(L)} + b^{(L+1)}\big) $$

  • \(W^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\), \(b^{(\ell)} \in \mathbb{R}^{n_\ell}\)
  • \(\phi^{(\ell)}\): the activation used at layer \(\ell\) (usually ReLU; the last hidden layer is often GELU or tanh).

3.2 Losses

Regression: MSE. Classification: cross-entropy (Part 4). Almost every neural-network output layer is one of these two.

3.3 Activation Functions

$$ \sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z)) $$

$$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}, \quad \tanh'(z) = 1 - \tanh^2(z) $$

$$ \mathrm{ReLU}(z) = \max(0, z), \quad \mathrm{ReLU}'(z) = \mathbb{1}[z > 0] $$

$$ \mathrm{GELU}(z) = z \cdot \Phi(z) \approx 0.5\, z\,\big(1 + \tanh(\sqrt{2/\pi}(z + 0.044715 z^3))\big) $$

Function Range Strength Weakness
sigmoid (0, 1) Probabilistic interpretation Vanishing gradient for large \
tanh (-1, 1) Zero-centered Still saturates
ReLU [0, ∞) Simple gradient, fast "Dead ReLU" risk
GELU (-∞, ∞) Smooth ReLU, the transformer default Slightly more expensive

4. Math & Mechanism — Backpropagation

4.1 Climbing the Loss Gradient One Layer at a Time

View the MLP as a composition \(f = f_L \circ f_{L-1} \circ \cdots \circ f_1\). The chain rule does the rest.

At each layer keep two quantities:

  • Pre-activation \(z^{(\ell)} = W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}\)
  • Post-activation \(h^{(\ell)} = \phi^{(\ell)}(z^{(\ell)})\)

Define the backward signal: \(\delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial z^{(\ell)}}\).

At the output:

$$ \delta^{(L+1)} = \hat{y} - y \quad \text{(cross-entropy + softmax / sigmoid combination)} $$

(This is the same clean cancellation from Part 4. For regression, use \(\hat{y} - y\) directly.)

For a hidden layer \(\ell\):

$$ \boxed{\delta^{(\ell)} = \big(W^{(\ell+1)\top} \delta^{(\ell+1)}\big) \odot \phi'^{(\ell)}(z^{(\ell)})} $$

\(\odot\) is element-wise multiplication. Multiply the next layer's \(\delta\) by the weight transpose, then point-wise by this layer's activation derivative.

Weight gradients:

$$ \frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \delta^{(\ell)} (h^{(\ell-1)})^\top, \quad \frac{\partial \mathcal{L}}{\partial b^{(\ell)}} = \delta^{(\ell)} $$

These four lines are all of backpropagation. The same pattern repeats at every depth.

4.2 By Hand — Two-Layer MLP

Binary classification, two hidden units, one output, ReLU + sigmoid:

forward:
  z1 = W1 x + b1                 # (2,)
  h1 = ReLU(z1)
  z2 = w2^T h1 + b2              # scalar
  yhat = sigmoid(z2)
  L = -[y log yhat + (1-y) log(1-yhat)]

backward:
  delta2 = yhat - y
  dw2 = delta2 * h1              # (2,)
  db2 = delta2

  delta1 = (w2 * delta2) * (z1 > 0).astype(float)   # ReLU derivative
  dW1 = outer(delta1, x)         # (2, d)
  db1 = delta1

PyTorch and TensorFlow's autograd automates exactly this procedure (Part 9).

4.3 The Universal Approximation Theorem

(Cybenko 1989; Hornik 1991) For any continuous function \(f: [0,1]^d \to \mathbb{R}\) and any \(\epsilon > 0\), there exists a single-hidden-layer MLP \(g\) with enough hidden units such that \(\sup_{x \in [0,1]^d} |g(x) - f(x)| < \epsilon\).

The condition: the activation must be non-constant, bounded, monotonically increasing, and continuous (Cybenko, sigmoid-like) or non-polynomial (Hornik's generalization).

Caveat — the theorem guarantees existence only. It does not say you can find such a \(g\) efficiently, or that this is the right thing to do. Later results (Eldan & Shamir 2016 and others) show that depth, not just width, brings dramatic gains in both expressivity and efficiency.

4.4 Vanishing and Exploding Gradients

As \(\delta^{(\ell)} = (W^{(\ell+1)\top}\delta^{(\ell+1)}) \odot \phi'(z^{(\ell)})\) is pushed back through the layers, products of small numbers vanish toward zero, while products of large numbers explode.

  • sigmoid / tanh: \(\phi'\) caps at 0.25; vanishing accelerates with depth.
  • ReLU: \(\phi'\) is 0 or 1 — dead in the negative regime, preserved in the positive regime.
  • Initialization (Xavier, He): scale weight variance to the layer fan-in. Part 7 details.

5. Diagram

diagram-1

Forward runs top-down (solid arrows); backward runs bottom-up (dashed). The gradient flows from the loss back toward the inputs, one stage at a time.


6. Principle Walkthrough — What loss.backward() Hides in One Line

The code is nearly standard PyTorch. But the mechanism beneath loss.backward()the chain rule unwound by automatic differentiation — is the body of neural-network training.

6.1 The Standard Two-Layer MLP

import torch.nn as nn, torch.nn.functional as F
class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, out_dim)
    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

Eight lines for an entire model. What PyTorch handles for you: - Random init of \(W_1, b_1, W_2, b_2\) (He initialization, etc.). - Building a computation graph during forward. - Tracing the graph backward at loss.backward() and computing all gradients via the chain rule. - Updating weights with an optimizer using those gradients.

6.2 What Backprop Actually Does at One Step

The single line loss.backward() implicitly executes:

$$ \delta^{(L)} = \nabla_{\hat{y}} \mathcal{L}, \qquad \delta^{(\ell)} = (W^{(\ell+1)})^{\top} \delta^{(\ell+1)} \odot \phi'(z^{(\ell)}) $$

At each layer \(\ell\), the cached \(h^{(\ell-1)}, z^{(\ell)}\) from forward give \(\frac{\partial \mathcal{L}}{\partial W^{(\ell)}} = \delta^{(\ell)} (h^{(\ell-1)})^{\top}\).

Why autodiff changed the field - You write the model as a formula; training works. - New activations, new losses, new regularizers can be tried in one line. - Thirty years separate Rumelhart et al.'s 1986 backprop paper from PyTorch (2017) — and that gap is when "neural-network research" stopped being a specialist skill.

The framing: PyTorch automates the calculus. Your job is to decide the architecture and loss. Part 7 will show that those decisions — optimizer, learning rate, regularizers — are themselves the body of model design.

6.3 Activation Comparison (concept)

The experimental setup Train the same MLP, with the same hyperparameters (learning rate, batch size, optimizer) on the same dataset (MNIST or CIFAR-10), and change only the activation function: sigmoid → tanh → ReLU → GELU. Every difference in the learning curve is then attributable to the activation alone.

What the curves show - Shallow nets (2–3 layers): all four activations end up at roughly the same accuracy. The differences barely matter. - Deeper nets (4+ layers): sigmoid and tanh barely move the loss. ReLU and GELU converge cleanly. This is the visible signature of vanishing gradients. Sigmoid's derivative peaks at 0.25, tanh's saturates at the tails; multiply four such factors and the gradient shrinks to \(\approx 0.25^4 = 0.004\). ReLU's derivative is exactly 1 on the positive side, so the gradient passes through untouched.

Why this experiment changed the field Before 2010, sigmoid and tanh were the defaults, and deep networks were widely considered "untrainable." Nair & Hinton 2010, Rectified Linear Units Improve Restricted Boltzmann Machines, made ReLU mainstream and the depth ceiling fell. ReLU was one of the key ingredients in AlexNet's 2012 ImageNet win — the result that kicked off the modern deep-learning era. A simple controlled experiment opened the door.

The current landscape - Transformers: GELU (BERT, GPT family). - CNNs: ReLU and variants (LeakyReLU, PReLU). - Mobile / quantization-friendly models: ReLU6, Hardswish — clipped variants that play nicely with int8 inference.

6.4 Universal Approximation Demo (concept)

The experimental setup Take a single hidden-layer MLP and grow the hidden width through 8 → 32 → 128 → 512 → 2048. Use it to fit \(y = \sin(x)\) for \(x \in [-2\pi, 2\pi]\), with 1024 sampled points and MSE loss.

What you see - 8 hidden units: a jagged near-linear curve that misses \(\sin\) entirely. - 32 hidden units: a fat oscillating curve — peaks and troughs are in roughly the right places but rough. - 128 hidden units: visually indistinguishable from the true \(\sin\) curve. - 512+ hidden units: no visible improvement; the numerical error keeps shrinking but your eye won't tell the difference.

Why this is the cleanest intuition for universal approximation The universal approximation theorem (Cybenko 1989, Hornik 1991) states that a single hidden-layer MLP, given enough hidden units, can approximate any continuous function to arbitrary precision. This is an existence result — it says nothing about efficiency. The demo makes that limit concrete: even a function as smooth as \(\sin\) needs ~100 hidden units before the eye is satisfied.

Why we still stack depth Single-layer networks gain expressive power through width, but they cannot build hierarchical abstractions. For data with nested structure (pixels → edges → parts → objects, or characters → words → phrases → meanings), deep networks reach the same accuracy with far fewer parameters. Post 8 (deep-learning architectures) revisits why CNNs, RNNs, and Transformers all chose depth. Universal approximation says "you can"; depth says "you can, efficiently."


7. Variants & Case Studies — MLP Variants Connect All of Modern Deep Learning

7.1 The Activation Evolution — sigmoid → ReLU → GELU

Name Formula Origin Where it lives now
Sigmoid \(1/(1+e^{-z})\) Logistic regression (1958) Classification output head only
Tanh \(\tanh(z)\) Early RNNs LSTM / GRU internals
ReLU \(\max(0, z)\) Nair & Hinton 2010 CNN default
Leaky ReLU \(\max(\alpha z, z)\) Address dead ReLU CNN variants
PReLU learnable \(\alpha\) He et al. 2015 Auto-tuned
ELU negative side \(\alpha(e^z - 1)\) Clevert 2015 Mean-zero outputs
Swish/SiLU \(z \cdot \sigma(z)\) Ramachandran 2017 LLaMA, EfficientNet
GELU \(z \cdot \Phi(z)\) Hendrycks 2016 Transformer default (BERT, GPT)

Core trajectory - sigmoid → ReLU: vanishing gradient solved (~2010). - ReLU → GELU/Swish: smooth derivative aids training stability (Transformer era).

7.2 MLP Regularization — Five Mechanisms Working Together

What changes On top of plain SGD, add all of:

nn.Dropout(p=0.5)                              # Random zero per forward pass
optim.Adam(weight_decay=1e-4)                   # Optimizer-side L2
nn.BatchNorm1d / nn.LayerNorm                   # Layer-wise distribution stabilization

Why all five are needed Neural networks have so much capacity that any single regularizer leaks. Part 7 explores how the five mechanisms complement each other.

7.3 Output Heads = GLM in Disguise

Task Output activation Loss Distributional assumption
Binary classification sigmoid binary cross-entropy Bernoulli
Multi-class softmax categorical cross-entropy Categorical
Regression identity MSE / MAE Gaussian / Laplace
Multi-label per-class sigmoid per-class BCE Bernoulli × K
Count exp Poisson NLL Poisson

The framing: a neural network output head is learned representation + GLM. Same place as Part 3 §7.2. Neural networks fix a common GLM output and stack representation learning in front of it.

7.4 Where MLPs Live as Sub-blocks

  • CNN (Part 8): convolution = a weight-shared local MLP. The same weights applied at every spatial location.
  • Transformer feed-forward (LLM Core Study Part 1): a two-layer MLP applied to each token — about 2/3 of model parameters live here.
  • Recommendation (Deep & Wide): linear (wide) + MLP (deep). Google 2016, Wide & Deep Learning.
  • Mixture of Experts (MoE): multiple MLPs selected by a gating network (Mixtral, GPT-4).
  • GNN message function: graph node messages parameterized by an MLP.

The framing: in modern deep learning, the minimum unit block of nearly every architecture is still an MLP. Part 8's attention is even built around MLPs.

7.5 Industry Cases — Where MLPs Alone Are Enough

  • Tabular-data competitions: XGBoost / LightGBM usually win, but MLP-based models (TabNet, SAINT) closed the gap from 2020 onward.
  • Embedding-based search and recommendation: the embedding is the substance; one or two MLP layers suffice.
  • AlphaFold (protein structure): MLPs as key building blocks between attention.
  • LLM feed-forward layers: half or more of model parameters live in two-layer MLPs.

8. Limits & Failure Modes — Six Enemies That Depth Created

8.1 Vanishing Gradient

Why it is essential Saturating activations like sigmoid / tanh have derivative max < 1. With depth \(L\), the gradient shrinks at rate \(0.25^L\) (sigmoid) — deeper layers stop training.

How you spot it Early in training, deep-layer weights barely change. Learning curve very slow.

Next step ReLU-family activations (Part 7 §6.3) + He initialization + BatchNorm/LayerNorm + skip connections (Part 8 ResNet). All four work together to unlock 100+ layer depth.

8.2 Dead ReLU

Why it is essential ReLU has zero derivative at \(z < 0\). Any neuron pushed there stops learning forever — a dead neuron.

How you spot it A look at post-training activations reveals many neurons that always output zero.

Next step Leaky ReLU (\(\alpha = 0.01\)), PReLU (learned \(\alpha\)), or GELU/Swish (smooth derivatives + nonzero negative side).

8.3 Overfitting

Why it is essential Neural networks can approximate any function (universal approximation). That capacity is identical to the capacity to memorize the training set. With limited data, memorization wins.

How you spot it Train accuracy 99%, test accuracy 70% → overfit. Train loss drops while val loss rises.

Next step Dropout + weight decay + early stopping + augmentation symphony (Part 7).

8.4 Weakness on Tabular Data

Why it is essential Tabular data has little implicit structure — no spatial locality of pixels, no sequential order of text — for neural networks to exploit. Tree models (XGBoost, LightGBM) naturally use axis-aligned splits on tabular data.

How you spot it On almost every Kaggle tabular competition, gradient boosting wins.

Next step TabNet (Arik & Pfister 2020), SAINT (Somepalli 2021), FT-Transformer (Gorishniy 2021) — MLP + attention to bring attention to tabular data.

8.5 Initialization Sensitivity

Why it is essential Standard-normal initialization causes deep-net activations and gradients to grow or shrink exponentially with depth. Training cannot even start.

How you spot it Loss is NaN from the start, or the learning curve flatlines.

Next step Xavier (Glorot 2010, for tanh) and He (He 2015, for ReLU). Scale variance to fan-in. Part 7 covers this in depth.

8.6 Absence of Interpretability

Why it is essential Neural-net weights are learned representations — humans cannot easily assign them meaning. Decisive weakness in medical diagnostics and other domains where why matters.

How you spot it A request to "explain this prediction" instantly hits the limit.

Next step SHAP (Lundberg & Lee 2017), Integrated Gradients (Sundararajan 2017), LIME (Ribeiro 2016) — post-hoc explanation tools. Or attention visualization (Part 8 Transformer).


What the Limits Sketch

Neural networks' six limits compress the next six advances of modern deep learning: - Vanishing gradient → ResNet skip connections (Part 8) - Dead ReLU → GELU/Swish (Transformer era) - Overfitting → symphony of regularizers (Part 7) - Tabular weakness → TabNet, FT-Transformer - Init sensitivity → He/Xavier standardization (Part 7) - Lack of interpretability → SHAP, attention viz (Part 8)

The limits of neural networks are the motivation for what came next.


9. Quick Recap — Answer Before You Peek

Q1. Why can't a single-layer perceptron solve XOR? Why does one hidden layer fix it?

Answer XOR is not linearly separable. A single layer = linear combination of inputs + activation — a line/hyperplane cannot separate it. A hidden layer applies learned nonlinear transformations that map the input space into a new space where linear separation works. Why Minsky & Papert 1969 formalized the limit; Rumelhart et al. 1986 backprop is the direct answer. (Section 2.2.)

Q2. Why does sigmoid's derivative impose a depth limit on neural networks? How was it solved?

Answer Sigmoid's derivative caps at 0.25 → with depth \(L\), gradients shrink at \(0.25^L\) (vanishing gradient). ReLU's positive-side derivative is exactly 1, so gradients pass through. Why Nair & Hinton 2010 introduced ReLU; AlexNet 2012 demonstrated the depth payoff. Part 7 §6.3's activation comparison shows it on training curves. (Sections 6.3, 8.1.)

Q3. What does the one line loss.backward() do for you?

Answer Walks the computation graph in reverse and applies the chain rule to compute gradients for every weight, using cached \(h^{(\ell)}, z^{(\ell)}\) from the forward pass. Why Autodiff changed the field: writing the model as a formula is enough; no manual \(\delta\) derivation. PyTorch 2017 dropped the barrier to entry. (Section 6.2.)

Q4. Universal approximation says "any function." Why stack depth?

Answer Universal approximation is an existence result and says nothing about efficiency. A single layer needs ~100 hidden units even for \(\sin\). Depth produces hierarchical abstractions, achieving the same accuracy with far fewer parameters. Why For nested structure like pixels → edges → parts → objects, depth is essential. Part 8 CNNs / RNNs / Transformers all chose depth for this reason. (Section 6.4.)

Q5. A Transformer is a neural classifier, yet 2/3 of its parameters are MLPs. How?

Answer A two-layer MLP (the feed-forward block) is applied to every token, and roughly two-thirds of model parameters sit there. Attention routes information; MLPs do the semantic transformation. Why In modern deep learning, the minimum unit block is still an MLP. CNN convolutions are weight-shared local MLPs. Most of an LLM's size lives in feed-forward layers. (Section 7.4.)


If you got four or five answers easily, the why / how / where to of neural networks is in place.


10. Further Reading

Primary sources

  • Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65(6), 386–408 (1958). — The perceptron paper.
  • Minsky, M., Papert, S. Perceptrons: An Introduction to Computational Geometry. MIT Press (1969). — Formalizing the single-layer limit.
  • Rumelhart, D. E., Hinton, G. E., Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986). — Backprop rediscovered.
  • Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 2, 303–314 (1989). — Universal approximation.
  • Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251–257 (1991). — Generalized universal approximation.
  • Nair, V., Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines. ICML (2010). — Introduction of ReLU into the modern era.

Official docs

  • PyTorch nn.Linear: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
  • PyTorch autograd: https://pytorch.org/docs/stable/notes/autograd.html
  • nn.functional activations: https://pytorch.org/docs/stable/nn.functional.html

Companion books

  • Goodfellow, Bengio, Courville. Deep Learning. MIT Press 2016. Chapter 6.
  • Nielsen, M. Neural Networks and Deep Learning (free online, 2015).

In Part 7 we'll dive into deep learning training tricks: SGD / Momentum / Adam optimizers, Dropout / BatchNorm / LayerNorm, Xavier / He initialization, learning-rate schedules, and gradient clipping — the craft that actually makes a deep network converge.

Series overview: Series index

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System