"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

What did we keep adding on top of an MLP? Vision went CNN, sequences went RNN/LSTM, and eventually both converged on Attention and the Transformer. Each architecture is easier to remember if you read it as what it gave up to gain something else. This part is the one-line summary of the decisive inflection points: LeNet → AlexNet → ResNet → LSTM → Attention → Transformer.

0. Learning Objectives

Explain why CNN's convolution, pooling, and weight sharing match images so well.
Trace what got unblocked from LeNet → AlexNet → VGG → ResNet as depth grew.
State BPTT for RNNs and the vanishing/exploding gradient problem.
Write the LSTM gate equations (forget, input, output) with intuition.
Write the attention formula in query/key/value form and the scaled dot-product variant.
State the precise reasons Transformers replaced RNNs (parallelism + long-range dependencies).

1. 핵심 요약

CNN: weight sharing + local receptive fields exploit images' spatial locality. LeNet (1998) → AlexNet (2012) → VGG (2014) → ResNet (2015).
ResNet's skip connection: $y = F(x) + x$ made networks of 100+ layers practically trainable.
RNN: passes order through a hidden state; vanilla RNNs fail at long-range dependencies due to vanishing gradients.
LSTM (1997): gates plus a cell state preserve long-range information; the NLP standard for over a decade.
Attention (Bahdanau 2014): "what should I look at" becomes learnable. $\mathrm{softmax}(QK^\top / \sqrt{d_k})V$.
Transformer (Vaswani 2017): drops the RNN, keeps only attention. Parallelism + long-range dependency at once.

2. Intuition — Why a New Architecture Was Needed

2.1 Why an MLP Struggles with Images and Sequences

Images: 224×224×3 = 150,528 inputs. A single MLP hidden layer of 1024 units has 150 million weights — and discards spatial information entirely.
Sequences: variable-length inputs, but MLPs need a fixed shape.

The fix is to bake structural assumptions (inductive biases) into the architecture: spatial locality for images, temporal order for sequences.

2.2 The Inductive Bias Cheat Sheet

Data	Assumption	Architecture
Images	Locality + translation invariance	CNN
Time series	Sequential order + incremental information	RNN
Audio	Time + frequency locality	CNN + RNN (Conformer)
Text	Arbitrary-distance token dependencies	Attention / Transformer
Graphs	Message passing between nodes	GNN

3. Definitions — CNN Family

3.1 The Convolution Operation

2D convolution with input $X \in \mathbb{R}^{H \times W \times C}$ and kernel $K \in \mathbb{R}^{k_h \times k_w \times C \times C'}$:

$$ Y_{i,j,c'} = \sum_{u=0}^{k_h - 1}\sum_{v=0}^{k_w - 1}\sum_{c=0}^{C-1} K_{u,v,c,c'} \cdot X_{i+u, j+v, c} $$

Key properties:

Weight sharing: the same $K$ is reused across positions → far fewer parameters.
Translation equivariance: shifting the input shifts the output by the same amount.
Local receptive field: a single output pixel depends only on a $k_h \times k_w$ input region.

3.2 Pooling

Take max or mean over a small window. Reduces resolution and adds a little translation invariance.

3.3 The ResNet Block (He et al. 2015)

$$ y = F(x; W) + x $$

$F$ is two or three conv + activation layers. The skip connection creates a gradient highway that finally made very deep networks trainable. The same idea is reused inside every Transformer sublayer.

3.4 CNN Lineage in One Line Each

Model	Year	Key idea
LeNet-5	1998	The original CNN (LeCun)
AlexNet	2012	GPU + ReLU + Dropout; ImageNet winner that lit deep learning's fuse
VGG	2014	A simple deep stack of 3×3 convolutions
GoogLeNet/Inception	2014	1×1 conv for channel reduction, multi-scale
ResNet	2015	Skip connections at depth 100+
DenseNet	2017	Connect every layer to every later layer
EfficientNet	2019	Jointly scale depth, width, and resolution
ConvNeXt	2022	A modern CNN that re-thought design through the ViT lens

4. Definitions — RNN / LSTM

4.1 Vanilla RNN

$$ h_t = \tanh(W_x x_t + W_h h_{t-1} + b) $$

Training is BPTT (Backpropagation Through Time): unroll the network in time and apply standard backprop. Gradients accumulate $t$ products of $W_h$, so they vanish or explode with sequence length.

4.2 LSTM (Hochreiter & Schmidhuber 1997)

Gates plus a cell state preserve long-range information.

$$ \begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)} \ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)} \ \tilde{c}_t &= \tanh(W_c [h_{t-1}, x_t] + b_c) \quad \text{(candidate)} \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)} \ h_t &= o_t \odot \tanh(c_t) \end{aligned} $$

The forget gate, valued in (0, 1), tunes how much past information to keep. The gradient now flows through the cell state by addition, which dramatically mitigates vanishing.

GRU (Cho et al. 2014) is a simplified LSTM with just two gates — similar quality with fewer parameters.

4.3 Bidirectional / Stacked

Bidirectional RNN: concatenate forward and backward hidden states to see both sides of context.
Stacked RNN: layer multiple RNN passes vertically.

5. Definitions — Attention / Transformer

5.1 Scaled Dot-Product Attention

Given queries $Q$, keys $K$, values $V$:

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\Big(\frac{Q K^\top}{\sqrt{d_k}}\Big) V $$

Intuition: for each query, output a weighted average of values, weighted by query-key similarity. Any two positions can refer directly to one another — no sequential journey required.

The $\sqrt{d_k}$ divisor counteracts the fact that the dot-product variance scales with $d_k$, preventing softmax from collapsing onto a single position.

5.2 Multi-Head Attention

Run $h$ attentions in parallel:

$$ \mathrm{head}_i = \mathrm{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$$ \mathrm{MHA}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h) W^O $$

Each head learns a different representation subspace.

5.3 The Transformer Block (Vaswani 2017)

$$ \begin{aligned} y_1 &= \mathrm{LayerNorm}(x + \mathrm{MHA}(x)) \ y_2 &= \mathrm{LayerNorm}(y_1 + \mathrm{FFN}(y_1)) \end{aligned} $$

MHA + residual + LayerNorm
FFN: a 2-layer MLP (hidden is typically 4 × the model dimension) with GELU
Positional encoding: absolute (Sinusoidal) or relative (RoPE, ALiBi)

Stack $N$ of these blocks and you have the body of GPT, BERT, LLaMA, and most modern LLMs.

5.4 Why It Replaced the RNN

Aspect	RNN/LSTM	Transformer
Parallelism	Sequential in time (slow)	Tokens processed in parallel (fast)
Long-range dependency	Vanishing gradient	One-hop attention
Expressivity	Partly via gates	Rich via multi-head
Memory	$O(L)$	$O(L^2)$ — a weakness on very long sequences

On modern GPUs, parallelism decided the contest. RNNs have a sequential dependency in the time axis that throttles GPU utilization; Transformers can compute all tokens at once.

6. Diagram

ResBlock:                       Transformer block:
  x -- conv -- relu -- conv ---+   x -- MHA --+      
  |                            +   |          +-- LN --+
  +---------------------------+    +----------+         |
                                                        FFN -- LN -- y

Both architectures share the skip-connection idea.

7. Principle Walkthrough — The Shared Foundation of CNN, RNN, and Attention

Reading each architecture as code is less productive than seeing what kind of information flow each one assumes. CNN = spatial locality, RNN = temporal order, Attention = direct connection between arbitrary positions.

7.1 CNN — Spatial Locality and Weight Sharing in One Line

nn.Conv2d(in_ch, out_ch, kernel_size=3, padding=1)

This single line assumes: - Spatial locality: a pixel's meaning depends on its 3×3 neighborhood; far-away pixels exert no direct influence. - Weight sharing: the same filter applies across the whole image. A "cat ear" detector is the same wherever the ear sits. - Translation equivariance: shift the input, and the output shifts by the same amount.

Parameter savings: 10–100× fewer parameters than an equally expressive MLP. This is why CNNs train well even on small data.

7.2 ResNet's Skip Connection — Why One Line Made 100-Layer Networks Possible

return F.relu(out + x)        # one line that changed everything

What you see Add the input directly to the block's output. The block computes $F(x) + x$.

Why it broke the depth ceiling - The gradient always has a bypass route through the block. Even if $\nabla F$ shrinks, the $+x$ path lets it pass — the direct cure for vanishing gradients. - If the block learns zero, then $F(x) = 0$ and the output equals $x$ — the identity is the default. Adding more blocks cannot make the solution worse.

Forward link This pattern carries directly into the Transformer's residual + LayerNorm structure. Every Transformer block is a ResNet pattern.

7.3 The Essence of an RNN — Recurrence Encodes Time Dependence

h_t = tanh(W_x @ x_t + W_h @ h_{t-1} + b)

The key - At every time step $t$, the same weights $W_x, W_h$ are reused — weight sharing applied to the time axis. - The hidden state $h_{t-1}$ compresses past information and passes it forward.

What LSTM unlocked RNNs back-propagate $W_h$ raised to the power of $t$; past length ~100, gradients vanish or explode. LSTM (Hochreiter & Schmidhuber 1997) introduces a cell state + four gates to control information flow explicitly — a memory gate near 1 lets information flow through intact.

Limits Sequential computation prevents GPU parallelism. Even LSTMs fade past 1,000 tokens. These two limits motivate the Transformer.

7.4 Attention — Every Position Directly Connected to Every Position

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\Big(\frac{Q K^{\top}}{\sqrt{d_k}}\Big) V $$

The essence - $Q K^{\top}$: similarity between the query and every key. Each token measures its relevance to every other token. - softmax: weights sum to 1. - $\sqrt{d_k}$ scaling: prevents dot products from blowing up in high dimensions. - $\cdot V$: collect values as a weighted sum.

Why it replaced RNNs - Parallelization: every position computed in a single matrix multiplication. No sequential dependency. - Long-range dependency: token 1 and token 1,000 are one step apart. - Capacity: at $O(L^2)$ cost, every pair is examined.

Multi-Head Attention The same input is split into several projections, attention runs in parallel on each, results concatenate. Different kinds of relationships (syntactic, semantic, positional) are learned simultaneously.

Forward link LLM Core Study Part 1 covers the full Transformer block (attention + FFN + residual + LayerNorm). Part 8's attention is the minimum unit.

8. Limits & Failure Modes — Architecture Limits Drive the Next Advance

8.1 CNN Limits

Why they are essential - Global context is weak: stacking 3×3 kernels grows the receptive field with depth, but the entire 1,000×1,000 image is never seen at once. - Absolute position is weak: translation equivariance is the limit (e.g., "first line of text" vs "last line").

How you spot it Image classification works but scene-graph understanding, OCR position, or where in a medical image breaks down.

Next step - Vision Transformer (Dosovitskiy 2020) — attention for global context. - CoAtNet (Dai 2021) — CNN + attention hybrids. - ConvNeXt (Liu 2022) — design patterns reimported from Transformers.

8.2 RNN/LSTM Limits

Why they are essential - Sequential compute: time step $t$ depends on $t-1$ → no GPU parallelism. - Gradient flow: very long dependencies (1,000+ tokens) defeat even LSTMs. - Fixed-size memory: hidden state size is independent of sequence length → an information bottleneck.

How you spot it On long documents, code, or time series, late performance degrades. Training time grows linearly with sequence length.

Next step - Transformer (2017) — eliminates sequential dependency; computes every position simultaneously. - Mamba (Gu 2024) — selective SSM that achieves linear time + long context.

8.3 Transformer Limits

Why they are essential - $O(L^2)$ cost: memory and compute quadratic in context length $L$. At 100K tokens that's 10^10 dot products. - Missing positional information: attention is inherently permutation invariant — explicit positional encoding (Sinusoidal, RoPE, ALiBi) is required. - Scaling-law dependence: with small data, CNNs/RNNs can win. Weak inductive bias means all structure must be learned from data.

How you spot it OOM as context length grows; poor generalization on small datasets.

Next step - Linear Attention (Katharopoulos 2020) — $O(L)$ approximation. - Flash Attention (Dao 2022) — memory-efficient exact attention. - Mamba / RWKV — SSM/RNN-style with $O(L)$ cost. - Sliding Window Attention (Mistral) — local attention pattern.

8.4 Common to All

Why they are essential - Dominance of pretrain + fine-tune: training from scratch is increasingly rare. Model origin matters more than model choice. - Compute & data primacy: Sutton 2019, The Bitter Lesson — how much data and compute you can wield matters more than architectural cleverness. - Interpretability: any deep model struggles to explain why it predicted what it did.

Next step - Part 9 (PyTorch / TensorFlow + local LLMs) — the standard of adopting pretrained models. - The LLM Core Study series — detailed structure of Transformer and attention.

What the Limits Sketch

The limits of the three architectures map precisely onto the motivations for what came next: - CNN global-context limit → ViT - RNN sequential dependency → Transformer - Transformer $O(L^2)$ → Mamba / Flash Attention / Linear Attention - All → dominance of pretrain + fine-tune (Part 9)

9. Quick Recap — Answer Before You Peek

Q1. Why does a CNN have 10–100× fewer parameters than an equally expressive MLP?

Answer Weight sharing — the same 3×3 kernel applies across the whole image, instead of separate weights at every position. Why This inductive bias is exactly why CNNs train well on small data. But global context remains weak — the gap that ViT closes. (Sections 7.1, 8.1.)

Q2. What mechanism makes ResNet's `out + x` one-liner enable 100-layer training?

Answer Gradients always have a direct bypass around the block. Even when $\nabla F$ shrinks, the $+x$ path passes the gradient through. Why The frontal cure for vanishing gradients. If a block learns zero, the identity is the default — adding more blocks cannot make the solution worse. The same pattern recurs in every Transformer block. (Section 7.2.)

Q3. The single most decisive reason Transformers replaced RNNs?

Answer Parallelization. RNNs require $h_t$ to depend on $h_{t-1}$ — sequential computation prevents GPU utilization. Attention computes every position in one matrix multiply. Why The core insight of "Attention Is All You Need" (Vaswani 2017). Plus long-range dependencies connect in one step. (Sections 7.3, 7.4, 8.2.)

Q4. Three variants designed to break Transformer's $O(L^2)$ cost?

Answer (a) Linear Attention — $O(L)$ approximation. (b) Flash Attention — memory-efficient exact attention. (c) Mamba/RWKV — SSM/RNN-style at $O(L)$. Why 100K-token contexts mean 10^10 dot products — OOM territory. Different use cases adopt different variants; Mistral uses sliding window attention. (Section 8.3.)

Q5. Under the Bitter Lesson (Sutton 2019), what's the real architecture-selection truth?

Answer How much data and how much compute dominate over which architecture. With pretrain + fine-tune as the standard, training from scratch is increasingly rare. Why Part 9 (using pretrained models) is the modern workflow. The era of choosing a model is being replaced by the era of choosing a source. (Section 8.4.)

If you got four or five answers easily, the shared foundation and evolution of CNN, RNN, and attention is in place.

10. Further Reading

Primary sources

LeCun, Y. et al. Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). — LeNet.
Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. — AlexNet.
Simonyan, K., Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv:1409.1556 — VGG.
He, K., Zhang, X., Ren, S., Sun, J. Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385 — ResNet.
Hochreiter, S., Schmidhuber, J. Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (1997). — LSTM.
Cho, K. et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014. arXiv:1406.1078 — GRU.
Bahdanau, D., Cho, K., Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473 — Attention.
Vaswani, A. et al. Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762 — Transformer.

Official docs

PyTorch Conv2d: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
PyTorch LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
PyTorch MultiheadAttention: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
torchvision models: https://pytorch.org/vision/stable/models.html

Companion books

Goodfellow, Bengio, Courville. Deep Learning. Chapters 9, 10, 12.
Dive into Deep Learning (d2l.ai): a free, hands-on textbook.

In the final Part 9, we'll tackle PyTorch vs TensorFlow, and local LLMs: comparing the two frameworks' philosophies and ecosystems, then tracing how transformers, llama.cpp, MLX, and Ollama act as the bridge to running an LLM on your own machine.

Series overview: Series index