"LLM Core Study (1/6) — Fundamentals: Tokenization, Embeddings, Attention, Positional Encoding"

The path string → tokens → embeddings → attention → output representation is what one transformer block actually does. By the end of this lecture you can answer in one line: why \(\sqrt{d_k}\), what RoPE rotates, why Korean costs more tokens than English.


0. Learning Objectives

  • Simulate the BPE training and inference loops on paper.
  • State the shape of the embedding matrix, how it is initialised, and what weight tying means.
  • Trace shapes through scaled dot-product attention and reproduce it in ~30 lines of PyTorch.
  • Explain what changes between a single-head and a multi-head attention, and where the cost grows.
  • Distinguish Sinusoidal, RoPE, and ALiBi by where they inject positional information.
  • Explain why standard attention scales as \(O(N^2)\) in sequence length \(N\), and why "Lost in the Middle" happens.

1. 핵심 요약

  • Transformer inputs are integer ID sequences, not strings. The tokenizer cuts the string.
  • An embedding is a learnable lookup table \(\mathbb{Z} \to \mathbb{R}^{d_{\text{model}}}\).
  • Self-Attention is one line: \(\text{softmax}(QK^\top / \sqrt{d_k}) V\). Multi-head runs the same in parallel sub-spaces.
  • Positional information is injected by adding to embeddings (Sinusoidal), rotating Q and K (RoPE), or biasing attention scores by distance (ALiBi).
  • Standard attention is \(O(N^2)\), and very long contexts suffer from middle-position recall loss.

2. Intuition: What does one transformer block do?

A self-supervised language model plays one game: given the tokens seen so far, output a probability distribution over the next token. Playing this well requires the model to organise the relationships between previously seen tokens. RNNs accumulated those relationships through time order; transformers let tokens exchange weighted messages directly regardless of position (Vaswani et al., 2017). Self-attention is the quantitative definition of that exchange.

A raw string cannot enter the model. Tokenisation breaks it into integer IDs; embedding maps each ID to a \(d\)-dimensional coordinate; one attention block rearranges those coordinates by content-similar weighting; an MLP transforms each coordinate further; residual connections and normalisation keep the gradient alive. Stack \(L\) such blocks and you have a model. This lecture covers only the input representation + attention + positions part. Optimisation dynamics live in Lecture 5.


3. Tokenisation

3.1 Definition

Tokenisation is a function \(T: \Sigma^* \to \mathbb{Z}^*\). The vocabulary \(V = \{0, \dots, |V|-1\}\) is fixed at tokenizer-training time, and every downstream weight (embedding matrix, output head) depends on \(|V|\).

Vocabulary sizes in the wild:

  • GPT-2 / GPT-3: BPE, \(|V| = 50{,}257\).
  • LLaMA family: SentencePiece BPE, \(|V| = 32{,}000\) (LLaMA 1·2), \(128{,}000\) (LLaMA 3).
  • Gemma: SentencePiece, \(|V| = 256{,}000\) (multilingual-heavy).

3.2 Byte-Pair Encoding (BPE)

BPE originated as a compression algorithm and was repurposed for NMT by Sennrich et al., 2016. The training loop merges the most frequent adjacent symbol pair into a new symbol until the vocabulary reaches the target size.

Input: character-tokenised corpus, e.g. low, lower, newest, widest
Init vocab: {l, o, w, e, r, n, s, t, i, d, </w>}
Repeat:
  1) count adjacent symbol pairs.
  2) merge the most frequent pair (X, Y) into a new symbol XY.
  3) substitute (X, Y) -> XY everywhere in the corpus.
Stop when vocab reaches the target size.

The first few merges typically look like:

('e','s'): 9   →  merge into 'es'
('es','t'): 9 →  merge into 'est'
('l','o'): 7  →  merge into 'lo'
('lo','w'): 7 →  merge into 'low'

Training output is the ordered list of merge rules; encoding new strings applies the same rules in the same order.

3.3 Encoding new strings

A greedy algorithm that prefers earlier-learned merges:

def bpe_encode(text, merges, vocab):
    tokens = list(text)  # start at character granularity
    while True:
        pairs = [(tokens[i], tokens[i+1]) for i in range(len(tokens) - 1)]
        best = None
        best_rank = float('inf')
        for i, p in enumerate(pairs):
            if p in merges and merges.index(p) < best_rank:
                best_rank = merges.index(p)
                best = (i, p)
        if best is None:
            break
        i, (a, b) = best
        tokens = tokens[:i] + [a + b] + tokens[i + 2:]
    return [vocab[t] for t in tokens]

Real implementations use heap structures for \(O(N \log N)\) but the spirit is identical: frequent patterns become single tokens, rare patterns split into sub-words.

3.4 WordPiece vs SentencePiece vs Byte-level BPE

Algorithm Training criterion Whitespace Models
BPE (Sennrich 2016) Max-frequency merge Auxiliary </w> GPT-2
WordPiece (Schuster 2012) Max-likelihood merge Word boundary BERT
SentencePiece (Kudo 2018) Unigram LM or BPE on raw text Preserved via LLaMA, T5
Byte-level BPE BPE on UTF-8 bytes n/a GPT-2 onwards

Byte-level BPE never sees an OOV: every Unicode input is bytes 0–255 at worst. The cost is that non-Latin scripts (CJK in particular) tokenise into many more tokens than English. SentencePiece keeps explicit whitespace via and trains directly on raw text, which generalises well across languages and code.

3.5 Why Korean costs more tokens than English

A multilingual BPE vocabulary devotes its budget to frequent patterns. Frequent English fragments ("the", "ing", "tion") fit into single tokens; Korean syllable blocks rarely fit and fall back to byte or sub-syllable splits. As a result, the same meaning takes roughly 1.5–2.5× more tokens in Korean than in English on the GPT-4 tokenizer. Korean-specific tokenizers (Polyglot-ko, KoGPT) close that gap at the cost of weaker English coverage.

3.6 Diagram

diagram-1

3.7 Limits & failure modes

  • Arithmetic. "12345" splitting as "1", "23", "45" breaks numeric reasoning. Recent models pre-tokenise digits 0–9 individually.
  • Named entities. Out-of-vocabulary names fragment into sub-syllables and become hard for the model to track consistently.
  • Cost asymmetry. Pricing and context budgeting based on English token counts under-estimate the real cost for Korean, Japanese, and other non-Latin corpora by up to 2×.

3.8 Practice tasks

  1. Apply BPE for five merges by hand to the corpus ["banana", "ban", "bandit", "anchor"] and record the vocabulary.
  2. With tiktoken (pip install tiktoken), compare English and Korean token counts for the same sentence in o200k_base.
  3. Measure how many tokens 12,345,678 produces versus 12345678 in GPT-4's tokenizer.

4. Embeddings

4.1 Definition

For token ID \(t \in \{0, \dots, |V|-1\}\),

$$ E: t \mapsto E[t, :] \in \mathbb{R}^{d_{\text{model}}}. $$

The embedding matrix \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\) is a learnable parameter. For \(|V| = 32{,}000\) and \(d_{\text{model}} = 4096\) the table alone holds \(1.3 \times 10^8\) parameters — about 1.9 % of a 7 B model.

4.2 Initialisation

Standard transformer training uses \(\mathcal{N}(0, \sigma^2)\) with \(\sigma = 0.02\). Larger initial variance combined with the \(\sqrt{d_{\text{model}}}\) scaling factor leads to early-training gradient blow-up, which is the He et al., 2015 perspective applied at the embedding boundary.

4.3 Weight tying

The final logits are \(h_T W_{\text{out}}^\top + b\). Press & Wolf, 2017 showed that sharing the embedding matrix and the output projection saves parameters and improves perplexity.

self.embedding = nn.Embedding(vocab_size, d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
self.lm_head.weight = self.embedding.weight  # tied

GPT-2 and T5 tie weights. Some recent models (LLaMA 3) keep them separate to gain stability and inference flexibility at the cost of more parameters.

4.4 Geometry of the embedding space

Embeddings do not learn "meaning" by design; they learn coordinates useful for next-token prediction. Word2Vec-era analogies like "king − man + woman ≈ queen" only partially hold for LLM embeddings because tokens are sub-words and the loss is sequence-level. For semantic search, contextual representations (mean-pooled hidden states or [EOS] hidden state) are the standard.

4.5 Practice

  1. Compute by hand the parameter count of nn.Embedding(50257, 1024) and what fraction of a 7 B model that is with and without tying.
  2. Verify that two tokenisations of the same string produce identical embedding lookups (lookup is deterministic).

5. Self-Attention

5.1 The single line

Given input embeddings \(X \in \mathbb{R}^{N \times d_{\text{model}}}\) and learnable matrices \(W^Q, W^K, W^V \in \mathbb{R}^{d_{\text{model}} \times d_k}\),

$$ Q = X W^Q,\quad K = X W^K,\quad V = X W^V, $$

scaled dot-product attention is

$$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V \in \mathbb{R}^{N \times d_v}. $$

This single line — Vaswani et al., 2017, eq. (1) — is the alpha and omega of the transformer.

5.2 Shape tracking

Tensor Shape
\(X\) \(N \times d_{\text{model}}\)
\(Q, K, V\) \(N \times d_k\)
\(QK^\top\) \(N \times N\) — attention scores
softmax row-sums to 1 \(N \times N\) — weights
Output \(N \times d_v\)

Both the matmul and the softmax are \(O(N^2)\) in sequence length, which is the fundamental cost of context extension.

5.3 Why \(\sqrt{d_k}\)?

If components of \(Q, K\) are i.i.d. with zero mean and unit variance, each entry of \(QK^\top\) has variance \(d_k\). Softmax over high-variance inputs saturates: a single position takes weight near 1, all others near 0, and gradients vanish. Dividing by \(\sqrt{d_k}\) restores unit variance and keeps gradients alive. A small but decisive detail.

5.4 Multi-Head Attention

A single head captures one type of relation. Running attention in \(h\) parallel sub-spaces lets the model learn multiple relation types at once. With \(d_k = d_{\text{model}} / h\),

$$ \mathrm{head}_i = \mathrm{Attention}(X W^Q_i, X W^K_i, X W^V_i), $$ $$ \mathrm{MHA}(X) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h) W^O, $$

with \(W^O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}\). Total parameters match a single-head version of size \(d_{\text{model}}\); the expressivity multiplies because each head learns in a different sub-space.

5.5 Causal masking

A decoder must not see future tokens. Add a mask that fills the strict upper triangle of \(QK^\top\) with \(-\infty\). After softmax those entries become exactly zero.

Pre-mask scores (3 × 3):           After mask:
[ s11 s12 s13 ]                     [ s11 -inf -inf ]
[ s21 s22 s23 ]                     [ s21  s22 -inf ]
[ s31 s32 s33 ]                     [ s31  s32  s33 ]

5.6 Minimal PyTorch implementation

import torch
import torch.nn as nn
import torch.nn.functional as F


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, causal: bool = True):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_model, self.n_heads, self.d_k = d_model, n_heads, d_model // n_heads
        self.causal = causal
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N, _ = x.shape
        qkv = self.qkv(x)
        q, k, v = qkv.chunk(3, dim=-1)
        # (B, N, h, d_k) -> (B, h, N, d_k)
        q = q.view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, N, self.n_heads, self.d_k).transpose(1, 2)
        scores = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
        if self.causal:
            mask = torch.triu(torch.ones(N, N, device=x.device), diagonal=1).bool()
            scores = scores.masked_fill(mask, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        out = weights @ v
        out = out.transpose(1, 2).contiguous().view(B, N, self.d_model)
        return self.out(out)


if __name__ == '__main__':
    mha = MultiHeadAttention(d_model=128, n_heads=8, causal=True)
    print(mha(torch.randn(2, 10, 128)).shape)  # torch.Size([2, 10, 128])

Run it and confirm weights[0, 0, 0, 5:] are zero — the causal mask is doing its job.

5.7 Variants and case studies

  • Grouped-Query Attention (GQA, Ainslie 2023). Fewer KV heads than Q heads to shrink the KV cache. LLaMA 2 70B has 8 KV heads shared by 64 Q heads.
  • Multi-Query Attention (MQA). Extreme GQA with 1 KV head. Faster inference, harder to train.
  • FlashAttention (Dao 2022). Same \(O(N^2)\) time, but tiles I/O so memory bandwidth stops being the bottleneck.
  • Sliding-Window Attention. Each token attends only to a window of \(w\) neighbours; Mistral 7B uses \(w = 4096\).

5.8 Limits and failure modes

  • \(O(N^2)\) cost. A 100K-token context is 10⁴× more attention compute than a 1K context. Tiling reduces the constant, not the growth rate.
  • Lost in the Middle (Liu 2023). In long contexts, beginnings and endings are recalled well; middles fade. Practical RAG advice puts critical evidence near the boundaries.
  • Numerical tokenisation interactions, as in §3.7.

5.9 Practice

  1. Extend the implementation above to accept an external attn_mask argument.
  2. Compare outputs for n_heads=1 vs n_heads=8 on the same input; observe norm and weight distributions.
  3. For \(N = 1024\), estimate the FP16 memory of the scores tensor and the full KV cache.

6. Positional Encoding

Attention is a set operation: without explicit position signals, permuting the input would leave the output unchanged. Position must be injected somewhere.

6.1 Sinusoidal (Vaswani 2017)

For position \(p\) and dimension \(i\),

$$ \mathrm{PE}(p, 2i) = \sin(p / 10000^{2i/d_{\text{model}}}), $$ $$ \mathrm{PE}(p, 2i + 1) = \cos(p / 10000^{2i/d_{\text{model}}}). $$

Added directly to token embeddings. Extrapolation works mechanically, but accuracy degrades beyond the training length, and attention sees only absolute positions.

6.2 Learned positional embeddings

GPT-2 stores a learnable \(P_{\max} \times d_{\text{model}}\) matrix. Simple and effective inside the trained range, but no graceful extrapolation.

6.3 Rotary Position Embedding (RoPE, Su 2021)

Instead of adding to embeddings, rotate Q and K. Pair dimensions \((2i, 2i+1)\), define frequency \(\theta_i = 10000^{-2i/d}\), and apply

$$ \begin{bmatrix} q'_{2i} \\ q'_{2i+1} \end{bmatrix} = R(p\theta_i) \begin{bmatrix} q_{2i} \\ q_{2i+1} \end{bmatrix},\ \ R(\phi) = \begin{bmatrix} \cos\phi & -\sin\phi \\ \sin\phi & \cos\phi \end{bmatrix}, $$

with the same rotation applied to \(K\). The dot product \(Q_p^\top K_q\) then becomes a function of \(p - q\) — relative position emerges naturally.

  • Encodes relative position without adding parameters.
  • Extrapolates better than sinusoidal, especially with frequency-base rescaling (NTK-aware, YaRN).
  • Used by LLaMA, Qwen, Mistral, and most modern open models.
def apply_rope(x: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
    """x: (B, h, N, d_k) — d_k must be even. freqs: (N, d_k/2)."""
    x_pair = x.reshape(*x.shape[:-1], -1, 2)
    cos = freqs.cos().unsqueeze(0).unsqueeze(0)
    sin = freqs.sin().unsqueeze(0).unsqueeze(0)
    x0, x1 = x_pair[..., 0], x_pair[..., 1]
    return torch.stack([x0 * cos - x1 * sin, x0 * sin + x1 * cos], dim=-1).flatten(-2)

6.4 ALiBi (Press 2022)

Leave Q, K, and embeddings alone. Instead bias attention scores by relative distance:

$$ \text{scores}_{ij} \leftarrow \text{scores}_{ij} - m \cdot |i - j|. $$

Per-head negative slopes \(m\) are fixed; extrapolation is excellent. Used by MosaicBERT and Falcon.

6.5 One-line comparison

Method Where it acts Extrapolation Models
Sinusoidal Add to embeddings Possible, lossy Original Transformer
Learned Add to embeddings Not possible GPT-2
RoPE Rotate Q, K Good (great with base rescaling) LLaMA, Qwen
ALiBi Bias scores Strong Falcon, MosaicBERT

6.6 Limits

  • Every positional scheme degrades outside its training distribution. RoPE and ALiBi merely degrade more gracefully.
  • Extending context requires touching KV cache, data distribution, evaluation harness, and positional encoding — not just one of them.

6.7 Practice

  1. Compute the Sinusoidal PE matrix for \(d_{\text{model}} = 8, N = 4\) by hand.
  2. Apply apply_rope to a length-1024 batch and verify that \(Q_p^\top K_q\) depends on \(p - q\) only.
  3. Derive the standard ALiBi slope schedule \(m = 2^{-8/h}, 2^{-16/h}, \dots\) for \(h\) heads and confirm attention biases penalise distant positions correctly.

7. Putting it together: one transformer block (decoder-style)

diagram-2

After \(L\) blocks, the final hidden state is projected through unembedding to logits, and softmax produces the next-token distribution. How to sample from that distribution is the topic of Lecture 3.


8. A quick note on context length

Long context is expensive in three coupled ways:

  1. Attention matrix \(O(N^2)\). FlashAttention / sliding window reduce the constant.
  2. KV cache \(O(N \cdot L \cdot 2 \cdot d_{\text{kv}})\). Grows linearly per layer; GQA shrinks \(d_{\text{kv}}\).
  3. Quality. Lost-in-the-middle plus positional-encoding extrapolation limits combine to weaken middle-context recall.

So the answer to "do we still need RAG with 1M-token context?" is yes, for three reasons: cost, middle recall, and freshness. We return to RAG in Lecture 4.


9. Quick Recap — Answer Before You Peek

Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.

Q1. Describe BPE training in four lines.

Answer (1) split every word into characters, (2) find the most frequent character pair and merge into a new token, (3) repeat until the vocabulary cap is reached, (4) apply the learned merge rules at inference time. Why Frequency + composition = semantically meaningful subwords, statistically resolving OOV (Sennrich et al. 2016). (Section §3 Tokenization.)

Q2. How do \(|V|\) and \(d_{\text{model}}\) connect to the embedding parameter count?

Answer Embedding matrix = \(|V| \times d_{\text{model}}\). E.g., \(|V| = 50{,}000\) and \(d_{\text{model}} = 4096\) → ~200M parameters, i.e. 10–30% of the model. Why The output unembed matrix is usually tied to the same shape. Growing the vocabulary raises parameters + KV memory + training time together. (Section §4 Embedding.)

Q3. What breaks if you remove the \(\sqrt{d_k}\) scaling in attention?

Answer \(Q K^{\top}\) variance grows linearly with \(d_k\), so the softmax saturates into a near-one-hot distribution → gradients collapse (softmax saturation). Why The softmax derivative is near zero for large inputs. Loss plateaus early. Scaling is essential at large \(d_k\). (Section §5 Attention.)

Q4. When \(h\) (heads) grows in Multi-Head Attention, where does the cost grow — parameters, KV memory, or both?

Answer Neither — total parameters and KV memory are unchanged. Only tensor reshape overhead grows. Why \(d_{\text{model}}\) is split across \(h\) heads, so each head uses \(d_k = d_{\text{model}}/h\). Total weight count is identical; the gain is parallel learning of different relationships. (Section §5.)

Q5. Where do RoPE and ALiBi physically enter, and why does "Lost in the Middle" happen?

Answer RoPE: multiply Q and K by a rotation at application time. ALiBi: add a distance-based bias to the attention score matrix. Lost in the Middle: attention frequently misses information at middle positions — training distributions are biased toward start and end (Liu et al. 2024). Why Both positional schemes have different trade-offs for extrapolation (contexts longer than seen during training). RoPE generalizes well to longer lengths; ALiBi degrades gracefully outside training range. (Sections §6 Positional Encoding, §8 Context Extension.)


If four or five came out as one-liners, the internal flow of a Transformer block is in place. For any question that stalled, return to the matching section.


10. Further reading


This is part 1 of 6 in the LLM Core Study series. Part 2 covers fine-tuning: LoRA, QLoRA, and distillation.

Series overview: Series index

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System