"LLM Core Study (3/6) — Decoding and Generation: Greedy, Beam, Top-p, Speculative"
One forward pass produces vocabulary-sized logits — a distribution, not text. The decoder turns that distribution into a token. This lecture organises decoders into four families: deterministic, stochastic, structural search, and acceleration.
0. Learning Objectives
- Reproduce the logits → probability → next-token pipeline in code.
- Distinguish Temperature, Top-k, Top-p (Holtzman 2020), and Min-p in one line each.
- Explain when Greedy or Beam are good and why beam needs length penalty.
- Know the equations and side effects of repetition, frequency, and presence penalties.
- State Speculative Decoding's (Leviathan 2023) acceptance rule and verify it by hand.
- Compare autoregressive (GPT) vs masked (BERT) training and inference.
1. ํต์ฌ ์์ฝ
- Decoding = "policy for choosing a token from logits". Repeated greedily, token by token.
- Greedy/Beam are deterministic. Beam needs length penalty or it favours short outputs.
- Temperature scales logits. Top-k truncates by count; Top-p by cumulative mass (adaptive); Min-p by ratio to the max.
- Penalties suppress repetition at the cost of legitimate re-use.
- Speculative Decoding lets a small draft model propose \(k\) tokens, verified by the target with provably equivalent distribution.
- AR (GPT) generates left-to-right; MLM (BERT) is bidirectional and not for generation.
2. Intuition: choosing one token from a distribution
The model gives \(P(x_{t+1} \mid x_{1:t}) = \mathrm{softmax}(z)\) with \(z \in \mathbb{R}^{|V|}\). The decoder picks one token, appends it to the input, and repeats until a stop condition. Different policies are different trade-offs between determinism, diversity, length, and repetition.
3. Deterministic decoders
3.1 Greedy
Always take \(\arg\max_z\). Simplest, fully deterministic, but easily trapped in local optima.
def greedy_decode(model, ids, max_new_tokens=64):
for _ in range(max_new_tokens):
with torch.no_grad():
logits = model(ids)[:, -1, :]
nxt = logits.argmax(dim=-1, keepdim=True)
ids = torch.cat([ids, nxt], dim=1)
if nxt.item() == EOS_ID:
break
return ids
3.2 Beam Search
Maintain the top-\(B\) partial sequences at every step. Each partial sequence carries a score
$$ \text{score}(x_{1:t}) = \sum_{i=1}^{t} \log P(x_i \mid x_{<i}). $$
Because all log-probs are negative, longer sequences accumulate more negative score and lose by default. The standard length penalty (Wu et al., 2016):
$$ \text{score}_{\text{lp}}(x_{1:T}) = \frac{\sum_i \log P(x_i \mid x_{<i})}{((5 + T)/6)^\alpha},\ \ \alpha \in [0.6, 1.0]. $$
Beam search is strong for translation and summarisation, weak for open-ended generation (it produces homogeneous, somewhat dull text).
3.3 Pros and cons
- ✅ Deterministic, locally consistent.
- ❌ \(B \times\) memory and compute.
- ❌ Mode collapse: all \(B\) beams converge on similar continuations.
- ❌ Repetition loops without
no_repeat_ngram_size.
4. Stochastic decoders
4.1 Temperature
$$ P_T(y) = \frac{\exp(z_y / T)}{\sum_{y'} \exp(z_{y'}/T)}. $$
- \(T \to 0\): concentrates on argmax (≈ Greedy).
- \(T = 1\): the model's natural distribution.
- \(T > 1\): flatter, more diverse, less consistent.
4.2 Top-k
Keep the top-\(k\) logits, mask the rest to \(-\infty\). Pros: always exactly \(k\) candidates. Cons: too few in flat distributions, too many in sharp ones.
4.3 Top-p (Nucleus, Holtzman 2020)
Sort by probability and keep the smallest prefix whose cumulative mass exceeds \(p\):
$$ \mathrm{Nucleus}_p = \{y_1, \dots, y_K\}\quad \text{s.t.}\quad \sum_{i=1}^{K} P(y_i) \ge p. $$
Adaptive: small \(K\) on sharp distributions, large \(K\) on flat ones. Typical settings: \(p \in [0.9, 0.95]\).
def top_p_filter(logits, p=0.9):
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
probs = torch.softmax(sorted_logits, dim=-1)
cum = torch.cumsum(probs, dim=-1)
mask = cum > p
mask[..., 0] = False # keep at least one
sorted_logits[mask] = float('-inf')
return sorted_logits.scatter(-1, sorted_idx, sorted_logits)
4.4 Min-p
Threshold by ratio to the max probability: drop \(y\) if \(P(y) < \text{minp} \cdot P_{\max}\). Natural narrowing on sharp distributions, generous on flat ones.
4.5 Common combinations
Temperature is typically applied first, then Top-p or Top-k truncation. Order matters because softmax is non-linear: switching order changes the resulting distribution.
4.6 Limits
- Non-deterministic — debugging requires Greedy or
T = 0. - Diversity vs coherence is a domain-specific trade-off.
5. Repetition control
5.1 Repetition penalty (Keskar 2019)
Divide (positive logits) or multiply (negative) by a constant \(r > 1\) for tokens already seen.
$$ z'_y = \begin{cases} z_y / r & z_y > 0, y \in \text{history} \\ z_y \cdot r & z_y < 0, y \in \text{history} \\ z_y & \text{otherwise} \end{cases} $$
Side effect: punishes legitimate re-use (e.g. repeated proper nouns).
5.2 Frequency penalty (OpenAI API)
Subtract \(n \cdot \alpha\) from a token's logit, where \(n\) is the number of past occurrences. Smooth control of accumulating repetition.
5.3 Presence penalty (OpenAI API)
Subtract a fixed \(\beta\) once the token has appeared at all. Encourages topic-shifting.
5.4 No-repeat n-gram (beam search)
Forbid n-grams that already appeared in the partial sequence. Strong for translation, harmful when repetition is part of the truth.
6. Speculative Decoding (Leviathan 2023)
6.1 The bottleneck it removes
LLM inference is sequential: \(T\) tokens require \(T\) forwards. GPUs are bandwidth-bound; one giant matmul outputs one token.
6.2 The idea
A small draft model \(M_q\) proposes \(k\) tokens sequentially. The large target \(M_p\) checks them in a single forward (parallel over positions). At each position \(i\) we accept with probability
$$ \min\!\left(1, \frac{p_i(x_i)}{q_i(x_i)}\right). $$
If we reject at position \(i\), we resample once from the residual distribution \((p - q)_+/Z\). The remarkable result: the joint distribution is exactly what \(p\) would have produced (Leviathan et al., 2023, Theorem 1).
6.3 Expected speedup
With acceptance rate \(\alpha\) and draft length \(k\), each target forward yields \(1 + \alpha + \dots + \alpha^{k-1}\) tokens on average. With \(\alpha = 0.7, k = 5\), the expected gain is ~2.6× before draft cost, ~1.5–3× in practice.
6.4 Implementation pitfalls
- Draft and target must share the vocabulary.
- The target produces an extra logits row to use as fallback if all \(k\) drafts are rejected.
- The KV cache of the target must be truncated correctly on rejection.
6.5 Variants
- Medusa: train multiple prediction heads on the same model; eliminates the draft entirely.
- Lookahead decoding: same-model n-gram cache.
- EAGLE-2: more elaborate draft training.
6.6 Limits
- Low acceptance in difficult domains (code, math, low-temperature creative writing).
- Roughly 2× memory (draft + target).
7. AR vs Masked
7.1 Autoregressive (GPT)
$$ P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{<t}). $$
Train with a causal mask; predict each position from past tokens. Inference is left-to-right.
7.2 Masked Language Modelling (BERT)
Randomly replace 15 % of tokens with [MASK] and predict them using bidirectional context. Strong for classification and embeddings, not directly suited to generation.
7.3 Comparison
| Aspect | AR (GPT) | MLM (BERT) |
|---|---|---|
| Objective | Next-token prediction | Mask recovery |
| Mask | Causal (upper triangle) | Random positions |
| Directionality | Left-to-right | Bidirectional |
| Generation | Native AR | Not directly; encoder-decoder variants exist |
| Classification | Last hidden state | [CLS] or pooled |
Hybrids exist (T5 encoder-decoder, Prefix-LM, Diffusion-LM), but the decoding strategies in this lecture are mostly for AR models.
8. Quick Recap — Answer Before You Peek
Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.
Q1. Greedy vs Beam: determinism and memory?
Answer Both are deterministic (same input → same output). Memory: Greedy O(1), Beam O(\(b \cdot L\)) for beam size \(b\) and length \(L\). Why Greedy keeps one argmax token per step; Beam tracks the top-\(b\) hypotheses simultaneously. Beam wins on translation and summarization where a clear correct answer exists; creative generation favors stochastic decoders. (Section §3.)
Q2. What does it mean that Top-p is dynamic compared to Top-k?
Answer Top-k cuts off at a fixed count \(k\). Top-p cuts off where the cumulative probability mass reaches \(p\) — narrow for sharp distributions, wide for flat ones. Why "Narrow the cut when confident, widen when uncertain." A single \(p = 0.9\) might select 3 tokens at one step and 50 at the next. Holtzman et al. 2020. (Section §4.)
Q3. Frequency vs Presence penalty?
Answer Frequency: penalty proportional to how many times a token appeared. Presence: a fixed penalty if it has appeared at all. Why Frequency curbs repetition rate; Presence promotes diversity. OpenAI's API exposes both because they push different objectives. (Section §5.)
Q4. Write the acceptance rule of Speculative Decoding. Why must draft and target share a vocabulary?
Answer Acceptance probability: \(\min\!\left(1, \frac{p_{\text{target}}(t)}{p_{\text{draft}}(t)}\right)\). Shared vocabulary: the target must be able to interpret tokens the draft proposes for verification to make sense. Why Leviathan 2023. A small fast draft model proposes N tokens; the target verifies them in parallel. Without vocabulary alignment, token IDs are incompatible. Yields ~2–3× latency reduction. (Section §6.)
Q5. AR vs MLM training objectives — one line each.
Answer AR (Autoregressive, GPT family): predict the next token. MLM (Masked LM, BERT family): predict the masked tokens using bidirectional context. Why AR is optimal for generation; MLM is optimal for representation learning. The same Transformer architecture becomes a different model purely through training objective. Decoding algorithms apply only to AR. (Section §7.)
If four or five came out as one-liners, the deterministic vs stochastic → repetition control → speculative → AR vs MLM arc is in place.
9. Practice
- For
temperature ∈ {0.1, 0.7, 1.5}andtop_p ∈ {0.5, 0.9, 0.99}, generate 10 samples each and score diversity vs coherence on a 1–5 scale. - With beam \(B = 4\) and
length_penalty ∈ {0, 0.6, 1.0}, decode a short translation and compare output lengths. - Use Hugging Face's
assistant_modelargument (speculative decoding) and compare tokens/sec. - Remove
mask[..., 0] = Falsefrom the Top-p filter and find inputs that break.
10. Further reading
- Holtzman et al., 2020, The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
- Leviathan et al., 2023, Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192.
- Cai et al., 2024, Medusa. arXiv:2401.10774.
- Devlin et al., 2019, BERT. arXiv:1810.04805.
- Radford et al., 2019, GPT-2.
- Keskar et al., 2019, CTRL. arXiv:1909.05858.
- Wu et al., 2016, Google NMT. arXiv:1609.08144.
- Hugging Face generation guide: huggingface.co/docs/transformers/generation_strategies.
Part 3 of 6 in the LLM Core Study series. Part 4 covers advanced techniques: RAG, CoT, MoE, and In-Context Learning.
Series overview: Series index
๋๊ธ
๋๊ธ ์ฐ๊ธฐ