"LLM Core Study (6/6) — Learning Roadmap: 12-week Course, 10 Papers, 10 Repos"
Parts 1–5 covered principles. This finale orders them into a 12-week schedule with paper-reading, code-cloning, and experiment loops, then ranks the 10 must-read papers and 10 must-touch repositories, and finishes with a self-evaluation and next-step branches.
0. Learning Objectives
- See, at a glance, what to do each week for 12 weeks.
- Understand why the papers are read in this particular order.
- Match each repository to the right phase of learning.
- Use the self-evaluation to decide which deep-dive branch (inference, RAG, alignment, agents) comes next.
1. The 12-week course
Three activities per week: read, clone, experiment. Keep a short retro note each week — what built, what broke, what's next.
1.1 Weeks 0–2 — Build a tiny transformer
- Read: Vaswani 2017; review Lecture 1.
- Clone: copy nanoGPT from scratch into an empty file.
- Experiment: train on Tiny Shakespeare; reach perplexity ≤ 10.
1.2 Weeks 3–4 — Tokenisers and embeddings
- Read: Sennrich 2016 (BPE), Kudo & Richardson 2018 (SentencePiece).
- Clone: implement BPE in under 50 lines.
- Experiment: compare English vs Korean compression at three vocab sizes.
1.3 Weeks 5–6 — Decoding strategies
- Read: Holtzman 2020 (Top-p), Leviathan 2023 (Speculative).
- Clone: write Greedy / Beam / Top-p / Min-p in one decoder module.
- Experiment: compare BLEU and repetition rate across the four.
1.4 Weeks 7–8 — Fine-tuning (LoRA, QLoRA)
- Read: Hu 2021, Dettmers 2023.
- Clone: fine-tune a 7B model with LoRA on Korean instruction data.
- Experiment: sweep \(r \in \{4, 8, 32\}\); record loss and MMLU.
1.5 Weeks 9–10 — RAG
- Read: Lewis 2020 (RAG), Karpukhin 2020 (DPR).
- Clone: build FAISS + BGE-M3 + LLM into a working RAG pipeline.
- Experiment: chunk-size and top-k sweeps; faithfulness deltas.
1.6 Weeks 11–12 — Evaluation and operations
- Read: Lecture 4 RAGAS section + Lecture 5 perplexity/calibration.
- Clone: bolt a harness onto your own RAG.
- Experiment: dashboard hallucination, latency, cost on three axes.
2. Ten papers — read order
- Vaswani et al., 2017, Attention Is All You Need — start of Lecture 1. arXiv:1706.03762.
- Sennrich et al., 2016, NMT of Rare Words with Subword Units — BPE origin. arXiv:1508.07909.
- Su et al., 2021, RoFormer: Rotary Position Embedding — Lecture 1 §6. arXiv:2104.09864.
- Hu et al., 2021, LoRA — Lecture 2. arXiv:2106.09685.
- Dettmers et al., 2023, QLoRA — Lecture 2. arXiv:2305.14314.
- Holtzman et al., 2020, The Curious Case of Neural Text Degeneration — Lecture 3. arXiv:1904.09751.
- Lewis et al., 2020, Retrieval-Augmented Generation — Lecture 4. arXiv:2005.11401.
- Wei et al., 2022, Chain-of-Thought Prompting — Lecture 4. arXiv:2201.11903.
- Fedus et al., 2022, Switch Transformer — Lecture 4 (MoE). arXiv:2101.03961.
- He et al., 2016, Deep Residual Learning — Lecture 5 (stability). arXiv:1512.03385.
Order principle: inside the model → training → applications → systems. Bonus list (read after the core 10):
- Brown 2020 (GPT-3). 12. Ouyang 2022 (InstructGPT). 13. Rafailov 2023 (DPO). 14. Touvron 2023 (LLaMA 2). 15. Liu 2023 (Lost in the Middle).
3. Ten repositories
Read code, change one thing, run.
- karpathy/nanoGPT — 250-line GPT, the entire Lecture 1 in one file. github.com/karpathy/nanoGPT.
- karpathy/minbpe — minimal BPE tokenizer. github.com/karpathy/minbpe.
- huggingface/transformers — industrial-strength reference;
generation/utils.pyis Lecture 3 in code. github.com/huggingface/transformers. - huggingface/peft — LoRA / Adapter / Prefix-Tuning library. github.com/huggingface/peft.
- TimDettmers/bitsandbytes — NF4, double-quant, paged optimisers. github.com/TimDettmers/bitsandbytes.
- Dao-AILab/flash-attention — FlashAttention v2 / v3. github.com/Dao-AILab/flash-attention.
- vllm-project/vllm — production LLM serving (PagedAttention + speculative). github.com/vllm-project/vllm.
- facebookresearch/faiss — vector search baseline. github.com/facebookresearch/faiss.
- langchain-ai/langchain — RAG and agent orchestration. github.com/langchain-ai/langchain.
- mistralai/mistral-inference — official Mixtral/MoE inference. github.com/mistralai/mistral-inference.
Suggested order: 1 → 2 → 3 → 4 → 5 → 6 → 7 → (8, 9 in parallel) → 10.
4. Self-evaluation
4.0 Series-wide Quick Recap — Answer Before You Peek (with answers)
The five core questions this series answered. Before tackling the 30-prompt detailed evaluation below, give a one-liner for each and compare.
Q1. What are the four core components of a Transformer block?
Answer Tokenization · Embedding · Self-Attention (+ positional encoding) · Feed-Forward MLP — wrapped by Residual + LayerNorm. Why The entire internal flow covered in Part 1. FFN holds roughly 2/3 of model parameters — attention routes information, MLP performs semantic transformation. (Part 1 §7 Synthesis.)
Q2. How does QLoRA cut memory by roughly 10× compared to full fine-tuning?
Answer Three moves applied simultaneously: (a) freeze the base in 4-bit NF4 — no training cost, (b) train only 0.1% of parameters via LoRA, (c) Paged Optimizer swaps optimizer state between CPU and GPU. Why Full fine-tuning's 56 GB comes from the 4× optimizer state (Part 2 §3). QLoRA eliminates that by shrinking trainable parameters. (Part 2 §6.)
Q3. What is the central trade-off in decoding-algorithm selection?
Answer Confident distributions (translation, summarization) → Beam/Greedy. Creative generation (writing) → Top-p + Temperature. Speculative Decoding cuts latency directly (combinable with either family). Why Determinism vs Diversity is one axis; Latency is another. Beam is deterministic but diversity-poor; stochastic is the inverse. Repetition penalties tune both. (Part 3 §3·§4·§5·§6.)
Q4. How do the four advanced techniques (RAG · CoT · MoE · ICL) combine in practice?
Answer RAG injects external knowledge, CoT spells out reasoning steps, MoE delivers efficiency + capacity, ICL provides adaptation. They solve different problems and compose. Why Any single technique used as a universal cure fails. RAG alone cannot reason; CoT alone lacks external knowledge. Modern LLM systems typically combine RAG + CoT + ICL. (Part 4 §6.)
Q5. The five mathematical pillars under LLM training and inference?
Answer Softmax-CE (classification standard + clean gradient \(p - y\)) · KL Divergence (distribution distance, asymmetric) · Perplexity (\(e^{\mathrm{CE}}\), evaluation metric) · Residual + LayerNorm/RMSNorm (gradient flow + distribution stabilization) · AdamW (decoupled weight decay). Why All five operate simultaneously in any Transformer training run. Drop any one and deep-network training collapses. (Part 5 §2·§3·§4·§5·§6·§7·§8.)
If a question stalled, return to the matching section in Part 1–5 (Section 9).
4.1 Detailed self-evaluation — 30 prompts (answer each in under 30 seconds; five per lecture)
Lecture 1 — Fundamentals
- Write the tokenizer function signature.
- What breaks without the \(\sqrt{d_k}\) scaling?
- Relate head count and parameter count in MHA.
- Compare RoPE and ALiBi by where they act.
- Why does "Lost in the Middle" happen?
Lecture 2 — Fine-tuning
- Ratio of AdamW state to weights.
- Shapes and initialisation of \(A, B\) in LoRA.
- Equation that "merges" LoRA into the base.
- NF4 vs INT4 in one line.
- Role of \(T^2\) in distillation.
Lecture 3 — Decoding
- Determinism: Greedy vs Beam.
- Top-p vs Top-k adaptiveness.
- Frequency vs Presence penalty.
- Speculative Decoding acceptance equation.
- AR vs MLM objective.
Lecture 4 — Advanced
- DPR contrastive loss.
- Where does the cross-encoder fit?
- At what model scale does CoT become emergent?
- Switch vs Mixtral routing.
- Zero/one/few-shot definitions.
Lecture 5 — Math
- Derive softmax + CE gradient.
- KL asymmetry intuition.
- Perplexity ↔ cross-entropy.
- Residual connection's role.
- LayerNorm vs RMSNorm equations.
System
- 7 B LoRA memory decomposition (four parts).
- KV cache size formula.
- Active parameters of Mixtral 8x7B.
- Two RAG failure modes.
- Why "do we still need RAG at 1 M context?" is a yes.
Whatever you cannot answer, return to that lecture's "Self-check" section.
5. Where to go next
Once the core is solid, choose one branch.
5.1 Inference Engineering
- Resources: vLLM, TensorRT-LLM, GPTQ/AWQ quantisation.
- Topics: paged KV cache, FlashAttention v3, speculative + MoE.
5.2 RAG / Search Engineering
- Resources: ColBERT, hybrid (BM25 + vector) search, chunking strategies, metadata filters.
- Topics: retrieval evaluation, multimodal RAG, agentic RAG.
5.3 Post-Training / Alignment
- Resources: InstructGPT, DPO, Constitutional AI, reward models.
- Topics: multi-objective safety/honesty/helpfulness.
5.4 Agents and Orchestration
- Resources: ReAct, Toolformer, function calling, ReST^EM, multi-agent frameworks.
- Topics: tool use, memory systems, evaluation harnesses.
6. Integrated extra reading
- Deep Learning (Goodfellow et al., 2016), chapters 6, 8, 10.
- Speech and Language Processing (Jurafsky & Martin, 3rd ed.), chapters 8–12.
- Stanford CS336, Language Modeling from Scratch (2024+).
- Andrej Karpathy, Let's build GPT and Let's build the GPT Tokenizer (YouTube).
- Anthropic, Building Effective Agents (2024).
- OpenAI, Spec for the OpenAI API — the canonical reference for decoding parameters.
7. Closing
These six lectures climb the ladder of principles → training → applications → systems. Any new topic — MoE router stability, multimodal attention, 8-bit KV caches, longest-prefix caching — can be re-told with this same scaffolding. Throw me a topic and the series gains one more lecture in the same format.
Learning ends here only as a curriculum. The work begins next.
Part 6 of 6 — the end of the LLM Core Study series. The next series (Inference Engineering / RAG / Alignment / Agents) starts when you do.
Series overview: Series index
댓글
댓글 쓰기