"LLM Core Study (6/6) — Learning Roadmap: 12-week Course, 10 Papers, 10 Repos"

Parts 1–5 covered principles. This finale orders them into a 12-week schedule with paper-reading, code-cloning, and experiment loops, then ranks the 10 must-read papers and 10 must-touch repositories, and finishes with a self-evaluation and next-step branches.


0. Learning Objectives

  • See, at a glance, what to do each week for 12 weeks.
  • Understand why the papers are read in this particular order.
  • Match each repository to the right phase of learning.
  • Use the self-evaluation to decide which deep-dive branch (inference, RAG, alignment, agents) comes next.

1. The 12-week course

Three activities per week: read, clone, experiment. Keep a short retro note each week — what built, what broke, what's next.

1.1 Weeks 0–2 — Build a tiny transformer

  • Read: Vaswani 2017; review Lecture 1.
  • Clone: copy nanoGPT from scratch into an empty file.
  • Experiment: train on Tiny Shakespeare; reach perplexity ≤ 10.

1.2 Weeks 3–4 — Tokenisers and embeddings

  • Read: Sennrich 2016 (BPE), Kudo & Richardson 2018 (SentencePiece).
  • Clone: implement BPE in under 50 lines.
  • Experiment: compare English vs Korean compression at three vocab sizes.

1.3 Weeks 5–6 — Decoding strategies

  • Read: Holtzman 2020 (Top-p), Leviathan 2023 (Speculative).
  • Clone: write Greedy / Beam / Top-p / Min-p in one decoder module.
  • Experiment: compare BLEU and repetition rate across the four.

1.4 Weeks 7–8 — Fine-tuning (LoRA, QLoRA)

  • Read: Hu 2021, Dettmers 2023.
  • Clone: fine-tune a 7B model with LoRA on Korean instruction data.
  • Experiment: sweep \(r \in \{4, 8, 32\}\); record loss and MMLU.

1.5 Weeks 9–10 — RAG

  • Read: Lewis 2020 (RAG), Karpukhin 2020 (DPR).
  • Clone: build FAISS + BGE-M3 + LLM into a working RAG pipeline.
  • Experiment: chunk-size and top-k sweeps; faithfulness deltas.

1.6 Weeks 11–12 — Evaluation and operations

  • Read: Lecture 4 RAGAS section + Lecture 5 perplexity/calibration.
  • Clone: bolt a harness onto your own RAG.
  • Experiment: dashboard hallucination, latency, cost on three axes.

2. Ten papers — read order

  1. Vaswani et al., 2017, Attention Is All You Need — start of Lecture 1. arXiv:1706.03762.
  2. Sennrich et al., 2016, NMT of Rare Words with Subword Units — BPE origin. arXiv:1508.07909.
  3. Su et al., 2021, RoFormer: Rotary Position Embedding — Lecture 1 §6. arXiv:2104.09864.
  4. Hu et al., 2021, LoRA — Lecture 2. arXiv:2106.09685.
  5. Dettmers et al., 2023, QLoRA — Lecture 2. arXiv:2305.14314.
  6. Holtzman et al., 2020, The Curious Case of Neural Text Degeneration — Lecture 3. arXiv:1904.09751.
  7. Lewis et al., 2020, Retrieval-Augmented Generation — Lecture 4. arXiv:2005.11401.
  8. Wei et al., 2022, Chain-of-Thought Prompting — Lecture 4. arXiv:2201.11903.
  9. Fedus et al., 2022, Switch Transformer — Lecture 4 (MoE). arXiv:2101.03961.
  10. He et al., 2016, Deep Residual Learning — Lecture 5 (stability). arXiv:1512.03385.

Order principle: inside the model → training → applications → systems. Bonus list (read after the core 10):

  1. Brown 2020 (GPT-3). 12. Ouyang 2022 (InstructGPT). 13. Rafailov 2023 (DPO). 14. Touvron 2023 (LLaMA 2). 15. Liu 2023 (Lost in the Middle).

3. Ten repositories

Read code, change one thing, run.

  1. karpathy/nanoGPT — 250-line GPT, the entire Lecture 1 in one file. github.com/karpathy/nanoGPT.
  2. karpathy/minbpe — minimal BPE tokenizer. github.com/karpathy/minbpe.
  3. huggingface/transformers — industrial-strength reference; generation/utils.py is Lecture 3 in code. github.com/huggingface/transformers.
  4. huggingface/peft — LoRA / Adapter / Prefix-Tuning library. github.com/huggingface/peft.
  5. TimDettmers/bitsandbytes — NF4, double-quant, paged optimisers. github.com/TimDettmers/bitsandbytes.
  6. Dao-AILab/flash-attention — FlashAttention v2 / v3. github.com/Dao-AILab/flash-attention.
  7. vllm-project/vllm — production LLM serving (PagedAttention + speculative). github.com/vllm-project/vllm.
  8. facebookresearch/faiss — vector search baseline. github.com/facebookresearch/faiss.
  9. langchain-ai/langchain — RAG and agent orchestration. github.com/langchain-ai/langchain.
  10. mistralai/mistral-inference — official Mixtral/MoE inference. github.com/mistralai/mistral-inference.

Suggested order: 1 → 2 → 3 → 4 → 5 → 6 → 7 → (8, 9 in parallel) → 10.


4. Self-evaluation

4.0 Series-wide Quick Recap — Answer Before You Peek (with answers)

The five core questions this series answered. Before tackling the 30-prompt detailed evaluation below, give a one-liner for each and compare.

Q1. What are the four core components of a Transformer block?

Answer Tokenization · Embedding · Self-Attention (+ positional encoding) · Feed-Forward MLP — wrapped by Residual + LayerNorm. Why The entire internal flow covered in Part 1. FFN holds roughly 2/3 of model parameters — attention routes information, MLP performs semantic transformation. (Part 1 §7 Synthesis.)

Q2. How does QLoRA cut memory by roughly 10× compared to full fine-tuning?

Answer Three moves applied simultaneously: (a) freeze the base in 4-bit NF4 — no training cost, (b) train only 0.1% of parameters via LoRA, (c) Paged Optimizer swaps optimizer state between CPU and GPU. Why Full fine-tuning's 56 GB comes from the 4× optimizer state (Part 2 §3). QLoRA eliminates that by shrinking trainable parameters. (Part 2 §6.)

Q3. What is the central trade-off in decoding-algorithm selection?

Answer Confident distributions (translation, summarization) → Beam/Greedy. Creative generation (writing) → Top-p + Temperature. Speculative Decoding cuts latency directly (combinable with either family). Why Determinism vs Diversity is one axis; Latency is another. Beam is deterministic but diversity-poor; stochastic is the inverse. Repetition penalties tune both. (Part 3 §3·§4·§5·§6.)

Q4. How do the four advanced techniques (RAG · CoT · MoE · ICL) combine in practice?

Answer RAG injects external knowledge, CoT spells out reasoning steps, MoE delivers efficiency + capacity, ICL provides adaptation. They solve different problems and compose. Why Any single technique used as a universal cure fails. RAG alone cannot reason; CoT alone lacks external knowledge. Modern LLM systems typically combine RAG + CoT + ICL. (Part 4 §6.)

Q5. The five mathematical pillars under LLM training and inference?

Answer Softmax-CE (classification standard + clean gradient \(p - y\)) · KL Divergence (distribution distance, asymmetric) · Perplexity (\(e^{\mathrm{CE}}\), evaluation metric) · Residual + LayerNorm/RMSNorm (gradient flow + distribution stabilization) · AdamW (decoupled weight decay). Why All five operate simultaneously in any Transformer training run. Drop any one and deep-network training collapses. (Part 5 §2·§3·§4·§5·§6·§7·§8.)

If a question stalled, return to the matching section in Part 1–5 (Section 9).


4.1 Detailed self-evaluation — 30 prompts (answer each in under 30 seconds; five per lecture)

Lecture 1 — Fundamentals

  • Write the tokenizer function signature.
  • What breaks without the \(\sqrt{d_k}\) scaling?
  • Relate head count and parameter count in MHA.
  • Compare RoPE and ALiBi by where they act.
  • Why does "Lost in the Middle" happen?

Lecture 2 — Fine-tuning

  • Ratio of AdamW state to weights.
  • Shapes and initialisation of \(A, B\) in LoRA.
  • Equation that "merges" LoRA into the base.
  • NF4 vs INT4 in one line.
  • Role of \(T^2\) in distillation.

Lecture 3 — Decoding

  • Determinism: Greedy vs Beam.
  • Top-p vs Top-k adaptiveness.
  • Frequency vs Presence penalty.
  • Speculative Decoding acceptance equation.
  • AR vs MLM objective.

Lecture 4 — Advanced

  • DPR contrastive loss.
  • Where does the cross-encoder fit?
  • At what model scale does CoT become emergent?
  • Switch vs Mixtral routing.
  • Zero/one/few-shot definitions.

Lecture 5 — Math

  • Derive softmax + CE gradient.
  • KL asymmetry intuition.
  • Perplexity ↔ cross-entropy.
  • Residual connection's role.
  • LayerNorm vs RMSNorm equations.

System

  • 7 B LoRA memory decomposition (four parts).
  • KV cache size formula.
  • Active parameters of Mixtral 8x7B.
  • Two RAG failure modes.
  • Why "do we still need RAG at 1 M context?" is a yes.

Whatever you cannot answer, return to that lecture's "Self-check" section.


5. Where to go next

Once the core is solid, choose one branch.

5.1 Inference Engineering

  • Resources: vLLM, TensorRT-LLM, GPTQ/AWQ quantisation.
  • Topics: paged KV cache, FlashAttention v3, speculative + MoE.

5.2 RAG / Search Engineering

  • Resources: ColBERT, hybrid (BM25 + vector) search, chunking strategies, metadata filters.
  • Topics: retrieval evaluation, multimodal RAG, agentic RAG.

5.3 Post-Training / Alignment

  • Resources: InstructGPT, DPO, Constitutional AI, reward models.
  • Topics: multi-objective safety/honesty/helpfulness.

5.4 Agents and Orchestration

  • Resources: ReAct, Toolformer, function calling, ReST^EM, multi-agent frameworks.
  • Topics: tool use, memory systems, evaluation harnesses.

6. Integrated extra reading

  • Deep Learning (Goodfellow et al., 2016), chapters 6, 8, 10.
  • Speech and Language Processing (Jurafsky & Martin, 3rd ed.), chapters 8–12.
  • Stanford CS336, Language Modeling from Scratch (2024+).
  • Andrej Karpathy, Let's build GPT and Let's build the GPT Tokenizer (YouTube).
  • Anthropic, Building Effective Agents (2024).
  • OpenAI, Spec for the OpenAI API — the canonical reference for decoding parameters.

7. Closing

These six lectures climb the ladder of principles → training → applications → systems. Any new topic — MoE router stability, multimodal attention, 8-bit KV caches, longest-prefix caching — can be re-told with this same scaffolding. Throw me a topic and the series gains one more lecture in the same format.

Learning ends here only as a curriculum. The work begins next.


Part 6 of 6 — the end of the LLM Core Study series. The next series (Inference Engineering / RAG / Alignment / Agents) starts when you do.

Series overview: Series index

댓글

이 블로그의 인기 게시물

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System