"LLM Core Study (6/6) — Learning Roadmap: 12-week Course, 10 Papers, 10 Repos"

5월 05, 2026

Parts 1–5 covered principles. This finale orders them into a 12-week schedule with paper-reading, code-cloning, and experiment loops, then ranks the 10 must-read papers and 10 must-touch repositories, and finishes with a self-evaluation and next-step branches.

0. Learning Objectives

See, at a glance, what to do each week for 12 weeks.
Understand why the papers are read in this particular order.
Match each repository to the right phase of learning.
Use the self-evaluation to decide which deep-dive branch (inference, RAG, alignment, agents) comes next.

1. The 12-week course

Three activities per week: read, clone, experiment. Keep a short retro note each week — what built, what broke, what's next.

1.1 Weeks 0–2 — Build a tiny transformer

Read: Vaswani 2017; review Lecture 1.
Clone: copy nanoGPT from scratch into an empty file.
Experiment: train on Tiny Shakespeare; reach perplexity ≤ 10.

1.2 Weeks 3–4 — Tokenisers and embeddings

Read: Sennrich 2016 (BPE), Kudo & Richardson 2018 (SentencePiece).
Clone: implement BPE in under 50 lines.
Experiment: compare English vs Korean compression at three vocab sizes.

1.3 Weeks 5–6 — Decoding strategies

Read: Holtzman 2020 (Top-p), Leviathan 2023 (Speculative).
Clone: write Greedy / Beam / Top-p / Min-p in one decoder module.
Experiment: compare BLEU and repetition rate across the four.

1.4 Weeks 7–8 — Fine-tuning (LoRA, QLoRA)

Read: Hu 2021, Dettmers 2023.
Clone: fine-tune a 7B model with LoRA on Korean instruction data.
Experiment: sweep \(r \in \{4, 8, 32\}\); record loss and MMLU.

1.5 Weeks 9–10 — RAG

Read: Lewis 2020 (RAG), Karpukhin 2020 (DPR).
Clone: build FAISS + BGE-M3 + LLM into a working RAG pipeline.
Experiment: chunk-size and top-k sweeps; faithfulness deltas.

1.6 Weeks 11–12 — Evaluation and operations

Read: Lecture 4 RAGAS section + Lecture 5 perplexity/calibration.
Clone: bolt a harness onto your own RAG.
Experiment: dashboard hallucination, latency, cost on three axes.

2. Ten papers — read order

Vaswani et al., 2017, Attention Is All You Need — start of Lecture 1. arXiv:1706.03762.
Sennrich et al., 2016, NMT of Rare Words with Subword Units — BPE origin. arXiv:1508.07909.
Su et al., 2021, RoFormer: Rotary Position Embedding — Lecture 1 §6. arXiv:2104.09864.
Hu et al., 2021, LoRA — Lecture 2. arXiv:2106.09685.
Dettmers et al., 2023, QLoRA — Lecture 2. arXiv:2305.14314.
Holtzman et al., 2020, The Curious Case of Neural Text Degeneration — Lecture 3. arXiv:1904.09751.
Lewis et al., 2020, Retrieval-Augmented Generation — Lecture 4. arXiv:2005.11401.
Wei et al., 2022, Chain-of-Thought Prompting — Lecture 4. arXiv:2201.11903.
Fedus et al., 2022, Switch Transformer — Lecture 4 (MoE). arXiv:2101.03961.
He et al., 2016, Deep Residual Learning — Lecture 5 (stability). arXiv:1512.03385.

Order principle: inside the model → training → applications → systems. Bonus list (read after the core 10):

Brown 2020 (GPT-3). 12. Ouyang 2022 (InstructGPT). 13. Rafailov 2023 (DPO). 14. Touvron 2023 (LLaMA 2). 15. Liu 2023 (Lost in the Middle).

3. Ten repositories

Read code, change one thing, run.

karpathy/nanoGPT — 250-line GPT, the entire Lecture 1 in one file. github.com/karpathy/nanoGPT.
karpathy/minbpe — minimal BPE tokenizer. github.com/karpathy/minbpe.
huggingface/transformers — industrial-strength reference; generation/utils.py is Lecture 3 in code. github.com/huggingface/transformers.
huggingface/peft — LoRA / Adapter / Prefix-Tuning library. github.com/huggingface/peft.
TimDettmers/bitsandbytes — NF4, double-quant, paged optimisers. github.com/TimDettmers/bitsandbytes.
Dao-AILab/flash-attention — FlashAttention v2 / v3. github.com/Dao-AILab/flash-attention.
vllm-project/vllm — production LLM serving (PagedAttention + speculative). github.com/vllm-project/vllm.
facebookresearch/faiss — vector search baseline. github.com/facebookresearch/faiss.
langchain-ai/langchain — RAG and agent orchestration. github.com/langchain-ai/langchain.
mistralai/mistral-inference — official Mixtral/MoE inference. github.com/mistralai/mistral-inference.

Suggested order: 1 → 2 → 3 → 4 → 5 → 6 → 7 → (8, 9 in parallel) → 10.

4. Self-evaluation

4.0 Series-wide Quick Recap — Answer Before You Peek (with answers)

The five core questions this series answered. Before tackling the 30-prompt detailed evaluation below, give a one-liner for each and compare.

Q1. What are the four core components of a Transformer block?

Answer Tokenization · Embedding · Self-Attention (+ positional encoding) · Feed-Forward MLP — wrapped by Residual + LayerNorm. Why The entire internal flow covered in Part 1. FFN holds roughly 2/3 of model parameters — attention routes information, MLP performs semantic transformation. (Part 1 §7 Synthesis.)

Q2. How does QLoRA cut memory by roughly 10× compared to full fine-tuning?

Answer Three moves applied simultaneously: (a) freeze the base in 4-bit NF4 — no training cost, (b) train only 0.1% of parameters via LoRA, (c) Paged Optimizer swaps optimizer state between CPU and GPU. Why Full fine-tuning's 56 GB comes from the 4× optimizer state (Part 2 §3). QLoRA eliminates that by shrinking trainable parameters. (Part 2 §6.)

Q3. What is the central trade-off in decoding-algorithm selection?

Answer Confident distributions (translation, summarization) → Beam/Greedy. Creative generation (writing) → Top-p + Temperature. Speculative Decoding cuts latency directly (combinable with either family). Why Determinism vs Diversity is one axis; Latency is another. Beam is deterministic but diversity-poor; stochastic is the inverse. Repetition penalties tune both. (Part 3 §3·§4·§5·§6.)

Q4. How do the four advanced techniques (RAG · CoT · MoE · ICL) combine in practice?

Answer RAG injects external knowledge, CoT spells out reasoning steps, MoE delivers efficiency + capacity, ICL provides adaptation. They solve different problems and compose. Why Any single technique used as a universal cure fails. RAG alone cannot reason; CoT alone lacks external knowledge. Modern LLM systems typically combine RAG + CoT + ICL. (Part 4 §6.)

Q5. The five mathematical pillars under LLM training and inference?

Answer Softmax-CE (classification standard + clean gradient \(p - y\)) · KL Divergence (distribution distance, asymmetric) · Perplexity (\(e^{\mathrm{CE}}\), evaluation metric) · Residual + LayerNorm/RMSNorm (gradient flow + distribution stabilization) · AdamW (decoupled weight decay). Why All five operate simultaneously in any Transformer training run. Drop any one and deep-network training collapses. (Part 5 §2·§3·§4·§5·§6·§7·§8.)

If a question stalled, return to the matching section in Part 1–5 (Section 9).

4.1 Detailed self-evaluation — 30 prompts (answer each in under 30 seconds; five per lecture)

Lecture 1 — Fundamentals

Write the tokenizer function signature.
What breaks without the \(\sqrt{d_k}\) scaling?
Relate head count and parameter count in MHA.
Compare RoPE and ALiBi by where they act.
Why does "Lost in the Middle" happen?

Lecture 2 — Fine-tuning

Ratio of AdamW state to weights.
Shapes and initialisation of \(A, B\) in LoRA.
Equation that "merges" LoRA into the base.
NF4 vs INT4 in one line.
Role of \(T^2\) in distillation.

Lecture 3 — Decoding

Determinism: Greedy vs Beam.
Top-p vs Top-k adaptiveness.
Frequency vs Presence penalty.
Speculative Decoding acceptance equation.
AR vs MLM objective.

Lecture 4 — Advanced

DPR contrastive loss.
Where does the cross-encoder fit?
At what model scale does CoT become emergent?
Switch vs Mixtral routing.
Zero/one/few-shot definitions.

Lecture 5 — Math

Derive softmax + CE gradient.
KL asymmetry intuition.
Perplexity ↔ cross-entropy.
Residual connection's role.
LayerNorm vs RMSNorm equations.

System

7 B LoRA memory decomposition (four parts).
KV cache size formula.
Active parameters of Mixtral 8x7B.
Two RAG failure modes.
Why "do we still need RAG at 1 M context?" is a yes.

Whatever you cannot answer, return to that lecture's "Self-check" section.

5. Where to go next

Once the core is solid, choose one branch.

5.1 Inference Engineering

Resources: vLLM, TensorRT-LLM, GPTQ/AWQ quantisation.
Topics: paged KV cache, FlashAttention v3, speculative + MoE.

5.2 RAG / Search Engineering

Resources: ColBERT, hybrid (BM25 + vector) search, chunking strategies, metadata filters.
Topics: retrieval evaluation, multimodal RAG, agentic RAG.

5.3 Post-Training / Alignment

Resources: InstructGPT, DPO, Constitutional AI, reward models.
Topics: multi-objective safety/honesty/helpfulness.

5.4 Agents and Orchestration

Resources: ReAct, Toolformer, function calling, ReST^EM, multi-agent frameworks.
Topics: tool use, memory systems, evaluation harnesses.

6. Integrated extra reading

Deep Learning (Goodfellow et al., 2016), chapters 6, 8, 10.
Speech and Language Processing (Jurafsky & Martin, 3rd ed.), chapters 8–12.
Stanford CS336, Language Modeling from Scratch (2024+).
Andrej Karpathy, Let's build GPT and Let's build the GPT Tokenizer (YouTube).
Anthropic, Building Effective Agents (2024).
OpenAI, Spec for the OpenAI API — the canonical reference for decoding parameters.

7. Closing

These six lectures climb the ladder of principles → training → applications → systems. Any new topic — MoE router stability, multimodal attention, 8-bit KV caches, longest-prefix caching — can be re-told with this same scaffolding. Throw me a topic and the series gains one more lecture in the same format.

Learning ends here only as a curriculum. The work begins next.

Part 6 of 6 — the end of the LLM Core Study series. The next series (Inference Engineering / RAG / Alignment / Agents) starts when you do.

Series overview: Series index