"LLM Core Study (4/6) — Advanced: RAG, CoT, MoE, In-Context Learning"

Four orthogonal techniques sit on top of pretraining: external knowledge (RAG), explicit reasoning steps (CoT), sparse activation (MoE), and prompt-time learning (ICL). In production they are often combined; here we look at each one cleanly.

0. Learning Objectives

Diagram a RAG pipeline and name the metrics for each stage (retriever and reader).
State the difference between Chain-of-Thought (Wei 2022) and Self-Consistency (Wang 2022) in one line.
Write the routing equation for Sparse MoE (Switch / Mixtral) and explain the capacity factor.
Distinguish zero-shot, few-shot, and instruction tuning by what changes in the model.
List two failure modes for each technique.

1. 핵심 요약

RAG: retrieve relevant documents from an external corpus and inject them into the prompt. Reduces hallucination, staleness, and domain gap.
CoT: generate intermediate reasoning steps before the final answer. Self-Consistency = majority vote over multiple samples.
MoE: activate only a subset of experts per token to enlarge parameter capacity without proportional inference cost.
ICL: change behaviour by examples in the prompt, without changing weights.

2. RAG — Retrieval-Augmented Generation

2.1 Motivation

LLMs know what they were trained on. For anything published after the training cutoff, or for private corpora, the model risks producing plausible-sounding falsehoods. RAG closes the gap by searching at inference time.

2.2 Pipeline

2.3 Retriever — bi-encoder

Map queries and documents to the same vector space and score by cosine similarity:

$$ \mathrm{sim}(q, d) = \frac{f_q(q)^\top f_d(d)}{\|f_q(q)\|\|f_d(d)\|}. $$

Train contrastively against hard negatives:

$$ \mathcal{L} = -\log \frac{\exp(\mathrm{sim}(q, d^+)/\tau)}{\exp(\mathrm{sim}(q, d^+)/\tau) + \sum_{d^-} \exp(\mathrm{sim}(q, d^-)/\tau)}. $$

Models in production: DPR (Karpukhin 2020), E5, BGE-M3, GTE.

2.4 Cross-Encoder reranker

A second-stage cross-encoder reads [q] [SEP] [d] together and predicts a fine-grained relevance score. Expensive per call, far more accurate. Typical: bi-encoder top-100 → cross-encoder top-5.

2.5 Reader — the generator

The original RAG paper (Lewis 2020) trained the generator jointly with the retriever (RAG-Sequence, RAG-Token). Industrial implementations almost always use frozen LLM + prompt injection: the retrieved docs are pasted into a structured prompt, and a general LLM answers.

2.6 Evaluation

Retriever: Recall@k, MRR, nDCG.
Reader: faithfulness, answer relevancy, context precision/recall (e.g. RAGAS).
System: latency, cost, hallucination rate, citation accuracy.

2.7 Limits and failure modes

Bad chunking. Too small breaks context; too large bloats cost and hits middle-recall issues.
Lost in the Middle. Adding more documents does not monotonically improve answers.
Retrieval-answer misalignment. The model can ignore retrieved evidence in favour of pretraining priors.
Domain mismatch. Embedding models trained on web text struggle on legal/medical Korean.

2.8 Practice

Index 10K documents with BGE-M3 in FAISS and report Recall@10 on 100 queries.
Measure faithfulness with and without a cross-encoder reranker on the same retrieval.
Sweep chunk sizes 256/512/1024 and observe answer quality changes.

3. Chain-of-Thought (CoT)

3.1 Definition

Wei et al., 2022 showed that asking an LLM to write out reasoning steps before the final answer dramatically improves multi-step accuracy. Same weights, same model — only the prompt changes.

3.2 Triggers

Zero-shot CoT (Kojima 2022): append "Let's think step by step."
Original CoT (Wei 2022): few-shot examples whose answers include reasoning chains.

3.3 Why does it work?

Compressing an answer into a single token forces the model to do all computation in hidden state; spelling reasoning out as tokens lets the model reuse its own intermediate outputs.
Each generated reasoning token additionally conditions the next, acting as an external scratchpad.
The effect is most striking in larger models (~60 B+), where it appears emergent (Wei 2022).

3.4 Self-Consistency (Wang 2022)

CoT under sampling is non-deterministic. Sample $N$ reasoning chains, take the majority answer:

$$ \hat{y} = \arg\max_y \sum_{i=1}^{N} \mathbb{1}[y_i = y]. $$

Accuracy improves; cost scales with $N$.

3.5 Extensions

Tree of Thoughts (Yao 2023): explore reasoning as a tree with a critic.
Self-Refine (Madaan 2023): the model critiques and revises its own answer.
ReAct (Yao 2022): interleaved reasoning and tool calls.

3.6 Limits

Compute: $N$× cost for self-consistency.
A confident but wrong chain can survive self-critique.
Multi-step systems compound errors.

3.7 Practice

On GSM8K, compare accuracy with (a) no CoT, (b) zero-shot CoT, (c) Self-Consistency $N = 10$.
Plot accuracy vs reasoning tokens to estimate marginal value per additional step.

4. Mixture of Experts (MoE)

4.1 Motivation

Buy more capacity without buying proportional inference cost. Solution: activate only a fraction of expert FFNs per token.

4.2 Switch Transformer (Fedus 2022)

Each MoE layer contains $E$ experts and a gating matrix $W_g \in \mathbb{R}^{d \times E}$:

$$ p(e \mid h) = \mathrm{softmax}(W_g h)_e. $$

Switch uses Top-1 routing: send each token to its highest-scoring expert.

$$ \text{MoE}(h) = p_{e^*}(h) \cdot f_{e^*}(h),\ \ e^* = \arg\max_e p(e \mid h). $$

4.3 Mixtral (Jiang 2024)

$E = 8$, Top-2 routing: two experts per token, weighted by their gates:

$$ \text{MoE}(h) = \sum_{e \in \mathrm{Top2}(p)} p_e \cdot f_e(h). $$

Top-2 is more stable than Top-1 and cheaper than dense full FFN of the same parameter budget.

4.4 Capacity factor and load balancing

Experts must receive roughly balanced loads. Capacity per expert is

$$ \text{capacity} = C \cdot \frac{N_{\text{tokens}}}{E}. $$

With $C = 1.0$ some tokens get dropped (only residual passes through). $C = 1.25$ buffers overflow.

Add an auxiliary load-balancing loss:

$$ \mathcal{L}_{\text{aux}} = E \cdot \sum_{e=1}^{E} f_e \cdot P_e, $$

where $f_e$ is the fraction of tokens routed to expert $e$ and $P_e$ is the mean gate probability.

4.5 Memory vs inference cost

Memory: all $E$ experts live in GPU memory.
Inference: only the activated experts matmul. Mixtral 8x7B has ~13 B active parameters per token (2/8 ratio).

4.6 Limits

Dead experts (always under-routed).
Distributed training requires all-to-all token shuffles.
Routing is sensitive to quantisation.

4.7 Practice

Implement a 4-expert Top-1 MoE FFN block in PyTorch.
Compare a dense FFN to a MoE FFN on the same data and budget.

5. In-Context Learning (ICL)

5.1 Definition

The ability to perform a task by reading examples in the prompt, without weight updates. Reported as emergent in GPT-3 (Brown 2020).

5.2 Zero / One / Few-shot

Zero-shot: instruction only.
One-shot: instruction + one example.
Few-shot: instruction + several examples.

5.3 Working hypotheses

The model has implicitly meta-learned over its pretraining corpus; the prompt becomes a tiny "training set" that attention processes via what behaves like gradient descent (von Oswald 2023).
Or: examples activate the right representational sub-space and format the output.

5.4 Instruction tuning

Fine-tune on (instruction, output) pairs to make zero-shot work without examples. FLAN (Wei 2021), Alpaca, Vicuna, Tulu, and most chat-tuned models follow this pattern.

5.5 RLHF / DPO (brief)

Further alignment with human or AI preferences. Detailed treatment is deferred; the key constraint is a KL penalty keeping the policy close to a reference model.

5.6 Limits

Long prompts cost money and hit middle-recall issues.
Example order matters more than intuition suggests (Lu 2022).
ICL cannot truly learn tasks beyond its expressive priors.

5.7 Practice

Plot accuracy vs k for k ∈ {0, 1, 4, 8} on a classification task.
Shuffle few-shot example order 10 times and measure variance.

6. How they combine in practice

RAG fills knowledge.
ICL fills format / style.
CoT fills reasoning.
MoE is an internal cost-efficiency lever.

7. Quick Recap — Answer Before You Peek

Five core questions this article answered. Cover the answers, give a one-line response yourself, then check.

Q1. Describe the RAG retriever / reader pipeline in one line.

Answer Retriever: k-NN search against a document-embedding index using the query embedding to fetch top-k docs. Reader: the LLM produces an answer using the retrieved docs as context. Why k-NN over embeddings (ML Foundations Part 2 §7.4). Learned dense retrievers like DPR (Karpukhin 2020) outperform sparse retrievers (BM25) at semantic matching. (Section §2.)

Q2. Write the DPR contrastive loss.

Answer $\mathcal{L} = -\log \frac{e^{\text{sim}(q, p^+)}}{e^{\text{sim}(q, p^+)} + \sum_{p^-} e^{\text{sim}(q, p^-)}}$ — InfoNCE form. Why Lifts the positive doc $p^+$ for a query $q$ and pushes down negatives $p^-$. Negative sampling is the lever — hard negatives (top BM25 results that are not the gold answer) determine quality. Same family as metric learning in ML Foundations Part 2. (Section §2.)

Q3. What does "Lost in the Middle" mean for RAG design?

Answer Information placed in the middle of context is often missed. Best practice: re-rank and place the most relevant documents at the start and end. Why Liu et al. 2024. Attention biases toward the ends due to training distributions (Part 1 §6 Positional Encoding). RAG systems use rerankers (Cohere, Voyage) to compress top-k into top-5 and reorder. (Section §2.6.)

Q4. Switch vs Mixtral — two lines.

Answer Switch (Fedus 2021): activates one expert per token (top-1 routing). Mixtral (Mistral 2024): activates two experts per token (top-2 routing) and combines via weighted sum. Why Mixtral's top-2 trains both experts simultaneously and avoids single-point-of-failure routing. With 8 experts × 7B = 47B parameters but only 13B active at inference — memory savings while preserving expressiveness. (Section §4.)

Q5. Zero / one / few-shot definitions, plus when each advanced technique fails?

Answer Zero: no examples. One: a single example. Few: 2–dozens. Failure modes: RAG — bad retrieval produces confident hallucination / CoT — distracting on tasks with short answers / MoE — minimal gain on small models (<7B) / ICL — pattern mimicry only on very OOD tasks. Why The four techniques shine in combination: RAG for external knowledge, CoT for reasoning, MoE for efficiency, ICL for domain adaptation. Each individually has these failure modes. (Sections §3·§4·§5·§6.)

If four or five came out as one-liners, the RAG·CoT·MoE·ICL composition strategy is in place.

8. Further reading

Lewis et al., 2020, RAG. arXiv:2005.11401.
Karpukhin et al., 2020, DPR. arXiv:2004.04906.
Wei et al., 2022, Chain-of-Thought. arXiv:2201.11903.
Wang et al., 2022, Self-Consistency. arXiv:2203.11171.
Kojima et al., 2022, Zero-shot CoT. arXiv:2205.11916.
Fedus et al., 2022, Switch Transformer. arXiv:2101.03961.
Jiang et al., 2024, Mixtral. arXiv:2401.04088.
Brown et al., 2020, GPT-3. arXiv:2005.14165.
Wei et al., 2021, FLAN. arXiv:2109.01652.
von Oswald et al., 2023, Transformers Learn In-Context by Gradient Descent. arXiv:2212.07677.
LangChain RAG tutorial: python.langchain.com/docs/use_cases/question_answering.

Part 4 of 6 in the LLM Core Study series. Part 5 turns to the math: softmax, cross-entropy, KL, gradients, LayerNorm.

Series overview: Series index