"RAG Core Study (1/26) — What RAG Is and Why You Need It"
Series one-liner — "Find the evidence that fits the question, and never answer outside that evidence."
This 26-part series is for engineers and researchers who must actually build and operate a RAG (Retrieval-Augmented Generation) system. Part 1 keeps a single goal in view: what RAG is, why it appeared, and when you should not use it. Code stays to a single call, and the math reduces to Lewis 2020's one formula. If "why do we need RAG?" lands in your hands here, every decision in the next 25 parts — chunking, embedding, hybrid search, reranking, routing — gains a reference point.
0. Prerequisites
Three concepts make this article easier to read. You can survive without them, but they speed up the why behind each design choice.
- Tokenization: how text is split into subword units before reaching the model. Covered in AI Basics (1/11) — LLM Fundamentals 2026 and in LLM Core Study (1/6) — Tokenization · Embedding · Attention · Positional Encoding.
- Context Window: the upper bound on tokens the model can see in one shot (8K to 1M nowadays). The "long-context vs RAG" tradeoff lives on this axis.
- Hallucination: when the model produces plausible but false statements. Part 8 will return to the point that RAG does not fully eliminate hallucinations.
A one-line definition for each is enough. On top of those definitions, this article builds a separate abstraction layer called the five elements of RAG.
1. Learning Objectives
After this article, you should be able to explain three things in your own words.
- The five elements of RAG — Retriever, Generator, Knowledge Base, Context, Grounding — each in a single line, and how they combine into one answer.
- Why RAG emerged — to address two problems at once: knowledge that did not make it into the model's weights (recent, internal, regulatory) and over-confident answers without evidence.
- When not to use RAG — narrow domains where parametric memory suffices, one-shot answers that need no provenance, and cases where long-context cost beats RAG infrastructure cost.
These three are the entry point for the entire 26-part series. Whichever chunking strategy, embedding, or vector DB you eventually choose, those decisions become cost without reason unless "why RAG in the first place" is settled.
2. 핵심 요약 — One Paragraph
RAG stands for Retrieval-Augmented Generation. One line: "Find external documents that match the question, then have the LLM answer strictly inside that evidence." Two motivations brought it here. First, knowledge created after training — yesterday's internal doc, today's regulatory update, a meeting summary from 30 minutes ago — never enters the model's weights. Second, the same model produces confident plausible answers even when it has no grounds. That is hallucination. RAG's one-line prescription, "search → put the result into context → only answer from there," reduces both at once. The catch: when retrieval itself is wrong, hallucinations do not disappear — they become more plausible. That is why 25 of the 26 parts concentrate on making retrieval correct.
3. Intuition — The Desk Student and the Library Student
Two students sit for the same exam.
- Student A (LLM alone, parametric memory): the desk holds only the notes the student memorised. Fast and accurate within that range. But when a question outside the notes arrives, instead of saying "I don't remember," the student invents an answer. That is hallucination.
- Student B (RAG): the same notes, plus the option to walk to the library, open relevant books, and come back. It takes more time, and if the wrong book is opened, the wrong answer is delivered with confidence. But on unfamiliar topics there is now a chance of producing an evidence-backed answer.
The essence of RAG is "turning the desk student into a library student." Which shelf to walk to, which book to open, which page to read — each of those is a separate decision. The 26 parts of this series unfold that decision tree, one branch at a time.
Core message (note §34): "Find the evidence that fits the question, and never answer outside that evidence." — every later decision sits below this line.
The student analogy maps to a simple diagram:
Same model, different answers — because the procedure changed, not the model. RAG is a procedural design.
4. Definition — The Five Elements of RAG
Reduced from Lewis et al. 2020 to five words.
| Element | One-line Definition | Primary Source |
|---|---|---|
| Retriever | A function that takes a query and returns the top-K relevant documents | Lewis 2020 §2 |
| Generator | An LLM that takes (Query + Context) and produces an answer | Lewis 2020 §2 |
| Knowledge Base | The store holding external documents and embedding indexes | Khandelwal 2020 §1 |
| Context | The excerpt injected into the Generator input, usually top-K passages | Liu 2023 §2 |
| Grounding | The property of staying inside the evidence — a behavioural contract | Asai 2023 §3 |
The five elements never operate in isolation. They form a single flow that produces one answer.
The most commonly missed distinction is between Knowledge Base and Context. Knowledge Base is storage; Context is excerpt. Retrieval decides "where to fetch from"; context assembly decides "which excerpts, in which order, in what shape." The two have different costs and different failure modes. Lost in the Middle in §8.1 is, strictly speaking, a context-side failure.
A second frequent question: "How do we enforce Grounding in code?" Honest answer: we don't fully. The Generator is still an LLM and can pull from its parametric memory even when the system prompt says "use only the context." That is why RAG also needs prompt design ("ignore information outside the provided context"), citation enforcement (require the answer to cite which excerpt it came from), and verification stages (Self-RAG, Corrective RAG). Parts 22 (Corrective), 23 (Graph), and 24 (Agentic) of this series each address this problem in depth.
5. Math — Lewis 2020 in One Line
The original RAG paper defines the answer distribution as
$$P(y \mid x) = \sum_{z \in \text{top-K}(x)} P_\eta(z \mid x) \cdot P_\theta(y \mid x, z)$$
- \(x\) — the input query
- \(z\) — a retrieved document
- \(y\) — the answer
- \(P_\eta(z \mid x)\) — the Retriever scoring "how relevant is this document to this query"
- \(P_\theta(y \mid x, z)\) — the Generator scoring "given this query and this document, the probability of this answer"
The line to take with you: the answer factorises into a product of retriever and generator, summed over top-K. The two models can be trained separately or jointly, and joint training splits into RAG-Token and RAG-Sequence. RAG-Token re-applies the sum per token, so each token can attend to a different document. RAG-Sequence chooses one document mixture for the whole answer. Lewis 2020 Table 2 reports the trade-off.
The formula is simple, but nearly every later part of this series can be read as which slot of this formula it improves.
- Retriever (\(P_\eta\)) improvements: Parts 7–11 (embedding, vector DB, dense, BM25), Part 12 (reranker), Parts 17–21 (query processing, dynamic).
- Generator (\(P_\theta\)) improvements: Part 22 (Corrective), Part 24 (Agentic verification).
- top-K selection: Part 21 (Adaptive Top-K).
- The sum itself: RAG-Token vs RAG-Sequence is outside the scope of this series. The series focuses on the retrieve-then-read paradigm.
A side effect of the product-sum structure: "a confidently wrong retriever lifts the generator's confidence too." §8.2 picks this up.
6. Walkthrough — From a Question to an Answer
Code stays minimal. v3 standard — a walkthrough is one-line dissection of the algorithm, not a code dump.
answer = generator(prompt, context=retriever(query, k=5))
Six things happen inside this one line.
- Normalisation. Strip quotes, punctuation, whitespace; expand acronyms; normalise units. Part 18 (Query Rewrite) takes this further.
- Embedding. Map the normalised query into a d-dimensional vector with an embedding model. Part 8 compares embedding models.
- Search. Score all (or pre-filtered) document vectors in the knowledge base and keep the top-K by similarity. Dense uses cosine, sparse uses BM25, hybrid blends the two. Parts 9, 10, 11, and 12 live here.
- Reranking — optional. A cross-encoder reorders the top-K and keeps K' ≤ K. Part 13.
- Context assembly. Decide order, format, and truncation of the remaining documents in the prompt. To mitigate Lost in the Middle, place the most relevant items at both ends (Liu 2023).
- Generation. Feed (system prompt + context + query) to the LLM. To keep the answer inside the evidence, the system prompt typically forbids using information outside the provided context.
Each step has a later part that deepens it:
| Step | Where it deepens |
|---|---|
| Normalise / rewrite | Part 18 — Query Rewrite |
| Embed | Part 8 — Embedding models |
| Search (dense / sparse / hybrid) | Parts 10, 11, 12 |
| Rerank | Part 13 |
| Context assembly | Part 6 — Contextual Chunking, Part 14 — eval sets |
| Generation (Grounding) | Parts 22, 24 — Corrective, Agentic |
The table doubles as a reading map for the rest of the series. After Part 1, the next read becomes self-evident.
7. Variants and Cases — Naive · Advanced · Modular · RETRO · kNN-LM
Each variant under the v3 five-block pattern: what changes → why it appeared → what becomes possible → where it fits → limits and the next step.
7.1 Naive RAG — "retrieve → generate" once
- What changes: one retrieval + one generation. The simplest form.
- Why it appeared: the Lewis 2020 default; the standard from 2020 to 2022.
- What becomes possible: LLMs gain access to post-training knowledge and private corpora without retraining.
- Where it fits: small clean corpus, narrow query distribution. POCs, internal FAQ, questions answered within a single document.
- Limits and next step: no retry after a single retrieval; top-K growth exposes Lost in the Middle. → Advanced RAG.
7.2 Advanced RAG — Pre / Retrieval / Post three-stage
- What changes: stages are added around retrieval. Pre: query rewrite, expansion, classification. Post: rerank, filter, dedupe.
- Why it appeared: the limits of Naive forced iterative refinement of retrieval. Gao et al. 2024 RAG Survey §4 codifies this taxonomy.
- What becomes possible: standard behaviours like "rewrite ambiguous queries before search" and "rerank when too many results" enter the playbook. Hybrid Search (Part 12) and Reranker (Part 13) are core here.
- Where it fits: production. Variable natural-language queries, multi-domain corpora.
- Limits and next step: more stages add latency, and a fixed sequence cannot adapt per query. → Modular RAG.
7.3 Modular RAG — Replaceable, recomposable modules
- What changes: each stage of Pre / Retrieval / Post is abstracted into a module that can be swapped, inserted, removed, or looped. Search modules, Memory modules, Routing modules, Predict modules, and more.
- Why it appeared: domain-specific pipelines and type-of-question routing inside one system.
- What becomes possible: routing such as "regulatory queries → exact-match module, meeting summaries → session-context module, general FAQ → hybrid + rerank module" within a single system.
- Where it fits: multi-domain operations with heterogeneous user groups.
- Limits and next step: freedom raises the evaluation problem. Which composition is best requires experiment automation. → Parts 14 (eval sets), 15 (search-quality metrics), 16 (LangSmith / Phoenix).
7.4 RETRO / Fusion-in-Decoder — Coupling at training time
- What changes: retrieval is folded into training. RETRO (Borgeaud 2022) retrieves 25-million-token-scale chunks during training and injects them into the decoder via cross-attention. Fusion-in-Decoder (Izacard 2021) encodes each retrieved passage in the encoder and fuses them in the decoder.
- Why it appeared: the Lewis 2020 setup separates retrieval and generation losses, leaving efficiency on the table. Joint training proposed to make small models match larger ones.
- What becomes possible: a 7B-scale model approaches some GPT-3 (175B) language-modelling benchmarks.
- Where it fits: groups that pre-train their own models (research labs, large providers). Operational RAG rarely touches this band.
- Limits and next step: high training cost. Most production RAG remains retrieve-then-read, and this series stays there.
7.5 kNN-LM — Token-level retrieval
- What changes: the retrieval unit is token representation vectors, not documents. At generation time, the nearest token contexts in the corpus are queried and blended into the next-token distribution (Khandelwal 2020).
- Why it appeared: a way to borrow corpus statistics at runtime when weight memory is insufficient.
- What becomes possible: domain adaptation without further training; sharp perplexity drops on rare token sequences.
- Where it fits: narrow domains with peculiar distributions (regulatory text, code, medical notes).
- Limits and next step: a retrieval per decoded token — latency is heavy. One of the reasons RAG concentrated on document-level retrieve-then-read. The series does not cover kNN-LM, but knowing why it became a minority path helps when Part 21 (Adaptive Top-K) introduces budget-aware retrieval.
A condensed comparison:
| Variant | Retrieval unit | Training coupling | Operational difficulty | Coverage in this series |
|---|---|---|---|---|
| Naive RAG | Document | No | 1 | Parts 1, 2, 5, 6 |
| Advanced RAG | Document | No | 2 | Parts 7–13 |
| Modular RAG | Document | No | 3 | Parts 14–24 |
| RETRO / FiD | Chunk | Yes (pre-training) | Very high | Not covered |
| kNN-LM | Token | Yes | Very high | Not covered |
The 26 parts cover the first three rows. Training-time methods belong to a separate future series (Agent / Fine-tuning).
8. Limits and Failure Modes
RAG is not a silver bullet. When one decision misfires, what does the system look like, how is it diagnosed, and how is it mitigated? v3 five-block — why is it intrinsic → how to diagnose → mitigation → which later part picks it up.
8.1 Lost in the Middle
- Why intrinsic: Liu et al. 2023 (TACL 2024) showed that middle-positioned information is used less than information at the start or end of a long context. Subsequent analyses tie this to training-data position priors, not raw attention.
- How to diagnose: place the same document at the start / middle / end of the prompt and compare accuracy. If accuracy drops in the middle, the effect is active.
- Mitigation: lower top-K, reorder so the most relevant items sit at both ends, use a reranker to compress K' and shorten the context.
- Later part: Parts 6 (Contextual Chunking), 13 (Reranker), 15 (search-quality metrics).
8.2 Garbage-In Garbage-Out — Wrong retrieval yields more confident hallucinations
- Why intrinsic: from the formula in §5, the Generator conditions on the Retriever's output. A plausible but wrong document raises \(P_\theta(y \mid x, z)\) toward a plausible wrong answer. Hallucinations are not eliminated; they become grounded in the wrong evidence.
- How to diagnose: when an answer is wrong, check separately whether the correct document was even in the context. RAGAS Context Recall (Part 14) measures this directly.
- Mitigation: improve retrieval quality with hybrid + rerank, abstain when confidence is low, force citations in the answer to enable downstream verification.
- Later part: Parts 12 (Hybrid), 13 (Reranker), 14 (RAGAS), 22 (Corrective RAG).
8.3 Cost and Latency Stack-Up
- Why intrinsic: retrieval (embedding + vector DB) + rerank (cross-encoder) + generation (LLM) run in a sequential pipeline. Each step adds tens to hundreds of milliseconds, and the cost is distributed across embedding calls, LLM tokens, and infrastructure (vector DB).
- How to diagnose: measure p50 / p95 / p99 per stage and break out costs by component. LangSmith and Phoenix are the standard tools for this separation.
- Mitigation: Adaptive Top-K (Part 21) shrinks K when confidence is high; Conditional Reranking (Part 21) only invokes the reranker on low-confidence cases; cache (query → top-K).
- Later part: Parts 16 (LangSmith / Phoenix), 21 (Adaptive Top-K).
8.4 Parametric vs Retrieved Conflict
- Why intrinsic: when the model's internal memory and the retrieved evidence disagree, which one wins is unspecified. Even with a strong "use only the context" instruction, a confident region of parametric memory can override the context.
- How to diagnose: feed deliberately wrong context on questions the model already knows. If the answer follows the context, grounding is healthy. If it follows the weights, conflict is active.
- Mitigation: prompts that require citations, Self-RAG / Corrective RAG that re-verify answers, smaller models whose parametric confidence is lower (experiments report stronger context adherence).
- Later part: Parts 22 (Corrective RAG), 24 (Agentic RAG).
8.5 Missing Document Boundaries — "Which group of documents are we searching?"
- Why intrinsic: when policies, training materials, recent meeting minutes, past decisions, developer documentation, and user manuals all live in the same vector space, semantic similarity overrides purpose distinctions. A user asking for "the latest security policy" can receive a training brochure simply because it embeds closer.
- How to diagnose: compare per-document-group search against unified search. If unified search is worse, this is active.
- Mitigation: pre-filter by metadata, multi-collection / namespace design, filter-first retrieval. Part 3 collects all three.
- Later part: Part 3 — Ingestion Design: Document Boundary, Model-aware Schema, Filter-first Retrieval.
The five limits are not independent. They show up together in production and mask each other's diagnosis. That is why this series places evaluation, search-quality metrics, and experiment automation early at Parts 14–16.
8.5 Common Pitfalls
- "Finer chunks always help." — Cuts across semantic units and weakens the retrieval signal. Part 5 covers appropriate sizes.
- "Larger top-K always helps." — Lost in the Middle. §8.1.
- "RAG eliminates hallucination." — §8.2. Wrong retrieval yields more confident hallucination.
- "Long-context models replace RAG." — Cost drops, but provenance disappears. You cannot ask "where did this part of the answer come from?" anymore.
- "Reranking is always worth it." — Adds cost; minimal gain when retrieval is already accurate. Part 21 (Conditional Reranking).
The upshot: no RAG decision is unconditionally correct. Cost, latency, and accuracy form a triad to be measured — and that is why half of this series is allocated to evaluation, experiments, and routing.
9. Settled Conclusions — Answer Without Looking
Five questions. Answer each in one line, then verify against one-sentence rationale and which chapter it lives in.
Q1. List the five elements of RAG from memory.
Answer: Retriever, Generator, Knowledge Base, Context, Grounding. Rationale: Retriever does top-K search; Generator is the LLM; Knowledge Base is the store; Context is the excerpt; Grounding is the stay-inside-evidence behavioural contract. Chapter: §4.
Q2. In Lewis 2020's formula \(P(y \mid x) = \sum_z P_\eta(z\mid x) P_\theta(y \mid x, z)\), what are the two factors?
Answer: \(P_\eta\) is the Retriever; \(P_\theta\) is the Generator. Rationale: The first factor scores document-to-query relevance; the second scores answer-given-query-and-document. Their product is summed over top-K. Chapter: §5.
Q3. What does Lost in the Middle mean?
Answer: Information in the middle of a long context is underused compared to the start and end. Rationale: Liu 2023 showed position dependence linked to training-data priors. Mitigated by lowering top-K, reorder/rerank. Chapter: §8.1.
Q4. Give two cases where you should not use RAG.
Answer: ① Narrow domains fully covered by the model's parametric memory. ② One-shot answers that need no provenance. (Optionally: cases where long-context cost beats the RAG infrastructure cost.) Rationale: RAG adds cost and latency. The cost must be recovered through evidence, domain coverage, or recency. Chapter: §1, §2, §8.4.
Q5. When Grounding breaks, what should you suspect first?
Answer: Whether the correct document was even in the context — i.e. Context Recall. Rationale: A broken Grounding is not always the Generator's fault. If the context never contained the answer, the Generator could not have stayed inside it. §8.2 and Part 14's RAGAS Context Recall return to this. Chapter: §8.2.
If you can answer all five with the page closed, the three learning objectives are in hand.
10. Further Reading
Primary sources
- Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arXiv:2005.11401
- Khandelwal, U. et al. Generalization through Memorization: Nearest Neighbor Language Models. ICLR 2020. arXiv:1911.00172
- Borgeaud, S. et al. Improving Language Models by Retrieving from Trillions of Tokens (RETRO). ICML 2022. arXiv:2112.04426
- Izacard, G. & Grave, E. Leveraging Passage Retrieval with Generative Models for ODQA (FiD). EACL 2021. arXiv:2007.01282
- Liu, N. et al. Lost in the Middle. TACL 2024. arXiv:2307.03172
- Asai, A. et al. Self-RAG. ICLR 2024. arXiv:2310.11511
- Yan, S. et al. Corrective Retrieval Augmented Generation. arXiv:2401.15884
- Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997
Official docs / guides
- LangChain RAG tutorial:
https://python.langchain.com/docs/tutorials/rag/ - LlamaIndex Indexing & Querying:
https://docs.llamaindex.ai/en/stable/understanding/indexing/ - Anthropic Contextual Retrieval blog (2024-09)
- Pinecone "What is RAG" overview
- Weaviate Academy — Vector Search & RAG fundamentals
Supporting notes
- Author's notes §0–§5 — definition, history, overall structure.
- Author's notes §34 — core message ("Find the evidence that fits the question, and never answer outside that evidence").
- Author's notes §35-9 — "which group of documents are we searching" → §8.5 → Part 3.
Cheat Sheet
| Term | One-line Definition |
|---|---|
| Retriever | Returns top-K documents matching a query |
| Generator | LLM that produces an answer from Query + Context |
| Knowledge Base | Storage of external documents and embeddings |
| Context | The excerpt inserted into the Generator input |
| Grounding | The property of staying inside the evidence |
| Naive RAG | retrieve → generate, once |
| Advanced RAG | Pre / Retrieval / Post three-stage |
| Modular RAG | Each stage abstracted as a swappable module |
| Lost in the Middle | Middle of a long context is underused |
| Grounded Answer | An answer derived only from retrieved evidence |
| RAG-Token / RAG-Sequence | Per-token vs per-answer document choice |
| Context Recall | Whether the correct document made it into the context (§8.2 diagnostic) |
Bridge — What's Next
Next — RAG Core Study (2/26) — Document Preprocessing & PDF→Markdown Pipeline.
The first concrete problem you meet while building the Knowledge Base: how to store the raw documents. A corpus mixing PDFs, HWP, DOCX, HTML, and CSV must be reshaped into markdown that LLMs and embedding models can handle. Part 2 compares MarkItDown · Unstructured · Docling · LlamaParse on five axes: structure preservation, table extraction, image handling, licensing, operational cost. That comparison is where the next decision starts.
Series overview: Series index
댓글
댓글 쓰기