"RAG Core Study (2/26) — Document Preprocessing & PDF→Markdown Pipeline"

Part 1 settled the five elements of RAG. When you actually start building the Knowledge Base, the very first question is — "in what shape do we store the raw documents?"

In nearly every organisation, source material is a mixture of PDF · HWP · DOCX · HTML · CSV. Humans read them visually, but LLMs and embedding models need a single format with text and structure preserved. That format is Markdown. This part compares four libraries that own the PDF → Markdown transformation — MarkItDown · Unstructured · Docling · LlamaParse — on the same axes. It closes with a decision matrix: which tool fits which corpus.


0. Prerequisites

Two items are enough.

  • Python basics: virtualenv, pip, defining a small function.
  • Markdown: #, ##, table pipes, fenced code blocks.

No need to know the internals of PDF (objects, streams, text boxes). This part judges tools by the quality of the resulting markdown.


1. Learning Objectives

After this article you should be able to state three things in your own words.

  1. Why markdown — the Knowledge Base format decision directly affects embedding quality and chunkability.
  2. The one-line difference between the four tools — MarkItDown (MS, lightweight), Unstructured (general, element tree), Docling (IBM, structure-preserving), LlamaParse (LLM-based).
  3. The five-axis decision matrix — structure preservation, table extraction, image handling, licensing, operational cost.

This decision is made once per Knowledge Base. Getting it wrong costs you the corpus.


2. ํ•ต์‹ฌ ์š”์•ฝ

PDF → Markdown is not "extract the text and ship it." Heading hierarchy, table structure, image position, code blocks, and footnotes all need to survive — otherwise chunking won't follow semantic units and embeddings won't see the right signal. MarkItDown is Microsoft's lightweight converter: fast and simple but weak on tables and complex layouts. Unstructured is a general parser that normalises many formats into a consistent element tree. Docling is IBM's structure-preserving parser: it handles tables, reading order, and equations. LlamaParse is LLM-driven and delivers the highest quality at the highest cost and latency. The choice is dictated by corpus characteristics and operational budget.


3. Intuition — One PDF, Four Outcomes

Take a single PDF page: title at the top, body in two columns, a table in the middle, an image with a caption at the bottom. The four tools produce visibly different markdown.

  • MarkItDown: text is extracted top to bottom in a straight line. Two-column layouts get interleaved, and the table becomes a one-liner embedded in the prose. Fast.
  • Unstructured: text is split into elements (Title, NarrativeText, Table, Image). The two-column layout is captured per element. The table is preserved as a separate element.
  • Docling: reading order is normalised so the two columns never zig-zag. The table is converted to a proper markdown table. Equations and code go into their own blocks.
  • LlamaParse: an LLM visually describes the page and emits markdown. The table is nearly reconstructed pixel-perfect. But there is a per-page LLM cost and latency.

The same PDF becomes four different corpora. The signal seen by the embeddings differs across tools, and that difference shows up directly in retrieval accuracy.

diagram-1

4. Definition — The Four Tools Positioned

Tool Origin Core Idea License
MarkItDown Microsoft Fast lightweight converter. 30+ formats to markdown via direct extraction. MIT
Unstructured unstructured.io General-purpose element tree parser. Consistent NarrativeText / Title / Table units. Apache 2.0 (open); commercial API separate
Docling IBM Research Structure-preserving. Reading order, tables, equations restored. MIT
LlamaParse LlamaIndex LLM-based visual reading. PDF pages processed as images. Commercial API; 1,000 pages/day free tier

The four tools produce different artefacts from the same PDF. Reports consistently show retrieval accuracy gaps of 5–20 percentage points across tools on the same corpus (Docling 2024 tech report; LlamaParse public benchmarks).

The question this part answers: "Which tool fits my corpus?" Five weighted axes drive the answer.

  1. Structure preservation — heading hierarchy, reading order, paragraph boundaries.
  2. Table extraction — tables rendered as markdown (or HTML) tables.
  3. Image handling — captions, alt-text, optional OCR.
  4. Licensing — open vs commercial; does the data leave the network?
  5. Operational cost — per-page time and dollars, re-indexing cost.

5. Standard Invocations

Code is held to a single line per tool. Deep pipelines exist in each library's official docs in a consistent shape.

from markitdown import MarkItDown
md_text = MarkItDown().convert("doc.pdf").text_content

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("doc.pdf")  # returns element tree

from docling.document_converter import DocumentConverter
doc = DocumentConverter().convert("doc.pdf").document
md_text = doc.export_to_markdown()

from llama_cloud_services import LlamaParse
documents = LlamaParse(result_type="markdown").load_data("doc.pdf")

All four take a single PDF path and produce text or an element tree. The shape and quality of the output is what this part compares.


6. Walkthrough — A Page Becomes Markdown

Six general stages move a single page from PDF to markdown:

  1. Layout analysis — identify blocks (text boxes, tables, images) with coordinates; recover column structure.
  2. Reading order — sort blocks the way a human would read them; prevent zig-zag across columns.
  3. Text extraction — pull characters from each block in coordinate order; map bold and italic to markdown.
  4. Table detection — recover rows and columns from a table block; handle merged cells and multi-row headers.
  5. Image handling — pair images with captions; OCR or vision-model generate alt-text where needed.
  6. Markdown serialization — emit headings, paragraphs, tables, image links in markdown.

The four tools differ by how deeply each stage is processed.

  • MarkItDown does steps 1–3 shallowly; steps 4 and 5 are essentially skipped.
  • Unstructured does steps 1–5 and stops at an element tree; markdown serialization is the user's responsibility.
  • Docling does all six. Reading order, tables, equations, and code blocks are first-class.
  • LlamaParse delegates all six to an LLM. Pages are read as images and markdown is synthesised. Costly and slow, but the most robust on complex layouts.

7. Variants and Cases — Five-block View per Tool

7.1 MarkItDown — Microsoft Lightweight Converter

  • What changes: one-line conversion of 30+ formats — PDF, DOCX, PPTX, XLSX, images, audio — into markdown via a unified interface.
  • Why it appeared: Microsoft needed a lightweight adapter for internal tooling (released 2024-12).
  • What becomes possible: POCs and small corpora converted in minutes. CLI shipped.
  • Where it fits: simple internal documents, Word- and PowerPoint-heavy archives, retrieval cases where speed outweighs precision.
  • Limits and next step: complex tables and multi-column layouts get interleaved; image captions are dropped. → Move to Docling or LlamaParse when tables or reading order matter.

7.2 Unstructured — General-purpose Element Tree

  • What changes: the output is an element tree (Title, NarrativeText, Table, Image, …). The user decides whether to serialize to markdown.
  • Why it appeared: chunking and embedding pipelines need consistent semantic units; element trees provide them.
  • What becomes possible: PDF, HTML, DOCX, EML, CSV all flow into the same tree. LangChain UnstructuredLoader is the canonical integration.
  • Where it fits: heterogeneous corpora with element-based chunking.
  • Limits and next step: open-build table fidelity is weak; high-quality tables require the commercial Unstructured API. → Docling or LlamaParse as alternatives.

7.3 Docling — IBM Structure-preserving

  • What changes: tables, equations, code, and reading order are deeply handled. Tables export to markdown or HTML at cell granularity. Equations preserved as LaTeX.
  • Why it appeared: IBM Research targeted retrieval quality on enterprise documents — research reports, financial statements, policies (v1 2024-08, v2 2024-11).
  • What becomes possible: markdown that closely tracks the source for complex tables, equations, and multi-column reports. Self-contained packaging with docling-models.
  • Where it fits: academic papers, financial tables, multi-column reports, technical manuals.
  • Limits and next step: ~1–3 seconds per page on CPU; a 100K-page corpus takes tens of hours. → Batch processing with GPU acceleration, or tiered usage (Docling for key documents, MarkItDown for the rest).

7.4 LlamaParse — LLM-driven Visual Reading

  • What changes: pages are rendered as images and sent to an LLM; markdown is generated by the model.
  • Why it appeared: rule-based parsers hit a ceiling on complex charts, handwritten notes, and multilingual tables. LlamaIndex shipped LlamaParse as an API service in early 2024.
  • What becomes possible: charts, diagrams, and handwritten notes survive with meaning preserved. Visually complex tables are nearly reconstructed.
  • Where it fits: corpora where retrieval quality is non-negotiable and budget is available — regulatory text, medical charts, academic books.
  • Limits and next step: per-page cost (Premium ~$0.045/page + add-ons; Balanced ~$0.003/page). Data leaves the network via the API, which may conflict with internal security policy. → Self-hosted Docling for strict environments; LlamaParse only for the critical subset under budget.

8. Limits and Failure Modes

Five common failures in PDF preprocessing, under the v3 five-block view.

8.1 Multi-column Text Interleaving

  • Why intrinsic: PDF does not store reading order explicitly. Text boxes carry only coordinates; the left-to-right and top-to-bottom flow is inferred.
  • Diagnosis: a single sentence in the output zig-zags between columns.
  • Mitigation: switch to a tool that handles reading order (Docling), or process visually (LlamaParse).
  • Later part: Part 4 — OCR and Layout Analysis.

8.2 Tables Collapse Into One Line

  • Why intrinsic: tables rely on coordinate alignment; cell boundaries depend on visual lines or whitespace. Tables with weak boundaries collapse during text extraction.
  • Diagnosis: pipes (|) almost absent in the markdown output; cell values appear as a row of whitespace-separated tokens.
  • Mitigation: Docling or LlamaParse; or extract tables separately with OCR table recognition (AWS Textract Table Extraction, Azure DI). Revisited in Part 4.
  • Later part: Parts 4, 6.

8.3 Image Information Loss

  • Why intrinsic: figures, diagrams, and screenshots are not text. Embeddings cannot index them without a separate vision embedding.
  • Diagnosis: output has almost no image captions or alt-text; the prose references "see Figure 3" but the figure carries no metadata.
  • Mitigation: LlamaParse or a vision LLM to auto-caption. Docling extracts image regions but does not caption them.
  • Later part: Part 4 (image OCR), Part 8 (some multimodal embeddings).

8.4 CJK Text Garbled

  • Why intrinsic: incorrect or missing font maps in the PDF emit unassigned Unicode or broken glyphs, common in digital PDFs without embedded fonts.
  • Diagnosis: question marks (?) or empty squares () in the output.
  • Mitigation: re-process the PDF with OCR (Tesseract, PaddleOCR); or LlamaParse to re-read visually.
  • Later part: Part 4.

8.5 Cost and Latency Stack-up

  • Why intrinsic: processing 10K pages with LlamaParse Premium costs ~$450 plus LLM time; local Docling costs near zero but takes tens of hours; MarkItDown takes minutes. Tool choice dominates total KB build cost.
  • Diagnosis: estimate a single re-indexing pass in cost and time; can you afford it semi-annually?
  • Mitigation: tier the corpus — LlamaParse for critical documents, MarkItDown for the rest. Or incremental indexing — only the delta.
  • Later part: Part 25 — security and re-indexing operations.

8.5 Common Pitfalls

  • "One parser for everything." — Cost and time explode. Tier the corpus and mix tools.
  • "OCR is only for scanned PDFs." — Korean digital PDFs often have broken font maps; OCR is a useful fallback.
  • "Markdown done, chunk immediately." — If a table collapsed to one line and you chunk it, the table is lost forever. Sample-check the converted output before chunking.
  • "LlamaParse is always best." — Data leaves the network via the API; check security policy and DPA.
  • "A converted KB lasts forever." — Tools update and embedding models change; re-indexing is required. Part 25.

9. Settled Conclusions

Q1. State each tool's strength in one line.

  • MarkItDown — fast lightweight converter (MS, 30+ formats).
  • Unstructured — general element-tree parser (consistent semantic units).
  • Docling — structure preservation (tables, reading order, equations).
  • LlamaParse — LLM visual reading (top quality on complex layouts, high cost).

Chapter: §4, §7.

Q2. What are the five axes of the decision matrix?

Structure preservation / table extraction / image handling / licensing / operational cost. Same corpus can show 5–20pp retrieval-accuracy gap across tools (§4). Chapter: §4.

Q3. What is the root cause of two-column text interleaving?

PDFs do not record reading order; text boxes carry only coordinates and the L→R then next-column flow has to be inferred. Chapter: §8.1.

Q4. Two reasons to not use LlamaParse?

① Security-sensitive corpora that cannot leave the network. ② Page volume large enough that per-page cost breaks the RAG budget. Chapter: §7.4, §8.5.

Q5. When Korean PDF output shows question marks, what to suspect first?

Broken font maps in the PDF — missing or mis-mapped embedded fonts. Fix with OCR re-reading. Chapter: §8.4.


10. Further Reading

Primary sources

  • Docling Team. Docling Technical Report. IBM Research, 2024. arXiv:2408.09869
  • Auer, C. et al. DocLayNet. KDD 2022.
  • LlamaParse benchmarks (LlamaIndex blog, 2024-04 / 2024-09 updates).
  • Unstructured.io technical whitepaper (2024).

Official docs

  • MarkItDown: https://github.com/microsoft/markitdown
  • Unstructured: https://docs.unstructured.io/
  • Docling: https://github.com/DS4SD/docling
  • LlamaParse: https://docs.cloud.llamaindex.ai/llamaparse/getting_started

Supporting notes

  • Author notes §2, §3, §4 — preprocessing core.
  • LangChain Document Loaders: https://python.langchain.com/docs/integrations/document_loaders/

Cheat Sheet

Tool License Strength Weakness Best Fit
MarkItDown MIT Fast, 30+ formats Weak on tables, columns Simple internal documents
Unstructured Apache 2.0 Consistent element tree Open-build table fidelity Heterogeneous corpora
Docling MIT Structure, equations Long processing time Academic, financial, technical
LlamaParse Commercial API Top quality Cost, data egress Complex critical documents

Bridge — What's Next

Next — RAG Core Study (3/26) — Ingestion Design: Document Boundary, Model-aware Schema, Filter-first Retrieval.

After turning documents into markdown, the next decision is where to search. If policies, training material, and meeting minutes share one vector space, retrieval breaks. Part 3 collects the six-stage workflow from author note §35 in one place — Whole-document Retrieval, Model-aware Ingestion, Title-Body Schema, Document Boundary, Filter-first Retrieval.

๋Œ“๊ธ€

์ด ๋ธ”๋กœ๊ทธ์˜ ์ธ๊ธฐ ๊ฒŒ์‹œ๋ฌผ

Agent Memory Engine (2/10) — Building an AI Agent Memory System with SQLite Alone

"ML Foundations (9/9) — PyTorch vs TensorFlow, and the Road to Local LLMs"

"RAG Core Study (14/26) — Evaluation Sets with RAGAS & DeepEval"

"ML Foundations (8/9) — Deep Learning Architectures: CNN, RNN, Attention"

"ML Foundations (7/9) — Deep Learning Training: Optimizers, Regularization, Initialization"

OpenClaw to Hermes Migration (2/13) — What to Preserve, Partially Port, or Discard

AI Agents I Built (5/7) — Building an Automated Blogger API Publishing System