"RAG Core Study (4/26) — OCR & Layout Analysis"

5월 14, 2026

Part 2 covered well-behaved digital PDFs. This part covers everything else — scans, text trapped inside images, hard tables, multi-column layouts.

Corpora almost always contain abnormal PDFs: scanned paper, broken-font Korean PDFs, tables that only humans can read, multi-column reports. Part 4 compares five tools — PaddleOCR · Tesseract · AWS Textract · Azure Document Intelligence · LayoutLMv3 — on five axes: character accuracy · layout preservation · table extraction · Korean (CJK) support · operational cost.

0. Prerequisites

Part 2's markdown conversion flow.
The distinction between images and text objects inside a PDF.

You do not need OCR internals (CTC, attention OCR). The article focuses on the tool decision table.

1. Learning Objectives

State the difference between OCR and layout analysis in one line — OCR recognises characters; layout analysis recognises blocks, order, and roles.
Compare the five tools on strength, weakness, license, cost.
Identify the most dangerous failure for Korean (CJK) corpora.

2. 핵심 요약

OCR recognises characters; layout analysis recognises blocks of characters. Both are separate stages, and a good RAG preprocessor passes through both. PaddleOCR is the open / multilingual / free default. Tesseract is the oldest open OCR but weaker on Korean. AWS Textract and Azure DI are commercial APIs with strong table, form, and signature support. LayoutLMv3 is not an OCR — it is a model that consumes (text + bbox + image patches) for document understanding. For Korean, start with PaddleOCR and mix in Azure DI when tables matter.

3. Intuition — What the Five Tools See on the Same Page

A scanned Korean meeting-minutes page: top logo, Korean title, two-column body, a table in the middle, handwritten signatures at the bottom.

PaddleOCR: clean CJK character recognition; table cells identified by coordinates but markdown table conversion is weak.
Tesseract: has a kor model but struggles with handwriting and unusual fonts. Free and fast.
AWS Textract: strong tables and forms. Korean officially supported (2023+). Per-page cost.
Azure Document Intelligence: top-tier tables, signatures, checkboxes. Excellent Korean support. Per-page cost.
LayoutLMv3: not an OCR. Takes (text + bbox + image patches) and performs document understanding — classification, QA, entity extraction.

4. Definition — Two Separate Stages

Stage	Output	Example Tools
OCR	Pixels → Unicode characters with per-character bbox.	Tesseract, PaddleOCR
Layout Analysis	bbox set → block classification (Title / Paragraph / Table / Image) + reading order	LayoutLMv3, DocTR, Donut
End-to-end Document AI	Pixels → (text + structure + extracted fields) in one pass	AWS Textract, Azure DI

Side-by-side:

Tool	OCR	Layout	Table	Form/Sign	License	Korean
Tesseract	3	1	1	—	Apache 2.0	2
PaddleOCR	4	2	2	—	Apache 2.0	4
AWS Textract	4	3	4	3	Commercial	3
Azure DI	4	4	4	4	Commercial	4
LayoutLMv3	—	4	2	2	MIT	depends on OCR

LayoutLMv3 is a model, not an OCR. Korean performance depends on whatever OCR feeds it.

5. Standard Calls

from paddleocr import PaddleOCR
ocr = PaddleOCR(lang="korean")
result = ocr.ocr("page.png")  # [[bbox, (text, score)], ...]

import pytesseract
text = pytesseract.image_to_string("page.png", lang="kor")

import boto3
client = boto3.client("textract")
resp = client.detect_document_text(Document={"Bytes": open("page.png","rb").read()})

from azure.ai.documentintelligence import DocumentIntelligenceClient
client = DocumentIntelligenceClient(endpoint, key)
poller = client.begin_analyze_document("prebuilt-layout", body=open("page.pdf","rb"))
result = poller.result()

LayoutLMv3 loads via Hugging Face Transformers and operates on OCR output, not raw pixels.

6. Walkthrough — From Pixels to Markdown

Image normalisation — rotation, deskew, denoise.
Text detection — bounding boxes for character regions.
Text recognition — characters per box (CRNN, attention OCR).
Layout classification — group bboxes into blocks (title, paragraph, table, image).
Table recognition — split table regions into row/column cells.
Reading-order sort — humans-style order across columns.
Markdown serialisation — emit headings, paragraphs, tables, image refs.

Tools differ in how far down this pipeline they take you.

7. Variants and Cases

7.1 PaddleOCR

What changes: Apache 2.0 open OCR for 80+ languages; v4 Korean is the standard accuracy baseline.
Why it appeared: Baidu open-sourced its internal OCR, enabling multilingual processing without paying for APIs.
What becomes possible: free local OCR for Korean. GPU acceleration available.
Where it fits: Korean-primary corpora; budget-sensitive deployments.
Limits: weak on tables, signatures, checkboxes; pair with another tool.

7.2 Tesseract

What changes: the oldest open OCR with kor language pack.
Why it appeared: HP/Google released it as public infrastructure; easy install everywhere.
What becomes possible: a free OCR baseline on any OS.
Where it fits: simple Latin-text corpora.
Limits: CJK accuracy lags PaddleOCR; weak on handwriting and unusual fonts.

7.3 AWS Textract

What changes: AWS Document AI API — text + tables + forms + handwriting + natural-language Queries for field extraction.
Why it appeared: enterprise documents (invoices, contracts, forms) needed automated extraction. Korean official 2023+.
What becomes possible: business forms and contracts to cell-level markdown with SLA / HIPAA options.
Where it fits: AWS-native enterprise document workflows.
Limits: per-page cost; data leaves to AWS; not a fit for strict on-prem requirements.

7.4 Azure Document Intelligence (formerly Form Recognizer)

What changes: Microsoft Document AI with prebuilt models (layout / receipts / IDs / business cards) and custom fine-tunable models.
Why it appeared: Azure's OCR stack matured; Korean support is top-tier.
What becomes possible: tables, signatures, checkboxes, key-value pairs in one pass. prebuilt-layout is a strong RAG preprocessor baseline.
Where it fits: Azure environments with both Korean and table requirements.
Limits: per-page cost; data egress.

7.5 LayoutLMv3 — Post-OCR Understanding

What changes: takes (text + bbox + image patches) and learns a unified embedding for document understanding.
Why it appeared: text alone misses reading order and block roles. Microsoft Research (2022).
What becomes possible: extract total fields on receipts, signature blocks in contracts, titles vs body in reports.
Where it fits: post-processing OCR output for classification and field extraction in RAG KB enrichment.
Limits: not an OCR; Korean LayoutLMv3 support is limited and depends on the upstream OCR.

8. Limits and Failure Modes

8.1 Korean Output Shows Question Marks

Why intrinsic: PDF font-mapping errors plus no OCR pass. Digital PDFs without embedded fonts break text extraction.
Diagnosis: ? or □ in the output.
Mitigation: rasterise to image and OCR with PaddleOCR(lang="korean").
Later part: this part §7.1.

8.2 Tables Collapse to a Single Line

Why intrinsic: OCR alone does not know cell boundaries; cells collapse into a whitespace-delimited row.
Diagnosis: few | pipes; cell values appear as whitespace blobs.
Mitigation: use Textract or Azure DI table modes; or post-process OCR bboxes into cells.
Later part: §7.3, §7.4.

8.3 Multi-column Reading-order Errors

Why intrinsic: OCR recognises characters, not order. Sorting purely top-down, left-right scrambles two-column text.
Diagnosis: sentences cross columns mid-way.
Mitigation: LayoutLMv3, or reading-order algorithms inside Textract / Azure DI.
Later part: §7.4, §7.5.

8.4 Handwriting / Signatures Lost

Why intrinsic: most OCRs target printed text; handwriting accuracy drops below 50% on the same image.
Diagnosis: handwritten regions return blank or noisy strings.
Mitigation: Textract FORMS mode or Azure DI prebuilt-handwritten.
Later part: §7.3, §7.4.

8.5 Cost and Latency Stack-up

Why intrinsic: 10K-page OCR with PaddleOCR is tens of hours locally; Textract/Azure DI costs tens to hundreds of dollars; LayoutLMv3 post-processing adds GPU time.
Diagnosis: estimate one full KB build's time and dollars.
Mitigation: tier the corpus and use incremental OCR.
Later part: Part 25.

8.5 Common Pitfalls

"Digital PDFs don't need OCR." — Korean font-mapping fails often. §8.1.
"OCR captures tables automatically." — Often not; tables need a dedicated step. §8.2.
"One tool fits all documents." — Cost and accuracy both lose; tier the corpus.
"One OCR pass and we are done." — Tables, handwriting, multi-column may need separate models.
"LayoutLMv3 is an OCR." — It is not.

9. Settled Conclusions

Q1. What is the difference between OCR and Layout Analysis?

OCR recognises characters. Layout analysis recognises blocks of characters and their reading order. Two separate stages. Chapter: §4.

Q2. Default OCR for Korean corpora?

PaddleOCR (kor). Mix in Azure DI when tables or signatures matter. Chapter: §7.1, §7.4.

Q3. Which two tools lead on table extraction?

AWS Textract and Azure DI. Both produce cell-level tables and key-value pairs. Chapter: §7.3, §7.4.

Q4. When is LayoutLMv3 the right choice?

After OCR provides (text + bbox), when you need classification or field extraction on the document — typically RAG KB enrichment. Chapter: §7.5.

Q5. Why does OCR fix the "question marks" failure on Korean digital PDFs?

OCR re-recognises characters from pixels, bypassing the PDF's broken font map. Chapter: §8.1.

10. Further Reading

Primary

Du, Y. et al. PP-OCRv4. arXiv:2304.10833
Smith, R. An Overview of the Tesseract OCR Engine. ICDAR 2007.
Huang, Y. et al. LayoutLMv3. ACM MM 2022. arXiv:2204.08387
AWS Textract / Azure Document Intelligence official model cards (2024 updates).

Official docs

PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
Tesseract: https://tesseract-ocr.github.io/
AWS Textract: https://docs.aws.amazon.com/textract/
Azure Document Intelligence: https://learn.microsoft.com/azure/ai-services/document-intelligence/
LayoutLMv3 model card: https://huggingface.co/microsoft/layoutlmv3-base

Supporting

Author note §3 — OCR / Layout priority.
Author note §35-6 — Filter-first's preprocessing value.

Cheat Sheet

Tool	Korean	Tables	License	Cost	Fits
Tesseract	2	1	Apache	Free	Simple English
PaddleOCR	4	2	Apache	Free	Korean default
AWS Textract	3	4	Commercial	Per-page	Forms, tables
Azure DI	4	4	Commercial	Per-page	Korean + tables
LayoutLMv3	OCR-dep	2	MIT	GPU time	OCR post-processing

Bridge — What's Next

Next — RAG Core Study (5/26) — Five Paths of Chunking.

The documents are markdown, the search space is partitioned. Now: in what unit do we split? Fixed / Paragraph / Heading / Semantic / Sliding Window — and the Whole-document vs Chunk-level decision table left over from Part 3.

Series overview: Series index