"RAG Core Study (4/26) — OCR & Layout Analysis"
Part 2 covered well-behaved digital PDFs. This part covers everything else — scans, text trapped inside images, hard tables, multi-column layouts.
Corpora almost always contain abnormal PDFs: scanned paper, broken-font Korean PDFs, tables that only humans can read, multi-column reports. Part 4 compares five tools — PaddleOCR · Tesseract · AWS Textract · Azure Document Intelligence · LayoutLMv3 — on five axes: character accuracy · layout preservation · table extraction · Korean (CJK) support · operational cost.
0. Prerequisites
- Part 2's markdown conversion flow.
- The distinction between images and text objects inside a PDF.
You do not need OCR internals (CTC, attention OCR). The article focuses on the tool decision table.
1. Learning Objectives
- State the difference between OCR and layout analysis in one line — OCR recognises characters; layout analysis recognises blocks, order, and roles.
- Compare the five tools on strength, weakness, license, cost.
- Identify the most dangerous failure for Korean (CJK) corpora.
2. 핵심 요약
OCR recognises characters; layout analysis recognises blocks of characters. Both are separate stages, and a good RAG preprocessor passes through both. PaddleOCR is the open / multilingual / free default. Tesseract is the oldest open OCR but weaker on Korean. AWS Textract and Azure DI are commercial APIs with strong table, form, and signature support. LayoutLMv3 is not an OCR — it is a model that consumes (text + bbox + image patches) for document understanding. For Korean, start with PaddleOCR and mix in Azure DI when tables matter.
3. Intuition — What the Five Tools See on the Same Page
A scanned Korean meeting-minutes page: top logo, Korean title, two-column body, a table in the middle, handwritten signatures at the bottom.
- PaddleOCR: clean CJK character recognition; table cells identified by coordinates but markdown table conversion is weak.
- Tesseract: has a
kormodel but struggles with handwriting and unusual fonts. Free and fast. - AWS Textract: strong tables and forms. Korean officially supported (2023+). Per-page cost.
- Azure Document Intelligence: top-tier tables, signatures, checkboxes. Excellent Korean support. Per-page cost.
- LayoutLMv3: not an OCR. Takes (text + bbox + image patches) and performs document understanding — classification, QA, entity extraction.
4. Definition — Two Separate Stages
| Stage | Output | Example Tools |
|---|---|---|
| OCR | Pixels → Unicode characters with per-character bbox. | Tesseract, PaddleOCR |
| Layout Analysis | bbox set → block classification (Title / Paragraph / Table / Image) + reading order | LayoutLMv3, DocTR, Donut |
| End-to-end Document AI | Pixels → (text + structure + extracted fields) in one pass | AWS Textract, Azure DI |
Side-by-side:
| Tool | OCR | Layout | Table | Form/Sign | License | Korean |
|---|---|---|---|---|---|---|
| Tesseract | 3 | 1 | 1 | — | Apache 2.0 | 2 |
| PaddleOCR | 4 | 2 | 2 | — | Apache 2.0 | 4 |
| AWS Textract | 4 | 3 | 4 | 3 | Commercial | 3 |
| Azure DI | 4 | 4 | 4 | 4 | Commercial | 4 |
| LayoutLMv3 | — | 4 | 2 | 2 | MIT | depends on OCR |
LayoutLMv3 is a model, not an OCR. Korean performance depends on whatever OCR feeds it.
5. Standard Calls
from paddleocr import PaddleOCR
ocr = PaddleOCR(lang="korean")
result = ocr.ocr("page.png") # [[bbox, (text, score)], ...]
import pytesseract
text = pytesseract.image_to_string("page.png", lang="kor")
import boto3
client = boto3.client("textract")
resp = client.detect_document_text(Document={"Bytes": open("page.png","rb").read()})
from azure.ai.documentintelligence import DocumentIntelligenceClient
client = DocumentIntelligenceClient(endpoint, key)
poller = client.begin_analyze_document("prebuilt-layout", body=open("page.pdf","rb"))
result = poller.result()
LayoutLMv3 loads via Hugging Face Transformers and operates on OCR output, not raw pixels.
6. Walkthrough — From Pixels to Markdown
- Image normalisation — rotation, deskew, denoise.
- Text detection — bounding boxes for character regions.
- Text recognition — characters per box (CRNN, attention OCR).
- Layout classification — group bboxes into blocks (title, paragraph, table, image).
- Table recognition — split table regions into row/column cells.
- Reading-order sort — humans-style order across columns.
- Markdown serialisation — emit headings, paragraphs, tables, image refs.
Tools differ in how far down this pipeline they take you.
7. Variants and Cases
7.1 PaddleOCR
- What changes: Apache 2.0 open OCR for 80+ languages; v4 Korean is the standard accuracy baseline.
- Why it appeared: Baidu open-sourced its internal OCR, enabling multilingual processing without paying for APIs.
- What becomes possible: free local OCR for Korean. GPU acceleration available.
- Where it fits: Korean-primary corpora; budget-sensitive deployments.
- Limits: weak on tables, signatures, checkboxes; pair with another tool.
7.2 Tesseract
- What changes: the oldest open OCR with
korlanguage pack. - Why it appeared: HP/Google released it as public infrastructure; easy install everywhere.
- What becomes possible: a free OCR baseline on any OS.
- Where it fits: simple Latin-text corpora.
- Limits: CJK accuracy lags PaddleOCR; weak on handwriting and unusual fonts.
7.3 AWS Textract
- What changes: AWS Document AI API — text + tables + forms + handwriting + natural-language Queries for field extraction.
- Why it appeared: enterprise documents (invoices, contracts, forms) needed automated extraction. Korean official 2023+.
- What becomes possible: business forms and contracts to cell-level markdown with SLA / HIPAA options.
- Where it fits: AWS-native enterprise document workflows.
- Limits: per-page cost; data leaves to AWS; not a fit for strict on-prem requirements.
7.4 Azure Document Intelligence (formerly Form Recognizer)
- What changes: Microsoft Document AI with prebuilt models (layout / receipts / IDs / business cards) and custom fine-tunable models.
- Why it appeared: Azure's OCR stack matured; Korean support is top-tier.
- What becomes possible: tables, signatures, checkboxes, key-value pairs in one pass.
prebuilt-layoutis a strong RAG preprocessor baseline. - Where it fits: Azure environments with both Korean and table requirements.
- Limits: per-page cost; data egress.
7.5 LayoutLMv3 — Post-OCR Understanding
- What changes: takes (text + bbox + image patches) and learns a unified embedding for document understanding.
- Why it appeared: text alone misses reading order and block roles. Microsoft Research (2022).
- What becomes possible: extract total fields on receipts, signature blocks in contracts, titles vs body in reports.
- Where it fits: post-processing OCR output for classification and field extraction in RAG KB enrichment.
- Limits: not an OCR; Korean LayoutLMv3 support is limited and depends on the upstream OCR.
8. Limits and Failure Modes
8.1 Korean Output Shows Question Marks
- Why intrinsic: PDF font-mapping errors plus no OCR pass. Digital PDFs without embedded fonts break text extraction.
- Diagnosis:
?or□in the output. - Mitigation: rasterise to image and OCR with
PaddleOCR(lang="korean"). - Later part: this part §7.1.
8.2 Tables Collapse to a Single Line
- Why intrinsic: OCR alone does not know cell boundaries; cells collapse into a whitespace-delimited row.
- Diagnosis: few
|pipes; cell values appear as whitespace blobs. - Mitigation: use Textract or Azure DI table modes; or post-process OCR bboxes into cells.
- Later part: §7.3, §7.4.
8.3 Multi-column Reading-order Errors
- Why intrinsic: OCR recognises characters, not order. Sorting purely top-down, left-right scrambles two-column text.
- Diagnosis: sentences cross columns mid-way.
- Mitigation: LayoutLMv3, or reading-order algorithms inside Textract / Azure DI.
- Later part: §7.4, §7.5.
8.4 Handwriting / Signatures Lost
- Why intrinsic: most OCRs target printed text; handwriting accuracy drops below 50% on the same image.
- Diagnosis: handwritten regions return blank or noisy strings.
- Mitigation: Textract
FORMSmode or Azure DIprebuilt-handwritten. - Later part: §7.3, §7.4.
8.5 Cost and Latency Stack-up
- Why intrinsic: 10K-page OCR with PaddleOCR is tens of hours locally; Textract/Azure DI costs tens to hundreds of dollars; LayoutLMv3 post-processing adds GPU time.
- Diagnosis: estimate one full KB build's time and dollars.
- Mitigation: tier the corpus and use incremental OCR.
- Later part: Part 25.
8.5 Common Pitfalls
- "Digital PDFs don't need OCR." — Korean font-mapping fails often. §8.1.
- "OCR captures tables automatically." — Often not; tables need a dedicated step. §8.2.
- "One tool fits all documents." — Cost and accuracy both lose; tier the corpus.
- "One OCR pass and we are done." — Tables, handwriting, multi-column may need separate models.
- "LayoutLMv3 is an OCR." — It is not.
9. Settled Conclusions
Q1. What is the difference between OCR and Layout Analysis?
OCR recognises characters. Layout analysis recognises blocks of characters and their reading order. Two separate stages. Chapter: §4.
Q2. Default OCR for Korean corpora?
PaddleOCR (kor). Mix in Azure DI when tables or signatures matter.
Chapter: §7.1, §7.4.
Q3. Which two tools lead on table extraction?
AWS Textract and Azure DI. Both produce cell-level tables and key-value pairs. Chapter: §7.3, §7.4.
Q4. When is LayoutLMv3 the right choice?
After OCR provides (text + bbox), when you need classification or field extraction on the document — typically RAG KB enrichment. Chapter: §7.5.
Q5. Why does OCR fix the "question marks" failure on Korean digital PDFs?
OCR re-recognises characters from pixels, bypassing the PDF's broken font map. Chapter: §8.1.
10. Further Reading
Primary
- Du, Y. et al. PP-OCRv4. arXiv:2304.10833
- Smith, R. An Overview of the Tesseract OCR Engine. ICDAR 2007.
- Huang, Y. et al. LayoutLMv3. ACM MM 2022. arXiv:2204.08387
- AWS Textract / Azure Document Intelligence official model cards (2024 updates).
Official docs
- PaddleOCR:
https://github.com/PaddlePaddle/PaddleOCR - Tesseract:
https://tesseract-ocr.github.io/ - AWS Textract:
https://docs.aws.amazon.com/textract/ - Azure Document Intelligence:
https://learn.microsoft.com/azure/ai-services/document-intelligence/ - LayoutLMv3 model card:
https://huggingface.co/microsoft/layoutlmv3-base
Supporting
- Author note §3 — OCR / Layout priority.
- Author note §35-6 — Filter-first's preprocessing value.
Cheat Sheet
| Tool | Korean | Tables | License | Cost | Fits |
|---|---|---|---|---|---|
| Tesseract | 2 | 1 | Apache | Free | Simple English |
| PaddleOCR | 4 | 2 | Apache | Free | Korean default |
| AWS Textract | 3 | 4 | Commercial | Per-page | Forms, tables |
| Azure DI | 4 | 4 | Commercial | Per-page | Korean + tables |
| LayoutLMv3 | OCR-dep | 2 | MIT | GPU time | OCR post-processing |
Bridge — What's Next
Next — RAG Core Study (5/26) — Five Paths of Chunking.
The documents are markdown, the search space is partitioned. Now: in what unit do we split? Fixed / Paragraph / Heading / Semantic / Sliding Window — and the Whole-document vs Chunk-level decision table left over from Part 3.
Series overview: Series index
댓글
댓글 쓰기