How to extract tables from PDFs reliably
Extract structured tables from PDFs — bordered, borderless, multi-page, rotated — and convert them into clean JSON, CSV, or Excel. A practical guide for engineers and analysts.
Why table extraction is harder than text extraction
Pulling text out of a PDF is largely a solved problem. Pulling structured tables — preserving rows, columns, cell relationships, and headers — is not. Tables in PDFs are usually visual, not semantic. The viewer sees a grid; the file format sees scattered text fragments positioned on a canvas.
That gap is why naive text extraction returns columns concatenated as one long string and rows squashed together. Real table extraction has to reconstruct grid structure from layout signals.
Three table types and how each fails
Bordered tables (visible lines around every cell) are the easiest. Most table extractors get them right with line-detection algorithms. They still fail on merged cells and multi-line content.
Borderless tables rely entirely on whitespace between columns. Rule-based extractors guess the column boundaries from spacing and frequently misalign rows. AI extractors handle them better by reasoning about content alignment.
Multi-page tables, where header rows repeat and data continues across page breaks, require an extractor that understands document continuity rather than treating every page independently.
Rotated, scanned, and noisy tables
Tables rotated 90 degrees (common in landscape financial reports embedded in portrait documents) need orientation detection before extraction. Scanned tables need OCR plus structure inference, which is harder than either piece alone.
Stamps, watermarks, and signature blocks frequently overlap tables and confuse line-detection algorithms. Modern AI extractors are more resilient because they ignore visual noise that does not change document semantics.
A schema-first approach with DocPeel
Instead of asking "extract every table from this PDF," define the table you actually want. For an invoice, that is a line_items array with description, quantity, unit_price, amount columns. For a bank statement, it is a transactions array with date, description, amount, balance.
See the [DocPeel template system](/template-extraction) for how to define a table schema once and have the AI extract consistent rows from any PDF layout.
When tables should not be tables in the output
Sometimes the best output for a "table" is not actually a table. A summary section with totals is usually better as named fields than a grid. Repeating sections (multiple invoices in one PDF) should usually be an array of invoice objects, not one giant flattened table.
Modeling the output to match the downstream system saves you a pivot step later.
Need this workflow in production?
DocPeel turns PDFs, images, and emails into structured JSON with integrations for webhooks, spreadsheets, and downstream tools.