Why PDFs are still the hardest format to work with
PDFs were designed for printing, not data exchange. The format stores text as positioned glyphs with no inherent semantic structure — a "Total" label and its value might be in completely different parts of the file's internal object model, even if they look adjacent on screen.
Scanned PDFs are even harder: the file is just an image, with no embedded text at all. Traditional extraction tools rely on rigid coordinate-based templates that break the moment a vendor changes their layout, a printer skews the scan, or a new document type arrives.
DocPeel uses a large language model that reads each PDF the way a human analyst would — understanding structure from visual context rather than from fixed coordinates.
Three types of PDF, all handled the same way
Digitally generated PDFs (from Word, InDesign, accounting software) contain a native text layer that DocPeel reads directly. Extraction is fast and highly accurate because there is no OCR step.
Scanned PDFs are images of paper documents. DocPeel applies image pre-processing — deskewing, contrast enhancement, noise reduction — before running character recognition. Most clean scans reach the same accuracy level as native PDFs.
Hybrid PDFs mix native text with scanned pages — common when someone scans a signed page and appends it to a digital document. DocPeel handles each page according to its content type, automatically, without configuration.
Tables, multi-column layouts, and complex structures
Tables in PDFs are notoriously difficult. The PDF format has no concept of a "table" — each cell is just text at a position. Reconstructing rows, columns, and spanning cells requires spatial reasoning.
DocPeel extracts tables as arrays of row objects. Column headers become field names, and each row becomes a record. For multi-header tables (where a table header spans multiple rows), the model collapses them into a single clean schema.
Multi-column layouts — common in academic papers, catalogues, and forms — are read in the correct reading order rather than being streamed left-to-right across columns.
Using templates to standardise output across documents
When you receive the same PDF type from multiple sources — supplier invoices, bank statements, application forms — a custom extraction template locks in the exact field names and data types you want, regardless of variations in layout or wording across senders.
Define your schema once: field name, data type (text, number, date, currency, boolean, table), and whether it is required. Every PDF processed with that template returns an identical JSON shape, making downstream database inserts and API calls trivial.
Templates combine with AI extraction: the model uses your schema as a guide but still reads the document flexibly, finding the right data even when it appears in an unexpected position.