Scanned PDF to Excel
Scanned PDF to Excel: what works today and what does not
The public converter is optimized for text-based PDFs. Image-only scanned PDFs are detected early and shown as a clean failure instead of returning a broken spreadsheet.
Built for private business documents
- Async conversion keeps large PDF processing out of the browser request.
- Source PDFs and Excel outputs are cleaned up automatically by retention jobs.
- The public page shows simple review guidance; detailed telemetry stays internal.
Clean scanned-PDF detection
The engine checks text density and image-heavy page signals before attempting table extraction.
No broken spreadsheet output
When a PDF appears image-only, the converter fails clearly instead of fabricating rows.
OCR boundary is explicit
OCR is planned as a separate capability and is not advertised as active.
Extraction technology
How scanned PDF detection works
A scanned PDF usually has page images but little or no selectable text. The extractor uses that boundary to protect output quality.
Text-layer checks
Very low extracted word count and low text density are strong scanned-PDF signals.
Image-heavy pages
Pages dominated by images increase scanned confidence.
Graceful failure
The public UI explains the limitation without exposing internal counters.
Future OCR path
OCR can be added later after quality and processing-time benchmarks are covered by regression samples.
Supported examples
- Text-based PDFs where table text can be selected
- Digitally generated statements, invoices, and reports
- PDFs exported from ERP/accounting systems
Limitations to know
- Image-only scanned PDFs are not converted today.
- Photos of documents are not supported.
- OCR accuracy for Vietnamese text and numeric tables needs separate evaluation before release.
Have a text-based PDF instead?
Upload the text-based version and the converter can extract tables into Excel.
Try a text-based PDFFrequently asked questions
Can scanned PDFs be converted right now?
No. The public converter detects scanned/image-only PDFs and fails cleanly because OCR is not enabled.
Why not return partial results?
Returning guessed rows from images would be unreliable without OCR and validation.
What should I upload?
Use a PDF where text can be selected in the browser or PDF viewer.