PDF Documents¶
PDFs are indexed as pdf documents.
What OneSearch extracts¶
OneSearch uses pypdf to extract:
- text from pages
- page count
- PDF metadata such as title, author, subject, creator, and producer when present
The document title comes from PDF metadata first, then falls back to the filename.
Search and preview¶
Extracted text is searchable like any other document. The document page shows the extracted text and metadata.
PDF text extraction depends on the PDF. A born-digital PDF usually works well. A scanned PDF with only images will not have searchable text unless OCR support is added later.
Limits¶
PDF extraction is bounded by size and timeout settings:
Encrypted PDFs are tried with an empty password. If that fails, the file is recorded as an extraction failure and can still show basic filename/path metadata.
Not supported yet¶
- OCR for scanned PDFs
- password-protected PDFs that require a real password