Skip to content

PDF Documents

PDFs are indexed as pdf documents.

What OneSearch extracts

OneSearch uses pypdf to extract:

  • text from pages
  • page count
  • PDF metadata such as title, author, subject, creator, and producer when present

The document title comes from PDF metadata first, then falls back to the filename.

Search and preview

Extracted text is searchable like any other document. The document page shows the extracted text and metadata.

PDF text extraction depends on the PDF. A born-digital PDF usually works well. A scanned PDF with only images will not have searchable text unless OCR support is added later.

Limits

PDF extraction is bounded by size and timeout settings:

MAX_PDF_FILE_SIZE_MB=50
PDF_EXTRACTION_TIMEOUT=30

Encrypted PDFs are tried with an empty password. If that fails, the file is recorded as an extraction failure and can still show basic filename/path metadata.

Not supported yet

  • OCR for scanned PDFs
  • password-protected PDFs that require a real password