Category Archives: Text

Understanding the PDF file Format – ActualText

Text is defined in the PDF file format as a display value (normally what you see onscreen) and an extraction value. It is useful to have 2 options because some characters are displayed differently to what you extract (for example … Continue reading

Posted in Text | Tagged , , , , | Comments Off

PDF search

PDF search is a topic I have seen some very strange discussions on recently in several places so I felt a blog post would be useful. Firstly, you cannot generally do PDF search directly on a PDF document. You cannot … Continue reading

Posted in Search, Text | Tagged , , | Comments Off

Search PDF Files With Regular Expressions – Generating Teasers

Recently I have had some questions on how it would be possible to display search results including two words from either side of the the search result. This is something we already have set up for our simple viewer. We … Continue reading

Posted in PDF, Search, Text | Tagged , , , | Comments Off

PDF to HTML5 conversion – Where are my hyphens?

The devil is always in the detail with the PDF spec. I have been working on a PDF file where the Hyphen character was not appearing in the converted HTML5 output. This was odd as I have seen it on … Continue reading

Posted in html, Java, PDF, Text | Tagged , , , , , | Comments Off

Search PDF Files With Regular Expressions – Customizing Your Search of PDF files

With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What … Continue reading

Posted in Search, Text | Tagged , , , | Comments Off