Category Archives: Text
Understanding the PDF file Format – ActualText
Text is defined in the PDF file format as a display value (normally what you see onscreen) and an extraction value. It is useful to have 2 options because some characters are displayed differently to what you extract (for example … Continue reading
PDF search
PDF search is a topic I have seen some very strange discussions on recently in several places so I felt a blog post would be useful. Firstly, you cannot generally do PDF search directly on a PDF document. You cannot … Continue reading
Search PDF Files With Regular Expressions – Generating Teasers
Recently I have had some questions on how it would be possible to display search results including two words from either side of the the search result. This is something we already have set up for our simple viewer. We … Continue reading
PDF to HTML5 conversion – Where are my hyphens?
The devil is always in the detail with the PDF spec. I have been working on a PDF file where the Hyphen character was not appearing in the converted HTML5 output. This was odd as I have seen it on … Continue reading
Search PDF Files With Regular Expressions – Customizing Your Search of PDF files
With our update to the pdf search code we now route all our search functionality through the Java regular expressions engine. This has allowed us to allow for some clever search features by adding regular expressions to search terms. What … Continue reading


