org.jpedal.grouping
Class PdfGroupingAlgorithms

java.lang.Object
  extended by org.jpedal.grouping.PdfGroupingAlgorithms

public class PdfGroupingAlgorithms
extends java.lang.Object

Applies heuristics to unstructured PDF text to create content


Field Summary
static char MARKER2
           
static boolean oldTextExtraction
          Flag used to debug new text routines
static boolean useUnrotatedCoords
           
 
Constructor Summary
PdfGroupingAlgorithms()
           
PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
          create a new instance, passing in raw data
 
Method Summary
 void cleanupText(org.jpedal.objects.PdfData pdf_data)
          generic decode merely clean up data and remove our embedded information
 java.util.Map extractTextAsTable(int x1, int y1, int x2, int y2, int pageNumber, boolean isCSV, boolean keepFontInfo, boolean keepWidthInfo, boolean keepAlignmentInfo, int borderWidth, boolean AddCustomTags)
          calls various low level merging routines on merge - isCSV sets if output is XHTML or CSV format - XHTML also has options to include font tags (keepFontInfo), preserve widths (keepWidthInfo), try to preserve alignment (keepAlignmentInfo), and set a table border width (borderWidth) - AddCustomTags should always be set to false
 java.util.Vector extractTextAsWordlist(int x1, int y1, int x2, int y2, int page_number, boolean estimateParagraphs, boolean breakFragments, java.lang.String punctuation)
          algorithm to place data into an object for each page - hardcoded into program - Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right) - If the co-ordinates are not valid a PdfException is thrown - Returns a Vector with the words and co-ordinates (all values are Strings)
 java.lang.String extractTextInRectangle(int x1, int y1, int x2, int y2, int page_number, boolean estimateParagraphs, boolean breakFragments)
          algorithm to place data into an object for each page - hardcoded into program
Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right)
If the co-ordinates are not valid a PdfException is thrown
 java.util.List findMultipleTermsInRectangle(int x1, int y1, int x2, int y2, int rotation, int page_number, java.lang.String[] terms, boolean orderResults, int searchType, org.jpedal.grouping.SearchListener listener)
          Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number.
 java.util.SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1, int y1, int x2, int y2, int rotation, int page_number, java.lang.String[] terms, int searchType, org.jpedal.grouping.SearchListener listener)
          Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number, with matching teaser
 float[] findTextInRectangle(int x1, int y1, int x2, int y2, int page_number, java.lang.String textValue)
          Deprecated.  
 float[] findTextInRectangle(int x1, int y1, int x2, int y2, int page_number, java.lang.String textValue, int searchType)
          Algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number
 void generateTeasers()
          tell find text to generate teasers as well
 float[] getEndPoints()
          return endpoints from last findtext
 java.lang.String[] getTeasers()
          return text teasers from findtext if generateTeasers() called before find
static java.lang.String removeHiddenMarkers(java.lang.String contents)
          method to show data without encoding
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MARKER2

public static char MARKER2

oldTextExtraction

public static boolean oldTextExtraction
Flag used to debug new text routines


useUnrotatedCoords

public static boolean useUnrotatedCoords
Constructor Detail

PdfGroupingAlgorithms

public PdfGroupingAlgorithms()

PdfGroupingAlgorithms

public PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
create a new instance, passing in raw data

Method Detail

cleanupText

public final void cleanupText(org.jpedal.objects.PdfData pdf_data)
generic decode merely clean up data and remove our embedded information


removeHiddenMarkers

public static java.lang.String removeHiddenMarkers(java.lang.String contents)
method to show data without encoding


extractTextAsTable

public final java.util.Map extractTextAsTable(int x1,
                                              int y1,
                                              int x2,
                                              int y2,
                                              int pageNumber,
                                              boolean isCSV,
                                              boolean keepFontInfo,
                                              boolean keepWidthInfo,
                                              boolean keepAlignmentInfo,
                                              int borderWidth,
                                              boolean AddCustomTags)
                                       throws PdfException
calls various low level merging routines on merge - isCSV sets if output is XHTML or CSV format - XHTML also has options to include font tags (keepFontInfo), preserve widths (keepWidthInfo), try to preserve alignment (keepAlignmentInfo), and set a table border width (borderWidth) - AddCustomTags should always be set to false

Throws:
PdfException

extractTextAsWordlist

public final java.util.Vector extractTextAsWordlist(int x1,
                                                    int y1,
                                                    int x2,
                                                    int y2,
                                                    int page_number,
                                                    boolean estimateParagraphs,
                                                    boolean breakFragments,
                                                    java.lang.String punctuation)
                                             throws PdfException
algorithm to place data into an object for each page - hardcoded into program - Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right) - If the co-ordinates are not valid a PdfException is thrown - Returns a Vector with the words and co-ordinates (all values are Strings)

Throws:
PdfException

extractTextInRectangle

public final java.lang.String extractTextInRectangle(int x1,
                                                     int y1,
                                                     int x2,
                                                     int y2,
                                                     int page_number,
                                                     boolean estimateParagraphs,
                                                     boolean breakFragments)
                                              throws PdfException
algorithm to place data into an object for each page - hardcoded into program
Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right)
If the co-ordinates are not valid a PdfException is thrown

Throws:
PdfException

findMultipleTermsInRectangleWithMatchingTeasers

public java.util.SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1,
                                                                           int y1,
                                                                           int x2,
                                                                           int y2,
                                                                           int rotation,
                                                                           int page_number,
                                                                           java.lang.String[] terms,
                                                                           int searchType,
                                                                           org.jpedal.grouping.SearchListener listener)
                                                                    throws PdfException
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number, with matching teaser

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
rotation - the rotation of the page to be searched
page_number - the page number to search on
terms - the terms to search for
searchType - searchType the search type made up from one or more constants obtained from the SearchType class
listener - an implementation of SearchListener is required, this is to enable searching to be cancelled
Returns:
a SortedMap containing a collection of Rectangle describing the location of found text, mapped to a String which is the matching teaser
Throws:
PdfException - If the co-ordinates are not valid

findMultipleTermsInRectangle

public java.util.List findMultipleTermsInRectangle(int x1,
                                                   int y1,
                                                   int x2,
                                                   int y2,
                                                   int rotation,
                                                   int page_number,
                                                   java.lang.String[] terms,
                                                   boolean orderResults,
                                                   int searchType,
                                                   org.jpedal.grouping.SearchListener listener)
                                            throws PdfException
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number.

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
rotation - the rotation of the page to be searched
page_number - the page number to search on
terms - the terms to search for
orderResults - if true the list that is returned is ordered to return the resulting rectangles in a logical order descending down the page, if false, rectangles for multiple terms are grouped together.
searchType - searchType the search type made up from one or more constants obtained from the SearchType class
listener - an implementation of SearchListener is required, this is to enable searching to be cancelled
Returns:
a list of Rectangle describing the location of found text
Throws:
PdfException - If the co-ordinates are not valid

findTextInRectangle

public final float[] findTextInRectangle(int x1,
                                         int y1,
                                         int x2,
                                         int y2,
                                         int page_number,
                                         java.lang.String textValue)
                                  throws PdfException
Deprecated. 

algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number using a case sensitive comparison, finding the first occurance
Co-ordinates are x1,y1 (top left hand corner), x2,y2(bottom right)
If the co-ordinates are not valid a PdfException is thrown

Throws:
PdfException

findTextInRectangle

public final float[] findTextInRectangle(int x1,
                                         int y1,
                                         int x2,
                                         int y2,
                                         int page_number,
                                         java.lang.String textValue,
                                         int searchType)
                                  throws PdfException
Algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
page_number - the page number to search on
textValue - the text string to search for
searchType - the search type made up from one or more constants obtained from the SearchType class
Returns:
array of coordinates for found text
Throws:
PdfException - If the co-ordinates are not valid

getEndPoints

public float[] getEndPoints()
return endpoints from last findtext


getTeasers

public java.lang.String[] getTeasers()
return text teasers from findtext if generateTeasers() called before find


generateTeasers

public void generateTeasers()
tell find text to generate teasers as well