JPedal Java PDF library 4.01b28 API Documentation - http://www.jpedal.org

org.jpedal.grouping
Class PdfGroupingAlgorithms

java.lang.Object
  extended by org.jpedal.grouping.PdfGroupingAlgorithms

public class PdfGroupingAlgorithms
extends java.lang.Object

Applies heuristics to unstructured PDF text to create content


Field Summary
static char MARKER2
           
static boolean oldTextExtraction
          Flag used to debug new text routines
static int SURROUND_BY_ANY_PUNCTUATION
           
static java.lang.String SystemSeparator
           
static int USER_DEFINED_LIST_ONLY
           
static boolean useUnrotatedCoords
           
 int wordDetectionTechnique
           
 
Constructor Summary
PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
          create a new instance, passing in raw data
 
Method Summary
 java.util.Map extractTextAsTable(int x1, int y1, int x2, int y2, int pageNumber, boolean isCSV, boolean keepFontInfo, boolean keepWidthInfo, boolean keepAlignmentInfo, int borderWidth)
          calls various low level merging routines on merge - isCSV sets if output is XHTML or CSV format - XHTML also has options to include font tags (keepFontInfo), preserve widths (keepWidthInfo), try to preserve alignment (keepAlignmentInfo), and set a table border width (borderWidth) - AddCustomTags should always be set to false
 java.util.Vector extractTextAsWordlist(int x1, int y1, int x2, int y2, int page_number, boolean breakFragments, java.lang.String punctuation)
          algorithm to place data from within coordinates to a vector of word, word coords (x1,y1,x2,y2)
 java.lang.String extractTextInRectangle(int x1, int y1, int x2, int y2, int page_number, boolean estimateParagraphs, boolean breakFragments)
          algorithm to place data from specified coordinates on a page into a String.
 java.util.List findMultipleTermsInRectangle(int x1, int y1, int x2, int y2, int rotation, int page_number, java.lang.String[] terms, boolean orderResults, int searchType, org.jpedal.grouping.SearchListener listener)
          Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number.
 java.util.SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1, int y1, int x2, int y2, int rotation, int page_number, java.lang.String[] terms, int searchType, org.jpedal.grouping.SearchListener listener)
          Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number, with matching teaser
 float[] findTextInRectangle(int x1, int y1, int x2, int y2, int page_number, java.lang.String textValue, int searchType)
          Algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number
 float[] findTextInRectangleAcrossLines(int x1, int y1, int x2, int y2, int page_number, java.lang.String textValue, int searchType)
          Method to find text in the specified area allowing for the text to be split across multiple lines.
The results are returned in a float[] where there coords are organised in the following order.
[0]=result x1 coord [1]=result y1 coord [2]=result x2 coord [3]=result y2 coord [4]=either -101 to show that the next text area is the remainder of this word on another line else any other value is ignored.
 void generateTeasers()
          tell find text to generate teasers as well
 float[] getEndPoints()
          return endpoints from last findtext
 java.lang.String[] getTeasers()
          return text teasers from findtext if generateTeasers() called before find
 int getWordDetectionTechnique()
          Get the value of the word detection technique
static java.lang.String removeHiddenMarkers(java.lang.String contents)
          method to show data without encoding
 void setIncludeHTML(boolean value)
          sets if we include HTML in teasers (do we want this is word or this is word as teaser)
static void setSeparator(java.lang.String sep)
           
 void setWordDetectionTechnique(int wordDetectionTechnique)
          Set the word detection technique based on a set of final variables
public final int USER_DEFINED_LIST_ONLY = 0;
public final int SURROUND_BY_ANY_PUNCTUATION = 1;
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

wordDetectionTechnique

public int wordDetectionTechnique

USER_DEFINED_LIST_ONLY

public static final int USER_DEFINED_LIST_ONLY
See Also:
Constant Field Values

SURROUND_BY_ANY_PUNCTUATION

public static final int SURROUND_BY_ANY_PUNCTUATION
See Also:
Constant Field Values

SystemSeparator

public static java.lang.String SystemSeparator

MARKER2

public static char MARKER2

oldTextExtraction

public static boolean oldTextExtraction
Flag used to debug new text routines


useUnrotatedCoords

public static boolean useUnrotatedCoords
Constructor Detail

PdfGroupingAlgorithms

public PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
create a new instance, passing in raw data

Method Detail

setSeparator

public static void setSeparator(java.lang.String sep)

setIncludeHTML

public void setIncludeHTML(boolean value)
sets if we include HTML in teasers (do we want this is word or this is word as teaser)

Parameters:
value -

removeHiddenMarkers

public static java.lang.String removeHiddenMarkers(java.lang.String contents)
method to show data without encoding


extractTextAsTable

public final java.util.Map extractTextAsTable(int x1,
                                              int y1,
                                              int x2,
                                              int y2,
                                              int pageNumber,
                                              boolean isCSV,
                                              boolean keepFontInfo,
                                              boolean keepWidthInfo,
                                              boolean keepAlignmentInfo,
                                              int borderWidth)
                                       throws PdfException
calls various low level merging routines on merge - isCSV sets if output is XHTML or CSV format - XHTML also has options to include font tags (keepFontInfo), preserve widths (keepWidthInfo), try to preserve alignment (keepAlignmentInfo), and set a table border width (borderWidth) - AddCustomTags should always be set to false

Parameters:
x1 - is the x coord of the top left corner
y1 - is the y coord of the top left corner
x2 - is the x coord of the bottom right corner
y2 - is the y coord of the bottom right corner
pageNumber - is the page you wish to extract from
isCSV - is a boolean. If false the output is xhtml if true the text is out as CSV
keepFontInfo - if true and isCSV is false keeps font information in extrated text.
keepWidthInfo - if true and isCSV is false keeps width information in extrated text.
keepAlignmentInfo - if true and isCSV is false keeps alignment information in extrated text.
borderWidth - is the width of the border for xhtml
Returns:
Map containing text found in estimated table cells
Throws:
PdfException - If the co-ordinates are not valid

extractTextAsWordlist

public final java.util.Vector extractTextAsWordlist(int x1,
                                                    int y1,
                                                    int x2,
                                                    int y2,
                                                    int page_number,
                                                    boolean breakFragments,
                                                    java.lang.String punctuation)
                                             throws PdfException
algorithm to place data from within coordinates to a vector of word, word coords (x1,y1,x2,y2)

Parameters:
x1 - is the x coord of the top left corner
y1 - is the y coord of the top left corner
x2 - is the x coord of the bottom right corner
y2 - is the y coord of the bottom right corner
pageNumber - is the page you wish to extract from
breakFragments - will divide up text based on white space characters
punctuation - is a string containing all values that should be used to divide up words
Returns:
Vector containing words found and words coordinates (word, x1,y1,x2,y2...)
Throws:
PdfException - If the co-ordinates are not valid

extractTextInRectangle

public final java.lang.String extractTextInRectangle(int x1,
                                                     int y1,
                                                     int x2,
                                                     int y2,
                                                     int page_number,
                                                     boolean estimateParagraphs,
                                                     boolean breakFragments)
                                              throws PdfException
algorithm to place data from specified coordinates on a page into a String.

Parameters:
x1 - is the x coord of the top left corner
y1 - is the y coord of the top left corner
x2 - is the x coord of the bottom right corner
y2 - is the y coord of the bottom right corner
pageNumber - is the page you wish to extract from
estimateParagraphs - will attempt to find paragraphs and add new lines in output if true
breakFragments - will divide up text based on white space characters if true
Returns:
Vector containing words found and words coordinates (word, x1,y1,x2,y2...)
Throws:
PdfException - If the co-ordinates are not valid

findMultipleTermsInRectangleWithMatchingTeasers

public java.util.SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1,
                                                                           int y1,
                                                                           int x2,
                                                                           int y2,
                                                                           int rotation,
                                                                           int page_number,
                                                                           java.lang.String[] terms,
                                                                           int searchType,
                                                                           org.jpedal.grouping.SearchListener listener)
                                                                    throws PdfException
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number, with matching teaser

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
rotation - the rotation of the page to be searched
page_number - the page number to search on
terms - the terms to search for
searchType - searchType the search type made up from one or more constants obtained from the SearchType class
listener - an implementation of SearchListener is required, this is to enable searching to be cancelled
Returns:
a SortedMap containing a collection of Rectangle describing the location of found text, mapped to a String which is the matching teaser
Throws:
PdfException - If the co-ordinates are not valid

findMultipleTermsInRectangle

public java.util.List findMultipleTermsInRectangle(int x1,
                                                   int y1,
                                                   int x2,
                                                   int y2,
                                                   int rotation,
                                                   int page_number,
                                                   java.lang.String[] terms,
                                                   boolean orderResults,
                                                   int searchType,
                                                   org.jpedal.grouping.SearchListener listener)
                                            throws PdfException
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number.

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
rotation - the rotation of the page to be searched
page_number - the page number to search on
terms - the terms to search for
orderResults - if true the list that is returned is ordered to return the resulting rectangles in a logical order descending down the page, if false, rectangles for multiple terms are grouped together.
searchType - searchType the search type made up from one or more constants obtained from the SearchType class
listener - an implementation of SearchListener is required, this is to enable searching to be cancelled
Returns:
a list of Rectangle describing the location of found text
Throws:
PdfException - If the co-ordinates are not valid

findTextInRectangle

public final float[] findTextInRectangle(int x1,
                                         int y1,
                                         int x2,
                                         int y2,
                                         int page_number,
                                         java.lang.String textValue,
                                         int searchType)
                                  throws PdfException
Algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number

Parameters:
x1 - the left x cord
y1 - the upper y cord
x2 - the right x cord
y2 - the lower y cord
page_number - the page number to search on
textValue - the text string to search for
searchType - the search type made up from one or more constants obtained from the SearchType class
Returns:
array of coordinates for found text
Throws:
PdfException - If the co-ordinates are not valid

findTextInRectangleAcrossLines

public final float[] findTextInRectangleAcrossLines(int x1,
                                                    int y1,
                                                    int x2,
                                                    int y2,
                                                    int page_number,
                                                    java.lang.String textValue,
                                                    int searchType)
                                             throws PdfException
Method to find text in the specified area allowing for the text to be split across multiple lines.
The results are returned in a float[] where there coords are organised in the following order.
[0]=result x1 coord [1]=result y1 coord [2]=result x2 coord [3]=result y2 coord [4]=either -101 to show that the next text area is the remainder of this word on another line else any other value is ignored.

Parameters:
x1 - = top left of search area
y1 - = top left of search area
x2 - = bottom right of search area
y2 - = bottom right of search area
page_number - = the current page to search
textValue - = the text to search for
searchType - = info on how to search the pdf
Returns:
the coords of the found text followed by a value to specify if the following area should be linked as multi line
Throws:
PdfException

getEndPoints

public float[] getEndPoints()
return endpoints from last findtext


getTeasers

public java.lang.String[] getTeasers()
return text teasers from findtext if generateTeasers() called before find


generateTeasers

public void generateTeasers()
tell find text to generate teasers as well


getWordDetectionTechnique

public int getWordDetectionTechnique()
Get the value of the word detection technique

Returns:
the int value of the word detection technique

setWordDetectionTechnique

public void setWordDetectionTechnique(int wordDetectionTechnique)
Set the word detection technique based on a set of final variables
public final int USER_DEFINED_LIST_ONLY = 0;
public final int SURROUND_BY_ANY_PUNCTUATION = 1;

Parameters:
wordDetectionTechnique - the int value for the word detection technique

JPedal Java PDF library 4.01b28 API Documentation - http://www.jpedal.org

JPedal Java PDF library 4.01b28 API Documentation - http://www.jpedal.org