|
JPedal Java PDF library 4.01b28 API Documentation - http://www.jpedal.org | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.jpedal.grouping.PdfGroupingAlgorithms
public class PdfGroupingAlgorithms
Applies heuristics to unstructured PDF text to create content
| Field Summary | |
|---|---|
static char |
MARKER2
|
static boolean |
oldTextExtraction
Flag used to debug new text routines |
static int |
SURROUND_BY_ANY_PUNCTUATION
|
static java.lang.String |
SystemSeparator
|
static int |
USER_DEFINED_LIST_ONLY
|
static boolean |
useUnrotatedCoords
|
int |
wordDetectionTechnique
|
| Constructor Summary | |
|---|---|
PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
create a new instance, passing in raw data |
|
| Method Summary | |
|---|---|
java.util.Map |
extractTextAsTable(int x1,
int y1,
int x2,
int y2,
int pageNumber,
boolean isCSV,
boolean keepFontInfo,
boolean keepWidthInfo,
boolean keepAlignmentInfo,
int borderWidth)
calls various low level merging routines on merge - isCSV sets if output is XHTML or CSV format - XHTML also has options to include font tags (keepFontInfo), preserve widths (keepWidthInfo), try to preserve alignment (keepAlignmentInfo), and set a table border width (borderWidth) - AddCustomTags should always be set to false |
java.util.Vector |
extractTextAsWordlist(int x1,
int y1,
int x2,
int y2,
int page_number,
boolean breakFragments,
java.lang.String punctuation)
algorithm to place data from within coordinates to a vector of word, word coords (x1,y1,x2,y2) |
java.lang.String |
extractTextInRectangle(int x1,
int y1,
int x2,
int y2,
int page_number,
boolean estimateParagraphs,
boolean breakFragments)
algorithm to place data from specified coordinates on a page into a String. |
java.util.List |
findMultipleTermsInRectangle(int x1,
int y1,
int x2,
int y2,
int rotation,
int page_number,
java.lang.String[] terms,
boolean orderResults,
int searchType,
org.jpedal.grouping.SearchListener listener)
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number. |
java.util.SortedMap |
findMultipleTermsInRectangleWithMatchingTeasers(int x1,
int y1,
int x2,
int y2,
int rotation,
int page_number,
java.lang.String[] terms,
int searchType,
org.jpedal.grouping.SearchListener listener)
Algorithm to find multiple text terms in x1,y1,x2,y2 rectangle on page_number, with matching teaser |
float[] |
findTextInRectangle(int x1,
int y1,
int x2,
int y2,
int page_number,
java.lang.String textValue,
int searchType)
Algorithm to find textValue in x1,y1,x2,y2 rectangle on page_number |
float[] |
findTextInRectangleAcrossLines(int x1,
int y1,
int x2,
int y2,
int page_number,
java.lang.String textValue,
int searchType)
Method to find text in the specified area allowing for the text to be split across multiple lines. The results are returned in a float[] where there coords are organised in the following order. [0]=result x1 coord [1]=result y1 coord [2]=result x2 coord [3]=result y2 coord [4]=either -101 to show that the next text area is the remainder of this word on another line else any other value is ignored. |
void |
generateTeasers()
tell find text to generate teasers as well |
float[] |
getEndPoints()
return endpoints from last findtext |
java.lang.String[] |
getTeasers()
return text teasers from findtext if generateTeasers() called before find |
int |
getWordDetectionTechnique()
Get the value of the word detection technique |
static java.lang.String |
removeHiddenMarkers(java.lang.String contents)
method to show data without encoding |
void |
setIncludeHTML(boolean value)
sets if we include HTML in teasers (do we want this is word or this is word as teaser) |
static void |
setSeparator(java.lang.String sep)
|
void |
setWordDetectionTechnique(int wordDetectionTechnique)
Set the word detection technique based on a set of final variables public final int USER_DEFINED_LIST_ONLY = 0; public final int SURROUND_BY_ANY_PUNCTUATION = 1; |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public int wordDetectionTechnique
public static final int USER_DEFINED_LIST_ONLY
public static final int SURROUND_BY_ANY_PUNCTUATION
public static java.lang.String SystemSeparator
public static char MARKER2
public static boolean oldTextExtraction
public static boolean useUnrotatedCoords
| Constructor Detail |
|---|
public PdfGroupingAlgorithms(org.jpedal.objects.PdfData pdf_data)
| Method Detail |
|---|
public static void setSeparator(java.lang.String sep)
public void setIncludeHTML(boolean value)
value - public static java.lang.String removeHiddenMarkers(java.lang.String contents)
public final java.util.Map extractTextAsTable(int x1,
int y1,
int x2,
int y2,
int pageNumber,
boolean isCSV,
boolean keepFontInfo,
boolean keepWidthInfo,
boolean keepAlignmentInfo,
int borderWidth)
throws PdfException
x1 - is the x coord of the top left cornery1 - is the y coord of the top left cornerx2 - is the x coord of the bottom right cornery2 - is the y coord of the bottom right cornerpageNumber - is the page you wish to extract fromisCSV - is a boolean. If false the output is xhtml if true the text is out as CSVkeepFontInfo - if true and isCSV is false keeps font information in extrated text.keepWidthInfo - if true and isCSV is false keeps width information in extrated text.keepAlignmentInfo - if true and isCSV is false keeps alignment information in extrated text.borderWidth - is the width of the border for xhtml
PdfException - If the co-ordinates are not valid
public final java.util.Vector extractTextAsWordlist(int x1,
int y1,
int x2,
int y2,
int page_number,
boolean breakFragments,
java.lang.String punctuation)
throws PdfException
x1 - is the x coord of the top left cornery1 - is the y coord of the top left cornerx2 - is the x coord of the bottom right cornery2 - is the y coord of the bottom right cornerpageNumber - is the page you wish to extract frombreakFragments - will divide up text based on white space characterspunctuation - is a string containing all values that should be used to divide up words
PdfException - If the co-ordinates are not valid
public final java.lang.String extractTextInRectangle(int x1,
int y1,
int x2,
int y2,
int page_number,
boolean estimateParagraphs,
boolean breakFragments)
throws PdfException
x1 - is the x coord of the top left cornery1 - is the y coord of the top left cornerx2 - is the x coord of the bottom right cornery2 - is the y coord of the bottom right cornerpageNumber - is the page you wish to extract fromestimateParagraphs - will attempt to find paragraphs and add new lines in output if truebreakFragments - will divide up text based on white space characters if true
PdfException - If the co-ordinates are not valid
public java.util.SortedMap findMultipleTermsInRectangleWithMatchingTeasers(int x1,
int y1,
int x2,
int y2,
int rotation,
int page_number,
java.lang.String[] terms,
int searchType,
org.jpedal.grouping.SearchListener listener)
throws PdfException
x1 - the left x cordy1 - the upper y cordx2 - the right x cordy2 - the lower y cordrotation - the rotation of the page to be searchedpage_number - the page number to search onterms - the terms to search forsearchType - searchType the search type made up from one or more constants obtained from the SearchType classlistener - an implementation of SearchListener is required, this is to enable searching to be cancelled
PdfException - If the co-ordinates are not valid
public java.util.List findMultipleTermsInRectangle(int x1,
int y1,
int x2,
int y2,
int rotation,
int page_number,
java.lang.String[] terms,
boolean orderResults,
int searchType,
org.jpedal.grouping.SearchListener listener)
throws PdfException
x1 - the left x cordy1 - the upper y cordx2 - the right x cordy2 - the lower y cordrotation - the rotation of the page to be searchedpage_number - the page number to search onterms - the terms to search fororderResults - if true the list that is returned is ordered to return the resulting rectangles in a
logical order descending down the page, if false, rectangles for multiple terms are grouped together.searchType - searchType the search type made up from one or more constants obtained from the SearchType classlistener - an implementation of SearchListener is required, this is to enable searching to be cancelled
PdfException - If the co-ordinates are not valid
public final float[] findTextInRectangle(int x1,
int y1,
int x2,
int y2,
int page_number,
java.lang.String textValue,
int searchType)
throws PdfException
x1 - the left x cordy1 - the upper y cordx2 - the right x cordy2 - the lower y cordpage_number - the page number to search ontextValue - the text string to search forsearchType - the search type made up from one or more constants obtained from the SearchType class
PdfException - If the co-ordinates are not valid
public final float[] findTextInRectangleAcrossLines(int x1,
int y1,
int x2,
int y2,
int page_number,
java.lang.String textValue,
int searchType)
throws PdfException
x1 - = top left of search areay1 - = top left of search areax2 - = bottom right of search areay2 - = bottom right of search areapage_number - = the current page to searchtextValue - = the text to search forsearchType - = info on how to search the pdf
PdfExceptionpublic float[] getEndPoints()
public java.lang.String[] getTeasers()
public void generateTeasers()
public int getWordDetectionTechnique()
public void setWordDetectionTechnique(int wordDetectionTechnique)
wordDetectionTechnique - the int value for the word detection technique
|
JPedal Java PDF library 4.01b28 API Documentation - http://www.jpedal.org | ||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||