PdfProcessor
extends TextProcessor
in package
Used to create crawl summary information for PDF files
Tags
Table of Contents
- $image_types : array<string|int, mixed>
- Array filetypes which should be considered images.
- $indexed_file_types : array<string|int, mixed>
- Array of file extensions which can be handled by the search engine, other extensions will be ignored.
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $mime_processor : array<string|int, mixed>
- Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
- $plugin_instances : array<string|int, mixed>
- indexing_plugins which might be used with the current processor
- $summarizer : object
- Stores the summarizer object used by this instance of page processor to be used in generating a summary
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $text_data : bool
- Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
- __construct() : mixed
- Set-ups the any indexing plugins associated with this page processor
- calculateLang() : string
- Tries to determine the language of the document by looking at the $sample_text and $url provided the language
- closeDanglingTags() : mixed
- If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
- convertChar() : string
- Used to convert characters from one of the built in PDF encodings to UTF-8
- createThumb() : mixed
- Used to create an thumbnail file to a thumb folder from an PDF file provided the image magick command convert exists.
- dom() : object
- Return a document object based on a string containing the contents of a web page
- extractHttpHttpsUrls() : array<string|int, mixed>
- Tries to extract http or https links from a string of text.
- getBetweenTags() : array<string|int, mixed>
- Gets the text between two tags in a document starting at the current position.
- getEncodingTitle() : array<string|int, mixed>
- Returns the first encoding format information found in the PDF document
- getNextObject() : string
- Gets between an obj and endobj tag at the current position in a PDF document
- getObjectDictionary() : string
- Gets the object dictionary portion of the current PDF object
- getObjectStream() : string
- Gets the object stream portion of the current PDF object
- getText() : string
- Gets the text out of a PDF document
- handle() : array<string|int, mixed>
- Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
- initializeIndexedFileTypes() : mixed
- Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
- objectDictionaryHas() : whether
- Checks if the PDF object's object dictionary is in a list of types
- parseBrackets() : array<string|int, mixed>
- Extracts text till the next close brackets
- parseParentheses() : array<string|int, mixed>
- Extracts ASCII text till the next close parenthesis
- parseText() : string
- Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.
- process() : a
- Used to extract the title, description and links from a string consisting of PDF data.
Properties
$image_types
Array filetypes which should be considered images.
public
static array<string|int, mixed>
$image_types
= []
Sub-classes add to this array with the types they handle
$indexed_file_types
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
public
static array<string|int, mixed>
$indexed_file_types
= ["unknown"]
Sub-classes add to this array with the types they handle
$max_description_len
Max number of chars to extract for description from a page to index.
public
static int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
static int
$max_links_to_extract
$mime_processor
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
public
static array<string|int, mixed>
$mime_processor
= []
$plugin_instances
indexing_plugins which might be used with the current processor
public
array<string|int, mixed>
$plugin_instances
$summarizer
Stores the summarizer object used by this instance of page processor to be used in generating a summary
public
object
$summarizer
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$text_data
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
public
bool
$text_data
Methods
__construct()
Set-ups the any indexing plugins associated with this page processor
public
__construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
- $plugins : array<string|int, mixed> = []
-
an array of indexing plugins which might do further processing on the data handles by this page processor
- $max_description_len : int = null
-
maximal length of a page summary
- $max_links_to_extract : int = null
-
maximum number of links to extract from a single document
- $summarizer_option : string = self::BASIC_SUMMARIZER
-
CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER
Return values
mixed —calculateLang()
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
public
static calculateLang([string $sample_text = null ][, string $url = null ]) : string
Parameters
- $sample_text : string = null
-
sample text to try guess the language from
- $url : string = null
-
url of web-page as a fallback look at the country to figure out language
Return values
string —language tag for guessed language
closeDanglingTags()
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
public
static closeDanglingTags(string &$page) : mixed
Parameters
- $page : string
-
a reference to an xml or html document
Return values
mixed —convertChar()
Used to convert characters from one of the built in PDF encodings to UTF-8
public
static convertChar(char $cur_char, string $encoding) : string
Parameters
- $cur_char : char
-
character to convert
- $encoding : string
-
which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.
Return values
string —resultign converted string for character
createThumb()
Used to create an thumbnail file to a thumb folder from an PDF file provided the image magick command convert exists.
public
static createThumb(string $folder, string $thumb_folder, string $file_name[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : mixed
For this method to do anything the constant IMAGE_MAGICK must be set to the path to the "convert" command.
Parameters
- $folder : string
-
with pdf in it
- $thumb_folder : string
-
folder to generate
- $file_name : string
-
of pdf file in $folder
- $width : int = CTHUMB_DIM
-
= width in pixels of thumb
- $height : int = CTHUMB_DIM
-
= height in pixels of thumb
Return values
mixed —dom()
Return a document object based on a string containing the contents of a web page
public
static dom(string $page) : object
Parameters
- $page : string
-
a web page
Return values
object —document object
extractHttpHttpsUrls()
Tries to extract http or https links from a string of text.
public
static extractHttpHttpsUrls(string $page) : array<string|int, mixed>
Does this by a very approximate regular expression.
Parameters
- $page : string
-
text string of a document
Return values
array<string|int, mixed> —a set of http or https links that were extracted from the document
getBetweenTags()
Gets the text between two tags in a document starting at the current position.
public
static getBetweenTags(string $string, int $cur_pos, string $start_tag, string $end_tag) : array<string|int, mixed>
Parameters
- $string : string
-
document to extract text from
- $cur_pos : int
-
current location to look if can extract text
- $start_tag : string
-
starting tag that we want to extract after
- $end_tag : string
-
ending tag that we want to extract until
Return values
array<string|int, mixed> —pair consisting of when in the document we are after the end tag, together with the data between the two tags
getEncodingTitle()
Returns the first encoding format information found in the PDF document
public
static getEncodingTitle(string $pdf_string) : array<string|int, mixed>
Parameters
- $pdf_string : string
-
a string representing the PDF document
Return values
array<string|int, mixed> —[encoding, title] which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc as well as a title for the document if found
getNextObject()
Gets between an obj and endobj tag at the current position in a PDF document
public
static getNextObject(string $pdf_string, int $cur_pos) : string
Parameters
- $pdf_string : string
-
astring of a PDF document
- $cur_pos : int
-
a integer position in that string
Return values
string —the contents of the PDF object located at $cur_pos
getObjectDictionary()
Gets the object dictionary portion of the current PDF object
public
static getObjectDictionary(string $object_string) : string
Parameters
- $object_string : string
-
represents the contents of a PDF object
Return values
string —the object dictionary for the object
getObjectStream()
Gets the object stream portion of the current PDF object
public
static getObjectStream(string $object_string) : string
Parameters
- $object_string : string
-
represents the contents of a PDF object
Return values
string —the object stream for the object
getText()
Gets the text out of a PDF document
public
static getText(string $pdf_string, $url[, string $encoding = "" ]) : string
Parameters
- $pdf_string : string
-
a string representing the PDF document
- $url :
-
the url where the page contents came from, used to canonicalize relative links
- $encoding : string = ""
-
which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.
Return values
string —text extracted from the document
handle()
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
public
handle(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a web document
- $url : string
-
location the document came from
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
initializeIndexedFileTypes()
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
public
static initializeIndexedFileTypes() : mixed
Return values
mixed —objectDictionaryHas()
Checks if the PDF object's object dictionary is in a list of types
public
static objectDictionaryHas(string $object_dictionary, array<string|int, mixed> $type_array) : whether
Parameters
- $object_dictionary : string
-
the object dictionary to check
- $type_array : array<string|int, mixed>
-
the list of types to check against
Return values
whether —it is in or not
parseBrackets()
Extracts text till the next close brackets
public
static parseBrackets(string $data, int $cur_pos[, string $encoding = "" ]) : array<string|int, mixed>
Parameters
- $data : string
-
source to extract character data from
- $cur_pos : int
-
position to start in $data
- $encoding : string = ""
-
which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.
Return values
array<string|int, mixed> —pair consisting of the final position in $data as well as extracted text
parseParentheses()
Extracts ASCII text till the next close parenthesis
public
static parseParentheses(string $data, int $cur_pos, string $encoding) : array<string|int, mixed>
Parameters
- $data : string
-
source to extract character data from
- $cur_pos : int
-
position to start in $data
- $encoding : string
-
which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.
Return values
array<string|int, mixed> —pair consisting of the final position in $data as well as extracted text
parseText()
Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.
public
static parseText(string $data[, string $encoding = "" ]) : string
Parameters
- $data : string
-
source to extract character data from
- $encoding : string = ""
-
which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.
Return values
string —extracted text
process()
Used to extract the title, description and links from a string consisting of PDF data.
public
process(string $page, string $url) : a
Parameters
- $page : string
-
a string consisting of web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
a —summary of the contents of the page