Yioop_V9.5_Source_Code_Documentation

PdfProcessor extends TextProcessor
in package

Used to create crawl summary information for PDF files

Tags
author

Chris Pollett

Table of Contents

$image_types  : array<string|int, mixed>
Array filetypes which should be considered images.
$indexed_file_types  : array<string|int, mixed>
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
$max_description_len  : int
Max number of chars to extract for description from a page to index.
$max_links_to_extract  : int
Maximum number of urls to extract from a single document
$mime_processor  : array<string|int, mixed>
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
$plugin_instances  : array<string|int, mixed>
indexing_plugins which might be used with the current processor
$summarizer  : object
Stores the summarizer object used by this instance of page processor to be used in generating a summary
$summarizer_option  : string
Stores the name of the summarizer used for crawling.
$text_data  : bool
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
__construct()  : mixed
Set-ups the any indexing plugins associated with this page processor
calculateLang()  : string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
closeDanglingTags()  : mixed
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
convertChar()  : string
Used to convert characters from one of the built in PDF encodings to UTF-8
createThumb()  : mixed
Used to create an thumbnail file to a thumb folder from an PDF file provided the image magick command convert exists.
dom()  : object
Return a document object based on a string containing the contents of a web page
extractHttpHttpsUrls()  : array<string|int, mixed>
Tries to extract http or https links from a string of text.
getBetweenTags()  : array<string|int, mixed>
Gets the text between two tags in a document starting at the current position.
getEncodingTitle()  : array<string|int, mixed>
Returns the first encoding format information found in the PDF document
getNextObject()  : string
Gets between an obj and endobj tag at the current position in a PDF document
getObjectDictionary()  : string
Gets the object dictionary portion of the current PDF object
getObjectStream()  : string
Gets the object stream portion of the current PDF object
getText()  : string
Gets the text out of a PDF document
handle()  : array<string|int, mixed>
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
initializeIndexedFileTypes()  : mixed
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
objectDictionaryHas()  : whether
Checks if the PDF object's object dictionary is in a list of types
parseBrackets()  : array<string|int, mixed>
Extracts text till the next close brackets
parseParentheses()  : array<string|int, mixed>
Extracts ASCII text till the next close parenthesis
parseText()  : string
Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.
process()  : a
Used to extract the title, description and links from a string consisting of PDF data.

Properties

$image_types

Array filetypes which should be considered images.

public static array<string|int, mixed> $image_types = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

public static array<string|int, mixed> $indexed_file_types = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.

public static int $max_description_len

Only words in the description are indexed.

Maximum number of urls to extract from a single document

public static int $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

public static array<string|int, mixed> $mime_processor = []

$plugin_instances

indexing_plugins which might be used with the current processor

public array<string|int, mixed> $plugin_instances

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary

public object $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.

public string $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)

public bool $text_data

Methods

__construct()

Set-ups the any indexing plugins associated with this page processor

public __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
$plugins : array<string|int, mixed> = []

an array of indexing plugins which might do further processing on the data handles by this page processor

$max_description_len : int = null

maximal length of a page summary

$max_links_to_extract : int = null

maximum number of links to extract from a single document

$summarizer_option : string = self::BASIC_SUMMARIZER

CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

Return values
mixed

calculateLang()

Tries to determine the language of the document by looking at the $sample_text and $url provided the language

public static calculateLang([string $sample_text = null ][, string $url = null ]) : string
Parameters
$sample_text : string = null

sample text to try guess the language from

$url : string = null

url of web-page as a fallback look at the country to figure out language

Return values
string

language tag for guessed language

closeDanglingTags()

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.

public static closeDanglingTags(string &$page) : mixed
Parameters
$page : string

a reference to an xml or html document

Return values
mixed

convertChar()

Used to convert characters from one of the built in PDF encodings to UTF-8

public static convertChar(char $cur_char, string $encoding) : string
Parameters
$cur_char : char

character to convert

$encoding : string

which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values
string

resultign converted string for character

createThumb()

Used to create an thumbnail file to a thumb folder from an PDF file provided the image magick command convert exists.

public static createThumb(string $folder, string $thumb_folder, string $file_name[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : mixed

For this method to do anything the constant IMAGE_MAGICK must be set to the path to the "convert" command.

Parameters
$folder : string

with pdf in it

$thumb_folder : string

folder to generate

$file_name : string

of pdf file in $folder

$width : int = CTHUMB_DIM

= width in pixels of thumb

$height : int = CTHUMB_DIM

= height in pixels of thumb

Return values
mixed

dom()

Return a document object based on a string containing the contents of a web page

public static dom(string $page) : object
Parameters
$page : string

a web page

Return values
object

document object

extractHttpHttpsUrls()

Tries to extract http or https links from a string of text.

public static extractHttpHttpsUrls(string $page) : array<string|int, mixed>

Does this by a very approximate regular expression.

Parameters
$page : string

text string of a document

Return values
array<string|int, mixed>

a set of http or https links that were extracted from the document

getBetweenTags()

Gets the text between two tags in a document starting at the current position.

public static getBetweenTags(string $string, int $cur_pos, string $start_tag, string $end_tag) : array<string|int, mixed>
Parameters
$string : string

document to extract text from

$cur_pos : int

current location to look if can extract text

$start_tag : string

starting tag that we want to extract after

$end_tag : string

ending tag that we want to extract until

Return values
array<string|int, mixed>

pair consisting of when in the document we are after the end tag, together with the data between the two tags

getEncodingTitle()

Returns the first encoding format information found in the PDF document

public static getEncodingTitle(string $pdf_string) : array<string|int, mixed>
Parameters
$pdf_string : string

a string representing the PDF document

Return values
array<string|int, mixed>

[encoding, title] which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc as well as a title for the document if found

getNextObject()

Gets between an obj and endobj tag at the current position in a PDF document

public static getNextObject(string $pdf_string, int $cur_pos) : string
Parameters
$pdf_string : string

astring of a PDF document

$cur_pos : int

a integer position in that string

Return values
string

the contents of the PDF object located at $cur_pos

getObjectDictionary()

Gets the object dictionary portion of the current PDF object

public static getObjectDictionary(string $object_string) : string
Parameters
$object_string : string

represents the contents of a PDF object

Return values
string

the object dictionary for the object

getObjectStream()

Gets the object stream portion of the current PDF object

public static getObjectStream(string $object_string) : string
Parameters
$object_string : string

represents the contents of a PDF object

Return values
string

the object stream for the object

getText()

Gets the text out of a PDF document

public static getText(string $pdf_string,  $url[, string $encoding = "" ]) : string
Parameters
$pdf_string : string

a string representing the PDF document

$url :

the url where the page contents came from, used to canonicalize relative links

$encoding : string = ""

which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values
string

text extracted from the document

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

public handle(string $page, string $url) : array<string|int, mixed>
Parameters
$page : string

string of a web document

$url : string

location the document came from

Return values
array<string|int, mixed>

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

public static initializeIndexedFileTypes() : mixed
Return values
mixed

objectDictionaryHas()

Checks if the PDF object's object dictionary is in a list of types

public static objectDictionaryHas(string $object_dictionary, array<string|int, mixed> $type_array) : whether
Parameters
$object_dictionary : string

the object dictionary to check

$type_array : array<string|int, mixed>

the list of types to check against

Return values
whether

it is in or not

parseBrackets()

Extracts text till the next close brackets

public static parseBrackets(string $data, int $cur_pos[, string $encoding = "" ]) : array<string|int, mixed>
Parameters
$data : string

source to extract character data from

$cur_pos : int

position to start in $data

$encoding : string = ""

which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values
array<string|int, mixed>

pair consisting of the final position in $data as well as extracted text

parseParentheses()

Extracts ASCII text till the next close parenthesis

public static parseParentheses(string $data, int $cur_pos, string $encoding) : array<string|int, mixed>
Parameters
$data : string

source to extract character data from

$cur_pos : int

position to start in $data

$encoding : string

which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values
array<string|int, mixed>

pair consisting of the final position in $data as well as extracted text

parseText()

Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.

public static parseText(string $data[, string $encoding = "" ]) : string
Parameters
$data : string

source to extract character data from

$encoding : string = ""

which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values
string

extracted text

process()

Used to extract the title, description and links from a string consisting of PDF data.

public process(string $page, string $url) : a
Parameters
$page : string

a string consisting of web-page contents

$url : string

the url where the page contents came from, used to canonicalize relative links

Return values
a

summary of the contents of the page


        

Search results