Yioop_V9.5_Source_Code

PdfProcessor extends TextProcessor
in package

Application

Used to create crawl summary information for PDF files

$image_types

Array filetypes which should be considered images.


    public
    static    array<string|int, mixed>
    $image_types
     = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.


    public
    static    array<string|int, mixed>
    $indexed_file_types
     = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.


    public
    static    int
    $max_description_len

Only words in the description are indexed.

$max_links_to_extract

Maximum number of urls to extract from a single document


    public
    static    int
    $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle


    public
    static    array<string|int, mixed>
    $mime_processor
     = []

$plugin_instances

indexing_plugins which might be used with the current processor


    public
        array<string|int, mixed>
    $plugin_instances

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary


    public
        object
    $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.


    public
        string
    $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)


    public
        bool
    $text_data

__construct()

Set-ups the any indexing plugins associated with this page processor


    public
                    __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed

Parameters

$plugins : array<string|int, mixed> = []: an array of indexing plugins which might do further processing on the data handles by this page processor
$max_description_len : int = null: maximal length of a page summary
$max_links_to_extract : int = null: maximum number of links to extract from a single document
$summarizer_option : string = self::BASIC_SUMMARIZER: CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

Return values

mixed —

calculateLang()

Tries to determine the language of the document by looking at the $sample_text and $url provided the language


    public
            static        calculateLang([string $sample_text = null ][, string $url = null ]) : string

Parameters

$sample_text : string = null: sample text to try guess the language from
$url : string = null: url of web-page as a fallback look at the country to figure out language

Return values

string —

language tag for guessed language

closeDanglingTags()

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.


    public
            static        closeDanglingTags(string &$page) : mixed

Parameters

$page : string: a reference to an xml or html document

Return values

mixed —

convertChar()

Used to convert characters from one of the built in PDF encodings to UTF-8


    public
            static        convertChar(char $cur_char, string $encoding) : string

Parameters

$cur_char : char: character to convert
$encoding : string: which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values

string —

resultign converted string for character

createThumb()

Used to create an thumbnail file to a thumb folder from an PDF file provided the image magick command convert exists.


    public
            static        createThumb(string $folder, string $thumb_folder, string $file_name[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : mixed

For this method to do anything the constant IMAGE_MAGICK must be set to the path to the "convert" command.

Parameters

$folder : string: with pdf in it
$thumb_folder : string: folder to generate
$file_name : string: of pdf file in $folder
$width : int = CTHUMB_DIM: = width in pixels of thumb
$height : int = CTHUMB_DIM: = height in pixels of thumb

Return values

mixed —

dom()

Return a document object based on a string containing the contents of a web page


    public
            static        dom(string $page) : object

Parameters

$page : string: a web page

Return values

object —

document object

extractHttpHttpsUrls()

Tries to extract http or https links from a string of text.


    public
            static        extractHttpHttpsUrls(string $page) : array<string|int, mixed>

Does this by a very approximate regular expression.

Parameters

$page : string: text string of a document

Return values

array<string|int, mixed> —

a set of http or https links that were extracted from the document

getBetweenTags()

Gets the text between two tags in a document starting at the current position.


    public
            static        getBetweenTags(string $string, int $cur_pos, string $start_tag, string $end_tag) : array<string|int, mixed>

Parameters

$string : string: document to extract text from
$cur_pos : int: current location to look if can extract text
$start_tag : string: starting tag that we want to extract after
$end_tag : string: ending tag that we want to extract until

Return values

array<string|int, mixed> —

pair consisting of when in the document we are after the end tag, together with the data between the two tags

getEncodingTitle()

Returns the first encoding format information found in the PDF document


    public
            static        getEncodingTitle(string $pdf_string) : array<string|int, mixed>

Parameters

$pdf_string : string: a string representing the PDF document

Return values

array<string|int, mixed> —

[encoding, title] which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc as well as a title for the document if found

getNextObject()

Gets between an obj and endobj tag at the current position in a PDF document


    public
            static        getNextObject(string $pdf_string, int $cur_pos) : string

Parameters

$pdf_string : string: astring of a PDF document
$cur_pos : int: a integer position in that string

Return values

string —

the contents of the PDF object located at $cur_pos

getObjectDictionary()

Gets the object dictionary portion of the current PDF object


    public
            static        getObjectDictionary(string $object_string) : string

Parameters

$object_string : string: represents the contents of a PDF object

Return values

string —

the object dictionary for the object

getObjectStream()

Gets the object stream portion of the current PDF object


    public
            static        getObjectStream(string $object_string) : string

Parameters

$object_string : string: represents the contents of a PDF object

Return values

string —

the object stream for the object

getText()

Gets the text out of a PDF document


    public
            static        getText(string $pdf_string,  $url[, string $encoding = "" ]) : string

Parameters

$pdf_string : string: a string representing the PDF document
$url :: the url where the page contents came from, used to canonicalize relative links
$encoding : string = "": which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values

string —

text extracted from the document

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents


    public
                    handle(string $page, string $url) : array<string|int, mixed>

Parameters

$page : string: string of a web document
$url : string: location the document came from

Return values

array<string|int, mixed> —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays


    public
            static        initializeIndexedFileTypes() : mixed

Return values

mixed —

objectDictionaryHas()

Checks if the PDF object's object dictionary is in a list of types


    public
            static        objectDictionaryHas(string $object_dictionary, array<string|int, mixed> $type_array) : whether

Parameters

$object_dictionary : string: the object dictionary to check
$type_array : array<string|int, mixed>: the list of types to check against

Return values

whether —

it is in or not

parseBrackets()

Extracts text till the next close brackets


    public
            static        parseBrackets(string $data, int $cur_pos[, string $encoding = "" ]) : array<string|int, mixed>

Parameters

$data : string: source to extract character data from
$cur_pos : int: position to start in $data
$encoding : string = "": which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values

array<string|int, mixed> —

pair consisting of the final position in $data as well as extracted text

parseParentheses()

Extracts ASCII text till the next close parenthesis


    public
            static        parseParentheses(string $data, int $cur_pos, string $encoding) : array<string|int, mixed>

Parameters

$data : string: source to extract character data from
$cur_pos : int: position to start in $data
$encoding : string: which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values

array<string|int, mixed> —

pair consisting of the final position in $data as well as extracted text

parseText()

Extracts text from PDF data, getting rid of non printable data, square brackets and parenthesis and converting char codes to their values.


    public
            static        parseText(string $data[, string $encoding = "" ]) : string

Parameters

$data : string: source to extract character data from
$encoding : string = "": which of the default (if any) PDF encoding formats is being used: MacRomanEncoding, WinAnsiEncoding, PDFDocEncoding, etc.

Return values

string —

extracted text

process()

Used to extract the title, description and links from a string consisting of PDF data.


    public
                    process(string $page, string $url) : a

Parameters

$page : string: a string consisting of web-page contents
$url : string: the url where the page contents came from, used to canonicalize relative links

Return values

a —

summary of the contents of the page

PdfProcessor extends TextProcessor in package Application

Tags

Table of Contents

Properties

$image_types

$indexed_file_types

$max_description_len

$max_links_to_extract

$mime_processor

$plugin_instances

$summarizer

$summarizer_option

$text_data

Methods

__construct()

Parameters

Return values

calculateLang()

Parameters

Return values

closeDanglingTags()

Parameters

Return values

convertChar()

Parameters

Return values

createThumb()

Parameters

Return values

dom()

Parameters

Return values

extractHttpHttpsUrls()

Parameters

Return values

getBetweenTags()

Parameters

Return values

getEncodingTitle()

Parameters

Return values

getNextObject()

Parameters

Return values

getObjectDictionary()

Parameters

Return values

getObjectStream()

Parameters

Return values

getText()

Parameters

Return values

handle()

Parameters

Return values

initializeIndexedFileTypes()

Return values

objectDictionaryHas()

Parameters

Return values

parseBrackets()

Parameters

Return values

parseParentheses()

Parameters

Return values

parseText()

Parameters

Return values

process()

Parameters

Return values

PdfProcessor extends TextProcessor
in package

Application