Yioop_V9.5_Source_Code

CompressedProcessor extends PageProcessor
in package

Application

Used to create crawl summary information for a gz compressed file whose uncompressed form has a processor we index.

$image_types

Array filetypes which should be considered images.


    public
    static    array<string|int, mixed>
    $image_types
     = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.


    public
    static    array<string|int, mixed>
    $indexed_file_types
     = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.


    public
    static    int
    $max_description_len

Only words in the description are indexed.

$max_links_to_extract

Maximum number of urls to extract from a single document


    public
    static    int
    $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle


    public
    static    array<string|int, mixed>
    $mime_processor
     = []

$plugin_instances

indexing_plugins which might be used with the current processor


    public
        array<string|int, mixed>
    $plugin_instances

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary


    public
        object
    $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.


    public
        string
    $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)


    public
        bool
    $text_data

__construct()

Set-ups the any indexing plugins associated with this page processor


    public
                    __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed

Parameters

$plugins : array<string|int, mixed> = []: an array of indexing plugins which might do further processing on the data handles by this page processor
$max_description_len : int = null: maximal length of a page summary
$max_links_to_extract : int = null: maximum number of links to extract from a single document
$summarizer_option : string = self::BASIC_SUMMARIZER: CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

Return values

mixed —

dom()

Return a document object based on a string containing the contents of an XML page


    public
            static        dom(string $page) : object

Parameters

$page : string: a web page

Return values

object —

document object

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents


    public
                    handle(string $page, string $url) : array<string|int, mixed>

Parameters

$page : string: string of a web document
$url : string: location the document came from

Return values

array<string|int, mixed> —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays


    public
            static        initializeIndexedFileTypes() : mixed

Return values

mixed —

process()

Used to extract the title, description and links from a string consisting of compressed file of some known indexed_file_type


    public
                    process(string $page, string $url) : array<string|int, mixed>

Parameters

$page : string: web-page contents
$url : string: the url where the page contents came from, used to canonicalize relative links

Return values

array<string|int, mixed> —

a summary of the contents of the page

Yioop_V9.5_Source_Code_Documentation

CompressedProcessor extends PageProcessor
in package

Application

Tags

Table of Contents

Properties

$image_types

$indexed_file_types

$max_description_len

$max_links_to_extract

$mime_processor

$plugin_instances

$summarizer

$summarizer_option

$text_data

Methods

__construct()

Parameters

Return values

dom()

Parameters

Return values

handle()

Parameters

Return values

initializeIndexedFileTypes()

Return values

process()

Parameters

Return values

Search results

CompressedProcessor extends PageProcessor in package Application

Tags

Table of Contents

Properties

$image_types

$indexed_file_types

$max_description_len

$max_links_to_extract

$mime_processor

$plugin_instances

$summarizer

$summarizer_option

$text_data

Methods

__construct()

Parameters

Return values

dom()

Parameters

Return values

handle()

Parameters

Return values

initializeIndexedFileTypes()

Return values

process()

Parameters

Return values

CompressedProcessor extends PageProcessor
in package

Application