Yioop_V9.5_Source_Code

PageProcessor
in package

Application

implements CrawlConstants

Base class common to all processors of web page data

A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in APP_DIR/library/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

$image_types : array<string|int, mixed>: Array filetypes which should be considered images.
$indexed_file_types : array<string|int, mixed>: Array of file extensions which can be handled by the search engine, other extensions will be ignored.
$max_description_len : int: Max number of chars to extract for description from a page to index.
$max_links_to_extract : int: Maximum number of urls to extract from a single document
$mime_processor : array<string|int, mixed>: Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
$plugin_instances : array<string|int, mixed>: indexing_plugins which might be used with the current processor
$summarizer : object: Stores the summarizer object used by this instance of page processor to be used in generating a summary
$summarizer_option : string: Stores the name of the summarizer used for crawling.
$text_data : bool: Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
__construct() : mixed: Set-ups the any indexing plugins associated with this page processor
handle() : array<string|int, mixed>: Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
initializeIndexedFileTypes() : mixed: Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
process() : array<string|int, mixed>: Should be implemented to compute a summary based on a text string of a document. This method is called from

$image_types

Array filetypes which should be considered images.


    public
    static    array<string|int, mixed>
    $image_types
     = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.


    public
    static    array<string|int, mixed>
    $indexed_file_types
     = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.


    public
    static    int
    $max_description_len

Only words in the description are indexed.

$max_links_to_extract

Maximum number of urls to extract from a single document


    public
    static    int
    $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle


    public
    static    array<string|int, mixed>
    $mime_processor
     = []

$plugin_instances

indexing_plugins which might be used with the current processor


    public
        array<string|int, mixed>
    $plugin_instances

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary


    public
        object
    $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.


    public
        string
    $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)


    public
        bool
    $text_data

__construct()

Set-ups the any indexing plugins associated with this page processor


    public
                    __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, mixed $max_links_to_extract = null ][, int $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed

Parameters

$plugins : array<string|int, mixed> = []: an array of indexing plugins which might do further processing on the data handles by this page processor
$max_description_len : int = null: maximal length of a page summary
$max_links_to_extract : mixed = null
$summarizer_option : int = self::BASIC_SUMMARIZER: CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Return values

mixed —

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents


    public
                    handle(string $page, string $url) : array<string|int, mixed>

Parameters

$page : string: string of a web document
$url : string: location the document came from

Return values

array<string|int, mixed> —

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays


    public
            static        initializeIndexedFileTypes() : mixed

Return values

mixed —

process()

Should be implemented to compute a summary based on a text string of a document. This method is called from


    public
    abstract                process(string $page, string $url) : array<string|int, mixed>

Parameters

$page : string: string of a document
$url : string: location the document came from

Return values

array<string|int, mixed> —

a summary of (title, description,links, and content) of the information in $page

Yioop_V9.5_Source_Code_Documentation

PageProcessor
in package

Application

implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$image_types

$indexed_file_types

$max_description_len

$max_links_to_extract

$mime_processor

$plugin_instances

$summarizer

$summarizer_option

$text_data

Methods

__construct()

Parameters

Return values

handle()

Parameters

Return values

initializeIndexedFileTypes()

Return values

process()

Parameters

Tags

Return values

Search results

PageProcessor in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$image_types

$indexed_file_types

$max_description_len

$max_links_to_extract

$mime_processor

$plugin_instances

$summarizer

$summarizer_option

$text_data

Methods

__construct()

Parameters

Return values

handle()

Parameters

Return values

initializeIndexedFileTypes()

Return values

process()

Parameters

Tags

Return values

PageProcessor
in package

Application

implements CrawlConstants