PageProcessor
in package
implements
CrawlConstants
Base class common to all processors of web page data
A processor is used by the crawl portion of Yioop to extract indexable data from a page that might contains tags/binary data/etc that should not be indexed. Subclasses of PageProcessor stored in APP_DIR/library/processors will be detected by Yioop. So one can add code there if one want to make a custom processor for a new mimetype.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $image_types : array<string|int, mixed>
- Array filetypes which should be considered images.
- $indexed_file_types : array<string|int, mixed>
- Array of file extensions which can be handled by the search engine, other extensions will be ignored.
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $mime_processor : array<string|int, mixed>
- Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
- $plugin_instances : array<string|int, mixed>
- indexing_plugins which might be used with the current processor
- $summarizer : object
- Stores the summarizer object used by this instance of page processor to be used in generating a summary
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $text_data : bool
- Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
- __construct() : mixed
- Set-ups the any indexing plugins associated with this page processor
- handle() : array<string|int, mixed>
- Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
- initializeIndexedFileTypes() : mixed
- Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
- process() : array<string|int, mixed>
- Should be implemented to compute a summary based on a text string of a document. This method is called from
Properties
$image_types
Array filetypes which should be considered images.
public
static array<string|int, mixed>
$image_types
= []
Sub-classes add to this array with the types they handle
$indexed_file_types
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
public
static array<string|int, mixed>
$indexed_file_types
= ["unknown"]
Sub-classes add to this array with the types they handle
$max_description_len
Max number of chars to extract for description from a page to index.
public
static int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
static int
$max_links_to_extract
$mime_processor
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
public
static array<string|int, mixed>
$mime_processor
= []
$plugin_instances
indexing_plugins which might be used with the current processor
public
array<string|int, mixed>
$plugin_instances
$summarizer
Stores the summarizer object used by this instance of page processor to be used in generating a summary
public
object
$summarizer
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$text_data
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
public
bool
$text_data
Methods
__construct()
Set-ups the any indexing plugins associated with this page processor
public
__construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, mixed $max_links_to_extract = null ][, int $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
- $plugins : array<string|int, mixed> = []
-
an array of indexing plugins which might do further processing on the data handles by this page processor
- $max_description_len : int = null
-
maximal length of a page summary
- $max_links_to_extract : mixed = null
- $summarizer_option : int = self::BASIC_SUMMARIZER
-
CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
Return values
mixed —handle()
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
public
handle(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a web document
- $url : string
-
location the document came from
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
initializeIndexedFileTypes()
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
public
static initializeIndexedFileTypes() : mixed
Return values
mixed —process()
Should be implemented to compute a summary based on a text string of a document. This method is called from
public
abstract process(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a document
- $url : string
-
location the document came from
Tags
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page