Yioop_V9.5_Source_Code_Documentation

RobotProcessor extends PageProcessor
in package

Processor class used to extract information from robots.txt files

Tags
author

Chris Pollett

Table of Contents

$image_types  : array<string|int, mixed>
Array filetypes which should be considered images.
$indexed_file_types  : array<string|int, mixed>
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
$max_description_len  : int
Max number of chars to extract for description from a page to index.
$max_links_to_extract  : int
Maximum number of urls to extract from a single document
$mime_processor  : array<string|int, mixed>
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
$plugin_instances  : array<string|int, mixed>
indexing_plugins which might be used with the current processor
$summarizer  : object
Stores the summarizer object used by this instance of page processor to be used in generating a summary
$summarizer_option  : string
Stores the name of the summarizer used for crawling.
$text_data  : bool
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
__construct()  : mixed
Set-ups the any indexing plugins associated with this page processor
handle()  : array<string|int, mixed>
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
initializeIndexedFileTypes()  : mixed
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
makeCanonicalRobotPath()  : string
Converts a path in a robots.txt file into a standard form usable by Yioop For robot paths foo is treated the same as /foo Path might contain urlencoded characters. These are all decoded except for %2F which corresponds to a / (this is as per http://www.robotstxt.org/norobots-rfc.txt)
process()  : array<string|int, mixed>
Parses the contents of a robots.txt page extracting allowed, disallowed paths, crawl-delay, and sitemaps. We also extract a list of all user agent strings seen.

Properties

$image_types

Array filetypes which should be considered images.

public static array<string|int, mixed> $image_types = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

public static array<string|int, mixed> $indexed_file_types = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.

public static int $max_description_len

Only words in the description are indexed.

Maximum number of urls to extract from a single document

public static int $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

public static array<string|int, mixed> $mime_processor = []

$plugin_instances

indexing_plugins which might be used with the current processor

public array<string|int, mixed> $plugin_instances

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary

public object $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.

public string $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)

public bool $text_data

Methods

__construct()

Set-ups the any indexing plugins associated with this page processor

public __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
$plugins : array<string|int, mixed> = []

an array of indexing plugins which might do further processing on the data handles by this page processor

$max_description_len : int = null

maximal length of a page summary

$max_links_to_extract : int = null

maximum number of links to extract from a single document

$summarizer_option : string = self::BASIC_SUMMARIZER

CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

Return values
mixed

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

public handle(string $page, string $url) : array<string|int, mixed>
Parameters
$page : string

string of a web document

$url : string

location the document came from

Return values
array<string|int, mixed>

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

public static initializeIndexedFileTypes() : mixed
Return values
mixed

makeCanonicalRobotPath()

Converts a path in a robots.txt file into a standard form usable by Yioop For robot paths foo is treated the same as /foo Path might contain urlencoded characters. These are all decoded except for %2F which corresponds to a / (this is as per http://www.robotstxt.org/norobots-rfc.txt)

public makeCanonicalRobotPath(string $path) : string
Parameters
$path : string

to convert

Return values
string

Yioop canonical path

process()

Parses the contents of a robots.txt page extracting allowed, disallowed paths, crawl-delay, and sitemaps. We also extract a list of all user agent strings seen.

public process(string $page, string $url) : array<string|int, mixed>
Parameters
$page : string

text string of a document

$url : string

location the document came from, not used by TextProcessor at this point. Some of its subclasses override this method and use url to produce complete links for relative links within a document

Return values
array<string|int, mixed>

a summary of (title, description, links, and content) of the information in $page


        

Search results