CompressedProcessor
extends PageProcessor
in package
Used to create crawl summary information for a gz compressed file whose uncompressed form has a processor we index.
Tags
Table of Contents
- $image_types : array<string|int, mixed>
- Array filetypes which should be considered images.
- $indexed_file_types : array<string|int, mixed>
- Array of file extensions which can be handled by the search engine, other extensions will be ignored.
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $mime_processor : array<string|int, mixed>
- Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
- $plugin_instances : array<string|int, mixed>
- indexing_plugins which might be used with the current processor
- $summarizer : object
- Stores the summarizer object used by this instance of page processor to be used in generating a summary
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $text_data : bool
- Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
- __construct() : mixed
- Set-ups the any indexing plugins associated with this page processor
- dom() : object
- Return a document object based on a string containing the contents of an XML page
- handle() : array<string|int, mixed>
- Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
- initializeIndexedFileTypes() : mixed
- Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
- process() : array<string|int, mixed>
- Used to extract the title, description and links from a string consisting of compressed file of some known indexed_file_type
Properties
$image_types
Array filetypes which should be considered images.
public
static array<string|int, mixed>
$image_types
= []
Sub-classes add to this array with the types they handle
$indexed_file_types
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
public
static array<string|int, mixed>
$indexed_file_types
= ["unknown"]
Sub-classes add to this array with the types they handle
$max_description_len
Max number of chars to extract for description from a page to index.
public
static int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
static int
$max_links_to_extract
$mime_processor
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
public
static array<string|int, mixed>
$mime_processor
= []
$plugin_instances
indexing_plugins which might be used with the current processor
public
array<string|int, mixed>
$plugin_instances
$summarizer
Stores the summarizer object used by this instance of page processor to be used in generating a summary
public
object
$summarizer
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$text_data
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
public
bool
$text_data
Methods
__construct()
Set-ups the any indexing plugins associated with this page processor
public
__construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = null ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
- $plugins : array<string|int, mixed> = []
-
an array of indexing plugins which might do further processing on the data handles by this page processor
- $max_description_len : int = null
-
maximal length of a page summary
- $max_links_to_extract : int = null
-
maximum number of links to extract from a single document
- $summarizer_option : string = self::BASIC_SUMMARIZER
-
CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER
Return values
mixed —dom()
Return a document object based on a string containing the contents of an XML page
public
static dom(string $page) : object
Parameters
- $page : string
-
a web page
Return values
object —document object
handle()
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
public
handle(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a web document
- $url : string
-
location the document came from
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
initializeIndexedFileTypes()
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
public
static initializeIndexedFileTypes() : mixed
Return values
mixed —process()
Used to extract the title, description and links from a string consisting of compressed file of some known indexed_file_type
public
process(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
array<string|int, mixed> —a summary of the contents of the page