HtmlProcessor
extends TextProcessor
in package
Used to create crawl summary information for HTML files
Tags
Table of Contents
- MAX_TITLE_LEN = 100
- Maximum number of characters in a title
- $image_types : array<string|int, mixed>
- Array filetypes which should be considered images.
- $indexed_file_types : array<string|int, mixed>
- Array of file extensions which can be handled by the search engine, other extensions will be ignored.
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $mime_processor : array<string|int, mixed>
- Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
- $page_options_testing : bool
- Whether we are using this processor in the Page Options activity
- $plugin_instances : array<string|int, mixed>
- indexing_plugins which might be used with the current processor
- $scrapers : array<string|int, mixed>
- An array of scrapers to be used by this HtmlProcessor
- $summarizer : object
- Stores the summarizer object used by this instance of page processor to be used in generating a summary
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $text_data : bool
- Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
- __construct() : mixed
- Set-ups the any indexing plugins associated with this page processor
- calculateLang() : string
- Tries to determine the language of the document by looking at the $sample_text and $url provided the language
- closeDanglingTags() : mixed
- If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
- computeTopLevelLinks() : array<string|int, mixed>
- For a url which consists of just a hostname, computes the top level links within its web page. These links will be eventually display underneath the main link in the search results
- createThumb() : mixed
- Used to create an thumbnail file to a thumb folder from an epub,html, or text file provided the image magick command convert exists and the calibre command epub-convert exists.
- crudeDescription() : string
- Returns summary of body of a web page based on crude regex matching used as a fall back if dom parsing did not work.
- crudeTitle() : string
- Returns title of a webpage based on crude regex match, used as a fall back if dom parsing did not work.
- dom() : object
- Return a document object based on a string containing the contents of a web page
- domNodeToString() : string
- This returns the text content of a node but with spaces where tags were (unlike just using textContent)
- extractHttpHttpsUrls() : array<string|int, mixed>
- Tries to extract http or https links from a string of text.
- favicon() : string
- Used to compute the favicon url for a web page.
- getBetweenTags() : array<string|int, mixed>
- Gets the text between two tags in a document starting at the current position.
- getMetaRobots() : array<string|int, mixed>
- Get any NOINDEX, NOFOLLOW, NOARCHIVE, NONE, info out of any robot meta tags.
- handle() : array<string|int, mixed>
- Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
- initializeIndexedFileTypes() : mixed
- Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
- lang() : string
- Determines the language of the html document by looking at the root language attribute. If that fails $sample_text is used to try to guess the language
- links() : array<string|int, mixed>
- Returns up to MAX_LINKS_TO_EXTRACT many links from the supplied dom object where links have been canonicalized according to the supplied $site information.
- location() : mixed
- Extracts are location of refresh urls from the meta tags of html page in site
- process() : array<string|int, mixed>
- Used to extract the title, description and links from a string consisting of webpage data.
- relCanonical() : mixed
- If a canonical link element (https://en.wikipedia.org/wiki/Canonical_link_element) is in $dom, then this function extracts it
- title() : string
- Returns title of a webpage based on its document object
Constants
MAX_TITLE_LEN
Maximum number of characters in a title
public
mixed
MAX_TITLE_LEN
= 100
Properties
$image_types
Array filetypes which should be considered images.
public
static array<string|int, mixed>
$image_types
= []
Sub-classes add to this array with the types they handle
$indexed_file_types
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
public
static array<string|int, mixed>
$indexed_file_types
= ["unknown"]
Sub-classes add to this array with the types they handle
$max_description_len
Max number of chars to extract for description from a page to index.
public
static int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
static int
$max_links_to_extract
$mime_processor
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
public
static array<string|int, mixed>
$mime_processor
= []
$page_options_testing
Whether we are using this processor in the Page Options activity
public
static bool
$page_options_testing
= false
$plugin_instances
indexing_plugins which might be used with the current processor
public
array<string|int, mixed>
$plugin_instances
$scrapers
An array of scrapers to be used by this HtmlProcessor
public
array<string|int, mixed>
$scrapers
= []
$summarizer
Stores the summarizer object used by this instance of page processor to be used in generating a summary
public
object
$summarizer
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$text_data
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
public
bool
$text_data
Methods
__construct()
Set-ups the any indexing plugins associated with this page processor
public
__construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = CMAX_LINKS_TO_EXTRACT ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
- $plugins : array<string|int, mixed> = []
-
an array of indexing plugins which might do further processing on the data handles by this page processor
- $max_description_len : int = null
-
maximal length of a page summary
- $max_links_to_extract : int = CMAX_LINKS_TO_EXTRACT
-
maximum number of links to extract from a single document
- $summarizer_option : string = self::BASIC_SUMMARIZER
-
CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER
Return values
mixed —calculateLang()
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
public
static calculateLang([string $sample_text = null ][, string $url = null ]) : string
Parameters
- $sample_text : string = null
-
sample text to try guess the language from
- $url : string = null
-
url of web-page as a fallback look at the country to figure out language
Return values
string —language tag for guessed language
closeDanglingTags()
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
public
static closeDanglingTags(string &$page) : mixed
Parameters
- $page : string
-
a reference to an xml or html document
Return values
mixed —computeTopLevelLinks()
For a url which consists of just a hostname, computes the top level links within its web page. These links will be eventually display underneath the main link in the search results
public
static computeTopLevelLinks(string $url, array<string|int, mixed> $links) : array<string|int, mixed>
Parameters
- $url : string
-
of website that is currently being processed
- $links : array<string|int, mixed>
-
associative array of $link_url => $link_text pairs
Return values
array<string|int, mixed> —of important links for the url
createThumb()
Used to create an thumbnail file to a thumb folder from an epub,html, or text file provided the image magick command convert exists and the calibre command epub-convert exists.
public
static createThumb(string $folder, string $thumb_folder, string $file_name[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : mixed
For this method to do anything the constant IMAGE_MAGICK must be set to the path to the "convert" command and the constant CALIBRE must bet set to the path of "ebook-convert". This is very brute force and slow as currently implemented as it creates a PDF from file and then extracts the first page to make a thumb
Parameters
- $folder : string
-
with file in it
- $thumb_folder : string
-
folder to generate
- $file_name : string
-
of file file in $folder
- $width : int = CTHUMB_DIM
-
= width in pixels of thumb
- $height : int = CTHUMB_DIM
-
= height in pixels of thumb
Return values
mixed —crudeDescription()
Returns summary of body of a web page based on crude regex matching used as a fall back if dom parsing did not work.
public
static crudeDescription(string $page) : string
Parameters
- $page : string
-
to extract description from
Return values
string —a title of the page
crudeTitle()
Returns title of a webpage based on crude regex match, used as a fall back if dom parsing did not work.
public
static crudeTitle(string $page) : string
Parameters
- $page : string
-
to extract title from
Return values
string —a title of the page
dom()
Return a document object based on a string containing the contents of a web page
public
static dom(string $page) : object
Parameters
- $page : string
-
a web page
Return values
object —document object
domNodeToString()
This returns the text content of a node but with spaces where tags were (unlike just using textContent)
public
static domNodeToString(object $node) : string
Parameters
- $node : object
-
a DOMNode
Return values
string —its text content with spaces
extractHttpHttpsUrls()
Tries to extract http or https links from a string of text.
public
static extractHttpHttpsUrls(string $page) : array<string|int, mixed>
Does this by a very approximate regular expression.
Parameters
- $page : string
-
text string of a document
Return values
array<string|int, mixed> —a set of http or https links that were extracted from the document
favicon()
Used to compute the favicon url for a web page.
public
static favicon(object $dom, string $url) : string
Parameters
- $dom : object
-
document object model of the web page trying to compute the favicon url for
- $url : string
-
of web page that $dom corresponds to. Used to help compute favicon url if link to icon relative in $dom or if non-present and guessing using hostname.
Return values
string —url of favicon for web page (empty string if couldn't determine)
getBetweenTags()
Gets the text between two tags in a document starting at the current position.
public
static getBetweenTags(string $string, int $cur_pos, string $start_tag, string $end_tag) : array<string|int, mixed>
Parameters
- $string : string
-
document to extract text from
- $cur_pos : int
-
current location to look if can extract text
- $start_tag : string
-
starting tag that we want to extract after
- $end_tag : string
-
ending tag that we want to extract until
Return values
array<string|int, mixed> —pair consisting of when in the document we are after the end tag, together with the data between the two tags
getMetaRobots()
Get any NOINDEX, NOFOLLOW, NOARCHIVE, NONE, info out of any robot meta tags.
public
static getMetaRobots(object $dom) : array<string|int, mixed>
Parameters
- $dom : object
-
- a document object to check the meta tags for
Return values
array<string|int, mixed> —of robot meta instructions
handle()
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
public
handle(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
string of a web document
- $url : string
-
location the document came from
Return values
array<string|int, mixed> —a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.
initializeIndexedFileTypes()
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
public
static initializeIndexedFileTypes() : mixed
Return values
mixed —lang()
Determines the language of the html document by looking at the root language attribute. If that fails $sample_text is used to try to guess the language
public
static lang(object $dom[, string $sample_text = null ][, string $url = null ]) : string
Parameters
- $dom : object
-
a document object to check the language of
- $sample_text : string = null
-
sample text to try guess the language from
- $url : string = null
-
url of web-page as a fallback look at the country to figure out language
Return values
string —language tag for guessed language
links()
Returns up to MAX_LINKS_TO_EXTRACT many links from the supplied dom object where links have been canonicalized according to the supplied $site information.
public
static links(object $dom, string $site, string $lang) : array<string|int, mixed>
Parameters
- $dom : object
-
a document object with links on it
- $site : string
-
a string containing a url
- $lang : string
-
locale for document
Return values
array<string|int, mixed> —links from the $dom object
location()
Extracts are location of refresh urls from the meta tags of html page in site
public
static location(object $dom, string $url) : mixed
Parameters
- $dom : object
-
document object version of web page
- $url : string
-
the url where the dom object comes from
Return values
mixed —refresh or location url if found, false otherwise
process()
Used to extract the title, description and links from a string consisting of webpage data.
public
process(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
array<string|int, mixed> —a summary of the contents of the page
relCanonical()
If a canonical link element (https://en.wikipedia.org/wiki/Canonical_link_element) is in $dom, then this function extracts it
public
static relCanonical(object $dom, string $url) : mixed
Parameters
- $dom : object
-
document object version of web page
- $url : string
-
the url where the dom object comes from
Return values
mixed —refresh or location url if found, false otherwise
title()
Returns title of a webpage based on its document object
public
static title(object $dom) : string
Parameters
- $dom : object
-
a document object to extract a title from.
Return values
string —a title of the page