Yioop_V9.5_Source_Code_Documentation

HtmlProcessor extends TextProcessor
in package

Used to create crawl summary information for HTML files

Tags
author

Chris Pollett

Table of Contents

MAX_TITLE_LEN  = 100
Maximum number of characters in a title
$image_types  : array<string|int, mixed>
Array filetypes which should be considered images.
$indexed_file_types  : array<string|int, mixed>
Array of file extensions which can be handled by the search engine, other extensions will be ignored.
$max_description_len  : int
Max number of chars to extract for description from a page to index.
$max_links_to_extract  : int
Maximum number of urls to extract from a single document
$mime_processor  : array<string|int, mixed>
Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle
$page_options_testing  : bool
Whether we are using this processor in the Page Options activity
$plugin_instances  : array<string|int, mixed>
indexing_plugins which might be used with the current processor
$scrapers  : array<string|int, mixed>
An array of scrapers to be used by this HtmlProcessor
$summarizer  : object
Stores the summarizer object used by this instance of page processor to be used in generating a summary
$summarizer_option  : string
Stores the name of the summarizer used for crawling.
$text_data  : bool
Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)
__construct()  : mixed
Set-ups the any indexing plugins associated with this page processor
calculateLang()  : string
Tries to determine the language of the document by looking at the $sample_text and $url provided the language
closeDanglingTags()  : mixed
If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.
computeTopLevelLinks()  : array<string|int, mixed>
For a url which consists of just a hostname, computes the top level links within its web page. These links will be eventually display underneath the main link in the search results
createThumb()  : mixed
Used to create an thumbnail file to a thumb folder from an epub,html, or text file provided the image magick command convert exists and the calibre command epub-convert exists.
crudeDescription()  : string
Returns summary of body of a web page based on crude regex matching used as a fall back if dom parsing did not work.
crudeTitle()  : string
Returns title of a webpage based on crude regex match, used as a fall back if dom parsing did not work.
dom()  : object
Return a document object based on a string containing the contents of a web page
domNodeToString()  : string
This returns the text content of a node but with spaces where tags were (unlike just using textContent)
extractHttpHttpsUrls()  : array<string|int, mixed>
Tries to extract http or https links from a string of text.
favicon()  : string
Used to compute the favicon url for a web page.
getBetweenTags()  : array<string|int, mixed>
Gets the text between two tags in a document starting at the current position.
getMetaRobots()  : array<string|int, mixed>
Get any NOINDEX, NOFOLLOW, NOARCHIVE, NONE, info out of any robot meta tags.
handle()  : array<string|int, mixed>
Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents
initializeIndexedFileTypes()  : mixed
Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays
lang()  : string
Determines the language of the html document by looking at the root language attribute. If that fails $sample_text is used to try to guess the language
links()  : array<string|int, mixed>
Returns up to MAX_LINKS_TO_EXTRACT many links from the supplied dom object where links have been canonicalized according to the supplied $site information.
location()  : mixed
Extracts are location of refresh urls from the meta tags of html page in site
process()  : array<string|int, mixed>
Used to extract the title, description and links from a string consisting of webpage data.
relCanonical()  : mixed
If a canonical link element (https://en.wikipedia.org/wiki/Canonical_link_element) is in $dom, then this function extracts it
title()  : string
Returns title of a webpage based on its document object

Constants

MAX_TITLE_LEN

Maximum number of characters in a title

public mixed MAX_TITLE_LEN = 100

Properties

$image_types

Array filetypes which should be considered images.

public static array<string|int, mixed> $image_types = []

Sub-classes add to this array with the types they handle

$indexed_file_types

Array of file extensions which can be handled by the search engine, other extensions will be ignored.

public static array<string|int, mixed> $indexed_file_types = ["unknown"]

Sub-classes add to this array with the types they handle

$max_description_len

Max number of chars to extract for description from a page to index.

public static int $max_description_len

Only words in the description are indexed.

Maximum number of urls to extract from a single document

public static int $max_links_to_extract

$mime_processor

Associative array of mime_type => (page processor name that can process that type) Sub-classes add to this array with the types they handle

public static array<string|int, mixed> $mime_processor = []

$page_options_testing

Whether we are using this processor in the Page Options activity

public static bool $page_options_testing = false

$plugin_instances

indexing_plugins which might be used with the current processor

public array<string|int, mixed> $plugin_instances

$scrapers

An array of scrapers to be used by this HtmlProcessor

public array<string|int, mixed> $scrapers = []

$summarizer

Stores the summarizer object used by this instance of page processor to be used in generating a summary

public object $summarizer

$summarizer_option

Stores the name of the summarizer used for crawling.

public string $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$text_data

Whether the current processor is for text data (i.e., text, html, xml, etc) or for some other format (gif, png, etc)

public bool $text_data

Methods

__construct()

Set-ups the any indexing plugins associated with this page processor

public __construct([array<string|int, mixed> $plugins = [] ][, int $max_description_len = null ][, int $max_links_to_extract = CMAX_LINKS_TO_EXTRACT ][, string $summarizer_option = self::BASIC_SUMMARIZER ]) : mixed
Parameters
$plugins : array<string|int, mixed> = []

an array of indexing plugins which might do further processing on the data handles by this page processor

$max_description_len : int = null

maximal length of a page summary

$max_links_to_extract : int = CMAX_LINKS_TO_EXTRACT

maximum number of links to extract from a single document

$summarizer_option : string = self::BASIC_SUMMARIZER

CRAWL_CONSTANT specifying what kind of summarizer to use self::BASIC_SUMMARIZER, self::GRAPH_BASED_SUMMARIZER and self::CENTROID_SUMMARIZER self::CENTROID_SUMMARIZER

Return values
mixed

calculateLang()

Tries to determine the language of the document by looking at the $sample_text and $url provided the language

public static calculateLang([string $sample_text = null ][, string $url = null ]) : string
Parameters
$sample_text : string = null

sample text to try guess the language from

$url : string = null

url of web-page as a fallback look at the country to figure out language

Return values
string

language tag for guessed language

closeDanglingTags()

If an end of file is reached before closed tags are seen, this methods closes these tags in the correct order.

public static closeDanglingTags(string &$page) : mixed
Parameters
$page : string

a reference to an xml or html document

Return values
mixed

For a url which consists of just a hostname, computes the top level links within its web page. These links will be eventually display underneath the main link in the search results

public static computeTopLevelLinks(string $url, array<string|int, mixed> $links) : array<string|int, mixed>
Parameters
$url : string

of website that is currently being processed

$links : array<string|int, mixed>

associative array of $link_url => $link_text pairs

Return values
array<string|int, mixed>

of important links for the url

createThumb()

Used to create an thumbnail file to a thumb folder from an epub,html, or text file provided the image magick command convert exists and the calibre command epub-convert exists.

public static createThumb(string $folder, string $thumb_folder, string $file_name[, int $width = CTHUMB_DIM ][, int $height = CTHUMB_DIM ]) : mixed

For this method to do anything the constant IMAGE_MAGICK must be set to the path to the "convert" command and the constant CALIBRE must bet set to the path of "ebook-convert". This is very brute force and slow as currently implemented as it creates a PDF from file and then extracts the first page to make a thumb

Parameters
$folder : string

with file in it

$thumb_folder : string

folder to generate

$file_name : string

of file file in $folder

$width : int = CTHUMB_DIM

= width in pixels of thumb

$height : int = CTHUMB_DIM

= height in pixels of thumb

Return values
mixed

crudeDescription()

Returns summary of body of a web page based on crude regex matching used as a fall back if dom parsing did not work.

public static crudeDescription(string $page) : string
Parameters
$page : string

to extract description from

Return values
string

a title of the page

crudeTitle()

Returns title of a webpage based on crude regex match, used as a fall back if dom parsing did not work.

public static crudeTitle(string $page) : string
Parameters
$page : string

to extract title from

Return values
string

a title of the page

dom()

Return a document object based on a string containing the contents of a web page

public static dom(string $page) : object
Parameters
$page : string

a web page

Return values
object

document object

domNodeToString()

This returns the text content of a node but with spaces where tags were (unlike just using textContent)

public static domNodeToString(object $node) : string
Parameters
$node : object

a DOMNode

Return values
string

its text content with spaces

extractHttpHttpsUrls()

Tries to extract http or https links from a string of text.

public static extractHttpHttpsUrls(string $page) : array<string|int, mixed>

Does this by a very approximate regular expression.

Parameters
$page : string

text string of a document

Return values
array<string|int, mixed>

a set of http or https links that were extracted from the document

favicon()

Used to compute the favicon url for a web page.

public static favicon(object $dom, string $url) : string
Parameters
$dom : object

document object model of the web page trying to compute the favicon url for

$url : string

of web page that $dom corresponds to. Used to help compute favicon url if link to icon relative in $dom or if non-present and guessing using hostname.

Return values
string

url of favicon for web page (empty string if couldn't determine)

getBetweenTags()

Gets the text between two tags in a document starting at the current position.

public static getBetweenTags(string $string, int $cur_pos, string $start_tag, string $end_tag) : array<string|int, mixed>
Parameters
$string : string

document to extract text from

$cur_pos : int

current location to look if can extract text

$start_tag : string

starting tag that we want to extract after

$end_tag : string

ending tag that we want to extract until

Return values
array<string|int, mixed>

pair consisting of when in the document we are after the end tag, together with the data between the two tags

getMetaRobots()

Get any NOINDEX, NOFOLLOW, NOARCHIVE, NONE, info out of any robot meta tags.

public static getMetaRobots(object $dom) : array<string|int, mixed>
Parameters
$dom : object
  • a document object to check the meta tags for
Return values
array<string|int, mixed>

of robot meta instructions

handle()

Method used to handle processing data for a web page. It makes a summary for the page (via the process() function which should be subclassed) as well as runs any plugins that are associated with the processors to create sub-documents

public handle(string $page, string $url) : array<string|int, mixed>
Parameters
$page : string

string of a web document

$url : string

location the document came from

Return values
array<string|int, mixed>

a summary of (title, description,links, and content) of the information in $page also has a subdocs array containing any subdocuments returned from a plugin. A subdocuments might be things like recipes that appeared in a page or tweets, etc.

initializeIndexedFileTypes()

Get processors for different file types. constructing them will populate the self::$indexed_file_types, self::$image_types, and self::$mime_processor arrays

public static initializeIndexedFileTypes() : mixed
Return values
mixed

lang()

Determines the language of the html document by looking at the root language attribute. If that fails $sample_text is used to try to guess the language

public static lang(object $dom[, string $sample_text = null ][, string $url = null ]) : string
Parameters
$dom : object

a document object to check the language of

$sample_text : string = null

sample text to try guess the language from

$url : string = null

url of web-page as a fallback look at the country to figure out language

Return values
string

language tag for guessed language

Returns up to MAX_LINKS_TO_EXTRACT many links from the supplied dom object where links have been canonicalized according to the supplied $site information.

public static links(object $dom, string $site, string $lang) : array<string|int, mixed>
Parameters
$dom : object

a document object with links on it

$site : string

a string containing a url

$lang : string

locale for document

Return values
array<string|int, mixed>

links from the $dom object

location()

Extracts are location of refresh urls from the meta tags of html page in site

public static location(object $dom, string $url) : mixed
Parameters
$dom : object

document object version of web page

$url : string

the url where the dom object comes from

Return values
mixed

refresh or location url if found, false otherwise

process()

Used to extract the title, description and links from a string consisting of webpage data.

public process(string $page, string $url) : array<string|int, mixed>
Parameters
$page : string

web-page contents

$url : string

the url where the page contents came from, used to canonicalize relative links

Return values
array<string|int, mixed>

a summary of the contents of the page

relCanonical()

If a canonical link element (https://en.wikipedia.org/wiki/Canonical_link_element) is in $dom, then this function extracts it

public static relCanonical(object $dom, string $url) : mixed
Parameters
$dom : object

document object version of web page

$url : string

the url where the dom object comes from

Return values
mixed

refresh or location url if found, false otherwise

title()

Returns title of a webpage based on its document object

public static title(object $dom) : string
Parameters
$dom : object

a document object to extract a title from.

Return values
string

a title of the page


        

Search results