Yioop_V9.5_Source_Code_Documentation

ScraperManager
in package

Class used by html processors to detect if a page matches a particular signature such as that of a content management system, and also to provide scraping mechanisms for the content of such a page

Tags
author

Charles Bocage (charles.bocage@sjsu.edu) updated to support scraper priorities and extract fields Chris Pollett

Table of Contents

applyScraperRules()  : string
Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content
checkSignature()  : bool
If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found
getContentByXquery()  : DOMDocument
Get the contents of a document via an xpath
getScraper()  : array<string|int, mixed>
Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.
removeContentByXquery()  : bool
Removes from the contents of a DOMDocument the results of an xpath query

Methods

applyScraperRules()

Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content

public static applyScraperRules(string $page, mixed $scraper) : string
Parameters
$page : string

the web page to operate on

$scraper : mixed
Return values
string

the result of extracting first xpath content and deleting from it according to the remaining xpath rules

checkSignature()

If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found

public static checkSignature(string $page, string $signature) : bool
Parameters
$page : string

a web document to check

$signature : string

an xpath to check against

Return values
bool

true if the given xpath return a non empty dom node list

getContentByXquery()

Get the contents of a document via an xpath

public static getContentByXquery(string $page, string $query) : DOMDocument
Parameters
$page : string

a document to apply the xpath query against

$query : string

the xpath query to run

Return values
DOMDocument

dom of a simplified web page containing nodes matching xpath query within an html body tag.

getScraper()

Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.

public static getScraper(string $page, array<string|int, mixed> $scrapers) : array<string|int, mixed>
Parameters
$page : string

the html page to check

$scrapers : array<string|int, mixed>

an array of scrapers to check against

Return values
array<string|int, mixed>

an associative array of scraper properties if a matching scraper signature found; otherwise, the empty array

removeContentByXquery()

Removes from the contents of a DOMDocument the results of an xpath query

public static removeContentByXquery(DOMDocument $dom, string $query) : bool
Parameters
$dom : DOMDocument

a document to apply the xpath query against

$query : string

the xpath query to run

Return values
bool

whether anything was removed from the DOMDocument


        

Search results