Yioop_V9.5_Source_Code

ScraperManager
in package

Application

Class used by html processors to detect if a page matches a particular signature such as that of a content management system, and also to provide scraping mechanisms for the content of such a page

applyScraperRules()

Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content


    public
            static        applyScraperRules(string $page, mixed $scraper) : string

Parameters

$page : string: the web page to operate on
$scraper : mixed

Return values

string —

the result of extracting first xpath content and deleting from it according to the remaining xpath rules

checkSignature()

If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found


    public
            static        checkSignature(string $page, string $signature) : bool

Parameters

$page : string: a web document to check
$signature : string: an xpath to check against

Return values

bool —

true if the given xpath return a non empty dom node list

getContentByXquery()

Get the contents of a document via an xpath


    public
            static        getContentByXquery(string $page, string $query) : DOMDocument

Parameters

$page : string: a document to apply the xpath query against
$query : string: the xpath query to run

Return values

DOMDocument —

dom of a simplified web page containing nodes matching xpath query within an html body tag.

getScraper()

Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.


    public
            static        getScraper(string $page, array<string|int, mixed> $scrapers) : array<string|int, mixed>

Parameters

$page : string: the html page to check
$scrapers : array<string|int, mixed>: an array of scrapers to check against

Return values

array<string|int, mixed> —

an associative array of scraper properties if a matching scraper signature found; otherwise, the empty array

removeContentByXquery()

Removes from the contents of a DOMDocument the results of an xpath query


    public
            static        removeContentByXquery(DOMDocument $dom, string $query) : bool

Parameters

$dom : DOMDocument: a document to apply the xpath query against
$query : string: the xpath query to run

Return values

bool —

whether anything was removed from the DOMDocument

Yioop_V9.5_Source_Code_Documentation

ScraperManager
in package

Application

Tags

Table of Contents

Methods

applyScraperRules()

Parameters

Return values

checkSignature()

Parameters

Return values

getContentByXquery()

Parameters

Return values

getScraper()

Parameters

Return values

removeContentByXquery()

Parameters

Return values

Search results

ScraperManager in package Application

Tags

Table of Contents

Methods

applyScraperRules()

Parameters

Return values

checkSignature()

Parameters

Return values

getContentByXquery()

Parameters

Return values

getScraper()

Parameters

Return values

removeContentByXquery()

Parameters

Return values

ScraperManager
in package

Application