ScraperManager
in package
Class used by html processors to detect if a page matches a particular signature such as that of a content management system, and also to provide scraping mechanisms for the content of such a page
Tags
Table of Contents
- applyScraperRules() : string
- Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content
- checkSignature() : bool
- If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found
- getContentByXquery() : DOMDocument
- Get the contents of a document via an xpath
- getScraper() : array<string|int, mixed>
- Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.
- removeContentByXquery() : bool
- Removes from the contents of a DOMDocument the results of an xpath query
Methods
applyScraperRules()
Applies scrape rules to a given page. A scrape rule consists of TEXT_PATH xpath for the main content of a web page, a sequence of \n separated DELETE_PATHS for what should be removed from the main content as irrelevant, and finally a list EXTRACT_FIELDS of additional summary fields which should be extracted from the page content
public
static applyScraperRules(string $page, mixed $scraper) : string
Parameters
- $page : string
-
the web page to operate on
- $scraper : mixed
Return values
string —the result of extracting first xpath content and deleting from it according to the remaining xpath rules
checkSignature()
If $signature begins with '/', checks to see if applying the xpath in $signature to $page results in a non-empty dom node list. Otherwise, does a match of the regex (without matching start and end delimiters (say, /) against $page and returns whether found
public
static checkSignature(string $page, string $signature) : bool
Parameters
- $page : string
-
a web document to check
- $signature : string
-
an xpath to check against
Return values
bool —true if the given xpath return a non empty dom node list
getContentByXquery()
Get the contents of a document via an xpath
public
static getContentByXquery(string $page, string $query) : DOMDocument
Parameters
- $page : string
-
a document to apply the xpath query against
- $query : string
-
the xpath query to run
Return values
DOMDocument —dom of a simplified web page containing nodes matching xpath query within an html body tag.
getScraper()
Method used to check a page against a supplied list of scrapers for a matching signature. If a match is found that scraper is returned.
public
static getScraper(string $page, array<string|int, mixed> $scrapers) : array<string|int, mixed>
Parameters
- $page : string
-
the html page to check
- $scrapers : array<string|int, mixed>
-
an array of scrapers to check against
Return values
array<string|int, mixed> —an associative array of scraper properties if a matching scraper signature found; otherwise, the empty array
removeContentByXquery()
Removes from the contents of a DOMDocument the results of an xpath query
public
static removeContentByXquery(DOMDocument $dom, string $query) : bool
Parameters
- $dom : DOMDocument
-
a document to apply the xpath query against
- $query : string
-
the xpath query to run
Return values
bool —whether anything was removed from the DOMDocument