Yioop_V9.5_Source_Code_Documentation

Fetcher
in package
implements CrawlConstants

This class is responsible for fetching web pages for the SeekQuarry/Yioop search engine

Fetcher periodically queries the queue server asking for web pages to fetch. It gets at most MAX_FETCH_SIZE many web pages from the queue_server in one go. It then fetches these pages. Pages are fetched in batches of NUM_MULTI_CURL_PAGES many pages. Once downloaded the fetcher sends summaries back to the machine on which the queue_server lives. It does this by making a request of the web server on that machine and POSTs the data to the yioop web app. This data is handled by the FetchController class. The summary data can include up to four things: (1) robot.txt data, (2) summaries of each web page downloaded in the batch, and (3)a list of future urls to add to the to-crawl queue.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

DEFAULT_POST_MAX_SIZE  = 2000000
Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes
DOMAIN_FILTER_GLOB  = \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr"
Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)
GIT_URL_CONTINUE  = '@@@@'
constant indicating Git repository
HEX_NULL_CHARACTER  = "\x00"
A indicator to represent next position after the access code in Git tree object
INDICATOR_NONE  = 'none'
An indicator to tell no actions to be taken
REPOSITORY_GIT  = 'git'
constant indicating Git repository
$active_classifiers  : array<string|int, mixed>
Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met
$active_rankers  : array<string|int, mixed>
Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes
$all_file_types  : array<string|int, mixed>
List of all known file extensions including those not used for crawl
$all_git_urls  : array<string|int, mixed>
To store all the internal git urls fetched
$allow_disallow_cache_time  : int
Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
$allowed_sites  : array<string|int, mixed>
Web-sites that crawler can crawl. If used, ONLY these will be crawled
$arc_dir  : string
For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)
$arc_type  : string
For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')
$archive_iterator  : object
If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive
$cache_pages  : bool
Whether to cache pages or just the summaries
$channel  : int
Channel that queue server listens to messages for
$check_crawl_time  : int
The last time the name server was checked for a crawl time
$crawl_index  : string
If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
$crawl_order  : string
Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.
$crawl_stat_filename  : string
Name of file used to store fetcher statistics for the current crawl
$crawl_stat_info  : string
Fetcher statistics for the current crawl
$crawl_time  : int
Timestamp of the current crawl
$crawl_type  : string
Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
$current_server  : int
Index into $queue_servers of the server get schedule from (or last one we got the schedule from)
$db  : object
Reference to a database object. Used since has directory manipulation functions
$debug  : string
Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
$disallowed_sites  : array<string|int, mixed>
Web-sites that the crawler must not crawl
$domain_filters  : array<string|int, mixed>
An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.
$fetcher_num  : string
Which fetcher instance we are (if fetcher run as a job and more that one)
$found_sites  : array<string|int, mixed>
Summary information for visited sites that the fetcher hasn't sent to a queue_server yet
$hosts_with_errors  : array<string|int, mixed>
An array to keep track of hosts which have had a lot of http errors
$indexed_file_types  : array<string|int, mixed>
List of file extensions supported for the crawl
$max_depth  : int
Maximum depth fetcher should extract need seed urls to
$max_description_len  : int
Max number of chars to extract for description from a page to index.
$max_links_to_extract  : int
Maximum number of urls to extract from a single document
$minimum_fetch_loop_time  : int
Fetcher must wait at least this long between multi-curl requests.
$name_server  : array<string|int, mixed>
Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data
$no_process_links  : bool
When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step
$num_download_attempts  : mixed
Number of attempts to download urls in current fetch batch
$num_multi_curl  : int
For a web crawl only the number of web pages to download in one go.
$page_processors  : array<string|int, mixed>
An associative array of (mimetype => name of processor class to handle) pairs.
$page_range_request  : int
Maximum number of bytes to download of a webpage
$page_rule_parser  : array<string|int, mixed>
Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them
$plugin_hash  : string
Hash used to keep track of whether $plugin_processors info needs to be changed
$plugin_processors  : array<string|int, mixed>
An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called
$post_max_size  : int
Maximum number of bytes which can be uploaded to the current queue server's web app in one go
$processors  : array<string|int, mixed>
Page prcoessors used by this fetcher
$programming_language_extension  : array<string|int, mixed>
To map programming languages with their extensions
$proxy_servers  : array<string|int, mixed>
an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine
$queue_servers  : array<string|int, mixed>
Array of Urls or IP addresses of the queue_servers to get sites to crawl from
$recrawl_check_scheduler  : bool
Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive
$restrict_sites_by_url  : bool
Says whether the $allowed_sites array is being used or not
$robots_txt  : int
One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
$schedule_time  : int
Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.
$scrapers  : array<string|int, mixed>
Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.
$sequence_number  : int
Holds the sequence number of the current schedule received from queue server
$sleep_duration  : string
If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
$sleep_start  : string
If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
$summarizer_option  : string
Stores the name of the summarizer used for crawling.
$to_crawl  : array<string|int, mixed>
Contains the list of web pages to crawl from a queue_server
$to_crawl_again  : array<string|int, mixed>
Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)
$tor_proxy  : string
If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url
$total_git_urls  : int
To keep track of total number of Git internal urls
__construct()  : mixed
Sets up the field variables so that crawling can begin
addToCrawlSites()  : mixed
Used to add a set of links from a web page to the array of sites which need to be crawled.
allowedToCrawlSite()  : bool
Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
checkArchiveScheduler()  : array<string|int, mixed>
During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.
checkCrawlTime()  : bool
Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed
checkScheduler()  : mixed
Get status, current crawl, crawl order, and new site information from the queue_server.
compressAndUnsetSeenUrls()  : string
Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites
copySiteFields()  : mixed
Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.
cullNoncrawlableSites()  : mixed
Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.
deleteOldCrawls()  : mixed
Deletes any crawl web archive bundles not in the provided array of crawls
disallowedToCrawlSite()  : bool
Checks if url belongs to a list of sites that aren't supposed to be crawled
downloadPagesArchiveCrawl()  : array<string|int, mixed>
Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.
downloadPagesWebCrawl()  : array<string|int, mixed>
Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.
exceedMemoryThreshold()  : bool
Function to check if memory for this fetcher instance is getting low relative to what the system will allow.
getFetchSites()  : array<string|int, mixed>
Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.
getPageThumbs()  : mixed
Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.
loop()  : mixed
Main loop for the fetcher.
pageProcessor()  : object
Return the fetcher's copy of a page processor for the given mimetype.
processFetchPages()  : array<string|int, mixed>
Processes an array of downloaded web pages with the appropriate page processor.
processSubdocs()  : mixed
The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()
pruneLinks()  : mixed
This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.
reschedulePages()  : an
Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.
selectCurrentServerAndUpdateIfNeeded()  : mixed
At least once, and while memory is low selects next server and send any fetcher data we have to it.
setCrawlParamsFromArray()  : mixed
Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)
start()  : mixed
This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop
updateDomainFilters()  : mixed
Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded
updateFoundSites()  : mixed
Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.
updateScheduler()  : mixed
Updates the queue_server about sites that have been crawled.
uploadCrawlData()  : mixed
Sends to crawl, robot, and index data to the current queue server.

Constants

DEFAULT_POST_MAX_SIZE

Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes

public mixed DEFAULT_POST_MAX_SIZE = 2000000

DOMAIN_FILTER_GLOB

Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)

public mixed DOMAIN_FILTER_GLOB = \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr"

GIT_URL_CONTINUE

constant indicating Git repository

public mixed GIT_URL_CONTINUE = '@@@@'

HEX_NULL_CHARACTER

A indicator to represent next position after the access code in Git tree object

public mixed HEX_NULL_CHARACTER = "\x00"

INDICATOR_NONE

An indicator to tell no actions to be taken

public mixed INDICATOR_NONE = 'none'

REPOSITORY_GIT

constant indicating Git repository

public mixed REPOSITORY_GIT = 'git'

Properties

$active_classifiers

Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met

public array<string|int, mixed> $active_classifiers

$active_rankers

Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes

public array<string|int, mixed> $active_rankers

$all_file_types

List of all known file extensions including those not used for crawl

public array<string|int, mixed> $all_file_types

$all_git_urls

To store all the internal git urls fetched

public array<string|int, mixed> $all_git_urls

$allow_disallow_cache_time

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures

public int $allow_disallow_cache_time

$allowed_sites

Web-sites that crawler can crawl. If used, ONLY these will be crawled

public array<string|int, mixed> $allowed_sites

$arc_dir

For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)

public string $arc_dir

$arc_type

For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')

public string $arc_type

$archive_iterator

If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive

public object $archive_iterator

$cache_pages

Whether to cache pages or just the summaries

public bool $cache_pages

$channel

Channel that queue server listens to messages for

public int $channel

$check_crawl_time

The last time the name server was checked for a crawl time

public int $check_crawl_time

$crawl_index

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl

public string $crawl_index

$crawl_order

Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.

public string $crawl_order

$crawl_stat_filename

Name of file used to store fetcher statistics for the current crawl

public string $crawl_stat_filename

$crawl_stat_info

Fetcher statistics for the current crawl

public string $crawl_stat_info

$crawl_time

Timestamp of the current crawl

public int $crawl_time

$crawl_type

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive

public string $crawl_type

$current_server

Index into $queue_servers of the server get schedule from (or last one we got the schedule from)

public int $current_server

$db

Reference to a database object. Used since has directory manipulation functions

public object $db

$debug

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();

public string $debug

$disallowed_sites

Web-sites that the crawler must not crawl

public array<string|int, mixed> $disallowed_sites

$domain_filters

An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.

public array<string|int, mixed> $domain_filters

$fetcher_num

Which fetcher instance we are (if fetcher run as a job and more that one)

public string $fetcher_num

$found_sites

Summary information for visited sites that the fetcher hasn't sent to a queue_server yet

public array<string|int, mixed> $found_sites

$hosts_with_errors

An array to keep track of hosts which have had a lot of http errors

public array<string|int, mixed> $hosts_with_errors

$indexed_file_types

List of file extensions supported for the crawl

public array<string|int, mixed> $indexed_file_types

$max_depth

Maximum depth fetcher should extract need seed urls to

public int $max_depth

$max_description_len

Max number of chars to extract for description from a page to index.

public int $max_description_len

Only words in the description are indexed.

Maximum number of urls to extract from a single document

public int $max_links_to_extract

$minimum_fetch_loop_time

Fetcher must wait at least this long between multi-curl requests.

public int $minimum_fetch_loop_time

The value below is dynamically determined but is at least as large as MINIMUM_FETCH_LOOP_TIME

$name_server

Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data

public array<string|int, mixed> $name_server

When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step

public bool $no_process_links

$num_download_attempts

Number of attempts to download urls in current fetch batch

public mixed $num_download_attempts

$num_multi_curl

For a web crawl only the number of web pages to download in one go.

public int $num_multi_curl

$page_processors

An associative array of (mimetype => name of processor class to handle) pairs.

public array<string|int, mixed> $page_processors

$page_range_request

Maximum number of bytes to download of a webpage

public int $page_range_request

$page_rule_parser

Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them

public array<string|int, mixed> $page_rule_parser

$plugin_hash

Hash used to keep track of whether $plugin_processors info needs to be changed

public string $plugin_hash

$plugin_processors

An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called

public array<string|int, mixed> $plugin_processors

$post_max_size

Maximum number of bytes which can be uploaded to the current queue server's web app in one go

public int $post_max_size

$processors

Page prcoessors used by this fetcher

public array<string|int, mixed> $processors

$programming_language_extension

To map programming languages with their extensions

public array<string|int, mixed> $programming_language_extension

$proxy_servers

an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine

public array<string|int, mixed> $proxy_servers

$queue_servers

Array of Urls or IP addresses of the queue_servers to get sites to crawl from

public array<string|int, mixed> $queue_servers

$recrawl_check_scheduler

Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive

public bool $recrawl_check_scheduler

$restrict_sites_by_url

Says whether the $allowed_sites array is being used or not

public bool $restrict_sites_by_url

$robots_txt

One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS

public int $robots_txt

$schedule_time

Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.

public int $schedule_time

$scrapers

Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.

public array<string|int, mixed> $scrapers

$sequence_number

Holds the sequence number of the current schedule received from queue server

public int $sequence_number

$sleep_duration

If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.

public string $sleep_duration

$sleep_start

If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts

public string $sleep_start

$summarizer_option

Stores the name of the summarizer used for crawling.

public string $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$to_crawl

Contains the list of web pages to crawl from a queue_server

public array<string|int, mixed> $to_crawl

$to_crawl_again

Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)

public array<string|int, mixed> $to_crawl_again

$tor_proxy

If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url

public string $tor_proxy

$total_git_urls

To keep track of total number of Git internal urls

public int $total_git_urls

Methods

__construct()

Sets up the field variables so that crawling can begin

public __construct() : mixed
Return values
mixed

addToCrawlSites()

Used to add a set of links from a web page to the array of sites which need to be crawled.

public addToCrawlSites(array<string|int, mixed> $link_urls, string $old_url, int $old_weight, int $old_depth, int $num_common) : mixed
Parameters
$link_urls : array<string|int, mixed>

an array of urls to be crawled

$old_url : string

url of page where links came from

$old_weight : int

the weight on the page the link came from (order of importance among links on page)

$old_depth : int

of the web page the links came from

$num_common : int

number of company level domains in common between $link_urls and $old_url

Return values
mixed

allowedToCrawlSite()

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable

public allowedToCrawlSite(string $url) : bool
Parameters
$url : string

url to check

Return values
bool

whether is allowed to be crawled or not

checkArchiveScheduler()

During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.

public checkArchiveScheduler() : array<string|int, mixed>
Return values
array<string|int, mixed>

containing archive page data

checkCrawlTime()

Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed

public checkCrawlTime() : bool

If the timestamp has changed save the rest of the current fetch batch, then load any existing fetch from the new crawl; otherwise, set the crawl to empty. Also, handles deleting old crawls on this fetcher machine based on a list of current crawls on the name server.

Return values
bool

true if loaded a fetch batch due to time change

checkScheduler()

Get status, current crawl, crawl order, and new site information from the queue_server.

public checkScheduler() : mixed
Return values
mixed

array or bool. If we are doing a web crawl and we still have pages to crawl then true, if the scheduler page fails to download then false, otherwise, returns an array of info from the scheduler.

compressAndUnsetSeenUrls()

Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites

public compressAndUnsetSeenUrls(int $server) : string
Parameters
$server : int

index of queue server to compress and unset urls for

Return values
string

of compressed urls

copySiteFields()

Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.

public copySiteFields(int $i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages[, array<string|int, mixed> $exclude_fields = [] ]) : mixed
Parameters
$i : int

index to copy to

$site : array<string|int, mixed>

web page info to copy

$summarized_site_pages : array<string|int, mixed>

array of summaries of web pages

$exclude_fields : array<string|int, mixed> = []

an array of fields not to copy

Return values
mixed

cullNoncrawlableSites()

Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.

public cullNoncrawlableSites() : mixed
Return values
mixed

deleteOldCrawls()

Deletes any crawl web archive bundles not in the provided array of crawls

public deleteOldCrawls(array<string|int, mixed> &$still_active_crawls) : mixed
Parameters
$still_active_crawls : array<string|int, mixed>

those crawls which should not be deleted, so all others will be deleted

Tags
see
loop()
Return values
mixed

disallowedToCrawlSite()

Checks if url belongs to a list of sites that aren't supposed to be crawled

public disallowedToCrawlSite(string $url) : bool
Parameters
$url : string

url to check

Return values
bool

whether is shouldn't be crawled

downloadPagesArchiveCrawl()

Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.

public downloadPagesArchiveCrawl() : array<string|int, mixed>
Return values
array<string|int, mixed>

an associative array of web pages and meta data from the archive bundle being iterated over

downloadPagesWebCrawl()

Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.

public downloadPagesWebCrawl() : array<string|int, mixed>
Return values
array<string|int, mixed>

an associative array of web pages and meta data fetched from the internet

exceedMemoryThreshold()

Function to check if memory for this fetcher instance is getting low relative to what the system will allow.

public exceedMemoryThreshold() : bool
Return values
bool

whether available memory is getting low

getFetchSites()

Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.

public getFetchSites() : array<string|int, mixed>
Return values
array<string|int, mixed>

sites which are ready to be downloaded

getPageThumbs()

Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.

public getPageThumbs(array<string|int, mixed> &$sites) : mixed
Parameters
$sites : array<string|int, mixed>

associative array of web sites information to add thumbs for. At least one site in the array should have a self::THUMB_URL field that we want have the thumb of

Return values
mixed

loop()

Main loop for the fetcher.

public loop() : mixed

Checks for stop message, checks queue server if crawl has changed and for new pages to crawl. Loop gets a group of next pages to crawl if there are pages left to crawl (otherwise sleep 5 seconds). It downloads these pages, deduplicates them, and updates the found site info with the result before looping again.

Return values
mixed

pageProcessor()

Return the fetcher's copy of a page processor for the given mimetype.

public pageProcessor(string $type) : object
Parameters
$type : string

mimetype want a processor for

Return values
object

a page processor for that mime type of false if that mimetype can't be handled

processFetchPages()

Processes an array of downloaded web pages with the appropriate page processor.

public processFetchPages(array<string|int, mixed> $site_pages) : array<string|int, mixed>

Summary data is extracted from each non robots.txt file in the array. Disallowed paths and crawl-delays are extracted from robots.txt files.

Parameters
$site_pages : array<string|int, mixed>

a collection of web pages to process

Return values
array<string|int, mixed>

summary data extracted from these pages

processSubdocs()

The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()

public processSubdocs(int &$i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages) : mixed
Parameters
$i : int

index to begin adding subdocs at

$site : array<string|int, mixed>

web page that subdocs were from and from which some subdoc summary info is copied

$summarized_site_pages : array<string|int, mixed>

array of summaries of web pages

Return values
mixed

This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.

public pruneLinks(array<string|int, mixed> &$doc_info[, string $field = CrawlConstants::LINKS ], int $member_cache_time) : mixed

Then a crude estimate of the information contained in the links test: strlen(gzip(text)) is used to extract the best remaining links.

Parameters
$doc_info : array<string|int, mixed>

a string with a CrawlConstants::LINKS subarray This subarray in turn contains url => text pairs.

$field : string = CrawlConstants::LINKS

field for links default is CrawlConstants::LINKS

$member_cache_time : int

says how long allowed and disallowed url info should be caches by urlMemberSiteArray

Return values
mixed

reschedulePages()

Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.

public reschedulePages(array<string|int, mixed> &$site_pages) : an
Parameters
$site_pages : array<string|int, mixed>

pages to sort

Return values
an

array conisting of two array downloaded pages and not downloaded pages.

selectCurrentServerAndUpdateIfNeeded()

At least once, and while memory is low selects next server and send any fetcher data we have to it.

public selectCurrentServerAndUpdateIfNeeded(bool $at_least_current_server) : mixed
Parameters
$at_least_current_server : bool

whether to send to the site info to at least one queue server or to send only if memory is above threshold. Only in later casee is next server advanced.

Return values
mixed

setCrawlParamsFromArray()

Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)

public setCrawlParamsFromArray(array<string|int, mixed> &$info) : mixed
Parameters
$info : array<string|int, mixed>

struct with info about the kind of crawl, timestamp of index, crawl order, etc.

Return values
mixed

start()

This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop

public start() : mixed
Return values
mixed

updateDomainFilters()

Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded

public updateDomainFilters() : mixed
Return values
mixed

updateFoundSites()

Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.

public updateFoundSites(array<string|int, mixed> $sites[, bool $force_send = false ]) : mixed
Parameters
$sites : array<string|int, mixed>

site data to use for the update

$force_send : bool = false

whether to force send data back to queue_server or rely on usual thresholds before sending

Return values
mixed

updateScheduler()

Updates the queue_server about sites that have been crawled.

public updateScheduler(string $server[, bool $send_robots = false ]) : mixed

This method is called if there are currently no more sites to crawl. It compresses and does a post request to send the page summary data, robot data, and to crawl url data back to the server. In the event that the server doesn't acknowledge it loops and tries again after a delay until the post is successful. At this point, memory for this data is freed.

Parameters
$server : string

index of queue server to update

$send_robots : bool = false

whether to send robots.txt data if present

Return values
mixed

uploadCrawlData()

Sends to crawl, robot, and index data to the current queue server.

public uploadCrawlData(string $queue_server, array<string|int, mixed> $byte_counts, array<string|int, mixed> &$post_data) : mixed

If this data is more than post_max_size, it splits it into chunks which are then reassembled by the queue server web app before being put into the appropriate schedule sub-directory.

Parameters
$queue_server : string

url of the current queue server

$byte_counts : array<string|int, mixed>

has four fields: TOTAL, ROBOT, SCHEDULE, INDEX. These give the number of bytes overall for the 'data' field of $post_data and for each of these components.

$post_data : array<string|int, mixed>

data to be uploaded to the queue server web app

Return values
mixed

        

Search results