Fetcher
in package
implements
CrawlConstants
This class is responsible for fetching web pages for the SeekQuarry/Yioop search engine
Fetcher periodically queries the queue server asking for web pages to fetch. It gets at most MAX_FETCH_SIZE many web pages from the queue_server in one go. It then fetches these pages. Pages are fetched in batches of NUM_MULTI_CURL_PAGES many pages. Once downloaded the fetcher sends summaries back to the machine on which the queue_server lives. It does this by making a request of the web server on that machine and POSTs the data to the yioop web app. This data is handled by the FetchController class. The summary data can include up to four things: (1) robot.txt data, (2) summaries of each web page downloaded in the batch, and (3)a list of future urls to add to the to-crawl queue.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- DEFAULT_POST_MAX_SIZE = 2000000
- Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes
- DOMAIN_FILTER_GLOB = \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr"
- Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)
- GIT_URL_CONTINUE = '@@@@'
- constant indicating Git repository
- HEX_NULL_CHARACTER = "\x00"
- A indicator to represent next position after the access code in Git tree object
- INDICATOR_NONE = 'none'
- An indicator to tell no actions to be taken
- REPOSITORY_GIT = 'git'
- constant indicating Git repository
- $active_classifiers : array<string|int, mixed>
- Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met
- $active_rankers : array<string|int, mixed>
- Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes
- $all_file_types : array<string|int, mixed>
- List of all known file extensions including those not used for crawl
- $all_git_urls : array<string|int, mixed>
- To store all the internal git urls fetched
- $allow_disallow_cache_time : int
- Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
- $allowed_sites : array<string|int, mixed>
- Web-sites that crawler can crawl. If used, ONLY these will be crawled
- $arc_dir : string
- For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)
- $arc_type : string
- For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')
- $archive_iterator : object
- If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive
- $cache_pages : bool
- Whether to cache pages or just the summaries
- $channel : int
- Channel that queue server listens to messages for
- $check_crawl_time : int
- The last time the name server was checked for a crawl time
- $crawl_index : string
- If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
- $crawl_order : string
- Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.
- $crawl_stat_filename : string
- Name of file used to store fetcher statistics for the current crawl
- $crawl_stat_info : string
- Fetcher statistics for the current crawl
- $crawl_time : int
- Timestamp of the current crawl
- $crawl_type : string
- Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
- $current_server : int
- Index into $queue_servers of the server get schedule from (or last one we got the schedule from)
- $db : object
- Reference to a database object. Used since has directory manipulation functions
- $debug : string
- Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
- $disallowed_sites : array<string|int, mixed>
- Web-sites that the crawler must not crawl
- $domain_filters : array<string|int, mixed>
- An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.
- $fetcher_num : string
- Which fetcher instance we are (if fetcher run as a job and more that one)
- $found_sites : array<string|int, mixed>
- Summary information for visited sites that the fetcher hasn't sent to a queue_server yet
- $hosts_with_errors : array<string|int, mixed>
- An array to keep track of hosts which have had a lot of http errors
- $indexed_file_types : array<string|int, mixed>
- List of file extensions supported for the crawl
- $max_depth : int
- Maximum depth fetcher should extract need seed urls to
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $minimum_fetch_loop_time : int
- Fetcher must wait at least this long between multi-curl requests.
- $name_server : array<string|int, mixed>
- Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data
- $no_process_links : bool
- When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step
- $num_download_attempts : mixed
- Number of attempts to download urls in current fetch batch
- $num_multi_curl : int
- For a web crawl only the number of web pages to download in one go.
- $page_processors : array<string|int, mixed>
- An associative array of (mimetype => name of processor class to handle) pairs.
- $page_range_request : int
- Maximum number of bytes to download of a webpage
- $page_rule_parser : array<string|int, mixed>
- Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them
- $plugin_hash : string
- Hash used to keep track of whether $plugin_processors info needs to be changed
- $plugin_processors : array<string|int, mixed>
- An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called
- $post_max_size : int
- Maximum number of bytes which can be uploaded to the current queue server's web app in one go
- $processors : array<string|int, mixed>
- Page prcoessors used by this fetcher
- $programming_language_extension : array<string|int, mixed>
- To map programming languages with their extensions
- $proxy_servers : array<string|int, mixed>
- an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine
- $queue_servers : array<string|int, mixed>
- Array of Urls or IP addresses of the queue_servers to get sites to crawl from
- $recrawl_check_scheduler : bool
- Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive
- $restrict_sites_by_url : bool
- Says whether the $allowed_sites array is being used or not
- $robots_txt : int
- One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
- $schedule_time : int
- Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.
- $scrapers : array<string|int, mixed>
- Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.
- $sequence_number : int
- Holds the sequence number of the current schedule received from queue server
- $sleep_duration : string
- If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
- $sleep_start : string
- If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $to_crawl : array<string|int, mixed>
- Contains the list of web pages to crawl from a queue_server
- $to_crawl_again : array<string|int, mixed>
- Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)
- $tor_proxy : string
- If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url
- $total_git_urls : int
- To keep track of total number of Git internal urls
- __construct() : mixed
- Sets up the field variables so that crawling can begin
- addToCrawlSites() : mixed
- Used to add a set of links from a web page to the array of sites which need to be crawled.
- allowedToCrawlSite() : bool
- Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
- checkArchiveScheduler() : array<string|int, mixed>
- During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.
- checkCrawlTime() : bool
- Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed
- checkScheduler() : mixed
- Get status, current crawl, crawl order, and new site information from the queue_server.
- compressAndUnsetSeenUrls() : string
- Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites
- copySiteFields() : mixed
- Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.
- cullNoncrawlableSites() : mixed
- Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.
- deleteOldCrawls() : mixed
- Deletes any crawl web archive bundles not in the provided array of crawls
- disallowedToCrawlSite() : bool
- Checks if url belongs to a list of sites that aren't supposed to be crawled
- downloadPagesArchiveCrawl() : array<string|int, mixed>
- Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.
- downloadPagesWebCrawl() : array<string|int, mixed>
- Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.
- exceedMemoryThreshold() : bool
- Function to check if memory for this fetcher instance is getting low relative to what the system will allow.
- getFetchSites() : array<string|int, mixed>
- Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.
- getPageThumbs() : mixed
- Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.
- loop() : mixed
- Main loop for the fetcher.
- pageProcessor() : object
- Return the fetcher's copy of a page processor for the given mimetype.
- processFetchPages() : array<string|int, mixed>
- Processes an array of downloaded web pages with the appropriate page processor.
- processSubdocs() : mixed
- The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()
- pruneLinks() : mixed
- This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.
- reschedulePages() : an
- Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.
- selectCurrentServerAndUpdateIfNeeded() : mixed
- At least once, and while memory is low selects next server and send any fetcher data we have to it.
- setCrawlParamsFromArray() : mixed
- Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)
- start() : mixed
- This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop
- updateDomainFilters() : mixed
- Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded
- updateFoundSites() : mixed
- Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.
- updateScheduler() : mixed
- Updates the queue_server about sites that have been crawled.
- uploadCrawlData() : mixed
- Sends to crawl, robot, and index data to the current queue server.
Constants
DEFAULT_POST_MAX_SIZE
Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes
public
mixed
DEFAULT_POST_MAX_SIZE
= 2000000
DOMAIN_FILTER_GLOB
Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)
public
mixed
DOMAIN_FILTER_GLOB
= \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr"
GIT_URL_CONTINUE
constant indicating Git repository
public
mixed
GIT_URL_CONTINUE
= '@@@@'
HEX_NULL_CHARACTER
A indicator to represent next position after the access code in Git tree object
public
mixed
HEX_NULL_CHARACTER
= "\x00"
INDICATOR_NONE
An indicator to tell no actions to be taken
public
mixed
INDICATOR_NONE
= 'none'
REPOSITORY_GIT
constant indicating Git repository
public
mixed
REPOSITORY_GIT
= 'git'
Properties
$active_classifiers
Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met
public
array<string|int, mixed>
$active_classifiers
$active_rankers
Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes
public
array<string|int, mixed>
$active_rankers
$all_file_types
List of all known file extensions including those not used for crawl
public
array<string|int, mixed>
$all_file_types
$all_git_urls
To store all the internal git urls fetched
public
array<string|int, mixed>
$all_git_urls
$allow_disallow_cache_time
Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
public
int
$allow_disallow_cache_time
$allowed_sites
Web-sites that crawler can crawl. If used, ONLY these will be crawled
public
array<string|int, mixed>
$allowed_sites
$arc_dir
For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)
public
string
$arc_dir
$arc_type
For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')
public
string
$arc_type
$archive_iterator
If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive
public
object
$archive_iterator
$cache_pages
Whether to cache pages or just the summaries
public
bool
$cache_pages
$channel
Channel that queue server listens to messages for
public
int
$channel
$check_crawl_time
The last time the name server was checked for a crawl time
public
int
$check_crawl_time
$crawl_index
If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
public
string
$crawl_index
$crawl_order
Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.
public
string
$crawl_order
$crawl_stat_filename
Name of file used to store fetcher statistics for the current crawl
public
string
$crawl_stat_filename
$crawl_stat_info
Fetcher statistics for the current crawl
public
string
$crawl_stat_info
$crawl_time
Timestamp of the current crawl
public
int
$crawl_time
$crawl_type
Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
public
string
$crawl_type
$current_server
Index into $queue_servers of the server get schedule from (or last one we got the schedule from)
public
int
$current_server
$db
Reference to a database object. Used since has directory manipulation functions
public
object
$db
$debug
Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
public
string
$debug
$disallowed_sites
Web-sites that the crawler must not crawl
public
array<string|int, mixed>
$disallowed_sites
$domain_filters
An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.
public
array<string|int, mixed>
$domain_filters
$fetcher_num
Which fetcher instance we are (if fetcher run as a job and more that one)
public
string
$fetcher_num
$found_sites
Summary information for visited sites that the fetcher hasn't sent to a queue_server yet
public
array<string|int, mixed>
$found_sites
$hosts_with_errors
An array to keep track of hosts which have had a lot of http errors
public
array<string|int, mixed>
$hosts_with_errors
$indexed_file_types
List of file extensions supported for the crawl
public
array<string|int, mixed>
$indexed_file_types
$max_depth
Maximum depth fetcher should extract need seed urls to
public
int
$max_depth
$max_description_len
Max number of chars to extract for description from a page to index.
public
int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
int
$max_links_to_extract
$minimum_fetch_loop_time
Fetcher must wait at least this long between multi-curl requests.
public
int
$minimum_fetch_loop_time
The value below is dynamically determined but is at least as large as MINIMUM_FETCH_LOOP_TIME
$name_server
Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data
public
array<string|int, mixed>
$name_server
$no_process_links
When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step
public
bool
$no_process_links
$num_download_attempts
Number of attempts to download urls in current fetch batch
public
mixed
$num_download_attempts
$num_multi_curl
For a web crawl only the number of web pages to download in one go.
public
int
$num_multi_curl
$page_processors
An associative array of (mimetype => name of processor class to handle) pairs.
public
array<string|int, mixed>
$page_processors
$page_range_request
Maximum number of bytes to download of a webpage
public
int
$page_range_request
$page_rule_parser
Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them
public
array<string|int, mixed>
$page_rule_parser
$plugin_hash
Hash used to keep track of whether $plugin_processors info needs to be changed
public
string
$plugin_hash
$plugin_processors
An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called
public
array<string|int, mixed>
$plugin_processors
$post_max_size
Maximum number of bytes which can be uploaded to the current queue server's web app in one go
public
int
$post_max_size
$processors
Page prcoessors used by this fetcher
public
array<string|int, mixed>
$processors
$programming_language_extension
To map programming languages with their extensions
public
array<string|int, mixed>
$programming_language_extension
$proxy_servers
an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine
public
array<string|int, mixed>
$proxy_servers
$queue_servers
Array of Urls or IP addresses of the queue_servers to get sites to crawl from
public
array<string|int, mixed>
$queue_servers
$recrawl_check_scheduler
Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive
public
bool
$recrawl_check_scheduler
$restrict_sites_by_url
Says whether the $allowed_sites array is being used or not
public
bool
$restrict_sites_by_url
$robots_txt
One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
public
int
$robots_txt
$schedule_time
Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.
public
int
$schedule_time
$scrapers
Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.
public
array<string|int, mixed>
$scrapers
$sequence_number
Holds the sequence number of the current schedule received from queue server
public
int
$sequence_number
$sleep_duration
If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
public
string
$sleep_duration
$sleep_start
If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
public
string
$sleep_start
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER
$to_crawl
Contains the list of web pages to crawl from a queue_server
public
array<string|int, mixed>
$to_crawl
$to_crawl_again
Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)
public
array<string|int, mixed>
$to_crawl_again
$tor_proxy
If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url
public
string
$tor_proxy
$total_git_urls
To keep track of total number of Git internal urls
public
int
$total_git_urls
Methods
__construct()
Sets up the field variables so that crawling can begin
public
__construct() : mixed
Return values
mixed —addToCrawlSites()
Used to add a set of links from a web page to the array of sites which need to be crawled.
public
addToCrawlSites(array<string|int, mixed> $link_urls, string $old_url, int $old_weight, int $old_depth, int $num_common) : mixed
Parameters
- $link_urls : array<string|int, mixed>
-
an array of urls to be crawled
- $old_url : string
-
url of page where links came from
- $old_weight : int
-
the weight on the page the link came from (order of importance among links on page)
- $old_depth : int
-
of the web page the links came from
- $num_common : int
-
number of company level domains in common between $link_urls and $old_url
Return values
mixed —allowedToCrawlSite()
Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
public
allowedToCrawlSite(string $url) : bool
Parameters
- $url : string
-
url to check
Return values
bool —whether is allowed to be crawled or not
checkArchiveScheduler()
During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.
public
checkArchiveScheduler() : array<string|int, mixed>
Return values
array<string|int, mixed> —containing archive page data
checkCrawlTime()
Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed
public
checkCrawlTime() : bool
If the timestamp has changed save the rest of the current fetch batch, then load any existing fetch from the new crawl; otherwise, set the crawl to empty. Also, handles deleting old crawls on this fetcher machine based on a list of current crawls on the name server.
Return values
bool —true if loaded a fetch batch due to time change
checkScheduler()
Get status, current crawl, crawl order, and new site information from the queue_server.
public
checkScheduler() : mixed
Return values
mixed —array or bool. If we are doing a web crawl and we still have pages to crawl then true, if the scheduler page fails to download then false, otherwise, returns an array of info from the scheduler.
compressAndUnsetSeenUrls()
Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites
public
compressAndUnsetSeenUrls(int $server) : string
Parameters
- $server : int
-
index of queue server to compress and unset urls for
Return values
string —of compressed urls
copySiteFields()
Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.
public
copySiteFields(int $i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages[, array<string|int, mixed> $exclude_fields = [] ]) : mixed
Parameters
- $i : int
-
index to copy to
- $site : array<string|int, mixed>
-
web page info to copy
- $summarized_site_pages : array<string|int, mixed>
-
array of summaries of web pages
- $exclude_fields : array<string|int, mixed> = []
-
an array of fields not to copy
Return values
mixed —cullNoncrawlableSites()
Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.
public
cullNoncrawlableSites() : mixed
Return values
mixed —deleteOldCrawls()
Deletes any crawl web archive bundles not in the provided array of crawls
public
deleteOldCrawls(array<string|int, mixed> &$still_active_crawls) : mixed
Parameters
- $still_active_crawls : array<string|int, mixed>
-
those crawls which should not be deleted, so all others will be deleted
Tags
Return values
mixed —disallowedToCrawlSite()
Checks if url belongs to a list of sites that aren't supposed to be crawled
public
disallowedToCrawlSite(string $url) : bool
Parameters
- $url : string
-
url to check
Return values
bool —whether is shouldn't be crawled
downloadPagesArchiveCrawl()
Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.
public
downloadPagesArchiveCrawl() : array<string|int, mixed>
Return values
array<string|int, mixed> —an associative array of web pages and meta data from the archive bundle being iterated over
downloadPagesWebCrawl()
Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.
public
downloadPagesWebCrawl() : array<string|int, mixed>
Return values
array<string|int, mixed> —an associative array of web pages and meta data fetched from the internet
exceedMemoryThreshold()
Function to check if memory for this fetcher instance is getting low relative to what the system will allow.
public
exceedMemoryThreshold() : bool
Return values
bool —whether available memory is getting low
getFetchSites()
Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.
public
getFetchSites() : array<string|int, mixed>
Return values
array<string|int, mixed> —sites which are ready to be downloaded
getPageThumbs()
Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.
public
getPageThumbs(array<string|int, mixed> &$sites) : mixed
Parameters
- $sites : array<string|int, mixed>
-
associative array of web sites information to add thumbs for. At least one site in the array should have a self::THUMB_URL field that we want have the thumb of
Return values
mixed —loop()
Main loop for the fetcher.
public
loop() : mixed
Checks for stop message, checks queue server if crawl has changed and for new pages to crawl. Loop gets a group of next pages to crawl if there are pages left to crawl (otherwise sleep 5 seconds). It downloads these pages, deduplicates them, and updates the found site info with the result before looping again.
Return values
mixed —pageProcessor()
Return the fetcher's copy of a page processor for the given mimetype.
public
pageProcessor(string $type) : object
Parameters
- $type : string
-
mimetype want a processor for
Return values
object —a page processor for that mime type of false if that mimetype can't be handled
processFetchPages()
Processes an array of downloaded web pages with the appropriate page processor.
public
processFetchPages(array<string|int, mixed> $site_pages) : array<string|int, mixed>
Summary data is extracted from each non robots.txt file in the array. Disallowed paths and crawl-delays are extracted from robots.txt files.
Parameters
- $site_pages : array<string|int, mixed>
-
a collection of web pages to process
Return values
array<string|int, mixed> —summary data extracted from these pages
processSubdocs()
The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()
public
processSubdocs(int &$i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages) : mixed
Parameters
- $i : int
-
index to begin adding subdocs at
- $site : array<string|int, mixed>
-
web page that subdocs were from and from which some subdoc summary info is copied
- $summarized_site_pages : array<string|int, mixed>
-
array of summaries of web pages
Return values
mixed —pruneLinks()
This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.
public
pruneLinks(array<string|int, mixed> &$doc_info[, string $field = CrawlConstants::LINKS ], int $member_cache_time) : mixed
Then a crude estimate of the information contained in the links test: strlen(gzip(text)) is used to extract the best remaining links.
Parameters
- $doc_info : array<string|int, mixed>
-
a string with a CrawlConstants::LINKS subarray This subarray in turn contains url => text pairs.
- $field : string = CrawlConstants::LINKS
-
field for links default is CrawlConstants::LINKS
- $member_cache_time : int
-
says how long allowed and disallowed url info should be caches by urlMemberSiteArray
Return values
mixed —reschedulePages()
Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.
public
reschedulePages(array<string|int, mixed> &$site_pages) : an
Parameters
- $site_pages : array<string|int, mixed>
-
pages to sort
Return values
an —array conisting of two array downloaded pages and not downloaded pages.
selectCurrentServerAndUpdateIfNeeded()
At least once, and while memory is low selects next server and send any fetcher data we have to it.
public
selectCurrentServerAndUpdateIfNeeded(bool $at_least_current_server) : mixed
Parameters
- $at_least_current_server : bool
-
whether to send to the site info to at least one queue server or to send only if memory is above threshold. Only in later casee is next server advanced.
Return values
mixed —setCrawlParamsFromArray()
Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)
public
setCrawlParamsFromArray(array<string|int, mixed> &$info) : mixed
Parameters
- $info : array<string|int, mixed>
-
struct with info about the kind of crawl, timestamp of index, crawl order, etc.
Return values
mixed —start()
This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop
public
start() : mixed
Return values
mixed —updateDomainFilters()
Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded
public
updateDomainFilters() : mixed
Return values
mixed —updateFoundSites()
Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.
public
updateFoundSites(array<string|int, mixed> $sites[, bool $force_send = false ]) : mixed
Parameters
- $sites : array<string|int, mixed>
-
site data to use for the update
- $force_send : bool = false
-
whether to force send data back to queue_server or rely on usual thresholds before sending
Return values
mixed —updateScheduler()
Updates the queue_server about sites that have been crawled.
public
updateScheduler(string $server[, bool $send_robots = false ]) : mixed
This method is called if there are currently no more sites to crawl. It compresses and does a post request to send the page summary data, robot data, and to crawl url data back to the server. In the event that the server doesn't acknowledge it loops and tries again after a delay until the post is successful. At this point, memory for this data is freed.
Parameters
- $server : string
-
index of queue server to update
- $send_robots : bool = false
-
whether to send robots.txt data if present
Return values
mixed —uploadCrawlData()
Sends to crawl, robot, and index data to the current queue server.
public
uploadCrawlData(string $queue_server, array<string|int, mixed> $byte_counts, array<string|int, mixed> &$post_data) : mixed
If this data is more than post_max_size, it splits it into chunks which are then reassembled by the queue server web app before being put into the appropriate schedule sub-directory.
Parameters
- $queue_server : string
-
url of the current queue server
- $byte_counts : array<string|int, mixed>
-
has four fields: TOTAL, ROBOT, SCHEDULE, INDEX. These give the number of bytes overall for the 'data' field of $post_data and for each of these components.
- $post_data : array<string|int, mixed>
-
data to be uploaded to the queue server web app