Yioop_V9.5_Source_Code

Fetcher
in package

Application

implements CrawlConstants

This class is responsible for fetching web pages for the SeekQuarry/Yioop search engine

Fetcher periodically queries the queue server asking for web pages to fetch. It gets at most MAX_FETCH_SIZE many web pages from the queue_server in one go. It then fetches these pages. Pages are fetched in batches of NUM_MULTI_CURL_PAGES many pages. Once downloaded the fetcher sends summaries back to the machine on which the queue_server lives. It does this by making a request of the web server on that machine and POSTs the data to the yioop web app. This data is handled by the FetchController class. The summary data can include up to four things: (1) robot.txt data, (2) summaries of each web page downloaded in the batch, and (3)a list of future urls to add to the to-crawl queue.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

DEFAULT_POST_MAX_SIZE = 2000000: Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes
DOMAIN_FILTER_GLOB = \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr": Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)
GIT_URL_CONTINUE = '@@@@': constant indicating Git repository
HEX_NULL_CHARACTER = "\x00": A indicator to represent next position after the access code in Git tree object
INDICATOR_NONE = 'none': An indicator to tell no actions to be taken
REPOSITORY_GIT = 'git': constant indicating Git repository
$active_classifiers : array<string|int, mixed>: Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met
$active_rankers : array<string|int, mixed>: Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes
$all_file_types : array<string|int, mixed>: List of all known file extensions including those not used for crawl
$all_git_urls : array<string|int, mixed>: To store all the internal git urls fetched
$allow_disallow_cache_time : int: Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
$allowed_sites : array<string|int, mixed>: Web-sites that crawler can crawl. If used, ONLY these will be crawled
$arc_dir : string: For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)
$arc_type : string: For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')
$archive_iterator : object: If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive
$cache_pages : bool: Whether to cache pages or just the summaries
$channel : int: Channel that queue server listens to messages for
$check_crawl_time : int: The last time the name server was checked for a crawl time
$crawl_index : string: If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
$crawl_order : string: Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.
$crawl_stat_filename : string: Name of file used to store fetcher statistics for the current crawl
$crawl_stat_info : string: Fetcher statistics for the current crawl
$crawl_time : int: Timestamp of the current crawl
$crawl_type : string: Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
$current_server : int: Index into $queue_servers of the server get schedule from (or last one we got the schedule from)
$db : object: Reference to a database object. Used since has directory manipulation functions
$debug : string: Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
$disallowed_sites : array<string|int, mixed>: Web-sites that the crawler must not crawl
$domain_filters : array<string|int, mixed>: An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.
$fetcher_num : string: Which fetcher instance we are (if fetcher run as a job and more that one)
$found_sites : array<string|int, mixed>: Summary information for visited sites that the fetcher hasn't sent to a queue_server yet
$hosts_with_errors : array<string|int, mixed>: An array to keep track of hosts which have had a lot of http errors
$indexed_file_types : array<string|int, mixed>: List of file extensions supported for the crawl
$max_depth : int: Maximum depth fetcher should extract need seed urls to
$max_description_len : int: Max number of chars to extract for description from a page to index.
$max_links_to_extract : int: Maximum number of urls to extract from a single document
$minimum_fetch_loop_time : int: Fetcher must wait at least this long between multi-curl requests.
$name_server : array<string|int, mixed>: Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data
$no_process_links : bool: When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step
$num_download_attempts : mixed: Number of attempts to download urls in current fetch batch
$num_multi_curl : int: For a web crawl only the number of web pages to download in one go.
$page_processors : array<string|int, mixed>: An associative array of (mimetype => name of processor class to handle) pairs.
$page_range_request : int: Maximum number of bytes to download of a webpage
$page_rule_parser : array<string|int, mixed>: Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them
$plugin_hash : string: Hash used to keep track of whether $plugin_processors info needs to be changed
$plugin_processors : array<string|int, mixed>: An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called
$post_max_size : int: Maximum number of bytes which can be uploaded to the current queue server's web app in one go
$processors : array<string|int, mixed>: Page prcoessors used by this fetcher
$programming_language_extension : array<string|int, mixed>: To map programming languages with their extensions
$proxy_servers : array<string|int, mixed>: an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine
$queue_servers : array<string|int, mixed>: Array of Urls or IP addresses of the queue_servers to get sites to crawl from
$recrawl_check_scheduler : bool: Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive
$restrict_sites_by_url : bool: Says whether the $allowed_sites array is being used or not
$robots_txt : int: One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
$schedule_time : int: Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.
$scrapers : array<string|int, mixed>: Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.
$sequence_number : int: Holds the sequence number of the current schedule received from queue server
$sleep_duration : string: If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
$sleep_start : string: If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
$summarizer_option : string: Stores the name of the summarizer used for crawling.
$to_crawl : array<string|int, mixed>: Contains the list of web pages to crawl from a queue_server
$to_crawl_again : array<string|int, mixed>: Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)
$tor_proxy : string: If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url
$total_git_urls : int: To keep track of total number of Git internal urls
__construct() : mixed: Sets up the field variables so that crawling can begin
addToCrawlSites() : mixed: Used to add a set of links from a web page to the array of sites which need to be crawled.
allowedToCrawlSite() : bool: Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
checkArchiveScheduler() : array<string|int, mixed>: During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.
checkCrawlTime() : bool: Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed
checkScheduler() : mixed: Get status, current crawl, crawl order, and new site information from the queue_server.
compressAndUnsetSeenUrls() : string: Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites
copySiteFields() : mixed: Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.
cullNoncrawlableSites() : mixed: Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.
deleteOldCrawls() : mixed: Deletes any crawl web archive bundles not in the provided array of crawls
disallowedToCrawlSite() : bool: Checks if url belongs to a list of sites that aren't supposed to be crawled
downloadPagesArchiveCrawl() : array<string|int, mixed>: Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.
downloadPagesWebCrawl() : array<string|int, mixed>: Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.
exceedMemoryThreshold() : bool: Function to check if memory for this fetcher instance is getting low relative to what the system will allow.
getFetchSites() : array<string|int, mixed>: Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.
getPageThumbs() : mixed: Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.
loop() : mixed: Main loop for the fetcher.
pageProcessor() : object: Return the fetcher's copy of a page processor for the given mimetype.
processFetchPages() : array<string|int, mixed>: Processes an array of downloaded web pages with the appropriate page processor.
processSubdocs() : mixed: The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()
pruneLinks() : mixed: This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.
reschedulePages() : an: Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.
selectCurrentServerAndUpdateIfNeeded() : mixed: At least once, and while memory is low selects next server and send any fetcher data we have to it.
setCrawlParamsFromArray() : mixed: Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)
start() : mixed: This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop
updateDomainFilters() : mixed: Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded
updateFoundSites() : mixed: Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.
updateScheduler() : mixed: Updates the queue_server about sites that have been crawled.
uploadCrawlData() : mixed: Sends to crawl, robot, and index data to the current queue server.

DEFAULT_POST_MAX_SIZE

Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes


    public
        mixed
    DEFAULT_POST_MAX_SIZE
    = 2000000

DOMAIN_FILTER_GLOB

Domain Filter Glob pattern for Bloom filters used to specify allowed domains. (Only specifies if some filter file of this type exists)


    public
        mixed
    DOMAIN_FILTER_GLOB
    = \seekquarry\yioop\configs\DATA_DIR . "/domain_filters/*.ftr"

GIT_URL_CONTINUE

constant indicating Git repository


    public
        mixed
    GIT_URL_CONTINUE
    = '@@@@'

HEX_NULL_CHARACTER

A indicator to represent next position after the access code in Git tree object


    public
        mixed
    HEX_NULL_CHARACTER
    = "\x00"

INDICATOR_NONE

An indicator to tell no actions to be taken


    public
        mixed
    INDICATOR_NONE
    = 'none'

REPOSITORY_GIT

constant indicating Git repository


    public
        mixed
    REPOSITORY_GIT
    = 'git'

$active_classifiers

Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met


    public
        array<string|int, mixed>
    $active_classifiers

$active_rankers

Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes


    public
        array<string|int, mixed>
    $active_rankers

$all_file_types

List of all known file extensions including those not used for crawl


    public
        array<string|int, mixed>
    $all_file_types

$all_git_urls

To store all the internal git urls fetched


    public
        array<string|int, mixed>
    $all_git_urls

$allow_disallow_cache_time

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures


    public
        int
    $allow_disallow_cache_time

$allowed_sites

Web-sites that crawler can crawl. If used, ONLY these will be crawled


    public
        array<string|int, mixed>
    $allowed_sites

$arc_dir

For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)


    public
        string
    $arc_dir

$arc_type

For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')


    public
        string
    $arc_type

$archive_iterator

If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive


    public
        object
    $archive_iterator

$cache_pages

Whether to cache pages or just the summaries


    public
        bool
    $cache_pages

$channel

Channel that queue server listens to messages for


    public
        int
    $channel

$check_crawl_time

The last time the name server was checked for a crawl time


    public
        int
    $check_crawl_time

$crawl_index

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl


    public
        string
    $crawl_index

$crawl_order

Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.


    public
        string
    $crawl_order

$crawl_stat_filename

Name of file used to store fetcher statistics for the current crawl


    public
        string
    $crawl_stat_filename

$crawl_stat_info

Fetcher statistics for the current crawl


    public
        string
    $crawl_stat_info

$crawl_time

Timestamp of the current crawl


    public
        int
    $crawl_time

$crawl_type

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive


    public
        string
    $crawl_type

$current_server

Index into $queue_servers of the server get schedule from (or last one we got the schedule from)


    public
        int
    $current_server

$db

Reference to a database object. Used since has directory manipulation functions


    public
        object
    $db

$debug

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();


    public
        string
    $debug

$disallowed_sites

Web-sites that the crawler must not crawl


    public
        array<string|int, mixed>
    $disallowed_sites

$domain_filters

An array of Bloom filters which if non empty will be used to restrict discovered url links. Namely, if the domain of a url is not in any of the filters and is not of thhe form of a company level domain (cld) or www.cld then it will be pruned. Here a company level domain is of the form some_name.tld where tld is a top level domain or the form some_name.country_level_tld.tld if the the tld is that of a country. So for site.somewhere.jp, somewhere.jp would be the cld; for site.somewhere.co.jp, somewhere.co.jp would be the cld.


    public
        array<string|int, mixed>
    $domain_filters

$fetcher_num

Which fetcher instance we are (if fetcher run as a job and more that one)


    public
        string
    $fetcher_num

$found_sites

Summary information for visited sites that the fetcher hasn't sent to a queue_server yet


    public
        array<string|int, mixed>
    $found_sites

$hosts_with_errors

An array to keep track of hosts which have had a lot of http errors


    public
        array<string|int, mixed>
    $hosts_with_errors

$indexed_file_types

List of file extensions supported for the crawl


    public
        array<string|int, mixed>
    $indexed_file_types

$max_depth

Maximum depth fetcher should extract need seed urls to


    public
        int
    $max_depth

$max_description_len

Max number of chars to extract for description from a page to index.


    public
        int
    $max_description_len

Only words in the description are indexed.

$max_links_to_extract

Maximum number of urls to extract from a single document


    public
        int
    $max_links_to_extract

$minimum_fetch_loop_time

Fetcher must wait at least this long between multi-curl requests.


    public
        int
    $minimum_fetch_loop_time

The value below is dynamically determined but is at least as large as MINIMUM_FETCH_LOOP_TIME

$name_server

Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data


    public
        array<string|int, mixed>
    $name_server

$no_process_links

When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step


    public
        bool
    $no_process_links

$num_download_attempts

Number of attempts to download urls in current fetch batch


    public
        mixed
    $num_download_attempts

$num_multi_curl

For a web crawl only the number of web pages to download in one go.


    public
        int
    $num_multi_curl

$page_processors

An associative array of (mimetype => name of processor class to handle) pairs.


    public
        array<string|int, mixed>
    $page_processors

$page_range_request

Maximum number of bytes to download of a webpage


    public
        int
    $page_range_request

$page_rule_parser

Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them


    public
        array<string|int, mixed>
    $page_rule_parser

$plugin_hash

Hash used to keep track of whether $plugin_processors info needs to be changed


    public
        string
    $plugin_hash

$plugin_processors

An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called


    public
        array<string|int, mixed>
    $plugin_processors

$post_max_size

Maximum number of bytes which can be uploaded to the current queue server's web app in one go


    public
        int
    $post_max_size

$processors

Page prcoessors used by this fetcher


    public
        array<string|int, mixed>
    $processors

$programming_language_extension

To map programming languages with their extensions


    public
        array<string|int, mixed>
    $programming_language_extension

$proxy_servers

an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine


    public
        array<string|int, mixed>
    $proxy_servers

$queue_servers

Array of Urls or IP addresses of the queue_servers to get sites to crawl from


    public
        array<string|int, mixed>
    $queue_servers

$recrawl_check_scheduler

Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive


    public
        bool
    $recrawl_check_scheduler

$restrict_sites_by_url

Says whether the $allowed_sites array is being used or not


    public
        bool
    $restrict_sites_by_url

$robots_txt

One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS


    public
        int
    $robots_txt

$schedule_time

Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.


    public
        int
    $schedule_time

$scrapers

Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.


    public
        array<string|int, mixed>
    $scrapers

$sequence_number

Holds the sequence number of the current schedule received from queue server


    public
        int
    $sequence_number

$sleep_duration

If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.


    public
        string
    $sleep_duration

$sleep_start

If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts


    public
        string
    $sleep_start

$summarizer_option

Stores the name of the summarizer used for crawling.


    public
        string
    $summarizer_option

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

$to_crawl

Contains the list of web pages to crawl from a queue_server


    public
        array<string|int, mixed>
    $to_crawl

$to_crawl_again

Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)


    public
        array<string|int, mixed>
    $to_crawl_again

$tor_proxy

If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url


    public
        string
    $tor_proxy

$total_git_urls

To keep track of total number of Git internal urls


    public
        int
    $total_git_urls

__construct()

Sets up the field variables so that crawling can begin


    public
                    __construct() : mixed

Return values

mixed —

addToCrawlSites()

Used to add a set of links from a web page to the array of sites which need to be crawled.


    public
                    addToCrawlSites(array<string|int, mixed> $link_urls, string $old_url, int $old_weight, int $old_depth, int $num_common) : mixed

Parameters

$link_urls : array<string|int, mixed>: an array of urls to be crawled
$old_url : string: url of page where links came from
$old_weight : int: the weight on the page the link came from (order of importance among links on page)
$old_depth : int: of the web page the links came from
$num_common : int: number of company level domains in common between $link_urls and $old_url

Return values

mixed —

allowedToCrawlSite()

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable


    public
                    allowedToCrawlSite(string $url) : bool

Parameters

$url : string: url to check

Return values

bool —

whether is allowed to be crawled or not

checkArchiveScheduler()

During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.


    public
                    checkArchiveScheduler() : array<string|int, mixed>

Return values

array<string|int, mixed> —

containing archive page data

checkCrawlTime()

Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed


    public
                    checkCrawlTime() : bool

If the timestamp has changed save the rest of the current fetch batch, then load any existing fetch from the new crawl; otherwise, set the crawl to empty. Also, handles deleting old crawls on this fetcher machine based on a list of current crawls on the name server.

Return values

bool —

true if loaded a fetch batch due to time change

checkScheduler()

Get status, current crawl, crawl order, and new site information from the queue_server.


    public
                    checkScheduler() : mixed

Return values

mixed —

array or bool. If we are doing a web crawl and we still have pages to crawl then true, if the scheduler page fails to download then false, otherwise, returns an array of info from the scheduler.

compressAndUnsetSeenUrls()

Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites


    public
                    compressAndUnsetSeenUrls(int $server) : string

Parameters

$server : int: index of queue server to compress and unset urls for

Return values

string —

of compressed urls

copySiteFields()

Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages. This will flatten info in a DOC_INFO subarray. This method is both useful to prepare downloaded site info and to prepare meta info for sub docs that might be produced by an indexing puglin.


    public
                    copySiteFields(int $i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages[, array<string|int, mixed> $exclude_fields = [] ]) : mixed

Parameters

$i : int: index to copy to
$site : array<string|int, mixed>: web page info to copy
$summarized_site_pages : array<string|int, mixed>: array of summaries of web pages
$exclude_fields : array<string|int, mixed> = []: an array of fields not to copy

Return values

mixed —

cullNoncrawlableSites()

Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.


    public
                    cullNoncrawlableSites() : mixed

Return values

mixed —

deleteOldCrawls()

Deletes any crawl web archive bundles not in the provided array of crawls


    public
                    deleteOldCrawls(array<string|int, mixed> &$still_active_crawls) : mixed

Parameters

$still_active_crawls : array<string|int, mixed>: those crawls which should not be deleted, so all others will be deleted

Return values

mixed —

disallowedToCrawlSite()

Checks if url belongs to a list of sites that aren't supposed to be crawled


    public
                    disallowedToCrawlSite(string $url) : bool

Parameters

$url : string: url to check

Return values

bool —

whether is shouldn't be crawled

downloadPagesArchiveCrawl()

Extracts NUM_MULTI_CURL_PAGES from the current Archive Bundle that is being recrawled.


    public
                    downloadPagesArchiveCrawl() : array<string|int, mixed>

Return values

array<string|int, mixed> —

an associative array of web pages and meta data from the archive bundle being iterated over

downloadPagesWebCrawl()

Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.


    public
                    downloadPagesWebCrawl() : array<string|int, mixed>

Return values

array<string|int, mixed> —

an associative array of web pages and meta data fetched from the internet

exceedMemoryThreshold()

Function to check if memory for this fetcher instance is getting low relative to what the system will allow.


    public
                    exceedMemoryThreshold() : bool

Return values

bool —

whether available memory is getting low

getFetchSites()

Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.


    public
                    getFetchSites() : array<string|int, mixed>

Return values

array<string|int, mixed> —

sites which are ready to be downloaded

getPageThumbs()

Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.


    public
                    getPageThumbs(array<string|int, mixed> &$sites) : mixed

Parameters

$sites : array<string|int, mixed>: associative array of web sites information to add thumbs for. At least one site in the array should have a self::THUMB_URL field that we want have the thumb of

Return values

mixed —

loop()

Main loop for the fetcher.


    public
                    loop() : mixed

Checks for stop message, checks queue server if crawl has changed and for new pages to crawl. Loop gets a group of next pages to crawl if there are pages left to crawl (otherwise sleep 5 seconds). It downloads these pages, deduplicates them, and updates the found site info with the result before looping again.

Return values

mixed —

pageProcessor()

Return the fetcher's copy of a page processor for the given mimetype.


    public
                    pageProcessor(string $type) : object

Parameters

$type : string: mimetype want a processor for

Return values

object —

a page processor for that mime type of false if that mimetype can't be handled

processFetchPages()

Processes an array of downloaded web pages with the appropriate page processor.


    public
                    processFetchPages(array<string|int, mixed> $site_pages) : array<string|int, mixed>

Summary data is extracted from each non robots.txt file in the array. Disallowed paths and crawl-delays are extracted from robots.txt files.

Parameters

$site_pages : array<string|int, mixed>: a collection of web pages to process

Return values

array<string|int, mixed> —

summary data extracted from these pages

processSubdocs()

The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()


    public
                    processSubdocs(int &$i, array<string|int, mixed> $site, array<string|int, mixed> &$summarized_site_pages) : mixed

Parameters

$i : int: index to begin adding subdocs at
$site : array<string|int, mixed>: web page that subdocs were from and from which some subdoc summary info is copied
$summarized_site_pages : array<string|int, mixed>: array of summaries of web pages

Return values

mixed —

pruneLinks()

This method attempts to cull from the doc_info struct the best $this->max_links_to_extract. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.


    public
                    pruneLinks(array<string|int, mixed> &$doc_info[, string $field = CrawlConstants::LINKS ], int $member_cache_time) : mixed

Then a crude estimate of the information contained in the links test: strlen(gzip(text)) is used to extract the best remaining links.

Parameters

$doc_info : array<string|int, mixed>: a string with a CrawlConstants::LINKS subarray This subarray in turn contains url => text pairs.
$field : string = CrawlConstants::LINKS: field for links default is CrawlConstants::LINKS
$member_cache_time : int: says how long allowed and disallowed url info should be caches by urlMemberSiteArray

Return values

mixed —

reschedulePages()

Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.


    public
                    reschedulePages(array<string|int, mixed> &$site_pages) : an

Parameters

$site_pages : array<string|int, mixed>: pages to sort

Return values

an —

array conisting of two array downloaded pages and not downloaded pages.

selectCurrentServerAndUpdateIfNeeded()

At least once, and while memory is low selects next server and send any fetcher data we have to it.


    public
                    selectCurrentServerAndUpdateIfNeeded(bool $at_least_current_server) : mixed

Parameters

$at_least_current_server : bool: whether to send to the site info to at least one queue server or to send only if memory is above threshold. Only in later casee is next server advanced.

Return values

mixed —

setCrawlParamsFromArray()

Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)


    public
                    setCrawlParamsFromArray(array<string|int, mixed> &$info) : mixed

Parameters

$info : array<string|int, mixed>: struct with info about the kind of crawl, timestamp of index, crawl order, etc.

Return values

mixed —

start()

This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop


    public
                    start() : mixed

Return values

mixed —

updateDomainFilters()

Updates the array of domain filters currently loaded into memory based on which BloomFilterFiles are present in WORK_DIRECTORY/data/domain_filters and if they have changed since the current in-memory filters were loaded


    public
                    updateDomainFilters() : mixed

Return values

mixed —

updateFoundSites()

Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl. If so, a mini index is built and, the queue server is called with the data.


    public
                    updateFoundSites(array<string|int, mixed> $sites[, bool $force_send = false ]) : mixed

Parameters

$sites : array<string|int, mixed>: site data to use for the update
$force_send : bool = false: whether to force send data back to queue_server or rely on usual thresholds before sending

Return values

mixed —

updateScheduler()

Updates the queue_server about sites that have been crawled.


    public
                    updateScheduler(string $server[, bool $send_robots = false ]) : mixed

This method is called if there are currently no more sites to crawl. It compresses and does a post request to send the page summary data, robot data, and to crawl url data back to the server. In the event that the server doesn't acknowledge it loops and tries again after a delay until the post is successful. At this point, memory for this data is freed.

Parameters

$server : string: index of queue server to update
$send_robots : bool = false: whether to send robots.txt data if present

Return values

mixed —

uploadCrawlData()

Sends to crawl, robot, and index data to the current queue server.


    public
                    uploadCrawlData(string $queue_server, array<string|int, mixed> $byte_counts, array<string|int, mixed> &$post_data) : mixed

If this data is more than post_max_size, it splits it into chunks which are then reassembled by the queue server web app before being put into the appropriate schedule sub-directory.

Parameters

$queue_server : string: url of the current queue server
$byte_counts : array<string|int, mixed>: has four fields: TOTAL, ROBOT, SCHEDULE, INDEX. These give the number of bytes overall for the 'data' field of $post_data and for each of these components.
$post_data : array<string|int, mixed>: data to be uploaded to the queue server web app

Return values

mixed —

Fetcher in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

DEFAULT_POST_MAX_SIZE

DOMAIN_FILTER_GLOB

GIT_URL_CONTINUE

HEX_NULL_CHARACTER

INDICATOR_NONE

REPOSITORY_GIT

Properties

$active_classifiers

$active_rankers

$all_file_types

$all_git_urls

$allow_disallow_cache_time

$allowed_sites

$arc_dir

$arc_type

$archive_iterator

$cache_pages

$channel

$check_crawl_time

$crawl_index

$crawl_order

$crawl_stat_filename

$crawl_stat_info

$crawl_time

$crawl_type

$current_server

$db

$debug

$disallowed_sites

$domain_filters

$fetcher_num

$found_sites

$hosts_with_errors

$indexed_file_types

$max_depth

$max_description_len

$max_links_to_extract

$minimum_fetch_loop_time

$name_server

$no_process_links

$num_download_attempts

$num_multi_curl

$page_processors

$page_range_request

$page_rule_parser

$plugin_hash

$plugin_processors

$post_max_size

$processors

$programming_language_extension

$proxy_servers

$queue_servers

$recrawl_check_scheduler

$restrict_sites_by_url

$robots_txt

$schedule_time

$scrapers

$sequence_number

$sleep_duration

$sleep_start

$summarizer_option

$to_crawl

$to_crawl_again

$tor_proxy

$total_git_urls

Methods

__construct()

Return values

addToCrawlSites()

Parameters

Return values

allowedToCrawlSite()

Parameters

Return values

checkArchiveScheduler()

Fetcher
in package

Application

implements CrawlConstants