QueueServer
in package
implements
CrawlConstants
Command line program responsible for managing Yioop crawls.
It maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create and populate the IndexDocumentBundle that is used by the search front end.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $all_file_types : array<string|int, mixed>
- List of all known file extensions including those not used for crawl
- $allow_disallow_cache_time : int
- Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
- $allowed_sites : array<string|int, mixed>
- Web-sites that crawler can crawl. If used, ONLY these will be crawled
- $archive_modified_time : int
- This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.
- $cache_pages : bool
- Used in schedules to tell the fetcher whether or not to cache pages
- $channel : int
- Channel that queue server listens to messages for
- $crawl_index : string
- If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
- $crawl_order : string
- Constant saying the method used to order the priority queue for the crawl
- $crawl_queue : object
- Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next
- $crawl_status_file_name : string
- name of file used to hold statistic about current crawl
- $crawl_time : int
- The timestamp of the current active crawl
- $crawl_type : string
- Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
- $db : object
- Reference to a database object. Used since has directory manipulation functions
- $debug : string
- Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
- $disallowed_sites : array<string|int, mixed>
- Web-sites that the crawler must not crawl
- $hourly_crawl_data : array<string|int, mixed>
- This is a list of hourly (timestamp, number_of_urls_crawled) statistics
- $index_archive : object
- Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.
- $index_dirty : int
- flags for whether the index has data to be written to disk
- $indexed_file_types : array<string|int, mixed>
- List of file extensions supported for the crawl
- $indexing_plugins : array<string|int, mixed>
- This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.
- $indexing_plugins_data : array<string|int, mixed>
- This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.
- $info_parameter_map : array<string|int, mixed>
- A mapping between class field names and parameters which might be sent to a queue server via an info associative array.
- $last_index_save_time : int
- Last time index was saved to disk
- $last_next_partition_to_add : int
- Holds the int value of the previous partition in index
- $max_depth : string
- Constant saying the depth from the seeds crawl can go to
- $max_description_len : int
- Max number of chars to extract for description from a page to index.
- $max_links_to_extract : int
- Maximum number of urls to extract from a single document
- $messages_bundle : object
- Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.
- $most_recent_fetcher : string
- IP address as a string of the fetcher that most recently spoke with the queue server.
- $page_range_request : int
- Maximum number of bytes to download of a webpage
- $page_recrawl_frequency : int
- Number of days between resets of the page url filter If nonpositive, then never reset filter
- $page_rules : array<string|int, mixed>
- Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )
- $process_name : string
- String used for naming log files and for naming the processes which run related to the queue server
- $quota_clear_time : int
- Timestamp of lst time download from site quotas were cleared
- $quota_sites : array<string|int, mixed>
- Web-sites that have an hourly crawl quota
- $quota_sites_keys : array<string|int, mixed>
- Cache of array_keys of $quota_sites
- $repeat_type : int
- Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second
- $restrict_sites_by_url : bool
- Says whether the $allowed_sites array is being used or not
- $robots_txt : int
- One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
- $server_name : string
- String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.
- $server_type : mixed
- Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)
- $sleep_duration : string
- If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
- $sleep_start : string
- If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
- $start_dictionary_time : int
- Keeps track of the time needed for the dictionary updater to add the current partition contents to index
- $summarizer_option : string
- Stores the name of the summarizer used for crawling.
- $waiting_hosts : array<string|int, mixed>
- This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.
- $window_size : int
- Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.
- __construct() : mixed
- Creates a Queue Server Daemon
- allowedToCrawlSite() : bool
- Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
- calculateScheduleMetaInfo() : string
- Used to create encode a string representing with meta info for a fetcher schedule.
- checkBothProcessesRunning() : mixed
- Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process
- checkProcessRunning() : mixed
- Checks to make sure the given process (either Indexer or Scheduler) is running.
- checkRepeatingCrawlSwap() : bool
- Check for a repeating crawl whether it is time to swap between the active and search crawls.
- checkUpdateCrawlParameters() : mixed
- Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values
- deleteOrphanedBundles() : mixed
- Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.
- disallowedToCrawlSite() : bool
- Checks if url belongs to a list of sites that aren't supposed to be crawled
- dumpBigScheduleToSmall() : mixed
- Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls
- getEarliestSlot() : int
- Gets the first unfilled schedule slot after $index in $arr
- handleAdminMessages() : array<string|int, mixed>
- Handles messages passed via files to the QueueServer.
- indexSave() : mixed
- Builds inverted index and saves active partition
- initializeCrawlQueue() : mixed
- This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.
- initializeIndexBundle() : mixed
- Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.
- isAIndexer() : bool
- Used to check if the current queue server process is acting a indexer of data coming from fetchers
- isAScheduler() : bool
- Used to check if the current queue server process is acting a url scheduler for fetchers
- isOnlyIndexer() : bool
- Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)
- isOnlyScheduler() : bool
- Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)
- loop() : mixed
- Main runtime loop of the queue server.
- processCrawlData() : mixed
- Main body of queue server loop where indexing, scheduling, robot file processing is done.
- processEtagExpires() : mixed
- Process cache page validation data files sent by Fetcher
- processEtagExpiresArchive() : mixed
- Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.
- processIndexArchive() : mixed
- Adds the summary and index data in $file to summary bundle and word index
- processIndexData() : mixed
- Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found
- processReceivedRobotTxtUrls() : mixed
- This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
- processRecrawlDataArchive() : mixed
- Processes fetcher data file information during a recrawl
- processRecrawlRobotUrls() : mixed
- Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.
- processRobotArchive() : mixed
- Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.
- processRobotUrls() : mixed
- Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there
- processToCrawlArchive() : mixed
- Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info
- processToCrawlUrls() : mixed
- Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue
- produceFetchBatch() : mixed
- Produces a schedule.txt file of url data for a fetcher to crawl next.
- runPostProcessingPlugins() : mixed
- During crawl shutdown this is called to run any post processing plugins
- shutdownDictionary() : mixed
- During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.
- start() : mixed
- This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop
- startCrawl() : mixed
- Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle
- stopCrawl() : mixed
- Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.
- updateDisallowedQuotaSites() : mixed
- This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]
- updateMostRecentFetcher() : mixed
- Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher
- withinQuota() : bool
- Checks if the $url is from a site which has an hourly quota to download.
- writeAdminMessage() : mixed
- Used to write an admin crawl status message during a start or stop crawl.
- writeArchiveCrawlInfo() : mixed
- Used to write info about the current recrawl to file as well as to process any recrawl data files received
- writeCrawlStatus() : mixed
- Writes status information about the current crawl so that the webserver app can use it for its display.
Properties
$all_file_types
List of all known file extensions including those not used for crawl
public
array<string|int, mixed>
$all_file_types
$allow_disallow_cache_time
Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
public
int
$allow_disallow_cache_time
$allowed_sites
Web-sites that crawler can crawl. If used, ONLY these will be crawled
public
array<string|int, mixed>
$allowed_sites
$archive_modified_time
This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.
public
int
$archive_modified_time
$cache_pages
Used in schedules to tell the fetcher whether or not to cache pages
public
bool
$cache_pages
$channel
Channel that queue server listens to messages for
public
int
$channel
$crawl_index
If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
public
string
$crawl_index
$crawl_order
Constant saying the method used to order the priority queue for the crawl
public
string
$crawl_order
$crawl_queue
Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next
public
object
$crawl_queue
$crawl_status_file_name
name of file used to hold statistic about current crawl
public
string
$crawl_status_file_name
$crawl_time
The timestamp of the current active crawl
public
int
$crawl_time
$crawl_type
Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
public
string
$crawl_type
$db
Reference to a database object. Used since has directory manipulation functions
public
object
$db
$debug
Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
public
string
$debug
$disallowed_sites
Web-sites that the crawler must not crawl
public
array<string|int, mixed>
$disallowed_sites
$hourly_crawl_data
This is a list of hourly (timestamp, number_of_urls_crawled) statistics
public
array<string|int, mixed>
$hourly_crawl_data
$index_archive
Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.
public
object
$index_archive
$index_dirty
flags for whether the index has data to be written to disk
public
int
$index_dirty
$indexed_file_types
List of file extensions supported for the crawl
public
array<string|int, mixed>
$indexed_file_types
$indexing_plugins
This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.
public
array<string|int, mixed>
$indexing_plugins
$indexing_plugins_data
This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.
public
array<string|int, mixed>
$indexing_plugins_data
$info_parameter_map
A mapping between class field names and parameters which might be sent to a queue server via an info associative array.
public
static array<string|int, mixed>
$info_parameter_map
= ["crawl_order" => self::CRAWL_ORDER, "crawl_type" => self::CRAWL_TYPE, "crawl_index" => self::CRAWL_INDEX, "cache_pages" => self::CACHE_PAGES, "page_range_request" => self::PAGE_RANGE_REQUEST, "max_depth" => self::MAX_DEPTH, "repeat_type" => self::REPEAT_TYPE, "sleep_start" => self::SLEEP_START, "sleep_duration" => self::SLEEP_DURATION, "robots_txt" => self::ROBOTS_TXT, "max_description_len" => self::MAX_DESCRIPTION_LEN, "max_links_to_extract" => self::MAX_LINKS_TO_EXTRACT, "page_recrawl_frequency" => self::PAGE_RECRAWL_FREQUENCY, "indexed_file_types" => self::INDEXED_FILE_TYPES, "restrict_sites_by_url" => self::RESTRICT_SITES_BY_URL, "allowed_sites" => self::ALLOWED_SITES, "disallowed_sites" => self::DISALLOWED_SITES, "page_rules" => self::PAGE_RULES, "indexing_plugins" => self::INDEXING_PLUGINS, "indexing_plugins_data" => self::INDEXING_PLUGINS_DATA]
$last_index_save_time
Last time index was saved to disk
public
int
$last_index_save_time
$last_next_partition_to_add
Holds the int value of the previous partition in index
public
int
$last_next_partition_to_add
$max_depth
Constant saying the depth from the seeds crawl can go to
public
string
$max_depth
$max_description_len
Max number of chars to extract for description from a page to index.
public
int
$max_description_len
Only words in the description are indexed.
$max_links_to_extract
Maximum number of urls to extract from a single document
public
int
$max_links_to_extract
$messages_bundle
Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.
public
object
$messages_bundle
$most_recent_fetcher
IP address as a string of the fetcher that most recently spoke with the queue server.
public
string
$most_recent_fetcher
$page_range_request
Maximum number of bytes to download of a webpage
public
int
$page_range_request
$page_recrawl_frequency
Number of days between resets of the page url filter If nonpositive, then never reset filter
public
int
$page_recrawl_frequency
$page_rules
Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )
public
array<string|int, mixed>
$page_rules
$process_name
String used for naming log files and for naming the processes which run related to the queue server
public
string
$process_name
$quota_clear_time
Timestamp of lst time download from site quotas were cleared
public
int
$quota_clear_time
$quota_sites
Web-sites that have an hourly crawl quota
public
array<string|int, mixed>
$quota_sites
$quota_sites_keys
Cache of array_keys of $quota_sites
public
array<string|int, mixed>
$quota_sites_keys
$repeat_type
Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second
public
int
$repeat_type
$restrict_sites_by_url
Says whether the $allowed_sites array is being used or not
public
bool
$restrict_sites_by_url
$robots_txt
One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
public
int
$robots_txt
$server_name
String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.
public
string
$server_name
$server_type
Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)
public
mixed
$server_type
$sleep_duration
If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
public
string
$sleep_duration
$sleep_start
If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
public
string
$sleep_start
$start_dictionary_time
Keeps track of the time needed for the dictionary updater to add the current partition contents to index
public
int
$start_dictionary_time
$summarizer_option
Stores the name of the summarizer used for crawling.
public
string
$summarizer_option
Possible values are Basic and Centroid
$waiting_hosts
This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.
public
array<string|int, mixed>
$waiting_hosts
$window_size
Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.
public
int
$window_size
Methods
__construct()
Creates a Queue Server Daemon
public
__construct() : mixed
Return values
mixed —allowedToCrawlSite()
Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
public
allowedToCrawlSite(string $url) : bool
Parameters
- $url : string
-
url to check
Return values
bool —whether is allowed to be crawled or not
calculateScheduleMetaInfo()
Used to create encode a string representing with meta info for a fetcher schedule.
public
calculateScheduleMetaInfo(int $schedule_time) : string
Parameters
- $schedule_time : int
-
timestamp of the schedule
Return values
string —base64 encoded meta info
checkBothProcessesRunning()
Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process
public
checkBothProcessesRunning(array<string|int, mixed> $info) : mixed
Parameters
- $info : array<string|int, mixed>
-
information about queue server state used to determine if a crawl is active.
Return values
mixed —checkProcessRunning()
Checks to make sure the given process (either Indexer or Scheduler) is running.
public
checkProcessRunning(string $process, array<string|int, mixed> $info) : mixed
Parameters
- $process : string
-
should be either self::INDEXER or self::SCHEDULER
- $info : array<string|int, mixed>
-
information about queue server state used to determine if a crawl is active.
Return values
mixed —checkRepeatingCrawlSwap()
Check for a repeating crawl whether it is time to swap between the active and search crawls.
public
checkRepeatingCrawlSwap() : bool
Return values
bool —true if the time to swap has come
checkUpdateCrawlParameters()
Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values
public
checkUpdateCrawlParameters() : mixed
Return values
mixed —deleteOrphanedBundles()
Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.
public
deleteOrphanedBundles() : mixed
Return values
mixed —disallowedToCrawlSite()
Checks if url belongs to a list of sites that aren't supposed to be crawled
public
disallowedToCrawlSite(string $url) : bool
Parameters
- $url : string
-
url to check
Return values
bool —whether is shouldn't be crawled
dumpBigScheduleToSmall()
Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls
public
dumpBigScheduleToSmall(int $schedule_time, array<string|int, mixed> &$sites) : mixed
The size of the to crawl list depends on the number of found links during a fetch batch. This can be quite large compared to the fetch batch and during processing, we might be doing a fair bit of manipulation of arrays of sites, so the idea is this splitting like this will hopefully reduce the memory burden of scheduling.
Parameters
- $schedule_time : int
-
timestamp of schedule we are splitting
- $sites : array<string|int, mixed>
-
array containing to crawl data
Return values
mixed —getEarliestSlot()
Gets the first unfilled schedule slot after $index in $arr
public
getEarliestSlot(int $index, array<string|int, mixed> &$arr) : int
A schedule of sites for a fetcher to crawl consists of MAX_FETCH_SIZE many slots each of which could eventually hold url information. This function is used to schedule slots for crawl-delayed host.
Parameters
- $index : int
-
location to begin searching for an empty slot
- $arr : array<string|int, mixed>
-
list of slots to look in
Return values
int —index of first available slot
handleAdminMessages()
Handles messages passed via files to the QueueServer.
public
handleAdminMessages(array<string|int, mixed> $info) : array<string|int, mixed>
These files are typically written by the CrawlDaemon::init() when QueueServer is run using command-line argument
Parameters
- $info : array<string|int, mixed>
-
associative array with info about current state of queue server
Return values
array<string|int, mixed> —an updates version $info reflecting changes that occurred during the handling of the admin messages files.
indexSave()
Builds inverted index and saves active partition
public
indexSave() : mixed
Return values
mixed —initializeCrawlQueue()
This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.
public
initializeCrawlQueue() : mixed
Return values
mixed —initializeIndexBundle()
Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.
public
initializeIndexBundle([array<string|int, mixed> $info = [] ][, array<string|int, mixed> $try_to_set_from_old_index = null ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
if initializing a new crawl this should contain the crawl parameters
- $try_to_set_from_old_index : array<string|int, mixed> = null
-
parameters of the crawl to try to set from values already stored in archive info, other parameters are assumed to have been updated since.
Return values
mixed —isAIndexer()
Used to check if the current queue server process is acting a indexer of data coming from fetchers
public
isAIndexer() : bool
Return values
bool —whether it is or not
isAScheduler()
Used to check if the current queue server process is acting a url scheduler for fetchers
public
isAScheduler() : bool
Return values
bool —whether it is or not
isOnlyIndexer()
Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)
public
isOnlyIndexer() : bool
Return values
bool —whether it is or not
isOnlyScheduler()
Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)
public
isOnlyScheduler() : bool
Return values
bool —whether it is or not
loop()
Main runtime loop of the queue server.
public
loop() : mixed
Loops until a stop message received, check for start, stop, resume crawl messages, deletes any CrawlQueueBundle for which an IndexDocumentBundle does not exist. Processes
Return values
mixed —processCrawlData()
Main body of queue server loop where indexing, scheduling, robot file processing is done.
public
processCrawlData() : mixed
Return values
mixed —processEtagExpires()
Process cache page validation data files sent by Fetcher
public
processEtagExpires() : mixed
Return values
mixed —processEtagExpiresArchive()
Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.
public
processEtagExpiresArchive(array<string|int, mixed> &$etag_expires_data) : mixed
Parameters
- $etag_expires_data : array<string|int, mixed>
-
is the cache page validation data from the Fetchers.
Return values
mixed —processIndexArchive()
Adds the summary and index data in $file to summary bundle and word index
public
processIndexArchive(string &$pre_sites_and_index) : mixed
Parameters
- $pre_sites_and_index : string
-
containing web pages summaries
Return values
mixed —processIndexData()
Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found
public
processIndexData() : mixed
Return values
mixed —processReceivedRobotTxtUrls()
This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
public
processReceivedRobotTxtUrls() : mixed
Return values
mixed —processRecrawlDataArchive()
Processes fetcher data file information during a recrawl
public
processRecrawlDataArchive(array<string|int, mixed> $sites) : mixed
Parameters
- $sites : array<string|int, mixed>
-
a file which recently crawled urls (and other to_crawl data which will be discarded because we are doing a recrawl)
Return values
mixed —processRecrawlRobotUrls()
Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.
public
processRecrawlRobotUrls() : mixed
Return values
mixed —processRobotArchive()
Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.
public
processRobotArchive(mixed &$sites) : mixed
Adds this to the robot_table entry for this host. Adds dns info to the RAM-based dns cache hash table.
Parameters
- $sites : mixed
Return values
mixed —processRobotUrls()
Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there
public
processRobotUrls() : mixed
Return values
mixed —processToCrawlArchive()
Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info
public
processToCrawlArchive(array<string|int, mixed> &$sites) : mixed
Parameters
- $sites : array<string|int, mixed>
-
containing to crawl and seen url info
Return values
mixed —processToCrawlUrls()
Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue
public
processToCrawlUrls() : mixed
Return values
mixed —produceFetchBatch()
Produces a schedule.txt file of url data for a fetcher to crawl next.
public
produceFetchBatch() : mixed
The hard part of scheduling is to make sure that the overall crawl process obeys robots.txt files. This involves checking the url is in an allowed path for that host and it also involves making sure the Crawl-delay directive is respected. The first fetcher that contacts the server requesting data to crawl will get the schedule.txt produced by produceFetchBatch() at which point it will be unlinked (these latter things are controlled in FetchController).
Tags
Return values
mixed —runPostProcessingPlugins()
During crawl shutdown this is called to run any post processing plugins
public
runPostProcessingPlugins() : mixed
Return values
mixed —shutdownDictionary()
During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.
public
shutdownDictionary() : mixed
Return values
mixed —start()
This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop
public
start() : mixed
Return values
mixed —startCrawl()
Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle
public
startCrawl(array<string|int, mixed> $info) : mixed
Parameters
- $info : array<string|int, mixed>
-
parameter for the crawl
Return values
mixed —stopCrawl()
Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.
public
stopCrawl() : mixed
Return values
mixed —updateDisallowedQuotaSites()
This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]
public
updateDisallowedQuotaSites() : mixed
Return values
mixed —updateMostRecentFetcher()
Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher
public
updateMostRecentFetcher() : mixed
Return values
mixed —withinQuota()
Checks if the $url is from a site which has an hourly quota to download.
public
withinQuota(string $url[, int $bump_count = 1 ]) : bool
If so, it bumps the quota count and return true; false otherwise. This method also resets the quota queue every over
Parameters
- $url : string
-
to check if within quota
- $bump_count : int = 1
-
how much to bump quota count if url is from a site with a quota
Return values
bool —whether $url exceeds the hourly quota of the site it is from
writeAdminMessage()
Used to write an admin crawl status message during a start or stop crawl.
public
writeAdminMessage(string $message) : mixed
Parameters
- $message : string
-
to write into crawl_status.txt this will show up in the web crawl status element.
Return values
mixed —writeArchiveCrawlInfo()
Used to write info about the current recrawl to file as well as to process any recrawl data files received
public
writeArchiveCrawlInfo() : mixed
Return values
mixed —writeCrawlStatus()
Writes status information about the current crawl so that the webserver app can use it for its display.
public
writeCrawlStatus(array<string|int, mixed> $recent_urls) : mixed
Parameters
- $recent_urls : array<string|int, mixed>
-
contains the most recently crawled sites