Yioop_V9.5_Source_Code_Documentation

QueueServer
in package
implements CrawlConstants

Command line program responsible for managing Yioop crawls.

It maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create and populate the IndexDocumentBundle that is used by the search front end.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$all_file_types  : array<string|int, mixed>
List of all known file extensions including those not used for crawl
$allow_disallow_cache_time  : int
Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
$allowed_sites  : array<string|int, mixed>
Web-sites that crawler can crawl. If used, ONLY these will be crawled
$archive_modified_time  : int
This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.
$cache_pages  : bool
Used in schedules to tell the fetcher whether or not to cache pages
$channel  : int
Channel that queue server listens to messages for
$crawl_index  : string
If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
$crawl_order  : string
Constant saying the method used to order the priority queue for the crawl
$crawl_queue  : object
Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next
$crawl_status_file_name  : string
name of file used to hold statistic about current crawl
$crawl_time  : int
The timestamp of the current active crawl
$crawl_type  : string
Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
$db  : object
Reference to a database object. Used since has directory manipulation functions
$debug  : string
Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
$disallowed_sites  : array<string|int, mixed>
Web-sites that the crawler must not crawl
$hourly_crawl_data  : array<string|int, mixed>
This is a list of hourly (timestamp, number_of_urls_crawled) statistics
$index_archive  : object
Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.
$index_dirty  : int
flags for whether the index has data to be written to disk
$indexed_file_types  : array<string|int, mixed>
List of file extensions supported for the crawl
$indexing_plugins  : array<string|int, mixed>
This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.
$indexing_plugins_data  : array<string|int, mixed>
This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.
$info_parameter_map  : array<string|int, mixed>
A mapping between class field names and parameters which might be sent to a queue server via an info associative array.
$last_index_save_time  : int
Last time index was saved to disk
$last_next_partition_to_add  : int
Holds the int value of the previous partition in index
$max_depth  : string
Constant saying the depth from the seeds crawl can go to
$max_description_len  : int
Max number of chars to extract for description from a page to index.
$max_links_to_extract  : int
Maximum number of urls to extract from a single document
$messages_bundle  : object
Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.
$most_recent_fetcher  : string
IP address as a string of the fetcher that most recently spoke with the queue server.
$page_range_request  : int
Maximum number of bytes to download of a webpage
$page_recrawl_frequency  : int
Number of days between resets of the page url filter If nonpositive, then never reset filter
$page_rules  : array<string|int, mixed>
Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )
$process_name  : string
String used for naming log files and for naming the processes which run related to the queue server
$quota_clear_time  : int
Timestamp of lst time download from site quotas were cleared
$quota_sites  : array<string|int, mixed>
Web-sites that have an hourly crawl quota
$quota_sites_keys  : array<string|int, mixed>
Cache of array_keys of $quota_sites
$repeat_type  : int
Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second
$restrict_sites_by_url  : bool
Says whether the $allowed_sites array is being used or not
$robots_txt  : int
One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
$server_name  : string
String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.
$server_type  : mixed
Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)
$sleep_duration  : string
If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
$sleep_start  : string
If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
$start_dictionary_time  : int
Keeps track of the time needed for the dictionary updater to add the current partition contents to index
$summarizer_option  : string
Stores the name of the summarizer used for crawling.
$waiting_hosts  : array<string|int, mixed>
This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.
$window_size  : int
Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.
__construct()  : mixed
Creates a Queue Server Daemon
allowedToCrawlSite()  : bool
Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
calculateScheduleMetaInfo()  : string
Used to create encode a string representing with meta info for a fetcher schedule.
checkBothProcessesRunning()  : mixed
Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process
checkProcessRunning()  : mixed
Checks to make sure the given process (either Indexer or Scheduler) is running.
checkRepeatingCrawlSwap()  : bool
Check for a repeating crawl whether it is time to swap between the active and search crawls.
checkUpdateCrawlParameters()  : mixed
Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values
deleteOrphanedBundles()  : mixed
Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.
disallowedToCrawlSite()  : bool
Checks if url belongs to a list of sites that aren't supposed to be crawled
dumpBigScheduleToSmall()  : mixed
Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls
getEarliestSlot()  : int
Gets the first unfilled schedule slot after $index in $arr
handleAdminMessages()  : array<string|int, mixed>
Handles messages passed via files to the QueueServer.
indexSave()  : mixed
Builds inverted index and saves active partition
initializeCrawlQueue()  : mixed
This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.
initializeIndexBundle()  : mixed
Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.
isAIndexer()  : bool
Used to check if the current queue server process is acting a indexer of data coming from fetchers
isAScheduler()  : bool
Used to check if the current queue server process is acting a url scheduler for fetchers
isOnlyIndexer()  : bool
Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)
isOnlyScheduler()  : bool
Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)
loop()  : mixed
Main runtime loop of the queue server.
processCrawlData()  : mixed
Main body of queue server loop where indexing, scheduling, robot file processing is done.
processEtagExpires()  : mixed
Process cache page validation data files sent by Fetcher
processEtagExpiresArchive()  : mixed
Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.
processIndexArchive()  : mixed
Adds the summary and index data in $file to summary bundle and word index
processIndexData()  : mixed
Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found
processReceivedRobotTxtUrls()  : mixed
This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
processRecrawlDataArchive()  : mixed
Processes fetcher data file information during a recrawl
processRecrawlRobotUrls()  : mixed
Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.
processRobotArchive()  : mixed
Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.
processRobotUrls()  : mixed
Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there
processToCrawlArchive()  : mixed
Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info
processToCrawlUrls()  : mixed
Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue
produceFetchBatch()  : mixed
Produces a schedule.txt file of url data for a fetcher to crawl next.
runPostProcessingPlugins()  : mixed
During crawl shutdown this is called to run any post processing plugins
shutdownDictionary()  : mixed
During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.
start()  : mixed
This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop
startCrawl()  : mixed
Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle
stopCrawl()  : mixed
Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.
updateDisallowedQuotaSites()  : mixed
This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]
updateMostRecentFetcher()  : mixed
Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher
withinQuota()  : bool
Checks if the $url is from a site which has an hourly quota to download.
writeAdminMessage()  : mixed
Used to write an admin crawl status message during a start or stop crawl.
writeArchiveCrawlInfo()  : mixed
Used to write info about the current recrawl to file as well as to process any recrawl data files received
writeCrawlStatus()  : mixed
Writes status information about the current crawl so that the webserver app can use it for its display.

Properties

$all_file_types

List of all known file extensions including those not used for crawl

public array<string|int, mixed> $all_file_types

$allow_disallow_cache_time

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures

public int $allow_disallow_cache_time

$allowed_sites

Web-sites that crawler can crawl. If used, ONLY these will be crawled

public array<string|int, mixed> $allowed_sites

$archive_modified_time

This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.

public int $archive_modified_time

$cache_pages

Used in schedules to tell the fetcher whether or not to cache pages

public bool $cache_pages

$channel

Channel that queue server listens to messages for

public int $channel

$crawl_index

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl

public string $crawl_index

$crawl_order

Constant saying the method used to order the priority queue for the crawl

public string $crawl_order

$crawl_queue

Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next

public object $crawl_queue

$crawl_status_file_name

name of file used to hold statistic about current crawl

public string $crawl_status_file_name

$crawl_time

The timestamp of the current active crawl

public int $crawl_time

$crawl_type

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive

public string $crawl_type

$db

Reference to a database object. Used since has directory manipulation functions

public object $db

$debug

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();

public string $debug

$disallowed_sites

Web-sites that the crawler must not crawl

public array<string|int, mixed> $disallowed_sites

$hourly_crawl_data

This is a list of hourly (timestamp, number_of_urls_crawled) statistics

public array<string|int, mixed> $hourly_crawl_data

$index_archive

Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.

public object $index_archive

$index_dirty

flags for whether the index has data to be written to disk

public int $index_dirty

$indexed_file_types

List of file extensions supported for the crawl

public array<string|int, mixed> $indexed_file_types

$indexing_plugins

This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.

public array<string|int, mixed> $indexing_plugins

$indexing_plugins_data

This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.

public array<string|int, mixed> $indexing_plugins_data

$info_parameter_map

A mapping between class field names and parameters which might be sent to a queue server via an info associative array.

public static array<string|int, mixed> $info_parameter_map = ["crawl_order" => self::CRAWL_ORDER, "crawl_type" => self::CRAWL_TYPE, "crawl_index" => self::CRAWL_INDEX, "cache_pages" => self::CACHE_PAGES, "page_range_request" => self::PAGE_RANGE_REQUEST, "max_depth" => self::MAX_DEPTH, "repeat_type" => self::REPEAT_TYPE, "sleep_start" => self::SLEEP_START, "sleep_duration" => self::SLEEP_DURATION, "robots_txt" => self::ROBOTS_TXT, "max_description_len" => self::MAX_DESCRIPTION_LEN, "max_links_to_extract" => self::MAX_LINKS_TO_EXTRACT, "page_recrawl_frequency" => self::PAGE_RECRAWL_FREQUENCY, "indexed_file_types" => self::INDEXED_FILE_TYPES, "restrict_sites_by_url" => self::RESTRICT_SITES_BY_URL, "allowed_sites" => self::ALLOWED_SITES, "disallowed_sites" => self::DISALLOWED_SITES, "page_rules" => self::PAGE_RULES, "indexing_plugins" => self::INDEXING_PLUGINS, "indexing_plugins_data" => self::INDEXING_PLUGINS_DATA]

$last_index_save_time

Last time index was saved to disk

public int $last_index_save_time

$last_next_partition_to_add

Holds the int value of the previous partition in index

public int $last_next_partition_to_add

$max_depth

Constant saying the depth from the seeds crawl can go to

public string $max_depth

$max_description_len

Max number of chars to extract for description from a page to index.

public int $max_description_len

Only words in the description are indexed.

Maximum number of urls to extract from a single document

public int $max_links_to_extract

$messages_bundle

Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.

public object $messages_bundle

$most_recent_fetcher

IP address as a string of the fetcher that most recently spoke with the queue server.

public string $most_recent_fetcher

$page_range_request

Maximum number of bytes to download of a webpage

public int $page_range_request

$page_recrawl_frequency

Number of days between resets of the page url filter If nonpositive, then never reset filter

public int $page_recrawl_frequency

$page_rules

Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )

public array<string|int, mixed> $page_rules

$process_name

String used for naming log files and for naming the processes which run related to the queue server

public string $process_name

$quota_clear_time

Timestamp of lst time download from site quotas were cleared

public int $quota_clear_time

$quota_sites

Web-sites that have an hourly crawl quota

public array<string|int, mixed> $quota_sites

$quota_sites_keys

Cache of array_keys of $quota_sites

public array<string|int, mixed> $quota_sites_keys

$repeat_type

Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second

public int $repeat_type

$restrict_sites_by_url

Says whether the $allowed_sites array is being used or not

public bool $restrict_sites_by_url

$robots_txt

One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS

public int $robots_txt

$server_name

String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.

public string $server_name

$server_type

Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)

public mixed $server_type

$sleep_duration

If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.

public string $sleep_duration

$sleep_start

If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts

public string $sleep_start

$start_dictionary_time

Keeps track of the time needed for the dictionary updater to add the current partition contents to index

public int $start_dictionary_time

$summarizer_option

Stores the name of the summarizer used for crawling.

public string $summarizer_option

Possible values are Basic and Centroid

$waiting_hosts

This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.

public array<string|int, mixed> $waiting_hosts

$window_size

Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.

public int $window_size

Methods

__construct()

Creates a Queue Server Daemon

public __construct() : mixed
Return values
mixed

allowedToCrawlSite()

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable

public allowedToCrawlSite(string $url) : bool
Parameters
$url : string

url to check

Return values
bool

whether is allowed to be crawled or not

calculateScheduleMetaInfo()

Used to create encode a string representing with meta info for a fetcher schedule.

public calculateScheduleMetaInfo(int $schedule_time) : string
Parameters
$schedule_time : int

timestamp of the schedule

Return values
string

base64 encoded meta info

checkBothProcessesRunning()

Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process

public checkBothProcessesRunning(array<string|int, mixed> $info) : mixed
Parameters
$info : array<string|int, mixed>

information about queue server state used to determine if a crawl is active.

Return values
mixed

checkProcessRunning()

Checks to make sure the given process (either Indexer or Scheduler) is running.

public checkProcessRunning(string $process, array<string|int, mixed> $info) : mixed
Parameters
$process : string

should be either self::INDEXER or self::SCHEDULER

$info : array<string|int, mixed>

information about queue server state used to determine if a crawl is active.

Return values
mixed

checkRepeatingCrawlSwap()

Check for a repeating crawl whether it is time to swap between the active and search crawls.

public checkRepeatingCrawlSwap() : bool
Return values
bool

true if the time to swap has come

checkUpdateCrawlParameters()

Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values

public checkUpdateCrawlParameters() : mixed
Return values
mixed

deleteOrphanedBundles()

Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.

public deleteOrphanedBundles() : mixed
Return values
mixed

disallowedToCrawlSite()

Checks if url belongs to a list of sites that aren't supposed to be crawled

public disallowedToCrawlSite(string $url) : bool
Parameters
$url : string

url to check

Return values
bool

whether is shouldn't be crawled

dumpBigScheduleToSmall()

Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls

public dumpBigScheduleToSmall(int $schedule_time, array<string|int, mixed> &$sites) : mixed

The size of the to crawl list depends on the number of found links during a fetch batch. This can be quite large compared to the fetch batch and during processing, we might be doing a fair bit of manipulation of arrays of sites, so the idea is this splitting like this will hopefully reduce the memory burden of scheduling.

Parameters
$schedule_time : int

timestamp of schedule we are splitting

$sites : array<string|int, mixed>

array containing to crawl data

Return values
mixed

getEarliestSlot()

Gets the first unfilled schedule slot after $index in $arr

public getEarliestSlot(int $index, array<string|int, mixed> &$arr) : int

A schedule of sites for a fetcher to crawl consists of MAX_FETCH_SIZE many slots each of which could eventually hold url information. This function is used to schedule slots for crawl-delayed host.

Parameters
$index : int

location to begin searching for an empty slot

$arr : array<string|int, mixed>

list of slots to look in

Return values
int

index of first available slot

handleAdminMessages()

Handles messages passed via files to the QueueServer.

public handleAdminMessages(array<string|int, mixed> $info) : array<string|int, mixed>

These files are typically written by the CrawlDaemon::init() when QueueServer is run using command-line argument

Parameters
$info : array<string|int, mixed>

associative array with info about current state of queue server

Return values
array<string|int, mixed>

an updates version $info reflecting changes that occurred during the handling of the admin messages files.

indexSave()

Builds inverted index and saves active partition

public indexSave() : mixed
Return values
mixed

initializeCrawlQueue()

This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.

public initializeCrawlQueue() : mixed
Return values
mixed

initializeIndexBundle()

Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.

public initializeIndexBundle([array<string|int, mixed> $info = [] ][, array<string|int, mixed> $try_to_set_from_old_index = null ]) : mixed
Parameters
$info : array<string|int, mixed> = []

if initializing a new crawl this should contain the crawl parameters

$try_to_set_from_old_index : array<string|int, mixed> = null

parameters of the crawl to try to set from values already stored in archive info, other parameters are assumed to have been updated since.

Return values
mixed

isAIndexer()

Used to check if the current queue server process is acting a indexer of data coming from fetchers

public isAIndexer() : bool
Return values
bool

whether it is or not

isAScheduler()

Used to check if the current queue server process is acting a url scheduler for fetchers

public isAScheduler() : bool
Return values
bool

whether it is or not

isOnlyIndexer()

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)

public isOnlyIndexer() : bool
Return values
bool

whether it is or not

isOnlyScheduler()

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)

public isOnlyScheduler() : bool
Return values
bool

whether it is or not

loop()

Main runtime loop of the queue server.

public loop() : mixed

Loops until a stop message received, check for start, stop, resume crawl messages, deletes any CrawlQueueBundle for which an IndexDocumentBundle does not exist. Processes

Return values
mixed

processCrawlData()

Main body of queue server loop where indexing, scheduling, robot file processing is done.

public processCrawlData() : mixed
Return values
mixed

processEtagExpires()

Process cache page validation data files sent by Fetcher

public processEtagExpires() : mixed
Return values
mixed

processEtagExpiresArchive()

Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.

public processEtagExpiresArchive(array<string|int, mixed> &$etag_expires_data) : mixed
Parameters
$etag_expires_data : array<string|int, mixed>

is the cache page validation data from the Fetchers.

Return values
mixed

processIndexArchive()

Adds the summary and index data in $file to summary bundle and word index

public processIndexArchive(string &$pre_sites_and_index) : mixed
Parameters
$pre_sites_and_index : string

containing web pages summaries

Return values
mixed

processIndexData()

Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found

public processIndexData() : mixed
Return values
mixed

processReceivedRobotTxtUrls()

This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.

public processReceivedRobotTxtUrls() : mixed
Return values
mixed

processRecrawlDataArchive()

Processes fetcher data file information during a recrawl

public processRecrawlDataArchive(array<string|int, mixed> $sites) : mixed
Parameters
$sites : array<string|int, mixed>

a file which recently crawled urls (and other to_crawl data which will be discarded because we are doing a recrawl)

Return values
mixed

processRecrawlRobotUrls()

Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.

public processRecrawlRobotUrls() : mixed
Return values
mixed

processRobotArchive()

Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.

public processRobotArchive(mixed &$sites) : mixed

Adds this to the robot_table entry for this host. Adds dns info to the RAM-based dns cache hash table.

Parameters
$sites : mixed
Return values
mixed

processRobotUrls()

Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there

public processRobotUrls() : mixed
Return values
mixed

processToCrawlArchive()

Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info

public processToCrawlArchive(array<string|int, mixed> &$sites) : mixed
Parameters
$sites : array<string|int, mixed>

containing to crawl and seen url info

Return values
mixed

processToCrawlUrls()

Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue

public processToCrawlUrls() : mixed
Return values
mixed

produceFetchBatch()

Produces a schedule.txt file of url data for a fetcher to crawl next.

public produceFetchBatch() : mixed

The hard part of scheduling is to make sure that the overall crawl process obeys robots.txt files. This involves checking the url is in an allowed path for that host and it also involves making sure the Crawl-delay directive is respected. The first fetcher that contacts the server requesting data to crawl will get the schedule.txt produced by produceFetchBatch() at which point it will be unlinked (these latter things are controlled in FetchController).

Tags
see
FetchController
Return values
mixed

runPostProcessingPlugins()

During crawl shutdown this is called to run any post processing plugins

public runPostProcessingPlugins() : mixed
Return values
mixed

shutdownDictionary()

During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.

public shutdownDictionary() : mixed
Return values
mixed

start()

This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop

public start() : mixed
Return values
mixed

startCrawl()

Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle

public startCrawl(array<string|int, mixed> $info) : mixed
Parameters
$info : array<string|int, mixed>

parameter for the crawl

Return values
mixed

stopCrawl()

Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.

public stopCrawl() : mixed
Return values
mixed

updateDisallowedQuotaSites()

This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]

public updateDisallowedQuotaSites() : mixed
Return values
mixed

updateMostRecentFetcher()

Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher

public updateMostRecentFetcher() : mixed
Return values
mixed

withinQuota()

Checks if the $url is from a site which has an hourly quota to download.

public withinQuota(string $url[, int $bump_count = 1 ]) : bool

If so, it bumps the quota count and return true; false otherwise. This method also resets the quota queue every over

Parameters
$url : string

to check if within quota

$bump_count : int = 1

how much to bump quota count if url is from a site with a quota

Return values
bool

whether $url exceeds the hourly quota of the site it is from

writeAdminMessage()

Used to write an admin crawl status message during a start or stop crawl.

public writeAdminMessage(string $message) : mixed
Parameters
$message : string

to write into crawl_status.txt this will show up in the web crawl status element.

Return values
mixed

writeArchiveCrawlInfo()

Used to write info about the current recrawl to file as well as to process any recrawl data files received

public writeArchiveCrawlInfo() : mixed
Return values
mixed

writeCrawlStatus()

Writes status information about the current crawl so that the webserver app can use it for its display.

public writeCrawlStatus(array<string|int, mixed> $recent_urls) : mixed
Parameters
$recent_urls : array<string|int, mixed>

contains the most recently crawled sites

Return values
mixed

        

Search results