Yioop_V9.5_Source_Code

QueueServer
in package

Application

implements CrawlConstants

Command line program responsible for managing Yioop crawls.

It maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create and populate the IndexDocumentBundle that is used by the search front end.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

$all_file_types : array<string|int, mixed>: List of all known file extensions including those not used for crawl
$allow_disallow_cache_time : int: Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures
$allowed_sites : array<string|int, mixed>: Web-sites that crawler can crawl. If used, ONLY these will be crawled
$archive_modified_time : int: This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.
$cache_pages : bool: Used in schedules to tell the fetcher whether or not to cache pages
$channel : int: Channel that queue server listens to messages for
$crawl_index : string: If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl
$crawl_order : string: Constant saying the method used to order the priority queue for the crawl
$crawl_queue : object: Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next
$crawl_status_file_name : string: name of file used to hold statistic about current crawl
$crawl_time : int: The timestamp of the current active crawl
$crawl_type : string: Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive
$db : object: Reference to a database object. Used since has directory manipulation functions
$debug : string: Holds the value of a debug message that might have been sent from the command line during the current execution of loop();
$disallowed_sites : array<string|int, mixed>: Web-sites that the crawler must not crawl
$hourly_crawl_data : array<string|int, mixed>: This is a list of hourly (timestamp, number_of_urls_crawled) statistics
$index_archive : object: Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.
$index_dirty : int: flags for whether the index has data to be written to disk
$indexed_file_types : array<string|int, mixed>: List of file extensions supported for the crawl
$indexing_plugins : array<string|int, mixed>: This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.
$indexing_plugins_data : array<string|int, mixed>: This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.
$info_parameter_map : array<string|int, mixed>: A mapping between class field names and parameters which might be sent to a queue server via an info associative array.
$last_index_save_time : int: Last time index was saved to disk
$last_next_partition_to_add : int: Holds the int value of the previous partition in index
$max_depth : string: Constant saying the depth from the seeds crawl can go to
$max_description_len : int: Max number of chars to extract for description from a page to index.
$max_links_to_extract : int: Maximum number of urls to extract from a single document
$messages_bundle : object: Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.
$most_recent_fetcher : string: IP address as a string of the fetcher that most recently spoke with the queue server.
$page_range_request : int: Maximum number of bytes to download of a webpage
$page_recrawl_frequency : int: Number of days between resets of the page url filter If nonpositive, then never reset filter
$page_rules : array<string|int, mixed>: Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )
$process_name : string: String used for naming log files and for naming the processes which run related to the queue server
$quota_clear_time : int: Timestamp of lst time download from site quotas were cleared
$quota_sites : array<string|int, mixed>: Web-sites that have an hourly crawl quota
$quota_sites_keys : array<string|int, mixed>: Cache of array_keys of $quota_sites
$repeat_type : int: Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second
$restrict_sites_by_url : bool: Says whether the $allowed_sites array is being used or not
$robots_txt : int: One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS
$server_name : string: String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.
$server_type : mixed: Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)
$sleep_duration : string: If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.
$sleep_start : string: If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts
$start_dictionary_time : int: Keeps track of the time needed for the dictionary updater to add the current partition contents to index
$summarizer_option : string: Stores the name of the summarizer used for crawling.
$waiting_hosts : array<string|int, mixed>: This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.
$window_size : int: Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.
__construct() : mixed: Creates a Queue Server Daemon
allowedToCrawlSite() : bool: Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable
calculateScheduleMetaInfo() : string: Used to create encode a string representing with meta info for a fetcher schedule.
checkBothProcessesRunning() : mixed: Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process
checkProcessRunning() : mixed: Checks to make sure the given process (either Indexer or Scheduler) is running.
checkRepeatingCrawlSwap() : bool: Check for a repeating crawl whether it is time to swap between the active and search crawls.
checkUpdateCrawlParameters() : mixed: Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values
deleteOrphanedBundles() : mixed: Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.
disallowedToCrawlSite() : bool: Checks if url belongs to a list of sites that aren't supposed to be crawled
dumpBigScheduleToSmall() : mixed: Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls
getEarliestSlot() : int: Gets the first unfilled schedule slot after $index in $arr
handleAdminMessages() : array<string|int, mixed>: Handles messages passed via files to the QueueServer.
indexSave() : mixed: Builds inverted index and saves active partition
initializeCrawlQueue() : mixed: This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.
initializeIndexBundle() : mixed: Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.
isAIndexer() : bool: Used to check if the current queue server process is acting a indexer of data coming from fetchers
isAScheduler() : bool: Used to check if the current queue server process is acting a url scheduler for fetchers
isOnlyIndexer() : bool: Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)
isOnlyScheduler() : bool: Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)
loop() : mixed: Main runtime loop of the queue server.
processCrawlData() : mixed: Main body of queue server loop where indexing, scheduling, robot file processing is done.
processEtagExpires() : mixed: Process cache page validation data files sent by Fetcher
processEtagExpiresArchive() : mixed: Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.
processIndexArchive() : mixed: Adds the summary and index data in $file to summary bundle and word index
processIndexData() : mixed: Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found
processReceivedRobotTxtUrls() : mixed: This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
processRecrawlDataArchive() : mixed: Processes fetcher data file information during a recrawl
processRecrawlRobotUrls() : mixed: Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.
processRobotArchive() : mixed: Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.
processRobotUrls() : mixed: Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there
processToCrawlArchive() : mixed: Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info
processToCrawlUrls() : mixed: Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue
produceFetchBatch() : mixed: Produces a schedule.txt file of url data for a fetcher to crawl next.
runPostProcessingPlugins() : mixed: During crawl shutdown this is called to run any post processing plugins
shutdownDictionary() : mixed: During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.
start() : mixed: This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop
startCrawl() : mixed: Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle
stopCrawl() : mixed: Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.
updateDisallowedQuotaSites() : mixed: This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]
updateMostRecentFetcher() : mixed: Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher
withinQuota() : bool: Checks if the $url is from a site which has an hourly quota to download.
writeAdminMessage() : mixed: Used to write an admin crawl status message during a start or stop crawl.
writeArchiveCrawlInfo() : mixed: Used to write info about the current recrawl to file as well as to process any recrawl data files received
writeCrawlStatus() : mixed: Writes status information about the current crawl so that the webserver app can use it for its display.

$all_file_types

List of all known file extensions including those not used for crawl


    public
        array<string|int, mixed>
    $all_file_types

$allow_disallow_cache_time

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures


    public
        int
    $allow_disallow_cache_time

$allowed_sites

Web-sites that crawler can crawl. If used, ONLY these will be crawled


    public
        array<string|int, mixed>
    $allowed_sites

$archive_modified_time

This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.


    public
        int
    $archive_modified_time

$cache_pages

Used in schedules to tell the fetcher whether or not to cache pages


    public
        bool
    $cache_pages

$channel

Channel that queue server listens to messages for


    public
        int
    $channel

$crawl_index

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl


    public
        string
    $crawl_index

$crawl_order

Constant saying the method used to order the priority queue for the crawl


    public
        string
    $crawl_order

$crawl_queue

Holds the CrawlQueueBundle for the crawl. This bundle encapsulates the queue of urls that specifies what to crawl next


    public
        object
    $crawl_queue

$crawl_status_file_name

name of file used to hold statistic about current crawl


    public
        string
    $crawl_status_file_name

$crawl_time

The timestamp of the current active crawl


    public
        int
    $crawl_time

$crawl_type

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive


    public
        string
    $crawl_type

$db

Reference to a database object. Used since has directory manipulation functions


    public
        object
    $db

$debug

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();


    public
        string
    $debug

$disallowed_sites

Web-sites that the crawler must not crawl


    public
        array<string|int, mixed>
    $disallowed_sites

$hourly_crawl_data

This is a list of hourly (timestamp, number_of_urls_crawled) statistics


    public
        array<string|int, mixed>
    $hourly_crawl_data

$index_archive

Holds the IndexDocumentBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.


    public
        object
    $index_archive

$index_dirty

flags for whether the index has data to be written to disk


    public
        int
    $index_dirty

$indexed_file_types

List of file extensions supported for the crawl


    public
        array<string|int, mixed>
    $indexed_file_types

$indexing_plugins

This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.


    public
        array<string|int, mixed>
    $indexing_plugins

$indexing_plugins_data

This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.


    public
        array<string|int, mixed>
    $indexing_plugins_data

$info_parameter_map

A mapping between class field names and parameters which might be sent to a queue server via an info associative array.


    public
    static    array<string|int, mixed>
    $info_parameter_map
     = ["crawl_order" => self::CRAWL_ORDER, "crawl_type" => self::CRAWL_TYPE, "crawl_index" => self::CRAWL_INDEX, "cache_pages" => self::CACHE_PAGES, "page_range_request" => self::PAGE_RANGE_REQUEST, "max_depth" => self::MAX_DEPTH, "repeat_type" => self::REPEAT_TYPE, "sleep_start" => self::SLEEP_START, "sleep_duration" => self::SLEEP_DURATION, "robots_txt" => self::ROBOTS_TXT, "max_description_len" => self::MAX_DESCRIPTION_LEN, "max_links_to_extract" => self::MAX_LINKS_TO_EXTRACT, "page_recrawl_frequency" => self::PAGE_RECRAWL_FREQUENCY, "indexed_file_types" => self::INDEXED_FILE_TYPES, "restrict_sites_by_url" => self::RESTRICT_SITES_BY_URL, "allowed_sites" => self::ALLOWED_SITES, "disallowed_sites" => self::DISALLOWED_SITES, "page_rules" => self::PAGE_RULES, "indexing_plugins" => self::INDEXING_PLUGINS, "indexing_plugins_data" => self::INDEXING_PLUGINS_DATA]

$last_index_save_time

Last time index was saved to disk


    public
        int
    $last_index_save_time

$last_next_partition_to_add

Holds the int value of the previous partition in index


    public
        int
    $last_next_partition_to_add

$max_depth

Constant saying the depth from the seeds crawl can go to


    public
        string
    $max_depth

$max_description_len

Max number of chars to extract for description from a page to index.


    public
        int
    $max_description_len

Only words in the description are indexed.

$max_links_to_extract

Maximum number of urls to extract from a single document


    public
        int
    $max_links_to_extract

$messages_bundle

Hold the MessagesBundle to be used for the crawl. This bundle is used to store data that is sent between the QueueServer and Fetcher that has yet to be processed.


    public
        object
    $messages_bundle

$most_recent_fetcher

IP address as a string of the fetcher that most recently spoke with the queue server.


    public
        string
    $most_recent_fetcher

$page_range_request

Maximum number of bytes to download of a webpage


    public
        int
    $page_range_request

$page_recrawl_frequency

Number of days between resets of the page url filter If nonpositive, then never reset filter


    public
        int
    $page_recrawl_frequency

$page_rules

Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )


    public
        array<string|int, mixed>
    $page_rules

$process_name

String used for naming log files and for naming the processes which run related to the queue server


    public
        string
    $process_name

$quota_clear_time

Timestamp of lst time download from site quotas were cleared


    public
        int
    $quota_clear_time

$quota_sites

Web-sites that have an hourly crawl quota


    public
        array<string|int, mixed>
    $quota_sites

$quota_sites_keys

Cache of array_keys of $quota_sites


    public
        array<string|int, mixed>
    $quota_sites_keys

$repeat_type

Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second


    public
        int
    $repeat_type

$restrict_sites_by_url

Says whether the $allowed_sites array is being used or not


    public
        bool
    $restrict_sites_by_url

$robots_txt

One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS


    public
        int
    $robots_txt

$server_name

String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.


    public
        string
    $server_name

$server_type

Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)


    public
        mixed
    $server_type

$sleep_duration

If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.


    public
        string
    $sleep_duration

$sleep_start

If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts


    public
        string
    $sleep_start

$start_dictionary_time

Keeps track of the time needed for the dictionary updater to add the current partition contents to index


    public
        int
    $start_dictionary_time

$summarizer_option

Stores the name of the summarizer used for crawling.


    public
        string
    $summarizer_option

Possible values are Basic and Centroid

$waiting_hosts

This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.


    public
        array<string|int, mixed>
    $waiting_hosts

$window_size

Maximum number of to_crawl schedules that can waiting to be returned in the sequence they were sent out. I.e, if crawl results for fetch batch x have not been returned then fetch batch x + $window_size cannot be created from the queue.


    public
        int
    $window_size

__construct()

Creates a Queue Server Daemon


    public
                    __construct() : mixed

Return values

mixed —

allowedToCrawlSite()

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable


    public
                    allowedToCrawlSite(string $url) : bool

Parameters

$url : string: url to check

Return values

bool —

whether is allowed to be crawled or not

calculateScheduleMetaInfo()

Used to create encode a string representing with meta info for a fetcher schedule.


    public
                    calculateScheduleMetaInfo(int $schedule_time) : string

Parameters

$schedule_time : int: timestamp of the schedule

Return values

string —

base64 encoded meta info

checkBothProcessesRunning()

Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process


    public
                    checkBothProcessesRunning(array<string|int, mixed> $info) : mixed

Parameters

$info : array<string|int, mixed>: information about queue server state used to determine if a crawl is active.

Return values

mixed —

checkProcessRunning()

Checks to make sure the given process (either Indexer or Scheduler) is running.


    public
                    checkProcessRunning(string $process, array<string|int, mixed> $info) : mixed

Parameters

$process : string: should be either self::INDEXER or self::SCHEDULER
$info : array<string|int, mixed>: information about queue server state used to determine if a crawl is active.

Return values

mixed —

checkRepeatingCrawlSwap()

Check for a repeating crawl whether it is time to swap between the active and search crawls.


    public
                    checkRepeatingCrawlSwap() : bool

Return values

bool —

true if the time to swap has come

checkUpdateCrawlParameters()

Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values


    public
                    checkUpdateCrawlParameters() : mixed

Return values

mixed —

deleteOrphanedBundles()

Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.


    public
                    deleteOrphanedBundles() : mixed

Return values

mixed —

disallowedToCrawlSite()

Checks if url belongs to a list of sites that aren't supposed to be crawled


    public
                    disallowedToCrawlSite(string $url) : bool

Parameters

$url : string: url to check

Return values

bool —

whether is shouldn't be crawled

dumpBigScheduleToSmall()

Used to split a large schedule of to crawl sites into small ones (which are written to disk) and which can be handled by processToCrawlUrls


    public
                    dumpBigScheduleToSmall(int $schedule_time, array<string|int, mixed> &$sites) : mixed

The size of the to crawl list depends on the number of found links during a fetch batch. This can be quite large compared to the fetch batch and during processing, we might be doing a fair bit of manipulation of arrays of sites, so the idea is this splitting like this will hopefully reduce the memory burden of scheduling.

Parameters

$schedule_time : int: timestamp of schedule we are splitting
$sites : array<string|int, mixed>: array containing to crawl data

Return values

mixed —

getEarliestSlot()

Gets the first unfilled schedule slot after $index in $arr


    public
                    getEarliestSlot(int $index, array<string|int, mixed> &$arr) : int

A schedule of sites for a fetcher to crawl consists of MAX_FETCH_SIZE many slots each of which could eventually hold url information. This function is used to schedule slots for crawl-delayed host.

Parameters

$index : int: location to begin searching for an empty slot
$arr : array<string|int, mixed>: list of slots to look in

Return values

int —

index of first available slot

handleAdminMessages()

Handles messages passed via files to the QueueServer.


    public
                    handleAdminMessages(array<string|int, mixed> $info) : array<string|int, mixed>

These files are typically written by the CrawlDaemon::init() when QueueServer is run using command-line argument

Parameters

$info : array<string|int, mixed>: associative array with info about current state of queue server

Return values

array<string|int, mixed> —

an updates version $info reflecting changes that occurred during the handling of the admin messages files.

indexSave()

Builds inverted index and saves active partition


    public
                    indexSave() : mixed

Return values

mixed —

initializeCrawlQueue()

This method sets up a CrawlQueueBundle according to the current crawl order so that it can receive urls and prioritize them.


    public
                    initializeCrawlQueue() : mixed

Return values

mixed —

initializeIndexBundle()

Function used to set up an indexer's IndexDocumentBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.


    public
                    initializeIndexBundle([array<string|int, mixed> $info = [] ][, array<string|int, mixed> $try_to_set_from_old_index = null ]) : mixed

Parameters

$info : array<string|int, mixed> = []: if initializing a new crawl this should contain the crawl parameters
$try_to_set_from_old_index : array<string|int, mixed> = null: parameters of the crawl to try to set from values already stored in archive info, other parameters are assumed to have been updated since.

Return values

mixed —

isAIndexer()

Used to check if the current queue server process is acting a indexer of data coming from fetchers


    public
                    isAIndexer() : bool

Return values

bool —

whether it is or not

isAScheduler()

Used to check if the current queue server process is acting a url scheduler for fetchers


    public
                    isAScheduler() : bool

Return values

bool —

whether it is or not

isOnlyIndexer()

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)


    public
                    isOnlyIndexer() : bool

Return values

bool —

whether it is or not

isOnlyScheduler()

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)


    public
                    isOnlyScheduler() : bool

Return values

bool —

whether it is or not

loop()

Main runtime loop of the queue server.


    public
                    loop() : mixed

Loops until a stop message received, check for start, stop, resume crawl messages, deletes any CrawlQueueBundle for which an IndexDocumentBundle does not exist. Processes

Return values

mixed —

processCrawlData()

Main body of queue server loop where indexing, scheduling, robot file processing is done.


    public
                    processCrawlData() : mixed

Return values

mixed —

processEtagExpires()

Process cache page validation data files sent by Fetcher


    public
                    processEtagExpires() : mixed

Return values

mixed —

processEtagExpiresArchive()

Processes a cache page validation data. Extracts key-value pairs from data and inserts into the LinearHashTable used for storing cache page validation data.


    public
                    processEtagExpiresArchive(array<string|int, mixed> &$etag_expires_data) : mixed

Parameters

$etag_expires_data : array<string|int, mixed>: is the cache page validation data from the Fetchers.

Return values

mixed —

processIndexArchive()

Adds the summary and index data in $file to summary bundle and word index


    public
                    processIndexArchive(string &$pre_sites_and_index) : mixed

Parameters

$pre_sites_and_index : string: containing web pages summaries

Return values

mixed —

processIndexData()

Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found


    public
                    processIndexData() : mixed

Return values

mixed —

processReceivedRobotTxtUrls()

This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->crawl_queue->robot_notify_hosts to be received to be moved to the queue because host membership in $this->crawl_queue->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.


    public
                    processReceivedRobotTxtUrls() : mixed

Return values

mixed —

processRecrawlDataArchive()

Processes fetcher data file information during a recrawl


    public
                    processRecrawlDataArchive(array<string|int, mixed> $sites) : mixed

Parameters

$sites : array<string|int, mixed>: a file which recently crawled urls (and other to_crawl data which will be discarded because we are doing a recrawl)

Return values

mixed —

processRecrawlRobotUrls()

Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.


    public
                    processRecrawlRobotUrls() : mixed

Return values

mixed —

processRobotArchive()

Reads in $sites of robot data host and associated robots.txt allowed/disallowed paths, crawl delay info, and dns info.


    public
                    processRobotArchive(mixed &$sites) : mixed

Adds this to the robot_table entry for this host. Adds dns info to the RAM-based dns cache hash table.

Parameters

$sites : mixed

Return values

mixed —

processRobotUrls()

Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there


    public
                    processRobotUrls() : mixed

Return values

mixed —

processToCrawlArchive()

Process to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info


    public
                    processToCrawlArchive(array<string|int, mixed> &$sites) : mixed

Parameters

$sites : array<string|int, mixed>: containing to crawl and seen url info

Return values

mixed —

processToCrawlUrls()

Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue


    public
                    processToCrawlUrls() : mixed

Return values

mixed —

produceFetchBatch()

Produces a schedule.txt file of url data for a fetcher to crawl next.


    public
                    produceFetchBatch() : mixed

The hard part of scheduling is to make sure that the overall crawl process obeys robots.txt files. This involves checking the url is in an allowed path for that host and it also involves making sure the Crawl-delay directive is respected. The first fetcher that contacts the server requesting data to crawl will get the schedule.txt produced by produceFetchBatch() at which point it will be unlinked (these latter things are controlled in FetchController).

Return values

mixed —

runPostProcessingPlugins()

During crawl shutdown this is called to run any post processing plugins


    public
                    runPostProcessingPlugins() : mixed

Return values

mixed —

shutdownDictionary()

During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.


    public
                    shutdownDictionary() : mixed

Return values

mixed —

start()

This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop


    public
                    start() : mixed

Return values

mixed —

startCrawl()

Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle


    public
                    startCrawl(array<string|int, mixed> $info) : mixed

Parameters

$info : array<string|int, mixed>: parameter for the crawl

Return values

mixed —

stopCrawl()

Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.


    public
                    stopCrawl() : mixed

Return values

mixed —

updateDisallowedQuotaSites()

This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]


    public
                    updateDisallowedQuotaSites() : mixed

Return values

mixed —

updateMostRecentFetcher()

Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher


    public
                    updateMostRecentFetcher() : mixed

Return values

mixed —

withinQuota()

Checks if the $url is from a site which has an hourly quota to download.


    public
                    withinQuota(string $url[, int $bump_count = 1 ]) : bool

If so, it bumps the quota count and return true; false otherwise. This method also resets the quota queue every over

Parameters

$url : string: to check if within quota
$bump_count : int = 1: how much to bump quota count if url is from a site with a quota

Return values

bool —

whether $url exceeds the hourly quota of the site it is from

writeAdminMessage()

Used to write an admin crawl status message during a start or stop crawl.


    public
                    writeAdminMessage(string $message) : mixed

Parameters

$message : string: to write into crawl_status.txt this will show up in the web crawl status element.

Return values

mixed —

writeArchiveCrawlInfo()

Used to write info about the current recrawl to file as well as to process any recrawl data files received


    public
                    writeArchiveCrawlInfo() : mixed

Return values

mixed —

writeCrawlStatus()

Writes status information about the current crawl so that the webserver app can use it for its display.


    public
                    writeCrawlStatus(array<string|int, mixed> $recent_urls) : mixed

Parameters

$recent_urls : array<string|int, mixed>: contains the most recently crawled sites

Return values

mixed —

QueueServer in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$all_file_types

$allow_disallow_cache_time

$allowed_sites

$archive_modified_time

$cache_pages

$channel

$crawl_index

$crawl_order

$crawl_queue

$crawl_status_file_name

$crawl_time

$crawl_type

$db

$debug

$disallowed_sites

$hourly_crawl_data

$index_archive

$index_dirty

$indexed_file_types

$indexing_plugins

$indexing_plugins_data

$info_parameter_map

$last_index_save_time

$last_next_partition_to_add

$max_depth

$max_description_len

$max_links_to_extract

$messages_bundle

$most_recent_fetcher

$page_range_request

$page_recrawl_frequency

$page_rules

$process_name

$quota_clear_time

$quota_sites

$quota_sites_keys

$repeat_type

$restrict_sites_by_url

$robots_txt

$server_name

$server_type

$sleep_duration

$sleep_start

$start_dictionary_time

$summarizer_option

$waiting_hosts

$window_size

Methods

__construct()

Return values

allowedToCrawlSite()

Parameters

Return values

calculateScheduleMetaInfo()

Parameters

Return values

checkBothProcessesRunning()

Parameters

Return values

checkProcessRunning()

Parameters

Return values

checkRepeatingCrawlSwap()

Return values

checkUpdateCrawlParameters()

Return values

deleteOrphanedBundles()

Return values

disallowedToCrawlSite()

Parameters

Return values

dumpBigScheduleToSmall()

Parameters

Return values

getEarliestSlot()

QueueServer
in package

Application

implements CrawlConstants