CrawlQueueBundle
in package
Encapsulates the data structures needed to have a queue of to crawl urls
Tags
Table of Contents
- CRAWL_DELAYED_FOLDER = "CrawlDelayedHosts"
- Number of bytes in for hash table key
- HASH_KEY_SIZE = 8
- Number of bytes in for hash table key
- INT_SIZE = 4
- Size of int
- IP_SIZE = 16
- Length of an IPv6 ip address (IPv4 address are padded)
- MAX_URL_BUFFER_BEFORE_WRITE = 500
- When writing urls to robot_table, how many to buffer at a time and then bulk put.
- MAX_URL_FILE_SIZE = 1000000
- Size of notify buffer
- NO_FLAGS = 0
- Url type flag
- ROBOT_WAIT_FOLDER = "WaitRobotUrls"
- Number of bytes in for hash table key
- TIER_PREFIX = "Tier"
- URL_FILES_EXTENSION = ".txt.gz"
- File extension to used for files of serialized url data
- URL_QUEUE_FOLDER = "UrlQueue"
- Number of bytes in for hash table key
- WAITING_HOST = 1
- Url type flag
- $dir_name : string
- The folder name of this CrawlQueueBundle
- $dns_table : object
- host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency
- $domain_table : LinearHashTable
- LinearHashTable of information about company level domains that have been crawled. Information includes number of SEEN_URLS, number of WEIGHTED_SEEN_URLS, number of WEIGHTED_INCOMING_URLS.
- $etag_table : LinearHashTable
- Holds etag and expires http data
- $filter_size : int
- Number items that can be stored in a partition of the page exists filter bundle
- $num_urls_ram : int
- number of entries the priority queue used by this web queue bundle can store
- $robot_cache : array<string|int, mixed>
- RAM cache of recent robot.txt stuff crawlHash(host) => robot.txt info
- $robot_cache_times : array<string|int, mixed>
- Time when cache of recent robot.txt for host done crawlHash(host) => timestamp
- $robot_notify_hosts : array<string|int, mixed>
- Array of hosts for which a robots.txt file has just been received and processed for which urls from that host are still waiting to be notified for queueing.
- $robot_table : LinearHashTable
- HashTable used to store offsets into WebArchive that stores robot paths
- $url_exists_filter_bundle : object
- BloomFilter used to keep track of which urls we've already seen
- __construct() : mixed
- Makes a CrawlQueueBundle with the provided parameters
- addCrawlDelayedHosts() : mixed
- For a timestamp $schedule_time of a fetch batch of urls to be downloaded and for a list of crawl-delayed hosts in that batch, add the hosts to a a $schedule time file in the CrawlDelayedHosts queue so they can be notified when that fetch batch is done processing. Until notified any url from one of these crawl delayed hosts will be rescheduled rather than but in a fetch batch for download.
- addDNSCache() : mixed
- Add an entry to the web_queue_bundles DNS cache
- addSeenUrlFilter() : mixed
- Adds the supplied url to the url_exists_filter_bundle
- addSendFetcherQueue() : mixed
- Adds an array of url tuples to the queue of urls about to be scheduled into fetches batches to be downloaded by fetchers. This queue consists of tiers. Url tuples are sorted into a tier based on the number of urls that have been downloaded for that url's host and their weight.
- addUrlsDirectory() : mixed
- Adds the url info (such as url tuples (url, weight, referer)) to the appropriate file in a subfolder of the folder $dir used to implement a CrawlBundleQueue. If $timestamp is 0, then will store data in $dir/current day time stamp/last_file_in_folder.txt.gz . If last file exceed 1MB a new last file is started. If $timestamp > 0 then data is stored in $dir/$timestamp's day time stamp/$timestamp.txt.gz
- addWaitRobotQueue() : mixed
- Adds an array of url tuples to the queue of urls of waiting for robots.txt files to be received. This queue consists of a folder CrawlQueueBundle::ROBOT_WAIT_FOLDER with subfolders the hash of the name of a host that doesn't have a robots.txt file received yet.
- checkRobotOkay() : bool
- Checks if the given $url is allowed to be crawled based on stored robots.txt info.
- chooseFetchBatchQueueFolder() : string
- Returns the path to the send-fetcher-queue tier to use to make the next fetch batch of urls to download.
- computeTierUrl() : int
- Used to compute which send-fetcher-queue tier a url should be added to based, on the data related to the url in $url_tuple, its company level domain data, and the crawl order being used
- differenceSeenUrls() : mixed
- Removes all url objects from $url_array which have been seen
- dnsLookup() : string
- Used to lookup an entry in the DNS cache
- emptyDNSCache() : string
- Delete the Hash table used to store DNS lookup info.
- emptyUrlFilter() : mixed
- Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.
- getDayFolders() : array<string|int, mixed>
- Returns an array of all the days folders for a crawl queue.
- getDnsAge() : int
- Gets the timestamp of the oldest dns address still stored in the queue bundle
- getRobotData() : array<string|int, mixed>
- For a provided hostname, returns the robots.txt information stored in the the robot table: [HOSTNAME, CAPTURE_TIME, CRAWL_DELAY, ROBOT_PATHS => [ALLOWED_SITES, DISALLOWED_SITES], FLAGS (for not whether should wait for notification from a schedule being downloaded before continuing crawling the site).
- getUrlFilterAge() : int
- Gets the timestamp of the oldest url filter data still stored in the queue bundle
- getUrlsFileContents() : array<string|int, mixed>
- Returns the unserialized contents of a url info file after decompression.
- getUrlsFiles() : array<string|int, mixed>
- Returns an array of all the url info files in a queue subfolder of a queue for a CrawlQueueBundle. Url info files are usually stored in a file with a nine digit number followed by the queues file extension (usually .txt.gz) and store up to 1MB of compressed url info.
- gotRobotTxtTime() : int|bool
- Returns the timestamp of the last time host's robots.txt file was downloaded
- isResumable() : mixed
- notifyCrawlDelayedHosts() : mixed
- For each host in the crawl-delayed hosts queue waiting on the fetch batch schedule with $timestamp timestamp, clear their FLAGS variable in the robot table so that urls with this host are allowed to be scheduled into future fetch batches for download.
- processReceivedRobotTxtUrls() : mixed
- This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->robot_notify_hosts to be received to be moved to the queue because host membership in $this->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
- processWaitingHostFile() : mixed
- Used by @see notifyCrawlDelayedHosts($timestamp).
- putUrlsFileContents() : mixed
- Serializes and compress the url info (such as url tuples (url, weight, referer)) provided in $url_data and save the results into $file_name
- updateCompanyLevelDomainData() : int
- Computes an update to the company level domain data provided in cld_data, updating the WEIGHTED_SEEN_URLS and WEIGHTED_INCOMING_URLS fields according to information about a discovered url in $url_tuple
Constants
CRAWL_DELAYED_FOLDER
Number of bytes in for hash table key
public
mixed
CRAWL_DELAYED_FOLDER
= "CrawlDelayedHosts"
HASH_KEY_SIZE
Number of bytes in for hash table key
public
mixed
HASH_KEY_SIZE
= 8
INT_SIZE
Size of int
public
mixed
INT_SIZE
= 4
IP_SIZE
Length of an IPv6 ip address (IPv4 address are padded)
public
mixed
IP_SIZE
= 16
MAX_URL_BUFFER_BEFORE_WRITE
When writing urls to robot_table, how many to buffer at a time and then bulk put.
public
mixed
MAX_URL_BUFFER_BEFORE_WRITE
= 500
MAX_URL_FILE_SIZE
Size of notify buffer
public
mixed
MAX_URL_FILE_SIZE
= 1000000
NO_FLAGS
Url type flag
public
mixed
NO_FLAGS
= 0
ROBOT_WAIT_FOLDER
Number of bytes in for hash table key
public
mixed
ROBOT_WAIT_FOLDER
= "WaitRobotUrls"
TIER_PREFIX
public
mixed
TIER_PREFIX
= "Tier"
URL_FILES_EXTENSION
File extension to used for files of serialized url data
public
mixed
URL_FILES_EXTENSION
= ".txt.gz"
URL_QUEUE_FOLDER
Number of bytes in for hash table key
public
mixed
URL_QUEUE_FOLDER
= "UrlQueue"
WAITING_HOST
Url type flag
public
mixed
WAITING_HOST
= 1
Properties
$dir_name
The folder name of this CrawlQueueBundle
public
string
$dir_name
$dns_table
host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency
public
object
$dns_table
$domain_table
LinearHashTable of information about company level domains that have been crawled. Information includes number of SEEN_URLS, number of WEIGHTED_SEEN_URLS, number of WEIGHTED_INCOMING_URLS.
public
LinearHashTable
$domain_table
(A company level domain is google.com or google.co.uk, but not fo.la.google.com, www.google.com, foo.google.com or foo.google.co.uk)
$etag_table
Holds etag and expires http data
public
LinearHashTable
$etag_table
$filter_size
Number items that can be stored in a partition of the page exists filter bundle
public
int
$filter_size
$num_urls_ram
number of entries the priority queue used by this web queue bundle can store
public
int
$num_urls_ram
$robot_cache
RAM cache of recent robot.txt stuff crawlHash(host) => robot.txt info
public
array<string|int, mixed>
$robot_cache
= []
$robot_cache_times
Time when cache of recent robot.txt for host done crawlHash(host) => timestamp
public
array<string|int, mixed>
$robot_cache_times
= []
$robot_notify_hosts
Array of hosts for which a robots.txt file has just been received and processed for which urls from that host are still waiting to be notified for queueing.
public
array<string|int, mixed>
$robot_notify_hosts
$robot_table
HashTable used to store offsets into WebArchive that stores robot paths
public
LinearHashTable
$robot_table
$url_exists_filter_bundle
BloomFilter used to keep track of which urls we've already seen
public
object
$url_exists_filter_bundle
Methods
__construct()
Makes a CrawlQueueBundle with the provided parameters
public
__construct(string $dir_name, int $filter_size, int $num_urls_ram) : mixed
Parameters
- $dir_name : string
-
folder name used by this CrawlQueueBundle
- $filter_size : int
-
size of each partition in the page exists BloomFilterBundle
- $num_urls_ram : int
-
number of entries in ram for the priority queue
Return values
mixed —addCrawlDelayedHosts()
For a timestamp $schedule_time of a fetch batch of urls to be downloaded and for a list of crawl-delayed hosts in that batch, add the hosts to a a $schedule time file in the CrawlDelayedHosts queue so they can be notified when that fetch batch is done processing. Until notified any url from one of these crawl delayed hosts will be rescheduled rather than but in a fetch batch for download.
public
addCrawlDelayedHosts(mixed $schedule_time, array<string|int, mixed> $host_urls) : mixed
Parameters
- $schedule_time : mixed
- $host_urls : array<string|int, mixed>
-
array of urls for hosts that are crawl delayed and for which there is a schedule currently running on fetchers which might download from that host
Return values
mixed —addDNSCache()
Add an entry to the web_queue_bundles DNS cache
public
addDNSCache(string $host, string $ip_address) : mixed
Parameters
- $host : string
-
hostname to add to DNS Lookup table
- $ip_address : string
-
in presentation format (not as int) to add to table
Return values
mixed —addSeenUrlFilter()
Adds the supplied url to the url_exists_filter_bundle
public
addSeenUrlFilter(string $url) : mixed
Parameters
- $url : string
-
url to add
Return values
mixed —addSendFetcherQueue()
Adds an array of url tuples to the queue of urls about to be scheduled into fetches batches to be downloaded by fetchers. This queue consists of tiers. Url tuples are sorted into a tier based on the number of urls that have been downloaded for that url's host and their weight.
public
addSendFetcherQueue(array<string|int, mixed> $url_tuples, string $crawl_order) : mixed
Naively, without weight, a url goes into tier floor(log(# of urls downloaded already for its host)) Within a tier, urls are stored in folders by day recieved and then into a file from a sequence of files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.
Parameters
- $url_tuples : array<string|int, mixed>
-
array of tuples of the form (url, weight, referer)
- $crawl_order : string
-
one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING
Return values
mixed —addUrlsDirectory()
Adds the url info (such as url tuples (url, weight, referer)) to the appropriate file in a subfolder of the folder $dir used to implement a CrawlBundleQueue. If $timestamp is 0, then will store data in $dir/current day time stamp/last_file_in_folder.txt.gz . If last file exceed 1MB a new last file is started. If $timestamp > 0 then data is stored in $dir/$timestamp's day time stamp/$timestamp.txt.gz
public
addUrlsDirectory(string $dir, array<string|int, mixed> $url_info, int $timestamp) : mixed
Parameters
- $dir : string
-
folder to store data into a subfolder of
- $url_info : array<string|int, mixed>
-
information to serialized, compress, and store
- $timestamp : int
-
to use during storage to determine path as described above
Return values
mixed —addWaitRobotQueue()
Adds an array of url tuples to the queue of urls of waiting for robots.txt files to be received. This queue consists of a folder CrawlQueueBundle::ROBOT_WAIT_FOLDER with subfolders the hash of the name of a host that doesn't have a robots.txt file received yet.
public
addWaitRobotQueue(array<string|int, mixed> $url_tuples) : mixed
$url_tuple are then sorted to the appropriate host subfolder and are stored in subfolders by the day recieved and then a file in a sequence files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.
Parameters
- $url_tuples : array<string|int, mixed>
-
array of tuples of the form (url, weight, referer)
Return values
mixed —checkRobotOkay()
Checks if the given $url is allowed to be crawled based on stored robots.txt info.
public
checkRobotOkay(string $url) : bool
Parameters
- $url : string
-
to check
Return values
bool —whether it was allowed or not
chooseFetchBatchQueueFolder()
Returns the path to the send-fetcher-queue tier to use to make the next fetch batch of urls to download.
public
chooseFetchBatchQueueFolder(string $crawl_order) : string
Parameters
- $crawl_order : string
-
one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING
Return values
string —path to send-fetcher-queue tier
computeTierUrl()
Used to compute which send-fetcher-queue tier a url should be added to based, on the data related to the url in $url_tuple, its company level domain data, and the crawl order being used
public
computeTierUrl(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int
Parameters
- $url_tuple : array<string|int, mixed>
-
5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and that url's weight
- $cld_data : array<string|int, mixed>
- $crawl_order : string
-
one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING
Return values
int —tier $url should be queue into
differenceSeenUrls()
Removes all url objects from $url_array which have been seen
public
differenceSeenUrls(array<string|int, mixed> &$url_array[, array<string|int, mixed> $field_names = null ]) : mixed
Parameters
- $url_array : array<string|int, mixed>
-
objects to check if have been seen
- $field_names : array<string|int, mixed> = null
-
an array of components of a url_array element which contain a url to check if seen. If null, assumes url_array is just and array of urls not an array of url infos (i.e., array of array), and just directly checks those strings
Return values
mixed —dnsLookup()
Used to lookup an entry in the DNS cache
public
dnsLookup(string $host) : string
Parameters
- $host : string
-
hostname to add to DNS Lookup table
Return values
string —ipv4 or ipv6 address written as a string
emptyDNSCache()
Delete the Hash table used to store DNS lookup info.
public
emptyDNSCache() : string
Then construct an empty new one. This is called roughly once a day at the same time as
Tags
Return values
string —$message with what happened during empty process
emptyUrlFilter()
Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.
public
emptyUrlFilter() : mixed
Return values
mixed —getDayFolders()
Returns an array of all the days folders for a crawl queue.
public
getDayFolders(string $dir) : array<string|int, mixed>
By design queues in a CrawlQueueBundle consist of a sequence of subfolders with day timestamps (floor(unixstamp/86400)), and then files within these folders. This function returns a list of the day folder paths for such a queue. Note this function assumes that there aren't too many days to exceed memory. If a crawl runs at most a few years, this should be the case
Parameters
- $dir : string
-
folder qhich is acting as a CrawlQueueBundle queue
Return values
array<string|int, mixed> —of paths to day folders
getDnsAge()
Gets the timestamp of the oldest dns address still stored in the queue bundle
public
getDnsAge() : int
Return values
int —a Unix timestamp
getRobotData()
For a provided hostname, returns the robots.txt information stored in the the robot table: [HOSTNAME, CAPTURE_TIME, CRAWL_DELAY, ROBOT_PATHS => [ALLOWED_SITES, DISALLOWED_SITES], FLAGS (for not whether should wait for notification from a schedule being downloaded before continuing crawling the site).
public
getRobotData(string $host) : array<string|int, mixed>
Parameters
- $host : string
-
hostname to look up robots.tx info for. (no trailing / in hostname. i.e., https:/www.yahoo.com, not https:/www.yahoo.com/)
Return values
array<string|int, mixed> —robot table row as described above
getUrlFilterAge()
Gets the timestamp of the oldest url filter data still stored in the queue bundle
public
getUrlFilterAge() : int
Return values
int —a Unix timestamp
getUrlsFileContents()
Returns the unserialized contents of a url info file after decompression.
public
getUrlsFileContents(string $file_name) : array<string|int, mixed>
Assumes the resulting structure is small enough to fit in memory
Parameters
- $file_name : string
-
name of url info file
Return values
array<string|int, mixed> —of uncompressed, unserialized contents of this file.
getUrlsFiles()
Returns an array of all the url info files in a queue subfolder of a queue for a CrawlQueueBundle. Url info files are usually stored in a file with a nine digit number followed by the queues file extension (usually .txt.gz) and store up to 1MB of compressed url info.
public
getUrlsFiles(string $dir) : array<string|int, mixed>
This function assumes the paths to the number of url info files in the provided can fit in memory
Parameters
- $dir : string
-
folder containing url info files
Return values
array<string|int, mixed> —of paths to each url info file found.
gotRobotTxtTime()
Returns the timestamp of the last time host's robots.txt file was downloaded
public
gotRobotTxtTime(string $host) : int|bool
Parameters
- $host : string
-
url to check
Return values
int|bool —returns false if no capture of robots.txt yet, otherwise returns an integer timestamp
isResumable()
public
static isResumable(mixed $queue_bundle_dir) : mixed
Parameters
- $queue_bundle_dir : mixed
Return values
mixed —notifyCrawlDelayedHosts()
For each host in the crawl-delayed hosts queue waiting on the fetch batch schedule with $timestamp timestamp, clear their FLAGS variable in the robot table so that urls with this host are allowed to be scheduled into future fetch batches for download.
public
notifyCrawlDelayedHosts(int $timestamp) : mixed
Parameters
- $timestamp : int
-
of a fetch batch schedule to notify crawl-delayed hosts that it has completed download.
Return values
mixed —processReceivedRobotTxtUrls()
This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->robot_notify_hosts to be received to be moved to the queue because host membership in $this->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
public
processReceivedRobotTxtUrls(string $crawl_order) : mixed
Parameters
- $crawl_order : string
-
one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING
Return values
mixed —processWaitingHostFile()
Used by @see notifyCrawlDelayedHosts($timestamp).
public
processWaitingHostFile(string $file_name, mixed $robot_rows) : mixed
For each host listed in the file $file_name get its robot info from robot_table, clear its FLAG column, store the update into a temporary array $robot_rows. Every MAX_URL_BUFFER_BEFORE_WRITE many such hosts, write the updates in $robot_rows back to the robot_table on disk. If last batch of modified rows has been written when done file, return these in $robot_rows
Parameters
- $file_name : string
-
to get hosts to clear flag columns of @param array $robot_rows rows of updated hosts potentially from a previously processed file. @return array $robot_rows leftover updated robot host rows that haven't been written to disk yet
- $robot_rows : mixed
Return values
mixed —putUrlsFileContents()
Serializes and compress the url info (such as url tuples (url, weight, referer)) provided in $url_data and save the results into $file_name
public
putUrlsFileContents(string $file_name, array<string|int, mixed> $url_data) : mixed
Parameters
- $file_name : string
-
name of file to store unrl info into
- $url_data : array<string|int, mixed>
-
data to be serialized, compressed, and stored.
Return values
mixed —updateCompanyLevelDomainData()
Computes an update to the company level domain data provided in cld_data, updating the WEIGHTED_SEEN_URLS and WEIGHTED_INCOMING_URLS fields according to information about a discovered url in $url_tuple
public
updateCompanyLevelDomainData(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int
Parameters
- $url_tuple : array<string|int, mixed>
-
$url_tuple 5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and thaturl's weight
- $cld_data : array<string|int, mixed>
-
company level domain data to update
- $crawl_order : string
-
one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING
Return values
int —tier $url should be queue into