Yioop_V9.5_Source_Code_Documentation

CrawlQueueBundle
in package

Encapsulates the data structures needed to have a queue of to crawl urls

Tags
author

Chris Pollett

Table of Contents

CRAWL_DELAYED_FOLDER  = "CrawlDelayedHosts"
Number of bytes in for hash table key
HASH_KEY_SIZE  = 8
Number of bytes in for hash table key
INT_SIZE  = 4
Size of int
IP_SIZE  = 16
Length of an IPv6 ip address (IPv4 address are padded)
MAX_URL_BUFFER_BEFORE_WRITE  = 500
When writing urls to robot_table, how many to buffer at a time and then bulk put.
MAX_URL_FILE_SIZE  = 1000000
Size of notify buffer
NO_FLAGS  = 0
Url type flag
ROBOT_WAIT_FOLDER  = "WaitRobotUrls"
Number of bytes in for hash table key
TIER_PREFIX  = "Tier"
URL_FILES_EXTENSION  = ".txt.gz"
File extension to used for files of serialized url data
URL_QUEUE_FOLDER  = "UrlQueue"
Number of bytes in for hash table key
WAITING_HOST  = 1
Url type flag
$dir_name  : string
The folder name of this CrawlQueueBundle
$dns_table  : object
host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency
$domain_table  : LinearHashTable
LinearHashTable of information about company level domains that have been crawled. Information includes number of SEEN_URLS, number of WEIGHTED_SEEN_URLS, number of WEIGHTED_INCOMING_URLS.
$etag_table  : LinearHashTable
Holds etag and expires http data
$filter_size  : int
Number items that can be stored in a partition of the page exists filter bundle
$num_urls_ram  : int
number of entries the priority queue used by this web queue bundle can store
$robot_cache  : array<string|int, mixed>
RAM cache of recent robot.txt stuff crawlHash(host) => robot.txt info
$robot_cache_times  : array<string|int, mixed>
Time when cache of recent robot.txt for host done crawlHash(host) => timestamp
$robot_notify_hosts  : array<string|int, mixed>
Array of hosts for which a robots.txt file has just been received and processed for which urls from that host are still waiting to be notified for queueing.
$robot_table  : LinearHashTable
HashTable used to store offsets into WebArchive that stores robot paths
$url_exists_filter_bundle  : object
BloomFilter used to keep track of which urls we've already seen
__construct()  : mixed
Makes a CrawlQueueBundle with the provided parameters
addCrawlDelayedHosts()  : mixed
For a timestamp $schedule_time of a fetch batch of urls to be downloaded and for a list of crawl-delayed hosts in that batch, add the hosts to a a $schedule time file in the CrawlDelayedHosts queue so they can be notified when that fetch batch is done processing. Until notified any url from one of these crawl delayed hosts will be rescheduled rather than but in a fetch batch for download.
addDNSCache()  : mixed
Add an entry to the web_queue_bundles DNS cache
addSeenUrlFilter()  : mixed
Adds the supplied url to the url_exists_filter_bundle
addSendFetcherQueue()  : mixed
Adds an array of url tuples to the queue of urls about to be scheduled into fetches batches to be downloaded by fetchers. This queue consists of tiers. Url tuples are sorted into a tier based on the number of urls that have been downloaded for that url's host and their weight.
addUrlsDirectory()  : mixed
Adds the url info (such as url tuples (url, weight, referer)) to the appropriate file in a subfolder of the folder $dir used to implement a CrawlBundleQueue. If $timestamp is 0, then will store data in $dir/current day time stamp/last_file_in_folder.txt.gz . If last file exceed 1MB a new last file is started. If $timestamp > 0 then data is stored in $dir/$timestamp's day time stamp/$timestamp.txt.gz
addWaitRobotQueue()  : mixed
Adds an array of url tuples to the queue of urls of waiting for robots.txt files to be received. This queue consists of a folder CrawlQueueBundle::ROBOT_WAIT_FOLDER with subfolders the hash of the name of a host that doesn't have a robots.txt file received yet.
checkRobotOkay()  : bool
Checks if the given $url is allowed to be crawled based on stored robots.txt info.
chooseFetchBatchQueueFolder()  : string
Returns the path to the send-fetcher-queue tier to use to make the next fetch batch of urls to download.
computeTierUrl()  : int
Used to compute which send-fetcher-queue tier a url should be added to based, on the data related to the url in $url_tuple, its company level domain data, and the crawl order being used
differenceSeenUrls()  : mixed
Removes all url objects from $url_array which have been seen
dnsLookup()  : string
Used to lookup an entry in the DNS cache
emptyDNSCache()  : string
Delete the Hash table used to store DNS lookup info.
emptyUrlFilter()  : mixed
Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.
getDayFolders()  : array<string|int, mixed>
Returns an array of all the days folders for a crawl queue.
getDnsAge()  : int
Gets the timestamp of the oldest dns address still stored in the queue bundle
getRobotData()  : array<string|int, mixed>
For a provided hostname, returns the robots.txt information stored in the the robot table: [HOSTNAME, CAPTURE_TIME, CRAWL_DELAY, ROBOT_PATHS => [ALLOWED_SITES, DISALLOWED_SITES], FLAGS (for not whether should wait for notification from a schedule being downloaded before continuing crawling the site).
getUrlFilterAge()  : int
Gets the timestamp of the oldest url filter data still stored in the queue bundle
getUrlsFileContents()  : array<string|int, mixed>
Returns the unserialized contents of a url info file after decompression.
getUrlsFiles()  : array<string|int, mixed>
Returns an array of all the url info files in a queue subfolder of a queue for a CrawlQueueBundle. Url info files are usually stored in a file with a nine digit number followed by the queues file extension (usually .txt.gz) and store up to 1MB of compressed url info.
gotRobotTxtTime()  : int|bool
Returns the timestamp of the last time host's robots.txt file was downloaded
isResumable()  : mixed
notifyCrawlDelayedHosts()  : mixed
For each host in the crawl-delayed hosts queue waiting on the fetch batch schedule with $timestamp timestamp, clear their FLAGS variable in the robot table so that urls with this host are allowed to be scheduled into future fetch batches for download.
processReceivedRobotTxtUrls()  : mixed
This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->robot_notify_hosts to be received to be moved to the queue because host membership in $this->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.
processWaitingHostFile()  : mixed
Used by @see notifyCrawlDelayedHosts($timestamp).
putUrlsFileContents()  : mixed
Serializes and compress the url info (such as url tuples (url, weight, referer)) provided in $url_data and save the results into $file_name
updateCompanyLevelDomainData()  : int
Computes an update to the company level domain data provided in cld_data, updating the WEIGHTED_SEEN_URLS and WEIGHTED_INCOMING_URLS fields according to information about a discovered url in $url_tuple

Constants

CRAWL_DELAYED_FOLDER

Number of bytes in for hash table key

public mixed CRAWL_DELAYED_FOLDER = "CrawlDelayedHosts"

HASH_KEY_SIZE

Number of bytes in for hash table key

public mixed HASH_KEY_SIZE = 8

IP_SIZE

Length of an IPv6 ip address (IPv4 address are padded)

public mixed IP_SIZE = 16

MAX_URL_BUFFER_BEFORE_WRITE

When writing urls to robot_table, how many to buffer at a time and then bulk put.

public mixed MAX_URL_BUFFER_BEFORE_WRITE = 500

MAX_URL_FILE_SIZE

Size of notify buffer

public mixed MAX_URL_FILE_SIZE = 1000000

ROBOT_WAIT_FOLDER

Number of bytes in for hash table key

public mixed ROBOT_WAIT_FOLDER = "WaitRobotUrls"

URL_FILES_EXTENSION

File extension to used for files of serialized url data

public mixed URL_FILES_EXTENSION = ".txt.gz"

URL_QUEUE_FOLDER

Number of bytes in for hash table key

public mixed URL_QUEUE_FOLDER = "UrlQueue"

Properties

$dir_name

The folder name of this CrawlQueueBundle

public string $dir_name

$dns_table

host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency

public object $dns_table

$domain_table

LinearHashTable of information about company level domains that have been crawled. Information includes number of SEEN_URLS, number of WEIGHTED_SEEN_URLS, number of WEIGHTED_INCOMING_URLS.

public LinearHashTable $domain_table

(A company level domain is google.com or google.co.uk, but not fo.la.google.com, www.google.com, foo.google.com or foo.google.co.uk)

$filter_size

Number items that can be stored in a partition of the page exists filter bundle

public int $filter_size

$num_urls_ram

number of entries the priority queue used by this web queue bundle can store

public int $num_urls_ram

$robot_cache

RAM cache of recent robot.txt stuff crawlHash(host) => robot.txt info

public array<string|int, mixed> $robot_cache = []

$robot_cache_times

Time when cache of recent robot.txt for host done crawlHash(host) => timestamp

public array<string|int, mixed> $robot_cache_times = []

$robot_notify_hosts

Array of hosts for which a robots.txt file has just been received and processed for which urls from that host are still waiting to be notified for queueing.

public array<string|int, mixed> $robot_notify_hosts

$url_exists_filter_bundle

BloomFilter used to keep track of which urls we've already seen

public object $url_exists_filter_bundle

Methods

__construct()

Makes a CrawlQueueBundle with the provided parameters

public __construct(string $dir_name, int $filter_size, int $num_urls_ram) : mixed
Parameters
$dir_name : string

folder name used by this CrawlQueueBundle

$filter_size : int

size of each partition in the page exists BloomFilterBundle

$num_urls_ram : int

number of entries in ram for the priority queue

Return values
mixed

addCrawlDelayedHosts()

For a timestamp $schedule_time of a fetch batch of urls to be downloaded and for a list of crawl-delayed hosts in that batch, add the hosts to a a $schedule time file in the CrawlDelayedHosts queue so they can be notified when that fetch batch is done processing. Until notified any url from one of these crawl delayed hosts will be rescheduled rather than but in a fetch batch for download.

public addCrawlDelayedHosts(mixed $schedule_time, array<string|int, mixed> $host_urls) : mixed
Parameters
$schedule_time : mixed
$host_urls : array<string|int, mixed>

array of urls for hosts that are crawl delayed and for which there is a schedule currently running on fetchers which might download from that host

Return values
mixed

addDNSCache()

Add an entry to the web_queue_bundles DNS cache

public addDNSCache(string $host, string $ip_address) : mixed
Parameters
$host : string

hostname to add to DNS Lookup table

$ip_address : string

in presentation format (not as int) to add to table

Return values
mixed

addSeenUrlFilter()

Adds the supplied url to the url_exists_filter_bundle

public addSeenUrlFilter(string $url) : mixed
Parameters
$url : string

url to add

Return values
mixed

addSendFetcherQueue()

Adds an array of url tuples to the queue of urls about to be scheduled into fetches batches to be downloaded by fetchers. This queue consists of tiers. Url tuples are sorted into a tier based on the number of urls that have been downloaded for that url's host and their weight.

public addSendFetcherQueue(array<string|int, mixed> $url_tuples, string $crawl_order) : mixed

Naively, without weight, a url goes into tier floor(log(# of urls downloaded already for its host)) Within a tier, urls are stored in folders by day recieved and then into a file from a sequence of files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.

Parameters
$url_tuples : array<string|int, mixed>

array of tuples of the form (url, weight, referer)

$crawl_order : string

one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values
mixed

addUrlsDirectory()

Adds the url info (such as url tuples (url, weight, referer)) to the appropriate file in a subfolder of the folder $dir used to implement a CrawlBundleQueue. If $timestamp is 0, then will store data in $dir/current day time stamp/last_file_in_folder.txt.gz . If last file exceed 1MB a new last file is started. If $timestamp > 0 then data is stored in $dir/$timestamp's day time stamp/$timestamp.txt.gz

public addUrlsDirectory(string $dir, array<string|int, mixed> $url_info, int $timestamp) : mixed
Parameters
$dir : string

folder to store data into a subfolder of

$url_info : array<string|int, mixed>

information to serialized, compress, and store

$timestamp : int

to use during storage to determine path as described above

Return values
mixed

addWaitRobotQueue()

Adds an array of url tuples to the queue of urls of waiting for robots.txt files to be received. This queue consists of a folder CrawlQueueBundle::ROBOT_WAIT_FOLDER with subfolders the hash of the name of a host that doesn't have a robots.txt file received yet.

public addWaitRobotQueue(array<string|int, mixed> $url_tuples) : mixed

$url_tuple are then sorted to the appropriate host subfolder and are stored in subfolders by the day recieved and then a file in a sequence files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.

Parameters
$url_tuples : array<string|int, mixed>

array of tuples of the form (url, weight, referer)

Return values
mixed

checkRobotOkay()

Checks if the given $url is allowed to be crawled based on stored robots.txt info.

public checkRobotOkay(string $url) : bool
Parameters
$url : string

to check

Return values
bool

whether it was allowed or not

chooseFetchBatchQueueFolder()

Returns the path to the send-fetcher-queue tier to use to make the next fetch batch of urls to download.

public chooseFetchBatchQueueFolder(string $crawl_order) : string
Parameters
$crawl_order : string

one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values
string

path to send-fetcher-queue tier

computeTierUrl()

Used to compute which send-fetcher-queue tier a url should be added to based, on the data related to the url in $url_tuple, its company level domain data, and the crawl order being used

public computeTierUrl(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int
Parameters
$url_tuple : array<string|int, mixed>

5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and that url's weight

$cld_data : array<string|int, mixed>
$crawl_order : string

one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values
int

tier $url should be queue into

differenceSeenUrls()

Removes all url objects from $url_array which have been seen

public differenceSeenUrls(array<string|int, mixed> &$url_array[, array<string|int, mixed> $field_names = null ]) : mixed
Parameters
$url_array : array<string|int, mixed>

objects to check if have been seen

$field_names : array<string|int, mixed> = null

an array of components of a url_array element which contain a url to check if seen. If null, assumes url_array is just and array of urls not an array of url infos (i.e., array of array), and just directly checks those strings

Return values
mixed

dnsLookup()

Used to lookup an entry in the DNS cache

public dnsLookup(string $host) : string
Parameters
$host : string

hostname to add to DNS Lookup table

Return values
string

ipv4 or ipv6 address written as a string

emptyDNSCache()

Delete the Hash table used to store DNS lookup info.

public emptyDNSCache() : string

Then construct an empty new one. This is called roughly once a day at the same time as

Tags
see
emptyRobotFilters()
Return values
string

$message with what happened during empty process

emptyUrlFilter()

Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.

public emptyUrlFilter() : mixed
Return values
mixed

getDayFolders()

Returns an array of all the days folders for a crawl queue.

public getDayFolders(string $dir) : array<string|int, mixed>

By design queues in a CrawlQueueBundle consist of a sequence of subfolders with day timestamps (floor(unixstamp/86400)), and then files within these folders. This function returns a list of the day folder paths for such a queue. Note this function assumes that there aren't too many days to exceed memory. If a crawl runs at most a few years, this should be the case

Parameters
$dir : string

folder qhich is acting as a CrawlQueueBundle queue

Return values
array<string|int, mixed>

of paths to day folders

getDnsAge()

Gets the timestamp of the oldest dns address still stored in the queue bundle

public getDnsAge() : int
Return values
int

a Unix timestamp

getRobotData()

For a provided hostname, returns the robots.txt information stored in the the robot table: [HOSTNAME, CAPTURE_TIME, CRAWL_DELAY, ROBOT_PATHS => [ALLOWED_SITES, DISALLOWED_SITES], FLAGS (for not whether should wait for notification from a schedule being downloaded before continuing crawling the site).

public getRobotData(string $host) : array<string|int, mixed>
Parameters
$host : string

hostname to look up robots.tx info for. (no trailing / in hostname. i.e., https:/www.yahoo.com, not https:/www.yahoo.com/)

Return values
array<string|int, mixed>

robot table row as described above

getUrlFilterAge()

Gets the timestamp of the oldest url filter data still stored in the queue bundle

public getUrlFilterAge() : int
Return values
int

a Unix timestamp

getUrlsFileContents()

Returns the unserialized contents of a url info file after decompression.

public getUrlsFileContents(string $file_name) : array<string|int, mixed>

Assumes the resulting structure is small enough to fit in memory

Parameters
$file_name : string

name of url info file

Return values
array<string|int, mixed>

of uncompressed, unserialized contents of this file.

getUrlsFiles()

Returns an array of all the url info files in a queue subfolder of a queue for a CrawlQueueBundle. Url info files are usually stored in a file with a nine digit number followed by the queues file extension (usually .txt.gz) and store up to 1MB of compressed url info.

public getUrlsFiles(string $dir) : array<string|int, mixed>

This function assumes the paths to the number of url info files in the provided can fit in memory

Parameters
$dir : string

folder containing url info files

Return values
array<string|int, mixed>

of paths to each url info file found.

gotRobotTxtTime()

Returns the timestamp of the last time host's robots.txt file was downloaded

public gotRobotTxtTime(string $host) : int|bool
Parameters
$host : string

url to check

Return values
int|bool

returns false if no capture of robots.txt yet, otherwise returns an integer timestamp

isResumable()

public static isResumable(mixed $queue_bundle_dir) : mixed
Parameters
$queue_bundle_dir : mixed
Return values
mixed

notifyCrawlDelayedHosts()

For each host in the crawl-delayed hosts queue waiting on the fetch batch schedule with $timestamp timestamp, clear their FLAGS variable in the robot table so that urls with this host are allowed to be scheduled into future fetch batches for download.

public notifyCrawlDelayedHosts(int $timestamp) : mixed
Parameters
$timestamp : int

of a fetch batch schedule to notify crawl-delayed hosts that it has completed download.

Return values
mixed

processReceivedRobotTxtUrls()

This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->robot_notify_hosts to be received to be moved to the queue because host membership in $this->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.

public processReceivedRobotTxtUrls(string $crawl_order) : mixed
Parameters
$crawl_order : string

one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values
mixed

processWaitingHostFile()

Used by @see notifyCrawlDelayedHosts($timestamp).

public processWaitingHostFile(string $file_name, mixed $robot_rows) : mixed

For each host listed in the file $file_name get its robot info from robot_table, clear its FLAG column, store the update into a temporary array $robot_rows. Every MAX_URL_BUFFER_BEFORE_WRITE many such hosts, write the updates in $robot_rows back to the robot_table on disk. If last batch of modified rows has been written when done file, return these in $robot_rows

Parameters
$file_name : string

to get hosts to clear flag columns of @param array $robot_rows rows of updated hosts potentially from a previously processed file. @return array $robot_rows leftover updated robot host rows that haven't been written to disk yet

$robot_rows : mixed
Return values
mixed

putUrlsFileContents()

Serializes and compress the url info (such as url tuples (url, weight, referer)) provided in $url_data and save the results into $file_name

public putUrlsFileContents(string $file_name, array<string|int, mixed> $url_data) : mixed
Parameters
$file_name : string

name of file to store unrl info into

$url_data : array<string|int, mixed>

data to be serialized, compressed, and stored.

Return values
mixed

updateCompanyLevelDomainData()

Computes an update to the company level domain data provided in cld_data, updating the WEIGHTED_SEEN_URLS and WEIGHTED_INCOMING_URLS fields according to information about a discovered url in $url_tuple

public updateCompanyLevelDomainData(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int
Parameters
$url_tuple : array<string|int, mixed>

$url_tuple 5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and thaturl's weight

$cld_data : array<string|int, mixed>

company level domain data to update

$crawl_order : string

one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values
int

tier $url should be queue into


        

Search results