Yioop_V9.5_Source_Code

CrawlQueueBundle
in package

Application

Encapsulates the data structures needed to have a queue of to crawl urls

CRAWL_DELAYED_FOLDER

Number of bytes in for hash table key


    public
        mixed
    CRAWL_DELAYED_FOLDER
    = "CrawlDelayedHosts"

HASH_KEY_SIZE

Number of bytes in for hash table key


    public
        mixed
    HASH_KEY_SIZE
    = 8

INT_SIZE

Size of int


    public
        mixed
    INT_SIZE
    = 4

IP_SIZE

Length of an IPv6 ip address (IPv4 address are padded)


    public
        mixed
    IP_SIZE
    = 16

MAX_URL_BUFFER_BEFORE_WRITE

When writing urls to robot_table, how many to buffer at a time and then bulk put.


    public
        mixed
    MAX_URL_BUFFER_BEFORE_WRITE
    = 500

MAX_URL_FILE_SIZE

Size of notify buffer


    public
        mixed
    MAX_URL_FILE_SIZE
    = 1000000

NO_FLAGS

Url type flag


    public
        mixed
    NO_FLAGS
    = 0

ROBOT_WAIT_FOLDER

Number of bytes in for hash table key


    public
        mixed
    ROBOT_WAIT_FOLDER
    = "WaitRobotUrls"

TIER_PREFIX


    public
        mixed
    TIER_PREFIX
    = "Tier"

URL_FILES_EXTENSION

File extension to used for files of serialized url data


    public
        mixed
    URL_FILES_EXTENSION
    = ".txt.gz"

URL_QUEUE_FOLDER

Number of bytes in for hash table key


    public
        mixed
    URL_QUEUE_FOLDER
    = "UrlQueue"

WAITING_HOST

Url type flag


    public
        mixed
    WAITING_HOST
    = 1

$dir_name

The folder name of this CrawlQueueBundle


    public
        string
    $dir_name

$dns_table

host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency


    public
        object
    $dns_table

$domain_table

LinearHashTable of information about company level domains that have been crawled. Information includes number of SEEN_URLS, number of WEIGHTED_SEEN_URLS, number of WEIGHTED_INCOMING_URLS.


    public
        LinearHashTable
    $domain_table

(A company level domain is google.com or google.co.uk, but not fo.la.google.com, www.google.com, foo.google.com or foo.google.co.uk)

$etag_table

Holds etag and expires http data


    public
        LinearHashTable
    $etag_table

$filter_size

Number items that can be stored in a partition of the page exists filter bundle


    public
        int
    $filter_size

$num_urls_ram

number of entries the priority queue used by this web queue bundle can store


    public
        int
    $num_urls_ram

$robot_cache

RAM cache of recent robot.txt stuff crawlHash(host) => robot.txt info


    public
        array<string|int, mixed>
    $robot_cache
     = []

$robot_cache_times

Time when cache of recent robot.txt for host done crawlHash(host) => timestamp


    public
        array<string|int, mixed>
    $robot_cache_times
     = []

$robot_notify_hosts

Array of hosts for which a robots.txt file has just been received and processed for which urls from that host are still waiting to be notified for queueing.


    public
        array<string|int, mixed>
    $robot_notify_hosts

$robot_table

HashTable used to store offsets into WebArchive that stores robot paths


    public
        LinearHashTable
    $robot_table

$url_exists_filter_bundle

BloomFilter used to keep track of which urls we've already seen


    public
        object
    $url_exists_filter_bundle

__construct()

Makes a CrawlQueueBundle with the provided parameters


    public
                    __construct(string $dir_name, int $filter_size, int $num_urls_ram) : mixed

Parameters

$dir_name : string: folder name used by this CrawlQueueBundle
$filter_size : int: size of each partition in the page exists BloomFilterBundle
$num_urls_ram : int: number of entries in ram for the priority queue

Return values

mixed —

addCrawlDelayedHosts()

For a timestamp $schedule_time of a fetch batch of urls to be downloaded and for a list of crawl-delayed hosts in that batch, add the hosts to a a $schedule time file in the CrawlDelayedHosts queue so they can be notified when that fetch batch is done processing. Until notified any url from one of these crawl delayed hosts will be rescheduled rather than but in a fetch batch for download.


    public
                    addCrawlDelayedHosts(mixed $schedule_time, array<string|int, mixed> $host_urls) : mixed

Parameters

$schedule_time : mixed
$host_urls : array<string|int, mixed>: array of urls for hosts that are crawl delayed and for which there is a schedule currently running on fetchers which might download from that host

Return values

mixed —

addDNSCache()

Add an entry to the web_queue_bundles DNS cache


    public
                    addDNSCache(string $host, string $ip_address) : mixed

Parameters

$host : string: hostname to add to DNS Lookup table
$ip_address : string: in presentation format (not as int) to add to table

Return values

mixed —

addSeenUrlFilter()

Adds the supplied url to the url_exists_filter_bundle


    public
                    addSeenUrlFilter(string $url) : mixed

Parameters

$url : string: url to add

Return values

mixed —

addSendFetcherQueue()

Adds an array of url tuples to the queue of urls about to be scheduled into fetches batches to be downloaded by fetchers. This queue consists of tiers. Url tuples are sorted into a tier based on the number of urls that have been downloaded for that url's host and their weight.


    public
                    addSendFetcherQueue(array<string|int, mixed> $url_tuples, string $crawl_order) : mixed

Naively, without weight, a url goes into tier floor(log(# of urls downloaded already for its host)) Within a tier, urls are stored in folders by day recieved and then into a file from a sequence of files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.

Parameters

$url_tuples : array<string|int, mixed>: array of tuples of the form (url, weight, referer)
$crawl_order : string: one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values

mixed —

addUrlsDirectory()

Adds the url info (such as url tuples (url, weight, referer)) to the appropriate file in a subfolder of the folder $dir used to implement a CrawlBundleQueue. If $timestamp is 0, then will store data in $dir/current day time stamp/last_file_in_folder.txt.gz . If last file exceed 1MB a new last file is started. If $timestamp > 0 then data is stored in $dir/$timestamp's day time stamp/$timestamp.txt.gz


    public
                    addUrlsDirectory(string $dir, array<string|int, mixed> $url_info, int $timestamp) : mixed

Parameters

$dir : string: folder to store data into a subfolder of
$url_info : array<string|int, mixed>: information to serialized, compress, and store
$timestamp : int: to use during storage to determine path as described above

Return values

mixed —

addWaitRobotQueue()

Adds an array of url tuples to the queue of urls of waiting for robots.txt files to be received. This queue consists of a folder CrawlQueueBundle::ROBOT_WAIT_FOLDER with subfolders the hash of the name of a host that doesn't have a robots.txt file received yet.


    public
                    addWaitRobotQueue(array<string|int, mixed> $url_tuples) : mixed

$url_tuple are then sorted to the appropriate host subfolder and are stored in subfolders by the day recieved and then a file in a sequence files according to order received. Each file in the sequence is able to store 1MB compressed many url tuples.

Parameters

$url_tuples : array<string|int, mixed>: array of tuples of the form (url, weight, referer)

Return values

mixed —

checkRobotOkay()

Checks if the given $url is allowed to be crawled based on stored robots.txt info.


    public
                    checkRobotOkay(string $url) : bool

Parameters

$url : string: to check

Return values

bool —

whether it was allowed or not

chooseFetchBatchQueueFolder()

Returns the path to the send-fetcher-queue tier to use to make the next fetch batch of urls to download.


    public
                    chooseFetchBatchQueueFolder(string $crawl_order) : string

Parameters

$crawl_order : string: one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values

string —

path to send-fetcher-queue tier

computeTierUrl()

Used to compute which send-fetcher-queue tier a url should be added to based, on the data related to the url in $url_tuple, its company level domain data, and the crawl order being used


    public
                    computeTierUrl(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int

Parameters

$url_tuple : array<string|int, mixed>: 5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and that url's weight
$cld_data : array<string|int, mixed>
$crawl_order : string: one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values

int —

tier $url should be queue into

differenceSeenUrls()

Removes all url objects from $url_array which have been seen


    public
                    differenceSeenUrls(array<string|int, mixed> &$url_array[, array<string|int, mixed> $field_names = null ]) : mixed

Parameters

$url_array : array<string|int, mixed>: objects to check if have been seen
$field_names : array<string|int, mixed> = null: an array of components of a url_array element which contain a url to check if seen. If null, assumes url_array is just and array of urls not an array of url infos (i.e., array of array), and just directly checks those strings

Return values

mixed —

dnsLookup()

Used to lookup an entry in the DNS cache


    public
                    dnsLookup(string $host) : string

Parameters

$host : string: hostname to add to DNS Lookup table

Return values

string —

ipv4 or ipv6 address written as a string

emptyDNSCache()

Delete the Hash table used to store DNS lookup info.


    public
                    emptyDNSCache() : string

Then construct an empty new one. This is called roughly once a day at the same time as

Return values

string —

$message with what happened during empty process

emptyUrlFilter()

Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.


    public
                    emptyUrlFilter() : mixed

Return values

mixed —

getDayFolders()

Returns an array of all the days folders for a crawl queue.


    public
                    getDayFolders(string $dir) : array<string|int, mixed>

By design queues in a CrawlQueueBundle consist of a sequence of subfolders with day timestamps (floor(unixstamp/86400)), and then files within these folders. This function returns a list of the day folder paths for such a queue. Note this function assumes that there aren't too many days to exceed memory. If a crawl runs at most a few years, this should be the case

Parameters

$dir : string: folder qhich is acting as a CrawlQueueBundle queue

Return values

array<string|int, mixed> —

of paths to day folders

getDnsAge()

Gets the timestamp of the oldest dns address still stored in the queue bundle


    public
                    getDnsAge() : int

Return values

int —

a Unix timestamp

getRobotData()

For a provided hostname, returns the robots.txt information stored in the the robot table: [HOSTNAME, CAPTURE_TIME, CRAWL_DELAY, ROBOT_PATHS => [ALLOWED_SITES, DISALLOWED_SITES], FLAGS (for not whether should wait for notification from a schedule being downloaded before continuing crawling the site).


    public
                    getRobotData(string $host) : array<string|int, mixed>

Parameters

$host : string: hostname to look up robots.tx info for. (no trailing / in hostname. i.e., https:/www.yahoo.com, not https:/www.yahoo.com/)

Return values

array<string|int, mixed> —

robot table row as described above

getUrlFilterAge()

Gets the timestamp of the oldest url filter data still stored in the queue bundle


    public
                    getUrlFilterAge() : int

Return values

int —

a Unix timestamp

getUrlsFileContents()

Returns the unserialized contents of a url info file after decompression.


    public
                    getUrlsFileContents(string $file_name) : array<string|int, mixed>

Assumes the resulting structure is small enough to fit in memory

Parameters

$file_name : string: name of url info file

Return values

array<string|int, mixed> —

of uncompressed, unserialized contents of this file.

getUrlsFiles()

Returns an array of all the url info files in a queue subfolder of a queue for a CrawlQueueBundle. Url info files are usually stored in a file with a nine digit number followed by the queues file extension (usually .txt.gz) and store up to 1MB of compressed url info.


    public
                    getUrlsFiles(string $dir) : array<string|int, mixed>

This function assumes the paths to the number of url info files in the provided can fit in memory

Parameters

$dir : string: folder containing url info files

Return values

array<string|int, mixed> —

of paths to each url info file found.

gotRobotTxtTime()

Returns the timestamp of the last time host's robots.txt file was downloaded


    public
                    gotRobotTxtTime(string $host) : int|bool

Parameters

$host : string: url to check

Return values

int|bool —

returns false if no capture of robots.txt yet, otherwise returns an integer timestamp

isResumable()


    public
            static        isResumable(mixed $queue_bundle_dir) : mixed

Parameters

$queue_bundle_dir : mixed

Return values

mixed —

notifyCrawlDelayedHosts()

For each host in the crawl-delayed hosts queue waiting on the fetch batch schedule with $timestamp timestamp, clear their FLAGS variable in the robot table so that urls with this host are allowed to be scheduled into future fetch batches for download.


    public
                    notifyCrawlDelayedHosts(int $timestamp) : mixed

Parameters

$timestamp : int: of a fetch batch schedule to notify crawl-delayed hosts that it has completed download.

Return values

mixed —

processReceivedRobotTxtUrls()

This method is used to send urls that are in the waiting hosts folder for hosts listed in $this->robot_notify_hosts to be received to be moved to the queue because host membership in $this->robot_notify_hosts indicates that a robots.txt file has just been received for the particular domain.


    public
                    processReceivedRobotTxtUrls(string $crawl_order) : mixed

Parameters

$crawl_order : string: one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values

mixed —

processWaitingHostFile()

Used by @see notifyCrawlDelayedHosts($timestamp).


    public
                    processWaitingHostFile(string $file_name, mixed $robot_rows) : mixed

For each host listed in the file $file_name get its robot info from robot_table, clear its FLAG column, store the update into a temporary array $robot_rows. Every MAX_URL_BUFFER_BEFORE_WRITE many such hosts, write the updates in $robot_rows back to the robot_table on disk. If last batch of modified rows has been written when done file, return these in $robot_rows

Parameters

$file_name : string: to get hosts to clear flag columns of @param array $robot_rows rows of updated hosts potentially from a previously processed file. @return array $robot_rows leftover updated robot host rows that haven't been written to disk yet
$robot_rows : mixed

Return values

mixed —

putUrlsFileContents()

Serializes and compress the url info (such as url tuples (url, weight, referer)) provided in $url_data and save the results into $file_name


    public
                    putUrlsFileContents(string $file_name, array<string|int, mixed> $url_data) : mixed

Parameters

$file_name : string: name of file to store unrl info into
$url_data : array<string|int, mixed>: data to be serialized, compressed, and stored.

Return values

mixed —

updateCompanyLevelDomainData()

Computes an update to the company level domain data provided in cld_data, updating the WEIGHTED_SEEN_URLS and WEIGHTED_INCOMING_URLS fields according to information about a discovered url in $url_tuple


    public
                    updateCompanyLevelDomainData(array<string|int, mixed> $url_tuple, array<string|int, mixed> $cld_data, string $crawl_order) : int

Parameters

$url_tuple : array<string|int, mixed>: $url_tuple 5-tuple contains a url, its weight, the depth in the crawl where it was found, the url that refered to it, and thaturl's weight
$cld_data : array<string|int, mixed>: company level domain data to update
$crawl_order : string: one of CrawlConstants::BREADTH_FIRST or CrawlConstants::HOST_BUDGETING

Return values

int —

tier $url should be queue into

CrawlQueueBundle in package Application

Tags

Table of Contents

Constants

CRAWL_DELAYED_FOLDER

HASH_KEY_SIZE

INT_SIZE

IP_SIZE

MAX_URL_BUFFER_BEFORE_WRITE

MAX_URL_FILE_SIZE

NO_FLAGS

ROBOT_WAIT_FOLDER

TIER_PREFIX

URL_FILES_EXTENSION

URL_QUEUE_FOLDER

WAITING_HOST

Properties

$dir_name

$dns_table

$domain_table

$etag_table

$filter_size

$num_urls_ram

$robot_cache

$robot_cache_times

$robot_notify_hosts

$robot_table

$url_exists_filter_bundle

Methods

__construct()

Parameters

Return values

addCrawlDelayedHosts()

Parameters

Return values

addDNSCache()

Parameters

Return values

addSeenUrlFilter()

Parameters

Return values

addSendFetcherQueue()

Parameters

Return values

addUrlsDirectory()

Parameters

Return values

addWaitRobotQueue()

Parameters

Return values

checkRobotOkay()

Parameters

Return values

chooseFetchBatchQueueFolder()

Parameters

Return values

computeTierUrl()

Parameters

Return values

differenceSeenUrls()

Parameters

Return values

dnsLookup()

Parameters

Return values

emptyDNSCache()

Tags

Return values

emptyUrlFilter()

Return values

getDayFolders()

Parameters

Return values

getDnsAge()

Return values

getRobotData()

Parameters

Return values

getUrlFilterAge()

Return values

CrawlQueueBundle
in package

Application