Yioop_V9.5_Source_Code

DoubleIndexBundle
in package

Application

implements CrawlConstants

A DoubleIndexBundle encapsulates and provided methods for two IndexDocumentBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

$active_archive : IndexDocumentBundle: The internal IndexDocumentBundle which is active
$active_archive_num : IndexDocumentBundle: The number of the internal IndexDocumentBundle which is active
$description : string: A short text name for this DoubleIndexBundle
$num_docs_per_partition : int: Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle
$repeat_frequency : int: How frequency the live and ongoing archive should be swapped in seconds
$repeat_time : int: Last time live and ongoing archives were switched
$swap_count : int: The number of times live and ongoing archives have swapped
__construct() : mixed: Makes or initializes an DoubleIndexBundle with the provided parameters
addPages() : mixed: Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.
computeDocId() : mixed: Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
forceSave() : mixed: Forces the current shard to be saved
getArchiveInfo() : array<string|int, mixed>: Gets information about a DoubleIndexBundle out of its status.txt file
getCachePage() : array<string|int, mixed>: Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.
getParamModifiedTime() : mixed: Returns the last time the archive info of the bundle was modified.
getStartSchedule() : mixed: The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl
getSummary() : array<string|int, mixed>: Returns a document summary from the arctive archive associated with the supplied key
setArchiveInfo() : mixed: Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
setStartSchedule() : mixed: The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping
stopIndexing() : mixed: Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
swapActiveBundle() : mixed: Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.
swapTimeReached() : bool: Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.
updateDictionary() : mixed: Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary

$active_archive

The internal IndexDocumentBundle which is active


    public
        IndexDocumentBundle
    $active_archive

$active_archive_num

The number of the internal IndexDocumentBundle which is active


    public
        IndexDocumentBundle
    $active_archive_num

$description

A short text name for this DoubleIndexBundle


    public
        string
    $description

$num_docs_per_partition

Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle


    public
        int
    $num_docs_per_partition

$repeat_frequency

How frequency the live and ongoing archive should be swapped in seconds


    public
        int
    $repeat_frequency

$repeat_time

Last time live and ongoing archives were switched


    public
        int
    $repeat_time

$swap_count

The number of times live and ongoing archives have swapped


    public
        int
    $swap_count

__construct()

Makes or initializes an DoubleIndexBundle with the provided parameters


    public
                    __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $repeat_frequency = 3600 ]) : mixed

Parameters

$dir_name : string: folder name to store this bundle
$read_only_archive : bool = true: whether to open archive only for reading or reading and writing
$description : string = null: a text name/serialized info about this IndexDocumentBundle
$num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION: the number of pages to be stored in a single shard
$repeat_frequency : int = 3600: how often the crawl should be redone in seconds (has no effect if $read_only_archive is true)

Return values

mixed —

addPages()

Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.


    public
                    addPages(array<string|int, mixed> &$pages, int $visited_urls_count) : mixed

Parameters

$pages : array<string|int, mixed>: data to store
$visited_urls_count : int: number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values

mixed —

computeDocId()

Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.


    public
                    computeDocId(array<string|int, mixed> $site) : mixed

Parameters

$site : array<string|int, mixed>: site to compute doc_id for

Return values

mixed —

forceSave()

Forces the current shard to be saved


    public
                    forceSave() : mixed

Return values

mixed —

getArchiveInfo()

Gets information about a DoubleIndexBundle out of its status.txt file


    public
            static        getArchiveInfo(string $dir_name) : array<string|int, mixed>

Parameters

$dir_name : string: folder name of the DoubleIndexBundle to get info for

Return values

array<string|int, mixed> —

containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

getCachePage()

Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.


    public
                    getCachePage(string $doc_key) : array<string|int, mixed>

Parameters

$doc_key : string: key (usually based on url of where document came from) associated with documnent want summary of

Return values

array<string|int, mixed> —

desired cache

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.


    public
            static        getParamModifiedTime(string $dir_name) : mixed

Parameters

$dir_name : string: folder with archive bundle

Return values

mixed —

getStartSchedule()

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl


    public
            static        getStartSchedule(string $dir_name, int $channel) : mixed

Parameters

$dir_name : string: folder in the bundle where the schedule is stored
$channel : int: channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

Return values

mixed —

getSummary()

Returns a document summary from the arctive archive associated with the supplied key


    public
                    getSummary(string $doc_key) : array<string|int, mixed>

Parameters

$doc_key : string: key (usually based on url of where document came from) associated with documnent want summary of

Return values

array<string|int, mixed> —

desired summary

setArchiveInfo()

Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).


    public
            static        setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed

Parameters

$dir_name : string: folder with archive bundle
$info : array<string|int, mixed>: struct with above fields

Return values

mixed —

setStartSchedule()

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping


    public
            static        setStartSchedule(string $dir_name, int $channel) : mixed

Parameters

$dir_name : string: folder in the bundle where the schedule should be stored
$channel : int: channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

Return values

mixed —

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.


    public
                    stopIndexing() : mixed

Return values

mixed —

swapActiveBundle()

Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.


    public
                    swapActiveBundle() : mixed

Return values

mixed —

swapTimeReached()

Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.


    public
                    swapTimeReached() : bool

Return values

bool —

true if the swap time has been exceeded

updateDictionary()

Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary


    public
                    updateDictionary([string $taking_too_long_touch = null ]) : mixed

Parameters

$taking_too_long_touch : string = null: name of file to touch if checking the update takes longer than LOG_TIMEOUT. To prevent a crawl from stopping because nothing is happening the file usually supplied is C\SCHEDULES_DIR . "/{$this->channel}-" . self::crawl_status_file

Return values

mixed —

DoubleIndexBundle in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$active_archive

$active_archive_num

$description

$num_docs_per_partition

$repeat_frequency

$repeat_time

$swap_count

Methods

__construct()

Parameters

Return values

addPages()

Parameters

Return values

computeDocId()

Parameters

Return values

forceSave()

Return values

getArchiveInfo()

Parameters

Return values

getCachePage()

Parameters

Return values

getParamModifiedTime()

Parameters

Return values

getStartSchedule()

Parameters

Return values

getSummary()

Parameters

Return values

setArchiveInfo()

Parameters

Return values

setStartSchedule()

Parameters

Return values

stopIndexing()

Return values

swapActiveBundle()

Return values

swapTimeReached()

Return values

updateDictionary()

Parameters

Return values

DoubleIndexBundle
in package

Application

implements CrawlConstants