Yioop_V9.5_Source_Code_Documentation

DoubleIndexBundle
in package
implements CrawlConstants

A DoubleIndexBundle encapsulates and provided methods for two IndexDocumentBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$active_archive  : IndexDocumentBundle
The internal IndexDocumentBundle which is active
$active_archive_num  : IndexDocumentBundle
The number of the internal IndexDocumentBundle which is active
$description  : string
A short text name for this DoubleIndexBundle
$num_docs_per_partition  : int
Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle
$repeat_frequency  : int
How frequency the live and ongoing archive should be swapped in seconds
$repeat_time  : int
Last time live and ongoing archives were switched
$swap_count  : int
The number of times live and ongoing archives have swapped
__construct()  : mixed
Makes or initializes an DoubleIndexBundle with the provided parameters
addPages()  : mixed
Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.
computeDocId()  : mixed
Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
forceSave()  : mixed
Forces the current shard to be saved
getArchiveInfo()  : array<string|int, mixed>
Gets information about a DoubleIndexBundle out of its status.txt file
getCachePage()  : array<string|int, mixed>
Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.
getParamModifiedTime()  : mixed
Returns the last time the archive info of the bundle was modified.
getStartSchedule()  : mixed
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl
getSummary()  : array<string|int, mixed>
Returns a document summary from the arctive archive associated with the supplied key
setArchiveInfo()  : mixed
Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
setStartSchedule()  : mixed
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping
stopIndexing()  : mixed
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
swapActiveBundle()  : mixed
Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.
swapTimeReached()  : bool
Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.
updateDictionary()  : mixed
Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary

Properties

$description

A short text name for this DoubleIndexBundle

public string $description

$num_docs_per_partition

Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle

public int $num_docs_per_partition

$repeat_frequency

How frequency the live and ongoing archive should be swapped in seconds

public int $repeat_frequency

$repeat_time

Last time live and ongoing archives were switched

public int $repeat_time

$swap_count

The number of times live and ongoing archives have swapped

public int $swap_count

Methods

__construct()

Makes or initializes an DoubleIndexBundle with the provided parameters

public __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $repeat_frequency = 3600 ]) : mixed
Parameters
$dir_name : string

folder name to store this bundle

$read_only_archive : bool = true

whether to open archive only for reading or reading and writing

$description : string = null

a text name/serialized info about this IndexDocumentBundle

$num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION

the number of pages to be stored in a single shard

$repeat_frequency : int = 3600

how often the crawl should be redone in seconds (has no effect if $read_only_archive is true)

Return values
mixed

addPages()

Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.

public addPages(array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
$pages : array<string|int, mixed>

data to store

$visited_urls_count : int

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values
mixed

computeDocId()

Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.

public computeDocId(array<string|int, mixed> $site) : mixed
Parameters
$site : array<string|int, mixed>

site to compute doc_id for

Return values
mixed

forceSave()

Forces the current shard to be saved

public forceSave() : mixed
Return values
mixed

getArchiveInfo()

Gets information about a DoubleIndexBundle out of its status.txt file

public static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
$dir_name : string

folder name of the DoubleIndexBundle to get info for

Return values
array<string|int, mixed>

containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

getCachePage()

Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.

public getCachePage(string $doc_key) : array<string|int, mixed>
Parameters
$doc_key : string

key (usually based on url of where document came from) associated with documnent want summary of

Return values
array<string|int, mixed>

desired cache

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.

public static getParamModifiedTime(string $dir_name) : mixed
Parameters
$dir_name : string

folder with archive bundle

Return values
mixed

getStartSchedule()

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl

public static getStartSchedule(string $dir_name, int $channel) : mixed
Parameters
$dir_name : string

folder in the bundle where the schedule is stored

$channel : int

channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

Return values
mixed

getSummary()

Returns a document summary from the arctive archive associated with the supplied key

public getSummary(string $doc_key) : array<string|int, mixed>
Parameters
$doc_key : string

key (usually based on url of where document came from) associated with documnent want summary of

Return values
array<string|int, mixed>

desired summary

setArchiveInfo()

Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

public static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
$dir_name : string

folder with archive bundle

$info : array<string|int, mixed>

struct with above fields

Return values
mixed

setStartSchedule()

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping

public static setStartSchedule(string $dir_name, int $channel) : mixed
Parameters
$dir_name : string

folder in the bundle where the schedule should be stored

$channel : int

channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

Return values
mixed

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

public stopIndexing() : mixed
Return values
mixed

swapActiveBundle()

Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.

public swapActiveBundle() : mixed
Return values
mixed

swapTimeReached()

Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.

public swapTimeReached() : bool
Return values
bool

true if the swap time has been exceeded

updateDictionary()

Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary

public updateDictionary([string $taking_too_long_touch = null ]) : mixed
Parameters
$taking_too_long_touch : string = null

name of file to touch if checking the update takes longer than LOG_TIMEOUT. To prevent a crawl from stopping because nothing is happening the file usually supplied is C\SCHEDULES_DIR . "/{$this->channel}-" . self::crawl_status_file

Return values
mixed

        

Search results