DoubleIndexBundle
in package
implements
CrawlConstants
A DoubleIndexBundle encapsulates and provided methods for two IndexDocumentBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $active_archive : IndexDocumentBundle
- The internal IndexDocumentBundle which is active
- $active_archive_num : IndexDocumentBundle
- The number of the internal IndexDocumentBundle which is active
- $description : string
- A short text name for this DoubleIndexBundle
- $num_docs_per_partition : int
- Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle
- $repeat_frequency : int
- How frequency the live and ongoing archive should be swapped in seconds
- $repeat_time : int
- Last time live and ongoing archives were switched
- $swap_count : int
- The number of times live and ongoing archives have swapped
- __construct() : mixed
- Makes or initializes an DoubleIndexBundle with the provided parameters
- addPages() : mixed
- Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.
- computeDocId() : mixed
- Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
- forceSave() : mixed
- Forces the current shard to be saved
- getArchiveInfo() : array<string|int, mixed>
- Gets information about a DoubleIndexBundle out of its status.txt file
- getCachePage() : array<string|int, mixed>
- Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.
- getParamModifiedTime() : mixed
- Returns the last time the archive info of the bundle was modified.
- getStartSchedule() : mixed
- The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl
- getSummary() : array<string|int, mixed>
- Returns a document summary from the arctive archive associated with the supplied key
- setArchiveInfo() : mixed
- Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
- setStartSchedule() : mixed
- The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping
- stopIndexing() : mixed
- Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
- swapActiveBundle() : mixed
- Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.
- swapTimeReached() : bool
- Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.
- updateDictionary() : mixed
- Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary
Properties
$active_archive
The internal IndexDocumentBundle which is active
public
IndexDocumentBundle
$active_archive
$active_archive_num
The number of the internal IndexDocumentBundle which is active
public
IndexDocumentBundle
$active_archive_num
$description
A short text name for this DoubleIndexBundle
public
string
$description
$num_docs_per_partition
Number of docs before a new generation is started for an IndexDocumentBundle in this DoubleIndexBundle
public
int
$num_docs_per_partition
$repeat_frequency
How frequency the live and ongoing archive should be swapped in seconds
public
int
$repeat_frequency
$repeat_time
Last time live and ongoing archives were switched
public
int
$repeat_time
$swap_count
The number of times live and ongoing archives have swapped
public
int
$swap_count
Methods
__construct()
Makes or initializes an DoubleIndexBundle with the provided parameters
public
__construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $repeat_frequency = 3600 ]) : mixed
Parameters
- $dir_name : string
-
folder name to store this bundle
- $read_only_archive : bool = true
-
whether to open archive only for reading or reading and writing
- $description : string = null
-
a text name/serialized info about this IndexDocumentBundle
- $num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION
-
the number of pages to be stored in a single shard
- $repeat_frequency : int = 3600
-
how often the crawl should be redone in seconds (has no effect if $read_only_archive is true)
Return values
mixed —addPages()
Add the array of $pages to the active IndexDocumentBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.
public
addPages(array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
mixed —computeDocId()
Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
public
computeDocId(array<string|int, mixed> $site) : mixed
Parameters
- $site : array<string|int, mixed>
-
site to compute doc_id for
Return values
mixed —forceSave()
Forces the current shard to be saved
public
forceSave() : mixed
Return values
mixed —getArchiveInfo()
Gets information about a DoubleIndexBundle out of its status.txt file
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
folder name of the DoubleIndexBundle to get info for
Return values
array<string|int, mixed> —containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.
getCachePage()
Returns a full page cache (usually the web page downloaded as opposed to a summary of the web page) associated with a supplied key.
public
getCachePage(string $doc_key) : array<string|int, mixed>
Parameters
- $doc_key : string
-
key (usually based on url of where document came from) associated with documnent want summary of
Return values
array<string|int, mixed> —desired cache
getParamModifiedTime()
Returns the last time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Return values
mixed —getStartSchedule()
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl
public
static getStartSchedule(string $dir_name, int $channel) : mixed
Parameters
- $dir_name : string
-
folder in the bundle where the schedule is stored
- $channel : int
-
channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel
Return values
mixed —getSummary()
Returns a document summary from the arctive archive associated with the supplied key
public
getSummary(string $doc_key) : array<string|int, mixed>
Parameters
- $doc_key : string
-
key (usually based on url of where document came from) associated with documnent want summary of
Return values
array<string|int, mixed> —desired summary
setArchiveInfo()
Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —setStartSchedule()
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexDocumentBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexDocumentBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping
public
static setStartSchedule(string $dir_name, int $channel) : mixed
Parameters
- $dir_name : string
-
folder in the bundle where the schedule should be stored
- $channel : int
-
channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel
Return values
mixed —stopIndexing()
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
public
stopIndexing() : mixed
Return values
mixed —swapActiveBundle()
Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.
public
swapActiveBundle() : mixed
Return values
mixed —swapTimeReached()
Checks if the amount of time since the two IndexDocumentBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.
public
swapTimeReached() : bool
Return values
bool —true if the swap time has been exceeded
updateDictionary()
Checks if there is enough data in the active partition of the active archive to warrant storing in the dictionary, if so it builds an inverted index for the active partition of the active archive and adds the postings to the dictionary
public
updateDictionary([string $taking_too_long_touch = null ]) : mixed
Parameters
- $taking_too_long_touch : string = null
-
name of file to touch if checking the update takes longer than LOG_TIMEOUT. To prevent a crawl from stopping because nothing is happening the file usually supplied is C\SCHEDULES_DIR . "/{$this->channel}-" . self::crawl_status_file