IndexArchiveBundle
in package
implements
CrawlConstants
Encapsulates a set of web page summaries and an inverted word-index of terms from these summaries which allow one to search for summaries containing a particular word.
The basic file structures for an IndexArchiveBundle are:
- A WebArchiveBundle for web page summaries.
- A IndexDictionary containing all the words stored in the bundle. Each word entry in the dictionary contains starting and ending offsets for documents containing that word for some particular IndexShard generation.
- A set of index shard generations. These generations have names index0, index1,... A shard has word entries, word doc entries and document entries. For more information see the index shard documentation.
- The file generations.txt keeps track of what is the current generation. A given generation can hold NUM_WORDS_PER_GENERATION words amongst all its partitions. After which the next generation begins.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- FORCE_ADVANCE_SIZE = 120000000
- Threshold index shard beyond which we force the generation to advance
- NO_LOAD_SIZE = 50000000
- Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
- $current_shard : object
- Index Shard for current generation inverted word index
- $description : string
- A short text name for this IndexArchiveBundle
- $dictionary : object
- IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
- $dir_name : string
- Folder name to use for this IndexArchiveBundle
- $generation_info : array<string|int, mixed>
- structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
- $incremental : bool
- Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
- $num_docs_per_generation : int
- Number of docs before a new generation is started
- $num_partitions_summaries : int
- Number of partitions in the summaries WebArchiveBundle
- $summaries : object
- WebArchiveBundle for web page summaries
- $version : int
- What version of index archive bundle this is
- __construct() : mixed
- Makes or initializes an IndexArchiveBundle with the provided parameters
- addActiveShardDictionary() : mixed
- Adds the words from this shard to the dictionary
- addAdvanceGeneration() : mixed
- Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
- addIndexData() : mixed
- Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
- addPages() : mixed
- Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
- buildInvertedIndexShard() : mixed
- Builds an inverted index shard for the current generations index shard.
- forceSave() : mixed
- Forces the current shard to be saved
- getActiveShard() : object
- Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
- getArchiveInfo() : array<string|int, mixed>
- Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
- getCurrentShard() : object
- Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
- getPage() : array<string|int, mixed>
- Gets the page out of the summaries WebArchiveBundle with the given offset and generation
- getParamModifiedTime() : mixed
- Returns the last time the archive info of the bundle was modified.
- setArchiveInfo() : mixed
- Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
- setCurrentShard() : mixed
- Sets the current shard to be the $i th shard in the index bundle.
- stopIndexing() : mixed
- Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
- updateShardsAndDictionary() : int
- Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
Constants
FORCE_ADVANCE_SIZE
Threshold index shard beyond which we force the generation to advance
public
mixed
FORCE_ADVANCE_SIZE
= 120000000
NO_LOAD_SIZE
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
public
mixed
NO_LOAD_SIZE
= 50000000
Properties
$current_shard
Index Shard for current generation inverted word index
public
object
$current_shard
$description
A short text name for this IndexArchiveBundle
public
string
$description
$dictionary
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
public
object
$dictionary
$dir_name
Folder name to use for this IndexArchiveBundle
public
string
$dir_name
$generation_info
structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
public
array<string|int, mixed>
$generation_info
$incremental
Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
public
bool
$incremental
$num_docs_per_generation
Number of docs before a new generation is started
public
int
$num_docs_per_generation
$num_partitions_summaries
Number of partitions in the summaries WebArchiveBundle
public
int
$num_partitions_summaries
$summaries
WebArchiveBundle for web page summaries
public
object
$summaries
$version
What version of index archive bundle this is
public
int
$version
Methods
__construct()
Makes or initializes an IndexArchiveBundle with the provided parameters
public
__construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ][, bool $incremental = false ]) : mixed
Parameters
- $dir_name : string
-
folder name to store this bundle
- $read_only_archive : bool = true
-
whether to open archive only for reading or reading and writing
- $description : string = null
-
a text name/serialized info about this IndexArchiveBundle
- $num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION
-
the number of pages to be stored in a single shard
- $incremental : bool = false
-
for a non-read-only archive whether we build the IndexArchive in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
Return values
mixed —addActiveShardDictionary()
Adds the words from this shard to the dictionary
public
addActiveShardDictionary([object $callback = null ]) : mixed
Parameters
- $callback : object = null
-
object with join function to be called if process is taking too long
Return values
mixed —addAdvanceGeneration()
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
public
addAdvanceGeneration([object $callback = null ]) : mixed
Parameters
- $callback : object = null
-
object with join function to be called if process is taking too long
Return values
mixed —addIndexData()
Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
public
addIndexData(object $index_shard) : mixed
Parameters
- $index_shard : object
-
a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle
Return values
mixed —addPages()
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
public
addPages(int $generation, string $offset_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
- $generation : int
-
field used to select partition
- $offset_field : string
-
field used to record offsets after storing
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
mixed —buildInvertedIndexShard()
Builds an inverted index shard for the current generations index shard.
public
buildInvertedIndexShard() : mixed
Return values
mixed —forceSave()
Forces the current shard to be saved
public
forceSave() : mixed
Return values
mixed —getActiveShard()
Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
public
getActiveShard() : object
Return values
object —last shard in the bundle
getArchiveInfo()
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
path to a directory containing a summaries WebArchiveBundle
Return values
array<string|int, mixed> —summary of the given archive
getCurrentShard()
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
public
getCurrentShard([bool $force_read = false ]) : object
Parameters
- $force_read : bool = false
-
whether to force no advance generation and merge dictionary side effects
Return values
object —the currently being index shard
getPage()
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
public
getPage(int $offset[, int $generation = -1 ]) : array<string|int, mixed>
Parameters
- $offset : int
-
byte offset in partition of desired page
- $generation : int = -1
-
which generation WebArchive to look up in defaults to the same number as the current shard
Return values
array<string|int, mixed> —desired page
getParamModifiedTime()
Returns the last time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Return values
mixed —setArchiveInfo()
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —setCurrentShard()
Sets the current shard to be the $i th shard in the index bundle.
public
setCurrentShard( $i[, $disk_based = false ]) : mixed
Parameters
- $i :
-
which shard to set the current shard to be
- $disk_based : = false
-
whether to read the whole shard in before using or leave it on disk except for pages need
Return values
mixed —stopIndexing()
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
public
stopIndexing() : mixed
Return values
mixed —updateShardsAndDictionary()
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
public
updateShardsAndDictionary(int $add_num_docs[, object $callback = null ][, bool $blocking = false ]) : int
If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed
Parameters
- $add_num_docs : int
-
number of docs in the shard about to be added
- $callback : object = null
-
object with join function to be called if process is taking too long
- $blocking : bool = false
-
whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1
Return values
int —the active generation after the check and possible change has been performed