Yioop_V9.5_Source_Code_Documentation

IndexArchiveBundle
in package
implements CrawlConstants

Encapsulates a set of web page summaries and an inverted word-index of terms from these summaries which allow one to search for summaries containing a particular word.

The basic file structures for an IndexArchiveBundle are:

  1. A WebArchiveBundle for web page summaries.
  2. A IndexDictionary containing all the words stored in the bundle. Each word entry in the dictionary contains starting and ending offsets for documents containing that word for some particular IndexShard generation.
  3. A set of index shard generations. These generations have names index0, index1,... A shard has word entries, word doc entries and document entries. For more information see the index shard documentation.
  4. The file generations.txt keeps track of what is the current generation. A given generation can hold NUM_WORDS_PER_GENERATION words amongst all its partitions. After which the next generation begins.
Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

FORCE_ADVANCE_SIZE  = 120000000
Threshold index shard beyond which we force the generation to advance
NO_LOAD_SIZE  = 50000000
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
$current_shard  : object
Index Shard for current generation inverted word index
$description  : string
A short text name for this IndexArchiveBundle
$dictionary  : object
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
$dir_name  : string
Folder name to use for this IndexArchiveBundle
$generation_info  : array<string|int, mixed>
structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
$incremental  : bool
Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
$num_docs_per_generation  : int
Number of docs before a new generation is started
$num_partitions_summaries  : int
Number of partitions in the summaries WebArchiveBundle
$summaries  : object
WebArchiveBundle for web page summaries
$version  : int
What version of index archive bundle this is
__construct()  : mixed
Makes or initializes an IndexArchiveBundle with the provided parameters
addActiveShardDictionary()  : mixed
Adds the words from this shard to the dictionary
addAdvanceGeneration()  : mixed
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
addIndexData()  : mixed
Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
addPages()  : mixed
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
buildInvertedIndexShard()  : mixed
Builds an inverted index shard for the current generations index shard.
forceSave()  : mixed
Forces the current shard to be saved
getActiveShard()  : object
Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
getArchiveInfo()  : array<string|int, mixed>
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
getCurrentShard()  : object
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
getPage()  : array<string|int, mixed>
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
getParamModifiedTime()  : mixed
Returns the last time the archive info of the bundle was modified.
setArchiveInfo()  : mixed
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
setCurrentShard()  : mixed
Sets the current shard to be the $i th shard in the index bundle.
stopIndexing()  : mixed
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
updateShardsAndDictionary()  : int
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

Constants

FORCE_ADVANCE_SIZE

Threshold index shard beyond which we force the generation to advance

public mixed FORCE_ADVANCE_SIZE = 120000000

NO_LOAD_SIZE

Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard

public mixed NO_LOAD_SIZE = 50000000

Properties

$current_shard

Index Shard for current generation inverted word index

public object $current_shard

$description

A short text name for this IndexArchiveBundle

public string $description

$dictionary

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)

public object $dictionary

$dir_name

Folder name to use for this IndexArchiveBundle

public string $dir_name

$generation_info

structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).

public array<string|int, mixed> $generation_info

$incremental

Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.

public bool $incremental

$num_docs_per_generation

Number of docs before a new generation is started

public int $num_docs_per_generation

$num_partitions_summaries

Number of partitions in the summaries WebArchiveBundle

public int $num_partitions_summaries

Methods

__construct()

Makes or initializes an IndexArchiveBundle with the provided parameters

public __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ][, bool $incremental = false ]) : mixed
Parameters
$dir_name : string

folder name to store this bundle

$read_only_archive : bool = true

whether to open archive only for reading or reading and writing

$description : string = null

a text name/serialized info about this IndexArchiveBundle

$num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION

the number of pages to be stored in a single shard

$incremental : bool = false

for a non-read-only archive whether we build the IndexArchive in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.

Return values
mixed

addActiveShardDictionary()

Adds the words from this shard to the dictionary

public addActiveShardDictionary([object $callback = null ]) : mixed
Parameters
$callback : object = null

object with join function to be called if process is taking too long

Return values
mixed

addAdvanceGeneration()

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

public addAdvanceGeneration([object $callback = null ]) : mixed
Parameters
$callback : object = null

object with join function to be called if process is taking too long

Return values
mixed

addIndexData()

Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct

public addIndexData(object $index_shard) : mixed
Parameters
$index_shard : object

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

Return values
mixed

addPages()

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

public addPages(int $generation, string $offset_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
$generation : int

field used to select partition

$offset_field : string

field used to record offsets after storing

$pages : array<string|int, mixed>

data to store

$visited_urls_count : int

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values
mixed

buildInvertedIndexShard()

Builds an inverted index shard for the current generations index shard.

public buildInvertedIndexShard() : mixed
Return values
mixed

forceSave()

Forces the current shard to be saved

public forceSave() : mixed
Return values
mixed

getActiveShard()

Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard

public getActiveShard() : object
Return values
object

last shard in the bundle

getArchiveInfo()

Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.

public static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
$dir_name : string

path to a directory containing a summaries WebArchiveBundle

Return values
array<string|int, mixed>

summary of the given archive

getCurrentShard()

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

public getCurrentShard([bool $force_read = false ]) : object
Parameters
$force_read : bool = false

whether to force no advance generation and merge dictionary side effects

Return values
object

the currently being index shard

getPage()

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

public getPage(int $offset[, int $generation = -1 ]) : array<string|int, mixed>
Parameters
$offset : int

byte offset in partition of desired page

$generation : int = -1

which generation WebArchive to look up in defaults to the same number as the current shard

Return values
array<string|int, mixed>

desired page

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.

public static getParamModifiedTime(string $dir_name) : mixed
Parameters
$dir_name : string

folder with archive bundle

Return values
mixed

setArchiveInfo()

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

public static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
$dir_name : string

folder with archive bundle

$info : array<string|int, mixed>

struct with above fields

Return values
mixed

setCurrentShard()

Sets the current shard to be the $i th shard in the index bundle.

public setCurrentShard( $i[,  $disk_based = false ]) : mixed
Parameters
$i :

which shard to set the current shard to be

$disk_based : = false

whether to read the whole shard in before using or leave it on disk except for pages need

Return values
mixed

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

public stopIndexing() : mixed
Return values
mixed

updateShardsAndDictionary()

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

public updateShardsAndDictionary(int $add_num_docs[, object $callback = null ][, bool $blocking = false ]) : int

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters
$add_num_docs : int

number of docs in the shard about to be added

$callback : object = null

object with join function to be called if process is taking too long

$blocking : bool = false

whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Return values
int

the active generation after the check and possible change has been performed


        

Search results