Yioop_V9.5_Source_Code_Documentation

FeedArchiveBundle extends IndexArchiveBundle
in package

Subclass of IndexArchiveBundle with bloom filters to make it easy to check if a news feed item has been added to the bundle already before adding it

Tags
author

Chris Pollett

Table of Contents

FORCE_ADVANCE_SIZE  = 120000000
Threshold index shard beyond which we force the generation to advance
NO_LOAD_SIZE  = 50000000
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
$current_shard  : object
Index Shard for current generation inverted word index
$description  : string
A short text name for this IndexArchiveBundle
$dictionary  : object
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
$dir_name  : string
Folder name to use for this IndexArchiveBundle
$filter_a  : BloomFilterFile
Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
$filter_b  : BloomFilterFile
Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
$generation_info  : array<string|int, mixed>
structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
$incremental  : bool
Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
$num_docs_per_generation  : int
Number of docs before a new generation is started
$num_partitions_summaries  : int
Number of partitions in the summaries WebArchiveBundle
$summaries  : object
WebArchiveBundle for web page summaries
$version  : int
What version of index archive bundle this is
__construct()  : mixed
Makes or initializes an FeedArchiveBundle with the provided parameters
addActiveShardDictionary()  : mixed
Adds the words from this shard to the dictionary
addAdvanceGeneration()  : mixed
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
addFilters()  : mixed
Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
addIndexData()  : mixed
Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
addPages()  : mixed
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
addPagesAndSeenKeys()  : mixed
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
buildInvertedIndexShard()  : mixed
Builds an inverted index shard for the current generations index shard.
contains()  : bool
Whether the active filter for this feed contain thee feed item of thee supplied key
forceSave()  : mixed
Forces the current shard to be saved
getActiveShard()  : object
Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
getArchiveInfo()  : array<string|int, mixed>
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
getCurrentShard()  : object
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
getPage()  : array<string|int, mixed>
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
getParamModifiedTime()  : mixed
Returns the last time the archive info of the bundle was modified.
setArchiveInfo()  : mixed
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
setCurrentShard()  : mixed
Sets the current shard to be the $i th shard in the index bundle.
stopIndexing()  : mixed
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
updateShardsAndDictionary()  : int
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

Constants

FORCE_ADVANCE_SIZE

Threshold index shard beyond which we force the generation to advance

public mixed FORCE_ADVANCE_SIZE = 120000000

NO_LOAD_SIZE

Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard

public mixed NO_LOAD_SIZE = 50000000

Properties

$current_shard

Index Shard for current generation inverted word index

public object $current_shard

$description

A short text name for this IndexArchiveBundle

public string $description

$dictionary

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)

public object $dictionary

$dir_name

Folder name to use for this IndexArchiveBundle

public string $dir_name

$filter_a

Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.

public BloomFilterFile $filter_a

$generation_info

structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).

public array<string|int, mixed> $generation_info

$incremental

Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.

public bool $incremental

$num_docs_per_generation

Number of docs before a new generation is started

public int $num_docs_per_generation

$num_partitions_summaries

Number of partitions in the summaries WebArchiveBundle

public int $num_partitions_summaries

Methods

__construct()

Makes or initializes an FeedArchiveBundle with the provided parameters

public __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ]) : mixed
Parameters
$dir_name : string

folder name to store this bundle

$read_only_archive : bool = true

whether to open archive only for reading or reading and writing

$description : string = null

a text name/serialized info about this IndexArchiveBundle

$num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION

the number of pages to be stored in a single shard

Return values
mixed

addActiveShardDictionary()

Adds the words from this shard to the dictionary

public addActiveShardDictionary([object $callback = null ]) : mixed
Parameters
$callback : object = null

object with join function to be called if process is taking too long

Return values
mixed

addAdvanceGeneration()

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

public addAdvanceGeneration([object $callback = null ]) : mixed
Parameters
$callback : object = null

object with join function to be called if process is taking too long

Return values
mixed

addFilters()

Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.

public addFilters(string $key) : mixed
Parameters
$key : string

unique identifier of a feed item

Return values
mixed

addIndexData()

Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct

public addIndexData(object $index_shard) : mixed
Parameters
$index_shard : object

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

Return values
mixed

addPages()

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

public addPages(int $generation, string $offset_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
$generation : int

field used to select partition

$offset_field : string

field used to record offsets after storing

$pages : array<string|int, mixed>

data to store

$visited_urls_count : int

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values
mixed

addPagesAndSeenKeys()

Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.

public addPagesAndSeenKeys(int $generation, string $offset_field, string $key_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
$generation : int

field used to select partition

$offset_field : string

field used to record offsets after storing

$key_field : string

field used to store unique identifier for a each page item.

$pages : array<string|int, mixed>

data to store

$visited_urls_count : int

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values
mixed

buildInvertedIndexShard()

Builds an inverted index shard for the current generations index shard.

public buildInvertedIndexShard() : mixed
Return values
mixed

contains()

Whether the active filter for this feed contain thee feed item of thee supplied key

public contains(string $key) : bool
Parameters
$key : string

the feed item id to check if in archive

Return values
bool

true if it is in the archive, false otherwise

forceSave()

Forces the current shard to be saved

public forceSave() : mixed
Return values
mixed

getActiveShard()

Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard

public getActiveShard() : object
Return values
object

last shard in the bundle

getArchiveInfo()

Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.

public static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
$dir_name : string

path to a directory containing a summaries WebArchiveBundle

Return values
array<string|int, mixed>

summary of the given archive

getCurrentShard()

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

public getCurrentShard([bool $force_read = false ]) : object
Parameters
$force_read : bool = false

whether to force no advance generation and merge dictionary side effects

Return values
object

the currently being index shard

getPage()

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

public getPage(int $offset[, int $generation = -1 ]) : array<string|int, mixed>
Parameters
$offset : int

byte offset in partition of desired page

$generation : int = -1

which generation WebArchive to look up in defaults to the same number as the current shard

Return values
array<string|int, mixed>

desired page

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.

public static getParamModifiedTime(string $dir_name) : mixed
Parameters
$dir_name : string

folder with archive bundle

Return values
mixed

setArchiveInfo()

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

public static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
$dir_name : string

folder with archive bundle

$info : array<string|int, mixed>

struct with above fields

Return values
mixed

setCurrentShard()

Sets the current shard to be the $i th shard in the index bundle.

public setCurrentShard( $i[,  $disk_based = false ]) : mixed
Parameters
$i :

which shard to set the current shard to be

$disk_based : = false

whether to read the whole shard in before using or leave it on disk except for pages need

Return values
mixed

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

public stopIndexing() : mixed
Return values
mixed

updateShardsAndDictionary()

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

public updateShardsAndDictionary(int $add_num_docs[, object $callback = null ][, bool $blocking = false ]) : int

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters
$add_num_docs : int

number of docs in the shard about to be added

$callback : object = null

object with join function to be called if process is taking too long

$blocking : bool = false

whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Return values
int

the active generation after the check and possible change has been performed


        

Search results