FeedArchiveBundle
extends IndexArchiveBundle
in package
Subclass of IndexArchiveBundle with bloom filters to make it easy to check if a news feed item has been added to the bundle already before adding it
Tags
Table of Contents
- FORCE_ADVANCE_SIZE = 120000000
- Threshold index shard beyond which we force the generation to advance
- NO_LOAD_SIZE = 50000000
- Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
- $current_shard : object
- Index Shard for current generation inverted word index
- $description : string
- A short text name for this IndexArchiveBundle
- $dictionary : object
- IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
- $dir_name : string
- Folder name to use for this IndexArchiveBundle
- $filter_a : BloomFilterFile
- Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
- $filter_b : BloomFilterFile
- Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
- $generation_info : array<string|int, mixed>
- structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
- $incremental : bool
- Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
- $num_docs_per_generation : int
- Number of docs before a new generation is started
- $num_partitions_summaries : int
- Number of partitions in the summaries WebArchiveBundle
- $summaries : object
- WebArchiveBundle for web page summaries
- $version : int
- What version of index archive bundle this is
- __construct() : mixed
- Makes or initializes an FeedArchiveBundle with the provided parameters
- addActiveShardDictionary() : mixed
- Adds the words from this shard to the dictionary
- addAdvanceGeneration() : mixed
- Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
- addFilters() : mixed
- Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
- addIndexData() : mixed
- Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
- addPages() : mixed
- Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
- addPagesAndSeenKeys() : mixed
- Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
- buildInvertedIndexShard() : mixed
- Builds an inverted index shard for the current generations index shard.
- contains() : bool
- Whether the active filter for this feed contain thee feed item of thee supplied key
- forceSave() : mixed
- Forces the current shard to be saved
- getActiveShard() : object
- Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
- getArchiveInfo() : array<string|int, mixed>
- Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
- getCurrentShard() : object
- Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
- getPage() : array<string|int, mixed>
- Gets the page out of the summaries WebArchiveBundle with the given offset and generation
- getParamModifiedTime() : mixed
- Returns the last time the archive info of the bundle was modified.
- setArchiveInfo() : mixed
- Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
- setCurrentShard() : mixed
- Sets the current shard to be the $i th shard in the index bundle.
- stopIndexing() : mixed
- Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
- updateShardsAndDictionary() : int
- Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
Constants
FORCE_ADVANCE_SIZE
Threshold index shard beyond which we force the generation to advance
public
mixed
FORCE_ADVANCE_SIZE
= 120000000
NO_LOAD_SIZE
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
public
mixed
NO_LOAD_SIZE
= 50000000
Properties
$current_shard
Index Shard for current generation inverted word index
public
object
$current_shard
$description
A short text name for this IndexArchiveBundle
public
string
$description
$dictionary
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
public
object
$dictionary
$dir_name
Folder name to use for this IndexArchiveBundle
public
string
$dir_name
$filter_a
Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
public
BloomFilterFile
$filter_a
$filter_b
Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
public
BloomFilterFile
$filter_b
@see $filter_a
$generation_info
structure contains info about the current generation: its index (ACTIVE), and the number of words it contains (NUM_WORDS).
public
array<string|int, mixed>
$generation_info
$incremental
Holds for a non-read-only archive whether we build the IndexArchive's in an incremental fashion adding new documents periodically, or instead do we rebuild the whole index archive each time we forceSave.
public
bool
$incremental
$num_docs_per_generation
Number of docs before a new generation is started
public
int
$num_docs_per_generation
$num_partitions_summaries
Number of partitions in the summaries WebArchiveBundle
public
int
$num_partitions_summaries
$summaries
WebArchiveBundle for web page summaries
public
object
$summaries
$version
What version of index archive bundle this is
public
int
$version
Methods
__construct()
Makes or initializes an FeedArchiveBundle with the provided parameters
public
__construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ]) : mixed
Parameters
- $dir_name : string
-
folder name to store this bundle
- $read_only_archive : bool = true
-
whether to open archive only for reading or reading and writing
- $description : string = null
-
a text name/serialized info about this IndexArchiveBundle
- $num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION
-
the number of pages to be stored in a single shard
Return values
mixed —addActiveShardDictionary()
Adds the words from this shard to the dictionary
public
addActiveShardDictionary([object $callback = null ]) : mixed
Parameters
- $callback : object = null
-
object with join function to be called if process is taking too long
Return values
mixed —addAdvanceGeneration()
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by updateShardsAndDictionary as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
public
addAdvanceGeneration([object $callback = null ]) : mixed
Parameters
- $callback : object = null
-
object with join function to be called if process is taking too long
Return values
mixed —addFilters()
Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
public
addFilters(string $key) : mixed
Parameters
- $key : string
-
unique identifier of a feed item
Return values
mixed —addIndexData()
Adds the provided mini inverted index data to the IndexArchiveBundle Expects iupdateShardsAndDictionary to be called before, so generation is correct
public
addIndexData(object $index_shard) : mixed
Parameters
- $index_shard : object
-
a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle
Return values
mixed —addPages()
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
public
addPages(int $generation, string $offset_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
- $generation : int
-
field used to select partition
- $offset_field : string
-
field used to record offsets after storing
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
mixed —addPagesAndSeenKeys()
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
public
addPagesAndSeenKeys(int $generation, string $offset_field, string $key_field, array<string|int, mixed> &$pages, int $visited_urls_count) : mixed
Parameters
- $generation : int
-
field used to select partition
- $offset_field : string
-
field used to record offsets after storing
- $key_field : string
-
field used to store unique identifier for a each page item.
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
mixed —buildInvertedIndexShard()
Builds an inverted index shard for the current generations index shard.
public
buildInvertedIndexShard() : mixed
Return values
mixed —contains()
Whether the active filter for this feed contain thee feed item of thee supplied key
public
contains(string $key) : bool
Parameters
- $key : string
-
the feed item id to check if in archive
Return values
bool —true if it is in the archive, false otherwise
forceSave()
Forces the current shard to be saved
public
forceSave() : mixed
Return values
mixed —getActiveShard()
Sets the current shard to be the active shard (the active shard is what we call the last (highest indexed) shard in the bundle. Then returns a reference to this shard
public
getActiveShard() : object
Return values
object —last shard in the bundle
getArchiveInfo()
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
path to a directory containing a summaries WebArchiveBundle
Return values
array<string|int, mixed> —summary of the given archive
getCurrentShard()
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
public
getCurrentShard([bool $force_read = false ]) : object
Parameters
- $force_read : bool = false
-
whether to force no advance generation and merge dictionary side effects
Return values
object —the currently being index shard
getPage()
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
public
getPage(int $offset[, int $generation = -1 ]) : array<string|int, mixed>
Parameters
- $offset : int
-
byte offset in partition of desired page
- $generation : int = -1
-
which generation WebArchive to look up in defaults to the same number as the current shard
Return values
array<string|int, mixed> —desired page
getParamModifiedTime()
Returns the last time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Return values
mixed —setArchiveInfo()
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —setCurrentShard()
Sets the current shard to be the $i th shard in the index bundle.
public
setCurrentShard( $i[, $disk_based = false ]) : mixed
Parameters
- $i :
-
which shard to set the current shard to be
- $disk_based : = false
-
whether to read the whole shard in before using or leave it on disk except for pages need
Return values
mixed —stopIndexing()
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
public
stopIndexing() : mixed
Return values
mixed —updateShardsAndDictionary()
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
public
updateShardsAndDictionary(int $add_num_docs[, object $callback = null ][, bool $blocking = false ]) : int
If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed
Parameters
- $add_num_docs : int
-
number of docs in the shard about to be added
- $callback : object = null
-
object with join function to be called if process is taking too long
- $blocking : bool = false
-
whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1
Return values
int —the active generation after the check and possible change has been performed