IndexManager
in package
implements
CrawlConstants
Class used to manage open IndexArchiveBundle's while performing a query. Ensures an easy place to obtain references to these bundles and ensures only one object per bundle is instantiated in a Singleton-esque way.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- INDEX_CACHE_SIZE = 1000
- Max number of IndexArchiveBundles that can be cached
- $index_times : array<string|int, mixed>
- List of entries of the form name of bundle => time when cached
- $indexes : array<string|int, mixed>
- Open IndexArchiveBundle's managed by this manager
- clearCache() : mixed
- Clears the static variables in which caches of read in indexes and dictionary info is stored.
- discountedNumDocsTerm() : int
- Returns the number of document that a given term or phrase appears in in the given index where we discount later generation -- those with lower document rank more
- getIndex() : object
- Returns a reference to the managed copy of an IndexArchiveBundle object with a given timestamp or feed (for handling media feeds)
- getVersion() : int
- Returns the version of the index, so that Yioop can determine how to do word lookup.The only major change to the format was when word_id's went from 8 to 20 bytes which happened around Unix time 1369754208.
- getWordInfo() : array<string|int, mixed>
- Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
Constants
INDEX_CACHE_SIZE
Max number of IndexArchiveBundles that can be cached
public
mixed
INDEX_CACHE_SIZE
= 1000
Properties
$index_times
List of entries of the form name of bundle => time when cached
public
static array<string|int, mixed>
$index_times
= []
$indexes
Open IndexArchiveBundle's managed by this manager
public
static array<string|int, mixed>
$indexes
= []
Methods
clearCache()
Clears the static variables in which caches of read in indexes and dictionary info is stored.
public
static clearCache() : mixed
Return values
mixed —discountedNumDocsTerm()
Returns the number of document that a given term or phrase appears in in the given index where we discount later generation -- those with lower document rank more
public
static discountedNumDocsTerm(string $term, string $index_name) : int
Parameters
- $term : string
-
what to look up in the indexes dictionary no mask is used for this look up
- $index_name : string
-
index to look up term or phrase in
Return values
int —number of documents
getIndex()
Returns a reference to the managed copy of an IndexArchiveBundle object with a given timestamp or feed (for handling media feeds)
public
static getIndex(string $index_name) : object
Parameters
- $index_name : string
-
timestamp of desired IndexArchiveBundle
Return values
object —the desired IndexArchiveBundle reference
getVersion()
Returns the version of the index, so that Yioop can determine how to do word lookup.The only major change to the format was when word_id's went from 8 to 20 bytes which happened around Unix time 1369754208.
public
static getVersion(string $index_name) : int
Parameters
- $index_name : string
-
unix timestamp of index
Return values
int —0 - if the original format for Yioop indexes; 1 -if 20 byte word_id format
getWordInfo()
Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
public
static getWordInfo(string $index_name, string $term_id[, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ][, bool $with_remaining_total = false ]) : array<string|int, mixed>
Parameters
- $index_name : string
-
bundle to look for $term_id in
- $term_id : string
-
id of phrase or word to look up in bundle dictionary
- $threshold : int = -1
-
after the number of results exceeds this amount stop looking for more dictionary entries.
- $start_generation : int = -1
-
what generation in the index to start finding occurrence of phrase from
- $num_distinct_generations : int = -1
-
from $start_generation how many generation to search forward to
- $with_remaining_total : bool = false
-
whether to total number of postings found as well or not
Return values
array<string|int, mixed> —either [total, sequence of four tuples] or sequence of four tuples: (index_shard generation, posting_list_offset, length, exact id that match $term_id)