IndexShard
extends PersistentStructure
in package
implements
CrawlConstants
Data structure used to store one generation worth of the word document index (inverted index). This data structure consists of three main components a word entries, word_doc entries, and document entries.
Word entries are described in the documentation for the words field. Word-doc entries are described in the documentation for the word_docs field Document entries are described in the documentation for the doc_infos field
IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.
Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff"
- Represents an empty prefix item
- DEFAULT_SAVE_FREQUENCY = 50000
- If not specified in the constructor, this will be the number of operations between saves
- DESCRIPTION_WEIGHT = 2.0
- BM25F weight factor for terms in description
- DOC_ID_LEN = 24
- Length of DOC ID.
- DOC_KEY_LEN = 8
- Length of a key in a DOC ID.
- FLATTEN_FREQUENCY = 10000
- Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)
- HALF_BLANK = "\xff\xff\xff\xff"
- Flag used to indicate that a word item should not be packed or unpacked
- HEADER_LENGTH = 40
- Header Length of an IndexShard (sum of its non-variable length fields)
- LINK_FLAG = 0x800000
- Used to keep track of whether a record in document infos is for a document or for a link
- LINK_WEIGHT = 1.0
- BM25F weight factor for terms in a link
- MAX_AUX_DOC_KEYS = 200
- Maximum number of auxiliary document keys;
- POSTING_LEN = 4
- Length of one posting ( a doc offset occurrence pair) in a posting list
- SHARD_BLOCK_POWER = 12
- Shard block size is 1<< this power
- SHARD_BLOCK_SIZE = 4096
- Size in bytes of one block in IndexShard
- STORE_FLAG = "\x80"
- Represents an empty prefix item
- TITLE_WEIGHT = 4.0
- BM25F weight factor for terms in title
- WORD_DATA_LEN = 12
- Length of the data portion of a word entry in bytes in the shard
- WORD_KEY_LEN = 20
- Length of a word entry's key in bytes
- WORD_POSTING_COPY_LEN = 32000
- Bytes of tmp string allowed during flattenings
- $blocks : array<string|int, mixed>
- An cached array of disk blocks for an index shard that has not been completely loaded into memory.
- $blocks_words : array<string|int, mixed>
- Stores $blocks contents in (32 bit) unsigned int
- $doc_info_offset : int
- Holds offset of the doc_infos strings
- $doc_infos : string
- Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.
- $docids_len : int
- Length of $doc_infos as a string
- $fh : resource
- File handle for a shard if we are going to use it in read mode and not completely load it.
- $file_len : int
- Keeps track of the length of the shard as a file
- $filename : string
- Name of the file in which to store the PersistentStructure
- $generation : int
- This is supposed to hold the number of earlier shards, prior to the current shard.
- $hash_name : string
- Used to hold the computed 8 byte hash of the index shard filename
- $last_flattened_words_count : mixed
- Number of document inserts since the last time word data was flattened to the word_postings string.
- $len_all_docs : int
- Number of words stored in total in all documents in this shard
- $len_all_link_docs : int
- Number of words stored in total in all links in this shard
- $num_docs : int
- Number of documents (not links) stored in this shard
- $num_docs_per_generation : int
- This is supposed to hold the number of documents that a given shard can hold.
- $num_docs_word : array<string|int, mixed>
- Keeps track of the number of documents a word is in
- $num_link_docs : int
- Number of links (not documents) stored in this shard
- $prefixes : array<string|int, mixed>
- An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.
- $prefixes_len : int
- Length of the prefix index into the dictionary of the shard
- $read_only_from_disk : bool
- Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
- $save_frequency : int
- Number of operation between saves. If == -1 never save using checkSave
- $unsaved_operations : int
- Number of operations since the last save
- $word_doc_offset : int
- Holds offset of the word_docs strings
- $word_docs : string
- This string is non-empty when shard is loaded and in its packed state.
- $word_docs_len : int
- Length of $word_docs as a string
- $word_docs_packed : bool
- Keeps track of the packed/unpacked state of the word_docs list
- $word_postings : string
- Used to hold word_id, posting_len, posting triples as a memory efficient string
- $words : array<string|int, mixed>
- Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
- $words_len : int
- Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode
- __construct() : mixed
- Makes an index shard with the given file name and generation offset
- addDocumentWords() : bool
- Add a new document to the index shard with the given summary offset.
- appendIndexShard() : mixed
- Adds the contents of the supplied $index_shard to the current index shard
- binarySearchPostingOffsetDocOffset() : array<string|int, mixed>|bool
- Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false
- changeDocumentOffsets() : mixed
- Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
- checkSave() : mixed
- Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage
- computeProximity() : int
- Returns a proximity score for a single term based on its location in doc.
- docOffsetFromPostingOffset() : int
- Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
- docStats() : mixed
- Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
- gallopPostingOffsetDocOffset() : int
- Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found
- getDocIndexOfPostingAtOffset() : int
- Returns the document index of the posting at offset $current in word_docs
- getDocInfoSubstring() : string
- From disk gets $len many bytes starting from $offset in the doc_infos strings
- getPostingAtOffset() : string
- Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
- getPostingsSlice() : array<string|int, mixed>
- Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
- getPostingsSliceById() : array<string|int, mixed>
- Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
- getShardSubstring() : string
- Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
- getShardWord() : int
- Reads 32 bit word as an unsigned int from the offset given in the shard
- getWordDocsSubstring() : desired
- From disk gets $len many bytes starting from $offset in the word_docs strings
- getWordDocsWord() : mixed
- Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard
- getWordInfo() : array<string|int, mixed>
- Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
- getWordInfoFromString() : array<string|int, mixed>
- Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
- getWordString() : mixed
- Return word record (word key + posting lookup data )from the shard from the shard posting list
- headerToShardFields() : mixed
- Split a header string into a shards field variable
- load() : IndexShard
- Load an IndexShard from a file or string
- makeItem() : array<string|int, mixed>
- Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.
- makeWords() : mixed
- Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
- mergeWordPostingsToString() : mixed
- Used to flatten the words associative array to a more memory efficient word_postings string.
- nextPostingOffsetDocOffset() : array<string|int, mixed>
- Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
- numDocsOrLinks() : int
- An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
- outputPostingLists() : mixed
- Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
- packAuxiliaryDocumentKeys() : string
- Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.
- packDoclenNum() : string
- Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
- packValues() : string
- Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.
- packWords() : mixed
- Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.
- postingsSliceAscending() : array<string|int, mixed>
- Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
- postingsSliceDescending() : array<string|int, mixed>
- Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
- prepareWordsAndPrefixes() : mixed
- Computes the prefix string index for the current words array.
- readBlockShardAtOffset() : mixed
- Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
- readShardHeader() : bool
- If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
- save() : string
- Save the IndexShard to its filename
- saveWithoutDictionary() : mixed
- This method re-saves a saved shard without the prefixes and dictionary.
- unpackAuxiliaryDocumentKeys() : array<string|int, mixed>
- Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.
- unpackDoclenNum() : array<string|int, mixed>
- Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
- unpackValues() : array<string|int, mixed>
- Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.
- unpackWordDocs() : mixed
- Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.
- weightedCount() : array<string|int, mixed>
- Used to sum over the occurrences in a position list counting with weight based on term location in the document
Constants
BLANK
Represents an empty prefix item
public
mixed
BLANK
= "\xff\xff\xff\xff\xff\xff\xff\xff"
DEFAULT_SAVE_FREQUENCY
If not specified in the constructor, this will be the number of operations between saves
public
int
DEFAULT_SAVE_FREQUENCY
= 50000
DESCRIPTION_WEIGHT
BM25F weight factor for terms in description
public
mixed
DESCRIPTION_WEIGHT
= 2.0
DOC_ID_LEN
Length of DOC ID.
public
mixed
DOC_ID_LEN
= 24
DOC_KEY_LEN
Length of a key in a DOC ID.
public
mixed
DOC_KEY_LEN
= 8
FLATTEN_FREQUENCY
Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)
public
mixed
FLATTEN_FREQUENCY
= 10000
HALF_BLANK
Flag used to indicate that a word item should not be packed or unpacked
public
mixed
HALF_BLANK
= "\xff\xff\xff\xff"
HEADER_LENGTH
Header Length of an IndexShard (sum of its non-variable length fields)
public
mixed
HEADER_LENGTH
= 40
LINK_FLAG
Used to keep track of whether a record in document infos is for a document or for a link
public
mixed
LINK_FLAG
= 0x800000
LINK_WEIGHT
BM25F weight factor for terms in a link
public
mixed
LINK_WEIGHT
= 1.0
MAX_AUX_DOC_KEYS
Maximum number of auxiliary document keys;
public
mixed
MAX_AUX_DOC_KEYS
= 200
POSTING_LEN
Length of one posting ( a doc offset occurrence pair) in a posting list
public
mixed
POSTING_LEN
= 4
SHARD_BLOCK_POWER
Shard block size is 1<< this power
public
mixed
SHARD_BLOCK_POWER
= 12
SHARD_BLOCK_SIZE
Size in bytes of one block in IndexShard
public
mixed
SHARD_BLOCK_SIZE
= 4096
STORE_FLAG
Represents an empty prefix item
public
mixed
STORE_FLAG
= "\x80"
TITLE_WEIGHT
BM25F weight factor for terms in title
public
mixed
TITLE_WEIGHT
= 4.0
WORD_DATA_LEN
Length of the data portion of a word entry in bytes in the shard
public
mixed
WORD_DATA_LEN
= 12
WORD_KEY_LEN
Length of a word entry's key in bytes
public
mixed
WORD_KEY_LEN
= 20
WORD_POSTING_COPY_LEN
Bytes of tmp string allowed during flattenings
public
mixed
WORD_POSTING_COPY_LEN
= 32000
Properties
$blocks
An cached array of disk blocks for an index shard that has not been completely loaded into memory.
public
array<string|int, mixed>
$blocks
$blocks_words
Stores $blocks contents in (32 bit) unsigned int
public
array<string|int, mixed>
$blocks_words
$doc_info_offset
Holds offset of the doc_infos strings
public
int
$doc_info_offset
$doc_infos
Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.
public
string
$doc_infos
In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".
$docids_len
Length of $doc_infos as a string
public
int
$docids_len
$fh
File handle for a shard if we are going to use it in read mode and not completely load it.
public
resource
$fh
$file_len
Keeps track of the length of the shard as a file
public
int
$file_len
$filename
Name of the file in which to store the PersistentStructure
public
string
$filename
$generation
This is supposed to hold the number of earlier shards, prior to the current shard.
public
int
$generation
$hash_name
Used to hold the computed 8 byte hash of the index shard filename
public
string
$hash_name
$last_flattened_words_count
Number of document inserts since the last time word data was flattened to the word_postings string.
public
mixed
$last_flattened_words_count
$len_all_docs
Number of words stored in total in all documents in this shard
public
int
$len_all_docs
$len_all_link_docs
Number of words stored in total in all links in this shard
public
int
$len_all_link_docs
$num_docs
Number of documents (not links) stored in this shard
public
int
$num_docs
$num_docs_per_generation
This is supposed to hold the number of documents that a given shard can hold.
public
int
$num_docs_per_generation
$num_docs_word
Keeps track of the number of documents a word is in
public
array<string|int, mixed>
$num_docs_word
$num_link_docs
Number of links (not documents) stored in this shard
public
int
$num_link_docs
$prefixes
An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.
public
array<string|int, mixed>
$prefixes
$prefixes_len
Length of the prefix index into the dictionary of the shard
public
int
$prefixes_len
$read_only_from_disk
Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
public
bool
$read_only_from_disk
$save_frequency
Number of operation between saves. If == -1 never save using checkSave
public
int
$save_frequency
$unsaved_operations
Number of operations since the last save
public
int
$unsaved_operations
$word_doc_offset
Holds offset of the word_docs strings
public
int
$word_doc_offset
$word_docs
This string is non-empty when shard is loaded and in its packed state.
public
string
$word_docs
It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.
$word_docs_len
Length of $word_docs as a string
public
int
$word_docs_len
$word_docs_packed
Keeps track of the packed/unpacked state of the word_docs list
public
bool
$word_docs_packed
$word_postings
Used to hold word_id, posting_len, posting triples as a memory efficient string
public
string
$word_postings
$words
Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
public
array<string|int, mixed>
$words
$words_len
Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode
public
int
$words_len
Methods
__construct()
Makes an index shard with the given file name and generation offset
public
__construct(string $fname, int $generation[, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ][, bool $read_only_from_disk = false ]) : mixed
Parameters
- $fname : string
-
filename to store the index shard with
- $generation : int
-
when returning documents from the shard pretend there are this many earlier documents
- $num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION
-
the number of documents that a given shard can hold.
- $read_only_from_disk : bool = false
-
used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
Return values
mixed —addDocumentWords()
Add a new document to the index shard with the given summary offset.
public
addDocumentWords(string $doc_keys, int $summary_offset, array<string|int, mixed> $word_lists[, array<string|int, mixed> $meta_ids = [] ][, bool $is_doc = false ][, mixed $rank = false ][, array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : bool
Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.
Parameters
- $doc_keys : string
-
a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.
- $summary_offset : int
-
its offset into the word archive the document's data is stored in
- $word_lists : array<string|int, mixed>
-
(word => array of word positions in doc)
- $meta_ids : array<string|int, mixed> = []
-
meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.
- $is_doc : bool = false
-
flag used to indicate if what is being scored is a document or a link to a document
- $rank : mixed = false
-
either false if not used, or a 4 bit estimate of the rank of this document item
- $description_scores : array<string|int, mixed> = []
- $user_ranks : array<string|int, mixed> = []
Return values
bool —success or failure of performing the add
appendIndexShard()
Adds the contents of the supplied $index_shard to the current index shard
public
appendIndexShard(object $index_shard) : mixed
Parameters
- $index_shard : object
-
the shard to append to the current shard
Return values
mixed —binarySearchPostingOffsetDocOffset()
Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false
public
binarySearchPostingOffsetDocOffset(int $start, int $end, int $current, int $doc_index, int $direction) : array<string|int, mixed>|bool
Parameters
- $start : int
-
lower index of posting list
- $end : int
-
upper index of posting list
- $current : int
-
current index in posting list
- $doc_index : int
-
index wahat next doc offset after
- $direction : int
-
either self::ASCENDING or self::DESCENDING
Return values
array<string|int, mixed>|bool —either (posting_offset, next_doc_offset) or false
changeDocumentOffsets()
Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
public
changeDocumentOffsets(array<string|int, mixed> $docid_offsets) : mixed
Parameters
- $docid_offsets : array<string|int, mixed>
-
a set of doc_id associated with a new_doc_offset.
Return values
mixed —checkSave()
Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage
public
checkSave() : mixed
Return values
mixed —computeProximity()
Returns a proximity score for a single term based on its location in doc.
public
computeProximity(array<string|int, mixed> $position_list, bool $is_doc) : int
Parameters
- $position_list : array<string|int, mixed>
-
locations of term within item
- $is_doc : bool
-
whether the item is a document or not
Return values
int —a score for proximity
docOffsetFromPostingOffset()
Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
public
docOffsetFromPostingOffset(int $offset) : int
Parameters
- $offset : int
-
byte/char offset into the word_docs string
Return values
int —a document byte/char offset into the doc_infos string
docStats()
Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
public
static docStats(array<string|int, mixed> &$item, int $occurrences, int $doc_len, int $num_doc_or_links, float $average_doc_len, int $num_docs, int $total_docs_or_links, float $type_weight) : mixed
Parameters
- $item : array<string|int, mixed>
-
doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed
- $occurrences : int
-
- number of occurrences of the term in the item
- $doc_len : int
-
number of words in doc item represents
- $num_doc_or_links : int
-
number of links or docs containing the term
- $average_doc_len : float
-
average length of items in corpus
- $num_docs : int
-
either number of links or number of docs depending if item represents a link or a doc.
- $total_docs_or_links : int
-
number of docs or links in corpus
- $type_weight : float
-
BM25F weight for this component (doc or link) of score
Return values
mixed —gallopPostingOffsetDocOffset()
Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found
public
gallopPostingOffsetDocOffset(int &$current, int $doc_index, int $end, int $direction) : int
Parameters
- $current : int
-
current posting offset into posting list
- $doc_index : int
-
document index want bigger than or equal to
- $end : int
-
last index of posting list
- $direction : int
-
which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored
Return values
int —document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end
getDocIndexOfPostingAtOffset()
Returns the document index of the posting at offset $current in word_docs
public
getDocIndexOfPostingAtOffset(int $current) : int
Parameters
- $current : int
-
an offset into the posting lists (word_docs)
Return values
int —the doc index of the pointed to posting
getDocInfoSubstring()
From disk gets $len many bytes starting from $offset in the doc_infos strings
public
getDocInfoSubstring( $offset, $len[, bool $cache = false ]) : string
Parameters
- $offset :
-
byte offset to begin getting data out of disk-based doc_infos
- $len :
-
number of bytes to get
- $cache : bool = false
-
whether to cache disk blocks read from disk
Return values
string —desired
getPostingAtOffset()
Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
public
getPostingAtOffset(int $current, int &$posting_start, int &$posting_end) : string
Parameters
- $current : int
-
an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN
- $posting_start : int
-
after function call will be index of start of nearest posting to current
- $posting_end : int
-
after function call will be index of end of nearest posting to current
Return values
string —the substring of word_docs corresponding to the posting
getPostingsSlice()
Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
public
getPostingsSlice(int $start_offset, int &$next_offset, int $last_offset, int $len[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
- $start_offset : int
-
of the current posting list for query term used in calculating BM25F.
- $next_offset : int
-
where to start in word docs
- $last_offset : int
-
offset at which to stop by
- $len : int
-
number of documents desired
- $direction : int = self::ASCENDING
-
which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored
Return values
array<string|int, mixed> —desired list of doc's and their info
getPostingsSliceById()
Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
public
getPostingsSliceById(string $word_id, int $len[, mixed $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
- $word_id : string
-
key to look up documents for
- $len : int
-
number of documents
- $direction : mixed = self::ASCENDING
Return values
array<string|int, mixed> —desired list of doc's and their info
getShardSubstring()
Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
public
getShardSubstring(int $offset, int $len[, bool $cache = true ]) : string
Parameters
- $offset : int
-
byte offset to start reading from
- $len : int
-
number of bytes to read
- $cache : bool = true
-
whether to cache disk blocks read from disk
Return values
string —data from that location in the shard
getShardWord()
Reads 32 bit word as an unsigned int from the offset given in the shard
public
getShardWord(int $offset) : int
Parameters
- $offset : int
-
a byte offset into the shard
Return values
int —desired word or false
getWordDocsSubstring()
From disk gets $len many bytes starting from $offset in the word_docs strings
public
getWordDocsSubstring( $offset, $len[, bool $cache = true ]) : desired
Parameters
- $offset :
-
byte offset to begin getting data out of disk-based word_docs
- $len :
-
number of bytes to get
- $cache : bool = true
-
whether to cache disk blocks read from disk
Return values
desired —string
getWordDocsWord()
Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard
public
getWordDocsWord(int $offset) : mixed
Parameters
- $offset : int
-
a byte offset into the word_docs string
Return values
mixed —getWordInfo()
Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
public
getWordInfo(string $word_id[, bool $raw = false ]) : array<string|int, mixed>
Parameters
- $word_id : string
-
id of the word one wants to look up
- $raw : bool = false
-
whether the id is our version of base64 encoded or not
Return values
array<string|int, mixed> —first offset, last offset, count, exact matching id
getWordInfoFromString()
Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
public
static getWordInfoFromString(string $str[, bool $include_generation = false ]) : array<string|int, mixed>
Parameters
- $str : string
- $include_generation : bool = false
Return values
array<string|int, mixed> —of these three or four int's
getWordString()
Return word record (word key + posting lookup data )from the shard from the shard posting list
public
getWordString(bool $is_disk, int $start, int $location, int $word_item_len) : mixed
Parameters
- $is_disk : bool
-
whether the shard is on disk or in memory
- $start : int
-
offset to start of the dictionary
- $location : int
-
index of record to extract from dictionary
- $word_item_len : int
-
length of a word + data record
Return values
mixed —headerToShardFields()
Split a header string into a shards field variable
public
static headerToShardFields(string $header, object $shard) : mixed
Parameters
- $header : string
-
a string with packed shard header data
- $shard : object
-
IndexShard to put data into
Return values
mixed —load()
Load an IndexShard from a file or string
public
static load(string $fname[, string &$data = null ]) : IndexShard
Parameters
- $fname : string
-
the name of the file to the IndexShard from/to
- $data : string = null
-
stringified shard data to load shard from. If null then the data is loaded from the $fname if possible
Return values
IndexShard —the IndexShard loaded
makeItem()
Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.
public
makeItem(string $posting, int $num_doc_or_links[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
- $posting : string
-
a posting entry from some words posting list
- $num_doc_or_links : int
-
number of documents or links doc appears in
- $direction : int = self::ASCENDING
-
whether to compute DOC_RANK based on the assumption the iterator is traversing the index in an ascending or descending fashion
Return values
array<string|int, mixed> —($doc_id, posting_stats_array) for posting
makeWords()
Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
public
static makeWords(string &$value, int $key, object $shard) : mixed
Parameters
- $value : string
-
the word_key . word_info string
- $key : int
-
index in array - we don't use
- $shard : object
-
IndexShard to add the entry to word table for
Return values
mixed —mergeWordPostingsToString()
Used to flatten the words associative array to a more memory efficient word_postings string.
public
mergeWordPostingsToString([bool $replace = false ]) : mixed
$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory
$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.
mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.
Parameters
- $replace : bool = false
-
whether to overwrite existing word_id postings (true) or to append (false)
Return values
mixed —nextPostingOffsetDocOffset()
Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
public
nextPostingOffsetDocOffset(int $start_offset, int $end_offset, int $doc_offset[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
- $start_offset : int
-
first posting to consider
- $end_offset : int
-
last posting before give up
- $doc_offset : int
-
document offset we want to be greater than or equal to (when ASCENDING) or less equal to (DESCENDING)
- $direction : int = self::ASCENDING
-
which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored
Return values
array<string|int, mixed> —(int offset to next posting, doc_offset for this post)
numDocsOrLinks()
An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
public
static numDocsOrLinks(int $start_offset, int $last_offset[, float $avg_posting_len = 4 ]) : int
Parameters
- $start_offset : int
-
starting location in posting list
- $last_offset : int
-
ending location in posting list
- $avg_posting_len : float = 4
-
number of bytes in an average posting
Return values
int —number of docs or links
outputPostingLists()
Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
public
outputPostingLists([resource $fh = null ][, bool $with_logging = false ]) : mixed
Parameters
- $fh : resource = null
-
a filehandle to write to
- $with_logging : bool = false
-
whether to log progress
Return values
mixed —packAuxiliaryDocumentKeys()
Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.
public
packAuxiliaryDocumentKeys([array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : string
A document map entry consists of a four byte offset into a WebArchive, three more bytes for the document length as, one byte for the number of 8 byte aux keys, followed by a 24 byte key derived usually from the url, host, etc, followed by the description scores, user rank auxiliary keys.
Parameters
- $description_scores : array<string|int, mixed> = []
-
pairs position in document => weight score that position got during summarization process.
- $user_ranks : array<string|int, mixed> = []
-
float scores gotten by a user classifier/ranker defined using Manage Classfiers.
Return values
string —a string padded to length a multiple of 16 where @see packValues has been used to map each of the above array into a string
packDoclenNum()
Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
public
static packDoclenNum(int $doc_len, int $num_keys) : string
Parameters
- $doc_len : int
-
number of words in the document
- $num_keys : int
-
number of keys that are used to make up its doc_id
Return values
string —packed int string representing these two values
packValues()
Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.
public
packValues(array<string|int, mixed> $values[, string $type = "i" ]) : string
Parameters
- $values : array<string|int, mixed>
-
nonnegative integers or floats to pack
- $type : string = "i"
-
if is "i" then assuming integers we are packing otherwise floats
Return values
string —with packed values
packWords()
Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.
public
packWords([resource $fh = null ][, bool $with_logging = false ]) : mixed
After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.
Parameters
- $fh : resource = null
-
a file handle to write the dictionary to, if desired
- $with_logging : bool = false
-
whether to write progress log messages every 30 seconds
Return values
mixed —postingsSliceAscending()
Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
public
postingsSliceAscending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>
Parameters
- $start_offset : int
-
byte offset beginning of given posting list
- $next_offset : int
-
byte offset between $start_offset and $last_offset of a posting
- $last_offset : int
-
byte offset ending of given posting list
- $len : int
-
how many postings to return increasing from $next_offset
Return values
array<string|int, mixed> —of posting items
postingsSliceDescending()
Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
public
postingsSliceDescending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>
Parameters
- $start_offset : int
-
byte offset beginning of given posting list
- $next_offset : int
-
byte offset between $start_offset and $last_offset of a posting
- $last_offset : int
-
byte offset ending of given posting list
- $len : int
-
how many postings to return decreasing from $next_offset
Return values
array<string|int, mixed> —of posting items
prepareWordsAndPrefixes()
Computes the prefix string index for the current words array.
public
prepareWordsAndPrefixes([bool $with_logging = false ]) : mixed
This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings
Parameters
- $with_logging : bool = false
-
whether log messages should be written as progresses
Return values
mixed —readBlockShardAtOffset()
Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
public
readBlockShardAtOffset(int $bytes[, bool $cache = true ]) : mixed
Parameters
- $bytes : int
-
byte offset to start reading from
- $cache : bool = true
-
whether to cache disk blocks that have been read to RAM
Return values
mixed —data fromIndexShard file if found, false otherwise
readShardHeader()
If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
public
readShardHeader([bool $force = false ]) : bool
Parameters
- $force : bool = false
-
If true
Return values
bool —whether was able to read in or not
save()
Save the IndexShard to its filename
public
save([bool $to_string = false ][, bool $with_logging = false ]) : string
Parameters
- $to_string : bool = false
-
whether output should be written to a string rather than the default file location
- $with_logging : bool = false
-
whether log messages should be written as the shard save progresses
Return values
string —serialized shard if output was to string else empty string
saveWithoutDictionary()
This method re-saves a saved shard without the prefixes and dictionary.
public
saveWithoutDictionary([bool $with_logging = false ]) : mixed
It would typically be called after this information has been stored in an IndexDictionary obbject so that the data is not redundantly stored
Parameters
- $with_logging : bool = false
-
whether log messages should be written as the shard save progresses
Return values
mixed —unpackAuxiliaryDocumentKeys()
Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.
public
unpackAuxiliaryDocumentKeys(string $packed_data, int $offset) : array<string|int, mixed>
Parameters
- $packed_data : string
-
containing packed description scores and user ranks
- $offset : int
-
where in the string to begin unpacking from
Return values
array<string|int, mixed> —[$description_scores, $user_ranks]
unpackDoclenNum()
Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
public
static unpackDoclenNum(int $doc_info) : array<string|int, mixed>
Parameters
- $doc_info : int
-
integer to unpack
Return values
array<string|int, mixed> —pair (number of words in the document, number of keys that are used to make up its doc_id)
unpackValues()
Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.
public
unpackValues(mixed $packed_data, mixed $offset[, string $type = 'i' ]) : array<string|int, mixed>
@see packValues
Parameters
- $packed_data : mixed
- $offset : mixed
- $type : string = 'i'
-
if is "i" then assuming integers we are unpacking otherwise floats
Return values
array<string|int, mixed> —[unpacked values array, offset to where processed to in string]
unpackWordDocs()
Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.
public
unpackWordDocs() : mixed
This method is memory expensive as it briefly has essentially two copies of what's in word_docs.
Return values
mixed —weightedCount()
Used to sum over the occurrences in a position list counting with weight based on term location in the document
public
weightedCount(array<string|int, mixed> $position_list, bool $is_doc, int $title_length[, array<string|int, mixed> $position_scores = [] ]) : array<string|int, mixed>
Parameters
- $position_list : array<string|int, mixed>
-
positions of term in item
- $is_doc : bool
-
whether the item is a document or a link
- $title_length : int
-
position in position list at which point no longer in title of original doc
- $position_scores : array<string|int, mixed> = []
-
pairs position => weight saying how much a word at a given position range is worth
Return values
array<string|int, mixed> —asscoiative array of document_part => weight count of occurrences of term in