WordIterator
extends IndexBundleIterator
in package
Used to iterate through the documents associated with a word in an IndexArchiveBundle. It also makes it easy to get the summaries of these documents.
A description of how words and the documents containing them are stored is given in the documentation of IndexArchiveBundle.
Tags
Table of Contents
- DOC_RANK_WEIGHT = 50
- Weighting factor to multiply to make a doc-rank (approximate score of document based on its position in the index (when crawled)).
- HOST_KEY_POS = 17
- Host Key position + 1 (first char says doc, inlink or external link)
- KEY_LEN = 8
- Length of a doc key part
- RESULTS_PER_BLOCK = 200
- Default number of documents returned for each block (at most)
- $archive_file : int
- $avg_items_per_partition : int
- $base64_word_key : string
- Word key above in our modified base 64 encoding
- $count_block : int
- The number of documents in the current block
- $current_block_fresh : bool
- Says whether the value in $this->count_block is up to date
- $current_doc_offset : int
- The current value of the doc_offset of current posting if known
- $current_generation : int
- Numeric number of current shard
- $current_offset : int
- The current byte offset in the IndexShard (if older index)
- $dictionary_info : array<string|int, mixed>
- An array of shard generation and posting list offsets, lengths, and numbers of documents
- $direction : int
- Whether the iterator iterates forward or backward through documents in bundle
- $empty : int
- Keeps track of whether the word_iterator list is empty because the word does not appear in the index shard
- $filter : SearchfiltersModel
- Model responsible for keeping track of edited and deleted search results
- $generation_pointer : int
- Index into dictionary_info corresponding to the current shard
- $index_name : string
- The timestamp of the index is associated with this iterator
- $index_version : int
- The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
- $is_meta : string
- Whether word key corresponds to a meta word
- $last_offset : int
- Last Offset of word occurrence in the IndexShard
- $max_items_per_partition : int
- $next_offset : int
- The next byte offset in the IndexShard
- $no_more_generations : bool
- Used to keep track of whether getWordInfo might still get more data on the search terms as advance generations
- $num_docs : int
- Estimate of the number of documents that this iterator can return
- $num_generations : int
- The total number of shards that have data for this word
- $num_occurrences : int
- $pages : array<string|int, mixed>
- Cache of what currentDocsWithWord returns
- $ranking_factors : array<string|int, mixed>
- How url, keywords, and title words should influence relevance and doc rank calculations
- $results_per_block : int
- Number of documents returned for each block (at most)
- $seen_docs : int
- The number of documents already iterated over
- $start_generation : int
- First shard generation that word info was obtained for
- $start_offset : int
- Starting Offset of word occurrence in the IndexShard
- $term_info_computed : int
- $threshold_exceeded : int
- $total_num_docs : int
- $total_num_docs_and_links : int
- $total_number_of_partitions : int
- $word_key : string
- hash of word or phrase that the iterator iterates over
- __construct() : mixed
- Creates a word iterator with the given parameters.
- advance() : mixed
- Forwards the iterator one group of docs
- advanceGeneration() : mixed
- Switches which index shard is being used to return occurrences of the word to the next shard containing the word
- advanceSeenDocs() : mixed
- Updates the seen_docs count during an advance() call
- currentDocsWithWord() : mixed
- Gets the current block of doc ids and score associated with the this iterators word
- currentGenDocOffsetWithWord() : mixed
- Gets the doc_offset and generation for the next document that would be return by this iterator
- findDocsWithWord() : mixed
- Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
- frequencyNormalizationPrefaceScoring() : array<string|int, mixed>
- Normalizes the frequencies of a term within a document with respect to the length of the document, the positions of the term with the document and the overall importance score for a given position within the document Also computes the score of the posting for the host keywords, title keywords, and path keywords.
- genDocOffsetCmp() : int
- Compares two arrays each containing a (generation, offset) pair.
- getCurrentDocsForKeys() : array<string|int, mixed>
- Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
- getDirection() : int
- Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
- getDocKeyPositionsScoringInfo() : mixed
- Add to a set of postings from a partition scoring information, position list information and info about the relative weights of given position based on the position list file and doc_map file.
- getGenerationPostings() : array<string|int, mixed>
- Given a partition number in the the index's PartitionDocumentBundle retrieves all the posting for the word iterator's term in that partition.
- getPostingsSliceResults() : mixed
- Given the current_offset, result_per_block, and index used get the result_per_block postings starting from current_offset in the current direction (ascending or descending) for the term word iterator iterates over from the index.
- nextDocIndexOffsetPair() : array<string|int, mixed>
- Computes a pair [posting_slice_offset, $doc_index], such that the $doc_index when shift to make a doc_offset is greater than $doc_offset and posting_slice_offset is the offset of the first posting with this property.
- nextDocsWithWord() : array<string|int, mixed>
- Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
- plainAdvance() : mixed
- Forwards the iterator one group of docs. This is what's called by @see advance($gen_doc_offset) if $gen_doc_offset is null
- plan() : string
- Returns a string representation of a plan by which the current iterator finds its results
- reset() : mixed
- Resets the iterator to the first document block that it could iterate over
- setResultsPerBlock() : mixed
- Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
- termInfoIteratorFields() : mixed
- Used to compute fields such as $this->total_num_docs for this iterator on term $word_key for index $index_name
Constants
DOC_RANK_WEIGHT
Weighting factor to multiply to make a doc-rank (approximate score of document based on its position in the index (when crawled)).
public
mixed
DOC_RANK_WEIGHT
= 50
This weight affects the amount doc_rank determines the overall score of a document.
HOST_KEY_POS
Host Key position + 1 (first char says doc, inlink or external link)
public
mixed
HOST_KEY_POS
= 17
KEY_LEN
Length of a doc key part
public
mixed
KEY_LEN
= 8
RESULTS_PER_BLOCK
Default number of documents returned for each block (at most)
public
int
RESULTS_PER_BLOCK
= 200
Properties
$archive_file
public
int
$archive_file
$avg_items_per_partition
public
int
$avg_items_per_partition
$base64_word_key
Word key above in our modified base 64 encoding
public
string
$base64_word_key
$count_block
The number of documents in the current block
public
int
$count_block
$current_block_fresh
Says whether the value in $this->count_block is up to date
public
bool
$current_block_fresh
$current_doc_offset
The current value of the doc_offset of current posting if known
public
int
$current_doc_offset
$current_generation
Numeric number of current shard
public
int
$current_generation
$current_offset
The current byte offset in the IndexShard (if older index)
public
int
$current_offset
$dictionary_info
An array of shard generation and posting list offsets, lengths, and numbers of documents
public
array<string|int, mixed>
$dictionary_info
$direction
Whether the iterator iterates forward or backward through documents in bundle
public
int
$direction
$empty
Keeps track of whether the word_iterator list is empty because the word does not appear in the index shard
public
int
$empty
$filter
Model responsible for keeping track of edited and deleted search results
public
SearchfiltersModel
$filter
$generation_pointer
Index into dictionary_info corresponding to the current shard
public
int
$generation_pointer
$index_name
The timestamp of the index is associated with this iterator
public
string
$index_name
$index_version
The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
public
int
$index_version
$is_meta
Whether word key corresponds to a meta word
public
string
$is_meta
$last_offset
Last Offset of word occurrence in the IndexShard
public
int
$last_offset
$max_items_per_partition
public
int
$max_items_per_partition
$next_offset
The next byte offset in the IndexShard
public
int
$next_offset
$no_more_generations
Used to keep track of whether getWordInfo might still get more data on the search terms as advance generations
public
bool
$no_more_generations
$num_docs
Estimate of the number of documents that this iterator can return
public
int
$num_docs
$num_generations
The total number of shards that have data for this word
public
int
$num_generations
$num_occurrences
public
int
$num_occurrences
$pages
Cache of what currentDocsWithWord returns
public
array<string|int, mixed>
$pages
$ranking_factors
How url, keywords, and title words should influence relevance and doc rank calculations
public
array<string|int, mixed>
$ranking_factors
$results_per_block
Number of documents returned for each block (at most)
public
int
$results_per_block
= self::RESULTS_PER_BLOCK
$seen_docs
The number of documents already iterated over
public
int
$seen_docs
$start_generation
First shard generation that word info was obtained for
public
int
$start_generation
$start_offset
Starting Offset of word occurrence in the IndexShard
public
int
$start_offset
$term_info_computed
public
int
$term_info_computed
$threshold_exceeded
public
int
$threshold_exceeded
$total_num_docs
public
int
$total_num_docs
$total_num_docs_and_links
public
int
$total_num_docs_and_links
$total_number_of_partitions
public
int
$total_number_of_partitions
$word_key
hash of word or phrase that the iterator iterates over
public
string
$word_key
Methods
__construct()
Creates a word iterator with the given parameters.
public
__construct(string $word_key, string $index_name[, bool $raw = false ][, SearchfiltersModel $filter = null ][, int $results_per_block = IndexBundleIterator::RESULTS_PER_BLOCK ][, int $direction = self::ASCENDING ][, array<string|int, mixed> $ranking_factors = [] ]) : mixed
Parameters
- $word_key : string
-
hash of word or phrase to iterate docs of
- $index_name : string
-
time_stamp of the to use
- $raw : bool = false
-
whether the $word_key is our variant of base64 encoded
- $filter : SearchfiltersModel = null
-
Model responsible for keeping track of edited and deleted search results
- $results_per_block : int = IndexBundleIterator::RESULTS_PER_BLOCK
-
the maximum number of results that can be returned by a findDocsWithWord call
- $direction : int = self::ASCENDING
-
when results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added. Note: this value is not saved permanently. So you could in theory open two read only versions of the same bundle but reading the results in different directions
- $ranking_factors : array<string|int, mixed> = []
-
field say how url, keywords, and title words should influence relevance and doc rank calculations
Return values
mixed —advance()
Forwards the iterator one group of docs
public
advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
- $gen_doc_offset : array<string|int, mixed> = null
-
a generation, doc_offset pair. If not null, (in the ascending search case opposite for descending), the pair must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value.
Return values
mixed —advanceGeneration()
Switches which index shard is being used to return occurrences of the word to the next shard containing the word
public
advanceGeneration([int $generation = null ]) : mixed
Parameters
- $generation : int = null
-
generation to advance beyond
Return values
mixed —advanceSeenDocs()
Updates the seen_docs count during an advance() call
public
advanceSeenDocs() : mixed
Return values
mixed —currentDocsWithWord()
Gets the current block of doc ids and score associated with the this iterators word
public
currentDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
currentGenDocOffsetWithWord()
Gets the doc_offset and generation for the next document that would be return by this iterator
public
currentGenDocOffsetWithWord() : mixed
Return values
mixed —an array with the desired document offset and generation; -1 on fail
findDocsWithWord()
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
public
findDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
frequencyNormalizationPrefaceScoring()
Normalizes the frequencies of a term within a document with respect to the length of the document, the positions of the term with the document and the overall importance score for a given position within the document Also computes the score of the posting for the host keywords, title keywords, and path keywords.
public
frequencyNormalizationPrefaceScoring(array<string|int, mixed> $positions, int $num_words, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $descriptions_scores) : array<string|int, mixed>
Parameters
- $positions : array<string|int, mixed>
-
positions of this iterators term in the document
- $num_words : int
-
number of terms in the document
- $host_keywords_end_pos : int
-
term offset into the document summary that demarks the end of the host keywords portion of the summary
- $title_end_pos : int
-
absolute term offset into the document summary that demarks the end of the title portion of the summary
- $path_keywords_end_pos : int
-
absolute term offset into the document summary that demarks the end of the title portion of the summary
- $descriptions_scores : array<string|int, mixed>
-
boundaries and scores of different regions with document
Return values
array<string|int, mixed> —[normalized frequency, score for host name, title, and path keywords]
genDocOffsetCmp()
Compares two arrays each containing a (generation, offset) pair.
public
genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
- $gen_doc1 : array<string|int, mixed>
-
first ordered pair
- $gen_doc2 : array<string|int, mixed>
-
second ordered pair
- $direction : int = self::ASCENDING
-
whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search
Return values
int —-1,0,1 depending on which is bigger
getCurrentDocsForKeys()
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
public
getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
- $keys : array<string|int, mixed> = null
-
keys to try to find in the current block of returned results
Return values
array<string|int, mixed> —doc summaries that match provided keys
getDirection()
Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
public
getDirection() : int
Return values
int —direction traversing underlying archive bundle
getDocKeyPositionsScoringInfo()
Add to a set of postings from a partition scoring information, position list information and info about the relative weights of given position based on the position list file and doc_map file.
public
getDocKeyPositionsScoringInfo(mixed $postings, int $partition) : mixed
@param array $postings posting data to add scoring information to
Parameters
- $postings : mixed
- $partition : int
-
which partition from the PartitionDocumentBundle postings a re related to
Return values
mixed —getGenerationPostings()
Given a partition number in the the index's PartitionDocumentBundle retrieves all the posting for the word iterator's term in that partition.
public
getGenerationPostings(int $generation) : array<string|int, mixed>
Parameters
- $generation : int
-
partition to get postings for
Return values
array<string|int, mixed> —of posting items
getPostingsSliceResults()
Given the current_offset, result_per_block, and index used get the result_per_block postings starting from current_offset in the current direction (ascending or descending) for the term word iterator iterates over from the index.
public
getPostingsSliceResults() : mixed
Return values
mixed —nextDocIndexOffsetPair()
Computes a pair [posting_slice_offset, $doc_index], such that the $doc_index when shift to make a doc_offset is greater than $doc_offset and posting_slice_offset is the offset of the first posting with this property.
public
nextDocIndexOffsetPair(int $doc_offset) : array<string|int, mixed>
Parameters
- $doc_offset : int
-
that we are try to find a posting whose doc_index has a bigger doc_offset
Return values
array<string|int, mixed> —[posting_slice_offset, $doc_index]
nextDocsWithWord()
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
public
nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
- $doc_offset : = null
-
if set the next block must all have $doc_offsets equal to or larger than this value
Return values
array<string|int, mixed> —doc summaries matching the $this->restrict_phrases
plainAdvance()
Forwards the iterator one group of docs. This is what's called by @see advance($gen_doc_offset) if $gen_doc_offset is null
public
plainAdvance() : mixed
Return values
mixed —plan()
Returns a string representation of a plan by which the current iterator finds its results
public
plan() : string
Return values
string —a representation of the current iterator and its subiterators, useful for determining how a query will be processed
reset()
Resets the iterator to the first document block that it could iterate over
public
reset() : mixed
Return values
mixed —setResultsPerBlock()
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
public
setResultsPerBlock(int $num) : mixed
Parameters
- $num : int
-
the maximum number of results that can be returned by a block
Return values
mixed —termInfoIteratorFields()
Used to compute fields such as $this->total_num_docs for this iterator on term $word_key for index $index_name
protected
termInfoIteratorFields(string $index_name, string $word_key) : mixed
Parameters
- $index_name : string
-
name of index to compute statistics with respect to
- $word_key : string
-
term to compute statics with respect to