DocIterator
extends IndexBundleIterator
in package
Used to iterate through all the documents and links associated with a an IndexArchiveBundle. It iterates through each doc or link regardless of the words it contains. It also makes it easy to get the summaries of these documents.
A description of how words and the documents containing them are stored is given in the documentation of IndexArchiveBundle.
Tags
Table of Contents
- HOST_KEY_POS = 17
- Host Key position + 1 (first char says doc, inlink or eternal link)
- KEY_LEN = 8
- Length of a doc key
- RESULTS_PER_BLOCK = 200
- Default number of documents returned for each block (at most)
- $count_block : int
- The number of documents in the current block
- $current_block_fresh : bool
- Says whether the value in $this->count_block is up to date
- $current_generation : int
- Numeric number of current shard
- $current_offset : int
- The current byte offset in the IndexShard
- $direction : int
- When results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added.
- $doc_map : array<string|int, mixed>
- $doc_map_generation : int
- Index of the current generation/partition in the doc_map to get results from
- $filter : SearchfiltersModel
- Model responsible for keeping track of edited and deleted search results
- $index_name : string
- The timestamp of the index is associated with this iterator
- $index_version : int
- The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
- $key_index : int
- $last_offset : int
- Last offset of a doc occurrence in the IndexShard
- $next_offset : int
- The next byte offset of a doc in the IndexShard
- $num_docs : int
- Estimate of the number of documents that this iterator can return
- $num_generations : int
- The total number of shards that have data for this word
- $pages : array<string|int, mixed>
- Cache of what currentDocsWithWord returns
- $results_per_block : int
- Number of documents returned for each block (at most)
- $seen_docs : int
- The number of documents already iterated over
- $shard_lens : array<string|int, mixed>
- An array of shard docids_lens
- $total_num_docs : int
- __construct() : mixed
- Creates a doc iterator with the given parameters.
- advance() : mixed
- Forwards the iterator one group of docs
- advanceGeneration() : mixed
- Switches which index shard is being used to return occurrences of the word to the next shard containing the word
- advanceSeenDocs() : mixed
- Updates the seen_docs count during an advance() call
- currentDocsWithWord() : mixed
- Gets the current block of doc ids and score associated with the this iterators word
- currentGenDocOffsetWithWord() : mixed
- Gets the doc_offset and generation for the next document that would be return by this iterator
- findDocsWithWord() : mixed
- Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
- genDocOffsetCmp() : int
- Compares two arrays each containing a (generation, offset) pair.
- getCurrentDocsForKeys() : array<string|int, mixed>
- Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
- getDirection() : int
- Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
- getGenerationInfo() : mixed
- Mainly used to get the last_offset in shard $generation of the current index bundle. In the case where this wasn't previously cached it loads in the index bundle, sets the current generation to $generation, stores the docids_len (the last offset) of this shard in shard_lens and sets up last_offset as $generation's docids_len
- getPreviousDocOffset() : int
- Get the document offset prior to the current $doc_offset
- nextDocsWithWord() : array<string|int, mixed>
- Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
- plan() : string
- Returns a string representation of a plan by which the current iterator finds its results
- reset() : mixed
- Returns the iterators to the first document block that it could iterate over
- setResultsPerBlock() : mixed
- Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
Constants
HOST_KEY_POS
Host Key position + 1 (first char says doc, inlink or eternal link)
public
mixed
HOST_KEY_POS
= 17
KEY_LEN
Length of a doc key
public
mixed
KEY_LEN
= 8
RESULTS_PER_BLOCK
Default number of documents returned for each block (at most)
public
int
RESULTS_PER_BLOCK
= 200
Properties
$count_block
The number of documents in the current block
public
int
$count_block
$current_block_fresh
Says whether the value in $this->count_block is up to date
public
bool
$current_block_fresh
$current_generation
Numeric number of current shard
public
int
$current_generation
$current_offset
The current byte offset in the IndexShard
public
int
$current_offset
$direction
When results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added.
public
int
$direction
$doc_map
public
array<string|int, mixed>
$doc_map
$doc_map_generation
Index of the current generation/partition in the doc_map to get results from
public
int
$doc_map_generation
$filter
Model responsible for keeping track of edited and deleted search results
public
SearchfiltersModel
$filter
$index_name
The timestamp of the index is associated with this iterator
public
string
$index_name
$index_version
The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
public
int
$index_version
$key_index
public
int
$key_index
$last_offset
Last offset of a doc occurrence in the IndexShard
public
int
$last_offset
$next_offset
The next byte offset of a doc in the IndexShard
public
int
$next_offset
$num_docs
Estimate of the number of documents that this iterator can return
public
int
$num_docs
$num_generations
The total number of shards that have data for this word
public
int
$num_generations
$pages
Cache of what currentDocsWithWord returns
public
array<string|int, mixed>
$pages
$results_per_block
Number of documents returned for each block (at most)
public
int
$results_per_block
= self::RESULTS_PER_BLOCK
$seen_docs
The number of documents already iterated over
public
int
$seen_docs
$shard_lens
An array of shard docids_lens
public
array<string|int, mixed>
$shard_lens
$total_num_docs
public
int
$total_num_docs
Methods
__construct()
Creates a doc iterator with the given parameters.
public
__construct(string $index_name[, SearchfiltersModel $filter = null ][, int $results_per_block = IndexBundleIterator::RESULTS_PER_BLOCK ][, int $direction = self::ASCENDING ]) : mixed
Parameters
- $index_name : string
-
time_stamp of the to use
- $filter : SearchfiltersModel = null
-
Model responsible for keeping track of edited and deleted search results
- $results_per_block : int = IndexBundleIterator::RESULTS_PER_BLOCK
-
the maximum number of results that can be returned by a findDocsWithWord call
- $direction : int = self::ASCENDING
-
when results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added. Note: this value is not saved permanently. So you could in theory open two read only versions of the same bundle but reading the results in different directions
Return values
mixed —advance()
Forwards the iterator one group of docs
public
advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
- $gen_doc_offset : array<string|int, mixed> = null
-
a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value
Return values
mixed —advanceGeneration()
Switches which index shard is being used to return occurrences of the word to the next shard containing the word
public
advanceGeneration([int $generation = null ]) : mixed
Parameters
- $generation : int = null
-
generation to advance beyond
Return values
mixed —advanceSeenDocs()
Updates the seen_docs count during an advance() call
public
advanceSeenDocs() : mixed
Return values
mixed —currentDocsWithWord()
Gets the current block of doc ids and score associated with the this iterators word
public
currentDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
currentGenDocOffsetWithWord()
Gets the doc_offset and generation for the next document that would be return by this iterator
public
currentGenDocOffsetWithWord() : mixed
Return values
mixed —an array with the desired document offset and generation; -1 on fail
findDocsWithWord()
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
public
findDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
genDocOffsetCmp()
Compares two arrays each containing a (generation, offset) pair.
public
genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
- $gen_doc1 : array<string|int, mixed>
-
first ordered pair
- $gen_doc2 : array<string|int, mixed>
-
second ordered pair
- $direction : int = self::ASCENDING
-
whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search
Return values
int —-1,0,1 depending on which is bigger
getCurrentDocsForKeys()
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
public
getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
- $keys : array<string|int, mixed> = null
-
keys to try to find in the current block of returned results
Return values
array<string|int, mixed> —doc summaries that match provided keys
getDirection()
Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
public
getDirection() : int
Return values
int —either CrawlConstants::ASCENDING or CrawlConstants::DESCENDING
getGenerationInfo()
Mainly used to get the last_offset in shard $generation of the current index bundle. In the case where this wasn't previously cached it loads in the index bundle, sets the current generation to $generation, stores the docids_len (the last offset) of this shard in shard_lens and sets up last_offset as $generation's docids_len
public
getGenerationInfo( $generation) : mixed
Parameters
Return values
mixed —getPreviousDocOffset()
Get the document offset prior to the current $doc_offset
public
getPreviousDocOffset(int $doc_offset) : int
Parameters
- $doc_offset : int
-
an offset into the document map of an IndexShard
Return values
int —previous doc_offset
nextDocsWithWord()
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
public
nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
- $doc_offset : = null
-
if set the next block must all have $doc_offsets equal to or larger than this value
Return values
array<string|int, mixed> —doc summaries matching the $this->restrict_phrases
plan()
Returns a string representation of a plan by which the current iterator finds its results
public
plan() : string
Return values
string —a representation of the current iterator and its subiterators, useful for determining how a query will be processed
reset()
Returns the iterators to the first document block that it could iterate over
public
reset() : mixed
Return values
mixed —setResultsPerBlock()
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
public
setResultsPerBlock(int $num) : mixed
Parameters
- $num : int
-
the maximum number of results that can be returned by a block