
DocIterator extends IndexBundleIterator
in package

Used to iterate through all the documents and links associated with a an IndexArchiveBundle. It iterates through each doc or link regardless of the words it contains. It also makes it easy to get the summaries of these documents.

A description of how words and the documents containing them are stored is given in the documentation of IndexArchiveBundle.


Chris Pollett


Table of Contents

Host Key position + 1 (first char says doc, inlink or eternal link)
KEY_LEN  = 8
Length of a doc key
Default number of documents returned for each block (at most)
$count_block  : int
The number of documents in the current block
$current_block_fresh  : bool
Says whether the value in $this->count_block is up to date
$current_generation  : int
Numeric number of current shard
$current_offset  : int
The current byte offset in the IndexShard
$direction  : int
When results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added.
$doc_map  : array<string|int, mixed>
$doc_map_generation  : int
Index of the current generation/partition in the doc_map to get results from
$filter  : SearchfiltersModel
Model responsible for keeping track of edited and deleted search results
$index_name  : string
The timestamp of the index is associated with this iterator
$index_version  : int
The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
$key_index  : int
$last_offset  : int
Last offset of a doc occurrence in the IndexShard
$next_offset  : int
The next byte offset of a doc in the IndexShard
$num_docs  : int
Estimate of the number of documents that this iterator can return
$num_generations  : int
The total number of shards that have data for this word
$pages  : array<string|int, mixed>
Cache of what currentDocsWithWord returns
$results_per_block  : int
Number of documents returned for each block (at most)
$seen_docs  : int
The number of documents already iterated over
$shard_lens  : array<string|int, mixed>
An array of shard docids_lens
$total_num_docs  : int
__construct()  : mixed
Creates a doc iterator with the given parameters.
advance()  : mixed
Forwards the iterator one group of docs
advanceGeneration()  : mixed
Switches which index shard is being used to return occurrences of the word to the next shard containing the word
advanceSeenDocs()  : mixed
Updates the seen_docs count during an advance() call
currentDocsWithWord()  : mixed
Gets the current block of doc ids and score associated with the this iterators word
currentGenDocOffsetWithWord()  : mixed
Gets the doc_offset and generation for the next document that would be return by this iterator
findDocsWithWord()  : mixed
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
genDocOffsetCmp()  : int
Compares two arrays each containing a (generation, offset) pair.
getCurrentDocsForKeys()  : array<string|int, mixed>
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
getDirection()  : int
Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
getGenerationInfo()  : mixed
Mainly used to get the last_offset in shard $generation of the current index bundle. In the case where this wasn't previously cached it loads in the index bundle, sets the current generation to $generation, stores the docids_len (the last offset) of this shard in shard_lens and sets up last_offset as $generation's docids_len
getPreviousDocOffset()  : int
Get the document offset prior to the current $doc_offset
nextDocsWithWord()  : array<string|int, mixed>
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
plan()  : string
Returns a string representation of a plan by which the current iterator finds its results
reset()  : mixed
Returns the iterators to the first document block that it could iterate over
setResultsPerBlock()  : mixed
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()



Host Key position + 1 (first char says doc, inlink or eternal link)

public mixed HOST_KEY_POS = 17


Length of a doc key

public mixed KEY_LEN = 8


Default number of documents returned for each block (at most)

public int RESULTS_PER_BLOCK = 200



Says whether the value in $this->count_block is up to date

public bool $current_block_fresh


Numeric number of current shard

public int $current_generation


The current byte offset in the IndexShard

public int $current_offset


When results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added.

public int $direction


Index of the current generation/partition in the doc_map to get results from

public int $doc_map_generation


Model responsible for keeping track of edited and deleted search results

public SearchfiltersModel $filter


The timestamp of the index is associated with this iterator

public string $index_name


The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats

public int $index_version


Last offset of a doc occurrence in the IndexShard

public int $last_offset


The next byte offset of a doc in the IndexShard

public int $next_offset


Estimate of the number of documents that this iterator can return

public int $num_docs


The total number of shards that have data for this word

public int $num_generations


Cache of what currentDocsWithWord returns

public array<string|int, mixed> $pages


Number of documents returned for each block (at most)

public int $results_per_block = self::RESULTS_PER_BLOCK


An array of shard docids_lens

public array<string|int, mixed> $shard_lens



Creates a doc iterator with the given parameters.

public __construct(string $index_name[, SearchfiltersModel $filter = null ][, int $results_per_block = IndexBundleIterator::RESULTS_PER_BLOCK ][, int $direction = self::ASCENDING ]) : mixed
$index_name : string

time_stamp of the to use

$filter : SearchfiltersModel = null

Model responsible for keeping track of edited and deleted search results

$results_per_block : int = IndexBundleIterator::RESULTS_PER_BLOCK

the maximum number of results that can be returned by a findDocsWithWord call

$direction : int = self::ASCENDING

when results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added. Note: this value is not saved permanently. So you could in theory open two read only versions of the same bundle but reading the results in different directions

Return values


Forwards the iterator one group of docs

public advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
$gen_doc_offset : array<string|int, mixed> = null

a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

Return values


Switches which index shard is being used to return occurrences of the word to the next shard containing the word

public advanceGeneration([int $generation = null ]) : mixed
$generation : int = null

generation to advance beyond

Return values


Updates the seen_docs count during an advance() call

public advanceSeenDocs() : mixed
Return values


Gets the current block of doc ids and score associated with the this iterators word

public currentDocsWithWord() : mixed
Return values

doc ids and score if there are docs left, -1 otherwise


Gets the doc_offset and generation for the next document that would be return by this iterator

public currentGenDocOffsetWithWord() : mixed
Return values

an array with the desired document offset and generation; -1 on fail


Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

public findDocsWithWord() : mixed
Return values

doc ids and score if there are docs left, -1 otherwise


Compares two arrays each containing a (generation, offset) pair.

public genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
$gen_doc1 : array<string|int, mixed>

first ordered pair

$gen_doc2 : array<string|int, mixed>

second ordered pair

$direction : int = self::ASCENDING

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values

-1,0,1 depending on which is bigger


Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

public getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
$keys : array<string|int, mixed> = null

keys to try to find in the current block of returned results

Return values
array<string|int, mixed>

doc summaries that match provided keys


Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.

public getDirection() : int
Return values

either CrawlConstants::ASCENDING or CrawlConstants::DESCENDING


Mainly used to get the last_offset in shard $generation of the current index bundle. In the case where this wasn't previously cached it loads in the index bundle, sets the current generation to $generation, stores the docids_len (the last offset) of this shard in shard_lens and sets up last_offset as $generation's docids_len

public getGenerationInfo( $generation) : mixed
$generation :

to get last offset for

Return values


Get the document offset prior to the current $doc_offset

public getPreviousDocOffset(int $doc_offset) : int
$doc_offset : int

an offset into the document map of an IndexShard

Return values

previous doc_offset


Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

public nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
$doc_offset : = null

if set the next block must all have $doc_offsets equal to or larger than this value

Return values
array<string|int, mixed>

doc summaries matching the $this->restrict_phrases


Returns a string representation of a plan by which the current iterator finds its results

public plan() : string
Return values

a representation of the current iterator and its subiterators, useful for determining how a query will be processed


Returns the iterators to the first document block that it could iterate over

public reset() : mixed
Return values


Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

public setResultsPerBlock(int $num) : mixed
$num : int

the maximum number of results that can be returned by a block

Return values


Search results