IntersectIterator
extends IndexBundleIterator
in package
Used to iterate over the documents which occur in all of a set of iterator results
Tags
Table of Contents
- RESULTS_PER_BLOCK = 200
- Default number of documents returned for each block (at most)
- SYNC_TIMEOUT = 3
- Number of seconds before timeout and stop syncGenDocOffsetsAmongstIterators if slow
- $count_block : int
- The number of documents in the current block
- $current_block_fresh : bool
- Says whether the value in $this->count_block is up to date
- $index_bundle_iterators : array<string|int, mixed>
- An array of iterators whose intersection we get documents from
- $least_num_doc_index : int
- Which of the iterators has the current document with least index
- $num_docs : int
- Estimate of the number of documents that this iterator can return
- $num_iterators : int
- Number of elements in $this->index_bundle_iterators
- $num_words : int
- Number of elements in $this->word_iterator_map
- $pages : array<string|int, mixed>
- Cache of what currentDocsWithWord returns
- $quote_positions : array<string|int, mixed>
- Each element in this array corresponds to one quoted phrase in the original query. Each element is in turn an array with elements corresponding to a position of term in the original query followed its length (a term might involve more than one word so the length could be greater than one). It is also allowed that entries might be of the form *num => * to indicates that an asterisk (a wild card that can match any number of terms) appeared at that place in the query
- $results_per_block : int
- Number of documents returned for each block (at most)
- $seen_docs : int
- The number of documents already iterated over
- $seen_docs_unfiltered : int
- The number of iterated docs before the restriction test
- $sync_time : int
- Start time for syncGenDocOffsetsAmongstIterators
- $sync_timer_on : bool
- Whether to run a timer that shuts down the intersect iterator if syncGenDocOffsetsAmongstIterators takes longer than the time out period
- $to_advance_index : int
- Index of the iterator amongst those we are intersecting to advance next
- $total_num_docs : int
- $weight : float
- A weighting factor to multiply with each doc SCORE returned from this iterator
- $word_iterator_map : array<string|int, mixed>
- Associative array (term position in original query => iterator index of an iterator for that term). This is to handle queries where the same term occurs multiple times. For example, the rock back "The The"
- __construct() : mixed
- Creates an intersect iterator with the given parameters.
- advance() : mixed
- Forwards the iterator one group of docs
- advanceSeenDocs() : mixed
- Updates the seen_docs count during an advance() call
- checkQuote() : int
- Auxiliary function for @see checkQuotes used to check if quoted terms in search query appear exactly in the position lists of the current document
- checkQuotes() : bool
- Used to check if quoted terms in search query appear exactly in the position lists of the current document
- computeProximity() : sum
- Given the position_lists of a collection of terms computes a score for how close those words were in the given document
- currentDocsWithWord() : mixed
- Gets the current block of doc ids and score associated with the this iterators word
- currentGenDocOffsetWithWord() : mixed
- Gets the doc_offset and generation for the next document that would be return by this iterator
- findDocsWithWord() : mixed
- Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
- genDocOffsetCmp() : int
- Compares two arrays each containing a (generation, offset) pair.
- getCurrentDocsForKeys() : array<string|int, mixed>
- Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
- getDirection() : int
- Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
- nextDocsWithWord() : array<string|int, mixed>
- Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
- plan() : string
- Returns a string representation of a plan by which the current iterator finds its results
- reset() : mixed
- Returns the iterators to the first document block that it could iterate over
- setResultsPerBlock() : mixed
- This method is supposed to set the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord(). This method cannot be consistently implemented for this iterator and expect it to behave nicely it this iterator is used together with union_iterator. So to prevent a user for doing this, calling this method results in a user defined error
- syncGenDocOffsetsAmongstIterators() : mixed
- Finds the next generation and doc offset amongst all the iterators that contains the word. It assumes that the (generation, doc offset) pairs are ordered in an increasing fashion for the underlying iterators
Constants
RESULTS_PER_BLOCK
Default number of documents returned for each block (at most)
public
int
RESULTS_PER_BLOCK
= 200
SYNC_TIMEOUT
Number of seconds before timeout and stop syncGenDocOffsetsAmongstIterators if slow
public
mixed
SYNC_TIMEOUT
= 3
Properties
$count_block
The number of documents in the current block
public
int
$count_block
$current_block_fresh
Says whether the value in $this->count_block is up to date
public
bool
$current_block_fresh
$index_bundle_iterators
An array of iterators whose intersection we get documents from
public
array<string|int, mixed>
$index_bundle_iterators
$least_num_doc_index
Which of the iterators has the current document with least index
public
int
$least_num_doc_index
$num_docs
Estimate of the number of documents that this iterator can return
public
int
$num_docs
$num_iterators
Number of elements in $this->index_bundle_iterators
public
int
$num_iterators
$num_words
Number of elements in $this->word_iterator_map
public
int
$num_words
$pages
Cache of what currentDocsWithWord returns
public
array<string|int, mixed>
$pages
$quote_positions
Each element in this array corresponds to one quoted phrase in the original query. Each element is in turn an array with elements corresponding to a position of term in the original query followed its length (a term might involve more than one word so the length could be greater than one). It is also allowed that entries might be of the form *num => * to indicates that an asterisk (a wild card that can match any number of terms) appeared at that place in the query
public
array<string|int, mixed>
$quote_positions
$results_per_block
Number of documents returned for each block (at most)
public
int
$results_per_block
= self::RESULTS_PER_BLOCK
$seen_docs
The number of documents already iterated over
public
int
$seen_docs
$seen_docs_unfiltered
The number of iterated docs before the restriction test
public
int
$seen_docs_unfiltered
$sync_time
Start time for syncGenDocOffsetsAmongstIterators
public
int
$sync_time
$sync_timer_on
Whether to run a timer that shuts down the intersect iterator if syncGenDocOffsetsAmongstIterators takes longer than the time out period
public
bool
$sync_timer_on
$to_advance_index
Index of the iterator amongst those we are intersecting to advance next
public
int
$to_advance_index
$total_num_docs
public
int
$total_num_docs
$weight
A weighting factor to multiply with each doc SCORE returned from this iterator
public
float
$weight
$word_iterator_map
Associative array (term position in original query => iterator index of an iterator for that term). This is to handle queries where the same term occurs multiple times. For example, the rock back "The The"
public
array<string|int, mixed>
$word_iterator_map
Methods
__construct()
Creates an intersect iterator with the given parameters.
public
__construct(object $index_bundle_iterators, array<string|int, mixed> $word_iterator_map[, array<string|int, mixed> $quote_positions = null ][, float $weight = 1 ]) : mixed
Parameters
- $index_bundle_iterators : object
-
to use as a source of documents to iterate over
- $word_iterator_map : array<string|int, mixed>
-
ssociative array ( term position in original query => iterator index of an iterator for that term)
- $quote_positions : array<string|int, mixed> = null
-
Each element in this array corresponds to one quoted phrase in the original query. @see $quote_positions field variable in this class for more info
- $weight : float = 1
-
multiplicative factor to apply to scores returned from this iterator
Return values
mixed —advance()
Forwards the iterator one group of docs
public
advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
- $gen_doc_offset : array<string|int, mixed> = null
-
a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value
Return values
mixed —advanceSeenDocs()
Updates the seen_docs count during an advance() call
public
advanceSeenDocs() : mixed
Return values
mixed —checkQuote()
Auxiliary function for @see checkQuotes used to check if quoted terms in search query appear exactly in the position lists of the current document
public
checkQuote(array<string|int, mixed> &$position_lists, int $cur_pos, mixed $next_pos, array<string|int, mixed> $ngram_positions_within_quoted_query) : int
Parameters
- $position_lists : array<string|int, mixed>
-
of search terms in the current document
- $cur_pos : int
-
to look after in any position list
- $next_pos : mixed
-
- or int if * next_pos must be >= $cur_pos +len_search_term. $next_pos represents the position the next quoted term should be at
- $ngram_positions_within_quoted_query : array<string|int, mixed>
-
pairs: $ngram_position_within_quoted_query => $len_of_ngram
Return values
int —-1 on failure, 0 on backtrack, 1 on success
checkQuotes()
Used to check if quoted terms in search query appear exactly in the position lists of the current document
public
checkQuotes(array<string|int, mixed> &$position_lists) : bool
Parameters
- $position_lists : array<string|int, mixed>
-
of search terms in the current document
Return values
bool —whether the quoted terms in the search appear exactly
computeProximity()
Given the position_lists of a collection of terms computes a score for how close those words were in the given document
public
computeProximity(array<string|int, mixed> &$word_position_lists, array<string|int, mixed> &$word_len_lists, bool $is_doc, int $doc_len) : sum
Parameters
- $word_position_lists : array<string|int, mixed>
-
a 2D array item number => position_list (locations in doc where item occurred) for that item.
- $word_len_lists : array<string|int, mixed>
-
length for each item of its position list
- $is_doc : bool
-
whether this is the position list of a document or a link
- $doc_len : int
-
the length of the document
Return values
sum —of inverse of all covers computed by plane sweep algorithm
currentDocsWithWord()
Gets the current block of doc ids and score associated with the this iterators word
public
currentDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
currentGenDocOffsetWithWord()
Gets the doc_offset and generation for the next document that would be return by this iterator
public
currentGenDocOffsetWithWord() : mixed
Return values
mixed —an array with the desired document offset and generation; -1 on fail
findDocsWithWord()
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
public
findDocsWithWord() : mixed
Return values
mixed —doc ids and rank if there are docs left, -1 otherwise
genDocOffsetCmp()
Compares two arrays each containing a (generation, offset) pair.
public
genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
- $gen_doc1 : array<string|int, mixed>
-
first ordered pair
- $gen_doc2 : array<string|int, mixed>
-
second ordered pair
- $direction : int = self::ASCENDING
-
whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search
Return values
int —-1,0,1 depending on which is bigger
getCurrentDocsForKeys()
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
public
getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
- $keys : array<string|int, mixed> = null
-
keys to try to find in the current block of returned results
Return values
array<string|int, mixed> —doc summaries that match provided keys
getDirection()
Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
public
getDirection() : int
Return values
int —direction traversing underlying archive bundle
nextDocsWithWord()
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
public
nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
- $doc_offset : = null
-
if set the next block must all have $doc_offsets equal to or larger than this value
Return values
array<string|int, mixed> —doc summaries matching the $this->restrict_phrases
plan()
Returns a string representation of a plan by which the current iterator finds its results
public
plan() : string
Return values
string —a representation of the current iterator and its subiterators, useful for determining how a query will be processed
reset()
Returns the iterators to the first document block that it could iterate over
public
reset() : mixed
Return values
mixed —setResultsPerBlock()
This method is supposed to set the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord(). This method cannot be consistently implemented for this iterator and expect it to behave nicely it this iterator is used together with union_iterator. So to prevent a user for doing this, calling this method results in a user defined error
public
setResultsPerBlock(int $num) : mixed
Parameters
- $num : int
-
the maximum number of results that can be returned by a block
Return values
mixed —syncGenDocOffsetsAmongstIterators()
Finds the next generation and doc offset amongst all the iterators that contains the word. It assumes that the (generation, doc offset) pairs are ordered in an increasing fashion for the underlying iterators
public
syncGenDocOffsetsAmongstIterators() : mixed