Yioop_V9.5_Source_Code_Documentation

IntersectIterator extends IndexBundleIterator
in package

Used to iterate over the documents which occur in all of a set of iterator results

Tags
author

Chris Pollett

see
IndexArchiveBundle

Table of Contents

RESULTS_PER_BLOCK  = 200
Default number of documents returned for each block (at most)
SYNC_TIMEOUT  = 3
Number of seconds before timeout and stop syncGenDocOffsetsAmongstIterators if slow
$count_block  : int
The number of documents in the current block
$current_block_fresh  : bool
Says whether the value in $this->count_block is up to date
$index_bundle_iterators  : array<string|int, mixed>
An array of iterators whose intersection we get documents from
$least_num_doc_index  : int
Which of the iterators has the current document with least index
$num_docs  : int
Estimate of the number of documents that this iterator can return
$num_iterators  : int
Number of elements in $this->index_bundle_iterators
$num_words  : int
Number of elements in $this->word_iterator_map
$pages  : array<string|int, mixed>
Cache of what currentDocsWithWord returns
$quote_positions  : array<string|int, mixed>
Each element in this array corresponds to one quoted phrase in the original query. Each element is in turn an array with elements corresponding to a position of term in the original query followed its length (a term might involve more than one word so the length could be greater than one). It is also allowed that entries might be of the form *num => * to indicates that an asterisk (a wild card that can match any number of terms) appeared at that place in the query
$results_per_block  : int
Number of documents returned for each block (at most)
$seen_docs  : int
The number of documents already iterated over
$seen_docs_unfiltered  : int
The number of iterated docs before the restriction test
$sync_time  : int
Start time for syncGenDocOffsetsAmongstIterators
$sync_timer_on  : bool
Whether to run a timer that shuts down the intersect iterator if syncGenDocOffsetsAmongstIterators takes longer than the time out period
$to_advance_index  : int
Index of the iterator amongst those we are intersecting to advance next
$total_num_docs  : int
$weight  : float
A weighting factor to multiply with each doc SCORE returned from this iterator
$word_iterator_map  : array<string|int, mixed>
Associative array (term position in original query => iterator index of an iterator for that term). This is to handle queries where the same term occurs multiple times. For example, the rock back "The The"
__construct()  : mixed
Creates an intersect iterator with the given parameters.
advance()  : mixed
Forwards the iterator one group of docs
advanceSeenDocs()  : mixed
Updates the seen_docs count during an advance() call
checkQuote()  : int
Auxiliary function for @see checkQuotes used to check if quoted terms in search query appear exactly in the position lists of the current document
checkQuotes()  : bool
Used to check if quoted terms in search query appear exactly in the position lists of the current document
computeProximity()  : sum
Given the position_lists of a collection of terms computes a score for how close those words were in the given document
currentDocsWithWord()  : mixed
Gets the current block of doc ids and score associated with the this iterators word
currentGenDocOffsetWithWord()  : mixed
Gets the doc_offset and generation for the next document that would be return by this iterator
findDocsWithWord()  : mixed
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
genDocOffsetCmp()  : int
Compares two arrays each containing a (generation, offset) pair.
getCurrentDocsForKeys()  : array<string|int, mixed>
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
getDirection()  : int
Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
nextDocsWithWord()  : array<string|int, mixed>
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
plan()  : string
Returns a string representation of a plan by which the current iterator finds its results
reset()  : mixed
Returns the iterators to the first document block that it could iterate over
setResultsPerBlock()  : mixed
This method is supposed to set the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord(). This method cannot be consistently implemented for this iterator and expect it to behave nicely it this iterator is used together with union_iterator. So to prevent a user for doing this, calling this method results in a user defined error
syncGenDocOffsetsAmongstIterators()  : mixed
Finds the next generation and doc offset amongst all the iterators that contains the word. It assumes that the (generation, doc offset) pairs are ordered in an increasing fashion for the underlying iterators

Constants

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)

public int RESULTS_PER_BLOCK = 200

SYNC_TIMEOUT

Number of seconds before timeout and stop syncGenDocOffsetsAmongstIterators if slow

public mixed SYNC_TIMEOUT = 3

Properties

$current_block_fresh

Says whether the value in $this->count_block is up to date

public bool $current_block_fresh

$index_bundle_iterators

An array of iterators whose intersection we get documents from

public array<string|int, mixed> $index_bundle_iterators

$least_num_doc_index

Which of the iterators has the current document with least index

public int $least_num_doc_index

$num_docs

Estimate of the number of documents that this iterator can return

public int $num_docs

$num_iterators

Number of elements in $this->index_bundle_iterators

public int $num_iterators

$num_words

Number of elements in $this->word_iterator_map

public int $num_words

$pages

Cache of what currentDocsWithWord returns

public array<string|int, mixed> $pages

$quote_positions

Each element in this array corresponds to one quoted phrase in the original query. Each element is in turn an array with elements corresponding to a position of term in the original query followed its length (a term might involve more than one word so the length could be greater than one). It is also allowed that entries might be of the form *num => * to indicates that an asterisk (a wild card that can match any number of terms) appeared at that place in the query

public array<string|int, mixed> $quote_positions

$results_per_block

Number of documents returned for each block (at most)

public int $results_per_block = self::RESULTS_PER_BLOCK

$seen_docs_unfiltered

The number of iterated docs before the restriction test

public int $seen_docs_unfiltered

$sync_time

Start time for syncGenDocOffsetsAmongstIterators

public int $sync_time

$sync_timer_on

Whether to run a timer that shuts down the intersect iterator if syncGenDocOffsetsAmongstIterators takes longer than the time out period

public bool $sync_timer_on

$to_advance_index

Index of the iterator amongst those we are intersecting to advance next

public int $to_advance_index

$weight

A weighting factor to multiply with each doc SCORE returned from this iterator

public float $weight

$word_iterator_map

Associative array (term position in original query => iterator index of an iterator for that term). This is to handle queries where the same term occurs multiple times. For example, the rock back "The The"

public array<string|int, mixed> $word_iterator_map

Methods

__construct()

Creates an intersect iterator with the given parameters.

public __construct(object $index_bundle_iterators, array<string|int, mixed> $word_iterator_map[, array<string|int, mixed> $quote_positions = null ][, float $weight = 1 ]) : mixed
Parameters
$index_bundle_iterators : object

to use as a source of documents to iterate over

$word_iterator_map : array<string|int, mixed>

ssociative array ( term position in original query => iterator index of an iterator for that term)

$quote_positions : array<string|int, mixed> = null

Each element in this array corresponds to one quoted phrase in the original query. @see $quote_positions field variable in this class for more info

$weight : float = 1

multiplicative factor to apply to scores returned from this iterator

Return values
mixed

advance()

Forwards the iterator one group of docs

public advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
$gen_doc_offset : array<string|int, mixed> = null

a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

Return values
mixed

advanceSeenDocs()

Updates the seen_docs count during an advance() call

public advanceSeenDocs() : mixed
Return values
mixed

checkQuote()

Auxiliary function for @see checkQuotes used to check if quoted terms in search query appear exactly in the position lists of the current document

public checkQuote(array<string|int, mixed> &$position_lists, int $cur_pos, mixed $next_pos, array<string|int, mixed> $ngram_positions_within_quoted_query) : int
Parameters
$position_lists : array<string|int, mixed>

of search terms in the current document

$cur_pos : int

to look after in any position list

$next_pos : mixed
  • or int if * next_pos must be >= $cur_pos +len_search_term. $next_pos represents the position the next quoted term should be at
$ngram_positions_within_quoted_query : array<string|int, mixed>

pairs: $ngram_position_within_quoted_query => $len_of_ngram

Return values
int

-1 on failure, 0 on backtrack, 1 on success

checkQuotes()

Used to check if quoted terms in search query appear exactly in the position lists of the current document

public checkQuotes(array<string|int, mixed> &$position_lists) : bool
Parameters
$position_lists : array<string|int, mixed>

of search terms in the current document

Return values
bool

whether the quoted terms in the search appear exactly

computeProximity()

Given the position_lists of a collection of terms computes a score for how close those words were in the given document

public computeProximity(array<string|int, mixed> &$word_position_lists, array<string|int, mixed> &$word_len_lists, bool $is_doc, int $doc_len) : sum
Parameters
$word_position_lists : array<string|int, mixed>

a 2D array item number => position_list (locations in doc where item occurred) for that item.

$word_len_lists : array<string|int, mixed>

length for each item of its position list

$is_doc : bool

whether this is the position list of a document or a link

$doc_len : int

the length of the document

Return values
sum

of inverse of all covers computed by plane sweep algorithm

currentDocsWithWord()

Gets the current block of doc ids and score associated with the this iterators word

public currentDocsWithWord() : mixed
Return values
mixed

doc ids and score if there are docs left, -1 otherwise

currentGenDocOffsetWithWord()

Gets the doc_offset and generation for the next document that would be return by this iterator

public currentGenDocOffsetWithWord() : mixed
Return values
mixed

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

public findDocsWithWord() : mixed
Return values
mixed

doc ids and rank if there are docs left, -1 otherwise

genDocOffsetCmp()

Compares two arrays each containing a (generation, offset) pair.

public genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
$gen_doc1 : array<string|int, mixed>

first ordered pair

$gen_doc2 : array<string|int, mixed>

second ordered pair

$direction : int = self::ASCENDING

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values
int

-1,0,1 depending on which is bigger

getCurrentDocsForKeys()

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

public getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
$keys : array<string|int, mixed> = null

keys to try to find in the current block of returned results

Return values
array<string|int, mixed>

doc summaries that match provided keys

getDirection()

Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.

public getDirection() : int
Return values
int

direction traversing underlying archive bundle

nextDocsWithWord()

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

public nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
$doc_offset : = null

if set the next block must all have $doc_offsets equal to or larger than this value

Return values
array<string|int, mixed>

doc summaries matching the $this->restrict_phrases

plan()

Returns a string representation of a plan by which the current iterator finds its results

public plan() : string
Return values
string

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

reset()

Returns the iterators to the first document block that it could iterate over

public reset() : mixed
Return values
mixed

setResultsPerBlock()

This method is supposed to set the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord(). This method cannot be consistently implemented for this iterator and expect it to behave nicely it this iterator is used together with union_iterator. So to prevent a user for doing this, calling this method results in a user defined error

public setResultsPerBlock(int $num) : mixed
Parameters
$num : int

the maximum number of results that can be returned by a block

Return values
mixed

syncGenDocOffsetsAmongstIterators()

Finds the next generation and doc offset amongst all the iterators that contains the word. It assumes that the (generation, doc offset) pairs are ordered in an increasing fashion for the underlying iterators

public syncGenDocOffsetsAmongstIterators() : mixed
Return values
mixed

        

Search results