NetworkIterator
extends IndexBundleIterator
in package
This iterator is used to handle querying a network of queue_servers with regard to a query
Tags
Table of Contents
- HOST_KEY_POS = 17
- Host Key position + 1 (first char says doc, inlink or eternal link)
- KEY_LEN = 8
- Length of a doc key
- RESULTS_PER_BLOCK = 200
- Default number of documents returned for each block (at most)
- $base_query : string
- Part of query without limit and num to be processed by all queue_server machines
- $count_block : int
- The number of documents in the current block
- $current_block_fresh : bool
- Says whether the value in $this->count_block is up to date
- $filter : SearchfiltersModel
- Model responsible for keeping track of edited and deleted search results
- $hard_query : int
- Used to keep track of the original desired number of results to be returned in one find docs call versus the number actually retrieved.
- $last_results_per_block : int
- last value for results_per_block
- $limit : string
- Current limit number to be added to base query
- $more_flags : mixed
- Flags used to keep track of whether a given machine has more search result data. Array of booleans
- $more_results : array<string|int, mixed>
- Flags for each server saying if there are more results for that server or not
- $next_results_per_server : int
- used to adaptively change the number of pages requested from each machine based on the number of machines that still have results
- $num_docs : int
- Estimate of the number of documents that this iterator can return
- $num_downloaded : int
- Number of query results downloaded from machines involved in network query
- $pages : array<string|int, mixed>
- Cache of what currentDocsWithWord returns
- $queue_servers : string
- An array of servers to ask a query to
- $ranking_factors : array<string|int, mixed>
- How url, keywords, and title words should influence relevance and doc rank calculations
- $results_per_block : int
- Number of documents returned for each block (at most)
- $seen_docs : int
- The number of documents already iterated over
- $total_num_docs : int
- __construct() : mixed
- Creates a network iterator with the given parameters.
- advance() : mixed
- Forwards the iterator one group of docs
- advanceSeenDocs() : mixed
- Updates the seen_docs count during an advance() call
- currentDocsWithWord() : mixed
- Gets the current block of doc ids and score associated with the this iterators word
- currentGenDocOffsetWithWord() : mixed
- Gets the doc_offset and generation for the next document that would be return by this iterator. As this is not easily determined for a network iterator, this method always returns -1 for this iterator
- findDocsWithWord() : mixed
- Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
- genDocOffsetCmp() : int
- Compares two arrays each containing a (generation, offset) pair.
- getCurrentDocsForKeys() : array<string|int, mixed>
- Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
- getDirection() : int
- Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
- makeLookupLink() : string
- Called to make a link for AnalyticsManager about a network query performed by this iterator.
- nextDocsWithWord() : array<string|int, mixed>
- Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
- plan() : string
- Returns a string representation of a plan by which the current iterator finds its results
- reset() : mixed
- Returns the iterators to the first document block that it could iterate over
- serverAdjustedResultsPerBlock() : int
- If we want the top $num_results results (a block) and we have $num_machines, this computes how many results we shhould request of each machine.
- setResultsPerBlock() : mixed
- Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
Constants
HOST_KEY_POS
Host Key position + 1 (first char says doc, inlink or eternal link)
public
mixed
HOST_KEY_POS
= 17
KEY_LEN
Length of a doc key
public
mixed
KEY_LEN
= 8
RESULTS_PER_BLOCK
Default number of documents returned for each block (at most)
public
int
RESULTS_PER_BLOCK
= 200
Properties
$base_query
Part of query without limit and num to be processed by all queue_server machines
public
string
$base_query
$count_block
The number of documents in the current block
public
int
$count_block
$current_block_fresh
Says whether the value in $this->count_block is up to date
public
bool
$current_block_fresh
$filter
Model responsible for keeping track of edited and deleted search results
public
SearchfiltersModel
$filter
$hard_query
Used to keep track of the original desired number of results to be returned in one find docs call versus the number actually retrieved.
public
int
$hard_query
$last_results_per_block
last value for results_per_block
public
int
$last_results_per_block
$limit
Current limit number to be added to base query
public
string
$limit
$more_flags
Flags used to keep track of whether a given machine has more search result data. Array of booleans
public
mixed
$more_flags
@var array
$more_results
Flags for each server saying if there are more results for that server or not
public
array<string|int, mixed>
$more_results
$next_results_per_server
used to adaptively change the number of pages requested from each machine based on the number of machines that still have results
public
int
$next_results_per_server
$num_docs
Estimate of the number of documents that this iterator can return
public
int
$num_docs
$num_downloaded
Number of query results downloaded from machines involved in network query
public
int
$num_downloaded
$pages
Cache of what currentDocsWithWord returns
public
array<string|int, mixed>
$pages
$queue_servers
An array of servers to ask a query to
public
string
$queue_servers
$ranking_factors
How url, keywords, and title words should influence relevance and doc rank calculations
public
array<string|int, mixed>
$ranking_factors
$results_per_block
Number of documents returned for each block (at most)
public
int
$results_per_block
= self::RESULTS_PER_BLOCK
$seen_docs
The number of documents already iterated over
public
int
$seen_docs
$total_num_docs
public
int
$total_num_docs
Methods
__construct()
Creates a network iterator with the given parameters.
public
__construct(string $query, array<string|int, mixed> $queue_servers, string $timestamp[, SearchfiltersModel $filter = null ][, string $save_timestamp_name = "" ][, array<string|int, mixed> $ranking_factors = [] ]) : mixed
Parameters
- $query : string
-
the query that was supplied by the end user that we are trying to get search results for
- $queue_servers : array<string|int, mixed>
-
urls of yioop instances on which documents indexes live
- $timestamp : string
-
the timestamp of the particular current index archive bundles that we look in for results
- $filter : SearchfiltersModel = null
-
Model responsible for keeping track of edited and deleted search results
- $save_timestamp_name : string = ""
-
if this timestamp is nonzero, then when making queries to separate machines the save_timestamp is sent so the queries on those machine can make savepoints. Note the format of save_timestamp is timestamp-query_part where query_part is the number of the item in a query presentation (usually 0).
- $ranking_factors : array<string|int, mixed> = []
-
field say how url, keywords, and title words should influence relevance and doc rank calculations
Return values
mixed —advance()
Forwards the iterator one group of docs
public
advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
- $gen_doc_offset : array<string|int, mixed> = null
-
a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value
Return values
mixed —advanceSeenDocs()
Updates the seen_docs count during an advance() call
public
advanceSeenDocs() : mixed
Return values
mixed —currentDocsWithWord()
Gets the current block of doc ids and score associated with the this iterators word
public
currentDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
currentGenDocOffsetWithWord()
Gets the doc_offset and generation for the next document that would be return by this iterator. As this is not easily determined for a network iterator, this method always returns -1 for this iterator
public
currentGenDocOffsetWithWord() : mixed
Return values
mixed —an array with the desired document offset and generation; -1 on fail
findDocsWithWord()
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
public
findDocsWithWord() : mixed
Return values
mixed —doc ids and score if there are docs left, -1 otherwise
genDocOffsetCmp()
Compares two arrays each containing a (generation, offset) pair.
public
genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
- $gen_doc1 : array<string|int, mixed>
-
first ordered pair
- $gen_doc2 : array<string|int, mixed>
-
second ordered pair
- $direction : int = self::ASCENDING
-
whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search
Return values
int —-1,0,1 depending on which is bigger
getCurrentDocsForKeys()
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
public
getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
- $keys : array<string|int, mixed> = null
-
keys to try to find in the current block of returned results
Return values
array<string|int, mixed> —doc summaries that match provided keys
getDirection()
Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
public
getDirection() : int
Return values
int —either CrawlConstants::ASCENDING or CrawlConstants::DESCENDING
makeLookupLink()
Called to make a link for AnalyticsManager about a network query performed by this iterator.
public
makeLookupLink(array<string|int, mixed> $sites, int $index) : string
Parameters
- $sites : array<string|int, mixed>
-
used by this network iterator
- $index : int
-
which site in array to make link for
Return values
string —html of link
nextDocsWithWord()
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
public
nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
- $doc_offset : = null
-
if set the next block must all have $doc_offsets equal to or larger than this value
Return values
array<string|int, mixed> —doc summaries matching the $this->restrict_phrases
plan()
Returns a string representation of a plan by which the current iterator finds its results
public
plan() : string
Return values
string —a representation of the current iterator and its subiterators, useful for determining how a query will be processed
reset()
Returns the iterators to the first document block that it could iterate over
public
reset() : mixed
Return values
mixed —serverAdjustedResultsPerBlock()
If we want the top $num_results results (a block) and we have $num_machines, this computes how many results we shhould request of each machine.
public
static serverAdjustedResultsPerBlock(int $num_machines, mixed $num_results) : int
Buttcher, Clark, Cormack give an exact formula to compute this, but it is slow to compute We instead compute a (1/$num_machines^{3/4})* $num_results + 5;
Parameters
- $num_machines : int
-
number of machines each having a portion of the results
- $num_results : mixed
Return values
int —number of best results we should ask from each machine to ensure get top k best results overall
setResultsPerBlock()
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
public
setResultsPerBlock(int $num) : mixed
Parameters
- $num : int
-
the maximum number of results that can be returned by a block