Yioop_V9.5_Source_Code_Documentation

NetworkIterator extends IndexBundleIterator
in package

This iterator is used to handle querying a network of queue_servers with regard to a query

Tags
author

Chris Pollett

Table of Contents

HOST_KEY_POS  = 17
Host Key position + 1 (first char says doc, inlink or eternal link)
KEY_LEN  = 8
Length of a doc key
RESULTS_PER_BLOCK  = 200
Default number of documents returned for each block (at most)
$base_query  : string
Part of query without limit and num to be processed by all queue_server machines
$count_block  : int
The number of documents in the current block
$current_block_fresh  : bool
Says whether the value in $this->count_block is up to date
$filter  : SearchfiltersModel
Model responsible for keeping track of edited and deleted search results
$hard_query  : int
Used to keep track of the original desired number of results to be returned in one find docs call versus the number actually retrieved.
$last_results_per_block  : int
last value for results_per_block
$limit  : string
Current limit number to be added to base query
$more_flags  : mixed
Flags used to keep track of whether a given machine has more search result data. Array of booleans
$more_results  : array<string|int, mixed>
Flags for each server saying if there are more results for that server or not
$next_results_per_server  : int
used to adaptively change the number of pages requested from each machine based on the number of machines that still have results
$num_docs  : int
Estimate of the number of documents that this iterator can return
$num_downloaded  : int
Number of query results downloaded from machines involved in network query
$pages  : array<string|int, mixed>
Cache of what currentDocsWithWord returns
$queue_servers  : string
An array of servers to ask a query to
$ranking_factors  : array<string|int, mixed>
How url, keywords, and title words should influence relevance and doc rank calculations
$results_per_block  : int
Number of documents returned for each block (at most)
$seen_docs  : int
The number of documents already iterated over
$total_num_docs  : int
__construct()  : mixed
Creates a network iterator with the given parameters.
advance()  : mixed
Forwards the iterator one group of docs
advanceSeenDocs()  : mixed
Updates the seen_docs count during an advance() call
currentDocsWithWord()  : mixed
Gets the current block of doc ids and score associated with the this iterators word
currentGenDocOffsetWithWord()  : mixed
Gets the doc_offset and generation for the next document that would be return by this iterator. As this is not easily determined for a network iterator, this method always returns -1 for this iterator
findDocsWithWord()  : mixed
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
genDocOffsetCmp()  : int
Compares two arrays each containing a (generation, offset) pair.
getCurrentDocsForKeys()  : array<string|int, mixed>
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
getDirection()  : int
Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.
makeLookupLink()  : string
Called to make a link for AnalyticsManager about a network query performed by this iterator.
nextDocsWithWord()  : array<string|int, mixed>
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
plan()  : string
Returns a string representation of a plan by which the current iterator finds its results
reset()  : mixed
Returns the iterators to the first document block that it could iterate over
serverAdjustedResultsPerBlock()  : int
If we want the top $num_results results (a block) and we have $num_machines, this computes how many results we shhould request of each machine.
setResultsPerBlock()  : mixed
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

Constants

HOST_KEY_POS

Host Key position + 1 (first char says doc, inlink or eternal link)

public mixed HOST_KEY_POS = 17

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)

public int RESULTS_PER_BLOCK = 200

Properties

$base_query

Part of query without limit and num to be processed by all queue_server machines

public string $base_query

$current_block_fresh

Says whether the value in $this->count_block is up to date

public bool $current_block_fresh

$filter

Model responsible for keeping track of edited and deleted search results

public SearchfiltersModel $filter

$hard_query

Used to keep track of the original desired number of results to be returned in one find docs call versus the number actually retrieved.

public int $hard_query

$last_results_per_block

last value for results_per_block

public int $last_results_per_block

$limit

Current limit number to be added to base query

public string $limit

$more_flags

Flags used to keep track of whether a given machine has more search result data. Array of booleans

public mixed $more_flags

@var array

$more_results

Flags for each server saying if there are more results for that server or not

public array<string|int, mixed> $more_results

$next_results_per_server

used to adaptively change the number of pages requested from each machine based on the number of machines that still have results

public int $next_results_per_server

$num_docs

Estimate of the number of documents that this iterator can return

public int $num_docs

$num_downloaded

Number of query results downloaded from machines involved in network query

public int $num_downloaded

$pages

Cache of what currentDocsWithWord returns

public array<string|int, mixed> $pages

$queue_servers

An array of servers to ask a query to

public string $queue_servers

$ranking_factors

How url, keywords, and title words should influence relevance and doc rank calculations

public array<string|int, mixed> $ranking_factors

$results_per_block

Number of documents returned for each block (at most)

public int $results_per_block = self::RESULTS_PER_BLOCK

Methods

__construct()

Creates a network iterator with the given parameters.

public __construct(string $query, array<string|int, mixed> $queue_servers, string $timestamp[, SearchfiltersModel $filter = null ][, string $save_timestamp_name = "" ][, array<string|int, mixed> $ranking_factors = [] ]) : mixed
Parameters
$query : string

the query that was supplied by the end user that we are trying to get search results for

$queue_servers : array<string|int, mixed>

urls of yioop instances on which documents indexes live

$timestamp : string

the timestamp of the particular current index archive bundles that we look in for results

$filter : SearchfiltersModel = null

Model responsible for keeping track of edited and deleted search results

$save_timestamp_name : string = ""

if this timestamp is nonzero, then when making queries to separate machines the save_timestamp is sent so the queries on those machine can make savepoints. Note the format of save_timestamp is timestamp-query_part where query_part is the number of the item in a query presentation (usually 0).

$ranking_factors : array<string|int, mixed> = []

field say how url, keywords, and title words should influence relevance and doc rank calculations

Return values
mixed

advance()

Forwards the iterator one group of docs

public advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
$gen_doc_offset : array<string|int, mixed> = null

a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

Return values
mixed

advanceSeenDocs()

Updates the seen_docs count during an advance() call

public advanceSeenDocs() : mixed
Return values
mixed

currentDocsWithWord()

Gets the current block of doc ids and score associated with the this iterators word

public currentDocsWithWord() : mixed
Return values
mixed

doc ids and score if there are docs left, -1 otherwise

currentGenDocOffsetWithWord()

Gets the doc_offset and generation for the next document that would be return by this iterator. As this is not easily determined for a network iterator, this method always returns -1 for this iterator

public currentGenDocOffsetWithWord() : mixed
Return values
mixed

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

public findDocsWithWord() : mixed
Return values
mixed

doc ids and score if there are docs left, -1 otherwise

genDocOffsetCmp()

Compares two arrays each containing a (generation, offset) pair.

public genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
$gen_doc1 : array<string|int, mixed>

first ordered pair

$gen_doc2 : array<string|int, mixed>

second ordered pair

$direction : int = self::ASCENDING

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values
int

-1,0,1 depending on which is bigger

getCurrentDocsForKeys()

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

public getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
$keys : array<string|int, mixed> = null

keys to try to find in the current block of returned results

Return values
array<string|int, mixed>

doc summaries that match provided keys

getDirection()

Returns the direction of a IndexBundleIterator. Depending on the iterator could be either forward from the start of an index (self::ASCENDING) or backward from the end of the index (self::DESCENDING). For this base class, the function always returns self::ASCENDING, but subclasses might return different values.

public getDirection() : int
Return values
int

either CrawlConstants::ASCENDING or CrawlConstants::DESCENDING

Called to make a link for AnalyticsManager about a network query performed by this iterator.

public makeLookupLink(array<string|int, mixed> $sites, int $index) : string
Parameters
$sites : array<string|int, mixed>

used by this network iterator

$index : int

which site in array to make link for

Return values
string

html of link

nextDocsWithWord()

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

public nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
$doc_offset : = null

if set the next block must all have $doc_offsets equal to or larger than this value

Return values
array<string|int, mixed>

doc summaries matching the $this->restrict_phrases

plan()

Returns a string representation of a plan by which the current iterator finds its results

public plan() : string
Return values
string

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

reset()

Returns the iterators to the first document block that it could iterate over

public reset() : mixed
Return values
mixed

serverAdjustedResultsPerBlock()

If we want the top $num_results results (a block) and we have $num_machines, this computes how many results we shhould request of each machine.

public static serverAdjustedResultsPerBlock(int $num_machines, mixed $num_results) : int

Buttcher, Clark, Cormack give an exact formula to compute this, but it is slow to compute We instead compute a (1/$num_machines^{3/4})* $num_results + 5;

Parameters
$num_machines : int

number of machines each having a portion of the results

$num_results : mixed
Return values
int

number of best results we should ask from each machine to ensure get top k best results overall

setResultsPerBlock()

Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

public setResultsPerBlock(int $num) : mixed
Parameters
$num : int

the maximum number of results that can be returned by a block

Return values
mixed

        

Search results