Yioop_V9.5_Source_Code

GroupIterator extends IndexBundleIterator
in package

Application

This iterator is used to group together documents or document parts which share the same url. For instance, a link document item and the document that it links to will both be stored in the IndexArchiveBundle by the QueueServer. This iterator would combine both these items into a single document result with a sum of their score, and a summary, if returned, containing text from both sources. The iterator's purpose is vaguely analogous to a SQL GROUP BY clause

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)


    public
        int
    RESULTS_PER_BLOCK
    = 200

$count_block

The number of documents in the current block after filtering by restricted words


    public
        int
    $count_block

$count_block_unfiltered

The number of documents in the current block before filtering by restricted words


    public
        int
    $count_block_unfiltered

$current_block_fresh

Says whether the value in $this->count_block is up to date


    public
        bool
    $current_block_fresh

$current_block_hashes

hashes of document web pages seen in results returned from the most recent call to findDocsWithWord


    public
        array<string|int, mixed>
    $current_block_hashes

$current_machine

Id of queue_server this group_iterator lives on


    public
        int
    $current_machine

$current_seen_hashes


    public
        array<string|int, mixed>
    $current_seen_hashes

$domain_factors

Used to keep track and to weight pages based on the number of other pages from the same domain


    public
        array<string|int, mixed>
    $domain_factors

$grouped_hashes

hashed of document web pages used to keep track of track of groups seen so far


    public
        array<string|int, mixed>
    $grouped_hashes

$grouped_keys

hashed url keys used to keep track of track of groups seen so far


    public
        array<string|int, mixed>
    $grouped_keys

$index_bundle_iterator

The iterator we are using to get documents from


    public
        string
    $index_bundle_iterator

$num_docs

Estimate of the number of documents that this iterator can return


    public
        int
    $num_docs

$pages

Cache of what currentDocsWithWord returns


    public
        array<string|int, mixed>
    $pages

$results_per_block

Number of documents returned for each block (at most)


    public
        int
    $results_per_block
     = self::RESULTS_PER_BLOCK

$seen_docs

The number of documents already iterated over


    public
        int
    $seen_docs

$seen_docs_unfiltered

The number of iterated docs before the restriction test


    public
        int
    $seen_docs_unfiltered

$total_num_docs


    public
        int
    $total_num_docs

__construct()

Creates a group iterator with the given parameters.


    public
                    __construct(object $index_bundle_iterator[, int $num_iterators = 1 ], int $current_machine) : mixed

Parameters

$index_bundle_iterator : object: to use as a source of documents to iterate over
$num_iterators : int = 1: number of word iterators appearing in in sub-iterators -- if larger than reduce the default grouping number
$current_machine : int: if this iterator is being used in a multi- queue_server setting, then this is the id of the current queue_server

Return values

mixed —

advance()

Forwards the iterator one group of docs


    public
                    advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed

Parameters

$gen_doc_offset : array<string|int, mixed> = null: a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

Return values

mixed —

advanceSeenDocs()

Updates the seen_docs count during an advance() call


    public
                    advanceSeenDocs() : mixed

Return values

mixed —

aggregateScores()

For a collection of pages each with the same url, computes the page with the max score, as well as the max of the ranks, proximity, and relevance scores.


    public
                    aggregateScores(string $hash_url, array<string|int, mixed> &$pre_hash_page) : mixed

Stores this information in the first element of the array of pages. This process is described in detail at: https://www.seekquarry.com/?c=main&p=ranking#search

Parameters

$hash_url : string: the crawlHash of the url of the page we are scoring which will be compared with that of the host to see if the current page has the url of a hostname.
$pre_hash_page : array<string|int, mixed>: pages to compute scores for

Return values

mixed —

computeOutPages()

For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.


    public
                    computeOutPages(array<string|int, mixed> &$pre_out_pages) : array<string|int, mixed>

Parameters

$pre_out_pages : array<string|int, mixed>: array of groups of pages for which out pages are to be generated.

Return values

array<string|int, mixed> —

$out_pages array of single summarized documents

currentDocsWithWord()

Gets the current block of doc ids and score associated with the this iterators word


    public
                    currentDocsWithWord() : mixed

Return values

mixed —

doc ids and score if there are docs left, -1 otherwise

currentGenDocOffsetWithWord()

Gets the doc_offset and generation for the next document that would be return by this iterator


    public
                    currentGenDocOffsetWithWord() : mixed

Return values

mixed —

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached


    public
                    findDocsWithWord() : mixed

Return values

mixed —

doc ids and score if there are docs left, -1 otherwise

genDocOffsetCmp()

Compares two arrays each containing a (generation, offset) pair.


    public
                    genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int

Parameters

$gen_doc1 : array<string|int, mixed>: first ordered pair
$gen_doc2 : array<string|int, mixed>: second ordered pair
$direction : int = self::ASCENDING: whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values

int —

-1,0,1 depending on which is bigger

getCurrentDocsForKeys()

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator


    public
                    getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>

Parameters

$keys : array<string|int, mixed> = null: keys to try to find in the current block of returned results

Return values

array<string|int, mixed> —

doc summaries that match provided keys

getDirection()

Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.


    public
                    getDirection() : int

Return values

int —

direction traversing underlying archive bundle

getPagesToGroup()

Gets a sample of a few hundred pages on which to do grouping by URL


    public
                    getPagesToGroup() : array<string|int, mixed>

Return values

array<string|int, mixed> —

of pages of document key --> meta data arrays

groupByHashAndAggregate()

For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.


    public
                    groupByHashAndAggregate(array<string|int, mixed> &$pre_out_pages) : mixed

For each group of groups with the same hash summary, this function then selects the subgroup of with the highest max score for that group as its representative. The function then modifies the supplied argument array to make it an array of group representatives.

Parameters

$pre_out_pages : array<string|int, mixed>: documents previously grouped by hash of url

Return values

mixed —

groupByHashUrl()

Groups documents as well as mini-pages based on links to documents by url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.


    public
                    groupByHashUrl(array<string|int, mixed> &$pages) : array<string|int, mixed>

Parameters

$pages : array<string|int, mixed>: pages to group

Return values

array<string|int, mixed> —

$pre_out_pages pages after grouping

nextDocsWithWord()

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index


    public
                    nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>

Parameters

$doc_offset : = null: if set the next block must all have $doc_offsets equal to or larger than this value

Return values

array<string|int, mixed> —

doc summaries matching the $this->restrict_phrases

plan()

Returns a string representation of a plan by which the current iterator finds its results


    public
                    plan() : string

Return values

string —

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

reset()

Returns the iterators to the first document block that it could iterate over


    public
                    reset() : mixed

Return values

mixed —

setResultsPerBlock()

Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()


    public
                    setResultsPerBlock(int $num) : mixed

Parameters

$num : int: the maximum number of results that can be returned by a block

Return values

mixed —

GroupIterator extends IndexBundleIterator in package Application

Tags

Table of Contents

Constants

RESULTS_PER_BLOCK

Properties

$count_block

$count_block_unfiltered

$current_block_fresh

$current_block_hashes

$current_machine

$current_seen_hashes

$domain_factors

$grouped_hashes

$grouped_keys

$index_bundle_iterator

$num_docs

$pages

$results_per_block

$seen_docs

$seen_docs_unfiltered

$total_num_docs

Methods

__construct()

Parameters

Return values

advance()

Parameters

Return values

advanceSeenDocs()

Return values

aggregateScores()

Parameters

Return values

computeOutPages()

Parameters

Return values

currentDocsWithWord()

Return values

currentGenDocOffsetWithWord()

Return values

findDocsWithWord()

Return values

genDocOffsetCmp()

Parameters

Return values

getCurrentDocsForKeys()

Parameters

Return values

getDirection()

Return values

getPagesToGroup()

Return values

groupByHashAndAggregate()

Parameters

Return values

groupByHashUrl()

Parameters

Return values

nextDocsWithWord()

Parameters

Return values

plan()

Return values

reset()

Return values

setResultsPerBlock()

Parameters

Return values

GroupIterator extends IndexBundleIterator
in package

Application