
GroupIterator extends IndexBundleIterator
in package

This iterator is used to group together documents or document parts which share the same url. For instance, a link document item and the document that it links to will both be stored in the IndexArchiveBundle by the QueueServer. This iterator would combine both these items into a single document result with a sum of their score, and a summary, if returned, containing text from both sources. The iterator's purpose is vaguely analogous to a SQL GROUP BY clause


Chris Pollett


Table of Contents

Default number of documents returned for each block (at most)
$count_block  : int
The number of documents in the current block after filtering by restricted words
$count_block_unfiltered  : int
The number of documents in the current block before filtering by restricted words
$current_block_fresh  : bool
Says whether the value in $this->count_block is up to date
$current_block_hashes  : array<string|int, mixed>
hashes of document web pages seen in results returned from the most recent call to findDocsWithWord
$current_machine  : int
Id of queue_server this group_iterator lives on
$current_seen_hashes  : array<string|int, mixed>
$domain_factors  : array<string|int, mixed>
Used to keep track and to weight pages based on the number of other pages from the same domain
$grouped_hashes  : array<string|int, mixed>
hashed of document web pages used to keep track of track of groups seen so far
$grouped_keys  : array<string|int, mixed>
hashed url keys used to keep track of track of groups seen so far
$index_bundle_iterator  : string
The iterator we are using to get documents from
$num_docs  : int
Estimate of the number of documents that this iterator can return
$pages  : array<string|int, mixed>
Cache of what currentDocsWithWord returns
$results_per_block  : int
Number of documents returned for each block (at most)
$seen_docs  : int
The number of documents already iterated over
$seen_docs_unfiltered  : int
The number of iterated docs before the restriction test
$total_num_docs  : int
__construct()  : mixed
Creates a group iterator with the given parameters.
advance()  : mixed
Forwards the iterator one group of docs
advanceSeenDocs()  : mixed
Updates the seen_docs count during an advance() call
aggregateScores()  : mixed
For a collection of pages each with the same url, computes the page with the max score, as well as the max of the ranks, proximity, and relevance scores.
computeOutPages()  : array<string|int, mixed>
For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.
currentDocsWithWord()  : mixed
Gets the current block of doc ids and score associated with the this iterators word
currentGenDocOffsetWithWord()  : mixed
Gets the doc_offset and generation for the next document that would be return by this iterator
findDocsWithWord()  : mixed
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
genDocOffsetCmp()  : int
Compares two arrays each containing a (generation, offset) pair.
getCurrentDocsForKeys()  : array<string|int, mixed>
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
getDirection()  : int
Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
getPagesToGroup()  : array<string|int, mixed>
Gets a sample of a few hundred pages on which to do grouping by URL
groupByHashAndAggregate()  : mixed
For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.
groupByHashUrl()  : array<string|int, mixed>
Groups documents as well as mini-pages based on links to documents by url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.
nextDocsWithWord()  : array<string|int, mixed>
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
plan()  : string
Returns a string representation of a plan by which the current iterator finds its results
reset()  : mixed
Returns the iterators to the first document block that it could iterate over
setResultsPerBlock()  : mixed
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()



Default number of documents returned for each block (at most)

public int RESULTS_PER_BLOCK = 200



The number of documents in the current block after filtering by restricted words

public int $count_block


The number of documents in the current block before filtering by restricted words

public int $count_block_unfiltered


Says whether the value in $this->count_block is up to date

public bool $current_block_fresh


hashes of document web pages seen in results returned from the most recent call to findDocsWithWord

public array<string|int, mixed> $current_block_hashes


Id of queue_server this group_iterator lives on

public int $current_machine


public array<string|int, mixed> $current_seen_hashes


Used to keep track and to weight pages based on the number of other pages from the same domain

public array<string|int, mixed> $domain_factors


hashed of document web pages used to keep track of track of groups seen so far

public array<string|int, mixed> $grouped_hashes


hashed url keys used to keep track of track of groups seen so far

public array<string|int, mixed> $grouped_keys


The iterator we are using to get documents from

public string $index_bundle_iterator


Estimate of the number of documents that this iterator can return

public int $num_docs


Cache of what currentDocsWithWord returns

public array<string|int, mixed> $pages


Number of documents returned for each block (at most)

public int $results_per_block = self::RESULTS_PER_BLOCK


The number of iterated docs before the restriction test

public int $seen_docs_unfiltered



Creates a group iterator with the given parameters.

public __construct(object $index_bundle_iterator[, int $num_iterators = 1 ], int $current_machine) : mixed
$index_bundle_iterator : object

to use as a source of documents to iterate over

$num_iterators : int = 1

number of word iterators appearing in in sub-iterators -- if larger than reduce the default grouping number

$current_machine : int

if this iterator is being used in a multi- queue_server setting, then this is the id of the current queue_server

Return values


Forwards the iterator one group of docs

public advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
$gen_doc_offset : array<string|int, mixed> = null

a generation, doc_offset pair. If set, the must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value

Return values


Updates the seen_docs count during an advance() call

public advanceSeenDocs() : mixed
Return values


For a collection of pages each with the same url, computes the page with the max score, as well as the max of the ranks, proximity, and relevance scores.

public aggregateScores(string $hash_url, array<string|int, mixed> &$pre_hash_page) : mixed

Stores this information in the first element of the array of pages. This process is described in detail at: https://www.seekquarry.com/?c=main&p=ranking#search

$hash_url : string

the crawlHash of the url of the page we are scoring which will be compared with that of the host to see if the current page has the url of a hostname.

$pre_hash_page : array<string|int, mixed>

pages to compute scores for

Return values


For a collection of grouped pages generates a grouped summary for each group and returns an array of out pages consisting of single summarized documents for each group. These single summarized documents have aggregated scores.

public computeOutPages(array<string|int, mixed> &$pre_out_pages) : array<string|int, mixed>
$pre_out_pages : array<string|int, mixed>

array of groups of pages for which out pages are to be generated.

Return values
array<string|int, mixed>

$out_pages array of single summarized documents


Gets the current block of doc ids and score associated with the this iterators word

public currentDocsWithWord() : mixed
Return values

doc ids and score if there are docs left, -1 otherwise


Gets the doc_offset and generation for the next document that would be return by this iterator

public currentGenDocOffsetWithWord() : mixed
Return values

an array with the desired document offset and generation; -1 on fail


Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

public findDocsWithWord() : mixed
Return values

doc ids and score if there are docs left, -1 otherwise


Compares two arrays each containing a (generation, offset) pair.

public genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
$gen_doc1 : array<string|int, mixed>

first ordered pair

$gen_doc2 : array<string|int, mixed>

second ordered pair

$direction : int = self::ASCENDING

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values

-1,0,1 depending on which is bigger


Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

public getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
$keys : array<string|int, mixed> = null

keys to try to find in the current block of returned results

Return values
array<string|int, mixed>

doc summaries that match provided keys


Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.

public getDirection() : int
Return values

direction traversing underlying archive bundle


Gets a sample of a few hundred pages on which to do grouping by URL

public getPagesToGroup() : array<string|int, mixed>
Return values
array<string|int, mixed>

of pages of document key --> meta data arrays


For documents which had been previously grouped by the hash of their url, groups these groups further by the hash of their pages contents.

public groupByHashAndAggregate(array<string|int, mixed> &$pre_out_pages) : mixed

For each group of groups with the same hash summary, this function then selects the subgroup of with the highest max score for that group as its representative. The function then modifies the supplied argument array to make it an array of group representatives.

$pre_out_pages : array<string|int, mixed>

documents previously grouped by hash of url

Return values


Groups documents as well as mini-pages based on links to documents by url to produce an array of arrays of documents with same url. Since this is called in an iterator, documents which were already returned by a previous call to currentDocsWithWord() followed by an advance() will have been remembered in grouped_keys and will be ignored in the return result of this function.

public groupByHashUrl(array<string|int, mixed> &$pages) : array<string|int, mixed>
$pages : array<string|int, mixed>

pages to group

Return values
array<string|int, mixed>

$pre_out_pages pages after grouping


Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

public nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
$doc_offset : = null

if set the next block must all have $doc_offsets equal to or larger than this value

Return values
array<string|int, mixed>

doc summaries matching the $this->restrict_phrases


Returns a string representation of a plan by which the current iterator finds its results

public plan() : string
Return values

a representation of the current iterator and its subiterators, useful for determining how a query will be processed


Returns the iterators to the first document block that it could iterate over

public reset() : mixed
Return values


Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

public setResultsPerBlock(int $num) : mixed
$num : int

the maximum number of results that can be returned by a block

Return values


Search results