Yioop_V9.5_Source_Code_Documentation

WordIterator extends IndexBundleIterator
in package

Used to iterate through the documents associated with a word in an IndexArchiveBundle. It also makes it easy to get the summaries of these documents.

A description of how words and the documents containing them are stored is given in the documentation of IndexArchiveBundle.

Tags
author

Chris Pollett

see
IndexArchiveBundle

Table of Contents

DOC_RANK_WEIGHT  = 50
Weighting factor to multiply to make a doc-rank (approximate score of document based on its position in the index (when crawled)).
HOST_KEY_POS  = 17
Host Key position + 1 (first char says doc, inlink or external link)
KEY_LEN  = 8
Length of a doc key part
RESULTS_PER_BLOCK  = 200
Default number of documents returned for each block (at most)
$archive_file  : int
$avg_items_per_partition  : int
$base64_word_key  : string
Word key above in our modified base 64 encoding
$count_block  : int
The number of documents in the current block
$current_block_fresh  : bool
Says whether the value in $this->count_block is up to date
$current_doc_offset  : int
The current value of the doc_offset of current posting if known
$current_generation  : int
Numeric number of current shard
$current_offset  : int
The current byte offset in the IndexShard (if older index)
$dictionary_info  : array<string|int, mixed>
An array of shard generation and posting list offsets, lengths, and numbers of documents
$direction  : int
Whether the iterator iterates forward or backward through documents in bundle
$empty  : int
Keeps track of whether the word_iterator list is empty because the word does not appear in the index shard
$filter  : SearchfiltersModel
Model responsible for keeping track of edited and deleted search results
$generation_pointer  : int
Index into dictionary_info corresponding to the current shard
$index_name  : string
The timestamp of the index is associated with this iterator
$index_version  : int
The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats
$is_meta  : string
Whether word key corresponds to a meta word
$last_offset  : int
Last Offset of word occurrence in the IndexShard
$max_items_per_partition  : int
$next_offset  : int
The next byte offset in the IndexShard
$no_more_generations  : bool
Used to keep track of whether getWordInfo might still get more data on the search terms as advance generations
$num_docs  : int
Estimate of the number of documents that this iterator can return
$num_generations  : int
The total number of shards that have data for this word
$num_occurrences  : int
$pages  : array<string|int, mixed>
Cache of what currentDocsWithWord returns
$ranking_factors  : array<string|int, mixed>
How url, keywords, and title words should influence relevance and doc rank calculations
$results_per_block  : int
Number of documents returned for each block (at most)
$seen_docs  : int
The number of documents already iterated over
$start_generation  : int
First shard generation that word info was obtained for
$start_offset  : int
Starting Offset of word occurrence in the IndexShard
$term_info_computed  : int
$threshold_exceeded  : int
$total_num_docs  : int
$total_num_docs_and_links  : int
$total_number_of_partitions  : int
$word_key  : string
hash of word or phrase that the iterator iterates over
__construct()  : mixed
Creates a word iterator with the given parameters.
advance()  : mixed
Forwards the iterator one group of docs
advanceGeneration()  : mixed
Switches which index shard is being used to return occurrences of the word to the next shard containing the word
advanceSeenDocs()  : mixed
Updates the seen_docs count during an advance() call
currentDocsWithWord()  : mixed
Gets the current block of doc ids and score associated with the this iterators word
currentGenDocOffsetWithWord()  : mixed
Gets the doc_offset and generation for the next document that would be return by this iterator
findDocsWithWord()  : mixed
Hook function used by currentDocsWithWord to return the current block of docs if it is not cached
frequencyNormalizationPrefaceScoring()  : array<string|int, mixed>
Normalizes the frequencies of a term within a document with respect to the length of the document, the positions of the term with the document and the overall importance score for a given position within the document Also computes the score of the posting for the host keywords, title keywords, and path keywords.
genDocOffsetCmp()  : int
Compares two arrays each containing a (generation, offset) pair.
getCurrentDocsForKeys()  : array<string|int, mixed>
Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator
getDirection()  : int
Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.
getDocKeyPositionsScoringInfo()  : mixed
Add to a set of postings from a partition scoring information, position list information and info about the relative weights of given position based on the position list file and doc_map file.
getGenerationPostings()  : array<string|int, mixed>
Given a partition number in the the index's PartitionDocumentBundle retrieves all the posting for the word iterator's term in that partition.
getPostingsSliceResults()  : mixed
Given the current_offset, result_per_block, and index used get the result_per_block postings starting from current_offset in the current direction (ascending or descending) for the term word iterator iterates over from the index.
nextDocIndexOffsetPair()  : array<string|int, mixed>
Computes a pair [posting_slice_offset, $doc_index], such that the $doc_index when shift to make a doc_offset is greater than $doc_offset and posting_slice_offset is the offset of the first posting with this property.
nextDocsWithWord()  : array<string|int, mixed>
Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index
plainAdvance()  : mixed
Forwards the iterator one group of docs. This is what's called by @see advance($gen_doc_offset) if $gen_doc_offset is null
plan()  : string
Returns a string representation of a plan by which the current iterator finds its results
reset()  : mixed
Resets the iterator to the first document block that it could iterate over
setResultsPerBlock()  : mixed
Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()
termInfoIteratorFields()  : mixed
Used to compute fields such as $this->total_num_docs for this iterator on term $word_key for index $index_name

Constants

DOC_RANK_WEIGHT

Weighting factor to multiply to make a doc-rank (approximate score of document based on its position in the index (when crawled)).

public mixed DOC_RANK_WEIGHT = 50

This weight affects the amount doc_rank determines the overall score of a document.

HOST_KEY_POS

Host Key position + 1 (first char says doc, inlink or external link)

public mixed HOST_KEY_POS = 17

KEY_LEN

Length of a doc key part

public mixed KEY_LEN = 8

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)

public int RESULTS_PER_BLOCK = 200

Properties

$avg_items_per_partition

public int $avg_items_per_partition

$base64_word_key

Word key above in our modified base 64 encoding

public string $base64_word_key

$current_block_fresh

Says whether the value in $this->count_block is up to date

public bool $current_block_fresh

$current_doc_offset

The current value of the doc_offset of current posting if known

public int $current_doc_offset

$current_generation

Numeric number of current shard

public int $current_generation

$current_offset

The current byte offset in the IndexShard (if older index)

public int $current_offset

$dictionary_info

An array of shard generation and posting list offsets, lengths, and numbers of documents

public array<string|int, mixed> $dictionary_info

$direction

Whether the iterator iterates forward or backward through documents in bundle

public int $direction

$empty

Keeps track of whether the word_iterator list is empty because the word does not appear in the index shard

public int $empty

$filter

Model responsible for keeping track of edited and deleted search results

public SearchfiltersModel $filter

$generation_pointer

Index into dictionary_info corresponding to the current shard

public int $generation_pointer

$index_name

The timestamp of the index is associated with this iterator

public string $index_name

$index_version

The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats

public int $index_version

$is_meta

Whether word key corresponds to a meta word

public string $is_meta

$last_offset

Last Offset of word occurrence in the IndexShard

public int $last_offset

$max_items_per_partition

public int $max_items_per_partition

$next_offset

The next byte offset in the IndexShard

public int $next_offset

$no_more_generations

Used to keep track of whether getWordInfo might still get more data on the search terms as advance generations

public bool $no_more_generations

$num_docs

Estimate of the number of documents that this iterator can return

public int $num_docs

$num_generations

The total number of shards that have data for this word

public int $num_generations

$pages

Cache of what currentDocsWithWord returns

public array<string|int, mixed> $pages

$ranking_factors

How url, keywords, and title words should influence relevance and doc rank calculations

public array<string|int, mixed> $ranking_factors

$results_per_block

Number of documents returned for each block (at most)

public int $results_per_block = self::RESULTS_PER_BLOCK

$start_generation

First shard generation that word info was obtained for

public int $start_generation

$start_offset

Starting Offset of word occurrence in the IndexShard

public int $start_offset
public int $total_num_docs_and_links

$total_number_of_partitions

public int $total_number_of_partitions

$word_key

hash of word or phrase that the iterator iterates over

public string $word_key

Methods

__construct()

Creates a word iterator with the given parameters.

public __construct(string $word_key, string $index_name[, bool $raw = false ][, SearchfiltersModel $filter = null ][, int $results_per_block = IndexBundleIterator::RESULTS_PER_BLOCK ][, int $direction = self::ASCENDING ][, array<string|int, mixed> $ranking_factors = [] ]) : mixed
Parameters
$word_key : string

hash of word or phrase to iterate docs of

$index_name : string

time_stamp of the to use

$raw : bool = false

whether the $word_key is our variant of base64 encoded

$filter : SearchfiltersModel = null

Model responsible for keeping track of edited and deleted search results

$results_per_block : int = IndexBundleIterator::RESULTS_PER_BLOCK

the maximum number of results that can be returned by a findDocsWithWord call

$direction : int = self::ASCENDING

when results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added. Note: this value is not saved permanently. So you could in theory open two read only versions of the same bundle but reading the results in different directions

$ranking_factors : array<string|int, mixed> = []

field say how url, keywords, and title words should influence relevance and doc rank calculations

Return values
mixed

advance()

Forwards the iterator one group of docs

public advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed
Parameters
$gen_doc_offset : array<string|int, mixed> = null

a generation, doc_offset pair. If not null, (in the ascending search case opposite for descending), the pair must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value.

Return values
mixed

advanceGeneration()

Switches which index shard is being used to return occurrences of the word to the next shard containing the word

public advanceGeneration([int $generation = null ]) : mixed
Parameters
$generation : int = null

generation to advance beyond

Return values
mixed

advanceSeenDocs()

Updates the seen_docs count during an advance() call

public advanceSeenDocs() : mixed
Return values
mixed

currentDocsWithWord()

Gets the current block of doc ids and score associated with the this iterators word

public currentDocsWithWord() : mixed
Return values
mixed

doc ids and score if there are docs left, -1 otherwise

currentGenDocOffsetWithWord()

Gets the doc_offset and generation for the next document that would be return by this iterator

public currentGenDocOffsetWithWord() : mixed
Return values
mixed

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached

public findDocsWithWord() : mixed
Return values
mixed

doc ids and score if there are docs left, -1 otherwise

frequencyNormalizationPrefaceScoring()

Normalizes the frequencies of a term within a document with respect to the length of the document, the positions of the term with the document and the overall importance score for a given position within the document Also computes the score of the posting for the host keywords, title keywords, and path keywords.

public frequencyNormalizationPrefaceScoring(array<string|int, mixed> $positions, int $num_words, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $descriptions_scores) : array<string|int, mixed>
Parameters
$positions : array<string|int, mixed>

positions of this iterators term in the document

$num_words : int

number of terms in the document

$host_keywords_end_pos : int

term offset into the document summary that demarks the end of the host keywords portion of the summary

$title_end_pos : int

absolute term offset into the document summary that demarks the end of the title portion of the summary

$path_keywords_end_pos : int

absolute term offset into the document summary that demarks the end of the title portion of the summary

$descriptions_scores : array<string|int, mixed>

boundaries and scores of different regions with document

Return values
array<string|int, mixed>

[normalized frequency, score for host name, title, and path keywords]

genDocOffsetCmp()

Compares two arrays each containing a (generation, offset) pair.

public genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int
Parameters
$gen_doc1 : array<string|int, mixed>

first ordered pair

$gen_doc2 : array<string|int, mixed>

second ordered pair

$direction : int = self::ASCENDING

whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values
int

-1,0,1 depending on which is bigger

getCurrentDocsForKeys()

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator

public getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>
Parameters
$keys : array<string|int, mixed> = null

keys to try to find in the current block of returned results

Return values
array<string|int, mixed>

doc summaries that match provided keys

getDirection()

Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.

public getDirection() : int
Return values
int

direction traversing underlying archive bundle

getDocKeyPositionsScoringInfo()

Add to a set of postings from a partition scoring information, position list information and info about the relative weights of given position based on the position list file and doc_map file.

public getDocKeyPositionsScoringInfo(mixed $postings, int $partition) : mixed

@param array $postings posting data to add scoring information to

Parameters
$postings : mixed
$partition : int

which partition from the PartitionDocumentBundle postings a re related to

Return values
mixed

getGenerationPostings()

Given a partition number in the the index's PartitionDocumentBundle retrieves all the posting for the word iterator's term in that partition.

public getGenerationPostings(int $generation) : array<string|int, mixed>
Parameters
$generation : int

partition to get postings for

Return values
array<string|int, mixed>

of posting items

getPostingsSliceResults()

Given the current_offset, result_per_block, and index used get the result_per_block postings starting from current_offset in the current direction (ascending or descending) for the term word iterator iterates over from the index.

public getPostingsSliceResults() : mixed
Return values
mixed

nextDocIndexOffsetPair()

Computes a pair [posting_slice_offset, $doc_index], such that the $doc_index when shift to make a doc_offset is greater than $doc_offset and posting_slice_offset is the offset of the first posting with this property.

public nextDocIndexOffsetPair(int $doc_offset) : array<string|int, mixed>
Parameters
$doc_offset : int

that we are try to find a posting whose doc_index has a bigger doc_offset

Return values
array<string|int, mixed>

[posting_slice_offset, $doc_index]

nextDocsWithWord()

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index

public nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>
Parameters
$doc_offset : = null

if set the next block must all have $doc_offsets equal to or larger than this value

Return values
array<string|int, mixed>

doc summaries matching the $this->restrict_phrases

plainAdvance()

Forwards the iterator one group of docs. This is what's called by @see advance($gen_doc_offset) if $gen_doc_offset is null

public plainAdvance() : mixed
Return values
mixed

plan()

Returns a string representation of a plan by which the current iterator finds its results

public plan() : string
Return values
string

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

reset()

Resets the iterator to the first document block that it could iterate over

public reset() : mixed
Return values
mixed

setResultsPerBlock()

Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()

public setResultsPerBlock(int $num) : mixed
Parameters
$num : int

the maximum number of results that can be returned by a block

Return values
mixed

termInfoIteratorFields()

Used to compute fields such as $this->total_num_docs for this iterator on term $word_key for index $index_name

protected termInfoIteratorFields(string $index_name, string $word_key) : mixed
Parameters
$index_name : string

name of index to compute statistics with respect to

$word_key : string

term to compute statics with respect to

Return values
mixed

        

Search results