Yioop_V9.5_Source_Code

WordIterator extends IndexBundleIterator
in package

Application

Used to iterate through the documents associated with a word in an IndexArchiveBundle. It also makes it easy to get the summaries of these documents.

A description of how words and the documents containing them are stored is given in the documentation of IndexArchiveBundle.

DOC_RANK_WEIGHT

Weighting factor to multiply to make a doc-rank (approximate score of document based on its position in the index (when crawled)).


    public
        mixed
    DOC_RANK_WEIGHT
    = 50

This weight affects the amount doc_rank determines the overall score of a document.

HOST_KEY_POS

Host Key position + 1 (first char says doc, inlink or external link)


    public
        mixed
    HOST_KEY_POS
    = 17

KEY_LEN

Length of a doc key part


    public
        mixed
    KEY_LEN
    = 8

RESULTS_PER_BLOCK

Default number of documents returned for each block (at most)


    public
        int
    RESULTS_PER_BLOCK
    = 200

$archive_file


    public
        int
    $archive_file

$avg_items_per_partition


    public
        int
    $avg_items_per_partition

$base64_word_key

Word key above in our modified base 64 encoding


    public
        string
    $base64_word_key

$count_block

The number of documents in the current block


    public
        int
    $count_block

$current_block_fresh

Says whether the value in $this->count_block is up to date


    public
        bool
    $current_block_fresh

$current_doc_offset

The current value of the doc_offset of current posting if known


    public
        int
    $current_doc_offset

$current_generation

Numeric number of current shard


    public
        int
    $current_generation

$current_offset

The current byte offset in the IndexShard (if older index)


    public
        int
    $current_offset

$dictionary_info

An array of shard generation and posting list offsets, lengths, and numbers of documents


    public
        array<string|int, mixed>
    $dictionary_info

$direction

Whether the iterator iterates forward or backward through documents in bundle


    public
        int
    $direction

$empty

Keeps track of whether the word_iterator list is empty because the word does not appear in the index shard


    public
        int
    $empty

$filter

Model responsible for keeping track of edited and deleted search results


    public
        SearchfiltersModel
    $filter

$generation_pointer

Index into dictionary_info corresponding to the current shard


    public
        int
    $generation_pointer

$index_name

The timestamp of the index is associated with this iterator


    public
        string
    $index_name

$index_version

The index version affects how the iterator cycles through documents There was a big change in index format between version 3 and prior formats


    public
        int
    $index_version

$is_meta

Whether word key corresponds to a meta word


    public
        string
    $is_meta

$last_offset

Last Offset of word occurrence in the IndexShard


    public
        int
    $last_offset

$max_items_per_partition


    public
        int
    $max_items_per_partition

$next_offset

The next byte offset in the IndexShard


    public
        int
    $next_offset

$no_more_generations

Used to keep track of whether getWordInfo might still get more data on the search terms as advance generations


    public
        bool
    $no_more_generations

$num_docs

Estimate of the number of documents that this iterator can return


    public
        int
    $num_docs

$num_generations

The total number of shards that have data for this word


    public
        int
    $num_generations

$num_occurrences


    public
        int
    $num_occurrences

$pages

Cache of what currentDocsWithWord returns


    public
        array<string|int, mixed>
    $pages

$ranking_factors

How url, keywords, and title words should influence relevance and doc rank calculations


    public
        array<string|int, mixed>
    $ranking_factors

$results_per_block

Number of documents returned for each block (at most)


    public
        int
    $results_per_block
     = self::RESULTS_PER_BLOCK

$seen_docs

The number of documents already iterated over


    public
        int
    $seen_docs

$start_generation

First shard generation that word info was obtained for


    public
        int
    $start_generation

$start_offset

Starting Offset of word occurrence in the IndexShard


    public
        int
    $start_offset

$term_info_computed


    public
        int
    $term_info_computed

$threshold_exceeded


    public
        int
    $threshold_exceeded

$total_num_docs


    public
        int
    $total_num_docs

$total_num_docs_and_links


    public
        int
    $total_num_docs_and_links

$total_number_of_partitions


    public
        int
    $total_number_of_partitions

$word_key

hash of word or phrase that the iterator iterates over


    public
        string
    $word_key

__construct()

Creates a word iterator with the given parameters.


    public
                    __construct(string $word_key, string $index_name[, bool $raw = false ][, SearchfiltersModel $filter = null ][, int $results_per_block = IndexBundleIterator::RESULTS_PER_BLOCK ][, int $direction = self::ASCENDING ][, array<string|int, mixed> $ranking_factors = [] ]) : mixed

Parameters

$word_key : string: hash of word or phrase to iterate docs of
$index_name : string: time_stamp of the to use
$raw : bool = false: whether the $word_key is our variant of base64 encoded
$filter : SearchfiltersModel = null: Model responsible for keeping track of edited and deleted search results
$results_per_block : int = IndexBundleIterator::RESULTS_PER_BLOCK: the maximum number of results that can be returned by a findDocsWithWord call
$direction : int = self::ASCENDING: when results are access from $index_name in which order they should be presented. self::ASCENDING is from first added to last added, self::DESCENDING is from last added to first added. Note: this value is not saved permanently. So you could in theory open two read only versions of the same bundle but reading the results in different directions
$ranking_factors : array<string|int, mixed> = []: field say how url, keywords, and title words should influence relevance and doc rank calculations

Return values

mixed —

advance()

Forwards the iterator one group of docs


    public
                    advance([array<string|int, mixed> $gen_doc_offset = null ]) : mixed

Parameters

$gen_doc_offset : array<string|int, mixed> = null: a generation, doc_offset pair. If not null, (in the ascending search case opposite for descending), the pair must be of greater than or equal generation, and if equal the next block must all have $doc_offsets larger than or equal to this value.

Return values

mixed —

advanceGeneration()

Switches which index shard is being used to return occurrences of the word to the next shard containing the word


    public
                    advanceGeneration([int $generation = null ]) : mixed

Parameters

$generation : int = null: generation to advance beyond

Return values

mixed —

advanceSeenDocs()

Updates the seen_docs count during an advance() call


    public
                    advanceSeenDocs() : mixed

Return values

mixed —

currentDocsWithWord()

Gets the current block of doc ids and score associated with the this iterators word


    public
                    currentDocsWithWord() : mixed

Return values

mixed —

doc ids and score if there are docs left, -1 otherwise

currentGenDocOffsetWithWord()

Gets the doc_offset and generation for the next document that would be return by this iterator


    public
                    currentGenDocOffsetWithWord() : mixed

Return values

mixed —

an array with the desired document offset and generation; -1 on fail

findDocsWithWord()

Hook function used by currentDocsWithWord to return the current block of docs if it is not cached


    public
                    findDocsWithWord() : mixed

Return values

mixed —

doc ids and score if there are docs left, -1 otherwise

frequencyNormalizationPrefaceScoring()

Normalizes the frequencies of a term within a document with respect to the length of the document, the positions of the term with the document and the overall importance score for a given position within the document Also computes the score of the posting for the host keywords, title keywords, and path keywords.


    public
                    frequencyNormalizationPrefaceScoring(array<string|int, mixed> $positions, int $num_words, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $descriptions_scores) : array<string|int, mixed>

Parameters

$positions : array<string|int, mixed>: positions of this iterators term in the document
$num_words : int: number of terms in the document
$host_keywords_end_pos : int: term offset into the document summary that demarks the end of the host keywords portion of the summary
$title_end_pos : int: absolute term offset into the document summary that demarks the end of the title portion of the summary
$path_keywords_end_pos : int: absolute term offset into the document summary that demarks the end of the title portion of the summary
$descriptions_scores : array<string|int, mixed>: boundaries and scores of different regions with document

Return values

array<string|int, mixed> —

[normalized frequency, score for host name, title, and path keywords]

genDocOffsetCmp()

Compares two arrays each containing a (generation, offset) pair.


    public
                    genDocOffsetCmp(array<string|int, mixed> $gen_doc1, array<string|int, mixed> $gen_doc2[, int $direction = self::ASCENDING ]) : int

Parameters

$gen_doc1 : array<string|int, mixed>: first ordered pair
$gen_doc2 : array<string|int, mixed>: second ordered pair
$direction : int = self::ASCENDING: whether the comparison should be done for a self::ASCEDNING or a self::DESCENDING search

Return values

int —

-1,0,1 depending on which is bigger

getCurrentDocsForKeys()

Gets the summaries associated with the keys provided the keys can be found in the current block of docs returned by this iterator


    public
                    getCurrentDocsForKeys([array<string|int, mixed> $keys = null ]) : array<string|int, mixed>

Parameters

$keys : array<string|int, mixed> = null: keys to try to find in the current block of returned results

Return values

array<string|int, mixed> —

doc summaries that match provided keys

getDirection()

Returns CrawlConstants::ASCENDING or CrawlConstants::DESCENDING depending on the direction in which this iterator ttraverse the underlying index archive bundle.


    public
                    getDirection() : int

Return values

int —

direction traversing underlying archive bundle

getDocKeyPositionsScoringInfo()

Add to a set of postings from a partition scoring information, position list information and info about the relative weights of given position based on the position list file and doc_map file.


    public
                    getDocKeyPositionsScoringInfo(mixed $postings, int $partition) : mixed

@param array $postings posting data to add scoring information to

Parameters

$postings : mixed
$partition : int: which partition from the PartitionDocumentBundle postings a re related to

Return values

mixed —

getGenerationPostings()

Given a partition number in the the index's PartitionDocumentBundle retrieves all the posting for the word iterator's term in that partition.


    public
                    getGenerationPostings(int $generation) : array<string|int, mixed>

Parameters

$generation : int: partition to get postings for

Return values

array<string|int, mixed> —

of posting items

getPostingsSliceResults()

Given the current_offset, result_per_block, and index used get the result_per_block postings starting from current_offset in the current direction (ascending or descending) for the term word iterator iterates over from the index.


    public
                    getPostingsSliceResults() : mixed

Return values

mixed —

nextDocIndexOffsetPair()

Computes a pair [posting_slice_offset, $doc_index], such that the $doc_index when shift to make a doc_offset is greater than $doc_offset and posting_slice_offset is the offset of the first posting with this property.


    public
                    nextDocIndexOffsetPair(int $doc_offset) : array<string|int, mixed>

Parameters

$doc_offset : int: that we are try to find a posting whose doc_index has a bigger doc_offset

Return values

array<string|int, mixed> —

[posting_slice_offset, $doc_index]

nextDocsWithWord()

Get the current block of doc summaries for the word iterator and advances the current pointer to the next block of documents. If a doc index is the next block must be of docs after this doc_index


    public
                    nextDocsWithWord([ $doc_offset = null ]) : array<string|int, mixed>

Parameters

$doc_offset : = null: if set the next block must all have $doc_offsets equal to or larger than this value

Return values

array<string|int, mixed> —

doc summaries matching the $this->restrict_phrases

plainAdvance()

Forwards the iterator one group of docs. This is what's called by @see advance($gen_doc_offset) if $gen_doc_offset is null


    public
                    plainAdvance() : mixed

Return values

mixed —

plan()

Returns a string representation of a plan by which the current iterator finds its results


    public
                    plan() : string

Return values

string —

a representation of the current iterator and its subiterators, useful for determining how a query will be processed

reset()

Resets the iterator to the first document block that it could iterate over


    public
                    reset() : mixed

Return values

mixed —

setResultsPerBlock()

Sets the value of the result_per_block field. This field controls the maximum number of results that can be returned in one go by currentDocsWithWord()


    public
                    setResultsPerBlock(int $num) : mixed

Parameters

$num : int: the maximum number of results that can be returned by a block

Return values

mixed —

termInfoIteratorFields()

Used to compute fields such as $this->total_num_docs for this iterator on term $word_key for index $index_name


    protected
                    termInfoIteratorFields(string $index_name, string $word_key) : mixed

Parameters

$index_name : string: name of index to compute statistics with respect to
$word_key : string: term to compute statics with respect to

Return values

mixed —

WordIterator extends IndexBundleIterator in package Application

Tags

Table of Contents

Constants

DOC_RANK_WEIGHT

HOST_KEY_POS

KEY_LEN

RESULTS_PER_BLOCK

Properties

$archive_file

$avg_items_per_partition

$base64_word_key

$count_block

$current_block_fresh

$current_doc_offset

$current_generation

$current_offset

$dictionary_info

$direction

$empty

$filter

$generation_pointer

$index_name

$index_version

$is_meta

$last_offset

$max_items_per_partition

$next_offset

$no_more_generations

$num_docs

$num_generations

$num_occurrences

$pages

$ranking_factors

$results_per_block

$seen_docs

$start_generation

$start_offset

$term_info_computed

$threshold_exceeded

$total_num_docs

$total_num_docs_and_links

$total_number_of_partitions

$word_key

Methods

__construct()

Parameters

Return values

advance()

Parameters

Return values

advanceGeneration()

Parameters

Return values

advanceSeenDocs()

Return values

currentDocsWithWord()

Return values

currentGenDocOffsetWithWord()

Return values

findDocsWithWord()

Return values

frequencyNormalizationPrefaceScoring()

Parameters

Return values

genDocOffsetCmp()

Parameters

Return values

getCurrentDocsForKeys()

Parameters

Return values

getDirection()

Return values

getDocKeyPositionsScoringInfo()

Parameters

Return values

getGenerationPostings()

Parameters

Return values

getPostingsSliceResults()

WordIterator extends IndexBundleIterator
in package

Application