Yioop_V9.5_Source_Code

IndexDictionary
in package

Application

implements CrawlConstants

Data structure used to store for entries of the form: word id, index shard generation, posting list offset, and length of posting list. It has entries for all words stored in a given IndexArchiveBundle. There might be multiple entries for a given word_id if it occurs in more than one index shard in the given IndexArchiveBundle.

In terms of file structure, a dictionary is stored a folder consisting of 256 subfolders. Each subfolder is used to store the word_ids beginning with a particular character. Within a folder are files of various tier levels representing the data stored. As crawling proceeds words from a shard are added to the dictionary in files of tier level 0 either with suffix A or B. If it is detected that both an A and a B file of a given tier level exist, then the results of these two files are merged to a new file at one tier level up . The old files are then deleted. This process is applied recursively until there is at most an A file on each level.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

AUX_RECORD_BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff": Represents an empty element in an Auxiliary dictionary entry record 10 bytes long
AUX_RECORD_LEN = 10: Represents the length of an element in a dictionary aux record
DICT_BLOCK_POWER = 12: Disk block size is 1<< this power
DICT_BLOCK_SIZE = 4096: Size in bytes of one block in IndexDictionary
MAX_DICT_FILE_HANDLES = 200: Maximum number of simultaneously open file handles
NUM_PREFIX_LETTERS = 256: Number of possible prefix records (number of possible values for second char of a word id)
PREFIX_HEADER_SIZE = 2048: One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.
PREFIX_ITEM_SIZE = 8: Size of an item in the prefix index used to look up words.
SEGMENT_SIZE = 20000000: When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)
$active_tiers : array<string|int, mixed>: Tiers which currently have data for reading
$blocks : array<string|int, mixed>: A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.
$dir_name : string: Folder name to use for this IndexDictionary
$fhs : resource: Array of file handle for files in the dictionary. Members are used to read files to look up words.
$file_lens : int: Array of file lengths for files in the dictionary. Use so don't try to seek past end of files
$max_tier : int: The highest tiered index in the IndexDictionary
$parent_archive_bundle : object: If not null, then the parent IndexArchiveBundle this dictionary belongs to
$read_tier : int: Tier currently being used to read dictionary data from
$shard_doc_lens : array<string|int, mixed>: Length of the doc strings for each of the shards that have been added to the dictionary.
__construct() : mixed: Makes an index dictionary with the given name
addAuxInfoRecords() : mixed: Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.
addLookedUpEntry() : mixed: This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.
addShardDictionary() : mixed: Adds the words in the provided IndexShard to the dictionary.
calculateActiveTiers() : array<string|int, mixed>: Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.
combineDictionaryRecord() : string: Used to combine the dictionary records for a given word_id between that come from two different tier files
decodeAuxRecord() : array<string|int, mixed>: Used to decode an auxiliary dictionary record associated with a given word_id
extractPrefixRecord() : array<string|int, mixed>: Returns the $record_num'th prefix record from $prefix_string
formatWordInfo() : array<string|int, mixed>: Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple
getDictSubstring() : string: Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary
getWordInfo() : mixed: For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
getWordInfoTier() : mixed: This method facilitates query processing of an ongoing crawl.
makePrefixLetters() : mixed: Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.
makePrefixRecord() : string: Makes a prefix record string out of an offset and count (packs and concatenates).
mergeAllTiers() : mixed: Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.
mergeTier() : mixed: Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.
mergeTierFiles() : mixed: For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.
readBlockDictAtOffset() : string: Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes

AUX_RECORD_BLANK

Represents an empty element in an Auxiliary dictionary entry record 10 bytes long


    public
        mixed
    AUX_RECORD_BLANK
    = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"

AUX_RECORD_LEN

Represents the length of an element in a dictionary aux record


    public
        mixed
    AUX_RECORD_LEN
    = 10

DICT_BLOCK_POWER

Disk block size is 1<< this power


    public
        mixed
    DICT_BLOCK_POWER
    = 12

DICT_BLOCK_SIZE

Size in bytes of one block in IndexDictionary


    public
        mixed
    DICT_BLOCK_SIZE
    = 4096

MAX_DICT_FILE_HANDLES

Maximum number of simultaneously open file handles


    public
        mixed
    MAX_DICT_FILE_HANDLES
    = 200

NUM_PREFIX_LETTERS

Number of possible prefix records (number of possible values for second char of a word id)


    public
        mixed
    NUM_PREFIX_LETTERS
    = 256

PREFIX_HEADER_SIZE

One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.


    public
        mixed
    PREFIX_HEADER_SIZE
    = 2048

PREFIX_ITEM_SIZE

Size of an item in the prefix index used to look up words.


    public
        mixed
    PREFIX_ITEM_SIZE
    = 8

If the sub-dir was 65 (ASCII A), and the second char was also ASCII 65, then the corresponding prefix record would be the offset to the first word_id beginning with AA, followed by the number of such AA records.

SEGMENT_SIZE

When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)


    public
        mixed
    SEGMENT_SIZE
    = 20000000

$active_tiers

Tiers which currently have data for reading


    public
        array<string|int, mixed>
    $active_tiers

$blocks

A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.


    public
        array<string|int, mixed>
    $blocks

$dir_name

Folder name to use for this IndexDictionary


    public
        string
    $dir_name

$fhs

Array of file handle for files in the dictionary. Members are used to read files to look up words.


    public
        resource
    $fhs

$file_lens

Array of file lengths for files in the dictionary. Use so don't try to seek past end of files


    public
        int
    $file_lens

$max_tier

The highest tiered index in the IndexDictionary


    public
        int
    $max_tier

$parent_archive_bundle

If not null, then the parent IndexArchiveBundle this dictionary belongs to


    public
        object
    $parent_archive_bundle

$read_tier

Tier currently being used to read dictionary data from


    public
        int
    $read_tier

$shard_doc_lens

Length of the doc strings for each of the shards that have been added to the dictionary.


    public
        array<string|int, mixed>
    $shard_doc_lens

__construct()

Makes an index dictionary with the given name


    public
                    __construct(string $dir_name[, object $parent_archive_bundle = null ]) : mixed

Parameters

$dir_name : string: the directory name to store the index dictionary in
$parent_archive_bundle : object = null: parent index archive bundle this dictionary is for

Return values

mixed —

addAuxInfoRecords()

Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.


    public
                    addAuxInfoRecords(string $id, int $file_num, int $num_aux_records, int &$total_count, int $threshold, array<string|int, mixed> &$info, int &$previous_generation, int &$num_generations, int $offset, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed

A typical dictionary entry consists of a 20 byte word id, followed by the 4 bytes ints generation, offset, and length of the posting lists in that generation. If the high bit of the prefix characters in the word id are flipped, it indicates the presence of auxiliary records for that word id. In which case bytes 1, and 2 of the generation, code the number of auxiliary records there will be for this word id. An auxiliary record is 32 bytes long beginning with a bit of the current high prefix letter, followed by a 15 bit code of which aux record in the sequence of aux records for this word id it is, followed by three 10 byte 2byte generation, 4 byte offset, 4 byte len records.

Parameters

$id : string: word id to add aux records for
$file_num : int: which prefix file to read from (always reads a file at the max_tier level)
$num_aux_records : int
$total_count : int
$threshold : int
$info : array<string|int, mixed>
$previous_generation : int
$num_generations : int
$offset : int
$num_distinct_generations : int
$max_retained_generation : int
$id_info : array<string|int, mixed>

Return values

mixed —

addLookedUpEntry()

This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.


    public
                    addLookedUpEntry(string $id, string $word_id, array<string|int, mixed> $record, array<string|int, mixed> &$info, int &$total_count, int &$previous_generation, int &$previous_id, int &$num_generations, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed

Parameters

$id : string: of a row to compare $word_id against
$word_id : string: the word id of a term or phrase we are computing the quadruple array for
$record : array<string|int, mixed>: current record from dictionary that we may or may not add to info
$info : array<string|int, mixed>: quadruple array we are adding to
$total_count : int: count of items in $info
$previous_generation : int: last generation added to $info
$previous_id : int: last exact if added to $info
$num_generations : int
$num_distinct_generations : int
$max_retained_generation : int
$id_info : array<string|int, mixed>

Return values

mixed —

addShardDictionary()

Adds the words in the provided IndexShard to the dictionary.


    public
                    addShardDictionary(object $index_shard[, object $callback = null ]) : mixed

Merges tiers as needed.

Parameters

$index_shard : object: the shard to add the word to the dictionary with
$callback : object = null: object with join function to be called if process is taking too long

Return values

mixed —

calculateActiveTiers()

Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.


    public
                    calculateActiveTiers() : array<string|int, mixed>

Return values

array<string|int, mixed> —

active dictionary tiers which may be added to by and ongoing crawl

combineDictionaryRecord()

Used to combine the dictionary records for a given word_id between that come from two different tier files


    public
                    combineDictionaryRecord(string $record_a, string $record_b, int $prefix_bit) : string

Parameters

$record_a : string: a dictionary record including auxiliary records from the 'a'th file of the give tier
$record_b : string: a dictionary record including auxiliary records from the 'b'th file of the give tier
$prefix_bit : int: either 0 or 32768. The first bit of an auxiliary record should be negation of higher order bit of the given prefix letter used by the tier file.

Return values

string —

a single record with merged strings making use of auxliary records as needed containing (generation, posting list offset, length) information.

decodeAuxRecord()

Used to decode an auxiliary dictionary record associated with a given word_id


    public
                    decodeAuxRecord(string $record_string, string $offset) : array<string|int, mixed>

Parameters

$record_string : string: string in which dictionary records occur
$offset : string: a byte offset into $record_string

Return values

array<string|int, mixed> —

of up to three strings

extractPrefixRecord()

Returns the $record_num'th prefix record from $prefix_string


    public
                    extractPrefixRecord(string $prefix_string, int $record_num) : array<string|int, mixed>

Parameters

$prefix_string : string: string to get record from
$record_num : int: which record to extract

Return values

array<string|int, mixed> —

$offset, $count array

formatWordInfo()

Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple


    public
                    formatWordInfo(int $total_count, int $max_retained_generation, array<string|int, mixed> $info) : array<string|int, mixed>

Parameters

$total_count : int
$max_retained_generation : int
$info : array<string|int, mixed>

Return values

array<string|int, mixed> —

resulting triple

getDictSubstring()

Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary


    public
                    getDictSubstring(int $file_num, int $offset, int $len) : string

Parameters

$file_num : int: which prefix file to read from (always reads a file at the max_tier level)
$offset : int: byte offset to start reading from
$len : int: number of bytes to read

Return values

string —

data from that location in the shard

getWordInfo()

For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.


    public
                    getWordInfo(string $word_id[, bool $raw = false ][, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ][, bool $with_remaining_total = false ]) : mixed

Parameters

$word_id : string: id of the word or phrase one wants to look up
$raw : bool = false: whether the id is our version of base64 encoded or not
$threshold : int = -1: if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches
$start_generation : int = -1: which index shard in inverted index to start search from
$num_distinct_generations : int = -1: how many shard to consider after $start_generation
$with_remaining_total : bool = false

Return values

mixed —

an array of entries of the form generation, first offset, last offset, count, matched_key If also have with remaining true, then get a pair, with second element as above and first element the estimated total number of of docs

getWordInfoTier()

This method facilitates query processing of an ongoing crawl.


    public
                    getWordInfoTier(string $word_id, bool $raw, int $tier[, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ]) : mixed

During an ongoing crawl, the dictionary is arranged into tiers as per the logarithmic merge algorithm rather than just one tier as in a crawl that has been stopped. Word info for more recently crawled pages will tend to be in lower tiers than data that was crawled earlier. getWordInfoTier gets word info data for a specific tier in the index dictionary. Each tier will have word info for a specific, disjoint set of shards, so the format of how to look up posting lists in a shard can be the same regardless of the tier: an array entry is of the form generation, first offset, last offset, and number of documents the word occurred in for this shard.

Parameters

$word_id : string: id of the word one wants to look up
$raw : bool: whether the id is our version of base64 encoded or not
$tier : int: which tier to get word info from
$threshold : int = -1: if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches
$start_generation : int = -1: if positive the first generation to return information about
$num_distinct_generations : int = -1: if positive number of then determines the number of generations after the starting generation to return information about

Return values

mixed —

a tuple (total_count, max_found_generation, an array of entries of the form generation, first offset, last offset, count, matched_key) or false if no data

makePrefixLetters()

Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.


    public
            static        makePrefixLetters(string $dir_name) : mixed

Parameters

$dir_name : string: base directory in which these sub-directories should be made

Return values

mixed —

makePrefixRecord()

Makes a prefix record string out of an offset and count (packs and concatenates).


    public
                    makePrefixRecord(int $offset, int $count) : string

Parameters

$offset : int: byte offset into words for the prefix record
$count : int: number of word with that prefix

Return values

string —

the packed record

mergeAllTiers()

Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.


    public
                    mergeAllTiers([object $callback = null ][, int $max_tier = -1 ][, bool $fast_merge_all = false ]) : mixed

Parameters

$callback : object = null: object with join function to be called if process is taking too long
$max_tier : int = -1: the maximum tier to merge to merge till -- if not set then $this->max_tier used. Otherwise, one would typically set to a value bigger than $this->max_tier
$fast_merge_all : bool = false: if true then merge away B slots but don't merge everything to a top tier

Return values

mixed —

mergeTier()

Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.


    public
                    mergeTier(int $tier, string $out_slot) : mixed

Parameters

$tier : int: tier level to perform the merge of files at
$out_slot : string: either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

Return values

mixed —

mergeTierFiles()

For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.


    public
                    mergeTierFiles(int $prefix, int $tier, string $out_slot) : mixed

Parameters

$prefix : int: which prefix directory to perform the merge of files
$tier : int: tier level to perform the merge of files at
$out_slot : string: either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

Return values

mixed —

readBlockDictAtOffset()

Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes


    public
                &    readBlockDictAtOffset(int $file_num, int $bytes) : string

Parameters

$file_num : int: which dictionary file (given by first letter prefix) to read from
$bytes : int: byte offset to start reading from

Return values

string —

&data fromIndexShard file

IndexDictionary in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

AUX_RECORD_BLANK

AUX_RECORD_LEN

DICT_BLOCK_POWER

DICT_BLOCK_SIZE

MAX_DICT_FILE_HANDLES

NUM_PREFIX_LETTERS

PREFIX_HEADER_SIZE

PREFIX_ITEM_SIZE

SEGMENT_SIZE

Properties

$active_tiers

$blocks

$dir_name

$fhs

$file_lens

$max_tier

$parent_archive_bundle

$read_tier

$shard_doc_lens

Methods

__construct()

Parameters

Return values

addAuxInfoRecords()

Parameters

Return values

addLookedUpEntry()

Parameters

Return values

addShardDictionary()

Parameters

Return values

calculateActiveTiers()

Return values

combineDictionaryRecord()

Parameters

Return values

decodeAuxRecord()

Parameters

Return values

extractPrefixRecord()

Parameters

Return values

formatWordInfo()

Parameters

Return values

getDictSubstring()

Parameters

Return values

getWordInfo()

Parameters

Return values

getWordInfoTier()

Parameters

Return values

makePrefixLetters()

Parameters

Return values

makePrefixRecord()

Parameters

Return values

mergeAllTiers()

Parameters

Return values

mergeTier()

Parameters

Return values

mergeTierFiles()

Parameters

Return values

readBlockDictAtOffset()

Parameters

Return values

IndexDictionary
in package

Application

implements CrawlConstants