Yioop_V9.5_Source_Code_Documentation

IndexDictionary
in package
implements CrawlConstants

Data structure used to store for entries of the form: word id, index shard generation, posting list offset, and length of posting list. It has entries for all words stored in a given IndexArchiveBundle. There might be multiple entries for a given word_id if it occurs in more than one index shard in the given IndexArchiveBundle.

In terms of file structure, a dictionary is stored a folder consisting of 256 subfolders. Each subfolder is used to store the word_ids beginning with a particular character. Within a folder are files of various tier levels representing the data stored. As crawling proceeds words from a shard are added to the dictionary in files of tier level 0 either with suffix A or B. If it is detected that both an A and a B file of a given tier level exist, then the results of these two files are merged to a new file at one tier level up . The old files are then deleted. This process is applied recursively until there is at most an A file on each level.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

AUX_RECORD_BLANK  = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"
Represents an empty element in an Auxiliary dictionary entry record 10 bytes long
AUX_RECORD_LEN  = 10
Represents the length of an element in a dictionary aux record
DICT_BLOCK_POWER  = 12
Disk block size is 1<< this power
DICT_BLOCK_SIZE  = 4096
Size in bytes of one block in IndexDictionary
MAX_DICT_FILE_HANDLES  = 200
Maximum number of simultaneously open file handles
NUM_PREFIX_LETTERS  = 256
Number of possible prefix records (number of possible values for second char of a word id)
PREFIX_HEADER_SIZE  = 2048
One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.
PREFIX_ITEM_SIZE  = 8
Size of an item in the prefix index used to look up words.
SEGMENT_SIZE  = 20000000
When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)
$active_tiers  : array<string|int, mixed>
Tiers which currently have data for reading
$blocks  : array<string|int, mixed>
A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.
$dir_name  : string
Folder name to use for this IndexDictionary
$fhs  : resource
Array of file handle for files in the dictionary. Members are used to read files to look up words.
$file_lens  : int
Array of file lengths for files in the dictionary. Use so don't try to seek past end of files
$max_tier  : int
The highest tiered index in the IndexDictionary
$parent_archive_bundle  : object
If not null, then the parent IndexArchiveBundle this dictionary belongs to
$read_tier  : int
Tier currently being used to read dictionary data from
$shard_doc_lens  : array<string|int, mixed>
Length of the doc strings for each of the shards that have been added to the dictionary.
__construct()  : mixed
Makes an index dictionary with the given name
addAuxInfoRecords()  : mixed
Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.
addLookedUpEntry()  : mixed
This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.
addShardDictionary()  : mixed
Adds the words in the provided IndexShard to the dictionary.
calculateActiveTiers()  : array<string|int, mixed>
Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.
combineDictionaryRecord()  : string
Used to combine the dictionary records for a given word_id between that come from two different tier files
decodeAuxRecord()  : array<string|int, mixed>
Used to decode an auxiliary dictionary record associated with a given word_id
extractPrefixRecord()  : array<string|int, mixed>
Returns the $record_num'th prefix record from $prefix_string
formatWordInfo()  : array<string|int, mixed>
Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple
getDictSubstring()  : string
Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary
getWordInfo()  : mixed
For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
getWordInfoTier()  : mixed
This method facilitates query processing of an ongoing crawl.
makePrefixLetters()  : mixed
Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.
makePrefixRecord()  : string
Makes a prefix record string out of an offset and count (packs and concatenates).
mergeAllTiers()  : mixed
Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.
mergeTier()  : mixed
Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.
mergeTierFiles()  : mixed
For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.
readBlockDictAtOffset()  : string
Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes

Constants

AUX_RECORD_BLANK

Represents an empty element in an Auxiliary dictionary entry record 10 bytes long

public mixed AUX_RECORD_BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"

AUX_RECORD_LEN

Represents the length of an element in a dictionary aux record

public mixed AUX_RECORD_LEN = 10

DICT_BLOCK_POWER

Disk block size is 1<< this power

public mixed DICT_BLOCK_POWER = 12

DICT_BLOCK_SIZE

Size in bytes of one block in IndexDictionary

public mixed DICT_BLOCK_SIZE = 4096

MAX_DICT_FILE_HANDLES

Maximum number of simultaneously open file handles

public mixed MAX_DICT_FILE_HANDLES = 200

NUM_PREFIX_LETTERS

Number of possible prefix records (number of possible values for second char of a word id)

public mixed NUM_PREFIX_LETTERS = 256

PREFIX_HEADER_SIZE

One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.

public mixed PREFIX_HEADER_SIZE = 2048

PREFIX_ITEM_SIZE

Size of an item in the prefix index used to look up words.

public mixed PREFIX_ITEM_SIZE = 8

If the sub-dir was 65 (ASCII A), and the second char was also ASCII 65, then the corresponding prefix record would be the offset to the first word_id beginning with AA, followed by the number of such AA records.

SEGMENT_SIZE

When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)

public mixed SEGMENT_SIZE = 20000000

Properties

$active_tiers

Tiers which currently have data for reading

public array<string|int, mixed> $active_tiers

$blocks

A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.

public array<string|int, mixed> $blocks

$dir_name

Folder name to use for this IndexDictionary

public string $dir_name

$fhs

Array of file handle for files in the dictionary. Members are used to read files to look up words.

public resource $fhs

$file_lens

Array of file lengths for files in the dictionary. Use so don't try to seek past end of files

public int $file_lens

$max_tier

The highest tiered index in the IndexDictionary

public int $max_tier

$parent_archive_bundle

If not null, then the parent IndexArchiveBundle this dictionary belongs to

public object $parent_archive_bundle

$read_tier

Tier currently being used to read dictionary data from

public int $read_tier

$shard_doc_lens

Length of the doc strings for each of the shards that have been added to the dictionary.

public array<string|int, mixed> $shard_doc_lens

Methods

__construct()

Makes an index dictionary with the given name

public __construct(string $dir_name[, object $parent_archive_bundle = null ]) : mixed
Parameters
$dir_name : string

the directory name to store the index dictionary in

$parent_archive_bundle : object = null

parent index archive bundle this dictionary is for

Return values
mixed

addAuxInfoRecords()

Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.

public addAuxInfoRecords(string $id, int $file_num, int $num_aux_records, int &$total_count, int $threshold, array<string|int, mixed> &$info, int &$previous_generation, int &$num_generations, int $offset, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed

A typical dictionary entry consists of a 20 byte word id, followed by the 4 bytes ints generation, offset, and length of the posting lists in that generation. If the high bit of the prefix characters in the word id are flipped, it indicates the presence of auxiliary records for that word id. In which case bytes 1, and 2 of the generation, code the number of auxiliary records there will be for this word id. An auxiliary record is 32 bytes long beginning with a bit of the current high prefix letter, followed by a 15 bit code of which aux record in the sequence of aux records for this word id it is, followed by three 10 byte 2byte generation, 4 byte offset, 4 byte len records.

Parameters
$id : string

word id to add aux records for

$file_num : int

which prefix file to read from (always reads a file at the max_tier level)

$num_aux_records : int
$total_count : int
$threshold : int
$info : array<string|int, mixed>
$previous_generation : int
$num_generations : int
$offset : int
$num_distinct_generations : int
$max_retained_generation : int
$id_info : array<string|int, mixed>
Return values
mixed

addLookedUpEntry()

This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.

public addLookedUpEntry(string $id, string $word_id, array<string|int, mixed> $record, array<string|int, mixed> &$info, int &$total_count, int &$previous_generation, int &$previous_id, int &$num_generations, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed
Parameters
$id : string

of a row to compare $word_id against

$word_id : string

the word id of a term or phrase we are computing the quadruple array for

$record : array<string|int, mixed>

current record from dictionary that we may or may not add to info

$info : array<string|int, mixed>

quadruple array we are adding to

$total_count : int

count of items in $info

$previous_generation : int

last generation added to $info

$previous_id : int

last exact if added to $info

$num_generations : int
$num_distinct_generations : int
$max_retained_generation : int
$id_info : array<string|int, mixed>
Return values
mixed

addShardDictionary()

Adds the words in the provided IndexShard to the dictionary.

public addShardDictionary(object $index_shard[, object $callback = null ]) : mixed

Merges tiers as needed.

Parameters
$index_shard : object

the shard to add the word to the dictionary with

$callback : object = null

object with join function to be called if process is taking too long

Return values
mixed

calculateActiveTiers()

Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.

public calculateActiveTiers() : array<string|int, mixed>
Return values
array<string|int, mixed>

active dictionary tiers which may be added to by and ongoing crawl

combineDictionaryRecord()

Used to combine the dictionary records for a given word_id between that come from two different tier files

public combineDictionaryRecord(string $record_a, string $record_b, int $prefix_bit) : string
Parameters
$record_a : string

a dictionary record including auxiliary records from the 'a'th file of the give tier

$record_b : string

a dictionary record including auxiliary records from the 'b'th file of the give tier

$prefix_bit : int

either 0 or 32768. The first bit of an auxiliary record should be negation of higher order bit of the given prefix letter used by the tier file.

Return values
string

a single record with merged strings making use of auxliary records as needed containing (generation, posting list offset, length) information.

decodeAuxRecord()

Used to decode an auxiliary dictionary record associated with a given word_id

public decodeAuxRecord(string $record_string, string $offset) : array<string|int, mixed>
Parameters
$record_string : string

string in which dictionary records occur

$offset : string

a byte offset into $record_string

Return values
array<string|int, mixed>

of up to three strings

extractPrefixRecord()

Returns the $record_num'th prefix record from $prefix_string

public extractPrefixRecord(string $prefix_string, int $record_num) : array<string|int, mixed>
Parameters
$prefix_string : string

string to get record from

$record_num : int

which record to extract

Return values
array<string|int, mixed>

$offset, $count array

formatWordInfo()

Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple

public formatWordInfo(int $total_count, int $max_retained_generation, array<string|int, mixed> $info) : array<string|int, mixed>
Parameters
$total_count : int
$max_retained_generation : int
$info : array<string|int, mixed>
Return values
array<string|int, mixed>

resulting triple

getDictSubstring()

Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary

public getDictSubstring(int $file_num, int $offset, int $len) : string
Parameters
$file_num : int

which prefix file to read from (always reads a file at the max_tier level)

$offset : int

byte offset to start reading from

$len : int

number of bytes to read

Return values
string

data from that location in the shard

getWordInfo()

For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.

public getWordInfo(string $word_id[, bool $raw = false ][, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ][, bool $with_remaining_total = false ]) : mixed
Parameters
$word_id : string

id of the word or phrase one wants to look up

$raw : bool = false

whether the id is our version of base64 encoded or not

$threshold : int = -1

if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches

$start_generation : int = -1

which index shard in inverted index to start search from

$num_distinct_generations : int = -1

how many shard to consider after $start_generation

$with_remaining_total : bool = false
Return values
mixed

an array of entries of the form generation, first offset, last offset, count, matched_key If also have with remaining true, then get a pair, with second element as above and first element the estimated total number of of docs

getWordInfoTier()

This method facilitates query processing of an ongoing crawl.

public getWordInfoTier(string $word_id, bool $raw, int $tier[, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ]) : mixed

During an ongoing crawl, the dictionary is arranged into tiers as per the logarithmic merge algorithm rather than just one tier as in a crawl that has been stopped. Word info for more recently crawled pages will tend to be in lower tiers than data that was crawled earlier. getWordInfoTier gets word info data for a specific tier in the index dictionary. Each tier will have word info for a specific, disjoint set of shards, so the format of how to look up posting lists in a shard can be the same regardless of the tier: an array entry is of the form generation, first offset, last offset, and number of documents the word occurred in for this shard.

Parameters
$word_id : string

id of the word one wants to look up

$raw : bool

whether the id is our version of base64 encoded or not

$tier : int

which tier to get word info from

$threshold : int = -1

if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches

$start_generation : int = -1

if positive the first generation to return information about

$num_distinct_generations : int = -1

if positive number of then determines the number of generations after the starting generation to return information about

Return values
mixed

a tuple (total_count, max_found_generation, an array of entries of the form generation, first offset, last offset, count, matched_key) or false if no data

makePrefixLetters()

Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.

public static makePrefixLetters(string $dir_name) : mixed
Parameters
$dir_name : string

base directory in which these sub-directories should be made

Return values
mixed

makePrefixRecord()

Makes a prefix record string out of an offset and count (packs and concatenates).

public makePrefixRecord(int $offset, int $count) : string
Parameters
$offset : int

byte offset into words for the prefix record

$count : int

number of word with that prefix

Return values
string

the packed record

mergeAllTiers()

Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.

public mergeAllTiers([object $callback = null ][, int $max_tier = -1 ][, bool $fast_merge_all = false ]) : mixed
Parameters
$callback : object = null

object with join function to be called if process is taking too long

$max_tier : int = -1

the maximum tier to merge to merge till -- if not set then $this->max_tier used. Otherwise, one would typically set to a value bigger than $this->max_tier

$fast_merge_all : bool = false

if true then merge away B slots but don't merge everything to a top tier

Return values
mixed

mergeTier()

Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.

public mergeTier(int $tier, string $out_slot) : mixed
Parameters
$tier : int

tier level to perform the merge of files at

$out_slot : string

either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

Return values
mixed

mergeTierFiles()

For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.

public mergeTierFiles(int $prefix, int $tier, string $out_slot) : mixed
Parameters
$prefix : int

which prefix directory to perform the merge of files

$tier : int

tier level to perform the merge of files at

$out_slot : string

either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.

Return values
mixed

readBlockDictAtOffset()

Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes

public & readBlockDictAtOffset(int $file_num, int $bytes) : string
Parameters
$file_num : int

which dictionary file (given by first letter prefix) to read from

$bytes : int

byte offset to start reading from

Return values
string

&data fromIndexShard file


        

Search results