IndexDictionary
in package
implements
CrawlConstants
Data structure used to store for entries of the form: word id, index shard generation, posting list offset, and length of posting list. It has entries for all words stored in a given IndexArchiveBundle. There might be multiple entries for a given word_id if it occurs in more than one index shard in the given IndexArchiveBundle.
In terms of file structure, a dictionary is stored a folder consisting of 256 subfolders. Each subfolder is used to store the word_ids beginning with a particular character. Within a folder are files of various tier levels representing the data stored. As crawling proceeds words from a shard are added to the dictionary in files of tier level 0 either with suffix A or B. If it is detected that both an A and a B file of a given tier level exist, then the results of these two files are merged to a new file at one tier level up . The old files are then deleted. This process is applied recursively until there is at most an A file on each level.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- AUX_RECORD_BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"
- Represents an empty element in an Auxiliary dictionary entry record 10 bytes long
- AUX_RECORD_LEN = 10
- Represents the length of an element in a dictionary aux record
- DICT_BLOCK_POWER = 12
- Disk block size is 1<< this power
- DICT_BLOCK_SIZE = 4096
- Size in bytes of one block in IndexDictionary
- MAX_DICT_FILE_HANDLES = 200
- Maximum number of simultaneously open file handles
- NUM_PREFIX_LETTERS = 256
- Number of possible prefix records (number of possible values for second char of a word id)
- PREFIX_HEADER_SIZE = 2048
- One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.
- PREFIX_ITEM_SIZE = 8
- Size of an item in the prefix index used to look up words.
- SEGMENT_SIZE = 20000000
- When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)
- $active_tiers : array<string|int, mixed>
- Tiers which currently have data for reading
- $blocks : array<string|int, mixed>
- A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.
- $dir_name : string
- Folder name to use for this IndexDictionary
- $fhs : resource
- Array of file handle for files in the dictionary. Members are used to read files to look up words.
- $file_lens : int
- Array of file lengths for files in the dictionary. Use so don't try to seek past end of files
- $max_tier : int
- The highest tiered index in the IndexDictionary
- $parent_archive_bundle : object
- If not null, then the parent IndexArchiveBundle this dictionary belongs to
- $read_tier : int
- Tier currently being used to read dictionary data from
- $shard_doc_lens : array<string|int, mixed>
- Length of the doc strings for each of the shards that have been added to the dictionary.
- __construct() : mixed
- Makes an index dictionary with the given name
- addAuxInfoRecords() : mixed
- Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.
- addLookedUpEntry() : mixed
- This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.
- addShardDictionary() : mixed
- Adds the words in the provided IndexShard to the dictionary.
- calculateActiveTiers() : array<string|int, mixed>
- Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.
- combineDictionaryRecord() : string
- Used to combine the dictionary records for a given word_id between that come from two different tier files
- decodeAuxRecord() : array<string|int, mixed>
- Used to decode an auxiliary dictionary record associated with a given word_id
- extractPrefixRecord() : array<string|int, mixed>
- Returns the $record_num'th prefix record from $prefix_string
- formatWordInfo() : array<string|int, mixed>
- Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple
- getDictSubstring() : string
- Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary
- getWordInfo() : mixed
- For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
- getWordInfoTier() : mixed
- This method facilitates query processing of an ongoing crawl.
- makePrefixLetters() : mixed
- Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.
- makePrefixRecord() : string
- Makes a prefix record string out of an offset and count (packs and concatenates).
- mergeAllTiers() : mixed
- Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.
- mergeTier() : mixed
- Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.
- mergeTierFiles() : mixed
- For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.
- readBlockDictAtOffset() : string
- Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes
Constants
AUX_RECORD_BLANK
Represents an empty element in an Auxiliary dictionary entry record 10 bytes long
public
mixed
AUX_RECORD_BLANK
= "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff"
AUX_RECORD_LEN
Represents the length of an element in a dictionary aux record
public
mixed
AUX_RECORD_LEN
= 10
DICT_BLOCK_POWER
Disk block size is 1<< this power
public
mixed
DICT_BLOCK_POWER
= 12
DICT_BLOCK_SIZE
Size in bytes of one block in IndexDictionary
public
mixed
DICT_BLOCK_SIZE
= 4096
MAX_DICT_FILE_HANDLES
Maximum number of simultaneously open file handles
public
mixed
MAX_DICT_FILE_HANDLES
= 200
NUM_PREFIX_LETTERS
Number of possible prefix records (number of possible values for second char of a word id)
public
mixed
NUM_PREFIX_LETTERS
= 256
PREFIX_HEADER_SIZE
One dictionary file represents the words whose ids begin with a fixed char. Amongst these id, the prefix index gives offsets for where id's with a given second char start. The total length of the records needed is PREFIX_ITEM_SIZE * NUM_PREFIX_LETTERS.
public
mixed
PREFIX_HEADER_SIZE
= 2048
PREFIX_ITEM_SIZE
Size of an item in the prefix index used to look up words.
public
mixed
PREFIX_ITEM_SIZE
= 8
If the sub-dir was 65 (ASCII A), and the second char was also ASCII 65, then the corresponding prefix record would be the offset to the first word_id beginning with AA, followed by the number of such AA records.
SEGMENT_SIZE
When merging two files on a given dictionary tier. This is the max number of bytes to read in one go. (Must be divisible by WORD_ITEM_LEN)
public
mixed
SEGMENT_SIZE
= 20000000
Properties
$active_tiers
Tiers which currently have data for reading
public
array<string|int, mixed>
$active_tiers
$blocks
A cached array of disk blocks for an index dictionary that has not been completely loaded into memory.
public
array<string|int, mixed>
$blocks
$dir_name
Folder name to use for this IndexDictionary
public
string
$dir_name
$fhs
Array of file handle for files in the dictionary. Members are used to read files to look up words.
public
resource
$fhs
$file_lens
Array of file lengths for files in the dictionary. Use so don't try to seek past end of files
public
int
$file_lens
$max_tier
The highest tiered index in the IndexDictionary
public
int
$max_tier
$parent_archive_bundle
If not null, then the parent IndexArchiveBundle this dictionary belongs to
public
object
$parent_archive_bundle
$read_tier
Tier currently being used to read dictionary data from
public
int
$read_tier
$shard_doc_lens
Length of the doc strings for each of the shards that have been added to the dictionary.
public
array<string|int, mixed>
$shard_doc_lens
Methods
__construct()
Makes an index dictionary with the given name
public
__construct(string $dir_name[, object $parent_archive_bundle = null ]) : mixed
Parameters
- $dir_name : string
-
the directory name to store the index dictionary in
- $parent_archive_bundle : object = null
-
parent index archive bundle this dictionary is for
Return values
mixed —addAuxInfoRecords()
Adds auxiliary records for a given word id if after merging info for a given word id can't be stored in a single record.
public
addAuxInfoRecords(string $id, int $file_num, int $num_aux_records, int &$total_count, int $threshold, array<string|int, mixed> &$info, int &$previous_generation, int &$num_generations, int $offset, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed
A typical dictionary entry consists of a 20 byte word id, followed by the 4 bytes ints generation, offset, and length of the posting lists in that generation. If the high bit of the prefix characters in the word id are flipped, it indicates the presence of auxiliary records for that word id. In which case bytes 1, and 2 of the generation, code the number of auxiliary records there will be for this word id. An auxiliary record is 32 bytes long beginning with a bit of the current high prefix letter, followed by a 15 bit code of which aux record in the sequence of aux records for this word id it is, followed by three 10 byte 2byte generation, 4 byte offset, 4 byte len records.
Parameters
- $id : string
-
word id to add aux records for
- $file_num : int
-
which prefix file to read from (always reads a file at the max_tier level)
- $num_aux_records : int
- $total_count : int
- $threshold : int
- $info : array<string|int, mixed>
- $previous_generation : int
- $num_generations : int
- $offset : int
- $num_distinct_generations : int
- $max_retained_generation : int
- $id_info : array<string|int, mixed>
Return values
mixed —addLookedUpEntry()
This method is used when computing the array of (generation, posting_list_start, len, exact_word_id) quadruples when looking up a $word_id in an index dictionary. It adds the word record to the quadruple array $info that has been calculated so far. It also update $total_count, and as well as $previous info for the previous matching record.
public
addLookedUpEntry(string $id, string $word_id, array<string|int, mixed> $record, array<string|int, mixed> &$info, int &$total_count, int &$previous_generation, int &$previous_id, int &$num_generations, int $num_distinct_generations, int &$max_retained_generation, array<string|int, mixed> &$id_info) : mixed
Parameters
- $id : string
-
of a row to compare $word_id against
- $word_id : string
-
the word id of a term or phrase we are computing the quadruple array for
- $record : array<string|int, mixed>
-
current record from dictionary that we may or may not add to info
- $info : array<string|int, mixed>
-
quadruple array we are adding to
- $total_count : int
-
count of items in $info
- $previous_generation : int
-
last generation added to $info
- $previous_id : int
-
last exact if added to $info
- $num_generations : int
- $num_distinct_generations : int
- $max_retained_generation : int
- $id_info : array<string|int, mixed>
Return values
mixed —addShardDictionary()
Adds the words in the provided IndexShard to the dictionary.
public
addShardDictionary(object $index_shard[, object $callback = null ]) : mixed
Merges tiers as needed.
Parameters
- $index_shard : object
-
the shard to add the word to the dictionary with
- $callback : object = null
-
object with join function to be called if process is taking too long
Return values
mixed —calculateActiveTiers()
Based on the current set of tiers in the 0 prrefix sub-folder determine an array of active dictionary tiers.
public
calculateActiveTiers() : array<string|int, mixed>
Return values
array<string|int, mixed> —active dictionary tiers which may be added to by and ongoing crawl
combineDictionaryRecord()
Used to combine the dictionary records for a given word_id between that come from two different tier files
public
combineDictionaryRecord(string $record_a, string $record_b, int $prefix_bit) : string
Parameters
- $record_a : string
-
a dictionary record including auxiliary records from the 'a'th file of the give tier
- $record_b : string
-
a dictionary record including auxiliary records from the 'b'th file of the give tier
- $prefix_bit : int
-
either 0 or 32768. The first bit of an auxiliary record should be negation of higher order bit of the given prefix letter used by the tier file.
Return values
string —a single record with merged strings making use of auxliary records as needed containing (generation, posting list offset, length) information.
decodeAuxRecord()
Used to decode an auxiliary dictionary record associated with a given word_id
public
decodeAuxRecord(string $record_string, string $offset) : array<string|int, mixed>
Parameters
- $record_string : string
-
string in which dictionary records occur
- $offset : string
-
a byte offset into $record_string
Return values
array<string|int, mixed> —of up to three strings
extractPrefixRecord()
Returns the $record_num'th prefix record from $prefix_string
public
extractPrefixRecord(string $prefix_string, int $record_num) : array<string|int, mixed>
Parameters
- $prefix_string : string
-
string to get record from
- $record_num : int
-
which record to extract
Return values
array<string|int, mixed> —$offset, $count array
formatWordInfo()
Auxiliary methods that takes the input triple ($total_count, $max_retained_generation, $info) and filters blank entries from $info and returns the resulting triple
public
formatWordInfo(int $total_count, int $max_retained_generation, array<string|int, mixed> $info) : array<string|int, mixed>
Parameters
- $total_count : int
- $max_retained_generation : int
- $info : array<string|int, mixed>
Return values
array<string|int, mixed> —resulting triple
getDictSubstring()
Gets from disk $len many bytes beginning at $offset from the $file_num prefix file in the index dictionary
public
getDictSubstring(int $file_num, int $offset, int $len) : string
Parameters
- $file_num : int
-
which prefix file to read from (always reads a file at the max_tier level)
- $offset : int
-
byte offset to start reading from
- $len : int
-
number of bytes to read
Return values
string —data from that location in the shard
getWordInfo()
For each index shard generation a word occurred in, return as part of array, an array entry of the form generation, first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
public
getWordInfo(string $word_id[, bool $raw = false ][, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ][, bool $with_remaining_total = false ]) : mixed
Parameters
- $word_id : string
-
id of the word or phrase one wants to look up
- $raw : bool = false
-
whether the id is our version of base64 encoded or not
- $threshold : int = -1
-
if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches
- $start_generation : int = -1
-
which index shard in inverted index to start search from
- $num_distinct_generations : int = -1
-
how many shard to consider after $start_generation
- $with_remaining_total : bool = false
Return values
mixed —an array of entries of the form generation, first offset, last offset, count, matched_key If also have with remaining true, then get a pair, with second element as above and first element the estimated total number of of docs
getWordInfoTier()
This method facilitates query processing of an ongoing crawl.
public
getWordInfoTier(string $word_id, bool $raw, int $tier[, int $threshold = -1 ][, int $start_generation = -1 ][, int $num_distinct_generations = -1 ]) : mixed
During an ongoing crawl, the dictionary is arranged into tiers as per the logarithmic merge algorithm rather than just one tier as in a crawl that has been stopped. Word info for more recently crawled pages will tend to be in lower tiers than data that was crawled earlier. getWordInfoTier gets word info data for a specific tier in the index dictionary. Each tier will have word info for a specific, disjoint set of shards, so the format of how to look up posting lists in a shard can be the same regardless of the tier: an array entry is of the form generation, first offset, last offset, and number of documents the word occurred in for this shard.
Parameters
- $word_id : string
-
id of the word one wants to look up
- $raw : bool
-
whether the id is our version of base64 encoded or not
- $tier : int
-
which tier to get word info from
- $threshold : int = -1
-
if greater than zero how many posting list results in dictionary info returned before stopping looking for more matches
- $start_generation : int = -1
-
if positive the first generation to return information about
- $num_distinct_generations : int = -1
-
if positive number of then determines the number of generations after the starting generation to return information about
Return values
mixed —a tuple (total_count, max_found_generation, an array of entries of the form generation, first offset, last offset, count, matched_key) or false if no data
makePrefixLetters()
Makes dictionary sub-directories for each of the 256 possible first hash characters that crawHash in raw mode code output.
public
static makePrefixLetters(string $dir_name) : mixed
Parameters
- $dir_name : string
-
base directory in which these sub-directories should be made
Return values
mixed —makePrefixRecord()
Makes a prefix record string out of an offset and count (packs and concatenates).
public
makePrefixRecord(int $offset, int $count) : string
Parameters
- $offset : int
-
byte offset into words for the prefix record
- $count : int
-
number of word with that prefix
Return values
string —the packed record
mergeAllTiers()
Merges for each tier and for each first letter subdirectory, the $tier pair of (A and B) files of dictionary words. If max_tier has not been reached but only one of the two tier files is present then that file is renamed with a name one tier higher. The output in all cases is stored in file ending with A or B one tier up. B is used if an A file is already present.
public
mergeAllTiers([object $callback = null ][, int $max_tier = -1 ][, bool $fast_merge_all = false ]) : mixed
Parameters
- $callback : object = null
-
object with join function to be called if process is taking too long
- $max_tier : int = -1
-
the maximum tier to merge to merge till -- if not set then $this->max_tier used. Otherwise, one would typically set to a value bigger than $this->max_tier
- $fast_merge_all : bool = false
-
if true then merge away B slots but don't merge everything to a top tier
Return values
mixed —mergeTier()
Merges for each first letter subdirectory, the $tier pair of files of dictionary words. The output is stored in $out_slot.
public
mergeTier(int $tier, string $out_slot) : mixed
Parameters
- $tier : int
-
tier level to perform the merge of files at
- $out_slot : string
-
either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.
Return values
mixed —mergeTierFiles()
For a fixed prefix directory merges the $tier pair of files of dictionary words. The output is stored in $out_slot.
public
mergeTierFiles(int $prefix, int $tier, string $out_slot) : mixed
Parameters
- $prefix : int
-
which prefix directory to perform the merge of files
- $tier : int
-
tier level to perform the merge of files at
- $out_slot : string
-
either "A" or "B", the suffix but not extension of the file one tier up to create with the merged results.
Return values
mixed —readBlockDictAtOffset()
Reads DICT_BLOCK_SIZE bytes from the prefix file $file_num beginning at byte offset $bytes
public
& readBlockDictAtOffset(int $file_num, int $bytes) : string
Parameters
- $file_num : int
-
which dictionary file (given by first letter prefix) to read from
- $bytes : int
-
byte offset to start reading from
Return values
string —&data fromIndexShard file