Yioop_V9.5_Source_Code

IndexShard extends PersistentStructure
in package

Application

implements CrawlConstants

Data structure used to store one generation worth of the word document index (inverted index). This data structure consists of three main components a word entries, word_doc entries, and document entries.

Word entries are described in the documentation for the words field. Word-doc entries are described in the documentation for the word_docs field Document entries are described in the documentation for the doc_infos field

IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.

Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff": Represents an empty prefix item
DEFAULT_SAVE_FREQUENCY = 50000: If not specified in the constructor, this will be the number of operations between saves
DESCRIPTION_WEIGHT = 2.0: BM25F weight factor for terms in description
DOC_ID_LEN = 24: Length of DOC ID.
DOC_KEY_LEN = 8: Length of a key in a DOC ID.
FLATTEN_FREQUENCY = 10000: Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)
HALF_BLANK = "\xff\xff\xff\xff": Flag used to indicate that a word item should not be packed or unpacked
HEADER_LENGTH = 40: Header Length of an IndexShard (sum of its non-variable length fields)
LINK_FLAG = 0x800000: Used to keep track of whether a record in document infos is for a document or for a link
LINK_WEIGHT = 1.0: BM25F weight factor for terms in a link
MAX_AUX_DOC_KEYS = 200: Maximum number of auxiliary document keys;
POSTING_LEN = 4: Length of one posting ( a doc offset occurrence pair) in a posting list
SHARD_BLOCK_POWER = 12: Shard block size is 1<< this power
SHARD_BLOCK_SIZE = 4096: Size in bytes of one block in IndexShard
STORE_FLAG = "\x80": Represents an empty prefix item
TITLE_WEIGHT = 4.0: BM25F weight factor for terms in title
WORD_DATA_LEN = 12: Length of the data portion of a word entry in bytes in the shard
WORD_KEY_LEN = 20: Length of a word entry's key in bytes
WORD_POSTING_COPY_LEN = 32000: Bytes of tmp string allowed during flattenings
$blocks : array<string|int, mixed>: An cached array of disk blocks for an index shard that has not been completely loaded into memory.
$blocks_words : array<string|int, mixed>: Stores $blocks contents in (32 bit) unsigned int
$doc_info_offset : int: Holds offset of the doc_infos strings
$doc_infos : string: Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.
$docids_len : int: Length of $doc_infos as a string
$fh : resource: File handle for a shard if we are going to use it in read mode and not completely load it.
$file_len : int: Keeps track of the length of the shard as a file
$filename : string: Name of the file in which to store the PersistentStructure
$generation : int: This is supposed to hold the number of earlier shards, prior to the current shard.
$hash_name : string: Used to hold the computed 8 byte hash of the index shard filename
$last_flattened_words_count : mixed: Number of document inserts since the last time word data was flattened to the word_postings string.
$len_all_docs : int: Number of words stored in total in all documents in this shard
$len_all_link_docs : int: Number of words stored in total in all links in this shard
$num_docs : int: Number of documents (not links) stored in this shard
$num_docs_per_generation : int: This is supposed to hold the number of documents that a given shard can hold.
$num_docs_word : array<string|int, mixed>: Keeps track of the number of documents a word is in
$num_link_docs : int: Number of links (not documents) stored in this shard
$prefixes : array<string|int, mixed>: An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.
$prefixes_len : int: Length of the prefix index into the dictionary of the shard
$read_only_from_disk : bool: Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
$save_frequency : int: Number of operation between saves. If == -1 never save using checkSave
$unsaved_operations : int: Number of operations since the last save
$word_doc_offset : int: Holds offset of the word_docs strings
$word_docs : string: This string is non-empty when shard is loaded and in its packed state.
$word_docs_len : int: Length of $word_docs as a string
$word_docs_packed : bool: Keeps track of the packed/unpacked state of the word_docs list
$word_postings : string: Used to hold word_id, posting_len, posting triples as a memory efficient string
$words : array<string|int, mixed>: Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
$words_len : int: Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode
__construct() : mixed: Makes an index shard with the given file name and generation offset
addDocumentWords() : bool: Add a new document to the index shard with the given summary offset.
appendIndexShard() : mixed: Adds the contents of the supplied $index_shard to the current index shard
binarySearchPostingOffsetDocOffset() : array<string|int, mixed>|bool: Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false
changeDocumentOffsets() : mixed: Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
checkSave() : mixed: Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage
computeProximity() : int: Returns a proximity score for a single term based on its location in doc.
docOffsetFromPostingOffset() : int: Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
docStats() : mixed: Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
gallopPostingOffsetDocOffset() : int: Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found
getDocIndexOfPostingAtOffset() : int: Returns the document index of the posting at offset $current in word_docs
getDocInfoSubstring() : string: From disk gets $len many bytes starting from $offset in the doc_infos strings
getPostingAtOffset() : string: Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
getPostingsSlice() : array<string|int, mixed>: Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
getPostingsSliceById() : array<string|int, mixed>: Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
getShardSubstring() : string: Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
getShardWord() : int: Reads 32 bit word as an unsigned int from the offset given in the shard
getWordDocsSubstring() : desired: From disk gets $len many bytes starting from $offset in the word_docs strings
getWordDocsWord() : mixed: Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard
getWordInfo() : array<string|int, mixed>: Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
getWordInfoFromString() : array<string|int, mixed>: Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
getWordString() : mixed: Return word record (word key + posting lookup data )from the shard from the shard posting list
headerToShardFields() : mixed: Split a header string into a shards field variable
load() : IndexShard: Load an IndexShard from a file or string
makeItem() : array<string|int, mixed>: Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.
makeWords() : mixed: Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
mergeWordPostingsToString() : mixed: Used to flatten the words associative array to a more memory efficient word_postings string.
nextPostingOffsetDocOffset() : array<string|int, mixed>: Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
numDocsOrLinks() : int: An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
outputPostingLists() : mixed: Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
packAuxiliaryDocumentKeys() : string: Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.
packDoclenNum() : string: Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
packValues() : string: Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.
packWords() : mixed: Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.
postingsSliceAscending() : array<string|int, mixed>: Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
postingsSliceDescending() : array<string|int, mixed>: Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
prepareWordsAndPrefixes() : mixed: Computes the prefix string index for the current words array.
readBlockShardAtOffset() : mixed: Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
readShardHeader() : bool: If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
save() : string: Save the IndexShard to its filename
saveWithoutDictionary() : mixed: This method re-saves a saved shard without the prefixes and dictionary.
unpackAuxiliaryDocumentKeys() : array<string|int, mixed>: Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.
unpackDoclenNum() : array<string|int, mixed>: Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
unpackValues() : array<string|int, mixed>: Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.
unpackWordDocs() : mixed: Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.
weightedCount() : array<string|int, mixed>: Used to sum over the occurrences in a position list counting with weight based on term location in the document

BLANK

Represents an empty prefix item


    public
        mixed
    BLANK
    = "\xff\xff\xff\xff\xff\xff\xff\xff"

DEFAULT_SAVE_FREQUENCY

If not specified in the constructor, this will be the number of operations between saves


    public
        int
    DEFAULT_SAVE_FREQUENCY
    = 50000

DESCRIPTION_WEIGHT

BM25F weight factor for terms in description


    public
        mixed
    DESCRIPTION_WEIGHT
    = 2.0

DOC_ID_LEN

Length of DOC ID.


    public
        mixed
    DOC_ID_LEN
    = 24

DOC_KEY_LEN

Length of a key in a DOC ID.


    public
        mixed
    DOC_KEY_LEN
    = 8

FLATTEN_FREQUENCY

Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)


    public
        mixed
    FLATTEN_FREQUENCY
    = 10000

HALF_BLANK

Flag used to indicate that a word item should not be packed or unpacked


    public
        mixed
    HALF_BLANK
    = "\xff\xff\xff\xff"

HEADER_LENGTH

Header Length of an IndexShard (sum of its non-variable length fields)


    public
        mixed
    HEADER_LENGTH
    = 40

LINK_FLAG

Used to keep track of whether a record in document infos is for a document or for a link


    public
        mixed
    LINK_FLAG
    = 0x800000

LINK_WEIGHT

BM25F weight factor for terms in a link


    public
        mixed
    LINK_WEIGHT
    = 1.0

MAX_AUX_DOC_KEYS

Maximum number of auxiliary document keys;


    public
        mixed
    MAX_AUX_DOC_KEYS
    = 200

POSTING_LEN

Length of one posting ( a doc offset occurrence pair) in a posting list


    public
        mixed
    POSTING_LEN
    = 4

SHARD_BLOCK_POWER

Shard block size is 1<< this power


    public
        mixed
    SHARD_BLOCK_POWER
    = 12

SHARD_BLOCK_SIZE

Size in bytes of one block in IndexShard


    public
        mixed
    SHARD_BLOCK_SIZE
    = 4096

STORE_FLAG

Represents an empty prefix item


    public
        mixed
    STORE_FLAG
    = "\x80"

TITLE_WEIGHT

BM25F weight factor for terms in title


    public
        mixed
    TITLE_WEIGHT
    = 4.0

WORD_DATA_LEN

Length of the data portion of a word entry in bytes in the shard


    public
        mixed
    WORD_DATA_LEN
    = 12

WORD_KEY_LEN

Length of a word entry's key in bytes


    public
        mixed
    WORD_KEY_LEN
    = 20

WORD_POSTING_COPY_LEN

Bytes of tmp string allowed during flattenings


    public
        mixed
    WORD_POSTING_COPY_LEN
    = 32000

$blocks

An cached array of disk blocks for an index shard that has not been completely loaded into memory.


    public
        array<string|int, mixed>
    $blocks

$blocks_words

Stores $blocks contents in (32 bit) unsigned int


    public
        array<string|int, mixed>
    $blocks_words

$doc_info_offset

Holds offset of the doc_infos strings


    public
        int
    $doc_info_offset

$doc_infos

Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.


    public
        string
    $doc_infos

In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".

$docids_len

Length of $doc_infos as a string


    public
        int
    $docids_len

$fh

File handle for a shard if we are going to use it in read mode and not completely load it.


    public
        resource
    $fh

$file_len

Keeps track of the length of the shard as a file


    public
        int
    $file_len

$filename

Name of the file in which to store the PersistentStructure


    public
        string
    $filename

$generation

This is supposed to hold the number of earlier shards, prior to the current shard.


    public
        int
    $generation

$hash_name

Used to hold the computed 8 byte hash of the index shard filename


    public
        string
    $hash_name

$last_flattened_words_count

Number of document inserts since the last time word data was flattened to the word_postings string.


    public
        mixed
    $last_flattened_words_count

$len_all_docs

Number of words stored in total in all documents in this shard


    public
        int
    $len_all_docs

$len_all_link_docs

Number of words stored in total in all links in this shard


    public
        int
    $len_all_link_docs

$num_docs

Number of documents (not links) stored in this shard


    public
        int
    $num_docs

$num_docs_per_generation

This is supposed to hold the number of documents that a given shard can hold.


    public
        int
    $num_docs_per_generation

$num_docs_word

Keeps track of the number of documents a word is in


    public
        array<string|int, mixed>
    $num_docs_word

$num_link_docs

Number of links (not documents) stored in this shard


    public
        int
    $num_link_docs

$prefixes

An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.


    public
        array<string|int, mixed>
    $prefixes

$prefixes_len

Length of the prefix index into the dictionary of the shard


    public
        int
    $prefixes_len

$read_only_from_disk

Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.


    public
        bool
    $read_only_from_disk

$save_frequency

Number of operation between saves. If == -1 never save using checkSave


    public
        int
    $save_frequency

$unsaved_operations

Number of operations since the last save


    public
        int
    $unsaved_operations

$word_doc_offset

Holds offset of the word_docs strings


    public
        int
    $word_doc_offset

$word_docs

This string is non-empty when shard is loaded and in its packed state.


    public
        string
    $word_docs

It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.

$word_docs_len

Length of $word_docs as a string


    public
        int
    $word_docs_len

$word_docs_packed

Keeps track of the packed/unpacked state of the word_docs list


    public
        bool
    $word_docs_packed

$word_postings

Used to hold word_id, posting_len, posting triples as a memory efficient string


    public
        string
    $word_postings

$words

Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP


    public
        array<string|int, mixed>
    $words

$words_len

Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode


    public
        int
    $words_len

__construct()

Makes an index shard with the given file name and generation offset


    public
                    __construct(string $fname, int $generation[, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ][, bool $read_only_from_disk = false ]) : mixed

Parameters

$fname : string: filename to store the index shard with
$generation : int: when returning documents from the shard pretend there are this many earlier documents
$num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION: the number of documents that a given shard can hold.
$read_only_from_disk : bool = false: used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Return values

mixed —

addDocumentWords()

Add a new document to the index shard with the given summary offset.


    public
                    addDocumentWords(string $doc_keys, int $summary_offset, array<string|int, mixed> $word_lists[, array<string|int, mixed> $meta_ids = [] ][, bool $is_doc = false ][, mixed $rank = false ][, array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : bool

Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.

Parameters

$doc_keys : string: a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.
$summary_offset : int: its offset into the word archive the document's data is stored in
$word_lists : array<string|int, mixed>: (word => array of word positions in doc)
$meta_ids : array<string|int, mixed> = []: meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.
$is_doc : bool = false: flag used to indicate if what is being scored is a document or a link to a document
$rank : mixed = false: either false if not used, or a 4 bit estimate of the rank of this document item
$description_scores : array<string|int, mixed> = []
$user_ranks : array<string|int, mixed> = []

Return values

bool —

success or failure of performing the add

appendIndexShard()

Adds the contents of the supplied $index_shard to the current index shard


    public
                    appendIndexShard(object $index_shard) : mixed

Parameters

$index_shard : object: the shard to append to the current shard

Return values

mixed —

binarySearchPostingOffsetDocOffset()

Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false


    public
                    binarySearchPostingOffsetDocOffset(int $start, int $end, int $current, int $doc_index, int $direction) : array<string|int, mixed>|bool

Parameters

$start : int: lower index of posting list
$end : int: upper index of posting list
$current : int: current index in posting list
$doc_index : int: index wahat next doc offset after
$direction : int: either self::ASCENDING or self::DESCENDING

Return values

array<string|int, mixed>|bool —

either (posting_offset, next_doc_offset) or false

changeDocumentOffsets()

Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).


    public
                    changeDocumentOffsets(array<string|int, mixed> $docid_offsets) : mixed

Parameters

$docid_offsets : array<string|int, mixed>: a set of doc_id associated with a new_doc_offset.

Return values

mixed —

checkSave()

Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage


    public
                    checkSave() : mixed

Return values

mixed —

computeProximity()

Returns a proximity score for a single term based on its location in doc.


    public
                    computeProximity(array<string|int, mixed> $position_list, bool $is_doc) : int

Parameters

$position_list : array<string|int, mixed>: locations of term within item
$is_doc : bool: whether the item is a document or not

Return values

int —

a score for proximity

docOffsetFromPostingOffset()

Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.


    public
                    docOffsetFromPostingOffset(int $offset) : int

Parameters

$offset : int: byte/char offset into the word_docs string

Return values

int —

a document byte/char offset into the doc_infos string

docStats()

Computes BM25F relevance and a score for the supplied item based on the supplied parameters.


    public
            static        docStats(array<string|int, mixed> &$item, int $occurrences, int $doc_len, int $num_doc_or_links, float $average_doc_len, int $num_docs, int $total_docs_or_links, float $type_weight) : mixed

Parameters

$item : array<string|int, mixed>

doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed

$occurrences : int

number of occurrences of the term in the item

$doc_len : int

number of words in doc item represents

$num_doc_or_links : int

number of links or docs containing the term

$average_doc_len : float

average length of items in corpus

$num_docs : int

either number of links or number of docs depending if item represents a link or a doc.

$total_docs_or_links : int

number of docs or links in corpus

$type_weight : float

BM25F weight for this component (doc or link) of score

Return values

mixed —

gallopPostingOffsetDocOffset()

Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found


    public
                    gallopPostingOffsetDocOffset(int &$current, int $doc_index, int $end, int $direction) : int

Parameters

$current : int: current posting offset into posting list
$doc_index : int: document index want bigger than or equal to
$end : int: last index of posting list
$direction : int: which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values

int —

document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end

getDocIndexOfPostingAtOffset()

Returns the document index of the posting at offset $current in word_docs


    public
                    getDocIndexOfPostingAtOffset(int $current) : int

Parameters

$current : int: an offset into the posting lists (word_docs)

Return values

int —

the doc index of the pointed to posting

getDocInfoSubstring()

From disk gets $len many bytes starting from $offset in the doc_infos strings


    public
                    getDocInfoSubstring( $offset,  $len[, bool $cache = false ]) : string

Parameters

$offset :: byte offset to begin getting data out of disk-based doc_infos
$len :: number of bytes to get
$cache : bool = false: whether to cache disk blocks read from disk

Return values

string —

desired

getPostingAtOffset()

Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting


    public
                    getPostingAtOffset(int $current, int &$posting_start, int &$posting_end) : string

Parameters

$current : int: an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN
$posting_start : int: after function call will be index of start of nearest posting to current
$posting_end : int: after function call will be index of end of nearest posting to current

Return values

string —

the substring of word_docs corresponding to the posting

getPostingsSlice()

Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.


    public
                    getPostingsSlice(int $start_offset, int &$next_offset, int $last_offset, int $len[, int $direction = self::ASCENDING ]) : array<string|int, mixed>

Parameters

$start_offset : int: of the current posting list for query term used in calculating BM25F.
$next_offset : int: where to start in word docs
$last_offset : int: offset at which to stop by
$len : int: number of documents desired
$direction : int = self::ASCENDING: which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values

array<string|int, mixed> —

desired list of doc's and their info

getPostingsSliceById()

Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)


    public
                    getPostingsSliceById(string $word_id, int $len[, mixed $direction = self::ASCENDING ]) : array<string|int, mixed>

Parameters

$word_id : string: key to look up documents for
$len : int: number of documents
$direction : mixed = self::ASCENDING

Return values

array<string|int, mixed> —

desired list of doc's and their info

getShardSubstring()

Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard


    public
                    getShardSubstring(int $offset, int $len[, bool $cache = true ]) : string

Parameters

$offset : int: byte offset to start reading from
$len : int: number of bytes to read
$cache : bool = true: whether to cache disk blocks read from disk

Return values

string —

data from that location in the shard

getShardWord()

Reads 32 bit word as an unsigned int from the offset given in the shard


    public
                    getShardWord(int $offset) : int

Parameters

$offset : int: a byte offset into the shard

Return values

int —

desired word or false

getWordDocsSubstring()

From disk gets $len many bytes starting from $offset in the word_docs strings


    public
                    getWordDocsSubstring( $offset,  $len[, bool $cache = true ]) : desired

Parameters

$offset :: byte offset to begin getting data out of disk-based word_docs
$len :: number of bytes to get
$cache : bool = true: whether to cache disk blocks read from disk

Return values

desired —

string

getWordDocsWord()

Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard


    public
                    getWordDocsWord(int $offset) : mixed

Parameters

$offset : int: a byte offset into the word_docs string

Return values

mixed —

getWordInfo()

Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.


    public
                    getWordInfo(string $word_id[, bool $raw = false ]) : array<string|int, mixed>

Parameters

$word_id : string: id of the word one wants to look up
$raw : bool = false: whether the id is our version of base64 encoded or not

Return values

array<string|int, mixed> —

first offset, last offset, count, exact matching id

getWordInfoFromString()

Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.


    public
            static        getWordInfoFromString(string $str[, bool $include_generation = false ]) : array<string|int, mixed>

Parameters

$str : string
$include_generation : bool = false

Return values

array<string|int, mixed> —

of these three or four int's

getWordString()

Return word record (word key + posting lookup data )from the shard from the shard posting list


    public
                    getWordString(bool $is_disk, int $start, int $location, int $word_item_len) : mixed

Parameters

$is_disk : bool: whether the shard is on disk or in memory
$start : int: offset to start of the dictionary
$location : int: index of record to extract from dictionary
$word_item_len : int: length of a word + data record

Return values

mixed —

headerToShardFields()

Split a header string into a shards field variable


    public
            static        headerToShardFields(string $header, object $shard) : mixed

Parameters

$header : string: a string with packed shard header data
$shard : object: IndexShard to put data into

Return values

mixed —

load()

Load an IndexShard from a file or string


    public
            static        load(string $fname[, string &$data = null ]) : IndexShard

Parameters

$fname : string: the name of the file to the IndexShard from/to
$data : string = null: stringified shard data to load shard from. If null then the data is loaded from the $fname if possible

Return values

IndexShard —

the IndexShard loaded

makeItem()

Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.


    public
                    makeItem(string $posting, int $num_doc_or_links[, int $direction = self::ASCENDING ]) : array<string|int, mixed>

Parameters

$posting : string: a posting entry from some words posting list
$num_doc_or_links : int: number of documents or links doc appears in
$direction : int = self::ASCENDING: whether to compute DOC_RANK based on the assumption the iterator is traversing the index in an ascending or descending fashion

Return values

array<string|int, mixed> —

($doc_id, posting_stats_array) for posting

makeWords()

Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.


    public
            static        makeWords(string &$value, int $key, object $shard) : mixed

Parameters

$value : string: the word_key . word_info string
$key : int: index in array - we don't use
$shard : object: IndexShard to add the entry to word table for

Return values

mixed —

mergeWordPostingsToString()

Used to flatten the words associative array to a more memory efficient word_postings string.


    public
                    mergeWordPostingsToString([bool $replace = false ]) : mixed

$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory

$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.

mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.

Parameters

$replace : bool = false: whether to overwrite existing word_id postings (true) or to append (false)

Return values

mixed —

nextPostingOffsetDocOffset()

Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).


    public
                    nextPostingOffsetDocOffset(int $start_offset, int $end_offset, int $doc_offset[, int $direction = self::ASCENDING ]) : array<string|int, mixed>

Parameters

$start_offset : int: first posting to consider
$end_offset : int: last posting before give up
$doc_offset : int: document offset we want to be greater than or equal to (when ASCENDING) or less equal to (DESCENDING)
$direction : int = self::ASCENDING: which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values

array<string|int, mixed> —

(int offset to next posting, doc_offset for this post)

numDocsOrLinks()

An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.


    public
            static        numDocsOrLinks(int $start_offset, int $last_offset[, float $avg_posting_len = 4 ]) : int

Parameters

$start_offset : int: starting location in posting list
$last_offset : int: ending location in posting list
$avg_posting_len : float = 4: number of bytes in an average posting

Return values

int —

number of docs or links

outputPostingLists()

Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.


    public
                    outputPostingLists([resource $fh = null ][, bool $with_logging = false ]) : mixed

Parameters

$fh : resource = null: a filehandle to write to
$with_logging : bool = false: whether to log progress

Return values

mixed —

packAuxiliaryDocumentKeys()

Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.


    public
                    packAuxiliaryDocumentKeys([array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : string

A document map entry consists of a four byte offset into a WebArchive, three more bytes for the document length as, one byte for the number of 8 byte aux keys, followed by a 24 byte key derived usually from the url, host, etc, followed by the description scores, user rank auxiliary keys.

Parameters

$description_scores : array<string|int, mixed> = []: pairs position in document => weight score that position got during summarization process.
$user_ranks : array<string|int, mixed> = []: float scores gotten by a user classifier/ranker defined using Manage Classfiers.

Return values

string —

a string padded to length a multiple of 16 where @see packValues has been used to map each of the above array into a string

packDoclenNum()

Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)


    public
            static        packDoclenNum(int $doc_len, int $num_keys) : string

Parameters

$doc_len : int: number of words in the document
$num_keys : int: number of keys that are used to make up its doc_id

Return values

string —

packed int string representing these two values

packValues()

Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.


    public
                    packValues(array<string|int, mixed> $values[, string $type = "i" ]) : string

Parameters

$values : array<string|int, mixed>: nonnegative integers or floats to pack
$type : string = "i": if is "i" then assuming integers we are packing otherwise floats

Return values

string —

with packed values

packWords()

Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.


    public
                    packWords([resource $fh = null ][, bool $with_logging = false ]) : mixed

After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.

Parameters

$fh : resource = null: a file handle to write the dictionary to, if desired
$with_logging : bool = false: whether to write progress log messages every 30 seconds

Return values

mixed —

postingsSliceAscending()

Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset


    public
                    postingsSliceAscending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>

Parameters

$start_offset : int: byte offset beginning of given posting list
$next_offset : int: byte offset between $start_offset and $last_offset of a posting
$last_offset : int: byte offset ending of given posting list
$len : int: how many postings to return increasing from $next_offset

Return values

array<string|int, mixed> —

of posting items

postingsSliceDescending()

Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset


    public
                    postingsSliceDescending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>

Parameters

$start_offset : int: byte offset beginning of given posting list
$next_offset : int: byte offset between $start_offset and $last_offset of a posting
$last_offset : int: byte offset ending of given posting list
$len : int: how many postings to return decreasing from $next_offset

Return values

array<string|int, mixed> —

of posting items

prepareWordsAndPrefixes()

Computes the prefix string index for the current words array.


    public
                    prepareWordsAndPrefixes([bool $with_logging = false ]) : mixed

This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings

Parameters

$with_logging : bool = false: whether log messages should be written as progresses

Return values

mixed —

readBlockShardAtOffset()

Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes


    public
                    readBlockShardAtOffset(int $bytes[, bool $cache = true ]) : mixed

Parameters

$bytes : int: byte offset to start reading from
$cache : bool = true: whether to cache disk blocks that have been read to RAM

Return values

mixed —

data fromIndexShard file if found, false otherwise

readShardHeader()

If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)


    public
                    readShardHeader([bool $force = false ]) : bool

Parameters

$force : bool = false: If true

Return values

bool —

whether was able to read in or not

save()

Save the IndexShard to its filename


    public
                    save([bool $to_string = false ][, bool $with_logging = false ]) : string

Parameters

$to_string : bool = false: whether output should be written to a string rather than the default file location
$with_logging : bool = false: whether log messages should be written as the shard save progresses

Return values

string —

serialized shard if output was to string else empty string

saveWithoutDictionary()

This method re-saves a saved shard without the prefixes and dictionary.


    public
                    saveWithoutDictionary([bool $with_logging = false ]) : mixed

It would typically be called after this information has been stored in an IndexDictionary obbject so that the data is not redundantly stored

Parameters

$with_logging : bool = false: whether log messages should be written as the shard save progresses

Return values

mixed —

unpackAuxiliaryDocumentKeys()

Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.


    public
                    unpackAuxiliaryDocumentKeys(string $packed_data, int $offset) : array<string|int, mixed>

Parameters

$packed_data : string: containing packed description scores and user ranks
$offset : int: where in the string to begin unpacking from

Return values

array<string|int, mixed> —

[$description_scores, $user_ranks]

unpackDoclenNum()

Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id


    public
            static        unpackDoclenNum(int $doc_info) : array<string|int, mixed>

Parameters

$doc_info : int: integer to unpack

Return values

array<string|int, mixed> —

pair (number of words in the document, number of keys that are used to make up its doc_id)

unpackValues()

Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.


    public
                    unpackValues(mixed $packed_data, mixed $offset[, string $type = 'i' ]) : array<string|int, mixed>

@see packValues

Parameters

$packed_data : mixed
$offset : mixed
$type : string = 'i': if is "i" then assuming integers we are unpacking otherwise floats

Return values

array<string|int, mixed> —

[unpacked values array, offset to where processed to in string]

unpackWordDocs()

Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.


    public
                    unpackWordDocs() : mixed

This method is memory expensive as it briefly has essentially two copies of what's in word_docs.

Return values

mixed —

weightedCount()

Used to sum over the occurrences in a position list counting with weight based on term location in the document


    public
                    weightedCount(array<string|int, mixed> $position_list, bool $is_doc, int $title_length[, array<string|int, mixed> $position_scores = [] ]) : array<string|int, mixed>

Parameters

$position_list : array<string|int, mixed>: positions of term in item
$is_doc : bool: whether the item is a document or a link
$title_length : int: position in position list at which point no longer in title of original doc
$position_scores : array<string|int, mixed> = []: pairs position => weight saying how much a word at a given position range is worth

Return values

array<string|int, mixed> —

asscoiative array of document_part => weight count of occurrences of term in

IndexShard extends PersistentStructure in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

BLANK

DEFAULT_SAVE_FREQUENCY

DESCRIPTION_WEIGHT

DOC_ID_LEN

DOC_KEY_LEN

FLATTEN_FREQUENCY

HALF_BLANK

HEADER_LENGTH

LINK_FLAG

LINK_WEIGHT

MAX_AUX_DOC_KEYS

POSTING_LEN

SHARD_BLOCK_POWER

SHARD_BLOCK_SIZE

STORE_FLAG

TITLE_WEIGHT

WORD_DATA_LEN

WORD_KEY_LEN

WORD_POSTING_COPY_LEN

Properties

$blocks

$blocks_words

$doc_info_offset

$doc_infos

$docids_len

$fh

$file_len

$filename

$generation

$hash_name

$last_flattened_words_count

$len_all_docs

$len_all_link_docs

$num_docs

$num_docs_per_generation

$num_docs_word

$num_link_docs

$prefixes

$prefixes_len

$read_only_from_disk

$save_frequency

$unsaved_operations

$word_doc_offset

$word_docs

$word_docs_len

$word_docs_packed

$word_postings

$words

$words_len

Methods

__construct()

Parameters

Return values

addDocumentWords()

Parameters

Return values

appendIndexShard()

Parameters

Return values

binarySearchPostingOffsetDocOffset()

Parameters

Return values

changeDocumentOffsets()

Parameters

Return values

checkSave()

Return values

computeProximity()

Parameters

Return values

docOffsetFromPostingOffset()

Parameters

Return values

docStats()

Parameters

IndexShard extends PersistentStructure
in package

Application

implements CrawlConstants