Yioop_V9.5_Source_Code_Documentation

IndexShard extends PersistentStructure
in package
implements CrawlConstants

Data structure used to store one generation worth of the word document index (inverted index). This data structure consists of three main components a word entries, word_doc entries, and document entries.

Word entries are described in the documentation for the words field. Word-doc entries are described in the documentation for the word_docs field Document entries are described in the documentation for the doc_infos field

IndexShards also have two access modes a $read_only_from_disk mode and a loaded in memory mode. Loaded in memory mode is mainly for writing new data to the shard. When in memory, data in the shard can also be in one of two states packed or unpacked. Roughly, when it is in a packed state it is ready to be serialized to disk; when it is an unpacked state it methods for adding data can be used.

Serialized on disk, a shard has a header with document statistics followed by the a prefix index into the words component, followed by the word component itself, then the word-docs component, and finally the document component.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

BLANK  = "\xff\xff\xff\xff\xff\xff\xff\xff"
Represents an empty prefix item
DEFAULT_SAVE_FREQUENCY  = 50000
If not specified in the constructor, this will be the number of operations between saves
DESCRIPTION_WEIGHT  = 2.0
BM25F weight factor for terms in description
DOC_ID_LEN  = 24
Length of DOC ID.
DOC_KEY_LEN  = 8
Length of a key in a DOC ID.
FLATTEN_FREQUENCY  = 10000
Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)
HALF_BLANK  = "\xff\xff\xff\xff"
Flag used to indicate that a word item should not be packed or unpacked
HEADER_LENGTH  = 40
Header Length of an IndexShard (sum of its non-variable length fields)
LINK_FLAG  = 0x800000
Used to keep track of whether a record in document infos is for a document or for a link
LINK_WEIGHT  = 1.0
BM25F weight factor for terms in a link
MAX_AUX_DOC_KEYS  = 200
Maximum number of auxiliary document keys;
POSTING_LEN  = 4
Length of one posting ( a doc offset occurrence pair) in a posting list
SHARD_BLOCK_POWER  = 12
Shard block size is 1<< this power
SHARD_BLOCK_SIZE  = 4096
Size in bytes of one block in IndexShard
STORE_FLAG  = "\x80"
Represents an empty prefix item
TITLE_WEIGHT  = 4.0
BM25F weight factor for terms in title
WORD_DATA_LEN  = 12
Length of the data portion of a word entry in bytes in the shard
WORD_KEY_LEN  = 20
Length of a word entry's key in bytes
WORD_POSTING_COPY_LEN  = 32000
Bytes of tmp string allowed during flattenings
$blocks  : array<string|int, mixed>
An cached array of disk blocks for an index shard that has not been completely loaded into memory.
$blocks_words  : array<string|int, mixed>
Stores $blocks contents in (32 bit) unsigned int
$doc_info_offset  : int
Holds offset of the doc_infos strings
$doc_infos  : string
Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.
$docids_len  : int
Length of $doc_infos as a string
$fh  : resource
File handle for a shard if we are going to use it in read mode and not completely load it.
$file_len  : int
Keeps track of the length of the shard as a file
$filename  : string
Name of the file in which to store the PersistentStructure
$generation  : int
This is supposed to hold the number of earlier shards, prior to the current shard.
$hash_name  : string
Used to hold the computed 8 byte hash of the index shard filename
$last_flattened_words_count  : mixed
Number of document inserts since the last time word data was flattened to the word_postings string.
$len_all_docs  : int
Number of words stored in total in all documents in this shard
$len_all_link_docs  : int
Number of words stored in total in all links in this shard
$num_docs  : int
Number of documents (not links) stored in this shard
$num_docs_per_generation  : int
This is supposed to hold the number of documents that a given shard can hold.
$num_docs_word  : array<string|int, mixed>
Keeps track of the number of documents a word is in
$num_link_docs  : int
Number of links (not documents) stored in this shard
$prefixes  : array<string|int, mixed>
An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.
$prefixes_len  : int
Length of the prefix index into the dictionary of the shard
$read_only_from_disk  : bool
Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.
$save_frequency  : int
Number of operation between saves. If == -1 never save using checkSave
$unsaved_operations  : int
Number of operations since the last save
$word_doc_offset  : int
Holds offset of the word_docs strings
$word_docs  : string
This string is non-empty when shard is loaded and in its packed state.
$word_docs_len  : int
Length of $word_docs as a string
$word_docs_packed  : bool
Keeps track of the packed/unpacked state of the word_docs list
$word_postings  : string
Used to hold word_id, posting_len, posting triples as a memory efficient string
$words  : array<string|int, mixed>
Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP
$words_len  : int
Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode
__construct()  : mixed
Makes an index shard with the given file name and generation offset
addDocumentWords()  : bool
Add a new document to the index shard with the given summary offset.
appendIndexShard()  : mixed
Adds the contents of the supplied $index_shard to the current index shard
binarySearchPostingOffsetDocOffset()  : array<string|int, mixed>|bool
Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false
changeDocumentOffsets()  : mixed
Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).
checkSave()  : mixed
Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage
computeProximity()  : int
Returns a proximity score for a single term based on its location in doc.
docOffsetFromPostingOffset()  : int
Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.
docStats()  : mixed
Computes BM25F relevance and a score for the supplied item based on the supplied parameters.
gallopPostingOffsetDocOffset()  : int
Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found
getDocIndexOfPostingAtOffset()  : int
Returns the document index of the posting at offset $current in word_docs
getDocInfoSubstring()  : string
From disk gets $len many bytes starting from $offset in the doc_infos strings
getPostingAtOffset()  : string
Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting
getPostingsSlice()  : array<string|int, mixed>
Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.
getPostingsSliceById()  : array<string|int, mixed>
Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)
getShardSubstring()  : string
Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard
getShardWord()  : int
Reads 32 bit word as an unsigned int from the offset given in the shard
getWordDocsSubstring()  : desired
From disk gets $len many bytes starting from $offset in the word_docs strings
getWordDocsWord()  : mixed
Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard
getWordInfo()  : array<string|int, mixed>
Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.
getWordInfoFromString()  : array<string|int, mixed>
Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.
getWordString()  : mixed
Return word record (word key + posting lookup data )from the shard from the shard posting list
headerToShardFields()  : mixed
Split a header string into a shards field variable
load()  : IndexShard
Load an IndexShard from a file or string
makeItem()  : array<string|int, mixed>
Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.
makeWords()  : mixed
Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.
mergeWordPostingsToString()  : mixed
Used to flatten the words associative array to a more memory efficient word_postings string.
nextPostingOffsetDocOffset()  : array<string|int, mixed>
Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).
numDocsOrLinks()  : int
An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.
outputPostingLists()  : mixed
Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.
packAuxiliaryDocumentKeys()  : string
Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.
packDoclenNum()  : string
Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)
packValues()  : string
Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.
packWords()  : mixed
Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.
postingsSliceAscending()  : array<string|int, mixed>
Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
postingsSliceDescending()  : array<string|int, mixed>
Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset
prepareWordsAndPrefixes()  : mixed
Computes the prefix string index for the current words array.
readBlockShardAtOffset()  : mixed
Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes
readShardHeader()  : bool
If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)
save()  : string
Save the IndexShard to its filename
saveWithoutDictionary()  : mixed
This method re-saves a saved shard without the prefixes and dictionary.
unpackAuxiliaryDocumentKeys()  : array<string|int, mixed>
Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.
unpackDoclenNum()  : array<string|int, mixed>
Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id
unpackValues()  : array<string|int, mixed>
Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.
unpackWordDocs()  : mixed
Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.
weightedCount()  : array<string|int, mixed>
Used to sum over the occurrences in a position list counting with weight based on term location in the document

Constants

BLANK

Represents an empty prefix item

public mixed BLANK = "\xff\xff\xff\xff\xff\xff\xff\xff"

DEFAULT_SAVE_FREQUENCY

If not specified in the constructor, this will be the number of operations between saves

public int DEFAULT_SAVE_FREQUENCY = 50000

DESCRIPTION_WEIGHT

BM25F weight factor for terms in description

public mixed DESCRIPTION_WEIGHT = 2.0

DOC_ID_LEN

Length of DOC ID.

public mixed DOC_ID_LEN = 24

DOC_KEY_LEN

Length of a key in a DOC ID.

public mixed DOC_KEY_LEN = 8

FLATTEN_FREQUENCY

Fraction of NUM_DOCS_PER_PARTITION document inserts before data from the words array is flattened to word_postings. (It will also be flattened during periodic index saves)

public mixed FLATTEN_FREQUENCY = 10000

HALF_BLANK

Flag used to indicate that a word item should not be packed or unpacked

public mixed HALF_BLANK = "\xff\xff\xff\xff"

HEADER_LENGTH

Header Length of an IndexShard (sum of its non-variable length fields)

public mixed HEADER_LENGTH = 40

Used to keep track of whether a record in document infos is for a document or for a link

public mixed LINK_FLAG = 0x800000

BM25F weight factor for terms in a link

public mixed LINK_WEIGHT = 1.0

MAX_AUX_DOC_KEYS

Maximum number of auxiliary document keys;

public mixed MAX_AUX_DOC_KEYS = 200

POSTING_LEN

Length of one posting ( a doc offset occurrence pair) in a posting list

public mixed POSTING_LEN = 4

SHARD_BLOCK_POWER

Shard block size is 1<< this power

public mixed SHARD_BLOCK_POWER = 12

SHARD_BLOCK_SIZE

Size in bytes of one block in IndexShard

public mixed SHARD_BLOCK_SIZE = 4096

STORE_FLAG

Represents an empty prefix item

public mixed STORE_FLAG = "\x80"

TITLE_WEIGHT

BM25F weight factor for terms in title

public mixed TITLE_WEIGHT = 4.0

WORD_DATA_LEN

Length of the data portion of a word entry in bytes in the shard

public mixed WORD_DATA_LEN = 12

WORD_KEY_LEN

Length of a word entry's key in bytes

public mixed WORD_KEY_LEN = 20

WORD_POSTING_COPY_LEN

Bytes of tmp string allowed during flattenings

public mixed WORD_POSTING_COPY_LEN = 32000

Properties

$blocks

An cached array of disk blocks for an index shard that has not been completely loaded into memory.

public array<string|int, mixed> $blocks

$blocks_words

Stores $blocks contents in (32 bit) unsigned int

public array<string|int, mixed> $blocks_words

$doc_info_offset

Holds offset of the doc_infos strings

public int $doc_info_offset

$doc_infos

Stores document id's and links to documents id's together with summary offset information, and number of words in the doc/link The format for a record is 4 byte offset, followed by 3 bytes for the document length, followed by 1 byte containing the number of 8 byte doc key strings that make up the doc id (2 for a doc, 3 for a link), followed by the doc key strings themselves.

public string $doc_infos

In the case of a document the first doc key string has a hash of the url, the second a hash a tag stripped version of the document. In the case of a link, the keys are a unique identifier for the link context, followed by 8 bytes for the hash of the url being pointed to by the link, followed by 8 bytes for the hash of "info:url_pointed_to_by_link".

$docids_len

Length of $doc_infos as a string

public int $docids_len

$fh

File handle for a shard if we are going to use it in read mode and not completely load it.

public resource $fh

$file_len

Keeps track of the length of the shard as a file

public int $file_len

$filename

Name of the file in which to store the PersistentStructure

public string $filename

$generation

This is supposed to hold the number of earlier shards, prior to the current shard.

public int $generation

$hash_name

Used to hold the computed 8 byte hash of the index shard filename

public string $hash_name

$last_flattened_words_count

Number of document inserts since the last time word data was flattened to the word_postings string.

public mixed $last_flattened_words_count

$len_all_docs

Number of words stored in total in all documents in this shard

public int $len_all_docs

Number of words stored in total in all links in this shard

public int $len_all_link_docs

$num_docs

Number of documents (not links) stored in this shard

public int $num_docs

$num_docs_per_generation

This is supposed to hold the number of documents that a given shard can hold.

public int $num_docs_per_generation

$num_docs_word

Keeps track of the number of documents a word is in

public array<string|int, mixed> $num_docs_word

Number of links (not documents) stored in this shard

public int $num_link_docs

$prefixes

An array representing offsets into the words dictionary of the index of the first occurrence of a two byte prefix of a word_id.

public array<string|int, mixed> $prefixes

$prefixes_len

Length of the prefix index into the dictionary of the shard

public int $prefixes_len

$read_only_from_disk

Flag used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

public bool $read_only_from_disk

$save_frequency

Number of operation between saves. If == -1 never save using checkSave

public int $save_frequency

$unsaved_operations

Number of operations since the last save

public int $unsaved_operations

$word_doc_offset

Holds offset of the word_docs strings

public int $word_doc_offset

$word_docs

This string is non-empty when shard is loaded and in its packed state.

public string $word_docs

It consists of a sequence of posting records. Each posting consists of a offset into the document entries structure for a document containing the word this is the posting for, as well as the number of occurrences of that word in that document.

$word_docs_len

Length of $word_docs as a string

public int $word_docs_len

$word_docs_packed

Keeps track of the packed/unpacked state of the word_docs list

public bool $word_docs_packed

$word_postings

Used to hold word_id, posting_len, posting triples as a memory efficient string

public string $word_postings

$words

Stores the array of word entries for this shard In the packed state, word entries consist of the word id, a generation number, an offset into the word_docs structure where the posting list for that word begins, and a length of this posting list. In the unpacked state each entry is a string of all the posting items for that word Periodically data in this words array is flattened to the word_postings string which is a more memory efficient was of storing data in PHP

public array<string|int, mixed> $words

$words_len

Stores length of the words array in the shard on disk. Only set if we're in $read_only_from_disk mode

public int $words_len

Methods

__construct()

Makes an index shard with the given file name and generation offset

public __construct(string $fname, int $generation[, int $num_docs_per_generation = CNUM_DOCS_PER_PARTITION ][, bool $read_only_from_disk = false ]) : mixed
Parameters
$fname : string

filename to store the index shard with

$generation : int

when returning documents from the shard pretend there are this many earlier documents

$num_docs_per_generation : int = CNUM_DOCS_PER_PARTITION

the number of documents that a given shard can hold.

$read_only_from_disk : bool = false

used to determined if this shard is going to be largely kept on disk and to be in read only mode. Otherwise, shard will assume to be completely held in memory and be read/writable.

Return values
mixed

addDocumentWords()

Add a new document to the index shard with the given summary offset.

public addDocumentWords(string $doc_keys, int $summary_offset, array<string|int, mixed> $word_lists[, array<string|int, mixed> $meta_ids = [] ][, bool $is_doc = false ][, mixed $rank = false ][, array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : bool

Associate with this document the supplied list of words and word counts. Finally, associate the given meta words with this document.

Parameters
$doc_keys : string

a string of concatenated keys for a document to insert. Each key is assumed to be a string of DOC_KEY_LEN many bytes. This whole set of keys is viewed as fixing one document.

$summary_offset : int

its offset into the word archive the document's data is stored in

$word_lists : array<string|int, mixed>

(word => array of word positions in doc)

$meta_ids : array<string|int, mixed> = []

meta words to be associated with the document an example meta word would be filetype:pdf for a PDF document.

$is_doc : bool = false

flag used to indicate if what is being scored is a document or a link to a document

$rank : mixed = false

either false if not used, or a 4 bit estimate of the rank of this document item

$description_scores : array<string|int, mixed> = []
$user_ranks : array<string|int, mixed> = []
Return values
bool

success or failure of performing the add

appendIndexShard()

Adds the contents of the supplied $index_shard to the current index shard

public appendIndexShard(object $index_shard) : mixed
Parameters
$index_shard : object

the shard to append to the current shard

Return values
mixed

binarySearchPostingOffsetDocOffset()

Computes (via binary seracg) a pair (posting_offset, next_doc_offset) such that next_doc_offset is the next document offset in the passed direction beyond doc_offset in the posting list bounded by indexes $start and $end (indices are bit shifts of offsets so are smaller numbers). If this cannot be found returns false

public binarySearchPostingOffsetDocOffset(int $start, int $end, int $current, int $doc_index, int $direction) : array<string|int, mixed>|bool
Parameters
$start : int

lower index of posting list

$end : int

upper index of posting list

$current : int

current index in posting list

$doc_index : int

index wahat next doc offset after

$direction : int

either self::ASCENDING or self::DESCENDING

Return values
array<string|int, mixed>|bool

either (posting_offset, next_doc_offset) or false

changeDocumentOffsets()

Changes the summary offsets associated with a set of doc_ids to new values. This is needed because the fetcher puts documents in a shard before sending them to a queue_server. It is on the queue_server however where documents are stored in the IndexArchiveBundle and summary offsets are obtained. Thus, the shard needs to be updated at that point. This function should be called when shard unpacked (we check and unpack to be on the safe side).

public changeDocumentOffsets(array<string|int, mixed> $docid_offsets) : mixed
Parameters
$docid_offsets : array<string|int, mixed>

a set of doc_id associated with a new_doc_offset.

Return values
mixed

checkSave()

Add one to the unsaved_operations count. If this goes above the save_frquency then save the PersistentStructure to secondary storage

public checkSave() : mixed
Return values
mixed

computeProximity()

Returns a proximity score for a single term based on its location in doc.

public computeProximity(array<string|int, mixed> $position_list, bool $is_doc) : int
Parameters
$position_list : array<string|int, mixed>

locations of term within item

$is_doc : bool

whether the item is a document or not

Return values
int

a score for proximity

docOffsetFromPostingOffset()

Given an offset of a posting into the word_docs string, looks up the posting there and computes the doc_offset stored in it.

public docOffsetFromPostingOffset(int $offset) : int
Parameters
$offset : int

byte/char offset into the word_docs string

Return values
int

a document byte/char offset into the doc_infos string

docStats()

Computes BM25F relevance and a score for the supplied item based on the supplied parameters.

public static docStats(array<string|int, mixed> &$item, int $occurrences, int $doc_len, int $num_doc_or_links, float $average_doc_len, int $num_docs, int $total_docs_or_links, float $type_weight) : mixed
Parameters
$item : array<string|int, mixed>

doc summary to compute a relevance and score for. Pass-by-ref so self::RELEVANCE and self::SCORE fields can be changed

$occurrences : int
  • number of occurrences of the term in the item
$doc_len : int

number of words in doc item represents

$num_doc_or_links : int

number of links or docs containing the term

$average_doc_len : float

average length of items in corpus

$num_docs : int

either number of links or number of docs depending if item represents a link or a doc.

$total_docs_or_links : int

number of docs or links in corpus

$type_weight : float

BM25F weight for this component (doc or link) of score

Return values
mixed

gallopPostingOffsetDocOffset()

Performs a galloping search (double forward jump distance each failure step) forward in a posting list from position $current forward until either $end is reached or a posting with document index bigger than $doc_index is found

public gallopPostingOffsetDocOffset(int &$current, int $doc_index, int $end, int $direction) : int
Parameters
$current : int

current posting offset into posting list

$doc_index : int

document index want bigger than or equal to

$end : int

last index of posting list

$direction : int

which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values
int

document index bigger than or equal to $doc_index. Since $current points at the posting this occurs for if found, no success by whether $current > $end

getDocIndexOfPostingAtOffset()

Returns the document index of the posting at offset $current in word_docs

public getDocIndexOfPostingAtOffset(int $current) : int
Parameters
$current : int

an offset into the posting lists (word_docs)

Return values
int

the doc index of the pointed to posting

getDocInfoSubstring()

From disk gets $len many bytes starting from $offset in the doc_infos strings

public getDocInfoSubstring( $offset,  $len[, bool $cache = false ]) : string
Parameters
$offset :

byte offset to begin getting data out of disk-based doc_infos

$len :

number of bytes to get

$cache : bool = false

whether to cache disk blocks read from disk

Return values
string

desired

getPostingAtOffset()

Gets the posting closest to index $current in the word_docs string modifies the passed-by-ref variables $posting_start and $posting_end so they are the index of the the start and end of the posting

public getPostingAtOffset(int $current, int &$posting_start, int &$posting_end) : string
Parameters
$current : int

an index into the word_docs strings corresponds to a start search loc of $current * self::POSTING_LEN

$posting_start : int

after function call will be index of start of nearest posting to current

$posting_end : int

after function call will be index of end of nearest posting to current

Return values
string

the substring of word_docs corresponding to the posting

getPostingsSlice()

Returns documents using the word_docs string (either as stored on disk or completely read in) of records starting at the given offset and using its link-list of records. Traversal of the list stops if an offset larger than $last_offset is seen or $len many doc's have been returned. Since $next_offset is passed by reference the value of $next_offset will point to the next record in the list (if it exists) after the function is called.

public getPostingsSlice(int $start_offset, int &$next_offset, int $last_offset, int $len[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
$start_offset : int

of the current posting list for query term used in calculating BM25F.

$next_offset : int

where to start in word docs

$last_offset : int

offset at which to stop by

$len : int

number of documents desired

$direction : int = self::ASCENDING

which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values
array<string|int, mixed>

desired list of doc's and their info

getPostingsSliceById()

Returns $len many documents which contained the word corresponding to $word_id (only works for loaded shards)

public getPostingsSliceById(string $word_id, int $len[, mixed $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
$word_id : string

key to look up documents for

$len : int

number of documents

$direction : mixed = self::ASCENDING
Return values
array<string|int, mixed>

desired list of doc's and their info

getShardSubstring()

Gets from Disk Data $len many bytes beginning at $offset from the current IndexShard

public getShardSubstring(int $offset, int $len[, bool $cache = true ]) : string
Parameters
$offset : int

byte offset to start reading from

$len : int

number of bytes to read

$cache : bool = true

whether to cache disk blocks read from disk

Return values
string

data from that location in the shard

getShardWord()

Reads 32 bit word as an unsigned int from the offset given in the shard

public getShardWord(int $offset) : int
Parameters
$offset : int

a byte offset into the shard

Return values
int

desired word or false

getWordDocsSubstring()

From disk gets $len many bytes starting from $offset in the word_docs strings

public getWordDocsSubstring( $offset,  $len[, bool $cache = true ]) : desired
Parameters
$offset :

byte offset to begin getting data out of disk-based word_docs

$len :

number of bytes to get

$cache : bool = true

whether to cache disk blocks read from disk

Return values
desired

string

getWordDocsWord()

Reads 32 bit word as an unsigned int from the offset given in the word_docs string in the shard

public getWordDocsWord(int $offset) : mixed
Parameters
$offset : int

a byte offset into the word_docs string

Return values
mixed

getWordInfo()

Returns the first offset, last offset, and number of documents the word occurred in for this shard. The first offset (similarly, the last offset) is the byte offset into the word_docs string of the first (last) record involving that word.

public getWordInfo(string $word_id[, bool $raw = false ]) : array<string|int, mixed>
Parameters
$word_id : string

id of the word one wants to look up

$raw : bool = false

whether the id is our version of base64 encoded or not

Return values
array<string|int, mixed>

first offset, last offset, count, exact matching id

getWordInfoFromString()

Converts $str into 3 ints for a first offset into word_docs, a last offset into word_docs, and a count of number of docs with that word.

public static getWordInfoFromString(string $str[, bool $include_generation = false ]) : array<string|int, mixed>
Parameters
$str : string
$include_generation : bool = false
Return values
array<string|int, mixed>

of these three or four int's

getWordString()

Return word record (word key + posting lookup data )from the shard from the shard posting list

public getWordString(bool $is_disk, int $start, int $location, int $word_item_len) : mixed
Parameters
$is_disk : bool

whether the shard is on disk or in memory

$start : int

offset to start of the dictionary

$location : int

index of record to extract from dictionary

$word_item_len : int

length of a word + data record

Return values
mixed

headerToShardFields()

Split a header string into a shards field variable

public static headerToShardFields(string $header, object $shard) : mixed
Parameters
$header : string

a string with packed shard header data

$shard : object

IndexShard to put data into

Return values
mixed

load()

Load an IndexShard from a file or string

public static load(string $fname[, string &$data = null ]) : IndexShard
Parameters
$fname : string

the name of the file to the IndexShard from/to

$data : string = null

stringified shard data to load shard from. If null then the data is loaded from the $fname if possible

Return values
IndexShard

the IndexShard loaded

makeItem()

Return (docid, item) where item has document statistics (summary offset, relevance, doc rank, and score) for the document give by the supplied posting, based on the the posting lists num docs with word, and the number of occurrences of the word in the doc.

public makeItem(string $posting, int $num_doc_or_links[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
$posting : string

a posting entry from some words posting list

$num_doc_or_links : int

number of documents or links doc appears in

$direction : int = self::ASCENDING

whether to compute DOC_RANK based on the assumption the iterator is traversing the index in an ascending or descending fashion

Return values
array<string|int, mixed>

($doc_id, posting_stats_array) for posting

makeWords()

Callback function for load method. splits a word_key . word_info string into an entry in the passed shard $shard->words[word_key] = $word_info.

public static makeWords(string &$value, int $key, object $shard) : mixed
Parameters
$value : string

the word_key . word_info string

$key : int

index in array - we don't use

$shard : object

IndexShard to add the entry to word table for

Return values
mixed

mergeWordPostingsToString()

Used to flatten the words associative array to a more memory efficient word_postings string.

public mergeWordPostingsToString([bool $replace = false ]) : mixed

$this->words is an associative array with associations wordid => postinglistforid this format is relatively wasteful of memory

$this->word_postings is a string in the format wordid1len1postings1wordid2len2postings2 ... wordids are lex ordered. This is more memory efficient as the former relies on the more wasteful php implementation of associative arrays.

mergeWordPostingsToString converts the former format to the latter for each of the current wordids. $this->words is then set to []; Note before this operation is done $this->word_postings might have data from earlier times mergeWordPostingsToString was called, in which case the behavior is controlled by $replace.

Parameters
$replace : bool = false

whether to overwrite existing word_id postings (true) or to append (false)

Return values
mixed

nextPostingOffsetDocOffset()

Finds the first posting offset between $start_offset and $end_offset of a posting that has a doc_offset bigger than or equal to $doc_offset This is implemented using a galloping search (double offset till get larger than binary search).

public nextPostingOffsetDocOffset(int $start_offset, int $end_offset, int $doc_offset[, int $direction = self::ASCENDING ]) : array<string|int, mixed>
Parameters
$start_offset : int

first posting to consider

$end_offset : int

last posting before give up

$doc_offset : int

document offset we want to be greater than or equal to (when ASCENDING) or less equal to (DESCENDING)

$direction : int = self::ASCENDING

which direction to iterate through elements of the posting slice (self::ASCENDING or self::DESCENDING) as compared to the order of when they were stored

Return values
array<string|int, mixed>

(int offset to next posting, doc_offset for this post)

An upper bound on the number of docs or links represented by the start and ending integer offsets into a posting list.

public static numDocsOrLinks(int $start_offset, int $last_offset[, float $avg_posting_len = 4 ]) : int
Parameters
$start_offset : int

starting location in posting list

$last_offset : int

ending location in posting list

$avg_posting_len : float = 4

number of bytes in an average posting

Return values
int

number of docs or links

outputPostingLists()

Used to convert the word_postings string into a word_docs string or if a file handle is provided write out the word_docs sequence of postings to the provided file handle.

public outputPostingLists([resource $fh = null ][, bool $with_logging = false ]) : mixed
Parameters
$fh : resource = null

a filehandle to write to

$with_logging : bool = false

whether to log progress

Return values
mixed

packAuxiliaryDocumentKeys()

Used to pack a list of description scores and user ranks as a string of auxiliary keys for a document map entry in the shard.

public packAuxiliaryDocumentKeys([array<string|int, mixed> $description_scores = [] ][, array<string|int, mixed> $user_ranks = [] ]) : string

A document map entry consists of a four byte offset into a WebArchive, three more bytes for the document length as, one byte for the number of 8 byte aux keys, followed by a 24 byte key derived usually from the url, host, etc, followed by the description scores, user rank auxiliary keys.

Parameters
$description_scores : array<string|int, mixed> = []

pairs position in document => weight score that position got during summarization process.

$user_ranks : array<string|int, mixed> = []

float scores gotten by a user classifier/ranker defined using Manage Classfiers.

Return values
string

a string padded to length a multiple of 16 where @see packValues has been used to map each of the above array into a string

packDoclenNum()

Used to store the length of a document as well as the number of key components in its doc_id as a packed int (4 byte string)

public static packDoclenNum(int $doc_len, int $num_keys) : string
Parameters
$doc_len : int

number of words in the document

$num_keys : int

number of keys that are used to make up its doc_id

Return values
string

packed int string representing these two values

packValues()

Used to pack either an array of nonnegative ints each less than 65535 or array of floats. Pack is done into a string of 2 bytes/ entry shorts.

public packValues(array<string|int, mixed> $values[, string $type = "i" ]) : string
Parameters
$values : array<string|int, mixed>

nonnegative integers or floats to pack

$type : string = "i"

if is "i" then assuming integers we are packing otherwise floats

Return values
string

with packed values

packWords()

Posting lists are initially stored associated with a word as a key value pair. The merge operation then merges them these to a string by word_postings. packWords separates words from postings.

public packWords([resource $fh = null ][, bool $with_logging = false ]) : mixed

After being applied words is a string consisting of triples (as concatenated strings) word_id, start_offset, end_offset. The offsets refer to integers offsets into a string $this->word_docs Finally, if a file handle is given, it writes the word dictionary out to the file as a long string. This function assumes mergeWordPostingsToString has just been called.

Parameters
$fh : resource = null

a file handle to write the dictionary to, if desired

$with_logging : bool = false

whether to write progress log messages every 30 seconds

Return values
mixed

postingsSliceAscending()

Returns the $len postings items in ascending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset

public postingsSliceAscending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>
Parameters
$start_offset : int

byte offset beginning of given posting list

$next_offset : int

byte offset between $start_offset and $last_offset of a posting

$last_offset : int

byte offset ending of given posting list

$len : int

how many postings to return increasing from $next_offset

Return values
array<string|int, mixed>

of posting items

postingsSliceDescending()

Returns the $len postings items in decending order from the posting list between $start_offset to $last_offset beginning at $next_offset and returns them as an array while updating the position of $next_offset

public postingsSliceDescending(int $start_offset, int &$next_offset, int $last_offset, int $len) : array<string|int, mixed>
Parameters
$start_offset : int

byte offset beginning of given posting list

$next_offset : int

byte offset between $start_offset and $last_offset of a posting

$last_offset : int

byte offset ending of given posting list

$len : int

how many postings to return decreasing from $next_offset

Return values
array<string|int, mixed>

of posting items

prepareWordsAndPrefixes()

Computes the prefix string index for the current words array.

public prepareWordsAndPrefixes([bool $with_logging = false ]) : mixed

This index gives offsets of the first occurrences of the lead two char's of a word_id in the words array. This method assumes that the word data is already in >word_postings

Parameters
$with_logging : bool = false

whether log messages should be written as progresses

Return values
mixed

readBlockShardAtOffset()

Reads SHARD_BLOCK_SIZE from the current IndexShard's file beginning at byte offset $bytes

public readBlockShardAtOffset(int $bytes[, bool $cache = true ]) : mixed
Parameters
$bytes : int

byte offset to start reading from

$cache : bool = true

whether to cache disk blocks that have been read to RAM

Return values
mixed

data fromIndexShard file if found, false otherwise

readShardHeader()

If not already loaded, reads in from disk the fixed-length'd field variables of this IndexShard ($this->words_len, etc)

public readShardHeader([bool $force = false ]) : bool
Parameters
$force : bool = false

If true

Return values
bool

whether was able to read in or not

save()

Save the IndexShard to its filename

public save([bool $to_string = false ][, bool $with_logging = false ]) : string
Parameters
$to_string : bool = false

whether output should be written to a string rather than the default file location

$with_logging : bool = false

whether log messages should be written as the shard save progresses

Return values
string

serialized shard if output was to string else empty string

saveWithoutDictionary()

This method re-saves a saved shard without the prefixes and dictionary.

public saveWithoutDictionary([bool $with_logging = false ]) : mixed

It would typically be called after this information has been stored in an IndexDictionary obbject so that the data is not redundantly stored

Parameters
$with_logging : bool = false

whether log messages should be written as the shard save progresses

Return values
mixed

unpackAuxiliaryDocumentKeys()

Used to unpack a list of description scores and user ranks from a document map entry in the shard. We assume these score were packed using @see packAuxiliaryDocumentKeys.

public unpackAuxiliaryDocumentKeys(string $packed_data, int $offset) : array<string|int, mixed>
Parameters
$packed_data : string

containing packed description scores and user ranks

$offset : int

where in the string to begin unpacking from

Return values
array<string|int, mixed>

[$description_scores, $user_ranks]

unpackDoclenNum()

Used to extract from a 32 bit unsigned int, a pair which represents the length of a document together with the number of keys in its doc_id

public static unpackDoclenNum(int $doc_info) : array<string|int, mixed>
Parameters
$doc_info : int

integer to unpack

Return values
array<string|int, mixed>

pair (number of words in the document, number of keys that are used to make up its doc_id)

unpackValues()

Used to unpack from a string an an array of short nonnegative ints or 2 byte floats.

public unpackValues(mixed $packed_data, mixed $offset[, string $type = 'i' ]) : array<string|int, mixed>

@see packValues

Parameters
$packed_data : mixed
$offset : mixed
$type : string = 'i'

if is "i" then assuming integers we are unpacking otherwise floats

Return values
array<string|int, mixed>

[unpacked values array, offset to where processed to in string]

unpackWordDocs()

Takes the word docs string and splits it into posting lists which are assigned to particular words in the words dictionary array.

public unpackWordDocs() : mixed

This method is memory expensive as it briefly has essentially two copies of what's in word_docs.

Return values
mixed

weightedCount()

Used to sum over the occurrences in a position list counting with weight based on term location in the document

public weightedCount(array<string|int, mixed> $position_list, bool $is_doc, int $title_length[, array<string|int, mixed> $position_scores = [] ]) : array<string|int, mixed>
Parameters
$position_list : array<string|int, mixed>

positions of term in item

$is_doc : bool

whether the item is a document or a link

$title_length : int

position in position list at which point no longer in title of original doc

$position_scores : array<string|int, mixed> = []

pairs position => weight saying how much a word at a given position range is worth

Return values
array<string|int, mixed>

asscoiative array of document_part => weight count of occurrences of term in


        

Search results