Yioop_V9.5_Source_Code_Documentation

IndexDocumentBundle
in package
implements CrawlConstants

Encapsulates a set of web page documents and an inverted word-index of terms from these documents which allow one to search for documents containing a particular word.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

ARCHIVE_INFO_FILE  = "archive_info.txt"
File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
DEFAULT_PARAMETERS  = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]
Default values for the configuration parameters of an IndexDocumentBundle
DEFAULT_VERSION  = "3.2"
The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
DICTIONARY_FOLDER  = "dictionary"
Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
DOC_MAP_FILENAME  = "doc_map"
Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
DOCID_LEN  = 24
Length of DocIds used by this IndexDocumentBundle
DOCID_PART_LEN  = 8
DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
DOCUMENTS_FOLDER  = "documents"
Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
LAST_ENTRIES_FILENAME  = "last_entries"
Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
NEXT_PARTITION_FILE  = "next_partition.txt"
The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
PARTITION_FILENAMES  = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]
Names for the files which appear within a partition sub-folder
POSITIONS_DOC_MAP_FOLDER  = "positions_doc_maps"
Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
POSITIONS_FILENAME  = "positions"
Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
POSTINGS_BUFFER_SIZE  = 1000000
How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
POSTINGS_FILENAME  = "postings"
Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
TEMP_POSTINGS_FILENAME  = "temp_postings"
Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
TERMID_LEN  = 16
Length of TermIds used by this IndexDocumentBundle
$archive_info  : array<string|int, mixed>
Holds property value pairs concerning the configuration of the current IndexDocumentBundle
$description  : string
A short text name for this IndexDocumentBundle
$dictionary  : object
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
$dir_name  : string
Folder name to use for this IndexDocumentBundle
$doc_map  : array<string|int, mixed>
Associative array of docid=>doc_record pairs
$doc_map_counter  : int
Keeps track of the number of documents present in the current partition
$doc_map_tools  : PackedTableTools
Used to read and write data to the $doc_map array
$documents  : object
PartitionDocumentBundle for web page documents
$extract_phrase_time  : int
Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
$last_entries  : array<string|int, mixed>
Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
$last_entries_tools  : PackedTableTools
Used to read and write data to the $last_entries array
$next_partition_to_add  : array<string|int, mixed>
structure contains info about the current partition
$positions  : string
A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
$postings  : array<string|int, mixed>
Associative array $term_id => posting list records for that term in the partition.
$postings_tools  : PackedTableTools
Used to read and write data to the $postings array
$unpack_len_map  : array<string|int, mixed>
Array of string lengths each of $unpack_maps codes consumes
$unpack_map  : array<string|int, mixed>
Map from int -> three character unpack string used to unpack posting info
__construct()  : mixed
Makes or initializes an IndexDocumentBundle with the provided parameters
addPages()  : bool
Add the array of $pages to the documents PartitionDocumentBundle
addPartitionPostingsDictionary()  : mixed
Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
addScoresDocMap()  : mixed
Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
addTermPostingLists()  : mixed
Adds posting records associated to a document to the posting lists for a partition.
buildInvertedIndexPartition()  : mixed
Builds an inverted index shard for a documents PartitionDocumentBundle partition.
computeDocId()  : string
Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
deDeltaPostingsSumFrequencies()  : int
Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
findNumSlashes()  : mixed
Finds number of '/' in the url after the hostname represented by doc_id $key.
forceSave()  : mixed
Forces the current shard to be saved
getArchiveInfo()  : array<string|int, mixed>
Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
getCachePage()  : array<string|int, mixed>
Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
getParamModifiedTime()  : mixed
Returns the last time the archive info of the bundle was modified.
getPartitionBaseFolder()  : string
Gets the file path corresponding to the partition with index $partition
getPostingsString()  : string
Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
getSummary()  : array<string|int, mixed>
Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
getWordInfo()  : array<string|int, mixed>
Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
invertOneSite()  : string
Used to create inverted index for one site and add its information to the current partition.
isACldDocId()  : mixed
Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
isAHostDocId()  : mixed
Checks if a doc_id $key is that of a host url.
isAWikipediaPage()  : mixed
Checks if a doc_id $key is that of a Wikipedia page.
isType()  : bool
Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
prepareIndexMap()  : array<string|int, mixed>
As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
setArchiveInfo()  : mixed
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
stopIndexing()  : mixed
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
unpackPostings()  : array<string|int, mixed>
Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
updateDictionary()  : mixed
For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same

Constants

ARCHIVE_INFO_FILE

File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle

public mixed ARCHIVE_INFO_FILE = "archive_info.txt"

DEFAULT_PARAMETERS

Default values for the configuration parameters of an IndexDocumentBundle

public mixed DEFAULT_PARAMETERS = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]

DEFAULT_VERSION

The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's

public mixed DEFAULT_VERSION = "3.2"

DICTIONARY_FOLDER

Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)

public mixed DICTIONARY_FOLDER = "dictionary"

DOC_MAP_FILENAME

Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].

public mixed DOC_MAP_FILENAME = "doc_map"

DOCID_LEN

Length of DocIds used by this IndexDocumentBundle

public mixed DOCID_LEN = 24

DOCID_PART_LEN

DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long

public mixed DOCID_PART_LEN = 8

DOCUMENTS_FOLDER

Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file

public mixed DOCUMENTS_FOLDER = "documents"

LAST_ENTRIES_FILENAME

Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition

public mixed LAST_ENTRIES_FILENAME = "last_entries"

NEXT_PARTITION_FILE

The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition

public mixed NEXT_PARTITION_FILE = "next_partition.txt"

PARTITION_FILENAMES

Names for the files which appear within a partition sub-folder

public mixed PARTITION_FILENAMES = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]

POSITIONS_DOC_MAP_FOLDER

Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.

public mixed POSITIONS_DOC_MAP_FOLDER = "positions_doc_maps"

POSITIONS_FILENAME

Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.

public mixed POSITIONS_FILENAME = "positions"

POSTINGS_BUFFER_SIZE

How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary

public mixed POSTINGS_BUFFER_SIZE = 1000000

POSTINGS_FILENAME

Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.

public mixed POSTINGS_FILENAME = "postings"

TEMP_POSTINGS_FILENAME

Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.

public mixed TEMP_POSTINGS_FILENAME = "temp_postings"

TERMID_LEN

Length of TermIds used by this IndexDocumentBundle

public mixed TERMID_LEN = 16

Properties

$archive_info

Holds property value pairs concerning the configuration of the current IndexDocumentBundle

public array<string|int, mixed> $archive_info

$description

A short text name for this IndexDocumentBundle

public string $description

$dictionary

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)

public object $dictionary

$doc_map

Associative array of docid=>doc_record pairs

public array<string|int, mixed> $doc_map

$doc_map_counter

Keeps track of the number of documents present in the current partition

public int $doc_map_counter

$extract_phrase_time

Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition

public int $extract_phrase_time

$last_entries

Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.

public array<string|int, mixed> $last_entries

$next_partition_to_add

structure contains info about the current partition

public array<string|int, mixed> $next_partition_to_add

$positions

A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.

public string $positions

$postings

Associative array $term_id => posting list records for that term in the partition.

public array<string|int, mixed> $postings

$unpack_len_map

Array of string lengths each of $unpack_maps codes consumes

public array<string|int, mixed> $unpack_len_map

$unpack_map

Map from int -> three character unpack string used to unpack posting info

public array<string|int, mixed> $unpack_map

Methods

__construct()

Makes or initializes an IndexDocumentBundle with the provided parameters

public __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $max_keys = BPlusTree::MAX_KEYS ]) : mixed
Parameters
$dir_name : string

folder name to store this bundle

$read_only_archive : bool = true

whether to open archive only for reading or reading and writing

$description : string = null

a text name/serialized info about this IndexDocumentBundle

$num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION

the number of documents to be stored in a single partition

$max_keys : int = BPlusTree::MAX_KEYS

the maximum number of keys used by the BPlusTree used for the inverted index

Return values
mixed

addPages()

Add the array of $pages to the documents PartitionDocumentBundle

public addPages(array<string|int, mixed> $pages, int $visited_urls_count) : bool
Parameters
$pages : array<string|int, mixed>

data to store

$visited_urls_count : int

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values
bool

success or failure of adding the pages

addPartitionPostingsDictionary()

Adds the previously constructed inverted index $partition to the inverted index of the whole bundle

public addPartitionPostingsDictionary([int $partition = -1 ][, string $taking_too_long_touch = null ]) : mixed
Parameters
$partition : int = -1

which partitions inverted index to add, by default the current save partition

$taking_too_long_touch : string = null

a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.

Return values
mixed

addScoresDocMap()

Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.

public addScoresDocMap(string $doc_id, int $num_words, float $score, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $description_scores, array<string|int, mixed> $user_ranks) : mixed
Parameters
$doc_id : string

new document id to add a record for

$num_words : int

number of terms in the document associated with the doc-id

$score : float

overall score for the important of this document

$host_keywords_end_pos : int

end of the portion of the document summary containing terms coming from the hostname

$title_end_pos : int

end of the portion of the document summary containing terms in the title

$path_keywords_end_pos : int

length of the portion of the document summary containing terms in the url path

$description_scores : array<string|int, mixed>

pairs of the form (length of summary portion, score for that portion)

$user_ranks : array<string|int, mixed>

for each user defined classifier for this crawl the float score of the classifier on this document

Return values
mixed

addTermPostingLists()

Adds posting records associated to a document to the posting lists for a partition.

public addTermPostingLists(int $position_offset, int $doc_length, array<string|int, mixed> $word_lists, array<string|int, mixed> $meta_ids, int $doc_map_index) : mixed
Parameters
$position_offset : int

number of header bytes that might be used before including any position data in the file that positions will eventually be stored.

$doc_length : int

length of document in terms for the document for which we are adding posting data.

$word_lists : array<string|int, mixed>

term => positions within current document of that term for the document whose posting data we are adding

$meta_ids : array<string|int, mixed>

meta terms associated with the document we are adding. An example, meta term might be "media:news"

$doc_map_index : int

which document within the partition is the one we are adding. I.e., 5 would mean there were 5 earlier documents whose postings we have already added.

Return values
mixed

buildInvertedIndexPartition()

Builds an inverted index shard for a documents PartitionDocumentBundle partition.

public buildInvertedIndexPartition([int $partition = -1 ][, string $taking_too_long_touch = null ][, mixed $just_stats = false ]) : mixed
Parameters
$partition : int = -1

to build index for

$taking_too_long_touch : string = null

a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.

$just_stats : mixed = false
Return values
mixed

whether job executed to completion (true or false) if !$just_stats, otherwise, an array with NUM_DOCS, NUM_LINKS, and TERM_STATISTICS (the latter having term frequency info)

computeDocId()

Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.

public static computeDocId(array<string|int, mixed> $site) : string
Parameters
$site : array<string|int, mixed>

site to compute doc_id for

Return values
string

the computedd doc_id

deDeltaPostingsSumFrequencies()

Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).

public deDeltaPostingsSumFrequencies(array<string|int, mixed> &$postings) : int
Parameters
$postings : array<string|int, mixed>

a reference to an array of posting lists for a term (this will be changed by this method)

Return values
int

sum of the frequencies of term occurrences as given by the above postings

findNumSlashes()

Finds number of '/' in the url after the hostname represented by doc_id $key.

public static findNumSlashes(string $key) : mixed
Parameters
$key : string

to find '/' count

Return values
mixed

forceSave()

Forces the current shard to be saved

public forceSave() : mixed
Return values
mixed

getArchiveInfo()

Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.

public static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
$dir_name : string

path to a directory containing a documents IndexDocumentBundle

Return values
array<string|int, mixed>

summary of the given archive

getCachePage()

Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise

public getCachePage(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
$doc_id : string

of document to look up

$partition : int

to look for document in

Return values
array<string|int, mixed>

desired page cache or [] if look up failed

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.

public static getParamModifiedTime(string $dir_name) : mixed
Parameters
$dir_name : string

folder with archive bundle

Tags
returb

mixed either time if file exists or false

Return values
mixed

getPartitionBaseFolder()

Gets the file path corresponding to the partition with index $partition

public getPartitionBaseFolder(int $partition) : string
Parameters
$partition : int

desired partition index

Return values
string

file path to where this partitions index data is stored (Not the original documents which are stored in the PartitionDocumentBundle)

getPostingsString()

Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.

public getPostingsString(int $partition, int $offset, int $len) : string
Parameters
$partition : int

partition to retrieve posting from

$offset : int

byte offset int partition/postings file to look for them

$len : int

length of the posting list to retrieve.

Return values
string

encoded posting list data -- vbyte encoded number of postings, followed by the posting data in PacktableTools format

getSummary()

Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.

public getSummary(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
$doc_id : string

of document to look up

$partition : int

to look for document in

Return values
array<string|int, mixed>

desired summary or [] if look up failed

getWordInfo()

Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id

public getWordInfo(string $term_id[, int $threshold = -1 ], mixed $offset[, mixed $num_partitions = -1 ][, bool $with_remaining_total = false ]) : array<string|int, mixed>
Parameters
$term_id : string

id of phrase or word to look up in bundle dictionary

$threshold : int = -1

after the number of results exceeds this amount stop looking for more dictionary entries.

$offset : mixed
$num_partitions : mixed = -1
$with_remaining_total : bool = false

whether to total number of postings found as well or not

Return values
array<string|int, mixed>

either [total, sequence of four tuples] or sequence of four tuples: (index_shard generation, posting_list_offset, length, exact id that match $term_id)

invertOneSite()

Used to create inverted index for one site and add its information to the current partition.

public invertOneSite(array<string|int, mixed> $site, array<string|int, mixed> $url_info, int &$link_cnt) : string
Parameters
$site : array<string|int, mixed>

site to invert

$url_info : array<string|int, mixed>

collection of url and hash's of documents which map to the same document

$link_cnt : int

current count of number of links discovered so far

Return values
string

$site_url canonical url for site

isACldDocId()

Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.

public static isACldDocId(string $key) : mixed

I.e., a url https://yahoo.com/ or https://www.yahoo.com/ as opposed to https://foo.yahoo.com/

Parameters
$key : string

to check if doc or not

Return values
mixed

isAHostDocId()

Checks if a doc_id $key is that of a host url.

public static isAHostDocId(string $key) : mixed

I.e., a url https://www.yahoo.com/ as opposed to https://www.yahoo.com/foo

Parameters
$key : string

to check if doc or not

Return values
mixed

isAWikipediaPage()

Checks if a doc_id $key is that of a Wikipedia page.

public static isAWikipediaPage(string $key) : mixed
Parameters
$key : string

to check if Wikipedia page or not

Return values
mixed

isType()

Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)

public static isType(string $key, mixed $types) : bool
Parameters
$key : string

to check if doc or not

$types : mixed
Return values
bool

true if a document

prepareIndexMap()

As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.

public prepareIndexMap(int $partition[, array<string|int, mixed> $test_index = [] ]) : array<string|int, mixed>

It also does simple deduplication of documents that have the same hash. It then returns an array of the grouped document data. Grouping is done by giving a score to each document based on (number of doc in index - order doc added). For two entries with the same hash_url, a document will be chosen over a link as the representative; otherwise, the one with higher score will be chosen as the representative. The representative document is given the sum of the scores of its constituents. A second phase where documents are grouped by hash of the text body is also done. Finally, the returned documents are sorted by their scores. So the order of documents from this process is roughly in the order of importance.

Parameters
$partition : int

index of partition to do deduplication for in the case that test index is empty

$test_index : array<string|int, mixed> = []

is non-null only when doing testing of what this method does. In which case, it should consist of an array of $doc_id => string represent a possible record for that doc. As deduplication is done entirely based on component of the doc_id (hash_url, doc_type, hash_doc, hash_host) the string doesn't matter too much.

Return values
array<string|int, mixed>

groups doc_id => records associated with that doc_id

setArchiveInfo()

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).

public static setArchiveInfo(string $dir_name, array<string|int, mixed> $update_info) : mixed
Parameters
$dir_name : string

folder with archive bundle

$update_info : array<string|int, mixed>

struct with above fields

Return values
mixed

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

public stopIndexing() : mixed
Return values
mixed

unpackPostings()

Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.

public unpackPostings(string $postings_string) : array<string|int, mixed>
Parameters
$postings_string : string

compress string representation of a set of postings for a term

Return values
array<string|int, mixed>

a pair [array of unpacked postings, sum of frequencies of all the postings]

updateDictionary()

For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same

public updateDictionary([string $taking_too_long_touch = null ][, bool $till_equal = true ]) : mixed
Parameters
$taking_too_long_touch : string = null

a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.

$till_equal : bool = true

is set to true will keep adding each partition up till the save partition if set to false, oln;y adds one partition

Return values
mixed

        

Search results