IndexDocumentBundle
in package
implements
CrawlConstants
Encapsulates a set of web page documents and an inverted word-index of terms from these documents which allow one to search for documents containing a particular word.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- ARCHIVE_INFO_FILE = "archive_info.txt"
- File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
- DEFAULT_PARAMETERS = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]
- Default values for the configuration parameters of an IndexDocumentBundle
- DEFAULT_VERSION = "3.2"
- The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
- DICTIONARY_FOLDER = "dictionary"
- Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
- DOC_MAP_FILENAME = "doc_map"
- Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
- DOCID_LEN = 24
- Length of DocIds used by this IndexDocumentBundle
- DOCID_PART_LEN = 8
- DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
- DOCUMENTS_FOLDER = "documents"
- Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
- LAST_ENTRIES_FILENAME = "last_entries"
- Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
- NEXT_PARTITION_FILE = "next_partition.txt"
- The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
- PARTITION_FILENAMES = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]
- Names for the files which appear within a partition sub-folder
- POSITIONS_DOC_MAP_FOLDER = "positions_doc_maps"
- Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
- POSITIONS_FILENAME = "positions"
- Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
- POSTINGS_BUFFER_SIZE = 1000000
- How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
- POSTINGS_FILENAME = "postings"
- Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
- TEMP_POSTINGS_FILENAME = "temp_postings"
- Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
- TERMID_LEN = 16
- Length of TermIds used by this IndexDocumentBundle
- $archive_info : array<string|int, mixed>
- Holds property value pairs concerning the configuration of the current IndexDocumentBundle
- $description : string
- A short text name for this IndexDocumentBundle
- $dictionary : object
- IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
- $dir_name : string
- Folder name to use for this IndexDocumentBundle
- $doc_map : array<string|int, mixed>
- Associative array of docid=>doc_record pairs
- $doc_map_counter : int
- Keeps track of the number of documents present in the current partition
- $doc_map_tools : PackedTableTools
- Used to read and write data to the $doc_map array
- $documents : object
- PartitionDocumentBundle for web page documents
- $extract_phrase_time : int
- Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
- $last_entries : array<string|int, mixed>
- Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
- $last_entries_tools : PackedTableTools
- Used to read and write data to the $last_entries array
- $next_partition_to_add : array<string|int, mixed>
- structure contains info about the current partition
- $positions : string
- A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
- $postings : array<string|int, mixed>
- Associative array $term_id => posting list records for that term in the partition.
- $postings_tools : PackedTableTools
- Used to read and write data to the $postings array
- $unpack_len_map : array<string|int, mixed>
- Array of string lengths each of $unpack_maps codes consumes
- $unpack_map : array<string|int, mixed>
- Map from int -> three character unpack string used to unpack posting info
- __construct() : mixed
- Makes or initializes an IndexDocumentBundle with the provided parameters
- addPages() : bool
- Add the array of $pages to the documents PartitionDocumentBundle
- addPartitionPostingsDictionary() : mixed
- Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
- addScoresDocMap() : mixed
- Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
- addTermPostingLists() : mixed
- Adds posting records associated to a document to the posting lists for a partition.
- buildInvertedIndexPartition() : mixed
- Builds an inverted index shard for a documents PartitionDocumentBundle partition.
- computeDocId() : string
- Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
- deDeltaPostingsSumFrequencies() : int
- Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
- findNumSlashes() : mixed
- Finds number of '/' in the url after the hostname represented by doc_id $key.
- forceSave() : mixed
- Forces the current shard to be saved
- getArchiveInfo() : array<string|int, mixed>
- Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
- getCachePage() : array<string|int, mixed>
- Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
- getParamModifiedTime() : mixed
- Returns the last time the archive info of the bundle was modified.
- getPartitionBaseFolder() : string
- Gets the file path corresponding to the partition with index $partition
- getPostingsString() : string
- Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
- getSummary() : array<string|int, mixed>
- Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
- getWordInfo() : array<string|int, mixed>
- Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
- invertOneSite() : string
- Used to create inverted index for one site and add its information to the current partition.
- isACldDocId() : mixed
- Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
- isAHostDocId() : mixed
- Checks if a doc_id $key is that of a host url.
- isAWikipediaPage() : mixed
- Checks if a doc_id $key is that of a Wikipedia page.
- isType() : bool
- Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
- prepareIndexMap() : array<string|int, mixed>
- As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
- setArchiveInfo() : mixed
- Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
- stopIndexing() : mixed
- Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
- unpackPostings() : array<string|int, mixed>
- Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
- updateDictionary() : mixed
- For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same
Constants
ARCHIVE_INFO_FILE
File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
public
mixed
ARCHIVE_INFO_FILE
= "archive_info.txt"
DEFAULT_PARAMETERS
Default values for the configuration parameters of an IndexDocumentBundle
public
mixed
DEFAULT_PARAMETERS
= ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]
DEFAULT_VERSION
The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
public
mixed
DEFAULT_VERSION
= "3.2"
DICTIONARY_FOLDER
Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
public
mixed
DICTIONARY_FOLDER
= "dictionary"
DOC_MAP_FILENAME
Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
public
mixed
DOC_MAP_FILENAME
= "doc_map"
DOCID_LEN
Length of DocIds used by this IndexDocumentBundle
public
mixed
DOCID_LEN
= 24
DOCID_PART_LEN
DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
public
mixed
DOCID_PART_LEN
= 8
DOCUMENTS_FOLDER
Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
public
mixed
DOCUMENTS_FOLDER
= "documents"
LAST_ENTRIES_FILENAME
Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
public
mixed
LAST_ENTRIES_FILENAME
= "last_entries"
NEXT_PARTITION_FILE
The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
public
mixed
NEXT_PARTITION_FILE
= "next_partition.txt"
PARTITION_FILENAMES
Names for the files which appear within a partition sub-folder
public
mixed
PARTITION_FILENAMES
= [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]
POSITIONS_DOC_MAP_FOLDER
Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
public
mixed
POSITIONS_DOC_MAP_FOLDER
= "positions_doc_maps"
POSITIONS_FILENAME
Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
public
mixed
POSITIONS_FILENAME
= "positions"
POSTINGS_BUFFER_SIZE
How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
public
mixed
POSTINGS_BUFFER_SIZE
= 1000000
POSTINGS_FILENAME
Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
public
mixed
POSTINGS_FILENAME
= "postings"
TEMP_POSTINGS_FILENAME
Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
public
mixed
TEMP_POSTINGS_FILENAME
= "temp_postings"
TERMID_LEN
Length of TermIds used by this IndexDocumentBundle
public
mixed
TERMID_LEN
= 16
Properties
$archive_info
Holds property value pairs concerning the configuration of the current IndexDocumentBundle
public
array<string|int, mixed>
$archive_info
$description
A short text name for this IndexDocumentBundle
public
string
$description
$dictionary
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
public
object
$dictionary
$dir_name
Folder name to use for this IndexDocumentBundle
public
string
$dir_name
$doc_map
Associative array of docid=>doc_record pairs
public
array<string|int, mixed>
$doc_map
$doc_map_counter
Keeps track of the number of documents present in the current partition
public
int
$doc_map_counter
$doc_map_tools
Used to read and write data to the $doc_map array
public
PackedTableTools
$doc_map_tools
$documents
PartitionDocumentBundle for web page documents
public
object
$documents
$extract_phrase_time
Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
public
int
$extract_phrase_time
$last_entries
Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
public
array<string|int, mixed>
$last_entries
$last_entries_tools
Used to read and write data to the $last_entries array
public
PackedTableTools
$last_entries_tools
$next_partition_to_add
structure contains info about the current partition
public
array<string|int, mixed>
$next_partition_to_add
$positions
A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
public
string
$positions
$postings
Associative array $term_id => posting list records for that term in the partition.
public
array<string|int, mixed>
$postings
$postings_tools
Used to read and write data to the $postings array
public
PackedTableTools
$postings_tools
$unpack_len_map
Array of string lengths each of $unpack_maps codes consumes
public
array<string|int, mixed>
$unpack_len_map
$unpack_map
Map from int -> three character unpack string used to unpack posting info
public
array<string|int, mixed>
$unpack_map
Methods
__construct()
Makes or initializes an IndexDocumentBundle with the provided parameters
public
__construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $max_keys = BPlusTree::MAX_KEYS ]) : mixed
Parameters
- $dir_name : string
-
folder name to store this bundle
- $read_only_archive : bool = true
-
whether to open archive only for reading or reading and writing
- $description : string = null
-
a text name/serialized info about this IndexDocumentBundle
- $num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION
-
the number of documents to be stored in a single partition
- $max_keys : int = BPlusTree::MAX_KEYS
-
the maximum number of keys used by the BPlusTree used for the inverted index
Return values
mixed —addPages()
Add the array of $pages to the documents PartitionDocumentBundle
public
addPages(array<string|int, mixed> $pages, int $visited_urls_count) : bool
Parameters
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
bool —success or failure of adding the pages
addPartitionPostingsDictionary()
Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
public
addPartitionPostingsDictionary([int $partition = -1 ][, string $taking_too_long_touch = null ]) : mixed
Parameters
- $partition : int = -1
-
which partitions inverted index to add, by default the current save partition
- $taking_too_long_touch : string = null
-
a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
Return values
mixed —addScoresDocMap()
Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
public
addScoresDocMap(string $doc_id, int $num_words, float $score, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $description_scores, array<string|int, mixed> $user_ranks) : mixed
Parameters
- $doc_id : string
-
new document id to add a record for
- $num_words : int
-
number of terms in the document associated with the doc-id
- $score : float
-
overall score for the important of this document
- $host_keywords_end_pos : int
-
end of the portion of the document summary containing terms coming from the hostname
- $title_end_pos : int
-
end of the portion of the document summary containing terms in the title
- $path_keywords_end_pos : int
-
length of the portion of the document summary containing terms in the url path
- $description_scores : array<string|int, mixed>
-
pairs of the form (length of summary portion, score for that portion)
- $user_ranks : array<string|int, mixed>
-
for each user defined classifier for this crawl the float score of the classifier on this document
Return values
mixed —addTermPostingLists()
Adds posting records associated to a document to the posting lists for a partition.
public
addTermPostingLists(int $position_offset, int $doc_length, array<string|int, mixed> $word_lists, array<string|int, mixed> $meta_ids, int $doc_map_index) : mixed
Parameters
- $position_offset : int
-
number of header bytes that might be used before including any position data in the file that positions will eventually be stored.
- $doc_length : int
-
length of document in terms for the document for which we are adding posting data.
- $word_lists : array<string|int, mixed>
-
term => positions within current document of that term for the document whose posting data we are adding
- $meta_ids : array<string|int, mixed>
-
meta terms associated with the document we are adding. An example, meta term might be "media:news"
- $doc_map_index : int
-
which document within the partition is the one we are adding. I.e., 5 would mean there were 5 earlier documents whose postings we have already added.
Return values
mixed —buildInvertedIndexPartition()
Builds an inverted index shard for a documents PartitionDocumentBundle partition.
public
buildInvertedIndexPartition([int $partition = -1 ][, string $taking_too_long_touch = null ][, mixed $just_stats = false ]) : mixed
Parameters
- $partition : int = -1
-
to build index for
- $taking_too_long_touch : string = null
-
a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
- $just_stats : mixed = false
Return values
mixed —whether job executed to completion (true or false) if !$just_stats, otherwise, an array with NUM_DOCS, NUM_LINKS, and TERM_STATISTICS (the latter having term frequency info)
computeDocId()
Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
public
static computeDocId(array<string|int, mixed> $site) : string
Parameters
- $site : array<string|int, mixed>
-
site to compute doc_id for
Return values
string —the computedd doc_id
deDeltaPostingsSumFrequencies()
Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
public
deDeltaPostingsSumFrequencies(array<string|int, mixed> &$postings) : int
Parameters
- $postings : array<string|int, mixed>
-
a reference to an array of posting lists for a term (this will be changed by this method)
Return values
int —sum of the frequencies of term occurrences as given by the above postings
findNumSlashes()
Finds number of '/' in the url after the hostname represented by doc_id $key.
public
static findNumSlashes(string $key) : mixed
Parameters
- $key : string
-
to find '/' count
Return values
mixed —forceSave()
Forces the current shard to be saved
public
forceSave() : mixed
Return values
mixed —getArchiveInfo()
Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
path to a directory containing a documents IndexDocumentBundle
Return values
array<string|int, mixed> —summary of the given archive
getCachePage()
Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
public
getCachePage(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
- $doc_id : string
-
of document to look up
- $partition : int
-
to look for document in
Return values
array<string|int, mixed> —desired page cache or [] if look up failed
getParamModifiedTime()
Returns the last time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Tags
Return values
mixed —getPartitionBaseFolder()
Gets the file path corresponding to the partition with index $partition
public
getPartitionBaseFolder(int $partition) : string
Parameters
- $partition : int
-
desired partition index
Return values
string —file path to where this partitions index data is stored (Not the original documents which are stored in the PartitionDocumentBundle)
getPostingsString()
Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
public
getPostingsString(int $partition, int $offset, int $len) : string
Parameters
- $partition : int
-
partition to retrieve posting from
- $offset : int
-
byte offset int partition/postings file to look for them
- $len : int
-
length of the posting list to retrieve.
Return values
string —encoded posting list data -- vbyte encoded number of postings, followed by the posting data in PacktableTools format
getSummary()
Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
public
getSummary(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
- $doc_id : string
-
of document to look up
- $partition : int
-
to look for document in
Return values
array<string|int, mixed> —desired summary or [] if look up failed
getWordInfo()
Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
public
getWordInfo(string $term_id[, int $threshold = -1 ], mixed $offset[, mixed $num_partitions = -1 ][, bool $with_remaining_total = false ]) : array<string|int, mixed>
Parameters
- $term_id : string
-
id of phrase or word to look up in bundle dictionary
- $threshold : int = -1
-
after the number of results exceeds this amount stop looking for more dictionary entries.
- $offset : mixed
- $num_partitions : mixed = -1
- $with_remaining_total : bool = false
-
whether to total number of postings found as well or not
Return values
array<string|int, mixed> —either [total, sequence of four tuples] or sequence of four tuples: (index_shard generation, posting_list_offset, length, exact id that match $term_id)
invertOneSite()
Used to create inverted index for one site and add its information to the current partition.
public
invertOneSite(array<string|int, mixed> $site, array<string|int, mixed> $url_info, int &$link_cnt) : string
Parameters
- $site : array<string|int, mixed>
-
site to invert
- $url_info : array<string|int, mixed>
-
collection of url and hash's of documents which map to the same document
- $link_cnt : int
-
current count of number of links discovered so far
Return values
string —$site_url canonical url for site
isACldDocId()
Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
public
static isACldDocId(string $key) : mixed
I.e., a url https://yahoo.com/ or https://www.yahoo.com/ as opposed to https://foo.yahoo.com/
Parameters
- $key : string
-
to check if doc or not
Return values
mixed —isAHostDocId()
Checks if a doc_id $key is that of a host url.
public
static isAHostDocId(string $key) : mixed
I.e., a url https://www.yahoo.com/ as opposed to https://www.yahoo.com/foo
Parameters
- $key : string
-
to check if doc or not
Return values
mixed —isAWikipediaPage()
Checks if a doc_id $key is that of a Wikipedia page.
public
static isAWikipediaPage(string $key) : mixed
Parameters
- $key : string
-
to check if Wikipedia page or not
Return values
mixed —isType()
Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
public
static isType(string $key, mixed $types) : bool
Parameters
- $key : string
-
to check if doc or not
- $types : mixed
Return values
bool —true if a document
prepareIndexMap()
As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
public
prepareIndexMap(int $partition[, array<string|int, mixed> $test_index = [] ]) : array<string|int, mixed>
It also does simple deduplication of documents that have the same hash. It then returns an array of the grouped document data. Grouping is done by giving a score to each document based on (number of doc in index - order doc added). For two entries with the same hash_url, a document will be chosen over a link as the representative; otherwise, the one with higher score will be chosen as the representative. The representative document is given the sum of the scores of its constituents. A second phase where documents are grouped by hash of the text body is also done. Finally, the returned documents are sorted by their scores. So the order of documents from this process is roughly in the order of importance.
Parameters
- $partition : int
-
index of partition to do deduplication for in the case that test index is empty
- $test_index : array<string|int, mixed> = []
-
is non-null only when doing testing of what this method does. In which case, it should consist of an array of $doc_id => string represent a possible record for that doc. As deduplication is done entirely based on component of the doc_id (hash_url, doc_type, hash_doc, hash_host) the string doesn't matter too much.
Return values
array<string|int, mixed> —groups doc_id => records associated with that doc_id
setArchiveInfo()
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $update_info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $update_info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —stopIndexing()
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
public
stopIndexing() : mixed
Return values
mixed —unpackPostings()
Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
public
unpackPostings(string $postings_string) : array<string|int, mixed>
Parameters
- $postings_string : string
-
compress string representation of a set of postings for a term
Return values
array<string|int, mixed> —a pair [array of unpacked postings, sum of frequencies of all the postings]
updateDictionary()
For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same
public
updateDictionary([string $taking_too_long_touch = null ][, bool $till_equal = true ]) : mixed
Parameters
- $taking_too_long_touch : string = null
-
a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
- $till_equal : bool = true
-
is set to true will keep adding each partition up till the save partition if set to false, oln;y adds one partition