Yioop_V9.5_Source_Code

IndexDocumentBundle
in package

Application

implements CrawlConstants

Encapsulates a set of web page documents and an inverted word-index of terms from these documents which allow one to search for documents containing a particular word.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

ARCHIVE_INFO_FILE = "archive_info.txt": File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
DEFAULT_PARAMETERS = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]: Default values for the configuration parameters of an IndexDocumentBundle
DEFAULT_VERSION = "3.2": The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
DICTIONARY_FOLDER = "dictionary": Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
DOC_MAP_FILENAME = "doc_map": Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
DOCID_LEN = 24: Length of DocIds used by this IndexDocumentBundle
DOCID_PART_LEN = 8: DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
DOCUMENTS_FOLDER = "documents": Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
LAST_ENTRIES_FILENAME = "last_entries": Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
NEXT_PARTITION_FILE = "next_partition.txt": The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
PARTITION_FILENAMES = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]: Names for the files which appear within a partition sub-folder
POSITIONS_DOC_MAP_FOLDER = "positions_doc_maps": Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
POSITIONS_FILENAME = "positions": Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
POSTINGS_BUFFER_SIZE = 1000000: How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
POSTINGS_FILENAME = "postings": Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
TEMP_POSTINGS_FILENAME = "temp_postings": Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
TERMID_LEN = 16: Length of TermIds used by this IndexDocumentBundle
$archive_info : array<string|int, mixed>: Holds property value pairs concerning the configuration of the current IndexDocumentBundle
$description : string: A short text name for this IndexDocumentBundle
$dictionary : object: IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
$dir_name : string: Folder name to use for this IndexDocumentBundle
$doc_map : array<string|int, mixed>: Associative array of docid=>doc_record pairs
$doc_map_counter : int: Keeps track of the number of documents present in the current partition
$doc_map_tools : PackedTableTools: Used to read and write data to the $doc_map array
$documents : object: PartitionDocumentBundle for web page documents
$extract_phrase_time : int: Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
$last_entries : array<string|int, mixed>: Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
$last_entries_tools : PackedTableTools: Used to read and write data to the $last_entries array
$next_partition_to_add : array<string|int, mixed>: structure contains info about the current partition
$positions : string: A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
$postings : array<string|int, mixed>: Associative array $term_id => posting list records for that term in the partition.
$postings_tools : PackedTableTools: Used to read and write data to the $postings array
$unpack_len_map : array<string|int, mixed>: Array of string lengths each of $unpack_maps codes consumes
$unpack_map : array<string|int, mixed>: Map from int -> three character unpack string used to unpack posting info
__construct() : mixed: Makes or initializes an IndexDocumentBundle with the provided parameters
addPages() : bool: Add the array of $pages to the documents PartitionDocumentBundle
addPartitionPostingsDictionary() : mixed: Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
addScoresDocMap() : mixed: Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
addTermPostingLists() : mixed: Adds posting records associated to a document to the posting lists for a partition.
buildInvertedIndexPartition() : mixed: Builds an inverted index shard for a documents PartitionDocumentBundle partition.
computeDocId() : string: Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
deDeltaPostingsSumFrequencies() : int: Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
findNumSlashes() : mixed: Finds number of '/' in the url after the hostname represented by doc_id $key.
forceSave() : mixed: Forces the current shard to be saved
getArchiveInfo() : array<string|int, mixed>: Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
getCachePage() : array<string|int, mixed>: Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
getParamModifiedTime() : mixed: Returns the last time the archive info of the bundle was modified.
getPartitionBaseFolder() : string: Gets the file path corresponding to the partition with index $partition
getPostingsString() : string: Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
getSummary() : array<string|int, mixed>: Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
getWordInfo() : array<string|int, mixed>: Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
invertOneSite() : string: Used to create inverted index for one site and add its information to the current partition.
isACldDocId() : mixed: Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
isAHostDocId() : mixed: Checks if a doc_id $key is that of a host url.
isAWikipediaPage() : mixed: Checks if a doc_id $key is that of a Wikipedia page.
isType() : bool: Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
prepareIndexMap() : array<string|int, mixed>: As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
setArchiveInfo() : mixed: Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
stopIndexing() : mixed: Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
unpackPostings() : array<string|int, mixed>: Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
updateDictionary() : mixed: For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same

ARCHIVE_INFO_FILE

File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle


    public
        mixed
    ARCHIVE_INFO_FILE
    = "archive_info.txt"

DEFAULT_PARAMETERS

Default values for the configuration parameters of an IndexDocumentBundle


    public
        mixed
    DEFAULT_PARAMETERS
    = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]

DEFAULT_VERSION

The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's


    public
        mixed
    DEFAULT_VERSION
    = "3.2"

DICTIONARY_FOLDER

Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)


    public
        mixed
    DICTIONARY_FOLDER
    = "dictionary"

DOC_MAP_FILENAME

Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].


    public
        mixed
    DOC_MAP_FILENAME
    = "doc_map"

DOCID_LEN

Length of DocIds used by this IndexDocumentBundle


    public
        mixed
    DOCID_LEN
    = 24

DOCID_PART_LEN

DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long


    public
        mixed
    DOCID_PART_LEN
    = 8

DOCUMENTS_FOLDER

Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file


    public
        mixed
    DOCUMENTS_FOLDER
    = "documents"

LAST_ENTRIES_FILENAME

Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition


    public
        mixed
    LAST_ENTRIES_FILENAME
    = "last_entries"

NEXT_PARTITION_FILE

The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition


    public
        mixed
    NEXT_PARTITION_FILE
    = "next_partition.txt"

PARTITION_FILENAMES

Names for the files which appear within a partition sub-folder


    public
        mixed
    PARTITION_FILENAMES
    = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]

POSITIONS_DOC_MAP_FOLDER

Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.


    public
        mixed
    POSITIONS_DOC_MAP_FOLDER
    = "positions_doc_maps"

POSITIONS_FILENAME

Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.


    public
        mixed
    POSITIONS_FILENAME
    = "positions"

POSTINGS_BUFFER_SIZE

How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary


    public
        mixed
    POSTINGS_BUFFER_SIZE
    = 1000000

POSTINGS_FILENAME

Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.


    public
        mixed
    POSTINGS_FILENAME
    = "postings"

TEMP_POSTINGS_FILENAME

Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.


    public
        mixed
    TEMP_POSTINGS_FILENAME
    = "temp_postings"

TERMID_LEN

Length of TermIds used by this IndexDocumentBundle


    public
        mixed
    TERMID_LEN
    = 16

$archive_info

Holds property value pairs concerning the configuration of the current IndexDocumentBundle


    public
        array<string|int, mixed>
    $archive_info

$description

A short text name for this IndexDocumentBundle


    public
        string
    $description

$dictionary

IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)


    public
        object
    $dictionary

$dir_name

Folder name to use for this IndexDocumentBundle


    public
        string
    $dir_name

$doc_map

Associative array of docid=>doc_record pairs


    public
        array<string|int, mixed>
    $doc_map

$doc_map_counter

Keeps track of the number of documents present in the current partition


    public
        int
    $doc_map_counter

$doc_map_tools

Used to read and write data to the $doc_map array


    public
        PackedTableTools
    $doc_map_tools

$documents

PartitionDocumentBundle for web page documents


    public
        object
    $documents

$extract_phrase_time

Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition


    public
        int
    $extract_phrase_time

$last_entries

Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.


    public
        array<string|int, mixed>
    $last_entries

$last_entries_tools

Used to read and write data to the $last_entries array


    public
        PackedTableTools
    $last_entries_tools

$next_partition_to_add

structure contains info about the current partition


    public
        array<string|int, mixed>
    $next_partition_to_add

$positions

A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.


    public
        string
    $positions

$postings

Associative array $term_id => posting list records for that term in the partition.


    public
        array<string|int, mixed>
    $postings

$postings_tools

Used to read and write data to the $postings array


    public
        PackedTableTools
    $postings_tools

$unpack_len_map

Array of string lengths each of $unpack_maps codes consumes


    public
        array<string|int, mixed>
    $unpack_len_map

$unpack_map

Map from int -> three character unpack string used to unpack posting info


    public
        array<string|int, mixed>
    $unpack_map

__construct()

Makes or initializes an IndexDocumentBundle with the provided parameters


    public
                    __construct(string $dir_name[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, int $max_keys = BPlusTree::MAX_KEYS ]) : mixed

Parameters

$dir_name : string: folder name to store this bundle
$read_only_archive : bool = true: whether to open archive only for reading or reading and writing
$description : string = null: a text name/serialized info about this IndexDocumentBundle
$num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION: the number of documents to be stored in a single partition
$max_keys : int = BPlusTree::MAX_KEYS: the maximum number of keys used by the BPlusTree used for the inverted index

Return values

mixed —

addPages()

Add the array of $pages to the documents PartitionDocumentBundle


    public
                    addPages(array<string|int, mixed> $pages, int $visited_urls_count) : bool

Parameters

$pages : array<string|int, mixed>: data to store
$visited_urls_count : int: number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

Return values

bool —

success or failure of adding the pages

addPartitionPostingsDictionary()

Adds the previously constructed inverted index $partition to the inverted index of the whole bundle


    public
                    addPartitionPostingsDictionary([int $partition = -1 ][, string $taking_too_long_touch = null ]) : mixed

Parameters

$partition : int = -1: which partitions inverted index to add, by default the current save partition
$taking_too_long_touch : string = null: a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.

Return values

mixed —

addScoresDocMap()

Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.


    public
                    addScoresDocMap(string $doc_id, int $num_words, float $score, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $description_scores, array<string|int, mixed> $user_ranks) : mixed

Parameters

$doc_id : string: new document id to add a record for
$num_words : int: number of terms in the document associated with the doc-id
$score : float: overall score for the important of this document
$host_keywords_end_pos : int: end of the portion of the document summary containing terms coming from the hostname
$title_end_pos : int: end of the portion of the document summary containing terms in the title
$path_keywords_end_pos : int: length of the portion of the document summary containing terms in the url path
$description_scores : array<string|int, mixed>: pairs of the form (length of summary portion, score for that portion)
$user_ranks : array<string|int, mixed>: for each user defined classifier for this crawl the float score of the classifier on this document

Return values

mixed —

addTermPostingLists()

Adds posting records associated to a document to the posting lists for a partition.


    public
                    addTermPostingLists(int $position_offset, int $doc_length, array<string|int, mixed> $word_lists, array<string|int, mixed> $meta_ids, int $doc_map_index) : mixed

Parameters

$position_offset : int: number of header bytes that might be used before including any position data in the file that positions will eventually be stored.
$doc_length : int: length of document in terms for the document for which we are adding posting data.
$word_lists : array<string|int, mixed>: term => positions within current document of that term for the document whose posting data we are adding
$meta_ids : array<string|int, mixed>: meta terms associated with the document we are adding. An example, meta term might be "media:news"
$doc_map_index : int: which document within the partition is the one we are adding. I.e., 5 would mean there were 5 earlier documents whose postings we have already added.

Return values

mixed —

buildInvertedIndexPartition()

Builds an inverted index shard for a documents PartitionDocumentBundle partition.


    public
                    buildInvertedIndexPartition([int $partition = -1 ][, string $taking_too_long_touch = null ][, mixed $just_stats = false ]) : mixed

Parameters

$partition : int = -1: to build index for
$taking_too_long_touch : string = null: a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
$just_stats : mixed = false

Return values

mixed —

whether job executed to completion (true or false) if !$just_stats, otherwise, an array with NUM_DOCS, NUM_LINKS, and TERM_STATISTICS (the latter having term frequency info)

computeDocId()

Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.


    public
            static        computeDocId(array<string|int, mixed> $site) : string

Parameters

$site : array<string|int, mixed>: site to compute doc_id for

Return values

string —

the computedd doc_id

deDeltaPostingsSumFrequencies()

Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).


    public
                    deDeltaPostingsSumFrequencies(array<string|int, mixed> &$postings) : int

Parameters

$postings : array<string|int, mixed>: a reference to an array of posting lists for a term (this will be changed by this method)

Return values

int —

sum of the frequencies of term occurrences as given by the above postings

findNumSlashes()

Finds number of '/' in the url after the hostname represented by doc_id $key.


    public
            static        findNumSlashes(string $key) : mixed

Parameters

$key : string: to find '/' count

Return values

mixed —

forceSave()

Forces the current shard to be saved


    public
                    forceSave() : mixed

Return values

mixed —

getArchiveInfo()

Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.


    public
            static        getArchiveInfo(string $dir_name) : array<string|int, mixed>

Parameters

$dir_name : string: path to a directory containing a documents IndexDocumentBundle

Return values

array<string|int, mixed> —

summary of the given archive

getCachePage()

Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise


    public
                    getCachePage(string $doc_id, int $partition) : array<string|int, mixed>

Parameters

$doc_id : string: of document to look up
$partition : int: to look for document in

Return values

array<string|int, mixed> —

desired page cache or [] if look up failed

getParamModifiedTime()

Returns the last time the archive info of the bundle was modified.


    public
            static        getParamModifiedTime(string $dir_name) : mixed

Parameters

$dir_name : string: folder with archive bundle

Return values

mixed —

getPartitionBaseFolder()

Gets the file path corresponding to the partition with index $partition


    public
                    getPartitionBaseFolder(int $partition) : string

Parameters

$partition : int: desired partition index

Return values

string —

file path to where this partitions index data is stored (Not the original documents which are stored in the PartitionDocumentBundle)

getPostingsString()

Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.


    public
                    getPostingsString(int $partition, int $offset, int $len) : string

Parameters

$partition : int: partition to retrieve posting from
$offset : int: byte offset int partition/postings file to look for them
$len : int: length of the posting list to retrieve.

Return values

string —

encoded posting list data -- vbyte encoded number of postings, followed by the posting data in PacktableTools format

getSummary()

Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.


    public
                    getSummary(string $doc_id, int $partition) : array<string|int, mixed>

Parameters

$doc_id : string: of document to look up
$partition : int: to look for document in

Return values

array<string|int, mixed> —

desired summary or [] if look up failed

getWordInfo()

Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id


    public
                    getWordInfo(string $term_id[, int $threshold = -1 ], mixed $offset[, mixed $num_partitions = -1 ][, bool $with_remaining_total = false ]) : array<string|int, mixed>

Parameters

$term_id : string: id of phrase or word to look up in bundle dictionary
$threshold : int = -1: after the number of results exceeds this amount stop looking for more dictionary entries.
$offset : mixed
$num_partitions : mixed = -1
$with_remaining_total : bool = false: whether to total number of postings found as well or not

Return values

array<string|int, mixed> —

either [total, sequence of four tuples] or sequence of four tuples: (index_shard generation, posting_list_offset, length, exact id that match $term_id)

invertOneSite()

Used to create inverted index for one site and add its information to the current partition.


    public
                    invertOneSite(array<string|int, mixed> $site, array<string|int, mixed> $url_info, int &$link_cnt) : string

Parameters

$site : array<string|int, mixed>: site to invert
$url_info : array<string|int, mixed>: collection of url and hash's of documents which map to the same document
$link_cnt : int: current count of number of links discovered so far

Return values

string —

$site_url canonical url for site

isACldDocId()

Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.


    public
            static        isACldDocId(string $key) : mixed

I.e., a url https://yahoo.com/ or https://www.yahoo.com/ as opposed to https://foo.yahoo.com/

Parameters

$key : string: to check if doc or not

Return values

mixed —

isAHostDocId()

Checks if a doc_id $key is that of a host url.


    public
            static        isAHostDocId(string $key) : mixed

I.e., a url https://www.yahoo.com/ as opposed to https://www.yahoo.com/foo

Parameters

$key : string: to check if doc or not

Return values

mixed —

isAWikipediaPage()

Checks if a doc_id $key is that of a Wikipedia page.


    public
            static        isAWikipediaPage(string $key) : mixed

Parameters

$key : string: to check if Wikipedia page or not

Return values

mixed —

isType()

Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)


    public
            static        isType(string $key, mixed $types) : bool

Parameters

$key : string: to check if doc or not
$types : mixed

Return values

bool —

true if a document

prepareIndexMap()

As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.


    public
                    prepareIndexMap(int $partition[, array<string|int, mixed> $test_index = [] ]) : array<string|int, mixed>

It also does simple deduplication of documents that have the same hash. It then returns an array of the grouped document data. Grouping is done by giving a score to each document based on (number of doc in index - order doc added). For two entries with the same hash_url, a document will be chosen over a link as the representative; otherwise, the one with higher score will be chosen as the representative. The representative document is given the sum of the scores of its constituents. A second phase where documents are grouped by hash of the text body is also done. Finally, the returned documents are sorted by their scores. So the order of documents from this process is roughly in the order of importance.

Parameters

$partition : int: index of partition to do deduplication for in the case that test index is empty
$test_index : array<string|int, mixed> = []: is non-null only when doing testing of what this method does. In which case, it should consist of an array of $doc_id => string represent a possible record for that doc. As deduplication is done entirely based on component of the doc_id (hash_url, doc_type, hash_doc, hash_host) the string doesn't matter too much.

Return values

array<string|int, mixed> —

groups doc_id => records associated with that doc_id

setArchiveInfo()

Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).


    public
            static        setArchiveInfo(string $dir_name, array<string|int, mixed> $update_info) : mixed

Parameters

$dir_name : string: folder with archive bundle
$update_info : array<string|int, mixed>: struct with above fields

Return values

mixed —

stopIndexing()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.


    public
                    stopIndexing() : mixed

Return values

mixed —

unpackPostings()

Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.


    public
                    unpackPostings(string $postings_string) : array<string|int, mixed>

Parameters

$postings_string : string: compress string representation of a set of postings for a term

Return values

array<string|int, mixed> —

a pair [array of unpacked postings, sum of frequencies of all the postings]

updateDictionary()

For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same


    public
                    updateDictionary([string $taking_too_long_touch = null ][, bool $till_equal = true ]) : mixed

Parameters

$taking_too_long_touch : string = null: a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
$till_equal : bool = true: is set to true will keep adding each partition up till the save partition if set to false, oln;y adds one partition

Return values

mixed —

IndexDocumentBundle in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

ARCHIVE_INFO_FILE

DEFAULT_PARAMETERS

DEFAULT_VERSION

DICTIONARY_FOLDER

DOC_MAP_FILENAME

DOCID_LEN

DOCID_PART_LEN

DOCUMENTS_FOLDER

LAST_ENTRIES_FILENAME

NEXT_PARTITION_FILE

PARTITION_FILENAMES

POSITIONS_DOC_MAP_FOLDER

POSITIONS_FILENAME

POSTINGS_BUFFER_SIZE

POSTINGS_FILENAME

TEMP_POSTINGS_FILENAME

TERMID_LEN

Properties

$archive_info

$description

$dictionary

$dir_name

$doc_map

$doc_map_counter

$doc_map_tools

$documents

$extract_phrase_time

$last_entries

$last_entries_tools

$next_partition_to_add

$positions

$postings

$postings_tools

$unpack_len_map

$unpack_map

Methods

__construct()

Parameters

Return values

addPages()

Parameters

Return values

addPartitionPostingsDictionary()

Parameters

Return values

addScoresDocMap()

Parameters

Return values

addTermPostingLists()

Parameters

Return values

buildInvertedIndexPartition()

Parameters

Return values

computeDocId()

Parameters

Return values

deDeltaPostingsSumFrequencies()

Parameters

Return values

findNumSlashes()

Parameters

Return values

forceSave()

Return values

getArchiveInfo()

Parameters

Return values

getCachePage()

Parameters

Return values

getParamModifiedTime()

Parameters

Tags

Return values

IndexDocumentBundle
in package

Application

implements CrawlConstants