FeedDocumentBundle
extends IndexDocumentBundle
in package
Subclass of IndexDocumentBundle with bloom filters to make it easy to check if a news feed item has been added to the bundle already before adding it
Tags
Table of Contents
- ARCHIVE_INFO_FILE = "archive_info.txt"
- File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
- DEFAULT_PARAMETERS = ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]
- Default values for the configuration parameters of an IndexDocumentBundle
- DEFAULT_VERSION = "3.2"
- The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
- DICTIONARY_FOLDER = "dictionary"
- Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
- DOC_MAP_FILENAME = "doc_map"
- Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
- DOCID_LEN = 24
- Length of DocIds used by this IndexDocumentBundle
- DOCID_PART_LEN = 8
- DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
- DOCUMENTS_FOLDER = "documents"
- Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
- LAST_ENTRIES_FILENAME = "last_entries"
- Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
- NEXT_PARTITION_FILE = "next_partition.txt"
- The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
- OLD_ITEM_TIME = 4 * \seekquarry\yioop\configs\ONE_WEEK
- how long in seconds before a feed item expires
- PARTITION_FILENAMES = [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]
- Names for the files which appear within a partition sub-folder
- POSITIONS_DOC_MAP_FOLDER = "positions_doc_maps"
- Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
- POSITIONS_FILENAME = "positions"
- Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
- POSTINGS_BUFFER_SIZE = 1000000
- How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
- POSTINGS_FILENAME = "postings"
- Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
- TEMP_POSTINGS_FILENAME = "temp_postings"
- Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
- TERMID_LEN = 16
- Length of TermIds used by this IndexDocumentBundle
- $archive_info : array<string|int, mixed>
- Holds property value pairs concerning the configuration of the current IndexDocumentBundle
- $db : DatasourceManager
- Reference to a DatasourceManager to communicate with the database to get a list of search sources (news feeds) associated with this feed bundle
- $description : string
- A short text name for this IndexDocumentBundle
- $dictionary : object
- IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
- $dir_name : string
- Folder name to use for this IndexDocumentBundle
- $doc_map : array<string|int, mixed>
- Associative array of docid=>doc_record pairs
- $doc_map_counter : int
- Keeps track of the number of documents present in the current partition
- $doc_map_tools : PackedTableTools
- Used to read and write data to the $doc_map array
- $documents : object
- PartitionDocumentBundle for web page documents
- $extract_phrase_time : int
- Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
- $feeds : array<string|int, mixed>
- Array of information about the search sources (news feeds) that were used to collect news items stored in this bundle
- $filter_a : BloomFilterFile
- Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
- $filter_b : BloomFilterFile
- Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
- $last_entries : array<string|int, mixed>
- Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
- $last_entries_tools : PackedTableTools
- Used to read and write data to the $last_entries array
- $next_partition_to_add : array<string|int, mixed>
- structure contains info about the current partition
- $positions : string
- A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
- $postings : array<string|int, mixed>
- Associative array $term_id => posting list records for that term in the partition.
- $postings_tools : PackedTableTools
- Used to read and write data to the $postings array
- $unpack_len_map : array<string|int, mixed>
- Array of string lengths each of $unpack_maps codes consumes
- $unpack_map : array<string|int, mixed>
- Map from int -> three character unpack string used to unpack posting info
- __construct() : mixed
- Makes or initializes an FeedArchiveBundle with the provided parameters
- addFilters() : mixed
- Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
- addPages() : bool
- Add the array of $pages to the documents PartitionDocumentBundle
- addPagesAndSeenKeys() : bool
- Adds pages of feed items to document bundle and adds their unique hashes (guids)) to bloom filters so they are not reindexed
- addPartitionPostingsDictionary() : mixed
- Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
- addScoresDocMap() : mixed
- Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
- addTermCountsTrendingTable() : mixed
- Updates TRENDING_TERM, hourly, daily, and weekly top term occurrences.
- addTermPostingLists() : mixed
- Adds posting records associated to a document to the posting lists for a partition.
- buildInvertedIndexPartition() : mixed
- Copies all feeds items newer than $age to a new shard, then deletes old index shard and database entries older than $age. Finally sets copied shard to be active. If this method is going to take max_execution_time/2 it returns false, so an additional job can be schedules; otherwise it returns true
- calculateMetas() : array<string|int, mixed>
- Used to calculate the meta words for RSS feed items
- computeDocId() : string
- Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
- contains() : bool
- Whether the active filter for this feed contain thee feed item of thee supplied key
- deDeltaPostingsSumFrequencies() : int
- Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
- findNumSlashes() : mixed
- Finds number of '/' in the url after the hostname represented by doc_id $key.
- forceSave() : mixed
- Forces the current shard to be saved
- getArchiveInfo() : array<string|int, mixed>
- Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
- getCachePage() : array<string|int, mixed>
- Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
- getParamModifiedTime() : mixed
- Returns the last time the archive info of the bundle was modified.
- getPartitionBaseFolder() : string
- Gets the file path corresponding to the partition with index $partition
- getPostingsString() : string
- Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
- getSummary() : array<string|int, mixed>
- Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
- getWordInfo() : array<string|int, mixed>
- Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
- invertOneSite() : string
- Used to create inverted index for one site and add its information to the current partition.
- isACldDocId() : mixed
- Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
- isAHostDocId() : mixed
- Checks if a doc_id $key is that of a host url.
- isAWikipediaPage() : mixed
- Checks if a doc_id $key is that of a Wikipedia page.
- isType() : bool
- Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
- prepareIndexMap() : array<string|int, mixed>
- As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
- setArchiveInfo() : mixed
- Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
- stopIndexing() : mixed
- Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
- unpackPostings() : array<string|int, mixed>
- Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
- updateDictionary() : mixed
- For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same
- updateTrendingTermCounts() : mixed
- Updates trending term counts based on the string from the current feed item.
Constants
ARCHIVE_INFO_FILE
File name used to store within the folder of the IndexDocumentBundle parameter/configuration information about the bundle
public
mixed
ARCHIVE_INFO_FILE
= "archive_info.txt"
DEFAULT_PARAMETERS
Default values for the configuration parameters of an IndexDocumentBundle
public
mixed
DEFAULT_PARAMETERS
= ["DESCRIPTION" => "", "VERSION" => self::DEFAULT_VERSION]
DEFAULT_VERSION
The version of this IndexDocumentBundle. The lowest format number is 3.0 as prior inverted index/document stores used IndexArchiveBundle's
public
mixed
DEFAULT_VERSION
= "3.2"
DICTIONARY_FOLDER
Subfolder of IndexDocumentBundle to store the btree with term => posting list information (i.e., the inverted index)
public
mixed
DICTIONARY_FOLDER
= "dictionary"
DOC_MAP_FILENAME
Partition i in an IndexDocumentBundle has a subfolder i within self::POSITIONS_DOC_MAP_FOLDER. Within this subfolder i, self::DOC_MAP_FILENAME is the name of the file used to store the document map for the partition. The document map consists of a sequence of records associated with each doc_id of a document stored in the partition. The first record is ["POS" => $num_words, "SCORE" => floatval($global_score_for_document)]. The second record is: ["POS" => $length_of_title_of_document, "SCORE" => floatval($num_description_scores)]] Here a description score is a score for the importance for a section of a document. Subsequence records, list [POS => the length of the jth section of the document, SCORE => its score].
public
mixed
DOC_MAP_FILENAME
= "doc_map"
DOCID_LEN
Length of DocIds used by this IndexDocumentBundle
public
mixed
DOCID_LEN
= 24
DOCID_PART_LEN
DocIds are made of three parts: hash of url, hash of document, hash of url hostname. Each of these hashes is DOCID_PART_LEN long
public
mixed
DOCID_PART_LEN
= 8
DOCUMENTS_FOLDER
Folder used to store the partition data of this IndexDocumentBundle These will consists of .txt.gz files for each partition which are used to store summaries of documents and actual documents (web pages) and .ix files which are used to store doc_id and the associated offsets to their summary and actual document within the .txt.gz file
public
mixed
DOCUMENTS_FOLDER
= "documents"
LAST_ENTRIES_FILENAME
Name of the last entries file used to help compute difference lists for doc_map_index, and position list offsets used in postings for the partition. This file is also used to track the total number of occurrences of term in a partition
public
mixed
LAST_ENTRIES_FILENAME
= "last_entries"
NEXT_PARTITION_FILE
The filename of a file that is used to keep track of the integer that says what is the next partition with documents that can be added to this IndexDocumentBundle's dictionary. I.e., It should be that next_partition <= save_partition
public
mixed
NEXT_PARTITION_FILE
= "next_partition.txt"
OLD_ITEM_TIME
how long in seconds before a feed item expires
public
mixed
OLD_ITEM_TIME
= 4 * \seekquarry\yioop\configs\ONE_WEEK
PARTITION_FILENAMES
Names for the files which appear within a partition sub-folder
public
mixed
PARTITION_FILENAMES
= [self::DOC_MAP_FILENAME, self::LAST_ENTRIES_FILENAME, self::POSITIONS_FILENAME, self::POSTINGS_FILENAME]
POSITIONS_DOC_MAP_FOLDER
Name of the folder used to hold position lists and document maps. Within this folder there is a subfolder for each partition which contains a doc_map file, postings file for the docs within the partition, position lists file for those postings, and a last_entries file used in the computation of difference list for doc_map_index and position list offsets, as well as number of occurrences of terms.
public
mixed
POSITIONS_DOC_MAP_FOLDER
= "positions_doc_maps"
POSITIONS_FILENAME
Name of the file within a partitions positions_doc_maps folder used to contain the partition's position list for all terms in partition.
public
mixed
POSITIONS_FILENAME
= "positions"
POSTINGS_BUFFER_SIZE
How many bytes of posting to buffer before writing, when addPartitionPostingsDictionary
public
mixed
POSTINGS_BUFFER_SIZE
= 1000000
POSTINGS_FILENAME
Name of the file within a partition's positions_doc_maps folder with posting information for all terms in that partition. This consists of key value pairs term_id => posting records for all documents with that term.
public
mixed
POSTINGS_FILENAME
= "postings"
TEMP_POSTINGS_FILENAME
Temporary name for postings from a POSTINGS_FILENAME file while they are being compressed.
public
mixed
TEMP_POSTINGS_FILENAME
= "temp_postings"
TERMID_LEN
Length of TermIds used by this IndexDocumentBundle
public
mixed
TERMID_LEN
= 16
Properties
$archive_info
Holds property value pairs concerning the configuration of the current IndexDocumentBundle
public
array<string|int, mixed>
$archive_info
$db
Reference to a DatasourceManager to communicate with the database to get a list of search sources (news feeds) associated with this feed bundle
public
DatasourceManager
$db
$description
A short text name for this IndexDocumentBundle
public
string
$description
$dictionary
IndexDictionary for all shards in the IndexArchiveBundle This contains entries of the form (word, num_shards with word, posting list info 0th shard containing the word, posting list info 1st shard containing the word, ...)
public
object
$dictionary
$dir_name
Folder name to use for this IndexDocumentBundle
public
string
$dir_name
$doc_map
Associative array of docid=>doc_record pairs
public
array<string|int, mixed>
$doc_map
$doc_map_counter
Keeps track of the number of documents present in the current partition
public
int
$doc_map_counter
$doc_map_tools
Used to read and write data to the $doc_map array
public
PackedTableTools
$doc_map_tools
$documents
PartitionDocumentBundle for web page documents
public
object
$documents
$extract_phrase_time
Holds the total time needed to extract phrases (sequences of adjacent words) from site descriptions for a partition
public
int
$extract_phrase_time
$feeds
Array of information about the search sources (news feeds) that were used to collect news items stored in this bundle
public
array<string|int, mixed>
$feeds
$filter_a
Used to store unique identifiers of feed items that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
public
BloomFilterFile
$filter_a
$filter_b
Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
public
BloomFilterFile
$filter_b
@see $filter_a
$last_entries
Used to keep track of the previous values posting quantities so difference lists can be computed. For example, previous $doc_map_index, previous position list offset. It also tracks the total number of occurrences of a term within a partition.
public
array<string|int, mixed>
$last_entries
$last_entries_tools
Used to read and write data to the $last_entries array
public
PackedTableTools
$last_entries_tools
$next_partition_to_add
structure contains info about the current partition
public
array<string|int, mixed>
$next_partition_to_add
$positions
A string consisting of a concatenated sequence term position information for each document in turn and within this for each term in that document.
public
string
$positions
$postings
Associative array $term_id => posting list records for that term in the partition.
public
array<string|int, mixed>
$postings
$postings_tools
Used to read and write data to the $postings array
public
PackedTableTools
$postings_tools
$unpack_len_map
Array of string lengths each of $unpack_maps codes consumes
public
array<string|int, mixed>
$unpack_len_map
$unpack_map
Map from int -> three character unpack string used to unpack posting info
public
array<string|int, mixed>
$unpack_map
Methods
__construct()
Makes or initializes an FeedArchiveBundle with the provided parameters
public
__construct(string $dir_name, mixed $db[, bool $read_only_archive = true ][, string $description = null ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ]) : mixed
Parameters
- $dir_name : string
-
folder name to store this bundle
- $db : mixed
- $read_only_archive : bool = true
-
whether to open archive only for reading or reading and writing
- $description : string = null
-
a text name/serialized info about this IndexDocumentBundle
- $num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION
-
the number of pages to be stored in a single shard
Return values
mixed —addFilters()
Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
public
addFilters(string $key) : mixed
Parameters
- $key : string
-
unique identifier of a feed item
Return values
mixed —addPages()
Add the array of $pages to the documents PartitionDocumentBundle
public
addPages(array<string|int, mixed> $pages, int $visited_urls_count) : bool
Parameters
- $pages : array<string|int, mixed>
-
data to store
- $visited_urls_count : int
-
number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).
Return values
bool —success or failure of adding the pages
addPagesAndSeenKeys()
Adds pages of feed items to document bundle and adds their unique hashes (guids)) to bloom filters so they are not reindexed
public
addPagesAndSeenKeys(array<string|int, mixed> $pages, int $visited_urls_count) : bool
Parameters
- $pages : array<string|int, mixed>
-
array of feed items
- $visited_urls_count : int
-
number of feed items
Return values
bool —whether or not succeeded in adding pages
addPartitionPostingsDictionary()
Adds the previously constructed inverted index $partition to the inverted index of the whole bundle
public
addPartitionPostingsDictionary([int $partition = -1 ][, string $taking_too_long_touch = null ]) : mixed
Parameters
- $partition : int = -1
-
which partitions inverted index to add, by default the current save partition
- $taking_too_long_touch : string = null
-
a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
Return values
mixed —addScoresDocMap()
Used to add a doci_id => doc_record to the current partition's document map ($this->doc_map). A doc record records the number of words in the document, an overall length of the document, the length of its title, scores for each of the sentences included into the summary for the documents, and classifier scores for each classifier that was used by the crawl.
public
addScoresDocMap(string $doc_id, int $num_words, float $score, int $host_keywords_end_pos, int $title_end_pos, int $path_keywords_end_pos, array<string|int, mixed> $description_scores, array<string|int, mixed> $user_ranks) : mixed
Parameters
- $doc_id : string
-
new document id to add a record for
- $num_words : int
-
number of terms in the document associated with the doc-id
- $score : float
-
overall score for the important of this document
- $host_keywords_end_pos : int
-
end of the portion of the document summary containing terms coming from the hostname
- $title_end_pos : int
-
end of the portion of the document summary containing terms in the title
- $path_keywords_end_pos : int
-
length of the portion of the document summary containing terms in the url path
- $description_scores : array<string|int, mixed>
-
pairs of the form (length of summary portion, score for that portion)
- $user_ranks : array<string|int, mixed>
-
for each user defined classifier for this crawl the float score of the classifier on this document
Return values
mixed —addTermCountsTrendingTable()
Updates TRENDING_TERM, hourly, daily, and weekly top term occurrences.
public
addTermCountsTrendingTable(array<string|int, mixed> $term_counts) : mixed
Removes entries older than a week
Parameters
- $term_counts : array<string|int, mixed>
-
for the most recent update of the feed index, it should be an array [$lang => [$term => $occurrences]] for the top NUM_TRENDING terms per language
Return values
mixed —addTermPostingLists()
Adds posting records associated to a document to the posting lists for a partition.
public
addTermPostingLists(int $position_offset, int $doc_length, array<string|int, mixed> $word_lists, array<string|int, mixed> $meta_ids, int $doc_map_index) : mixed
Parameters
- $position_offset : int
-
number of header bytes that might be used before including any position data in the file that positions will eventually be stored.
- $doc_length : int
-
length of document in terms for the document for which we are adding posting data.
- $word_lists : array<string|int, mixed>
-
term => positions within current document of that term for the document whose posting data we are adding
- $meta_ids : array<string|int, mixed>
-
meta terms associated with the document we are adding. An example, meta term might be "media:news"
- $doc_map_index : int
-
which document within the partition is the one we are adding. I.e., 5 would mean there were 5 earlier documents whose postings we have already added.
Return values
mixed —buildInvertedIndexPartition()
Copies all feeds items newer than $age to a new shard, then deletes old index shard and database entries older than $age. Finally sets copied shard to be active. If this method is going to take max_execution_time/2 it returns false, so an additional job can be schedules; otherwise it returns true
public
buildInvertedIndexPartition([int $partition = -1 ][, string $taking_too_long_touch = null ][, bool $just_stats = false ]) : mixed
Parameters
- $partition : int = -1
-
bundle partition to build inverted index for
- $taking_too_long_touch : string = null
-
name of file to touch if building inverted index takes too long (whether SCHEDULES_DIR/ . "/{$this->channel}-" . CrawmConstants::crawl_status_file has been recently modified) is used in crawling to see if have run out of new data and the crawl can stopped.
- $just_stats : bool = false
-
whether to just compute stats on the inverted or to actually save the results
Return values
mixed —whether job executed to completion (true or false) if !$just_stats, otherwise, an array with NUM_DOCS, NUM_LINKS, and TERM_STATISTICS (the latter having term frequency info)
calculateMetas()
Used to calculate the meta words for RSS feed items
public
calculateMetas(string $lang, int $pubdate, string $source_name, string $guid[, string $media_category = "news" ]) : array<string|int, mixed>
Parameters
- $lang : string
-
the locale_tag of the feed item
- $pubdate : int
-
UNIX timestamp publication date of item
- $source_name : string
-
the name of the feed
- $guid : string
-
the guid of the item
- $media_category : string = "news"
-
determines what media: metas to inject. Default is news.
Return values
array<string|int, mixed> —$meta_ids meta words found
computeDocId()
Given a $site array of information about a web page/document. Use CrawlConstant::URL and CrawlConstant::HASH fields to compute a unique doc id for the array.
public
static computeDocId(array<string|int, mixed> $site) : string
Parameters
- $site : array<string|int, mixed>
-
site to compute doc_id for
Return values
string —doc_id
contains()
Whether the active filter for this feed contain thee feed item of thee supplied key
public
contains(string $key) : bool
Parameters
- $key : string
-
the feed item id to check if in archive
Return values
bool —true if it is in the archive, false otherwise
deDeltaPostingsSumFrequencies()
Within postings DOC_MAP_INDEX and POSITION_OFFSETS to position lists are stored as delta lists (difference over previous values), this method undoes the delta list to restore the actual DELTA_DOC_MAP_INDEX and POSITION_OFFSETS values. It also computes the of the frequencies of items within the list of postings. This method is current only used for active partition in an index (the one whose terms haven't yet been added to the B+-tree).
public
deDeltaPostingsSumFrequencies(array<string|int, mixed> &$postings) : int
Parameters
- $postings : array<string|int, mixed>
-
a reference to an array of posting lists for a term (this will be changed by this method)
Return values
int —sum of the frequencies of term occurrences as given by the above postings
findNumSlashes()
Finds number of '/' in the url after the hostname represented by doc_id $key.
public
static findNumSlashes(string $key) : mixed
Parameters
- $key : string
-
to find '/' count
Return values
mixed —forceSave()
Forces the current shard to be saved
public
forceSave() : mixed
Return values
mixed —getArchiveInfo()
Gets the description, count of documents, and number of partitions of the documents store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
path to a directory containing a documents IndexDocumentBundle
Return values
array<string|int, mixed> —summary of the given archive
getCachePage()
Given the $doc_id of a document and a $partition to look for it in return's the cached page of the document if present and [] otherwise
public
getCachePage(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
- $doc_id : string
-
of document to look up
- $partition : int
-
to look for document in
Return values
array<string|int, mixed> —desired page cache or [] if look up failed
getParamModifiedTime()
Returns the last time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Tags
Return values
mixed —getPartitionBaseFolder()
Gets the file path corresponding to the partition with index $partition
public
getPartitionBaseFolder(int $partition) : string
Parameters
- $partition : int
-
desired partition index
Return values
string —file path to where this partitions index data is stored (Not the original documents which are stored in the PartitionDocumentBundle)
getPostingsString()
Get the postings stored in the postings file in a partition from $offset to $offset+len remove the 255 encoding.
public
getPostingsString(int $partition, int $offset, int $len) : string
Parameters
- $partition : int
-
partition to retrieve posting from
- $offset : int
-
byte offset int partition/postings file to look for them
- $len : int
-
length of the posting list to retrieve.
Return values
string —encoded posting list data -- vbyte encoded number of postings, followed by the posting data in PacktableTools format
getSummary()
Given the $doc_id of a document and a $partition to look for it in return's the document summary info if present and [] otherwise.
public
getSummary(string $doc_id, int $partition) : array<string|int, mixed>
Parameters
- $doc_id : string
-
of document to look up
- $partition : int
-
to look for document in
Return values
array<string|int, mixed> —desired summary or [] if look up failed
getWordInfo()
Gets an array of posting list positions for each shard in the bundle $index_name for the word id $term_id
public
getWordInfo(string $term_id[, int $threshold = -1 ], mixed $offset[, mixed $num_partitions = -1 ][, bool $with_remaining_total = false ]) : array<string|int, mixed>
Parameters
- $term_id : string
-
id of phrase or word to look up in bundle dictionary
- $threshold : int = -1
-
after the number of results exceeds this amount stop looking for more dictionary entries.
- $offset : mixed
- $num_partitions : mixed = -1
- $with_remaining_total : bool = false
-
whether to total number of postings found as well or not
Return values
array<string|int, mixed> —either [total, sequence of four tuples] or sequence of four tuples: (index_shard generation, posting_list_offset, length, exact id that match $term_id)
invertOneSite()
Used to create inverted index for one site and add its information to the current partition.
public
invertOneSite(array<string|int, mixed> $site, array<string|int, mixed> $url_info, int &$link_cnt) : string
Parameters
- $site : array<string|int, mixed>
-
site to invert
- $url_info : array<string|int, mixed>
-
collection of url and hash's of documents which map to the same document
- $link_cnt : int
-
current count of number of links discovered so far
Return values
string —$site_url canonical url for site
isACldDocId()
Checks if a doc_id $key is that of a Company level domain (cld) or www.cld.
public
static isACldDocId(string $key) : mixed
I.e., a url https://yahoo.com/ or https://www.yahoo.com/ as opposed to https://foo.yahoo.com/
Parameters
- $key : string
-
to check if doc or not
Return values
mixed —isAHostDocId()
Checks if a doc_id $key is that of a host url.
public
static isAHostDocId(string $key) : mixed
I.e., a url https://www.yahoo.com/ as opposed to https://www.yahoo.com/foo
Parameters
- $key : string
-
to check if doc or not
Return values
mixed —isAWikipediaPage()
Checks if a doc_id $key is that of a Wikipedia page.
public
static isAWikipediaPage(string $key) : mixed
Parameters
- $key : string
-
to check if Wikipedia page or not
Return values
mixed —isType()
Checks if a doc_id corresponds to a particular large scale type among external_link, internal_link, link (union of previous two), binary, feed, image, text, video, document (union of previous five)
public
static isType(string $key, mixed $types) : bool
Parameters
- $key : string
-
to check if doc or not
- $types : mixed
Return values
bool —true if a document
prepareIndexMap()
As pre-step to calculating the inverted index information for a partition this method groups documents and links to documents into single objects.
public
prepareIndexMap(int $partition[, array<string|int, mixed> $test_index = [] ]) : array<string|int, mixed>
It also does simple deduplication of documents that have the same hash. It then returns an array of the grouped document data. Grouping is done by giving a score to each document based on (number of doc in index - order doc added). For two entries with the same hash_url, a document will be chosen over a link as the representative; otherwise, the one with higher score will be chosen as the representative. The representative document is given the sum of the scores of its constituents. A second phase where documents are grouped by hash of the text body is also done. Finally, the returned documents are sorted by their scores. So the order of documents from this process is roughly in the order of importance.
Parameters
- $partition : int
-
index of partition to do deduplication for in the case that test index is empty
- $test_index : array<string|int, mixed> = []
-
is non-null only when doing testing of what this method does. In which case, it should consist of an array of $doc_id => string represent a possible record for that doc. As deduplication is done entirely based on component of the doc_id (hash_url, doc_type, hash_doc, hash_host) the string doesn't matter too much.
Return values
array<string|int, mixed> —groups doc_id => records associated with that doc_id
setArchiveInfo()
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialized store of global parameters of the crawl like seed sites, timestamp, etc).
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $update_info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $update_info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —stopIndexing()
Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.
public
stopIndexing() : mixed
Return values
mixed —unpackPostings()
Given the postings as a string for a partition for a term, unpacks them into an array of postings, doing de-delta of doc_map_indices and de-delta of positions. Each posting represents occurrence of a term in a documents, so the frequency component is the number of occurrences of the term in the document. This method also computes the sum of these frequencies over all postings in partition.
public
unpackPostings(string $postings_string) : array<string|int, mixed>
Parameters
- $postings_string : string
-
compress string representation of a set of postings for a term
Return values
array<string|int, mixed> —a pair [array of unpacked postings, sum of frequencies of all the postings]
updateDictionary()
For every partition between next partition and save partition, adds the posting list information to the dictionary BPlusTree. At the end of this process next partition and save partition should be the same
public
updateDictionary([string $taking_too_long_touch = null ][, bool $till_equal = true ]) : mixed
Parameters
- $taking_too_long_touch : string = null
-
a filename of a file to touch so its last modified time becomes the current time. In a typical Yioop crawl this is done for the CrawlConstants::crawl_status_file file to prevent Yioop's web interface from stopping the crawl because it has seen no recent progress activity on a crawl.
- $till_equal : bool = true
-
is set to true will keep adding each partition up till the save partition if set to false, oln;y adds one partition
Return values
mixed —updateTrendingTermCounts()
Updates trending term counts based on the string from the current feed item.
public
updateTrendingTermCounts(array<string|int, mixed> &$term_counts, string $source_phrase, array<string|int, mixed> $word_or_phrase_list, string $media_category, string $source_name, string $lang, int $pubdate[, string $source_stop_regex = "" ]) : mixed
Parameters
- $term_counts : array<string|int, mixed>
-
lang => [term => occurrences]
- $source_phrase : string
-
original non-stemmed phrase from feed item to adjust $term_counts with. Used to remember non-stemmed terms. We assume we have already extracted position lists from
- $word_or_phrase_list : array<string|int, mixed>
-
associate array of stemmed_word_or_phrase => positions in feed item of where occurs
- $media_category : string
-
of feed source the item case from. We trending counts grouped by media category
- $source_name : string
-
of feed source the item case from. We exclude from counts the name of the feed source
- $lang : string
-
locale_tag for this feed item
- $pubdate : int
-
timestamp when string was published (used in weighting)
- $source_stop_regex : string = ""
-
a regex to remove terms which occur frequently for this particular source