ArcTool
extends DictionaryUpdater
in package
implements
CrawlConstants
Command line program that allows one to examine the content of the WebArchiveBundles and IndexArchiveBundles of Yioop crawls.
To see all of the available command run it from the command line with a syntax like:
php ArcTool.php
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- MAX_BUFFER_DOCS = 200
- The maximum number of documents the ArcTool list function will read into memory in one go.
- MAX_REBUILD_DOCS = 8000
- The maximum number of documents the ArcTool will rebuild/migrate in one go
- __construct() : mixed
- Initializes the ArcTool, for now does nothing
- badFormatMessageAndExit() : mixed
- Outputs the "hey, this isn't a known bundle message" and then exit()'s.
- checkFilter() : mixed
- Outputs tot the terminal if the bloom filter $filter_path contains the string $item
- fixPartitionIndexes() : mixed
- Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).
- getArchiveKind() : string
- Given a folder name, determines the kind of bundle (if any) it holds.
- getArchiveName() : string
- Given a complete path to an archive returns its filename
- inject() : mixed
- Adds a list of urls as a upcoming schedule for a given queue bundle.
- instantiateIterator() : mixed
- Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.
- makeFilter() : mixed
- Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path
- migrateIndexArchive() : mixed
- Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle
- outputArchiveList() : mixed
- Lists the Web or IndexArchives in the crawl directory
- outputCountBundle() : mixed
- Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count
- outputDictInfo() : mixed
- Prints the dictionary records for a word in an IndexDocumentBundle
- outputDocLookup() : array<string|int, mixed>
- Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name
- outputInfo() : mixed
- Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.
- outputInfoDoubleIndexBundle() : mixed
- Outputs to stdout header information for a DoubleIndexBundle bundle.
- outputInfoFeedDocumentBundle() : mixed
- Outputs to stdout header information for a FeedDocumentBundle bundle.
- outputInfoIndexDocumentBundle() : mixed
- Outputs to stdout header information for a IndexDocumentBundle bundle.
- outputPartitionInfo() : mixed
- Prints information about the number of words and frequencies of words within the $index'th partition in the bundle
- outputShowPages() : mixed
- Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.
- rebuildIndexBundle() : mixed
- Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.
- run() : mixed
- The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.
- start() : mixed
- Runs the ArcTool on the supplied command line arguments
- usageMessageAndExit() : mixed
- Outputs the "how to use this tool message" and then exit()'s.
Constants
MAX_BUFFER_DOCS
The maximum number of documents the ArcTool list function will read into memory in one go.
public
mixed
MAX_BUFFER_DOCS
= 200
MAX_REBUILD_DOCS
The maximum number of documents the ArcTool will rebuild/migrate in one go
public
mixed
MAX_REBUILD_DOCS
= 8000
Methods
__construct()
Initializes the ArcTool, for now does nothing
public
__construct() : mixed
Return values
mixed —badFormatMessageAndExit()
Outputs the "hey, this isn't a known bundle message" and then exit()'s.
public
badFormatMessageAndExit(string $archive_name[, string $allowed_archives = "web or index" ]) : mixed
Parameters
- $archive_name : string
-
name or path to what was supposed to be an archive
- $allowed_archives : string = "web or index"
-
a string list of archives types that $archive_name could belong to
Return values
mixed —checkFilter()
Outputs tot the terminal if the bloom filter $filter_path contains the string $item
public
checkFilter(string $filter_path, string $item) : mixed
Parameters
- $filter_path : string
-
name of bloom filter file to check if contains item
- $item : string
-
item to chheck in in bloom filter
Return values
mixed —fixPartitionIndexes()
Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).
public
fixPartitionIndexes(string $archive_path, int $start_partition[, int $end_partition = -1 ]) : mixed
Parameters
- $archive_path : string
-
the path of a directory that holds an IndexDocumentBundle
- $start_partition : int
-
first partition to recompute
- $end_partition : int = -1
-
last partition to recompute (inclusive)
Return values
mixed —getArchiveKind()
Given a folder name, determines the kind of bundle (if any) it holds.
public
static getArchiveKind(string $archive_path) : string
It does this based on the expected location of the description.txt file, or arc_description.ini (in the case of a non-yioop archive)
Parameters
- $archive_path : string
-
the path to archive folder
Return values
string —the archive bundle type, either: WebArchiveBundle or IndexArchiveBundle
getArchiveName()
Given a complete path to an archive returns its filename
public
getArchiveName(string $archive_path) : string
Parameters
- $archive_path : string
-
a path to a yioop or non-yioop archive
Return values
string —its filename
inject()
Adds a list of urls as a upcoming schedule for a given queue bundle.
public
inject(string $timestamp, string $url_file_name) : mixed
Can be used to make a closed schedule startable
Parameters
- $timestamp : string
-
for a queue bundle to add urls to
- $url_file_name : string
-
name of file consist of urls to inject into the given crawl
Return values
mixed —instantiateIterator()
Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.
public
instantiateIterator(string $archive_path, string $iterator_type) : mixed
Parameters
- $archive_path : string
-
path to non-yioop archive
- $iterator_type : string
-
name of archive_bundle_iterator used to iterate over archive.
Return values
mixed —makeFilter()
Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path
public
makeFilter(string $dict_file, string $filter_path[, int $column_num = -1 ]) : mixed
Parameters
- $dict_file : string
-
to make BloomFilterFile from
- $filter_path : string
-
of file to serialize BloomFilterFile to
- $column_num : int = -1
-
if negative assumes $dict_file has one entry per line, if >=0 then is the index of the column in a csv to use for items
Return values
mixed —migrateIndexArchive()
Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle
public
migrateIndexArchive(string $archive_path) : mixed
Parameters
- $archive_path : string
-
file path to a IndexArchiveBundle
Return values
mixed —outputArchiveList()
Lists the Web or IndexArchives in the crawl directory
public
outputArchiveList() : mixed
Return values
mixed —outputCountBundle()
Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count
public
outputCountBundle(string $archive_path[, bool $set_count = false ]) : mixed
Parameters
- $archive_path : string
-
patch of archive to count
- $set_count : bool = false
-
flag that controls whether after computing the count to write it back into the archive
Return values
mixed —outputDictInfo()
Prints the dictionary records for a word in an IndexDocumentBundle
public
outputDictInfo(string $archive_path, string $word, int $start_record, int $num_records, bool $details) : mixed
Parameters
- $archive_path : string
-
the path of a directory that holds an IndexArchiveBundle
- $word : string
-
to look up dictionary record for
- $start_record : int
-
first record to list out
- $num_records : int
-
max records to list our
- $details : bool
-
whether to show posting list details or not
Return values
mixed —outputDocLookup()
Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name
public
outputDocLookup(string $index_name, int $partition, int $doc_map_index) : array<string|int, mixed>
Parameters
- $index_name : string
-
folder containing an IndexDocumentBundle DoubleIndexBundle or FeedDocumentBundle
- $partition : int
-
which partition to do the lookup in
- $doc_map_index : int
-
index of which document to lookup
Return values
array<string|int, mixed> —associative array of field => values associated with document
outputInfo()
Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.
public
outputInfo(string $archive_path) : mixed
Then outputs to stdout header information about the bundle by calling the appropriate sub-function.
Parameters
- $archive_path : string
-
The path of a directory that holds WebArchiveBundle,IndexArchiveBundle, or non-Yioop archive data
Return values
mixed —outputInfoDoubleIndexBundle()
Outputs to stdout header information for a DoubleIndexBundle bundle.
public
outputInfoDoubleIndexBundle(array<string|int, mixed> $info, string $archive_path) : mixed
Parameters
- $info : array<string|int, mixed>
-
header info that has already been read from the description.txt file
- $archive_path : string
-
file path of the folder containing the bundle
Return values
mixed —outputInfoFeedDocumentBundle()
Outputs to stdout header information for a FeedDocumentBundle bundle.
public
outputInfoFeedDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed
Parameters
- $info : array<string|int, mixed>
-
header info that has already been read from the description.txt file
- $archive_path : string
-
file path of the folder containing the bundle
- $alternate_description : string = ""
-
used as the text for description rather than what's given in $info
- $only_storage_info : bool = false
-
output only info about storage statistics don't output info about crawl parameters
- $only_crawl_params : bool = false
-
output only info about crawl parameters not storage statistics
Return values
mixed —outputInfoIndexDocumentBundle()
Outputs to stdout header information for a IndexDocumentBundle bundle.
public
outputInfoIndexDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed
Parameters
- $info : array<string|int, mixed>
-
header info that has already been read from the description.txt file
- $archive_path : string
-
file path of the folder containing the bundle
- $alternate_description : string = ""
-
used as the text for description rather than what's given in $info
- $only_storage_info : bool = false
-
output only info about storage statistics don't output info about crawl parameters
- $only_crawl_params : bool = false
-
output only info about crawl parameters not storage statistics
Return values
mixed —outputPartitionInfo()
Prints information about the number of words and frequencies of words within the $index'th partition in the bundle
public
outputPartitionInfo(string $archive_path, int $num) : mixed
Parameters
- $archive_path : string
-
the path of a directory that holds an IndexDocumentBundle
- $num : int
-
of partition to show info for
Return values
mixed —outputShowPages()
Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.
public
outputShowPages(string $archive_path, int $start, int $num) : mixed
Parameters
- $archive_path : string
-
path to bundle to list documents for
- $start : int
-
first document to list
- $num : int
-
number of documents to list
Return values
mixed —rebuildIndexBundle()
Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.
public
rebuildIndexBundle(string $archive_path, mixed $start_generation) : mixed
Then a reindex is done.
Parameters
- $archive_path : string
-
file path to a IndexArchiveBundle
- $start_generation : mixed
-
which web archive generation to start rebuild from. If 'continue' then keeps going from where last attempt at a rebuild was.
Return values
mixed —run()
The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.
public
static run(int $channel, string $bundle_path) : mixed
Parameters
- $channel : int
-
the channel the crawl is running on. Used in naming lock files
- $bundle_path : string
-
the path to the IndexDocumentBundle or FeedDucumentBundle we are adding dictionary info for
Return values
mixed —start()
Runs the ArcTool on the supplied command line arguments
public
start() : mixed
Return values
mixed —usageMessageAndExit()
Outputs the "how to use this tool message" and then exit()'s.
public
usageMessageAndExit() : mixed