Yioop_V9.5_Source_Code

ArcTool extends DictionaryUpdater
in package

Application

implements CrawlConstants

Command line program that allows one to examine the content of the WebArchiveBundles and IndexArchiveBundles of Yioop crawls.

To see all of the available command run it from the command line with a syntax like:

php ArcTool.php

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

MAX_BUFFER_DOCS = 200: The maximum number of documents the ArcTool list function will read into memory in one go.
MAX_REBUILD_DOCS = 8000: The maximum number of documents the ArcTool will rebuild/migrate in one go
__construct() : mixed: Initializes the ArcTool, for now does nothing
badFormatMessageAndExit() : mixed: Outputs the "hey, this isn't a known bundle message" and then exit()'s.
checkFilter() : mixed: Outputs tot the terminal if the bloom filter $filter_path contains the string $item
fixPartitionIndexes() : mixed: Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).
getArchiveKind() : string: Given a folder name, determines the kind of bundle (if any) it holds.
getArchiveName() : string: Given a complete path to an archive returns its filename
inject() : mixed: Adds a list of urls as a upcoming schedule for a given queue bundle.
instantiateIterator() : mixed: Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.
makeFilter() : mixed: Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path
migrateIndexArchive() : mixed: Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle
outputArchiveList() : mixed: Lists the Web or IndexArchives in the crawl directory
outputCountBundle() : mixed: Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count
outputDictInfo() : mixed: Prints the dictionary records for a word in an IndexDocumentBundle
outputDocLookup() : array<string|int, mixed>: Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name
outputInfo() : mixed: Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.
outputInfoDoubleIndexBundle() : mixed: Outputs to stdout header information for a DoubleIndexBundle bundle.
outputInfoFeedDocumentBundle() : mixed: Outputs to stdout header information for a FeedDocumentBundle bundle.
outputInfoIndexDocumentBundle() : mixed: Outputs to stdout header information for a IndexDocumentBundle bundle.
outputPartitionInfo() : mixed: Prints information about the number of words and frequencies of words within the $index'th partition in the bundle
outputShowPages() : mixed: Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.
rebuildIndexBundle() : mixed: Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.
run() : mixed: The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.
start() : mixed: Runs the ArcTool on the supplied command line arguments
usageMessageAndExit() : mixed: Outputs the "how to use this tool message" and then exit()'s.

MAX_BUFFER_DOCS

The maximum number of documents the ArcTool list function will read into memory in one go.


    public
        mixed
    MAX_BUFFER_DOCS
    = 200

MAX_REBUILD_DOCS

The maximum number of documents the ArcTool will rebuild/migrate in one go


    public
        mixed
    MAX_REBUILD_DOCS
    = 8000

__construct()

Initializes the ArcTool, for now does nothing


    public
                    __construct() : mixed

Return values

mixed —

badFormatMessageAndExit()

Outputs the "hey, this isn't a known bundle message" and then exit()'s.


    public
                    badFormatMessageAndExit(string $archive_name[, string $allowed_archives = "web or index" ]) : mixed

Parameters

$archive_name : string: name or path to what was supposed to be an archive
$allowed_archives : string = "web or index": a string list of archives types that $archive_name could belong to

Return values

mixed —

checkFilter()

Outputs tot the terminal if the bloom filter $filter_path contains the string $item


    public
                    checkFilter(string $filter_path, string $item) : mixed

Parameters

$filter_path : string: name of bloom filter file to check if contains item
$item : string: item to chheck in in bloom filter

Return values

mixed —

fixPartitionIndexes()

Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).


    public
                    fixPartitionIndexes(string $archive_path, int $start_partition[, int $end_partition = -1 ]) : mixed

Parameters

$archive_path : string: the path of a directory that holds an IndexDocumentBundle
$start_partition : int: first partition to recompute
$end_partition : int = -1: last partition to recompute (inclusive)

Return values

mixed —

getArchiveKind()

Given a folder name, determines the kind of bundle (if any) it holds.


    public
            static        getArchiveKind(string $archive_path) : string

It does this based on the expected location of the description.txt file, or arc_description.ini (in the case of a non-yioop archive)

Parameters

$archive_path : string: the path to archive folder

Return values

string —

the archive bundle type, either: WebArchiveBundle or IndexArchiveBundle

getArchiveName()

Given a complete path to an archive returns its filename


    public
                    getArchiveName(string $archive_path) : string

Parameters

$archive_path : string: a path to a yioop or non-yioop archive

Return values

string —

its filename

inject()

Adds a list of urls as a upcoming schedule for a given queue bundle.


    public
                    inject(string $timestamp, string $url_file_name) : mixed

Can be used to make a closed schedule startable

Parameters

$timestamp : string: for a queue bundle to add urls to
$url_file_name : string: name of file consist of urls to inject into the given crawl

Return values

mixed —

instantiateIterator()

Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.


    public
                    instantiateIterator(string $archive_path, string $iterator_type) : mixed

Parameters

$archive_path : string: path to non-yioop archive
$iterator_type : string: name of archive_bundle_iterator used to iterate over archive.

Return values

mixed —

makeFilter()

Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path


    public
                    makeFilter(string $dict_file, string $filter_path[, int $column_num = -1 ]) : mixed

Parameters

$dict_file : string: to make BloomFilterFile from
$filter_path : string: of file to serialize BloomFilterFile to
$column_num : int = -1: if negative assumes $dict_file has one entry per line, if >=0 then is the index of the column in a csv to use for items

Return values

mixed —

migrateIndexArchive()

Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle


    public
                    migrateIndexArchive(string $archive_path) : mixed

Parameters

$archive_path : string: file path to a IndexArchiveBundle

Return values

mixed —

outputArchiveList()

Lists the Web or IndexArchives in the crawl directory


    public
                    outputArchiveList() : mixed

Return values

mixed —

outputCountBundle()

Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count


    public
                    outputCountBundle(string $archive_path[, bool $set_count = false ]) : mixed

Parameters

$archive_path : string: patch of archive to count
$set_count : bool = false: flag that controls whether after computing the count to write it back into the archive

Return values

mixed —

outputDictInfo()

Prints the dictionary records for a word in an IndexDocumentBundle


    public
                    outputDictInfo(string $archive_path, string $word, int $start_record, int $num_records, bool $details) : mixed

Parameters

$archive_path : string: the path of a directory that holds an IndexArchiveBundle
$word : string: to look up dictionary record for
$start_record : int: first record to list out
$num_records : int: max records to list our
$details : bool: whether to show posting list details or not

Return values

mixed —

outputDocLookup()

Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name


    public
                    outputDocLookup(string $index_name, int $partition, int $doc_map_index) : array<string|int, mixed>

Parameters

$index_name : string: folder containing an IndexDocumentBundle DoubleIndexBundle or FeedDocumentBundle
$partition : int: which partition to do the lookup in
$doc_map_index : int: index of which document to lookup

Return values

array<string|int, mixed> —

associative array of field => values associated with document

outputInfo()

Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.


    public
                    outputInfo(string $archive_path) : mixed

Then outputs to stdout header information about the bundle by calling the appropriate sub-function.

Parameters

$archive_path : string: The path of a directory that holds WebArchiveBundle,IndexArchiveBundle, or non-Yioop archive data

Return values

mixed —

outputInfoDoubleIndexBundle()

Outputs to stdout header information for a DoubleIndexBundle bundle.


    public
                    outputInfoDoubleIndexBundle(array<string|int, mixed> $info, string $archive_path) : mixed

Parameters

$info : array<string|int, mixed>: header info that has already been read from the description.txt file
$archive_path : string: file path of the folder containing the bundle

Return values

mixed —

outputInfoFeedDocumentBundle()

Outputs to stdout header information for a FeedDocumentBundle bundle.


    public
                    outputInfoFeedDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed

Parameters

$info : array<string|int, mixed>: header info that has already been read from the description.txt file
$archive_path : string: file path of the folder containing the bundle
$alternate_description : string = "": used as the text for description rather than what's given in $info
$only_storage_info : bool = false: output only info about storage statistics don't output info about crawl parameters
$only_crawl_params : bool = false: output only info about crawl parameters not storage statistics

Return values

mixed —

outputInfoIndexDocumentBundle()

Outputs to stdout header information for a IndexDocumentBundle bundle.


    public
                    outputInfoIndexDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed

Parameters

$info : array<string|int, mixed>: header info that has already been read from the description.txt file
$archive_path : string: file path of the folder containing the bundle
$alternate_description : string = "": used as the text for description rather than what's given in $info
$only_storage_info : bool = false: output only info about storage statistics don't output info about crawl parameters
$only_crawl_params : bool = false: output only info about crawl parameters not storage statistics

Return values

mixed —

outputPartitionInfo()

Prints information about the number of words and frequencies of words within the $index'th partition in the bundle


    public
                    outputPartitionInfo(string $archive_path, int $num) : mixed

Parameters

$archive_path : string: the path of a directory that holds an IndexDocumentBundle
$num : int: of partition to show info for

Return values

mixed —

outputShowPages()

Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.


    public
                    outputShowPages(string $archive_path, int $start, int $num) : mixed

Parameters

$archive_path : string: path to bundle to list documents for
$start : int: first document to list
$num : int: number of documents to list

Return values

mixed —

rebuildIndexBundle()

Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.


    public
                    rebuildIndexBundle(string $archive_path, mixed $start_generation) : mixed

Then a reindex is done.

Parameters

$archive_path : string: file path to a IndexArchiveBundle
$start_generation : mixed: which web archive generation to start rebuild from. If 'continue' then keeps going from where last attempt at a rebuild was.

Return values

mixed —

run()

The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.


    public
            static        run(int $channel, string $bundle_path) : mixed

Parameters

$channel : int: the channel the crawl is running on. Used in naming lock files
$bundle_path : string: the path to the IndexDocumentBundle or FeedDucumentBundle we are adding dictionary info for

Return values

mixed —

start()

Runs the ArcTool on the supplied command line arguments


    public
                    start() : mixed

Return values

mixed —

usageMessageAndExit()

Outputs the "how to use this tool message" and then exit()'s.


    public
                    usageMessageAndExit() : mixed

Return values

mixed —

ArcTool extends DictionaryUpdater in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Constants

MAX_BUFFER_DOCS

MAX_REBUILD_DOCS

Methods

__construct()

Return values

badFormatMessageAndExit()

Parameters

Return values

checkFilter()

Parameters

Return values

fixPartitionIndexes()

Parameters

Return values

getArchiveKind()

Parameters

Return values

getArchiveName()

Parameters

Return values

inject()

Parameters

Return values

instantiateIterator()

Parameters

Return values

makeFilter()

Parameters

Return values

migrateIndexArchive()

Parameters

Return values

outputArchiveList()

Return values

outputCountBundle()

Parameters

Return values

outputDictInfo()

Parameters

Return values

outputDocLookup()

Parameters

Return values

outputInfo()

Parameters

Return values

outputInfoDoubleIndexBundle()

Parameters

Return values

outputInfoFeedDocumentBundle()

Parameters

Return values

outputInfoIndexDocumentBundle()

Parameters

Return values

outputPartitionInfo()

Parameters

Return values

outputShowPages()

Parameters

Return values

rebuildIndexBundle()

Parameters

Return values

run()

Parameters

Return values

start()

Return values

usageMessageAndExit()

Return values

ArcTool extends DictionaryUpdater
in package

Application

implements CrawlConstants