Yioop_V9.5_Source_Code_Documentation

ArcTool extends DictionaryUpdater
in package
implements CrawlConstants

Command line program that allows one to examine the content of the WebArchiveBundles and IndexArchiveBundles of Yioop crawls.

To see all of the available command run it from the command line with a syntax like:

php ArcTool.php

Tags
author

Chris Pollett (non-yioop archive code derived from earlier stuff by Shawn Tice)

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

MAX_BUFFER_DOCS  = 200
The maximum number of documents the ArcTool list function will read into memory in one go.
MAX_REBUILD_DOCS  = 8000
The maximum number of documents the ArcTool will rebuild/migrate in one go
__construct()  : mixed
Initializes the ArcTool, for now does nothing
badFormatMessageAndExit()  : mixed
Outputs the "hey, this isn't a known bundle message" and then exit()'s.
checkFilter()  : mixed
Outputs tot the terminal if the bloom filter $filter_path contains the string $item
fixPartitionIndexes()  : mixed
Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).
getArchiveKind()  : string
Given a folder name, determines the kind of bundle (if any) it holds.
getArchiveName()  : string
Given a complete path to an archive returns its filename
inject()  : mixed
Adds a list of urls as a upcoming schedule for a given queue bundle.
instantiateIterator()  : mixed
Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.
makeFilter()  : mixed
Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path
migrateIndexArchive()  : mixed
Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle
outputArchiveList()  : mixed
Lists the Web or IndexArchives in the crawl directory
outputCountBundle()  : mixed
Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count
outputDictInfo()  : mixed
Prints the dictionary records for a word in an IndexDocumentBundle
outputDocLookup()  : array<string|int, mixed>
Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name
outputInfo()  : mixed
Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.
outputInfoDoubleIndexBundle()  : mixed
Outputs to stdout header information for a DoubleIndexBundle bundle.
outputInfoFeedDocumentBundle()  : mixed
Outputs to stdout header information for a FeedDocumentBundle bundle.
outputInfoIndexDocumentBundle()  : mixed
Outputs to stdout header information for a IndexDocumentBundle bundle.
outputPartitionInfo()  : mixed
Prints information about the number of words and frequencies of words within the $index'th partition in the bundle
outputShowPages()  : mixed
Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.
rebuildIndexBundle()  : mixed
Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.
run()  : mixed
The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.
start()  : mixed
Runs the ArcTool on the supplied command line arguments
usageMessageAndExit()  : mixed
Outputs the "how to use this tool message" and then exit()'s.

Constants

MAX_BUFFER_DOCS

The maximum number of documents the ArcTool list function will read into memory in one go.

public mixed MAX_BUFFER_DOCS = 200

MAX_REBUILD_DOCS

The maximum number of documents the ArcTool will rebuild/migrate in one go

public mixed MAX_REBUILD_DOCS = 8000

Methods

__construct()

Initializes the ArcTool, for now does nothing

public __construct() : mixed
Return values
mixed

badFormatMessageAndExit()

Outputs the "hey, this isn't a known bundle message" and then exit()'s.

public badFormatMessageAndExit(string $archive_name[, string $allowed_archives = "web or index" ]) : mixed
Parameters
$archive_name : string

name or path to what was supposed to be an archive

$allowed_archives : string = "web or index"

a string list of archives types that $archive_name could belong to

Return values
mixed

checkFilter()

Outputs tot the terminal if the bloom filter $filter_path contains the string $item

public checkFilter(string $filter_path, string $item) : mixed
Parameters
$filter_path : string

name of bloom filter file to check if contains item

$item : string

item to chheck in in bloom filter

Return values
mixed

fixPartitionIndexes()

Recomputes the hash index (.ix) files for a range of partitions from start_partition to end_partition in the documents subfolder of an IndexDocumentBundle. An ix file contains a sequence of compressed 4-tuple (doc_id, summary_offset, summary_length, cache_length) corresponding to a partition file (these end in .txt.gz and are a sequence of compressed document summaries followed by orginal documents).

public fixPartitionIndexes(string $archive_path, int $start_partition[, int $end_partition = -1 ]) : mixed
Parameters
$archive_path : string

the path of a directory that holds an IndexDocumentBundle

$start_partition : int

first partition to recompute

$end_partition : int = -1

last partition to recompute (inclusive)

Return values
mixed

getArchiveKind()

Given a folder name, determines the kind of bundle (if any) it holds.

public static getArchiveKind(string $archive_path) : string

It does this based on the expected location of the description.txt file, or arc_description.ini (in the case of a non-yioop archive)

Parameters
$archive_path : string

the path to archive folder

Return values
string

the archive bundle type, either: WebArchiveBundle or IndexArchiveBundle

getArchiveName()

Given a complete path to an archive returns its filename

public getArchiveName(string $archive_path) : string
Parameters
$archive_path : string

a path to a yioop or non-yioop archive

Return values
string

its filename

inject()

Adds a list of urls as a upcoming schedule for a given queue bundle.

public inject(string $timestamp, string $url_file_name) : mixed

Can be used to make a closed schedule startable

Parameters
$timestamp : string

for a queue bundle to add urls to

$url_file_name : string

name of file consist of urls to inject into the given crawl

Return values
mixed

instantiateIterator()

Used to create an archive_bundle_iterator for a non-yioop archive As these iterators sometimes make use of a folder to store savepoints We create a temporary folder for this purpose in the current directory This should be garbage collected elsewhere.

public instantiateIterator(string $archive_path, string $iterator_type) : mixed
Parameters
$archive_path : string

path to non-yioop archive

$iterator_type : string

name of archive_bundle_iterator used to iterate over archive.

Return values
mixed

makeFilter()

Makes a BloomFilterFile object from a dictionary file $dict_file which has items listed one per line, or items listed as some column of a CSV file. The result is output to $filter_path

public makeFilter(string $dict_file, string $filter_path[, int $column_num = -1 ]) : mixed
Parameters
$dict_file : string

to make BloomFilterFile from

$filter_path : string

of file to serialize BloomFilterFile to

$column_num : int = -1

if negative assumes $dict_file has one entry per line, if >=0 then is the index of the column in a csv to use for items

Return values
mixed

migrateIndexArchive()

Copies an IndexArchiveBundle (or derived class) on disk into an IndexDocumentBundle (on disk). The new bundle will be at the old bundle's location while the old bundle is renamed after the process to "Old" . name_of_old_bundle

public migrateIndexArchive(string $archive_path) : mixed
Parameters
$archive_path : string

file path to a IndexArchiveBundle

Return values
mixed

outputArchiveList()

Lists the Web or IndexArchives in the crawl directory

public outputArchiveList() : mixed
Return values
mixed

outputCountBundle()

Counts and outputs the number of docs and links in each shard in the archive supplied in $archive_path as well as an overall count

public outputCountBundle(string $archive_path[, bool $set_count = false ]) : mixed
Parameters
$archive_path : string

patch of archive to count

$set_count : bool = false

flag that controls whether after computing the count to write it back into the archive

Return values
mixed

outputDictInfo()

Prints the dictionary records for a word in an IndexDocumentBundle

public outputDictInfo(string $archive_path, string $word, int $start_record, int $num_records, bool $details) : mixed
Parameters
$archive_path : string

the path of a directory that holds an IndexArchiveBundle

$word : string

to look up dictionary record for

$start_record : int

first record to list out

$num_records : int

max records to list our

$details : bool

whether to show posting list details or not

Return values
mixed

outputDocLookup()

Outputs the $doc_map_index'th document from the $partition partition of the IndexDocumentBundle in folder $index_name

public outputDocLookup(string $index_name, int $partition, int $doc_map_index) : array<string|int, mixed>
Parameters
$index_name : string

folder containing an IndexDocumentBundle DoubleIndexBundle or FeedDocumentBundle

$partition : int

which partition to do the lookup in

$doc_map_index : int

index of which document to lookup

Return values
array<string|int, mixed>

associative array of field => values associated with document

outputInfo()

Determines whether the supplied path is a WebArchiveBundle, an IndexArchiveBundle, DoubleIndexBundle, or non-Yioop Archive.

public outputInfo(string $archive_path) : mixed

Then outputs to stdout header information about the bundle by calling the appropriate sub-function.

Parameters
$archive_path : string

The path of a directory that holds WebArchiveBundle,IndexArchiveBundle, or non-Yioop archive data

Return values
mixed

outputInfoDoubleIndexBundle()

Outputs to stdout header information for a DoubleIndexBundle bundle.

public outputInfoDoubleIndexBundle(array<string|int, mixed> $info, string $archive_path) : mixed
Parameters
$info : array<string|int, mixed>

header info that has already been read from the description.txt file

$archive_path : string

file path of the folder containing the bundle

Return values
mixed

outputInfoFeedDocumentBundle()

Outputs to stdout header information for a FeedDocumentBundle bundle.

public outputInfoFeedDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed
Parameters
$info : array<string|int, mixed>

header info that has already been read from the description.txt file

$archive_path : string

file path of the folder containing the bundle

$alternate_description : string = ""

used as the text for description rather than what's given in $info

$only_storage_info : bool = false

output only info about storage statistics don't output info about crawl parameters

$only_crawl_params : bool = false

output only info about crawl parameters not storage statistics

Return values
mixed

outputInfoIndexDocumentBundle()

Outputs to stdout header information for a IndexDocumentBundle bundle.

public outputInfoIndexDocumentBundle(array<string|int, mixed> $info, string $archive_path[, string $alternate_description = "" ][, bool $only_storage_info = false ][, bool $only_crawl_params = false ]) : mixed
Parameters
$info : array<string|int, mixed>

header info that has already been read from the description.txt file

$archive_path : string

file path of the folder containing the bundle

$alternate_description : string = ""

used as the text for description rather than what's given in $info

$only_storage_info : bool = false

output only info about storage statistics don't output info about crawl parameters

$only_crawl_params : bool = false

output only info about crawl parameters not storage statistics

Return values
mixed

outputPartitionInfo()

Prints information about the number of words and frequencies of words within the $index'th partition in the bundle

public outputPartitionInfo(string $archive_path, int $num) : mixed
Parameters
$archive_path : string

the path of a directory that holds an IndexDocumentBundle

$num : int

of partition to show info for

Return values
mixed

outputShowPages()

Used to list out the pages/summaries stored in a bundle at $archive_path. It lists to stdout $num many documents starting at $start.

public outputShowPages(string $archive_path, int $start, int $num) : mixed
Parameters
$archive_path : string

path to bundle to list documents for

$start : int

first document to list

$num : int

number of documents to list

Return values
mixed

rebuildIndexBundle()

Used to recompute both the index shards and the dictionary of an index archive. The first step involves re-extracting the word into an inverted index from the summaries' web_archives.

public rebuildIndexBundle(string $archive_path, mixed $start_generation) : mixed

Then a reindex is done.

Parameters
$archive_path : string

file path to a IndexArchiveBundle

$start_generation : mixed

which web archive generation to start rebuild from. If 'continue' then keeps going from where last attempt at a rebuild was.

Return values
mixed

run()

The main code for the dictionary updater, updates the the dictionary for the IndexDocumentBundle at $bundle_path running on channel $channel from its current next_partition to process to the current save partition. Partitions are groups of documents that have been downloaded, but whose words ave not necessarily been add to the dicitionary for the bundle.

public static run(int $channel, string $bundle_path) : mixed
Parameters
$channel : int

the channel the crawl is running on. Used in naming lock files

$bundle_path : string

the path to the IndexDocumentBundle or FeedDucumentBundle we are adding dictionary info for

Return values
mixed

start()

Runs the ArcTool on the supplied command line arguments

public start() : mixed
Return values
mixed

usageMessageAndExit()

Outputs the "how to use this tool message" and then exit()'s.

public usageMessageAndExit() : mixed
Return values
mixed

        

Search results