Yioop_V9.5_Source_Code_Documentation

ArchiveBundleIterator
in package
implements CrawlConstants

Abstract class used to model iterating documents indexed in an WebArchiveBundle or set of such bundles.

Tags
author

Chris Pollett

see
WebArchiveBundle

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$bz2_iterator  : object
Used to interate over contents in a bzipped file
$compression  : string
Used to store the name of compression that should be used when iterator.
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
Default character encoding used by records in the archive. For example, UTF-8
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Timestamp of the archive that is being used to store results in
nextPages()  : array<string|int, mixed>
Gets the next $num many docs from the iterator
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckpoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
saveCheckpoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
weight()  : mixed
Estimates the important of the site according to the weighting of the particular archive iterator

Properties

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8

public string $encoding

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in

public int $result_timestamp

Methods

nextPages()

Gets the next $num many docs from the iterator

public abstract nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle

public abstract reset() : mixed
Return values
mixed

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

weight()

Estimates the important of the site according to the weighting of the particular archive iterator

public abstract weight( &$site) : mixed
Parameters
$site :

an associative array containing info about a web page

Return values
mixed

a 4-bit number or false if iterator doesn't uses default ranking method


        

Search results