ArchiveBundleIterator
in package
implements
CrawlConstants
Abstract class used to model iterating documents indexed in an WebArchiveBundle or set of such bundles.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $bz2_iterator : object
- Used to interate over contents in a bzipped file
- $compression : string
- Used to store the name of compression that should be used when iterator.
- $delimiter : string
- If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
- $encoding : string
- Default character encoding used by records in the archive. For example, UTF-8
- $end_of_iterator : bool
- Whether or not the iterator still has more documents
- $header : array<string|int, mixed>
- Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
- $iterate_timestamp : int
- Timestamp of the archive that is being iterated over
- $result_dir : string
- The path to the directory where the iteration status is stored.
- $result_timestamp : int
- Timestamp of the archive that is being used to store results in
- nextPages() : array<string|int, mixed>
- Gets the next $num many docs from the iterator
- reset() : mixed
- Resets the iterator to the start of the archive bundle
- restoreCheckpoint() : array<string|int, mixed>
- Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
- saveCheckpoint() : mixed
- Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
- seekPage() : mixed
- Advances the iterator to the $limit page, with as little additional processing as possible
- weight() : mixed
- Estimates the important of the site according to the weighting of the particular archive iterator
Properties
$bz2_iterator
Used to interate over contents in a bzipped file
public
object
$bz2_iterator
$compression
Used to store the name of compression that should be used when iterator.
public
string
$compression
For example, gzip, bzip, etc.
$delimiter
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
public
string
$delimiter
$encoding
Default character encoding used by records in the archive. For example, UTF-8
public
string
$encoding
$end_of_iterator
Whether or not the iterator still has more documents
public
bool
$end_of_iterator
$header
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
public
array<string|int, mixed>
$header
$iterate_timestamp
Timestamp of the archive that is being iterated over
public
int
$iterate_timestamp
$result_dir
The path to the directory where the iteration status is stored.
public
string
$result_dir
$result_timestamp
Timestamp of the archive that is being used to store results in
public
int
$result_timestamp
Methods
nextPages()
Gets the next $num many docs from the iterator
public
abstract nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $num : int
-
number of docs to get
- $no_process : bool = false
-
do not do any processing on page data
Return values
array<string|int, mixed> —associative arrays for $num pages
reset()
Resets the iterator to the start of the archive bundle
public
abstract reset() : mixed
Return values
mixed —restoreCheckpoint()
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
public
restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
saveCheckpoint()
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
public
saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
any extra info a subclass wants to save
Return values
mixed —seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
public
seekPage( $limit) : mixed
Parameters
Return values
mixed —weight()
Estimates the important of the site according to the weighting of the particular archive iterator
public
abstract weight( &$site) : mixed
Parameters
Return values
mixed —a 4-bit number or false if iterator doesn't uses default ranking method