WebArchiveBundleIterator
extends ArchiveBundleIterator
in package
Class used to model iterating documents indexed in an WebArchiveBundle. This would typically be for the purpose of re-indexing these documents.
Tags
Table of Contents
- $archive : object
- The web archive bundle being iterated over
- $bz2_iterator : object
- Used to interate over contents in a bzipped file
- $compression : string
- Used to store the name of compression that should be used when iterator.
- $count : int
- Number of documents in the web archive bundle being iterated over
- $current_partition_num : int
- Index of web archive in the web archive bundle that the iterator is currently getting results from
- $delimiter : string
- If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
- $encoding : string
- Default character encoding used by records in the archive. For example, UTF-8
- $end_of_iterator : bool
- Whether or not the iterator still has more documents
- $fetcher_prefix : string
- The fetcher prefix associated with this archive.
- $header : array<string|int, mixed>
- Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
- $iterate_timestamp : int
- Timestamp of the archive that is being iterated over
- $num_partitions : int
- Number of web archive objects in this web archive bundle
- $overall_index : int
- Index between 0 and $this->count of where the iterator is at
- $partition : int
- The current web archive in the bundle that is being iterated over
- $partition_index : int
- The item within the current partition to be returned next
- $result_dir : string
- The path to the directory where the iteration status is stored.
- $result_timestamp : int
- Timestamp of the archive that is being used to store results in
- __construct() : mixed
- Creates a web archive iterator with the given parameters.
- getArchiveName() : string
- Returns the path to an archive given its timestamp.
- nextPages() : array<string|int, mixed>
- Gets the next $num many docs from the iterator
- reset() : mixed
- Resets the iterator to the start of the archive bundle
- restoreCheckpoint() : array<string|int, mixed>
- Restores state from a previous instantiation, after the last batch of pages extracted.
- saveCheckpoint() : mixed
- Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
- seekPage() : mixed
- Advances the iterator to the $limit page, with as little additional processing as possible
- weight() : bool
- Estimates the importance of the site according to the weighting of the particular archive iterator
Properties
$archive
The web archive bundle being iterated over
public
object
$archive
$bz2_iterator
Used to interate over contents in a bzipped file
public
object
$bz2_iterator
$compression
Used to store the name of compression that should be used when iterator.
public
string
$compression
For example, gzip, bzip, etc.
$count
Number of documents in the web archive bundle being iterated over
public
int
$count
$current_partition_num
Index of web archive in the web archive bundle that the iterator is currently getting results from
public
int
$current_partition_num
$delimiter
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
public
string
$delimiter
$encoding
Default character encoding used by records in the archive. For example, UTF-8
public
string
$encoding
$end_of_iterator
Whether or not the iterator still has more documents
public
bool
$end_of_iterator
$fetcher_prefix
The fetcher prefix associated with this archive.
public
string
$fetcher_prefix
$header
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
public
array<string|int, mixed>
$header
$iterate_timestamp
Timestamp of the archive that is being iterated over
public
int
$iterate_timestamp
$num_partitions
Number of web archive objects in this web archive bundle
public
int
$num_partitions
$overall_index
Index between 0 and $this->count of where the iterator is at
public
int
$overall_index
$partition
The current web archive in the bundle that is being iterated over
public
int
$partition
$partition_index
The item within the current partition to be returned next
public
int
$partition_index
$result_dir
The path to the directory where the iteration status is stored.
public
string
$result_dir
$result_timestamp
Timestamp of the archive that is being used to store results in
public
int
$result_timestamp
Methods
__construct()
Creates a web archive iterator with the given parameters.
public
__construct(string $prefix, string $iterate_timestamp, string $result_timestamp) : mixed
Parameters
- $prefix : string
-
fetcher number this bundle is associated with
- $iterate_timestamp : string
-
timestamp of the web archive bundle to iterate over the pages of
- $result_timestamp : string
-
timestamp of the web archive bundle results are being stored in
Return values
mixed —getArchiveName()
Returns the path to an archive given its timestamp.
public
getArchiveName(string $timestamp) : string
Parameters
- $timestamp : string
-
the archive timestamp
Return values
string —the path to the archive, based off of the fetcher prefix used when this iterator was constructed
nextPages()
Gets the next $num many docs from the iterator
public
nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $num : int
-
number of docs to get
- $no_process : bool = false
-
this flag is inherited from base class but does not do anything in this case
Return values
array<string|int, mixed> —associative arrays for $num pages
reset()
Resets the iterator to the start of the archive bundle
public
reset() : mixed
Return values
mixed —restoreCheckpoint()
Restores state from a previous instantiation, after the last batch of pages extracted.
public
restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
saveCheckpoint()
Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
public
saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
data needed to restore where we are in the process of iterating through archive.
Return values
mixed —seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
public
seekPage( $limit) : mixed
Parameters
Return values
mixed —weight()
Estimates the importance of the site according to the weighting of the particular archive iterator
public
weight( &$site) : bool
Parameters
Return values
bool —false we assume files were crawled roughly according to page importance so we use default estimate of doc rank