MixArchiveBundleIterator
extends ArchiveBundleIterator
in package
Used to do an archive crawl based on the results of a crawl mix.
the query terms for this crawl mix will have site:any raw 1 appended to them
Tags
Table of Contents
- $bz2_iterator : object
- Used to interate over contents in a bzipped file
- $compression : string
- Used to store the name of compression that should be used when iterator.
- $delimiter : string
- If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
- $encoding : string
- Default character encoding used by records in the archive. For example, UTF-8
- $end_of_iterator : bool
- Whether or not the iterator still has more documents
- $header : array<string|int, mixed>
- Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
- $iterate_timestamp : int
- Timestamp of the archive that is being iterated over
- $limit : int
- count of how far out into the crawl mix we've gone.
- $mix_timestamp : int
- Used to hold timestamp of the crawl mix being used to iterate over
- $result_dir : string
- The path to the directory where the iteration status is stored.
- $result_timestamp : int
- Used to hold timestamp of the index archive bundle of output results
- __construct() : mixed
- Creates a web archive iterator with the given parameters.
- getArchiveName() : mixed
- Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)
- nextPages() : array<string|int, mixed>
- Gets the next $num many docs from the iterator
- reset() : mixed
- Resets the iterator to the start of the archive bundle
- restoreCheckpoint() : array<string|int, mixed>
- Restores state from a previous instantiation, after the last batch of pages extracted.
- saveCheckpoint() : mixed
- Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
- seekPage() : mixed
- Advances the iterator to the $limit page, with as little additional processing as possible
- weight() : bool
- Estimates the importance of the site according to the weighting of the particular archive iterator
Properties
$bz2_iterator
Used to interate over contents in a bzipped file
public
object
$bz2_iterator
$compression
Used to store the name of compression that should be used when iterator.
public
string
$compression
For example, gzip, bzip, etc.
$delimiter
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
public
string
$delimiter
$encoding
Default character encoding used by records in the archive. For example, UTF-8
public
string
$encoding
$end_of_iterator
Whether or not the iterator still has more documents
public
bool
$end_of_iterator
$header
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
public
array<string|int, mixed>
$header
$iterate_timestamp
Timestamp of the archive that is being iterated over
public
int
$iterate_timestamp
$limit
count of how far out into the crawl mix we've gone.
public
int
$limit
$mix_timestamp
Used to hold timestamp of the crawl mix being used to iterate over
public
int
$mix_timestamp
$result_dir
The path to the directory where the iteration status is stored.
public
string
$result_dir
$result_timestamp
Used to hold timestamp of the index archive bundle of output results
public
int
$result_timestamp
Methods
__construct()
Creates a web archive iterator with the given parameters.
public
__construct(string $mix_timestamp, string $result_timestamp) : mixed
Parameters
- $mix_timestamp : string
-
timestamp of the crawl mix to iterate over the pages of
- $result_timestamp : string
-
timestamp of the web archive bundle results are being stored in
Return values
mixed —getArchiveName()
Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)
public
getArchiveName(int $timestamp) : mixed
Parameters
- $timestamp : int
-
of current archive crawl
Return values
mixed —nextPages()
Gets the next $num many docs from the iterator
public
nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $num : int
-
number of docs to get
- $no_process : bool = false
-
this flag is inherited from base class but does not do anything in this case
Return values
array<string|int, mixed> —associative arrays for $num pages
reset()
Resets the iterator to the start of the archive bundle
public
reset() : mixed
Return values
mixed —restoreCheckpoint()
Restores state from a previous instantiation, after the last batch of pages extracted.
public
restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
saveCheckpoint()
Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
public
saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
data needed to restore where we are in the process of iterating through archive. By default save fields LIMIT and END_OF_ITERATOR
Return values
mixed —seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
public
seekPage( $limit) : mixed
Parameters
Return values
mixed —weight()
Estimates the importance of the site according to the weighting of the particular archive iterator
public
weight( &$site) : bool
Parameters
Return values
bool —false we assume files were crawled roughly according to page importance so we use default estimate of doc rank