Yioop_V9.5_Source_Code_Documentation

WebArchiveBundleIterator extends ArchiveBundleIterator
in package

Class used to model iterating documents indexed in an WebArchiveBundle. This would typically be for the purpose of re-indexing these documents.

Tags
author

Chris Pollett

see
WebArchiveBundle

Table of Contents

$archive  : object
The web archive bundle being iterated over
$bz2_iterator  : object
Used to interate over contents in a bzipped file
$compression  : string
Used to store the name of compression that should be used when iterator.
$count  : int
Number of documents in the web archive bundle being iterated over
$current_partition_num  : int
Index of web archive in the web archive bundle that the iterator is currently getting results from
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
Default character encoding used by records in the archive. For example, UTF-8
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$fetcher_prefix  : string
The fetcher prefix associated with this archive.
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$num_partitions  : int
Number of web archive objects in this web archive bundle
$overall_index  : int
Index between 0 and $this->count of where the iterator is at
$partition  : int
The current web archive in the bundle that is being iterated over
$partition_index  : int
The item within the current partition to be returned next
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Timestamp of the archive that is being used to store results in
__construct()  : mixed
Creates a web archive iterator with the given parameters.
getArchiveName()  : string
Returns the path to an archive given its timestamp.
nextPages()  : array<string|int, mixed>
Gets the next $num many docs from the iterator
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckpoint()  : array<string|int, mixed>
Restores state from a previous instantiation, after the last batch of pages extracted.
saveCheckpoint()  : mixed
Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
weight()  : bool
Estimates the importance of the site according to the weighting of the particular archive iterator

Properties

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$current_partition_num

Index of web archive in the web archive bundle that the iterator is currently getting results from

public int $current_partition_num

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8

public string $encoding

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in

public int $result_timestamp

Methods

__construct()

Creates a web archive iterator with the given parameters.

public __construct(string $prefix, string $iterate_timestamp, string $result_timestamp) : mixed
Parameters
$prefix : string

fetcher number this bundle is associated with

$iterate_timestamp : string

timestamp of the web archive bundle to iterate over the pages of

$result_timestamp : string

timestamp of the web archive bundle results are being stored in

Return values
mixed

getArchiveName()

Returns the path to an archive given its timestamp.

public getArchiveName(string $timestamp) : string
Parameters
$timestamp : string

the archive timestamp

Return values
string

the path to the archive, based off of the fetcher prefix used when this iterator was constructed

nextPages()

Gets the next $num many docs from the iterator

public nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

this flag is inherited from base class but does not do anything in this case

Return values
array<string|int, mixed>

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle

public reset() : mixed
Return values
mixed

restoreCheckpoint()

Restores state from a previous instantiation, after the last batch of pages extracted.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckpoint()

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

data needed to restore where we are in the process of iterating through archive.

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

weight()

Estimates the importance of the site according to the weighting of the particular archive iterator

public weight( &$site) : bool
Parameters
$site :

an associative array containing info about a web page

Return values
bool

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank


        

Search results