Yioop_V9.5_Source_Code

WebArchiveBundleIterator extends ArchiveBundleIterator
in package

Application

Class used to model iterating documents indexed in an WebArchiveBundle. This would typically be for the purpose of re-indexing these documents.

$archive

The web archive bundle being iterated over


    public
        object
    $archive

$bz2_iterator

Used to interate over contents in a bzipped file


    public
        object
    $bz2_iterator

$compression

Used to store the name of compression that should be used when iterator.


    public
        string
    $compression

For example, gzip, bzip, etc.

$count

Number of documents in the web archive bundle being iterated over


    public
        int
    $count

$current_partition_num

Index of web archive in the web archive bundle that the iterator is currently getting results from


    public
        int
    $current_partition_num

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator


    public
        string
    $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8


    public
        string
    $encoding

$end_of_iterator

Whether or not the iterator still has more documents


    public
        bool
    $end_of_iterator

$fetcher_prefix

The fetcher prefix associated with this archive.


    public
        string
    $fetcher_prefix

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)


    public
        array<string|int, mixed>
    $header

$iterate_timestamp

Timestamp of the archive that is being iterated over


    public
        int
    $iterate_timestamp

$num_partitions

Number of web archive objects in this web archive bundle


    public
        int
    $num_partitions

$overall_index

Index between 0 and $this->count of where the iterator is at


    public
        int
    $overall_index

$partition

The current web archive in the bundle that is being iterated over


    public
        int
    $partition

$partition_index

The item within the current partition to be returned next


    public
        int
    $partition_index

$result_dir

The path to the directory where the iteration status is stored.


    public
        string
    $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in


    public
        int
    $result_timestamp

__construct()

Creates a web archive iterator with the given parameters.


    public
                    __construct(string $prefix, string $iterate_timestamp, string $result_timestamp) : mixed

Parameters

$prefix : string: fetcher number this bundle is associated with
$iterate_timestamp : string: timestamp of the web archive bundle to iterate over the pages of
$result_timestamp : string: timestamp of the web archive bundle results are being stored in

Return values

mixed —

getArchiveName()

Returns the path to an archive given its timestamp.


    public
                    getArchiveName(string $timestamp) : string

Parameters

$timestamp : string: the archive timestamp

Return values

string —

the path to the archive, based off of the fetcher prefix used when this iterator was constructed

nextPages()

Gets the next $num many docs from the iterator


    public
                    nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>

Parameters

$num : int: number of docs to get
$no_process : bool = false: this flag is inherited from base class but does not do anything in this case

Return values

array<string|int, mixed> —

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle


    public
                    reset() : mixed

Return values

mixed —

restoreCheckpoint()

Restores state from a previous instantiation, after the last batch of pages extracted.


    public
                    restoreCheckpoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

saveCheckpoint()

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.


    public
                    saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: data needed to restore where we are in the process of iterating through archive.

Return values

mixed —

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible


    public
                    seekPage( $limit) : mixed

Parameters

$limit :: page to advance to

Return values

mixed —

weight()

Estimates the importance of the site according to the weighting of the particular archive iterator


    public
                    weight( &$site) : bool

Parameters

$site :: an associative array containing info about a web page

Return values

bool —

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank

WebArchiveBundleIterator extends ArchiveBundleIterator in package Application

Tags

Table of Contents

Properties

$archive

$bz2_iterator

$compression

$count

$current_partition_num

$delimiter

$encoding

$end_of_iterator

$fetcher_prefix

$header

$iterate_timestamp

$num_partitions

$overall_index

$partition

$partition_index

$result_dir

$result_timestamp

Methods

__construct()

Parameters

Return values

getArchiveName()

Parameters

Return values

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

saveCheckpoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

WebArchiveBundleIterator extends ArchiveBundleIterator
in package

Application