Yioop_V9.5_Source_Code

ArchiveBundleIterator
in package

Application

implements CrawlConstants

Abstract class used to model iterating documents indexed in an WebArchiveBundle or set of such bundles.

Interfaces, Classes, Traits and Enums

CrawlConstants: Shared constants and enums used by components that are involved in the crawling process

$bz2_iterator : object: Used to interate over contents in a bzipped file
$compression : string: Used to store the name of compression that should be used when iterator.
$delimiter : string: If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding : string: Default character encoding used by records in the archive. For example, UTF-8
$end_of_iterator : bool: Whether or not the iterator still has more documents
$header : array<string|int, mixed>: Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$iterate_timestamp : int: Timestamp of the archive that is being iterated over
$result_dir : string: The path to the directory where the iteration status is stored.
$result_timestamp : int: Timestamp of the archive that is being used to store results in
nextPages() : array<string|int, mixed>: Gets the next $num many docs from the iterator
reset() : mixed: Resets the iterator to the start of the archive bundle
restoreCheckpoint() : array<string|int, mixed>: Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
saveCheckpoint() : mixed: Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
seekPage() : mixed: Advances the iterator to the $limit page, with as little additional processing as possible
weight() : mixed: Estimates the important of the site according to the weighting of the particular archive iterator

$bz2_iterator

Used to interate over contents in a bzipped file


    public
        object
    $bz2_iterator

$compression

Used to store the name of compression that should be used when iterator.


    public
        string
    $compression

For example, gzip, bzip, etc.

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator


    public
        string
    $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8


    public
        string
    $encoding

$end_of_iterator

Whether or not the iterator still has more documents


    public
        bool
    $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)


    public
        array<string|int, mixed>
    $header

$iterate_timestamp

Timestamp of the archive that is being iterated over


    public
        int
    $iterate_timestamp

$result_dir

The path to the directory where the iteration status is stored.


    public
        string
    $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in


    public
        int
    $result_timestamp

nextPages()

Gets the next $num many docs from the iterator


    public
    abstract                nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>

Parameters

$num : int: number of docs to get
$no_process : bool = false: do not do any processing on page data

Return values

array<string|int, mixed> —

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle


    public
    abstract                reset() : mixed

Return values

mixed —

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.


    public
                    restoreCheckpoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.


    public
                    saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: any extra info a subclass wants to save

Return values

mixed —

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible


    public
                    seekPage( $limit) : mixed

Parameters

$limit :: page to advance to

Return values

mixed —

weight()

Estimates the important of the site according to the weighting of the particular archive iterator


    public
    abstract                weight( &$site) : mixed

Parameters

$site :: an associative array containing info about a web page

Return values

mixed —

a 4-bit number or false if iterator doesn't uses default ranking method

Yioop_V9.5_Source_Code_Documentation

ArchiveBundleIterator
in package

Application

implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$bz2_iterator

$compression

$delimiter

$encoding

$end_of_iterator

$header

$iterate_timestamp

$result_dir

$result_timestamp

Methods

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

saveCheckpoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

Search results

ArchiveBundleIterator in package Application implements CrawlConstants

Tags

Interfaces, Classes, Traits and Enums

Table of Contents

Properties

$bz2_iterator

$compression

$delimiter

$encoding

$end_of_iterator

$header

$iterate_timestamp

$result_dir

$result_timestamp

Methods

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

saveCheckpoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

ArchiveBundleIterator
in package

Application

implements CrawlConstants