Yioop_V9.5_Source_Code

MixArchiveBundleIterator extends ArchiveBundleIterator
in package

Application

Used to do an archive crawl based on the results of a crawl mix.

the query terms for this crawl mix will have site:any raw 1 appended to them

$bz2_iterator

Used to interate over contents in a bzipped file


    public
        object
    $bz2_iterator

$compression

Used to store the name of compression that should be used when iterator.


    public
        string
    $compression

For example, gzip, bzip, etc.

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator


    public
        string
    $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8


    public
        string
    $encoding

$end_of_iterator

Whether or not the iterator still has more documents


    public
        bool
    $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)


    public
        array<string|int, mixed>
    $header

$iterate_timestamp

Timestamp of the archive that is being iterated over


    public
        int
    $iterate_timestamp

$limit

count of how far out into the crawl mix we've gone.


    public
        int
    $limit

$mix_timestamp

Used to hold timestamp of the crawl mix being used to iterate over


    public
        int
    $mix_timestamp

$result_dir

The path to the directory where the iteration status is stored.


    public
        string
    $result_dir

$result_timestamp

Used to hold timestamp of the index archive bundle of output results


    public
        int
    $result_timestamp

__construct()

Creates a web archive iterator with the given parameters.


    public
                    __construct(string $mix_timestamp, string $result_timestamp) : mixed

Parameters

$mix_timestamp : string: timestamp of the crawl mix to iterate over the pages of
$result_timestamp : string: timestamp of the web archive bundle results are being stored in

Return values

mixed —

getArchiveName()

Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)


    public
                    getArchiveName(int $timestamp) : mixed

Parameters

$timestamp : int: of current archive crawl

Return values

mixed —

nextPages()

Gets the next $num many docs from the iterator


    public
                    nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>

Parameters

$num : int: number of docs to get
$no_process : bool = false: this flag is inherited from base class but does not do anything in this case

Return values

array<string|int, mixed> —

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle


    public
                    reset() : mixed

Return values

mixed —

restoreCheckpoint()

Restores state from a previous instantiation, after the last batch of pages extracted.


    public
                    restoreCheckpoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

saveCheckpoint()

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.


    public
                    saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: data needed to restore where we are in the process of iterating through archive. By default save fields LIMIT and END_OF_ITERATOR

Return values

mixed —

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible


    public
                    seekPage( $limit) : mixed

Parameters

$limit :: page to advance to

Return values

mixed —

weight()

Estimates the importance of the site according to the weighting of the particular archive iterator


    public
                    weight( &$site) : bool

Parameters

$site :: an associative array containing info about a web page

Return values

bool —

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank

Yioop_V9.5_Source_Code_Documentation

MixArchiveBundleIterator extends ArchiveBundleIterator
in package

Application

Tags

Table of Contents

Properties

$bz2_iterator

$compression

$delimiter

$encoding

$end_of_iterator

$header

$iterate_timestamp

$limit

$mix_timestamp

$result_dir

$result_timestamp

Methods

__construct()

Parameters

Return values

getArchiveName()

Parameters

Return values

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

saveCheckpoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

Search results

MixArchiveBundleIterator extends ArchiveBundleIterator in package Application

Tags

Table of Contents

Properties

$bz2_iterator

$compression

$delimiter

$encoding

$end_of_iterator

$header

$iterate_timestamp

$limit

$mix_timestamp

$result_dir

$result_timestamp

Methods

__construct()

Parameters

Return values

getArchiveName()

Parameters

Return values

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

saveCheckpoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

MixArchiveBundleIterator extends ArchiveBundleIterator
in package

Application