Yioop_V9.5_Source_Code

DatabaseBundleIterator extends ArchiveBundleIterator
in package

Application

Used to iterate through the records that result from an SQL query to a database

$bz2_iterator

Used to interate over contents in a bzipped file


    public
        object
    $bz2_iterator

$column_separator

DB Records are imported as a text string where column_separator is used to delimit the end of a column


    public
        string
    $column_separator

$compression

Used to store the name of compression that should be used when iterator.


    public
        string
    $compression

For example, gzip, bzip, etc.

$db

File handle for current arc file


    public
        resource
    $db

$dbinfo

Information about the database connection (DBMS, DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)


    public
        array<string|int, mixed>
    $dbinfo

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator


    public
        string
    $delimiter

$encoding

What character encoding is used for the DB records


    public
        string
    $encoding

$end_of_iterator

Whether or not the iterator still has more documents


    public
        bool
    $end_of_iterator

$field_value_separator

For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column


    public
        string
    $field_value_separator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)


    public
        array<string|int, mixed>
    $header

$iterate_dir

The path to the directory containing the archive partitions to be iterated over.


    public
        string
    $iterate_dir

$iterate_timestamp

Timestamp of the archive that is being iterated over


    public
        int
    $iterate_timestamp

$limit

Current result row of query iterator has processed to


    public
        int
    $limit

$result_dir

The path to the directory where the iteration status is stored.


    public
        string
    $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in


    public
        int
    $result_timestamp

$sql

SQL query whose records we are index


    public
        string
    $sql

__construct()

Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.


    public
                    __construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed

Parameters

$iterate_timestamp : string: timestamp of the arc archive bundle to iterate over the pages of
$iterate_dir : string: folder of files to iterate over
$result_timestamp : string: timestamp of the arc archive bundle results are being stored in
$result_dir : string: where to write last position checkpoints to

Return values

mixed —

nextPages()

Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.


    public
                    nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>

Parameters

$num : int: number of docs to get
$no_process : bool = false: do not do any processing on page data

Return values

array<string|int, mixed> —

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle


    public
                    reset() : mixed

Return values

mixed —

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.


    public
                    restoreCheckpoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

restoreCheckPoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.


    public
                    restoreCheckPoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.


    public
                    saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: any extra info a subclass wants to save

Return values

mixed —

saveCheckPoint()

Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.


    public
                    saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: any extra info a subclass wants to save

Return values

mixed —

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible


    public
                    seekPage( $limit) : mixed

Parameters

$limit :: page to advance to

Return values

mixed —

weight()

Estimates the important of the site according to the weighting of the particular archive iterator


    public
                    weight( &$site) : bool

Parameters

$site :: an associative array containing info about a web page

Return values

bool —

false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance

DatabaseBundleIterator extends ArchiveBundleIterator in package Application

Tags

Table of Contents

Properties

$bz2_iterator

$column_separator

$compression

$db

$dbinfo

$delimiter

$encoding

$end_of_iterator

$field_value_separator

$header

$iterate_dir

$iterate_timestamp

$limit

$result_dir

$result_timestamp

$sql

Methods

__construct()

Parameters

Return values

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckpoint()

Return values

restoreCheckPoint()

Return values

saveCheckpoint()

Parameters

Return values

saveCheckPoint()

Parameters

Return values

seekPage()

Parameters

Return values

weight()

Parameters

Return values

DatabaseBundleIterator extends ArchiveBundleIterator
in package

Application