Yioop_V9.5_Source_Code_Documentation

DatabaseBundleIterator extends ArchiveBundleIterator
in package

Used to iterate through the records that result from an SQL query to a database

Tags
author

Chris Pollett

see
WebArchiveBundle

Table of Contents

$bz2_iterator  : object
Used to interate over contents in a bzipped file
$column_separator  : string
DB Records are imported as a text string where column_separator is used to delimit the end of a column
$compression  : string
Used to store the name of compression that should be used when iterator.
$db  : resource
File handle for current arc file
$dbinfo  : array<string|int, mixed>
Information about the database connection (DBMS, DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
What character encoding is used for the DB records
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$field_value_separator  : string
For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$iterate_dir  : string
The path to the directory containing the archive partitions to be iterated over.
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$limit  : int
Current result row of query iterator has processed to
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Timestamp of the archive that is being used to store results in
$sql  : string
SQL query whose records we are index
__construct()  : mixed
Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.
nextPages()  : array<string|int, mixed>
Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckpoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
restoreCheckPoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.
saveCheckpoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
saveCheckPoint()  : mixed
Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
weight()  : bool
Estimates the important of the site according to the weighting of the particular archive iterator

Properties

$column_separator

DB Records are imported as a text string where column_separator is used to delimit the end of a column

public string $column_separator

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$dbinfo

Information about the database connection (DBMS, DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)

public array<string|int, mixed> $dbinfo

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$field_value_separator

For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column

public string $field_value_separator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$iterate_dir

The path to the directory containing the archive partitions to be iterated over.

public string $iterate_dir

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in

public int $result_timestamp

Methods

__construct()

Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.

public __construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed
Parameters
$iterate_timestamp : string

timestamp of the arc archive bundle to iterate over the pages of

$iterate_dir : string

folder of files to iterate over

$result_timestamp : string

timestamp of the arc archive bundle results are being stored in

$result_dir : string

where to write last position checkpoints to

Return values
mixed

nextPages()

Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.

public nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle

public reset() : mixed
Return values
mixed

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

restoreCheckPoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.

public restoreCheckPoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

saveCheckPoint()

Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.

public saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

weight()

Estimates the important of the site according to the weighting of the particular archive iterator

public weight( &$site) : bool
Parameters
$site :

an associative array containing info about a web page

Return values
bool

false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance


        

Search results