DatabaseBundleIterator
extends ArchiveBundleIterator
in package
Used to iterate through the records that result from an SQL query to a database
Tags
Table of Contents
- $bz2_iterator : object
- Used to interate over contents in a bzipped file
- $column_separator : string
- DB Records are imported as a text string where column_separator is used to delimit the end of a column
- $compression : string
- Used to store the name of compression that should be used when iterator.
- $db : resource
- File handle for current arc file
- $dbinfo : array<string|int, mixed>
- Information about the database connection (DBMS, DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)
- $delimiter : string
- If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
- $encoding : string
- What character encoding is used for the DB records
- $end_of_iterator : bool
- Whether or not the iterator still has more documents
- $field_value_separator : string
- For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column
- $header : array<string|int, mixed>
- Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
- $iterate_dir : string
- The path to the directory containing the archive partitions to be iterated over.
- $iterate_timestamp : int
- Timestamp of the archive that is being iterated over
- $limit : int
- Current result row of query iterator has processed to
- $result_dir : string
- The path to the directory where the iteration status is stored.
- $result_timestamp : int
- Timestamp of the archive that is being used to store results in
- $sql : string
- SQL query whose records we are index
- __construct() : mixed
- Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.
- nextPages() : array<string|int, mixed>
- Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
- reset() : mixed
- Resets the iterator to the start of the archive bundle
- restoreCheckpoint() : array<string|int, mixed>
- Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
- restoreCheckPoint() : array<string|int, mixed>
- Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.
- saveCheckpoint() : mixed
- Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
- saveCheckPoint() : mixed
- Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.
- seekPage() : mixed
- Advances the iterator to the $limit page, with as little additional processing as possible
- weight() : bool
- Estimates the important of the site according to the weighting of the particular archive iterator
Properties
$bz2_iterator
Used to interate over contents in a bzipped file
public
object
$bz2_iterator
$column_separator
DB Records are imported as a text string where column_separator is used to delimit the end of a column
public
string
$column_separator
$compression
Used to store the name of compression that should be used when iterator.
public
string
$compression
For example, gzip, bzip, etc.
$db
File handle for current arc file
public
resource
$db
$dbinfo
Information about the database connection (DBMS, DB_HOST, DB_USER, DB_PASSWORD, DB_NAME)
public
array<string|int, mixed>
$dbinfo
$delimiter
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
public
string
$delimiter
$encoding
What character encoding is used for the DB records
public
string
$encoding
$end_of_iterator
Whether or not the iterator still has more documents
public
bool
$end_of_iterator
$field_value_separator
For a given DB record each column is converted to a string: name_of_column field_value_separator value_of_column
public
string
$field_value_separator
$header
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
public
array<string|int, mixed>
$header
$iterate_dir
The path to the directory containing the archive partitions to be iterated over.
public
string
$iterate_dir
$iterate_timestamp
Timestamp of the archive that is being iterated over
public
int
$iterate_timestamp
$limit
Current result row of query iterator has processed to
public
int
$limit
$result_dir
The path to the directory where the iteration status is stored.
public
string
$result_dir
$result_timestamp
Timestamp of the archive that is being used to store results in
public
int
$result_timestamp
$sql
SQL query whose records we are index
public
string
$sql
Methods
__construct()
Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.
public
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed
Parameters
- $iterate_timestamp : string
-
timestamp of the arc archive bundle to iterate over the pages of
- $iterate_dir : string
-
folder of files to iterate over
- $result_timestamp : string
-
timestamp of the arc archive bundle results are being stored in
- $result_dir : string
-
where to write last position checkpoints to
Return values
mixed —nextPages()
Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
public
nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $num : int
-
number of docs to get
- $no_process : bool = false
-
do not do any processing on page data
Return values
array<string|int, mixed> —associative arrays for $num pages
reset()
Resets the iterator to the start of the archive bundle
public
reset() : mixed
Return values
mixed —restoreCheckpoint()
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
public
restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
restoreCheckPoint()
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint.
public
restoreCheckPoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
saveCheckpoint()
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
public
saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
any extra info a subclass wants to save
Return values
mixed —saveCheckPoint()
Used to save the result row we are at so that the iterator can start from that row the next time it is invoked.
public
saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
any extra info a subclass wants to save
Return values
mixed —seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
public
seekPage( $limit) : mixed
Parameters
Return values
mixed —weight()
Estimates the important of the site according to the weighting of the particular archive iterator
public
weight( &$site) : bool
Parameters
Return values
bool —false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance