Yioop_V9.5_Source_Code_Documentation

MixArchiveBundleIterator extends ArchiveBundleIterator
in package

Used to do an archive crawl based on the results of a crawl mix.

the query terms for this crawl mix will have site:any raw 1 appended to them

Tags
author

Chris Pollett

Table of Contents

$bz2_iterator  : object
Used to interate over contents in a bzipped file
$compression  : string
Used to store the name of compression that should be used when iterator.
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
Default character encoding used by records in the archive. For example, UTF-8
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$limit  : int
count of how far out into the crawl mix we've gone.
$mix_timestamp  : int
Used to hold timestamp of the crawl mix being used to iterate over
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Used to hold timestamp of the index archive bundle of output results
__construct()  : mixed
Creates a web archive iterator with the given parameters.
getArchiveName()  : mixed
Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)
nextPages()  : array<string|int, mixed>
Gets the next $num many docs from the iterator
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckpoint()  : array<string|int, mixed>
Restores state from a previous instantiation, after the last batch of pages extracted.
saveCheckpoint()  : mixed
Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
weight()  : bool
Estimates the importance of the site according to the weighting of the particular archive iterator

Properties

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8

public string $encoding

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$mix_timestamp

Used to hold timestamp of the crawl mix being used to iterate over

public int $mix_timestamp

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Used to hold timestamp of the index archive bundle of output results

public int $result_timestamp

Methods

__construct()

Creates a web archive iterator with the given parameters.

public __construct(string $mix_timestamp, string $result_timestamp) : mixed
Parameters
$mix_timestamp : string

timestamp of the crawl mix to iterate over the pages of

$result_timestamp : string

timestamp of the web archive bundle results are being stored in

Return values
mixed

getArchiveName()

Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)

public getArchiveName(int $timestamp) : mixed
Parameters
$timestamp : int

of current archive crawl

Return values
mixed

nextPages()

Gets the next $num many docs from the iterator

public nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

this flag is inherited from base class but does not do anything in this case

Return values
array<string|int, mixed>

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle

public reset() : mixed
Return values
mixed

restoreCheckpoint()

Restores state from a previous instantiation, after the last batch of pages extracted.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckpoint()

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

data needed to restore where we are in the process of iterating through archive. By default save fields LIMIT and END_OF_ITERATOR

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

weight()

Estimates the importance of the site according to the weighting of the particular archive iterator

public weight( &$site) : bool
Parameters
$site :

an associative array containing info about a web page

Return values
bool

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank


        

Search results