Yioop_V9.5_Source_Code_Documentation

WarcArchiveBundleIterator extends TextArchiveBundleIterator
in package

Used to iterate through the records of a collection of warc files stored in a WebArchiveBundle folder. Warc is the newer file format of the Internet Archive and other for digital preservation: http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml http://archive-access.sourceforge.net/warc/ Iteration is done for the purpose making an index of these records

Tags
author

Chris Pollett

see
WebArchiveBundle

Table of Contents

BUFFER_SIZE  = 16384000
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
MAX_RECORD_SIZE  = 49152
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
$buffer  : string
Used to buffer data from the currently opened file
$buffer_block_num  : int
Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename
$buffer_fh  : resource
If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file
$buffer_filename  : string
Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used
$bz2_iterator  : object
Used to interate over contents in a bzipped file
$compression  : string
Used to store the name of compression that should be used when iterator.
$current_offset  : int
current byte offset into the current arc file
$current_page_num  : int
current number of pages into the current arc file
$current_partition_num  : int
Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
Default character encoding used by records in the archive. For example, UTF-8
$end_delimiter  : string
Ending delimiters for records
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$fh  : resource
File handle for current archive file
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$ini  : array<string|int, mixed>
Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file
$iterate_dir  : string
The path to the directory containing the archive partitions to be iterated over.
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$num_partitions  : int
The number of arc files in this arc archive bundle
$partitions  : array<string|int, mixed>
Array of filenames of arc files in this directory (glob order)
$remainder  : string
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Timestamp of the archive that is being used to store results in
$start_delimiter  : string
Starting delimiters for records
$status_filename  : string
File name to write this archive iterator status messages to
$switch_partition_callback_name  : string
Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition
__construct()  : mixed
Creates an warc archive iterator with the given parameters.
checkEof()  : bool
Checks if this object's archive's current partition is at an end of file
checkFileHandle()  : bool
Checks if have a valid handle to object's archive's current partition
fileClose()  : mixed
Wrapper around particular compression scheme fclose function
fileGets()  : string
Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file
fileOpen()  : mixed
Wrapper around particular compression scheme fopen function
fileRead()  : string
Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file
fileTell()  : int
Returns the current position in the current iterator partition file for the given compression scheme.
getFileBlock()  : mixed
Reads and return the block of data from the current partition
getNextTagData()  : string
Used to extract data between two tags. After operation $this->buffer has contents after the close tag.
getNextTagsData()  : array<string|int, mixed>
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
getRecordStart()  : mixed
Used to advance the file pointer to the start of a WARD record
getWarcHeaders()  : array<string|int, mixed>
Used to parse the header portion of a WARC record
makeBuffer()  : mixed
Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file
nextChunk()  : array<string|int, mixed>
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
nextPage()  : array<string|int, mixed>
Gets the next doc from the iterator
nextPages()  : array<string|int, mixed>
Gets the next $num many docs from the iterator
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckPoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
restoreCheckpoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
saveCheckPoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
saveCheckpoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
setIniInfo()  : mixed
Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.
updateBuffer()  : bool
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
updatePartition()  : mixed
Helper function for nextChunk to advance the partition if we are at the end of the current archive file
weight()  : mixed
Estimates the important of the site according to the weighting of the particular archive iterator

Constants

BUFFER_SIZE

How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE

public mixed BUFFER_SIZE = 16384000

MAX_RECORD_SIZE

Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.

public mixed MAX_RECORD_SIZE = 49152

Properties

$buffer_block_num

Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename

public int $buffer_block_num

$buffer_fh

If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file

public resource $buffer_fh

$buffer_filename

Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used

public string $buffer_filename

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$current_partition_num

Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.

public int $current_partition_num

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8

public string $encoding

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$ini

Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file

public array<string|int, mixed> $ini

$iterate_dir

The path to the directory containing the archive partitions to be iterated over.

public string $iterate_dir

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$partitions

Array of filenames of arc files in this directory (glob order)

public array<string|int, mixed> $partitions

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in

public int $result_timestamp

$status_filename

File name to write this archive iterator status messages to

public string $status_filename

$switch_partition_callback_name

Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition

public string $switch_partition_callback_name = null

Methods

__construct()

Creates an warc archive iterator with the given parameters.

public __construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed
Parameters
$iterate_timestamp : string

timestamp of the arc archive bundle to iterate over the pages of

$iterate_dir : string

folder of files to iterate over

$result_timestamp : string

timestamp of the arc archive bundle results are being stored in

$result_dir : string

where to write last position checkpoints to

Return values
mixed

checkEof()

Checks if this object's archive's current partition is at an end of file

public checkEof() : bool
Return values
bool

whether end of file has been reached (true -it has)

checkFileHandle()

Checks if have a valid handle to object's archive's current partition

public checkFileHandle() : bool
Return values
bool

whether it has or not (true -it has)

fileClose()

Wrapper around particular compression scheme fclose function

public fileClose() : mixed
Return values
mixed

fileGets()

Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file

public fileGets() : string
Return values
string

from archive file up to next line ending or eof

fileOpen()

Wrapper around particular compression scheme fopen function

public fileOpen(string $filename[, bool $make_buffer_if_needed = true ]) : mixed
Parameters
$filename : string

name of file to open

$make_buffer_if_needed : bool = true
Return values
mixed

fileRead()

Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file

public fileRead(int $num_bytes) : string
Parameters
$num_bytes : int

to read from archive file

Return values
string

of length up to $num_bytes (less if eof occurs)

fileTell()

Returns the current position in the current iterator partition file for the given compression scheme.

public fileTell() : int
Return values
int

a position into the currently being processed file of the iterator

getFileBlock()

Reads and return the block of data from the current partition

public getFileBlock() : mixed
Return values
mixed

a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.

getNextTagData()

Used to extract data between two tags. After operation $this->buffer has contents after the close tag.

public getNextTagData(string $tag) : string
Parameters
$tag : string

tag name to look for

Return values
string

data start tag contents close tag of name $tag

getNextTagsData()

Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.

public getNextTagsData(array<string|int, mixed> $tags) : array<string|int, mixed>
Parameters
$tags : array<string|int, mixed>

array of tagnames to look for

Return values
array<string|int, mixed>

of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found

getRecordStart()

Used to advance the file pointer to the start of a WARD record

public getRecordStart() : mixed
Return values
mixed

getWarcHeaders()

Used to parse the header portion of a WARC record

public getWarcHeaders() : array<string|int, mixed>
Return values
array<string|int, mixed>

fields of WARC record mapped to their Yioop equivalents. Also, return 'line' the last line and 'warc-type' the kind of record.

makeBuffer()

Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file

public makeBuffer([string $buffer = "" ][, bool $return_string = false ]) : mixed
Parameters
$buffer : string = ""
$return_string : bool = false
Return values
mixed

whether successfully read in block or not

nextChunk()

Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.

public nextChunk() : array<string|int, mixed>
Return values
array<string|int, mixed>

with contents as described above

nextPage()

Gets the next doc from the iterator

public nextPage([bool $no_process = false ]) : array<string|int, mixed>
Parameters
$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative array for doc or string if no_process true

nextPages()

Gets the next $num many docs from the iterator

public abstract nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle

public abstract reset() : mixed
Return values
mixed

restoreCheckPoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.

public restoreCheckPoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckPoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

setIniInfo()

Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.

public setIniInfo(array<string|int, mixed> $ini) : mixed
Parameters
$ini : array<string|int, mixed>

configuration settings for this archive iterator

Return values
mixed

updateBuffer()

If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.

public updateBuffer([string $buffer = "" ][, bool $return_string = false ]) : bool
Parameters
$buffer : string = ""
$return_string : bool = false
Return values
bool

whether successfully read in next block or not

updatePartition()

Helper function for nextChunk to advance the partition if we are at the end of the current archive file

public updatePartition(array<string|int, mixed> &$info) : mixed
Parameters
$info : array<string|int, mixed>

a struct with data about current chunk. will up start partition flag

Return values
mixed

weight()

Estimates the important of the site according to the weighting of the particular archive iterator

public abstract weight( &$site) : mixed
Parameters
$site :

an associative array containing info about a web page

Return values
mixed

a 4-bit number or false if iterator doesn't uses default ranking method


        

Search results