Yioop_V9.5_Source_Code

WarcArchiveBundleIterator extends TextArchiveBundleIterator
in package

Application

Used to iterate through the records of a collection of warc files stored in a WebArchiveBundle folder. Warc is the newer file format of the Internet Archive and other for digital preservation: http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml http://archive-access.sourceforge.net/warc/ Iteration is done for the purpose making an index of these records

BUFFER_SIZE

How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE


    public
        mixed
    BUFFER_SIZE
    = 16384000

MAX_RECORD_SIZE

Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.


    public
        mixed
    MAX_RECORD_SIZE
    = 49152

$buffer

Used to buffer data from the currently opened file


    public
        string
    $buffer

$buffer_block_num

Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename


    public
        int
    $buffer_block_num

$buffer_fh

If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file


    public
        resource
    $buffer_fh

$buffer_filename

Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used


    public
        string
    $buffer_filename

$bz2_iterator

Used to interate over contents in a bzipped file


    public
        object
    $bz2_iterator

$compression

Used to store the name of compression that should be used when iterator.


    public
        string
    $compression

For example, gzip, bzip, etc.

$current_offset

current byte offset into the current arc file


    public
        int
    $current_offset

$current_page_num

current number of pages into the current arc file


    public
        int
    $current_page_num

$current_partition_num

Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.


    public
        int
    $current_partition_num

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator


    public
        string
    $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8


    public
        string
    $encoding

$end_delimiter

Ending delimiters for records


    public
        string
    $end_delimiter

$end_of_iterator

Whether or not the iterator still has more documents


    public
        bool
    $end_of_iterator

$fh

File handle for current archive file


    public
        resource
    $fh

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)


    public
        array<string|int, mixed>
    $header

$ini

Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file


    public
        array<string|int, mixed>
    $ini

$iterate_dir

The path to the directory containing the archive partitions to be iterated over.


    public
        string
    $iterate_dir

$iterate_timestamp

Timestamp of the archive that is being iterated over


    public
        int
    $iterate_timestamp

$num_partitions

The number of arc files in this arc archive bundle


    public
        int
    $num_partitions

$partitions

Array of filenames of arc files in this directory (glob order)


    public
        array<string|int, mixed>
    $partitions

$remainder


    public
        string
    $remainder

$result_dir

The path to the directory where the iteration status is stored.


    public
        string
    $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in


    public
        int
    $result_timestamp

$start_delimiter

Starting delimiters for records


    public
        string
    $start_delimiter

$status_filename

File name to write this archive iterator status messages to


    public
        string
    $status_filename

$switch_partition_callback_name

Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition


    public
        string
    $switch_partition_callback_name
     = null

__construct()

Creates an warc archive iterator with the given parameters.


    public
                    __construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed

Parameters

$iterate_timestamp : string: timestamp of the arc archive bundle to iterate over the pages of
$iterate_dir : string: folder of files to iterate over
$result_timestamp : string: timestamp of the arc archive bundle results are being stored in
$result_dir : string: where to write last position checkpoints to

Return values

mixed —

checkEof()

Checks if this object's archive's current partition is at an end of file


    public
                    checkEof() : bool

Return values

bool —

whether end of file has been reached (true -it has)

checkFileHandle()

Checks if have a valid handle to object's archive's current partition


    public
                    checkFileHandle() : bool

Return values

bool —

whether it has or not (true -it has)

fileClose()

Wrapper around particular compression scheme fclose function


    public
                    fileClose() : mixed

Return values

mixed —

fileGets()

Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file


    public
                    fileGets() : string

Return values

string —

from archive file up to next line ending or eof

fileOpen()

Wrapper around particular compression scheme fopen function


    public
                    fileOpen(string $filename[, bool $make_buffer_if_needed = true ]) : mixed

Parameters

$filename : string: name of file to open
$make_buffer_if_needed : bool = true

Return values

mixed —

fileRead()

Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file


    public
                    fileRead(int $num_bytes) : string

Parameters

$num_bytes : int: to read from archive file

Return values

string —

of length up to $num_bytes (less if eof occurs)

fileTell()

Returns the current position in the current iterator partition file for the given compression scheme.


    public
                    fileTell() : int

Return values

int —

a position into the currently being processed file of the iterator

getFileBlock()

Reads and return the block of data from the current partition


    public
                    getFileBlock() : mixed

Return values

mixed —

a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.

getNextTagData()

Used to extract data between two tags. After operation $this->buffer has contents after the close tag.


    public
                    getNextTagData(string $tag) : string

Parameters

$tag : string: tag name to look for

Return values

string —

data start tag contents close tag of name $tag

getNextTagsData()

Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.


    public
                    getNextTagsData(array<string|int, mixed> $tags) : array<string|int, mixed>

Parameters

$tags : array<string|int, mixed>: array of tagnames to look for

Return values

array<string|int, mixed> —

of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found

getRecordStart()

Used to advance the file pointer to the start of a WARD record


    public
                    getRecordStart() : mixed

Return values

mixed —

getWarcHeaders()

Used to parse the header portion of a WARC record


    public
                    getWarcHeaders() : array<string|int, mixed>

Return values

array<string|int, mixed> —

fields of WARC record mapped to their Yioop equivalents. Also, return 'line' the last line and 'warc-type' the kind of record.

makeBuffer()

Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file


    public
                    makeBuffer([string $buffer = "" ][, bool $return_string = false ]) : mixed

Parameters

$buffer : string = ""
$return_string : bool = false

Return values

mixed —

whether successfully read in block or not

nextChunk()

Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.


    public
                    nextChunk() : array<string|int, mixed>

Return values

array<string|int, mixed> —

with contents as described above

nextPage()

Gets the next doc from the iterator


    public
                    nextPage([bool $no_process = false ]) : array<string|int, mixed>

Parameters

$no_process : bool = false: do not do any processing on page data

Return values

array<string|int, mixed> —

associative array for doc or string if no_process true

nextPages()

Gets the next $num many docs from the iterator


    public
    abstract                nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>

Parameters

$num : int: number of docs to get
$no_process : bool = false: do not do any processing on page data

Return values

array<string|int, mixed> —

associative arrays for $num pages

reset()

Resets the iterator to the start of the archive bundle


    public
    abstract                reset() : mixed

Return values

mixed —

restoreCheckPoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.


    public
                    restoreCheckPoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.


    public
                    restoreCheckpoint() : array<string|int, mixed>

Return values

array<string|int, mixed> —

the data serialized when saveCheckpoint was called

saveCheckPoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.


    public
                    saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: any extra info a subclass wants to save

Return values

mixed —

saveCheckpoint()


    public
                    saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed

Parameters

$info : array<string|int, mixed> = []: any extra info a subclass wants to save

Return values

mixed —

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible


    public
                    seekPage( $limit) : mixed

Parameters

$limit :: page to advance to

Return values

mixed —

setIniInfo()

Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.


    public
                    setIniInfo(array<string|int, mixed> $ini) : mixed

Parameters

$ini : array<string|int, mixed>: configuration settings for this archive iterator

Return values

mixed —

updateBuffer()

If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.


    public
                    updateBuffer([string $buffer = "" ][, bool $return_string = false ]) : bool

Parameters

$buffer : string = ""
$return_string : bool = false

Return values

bool —

whether successfully read in next block or not

updatePartition()

Helper function for nextChunk to advance the partition if we are at the end of the current archive file


    public
                    updatePartition(array<string|int, mixed> &$info) : mixed

Parameters

$info : array<string|int, mixed>: a struct with data about current chunk. will up start partition flag

Return values

mixed —

weight()

Estimates the important of the site according to the weighting of the particular archive iterator


    public
    abstract                weight( &$site) : mixed

Parameters

$site :: an associative array containing info about a web page

Return values

mixed —

a 4-bit number or false if iterator doesn't uses default ranking method

WarcArchiveBundleIterator extends TextArchiveBundleIterator in package Application

Tags

Table of Contents

Constants

BUFFER_SIZE

MAX_RECORD_SIZE

Properties

$buffer

$buffer_block_num

$buffer_fh

$buffer_filename

$bz2_iterator

$compression

$current_offset

$current_page_num

$current_partition_num

$delimiter

$encoding

$end_delimiter

$end_of_iterator

$fh

$header

$ini

$iterate_dir

$iterate_timestamp

$num_partitions

$partitions

$remainder

$result_dir

$result_timestamp

$start_delimiter

$status_filename

$switch_partition_callback_name

Methods

__construct()

Parameters

Return values

checkEof()

Return values

checkFileHandle()

Return values

fileClose()

Return values

fileGets()

Return values

fileOpen()

Parameters

Return values

fileRead()

Parameters

Return values

fileTell()

Return values

getFileBlock()

Return values

getNextTagData()

Parameters

Return values

getNextTagsData()

Parameters

Return values

getRecordStart()

Return values

getWarcHeaders()

Return values

makeBuffer()

Parameters

Return values

nextChunk()

Return values

nextPage()

Parameters

Return values

nextPages()

Parameters

Return values

reset()

Return values

restoreCheckPoint()

Return values

WarcArchiveBundleIterator extends TextArchiveBundleIterator
in package

Application