WarcArchiveBundleIterator
extends TextArchiveBundleIterator
in package
Used to iterate through the records of a collection of warc files stored in a WebArchiveBundle folder. Warc is the newer file format of the Internet Archive and other for digital preservation: http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml http://archive-access.sourceforge.net/warc/ Iteration is done for the purpose making an index of these records
Tags
Table of Contents
- BUFFER_SIZE = 16384000
- How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
- MAX_RECORD_SIZE = 49152
- Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
- $buffer : string
- Used to buffer data from the currently opened file
- $buffer_block_num : int
- Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename
- $buffer_fh : resource
- If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file
- $buffer_filename : string
- Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used
- $bz2_iterator : object
- Used to interate over contents in a bzipped file
- $compression : string
- Used to store the name of compression that should be used when iterator.
- $current_offset : int
- current byte offset into the current arc file
- $current_page_num : int
- current number of pages into the current arc file
- $current_partition_num : int
- Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.
- $delimiter : string
- If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
- $encoding : string
- Default character encoding used by records in the archive. For example, UTF-8
- $end_delimiter : string
- Ending delimiters for records
- $end_of_iterator : bool
- Whether or not the iterator still has more documents
- $fh : resource
- File handle for current archive file
- $header : array<string|int, mixed>
- Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
- $ini : array<string|int, mixed>
- Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file
- $iterate_dir : string
- The path to the directory containing the archive partitions to be iterated over.
- $iterate_timestamp : int
- Timestamp of the archive that is being iterated over
- $num_partitions : int
- The number of arc files in this arc archive bundle
- $partitions : array<string|int, mixed>
- Array of filenames of arc files in this directory (glob order)
- $remainder : string
- $result_dir : string
- The path to the directory where the iteration status is stored.
- $result_timestamp : int
- Timestamp of the archive that is being used to store results in
- $start_delimiter : string
- Starting delimiters for records
- $status_filename : string
- File name to write this archive iterator status messages to
- $switch_partition_callback_name : string
- Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition
- __construct() : mixed
- Creates an warc archive iterator with the given parameters.
- checkEof() : bool
- Checks if this object's archive's current partition is at an end of file
- checkFileHandle() : bool
- Checks if have a valid handle to object's archive's current partition
- fileClose() : mixed
- Wrapper around particular compression scheme fclose function
- fileGets() : string
- Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file
- fileOpen() : mixed
- Wrapper around particular compression scheme fopen function
- fileRead() : string
- Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file
- fileTell() : int
- Returns the current position in the current iterator partition file for the given compression scheme.
- getFileBlock() : mixed
- Reads and return the block of data from the current partition
- getNextTagData() : string
- Used to extract data between two tags. After operation $this->buffer has contents after the close tag.
- getNextTagsData() : array<string|int, mixed>
- Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
- getRecordStart() : mixed
- Used to advance the file pointer to the start of a WARD record
- getWarcHeaders() : array<string|int, mixed>
- Used to parse the header portion of a WARC record
- makeBuffer() : mixed
- Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file
- nextChunk() : array<string|int, mixed>
- Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
- nextPage() : array<string|int, mixed>
- Gets the next doc from the iterator
- nextPages() : array<string|int, mixed>
- Gets the next $num many docs from the iterator
- reset() : mixed
- Resets the iterator to the start of the archive bundle
- restoreCheckPoint() : array<string|int, mixed>
- Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
- restoreCheckpoint() : array<string|int, mixed>
- Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
- saveCheckPoint() : mixed
- Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
- saveCheckpoint() : mixed
- Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
- seekPage() : mixed
- Advances the iterator to the $limit page, with as little additional processing as possible
- setIniInfo() : mixed
- Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.
- updateBuffer() : bool
- If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
- updatePartition() : mixed
- Helper function for nextChunk to advance the partition if we are at the end of the current archive file
- weight() : mixed
- Estimates the important of the site according to the weighting of the particular archive iterator
Constants
BUFFER_SIZE
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
public
mixed
BUFFER_SIZE
= 16384000
MAX_RECORD_SIZE
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
public
mixed
MAX_RECORD_SIZE
= 49152
Properties
$buffer
Used to buffer data from the currently opened file
public
string
$buffer
$buffer_block_num
Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename
public
int
$buffer_block_num
$buffer_fh
If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file
public
resource
$buffer_fh
$buffer_filename
Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used
public
string
$buffer_filename
$bz2_iterator
Used to interate over contents in a bzipped file
public
object
$bz2_iterator
$compression
Used to store the name of compression that should be used when iterator.
public
string
$compression
For example, gzip, bzip, etc.
$current_offset
current byte offset into the current arc file
public
int
$current_offset
$current_page_num
current number of pages into the current arc file
public
int
$current_page_num
$current_partition_num
Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.
public
int
$current_partition_num
$delimiter
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
public
string
$delimiter
$encoding
Default character encoding used by records in the archive. For example, UTF-8
public
string
$encoding
$end_delimiter
Ending delimiters for records
public
string
$end_delimiter
$end_of_iterator
Whether or not the iterator still has more documents
public
bool
$end_of_iterator
$fh
File handle for current archive file
public
resource
$fh
$header
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
public
array<string|int, mixed>
$header
$ini
Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file
public
array<string|int, mixed>
$ini
$iterate_dir
The path to the directory containing the archive partitions to be iterated over.
public
string
$iterate_dir
$iterate_timestamp
Timestamp of the archive that is being iterated over
public
int
$iterate_timestamp
$num_partitions
The number of arc files in this arc archive bundle
public
int
$num_partitions
$partitions
Array of filenames of arc files in this directory (glob order)
public
array<string|int, mixed>
$partitions
$remainder
public
string
$remainder
$result_dir
The path to the directory where the iteration status is stored.
public
string
$result_dir
$result_timestamp
Timestamp of the archive that is being used to store results in
public
int
$result_timestamp
$start_delimiter
Starting delimiters for records
public
string
$start_delimiter
$status_filename
File name to write this archive iterator status messages to
public
string
$status_filename
$switch_partition_callback_name
Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition
public
string
$switch_partition_callback_name
= null
Methods
__construct()
Creates an warc archive iterator with the given parameters.
public
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed
Parameters
- $iterate_timestamp : string
-
timestamp of the arc archive bundle to iterate over the pages of
- $iterate_dir : string
-
folder of files to iterate over
- $result_timestamp : string
-
timestamp of the arc archive bundle results are being stored in
- $result_dir : string
-
where to write last position checkpoints to
Return values
mixed —checkEof()
Checks if this object's archive's current partition is at an end of file
public
checkEof() : bool
Return values
bool —whether end of file has been reached (true -it has)
checkFileHandle()
Checks if have a valid handle to object's archive's current partition
public
checkFileHandle() : bool
Return values
bool —whether it has or not (true -it has)
fileClose()
Wrapper around particular compression scheme fclose function
public
fileClose() : mixed
Return values
mixed —fileGets()
Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file
public
fileGets() : string
Return values
string —from archive file up to next line ending or eof
fileOpen()
Wrapper around particular compression scheme fopen function
public
fileOpen(string $filename[, bool $make_buffer_if_needed = true ]) : mixed
Parameters
- $filename : string
-
name of file to open
- $make_buffer_if_needed : bool = true
Return values
mixed —fileRead()
Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file
public
fileRead(int $num_bytes) : string
Parameters
- $num_bytes : int
-
to read from archive file
Return values
string —of length up to $num_bytes (less if eof occurs)
fileTell()
Returns the current position in the current iterator partition file for the given compression scheme.
public
fileTell() : int
Return values
int —a position into the currently being processed file of the iterator
getFileBlock()
Reads and return the block of data from the current partition
public
getFileBlock() : mixed
Return values
mixed —a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.
getNextTagData()
Used to extract data between two tags. After operation $this->buffer has contents after the close tag.
public
getNextTagData(string $tag) : string
Parameters
- $tag : string
-
tag name to look for
Return values
string —data start tag contents close tag of name $tag
getNextTagsData()
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
public
getNextTagsData(array<string|int, mixed> $tags) : array<string|int, mixed>
Parameters
- $tags : array<string|int, mixed>
-
array of tagnames to look for
Return values
array<string|int, mixed> —of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found
getRecordStart()
Used to advance the file pointer to the start of a WARD record
public
getRecordStart() : mixed
Return values
mixed —getWarcHeaders()
Used to parse the header portion of a WARC record
public
getWarcHeaders() : array<string|int, mixed>
Return values
array<string|int, mixed> —fields of WARC record mapped to their Yioop equivalents. Also, return 'line' the last line and 'warc-type' the kind of record.
makeBuffer()
Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file
public
makeBuffer([string $buffer = "" ][, bool $return_string = false ]) : mixed
Parameters
- $buffer : string = ""
- $return_string : bool = false
Return values
mixed —whether successfully read in block or not
nextChunk()
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
public
nextChunk() : array<string|int, mixed>
Return values
array<string|int, mixed> —with contents as described above
nextPage()
Gets the next doc from the iterator
public
nextPage([bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $no_process : bool = false
-
do not do any processing on page data
Return values
array<string|int, mixed> —associative array for doc or string if no_process true
nextPages()
Gets the next $num many docs from the iterator
public
abstract nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
- $num : int
-
number of docs to get
- $no_process : bool = false
-
do not do any processing on page data
Return values
array<string|int, mixed> —associative arrays for $num pages
reset()
Resets the iterator to the start of the archive bundle
public
abstract reset() : mixed
Return values
mixed —restoreCheckPoint()
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
public
restoreCheckPoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
restoreCheckpoint()
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
public
restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed> —the data serialized when saveCheckpoint was called
saveCheckPoint()
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
public
saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
any extra info a subclass wants to save
Return values
mixed —saveCheckpoint()
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
public
saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
- $info : array<string|int, mixed> = []
-
any extra info a subclass wants to save
Return values
mixed —seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
public
seekPage( $limit) : mixed
Parameters
Return values
mixed —setIniInfo()
Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.
public
setIniInfo(array<string|int, mixed> $ini) : mixed
Parameters
- $ini : array<string|int, mixed>
-
configuration settings for this archive iterator
Return values
mixed —updateBuffer()
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
public
updateBuffer([string $buffer = "" ][, bool $return_string = false ]) : bool
Parameters
- $buffer : string = ""
- $return_string : bool = false
Return values
bool —whether successfully read in next block or not
updatePartition()
Helper function for nextChunk to advance the partition if we are at the end of the current archive file
public
updatePartition(array<string|int, mixed> &$info) : mixed
Parameters
- $info : array<string|int, mixed>
-
a struct with data about current chunk. will up start partition flag
Return values
mixed —weight()
Estimates the important of the site according to the weighting of the particular archive iterator
public
abstract weight( &$site) : mixed
Parameters
Return values
mixed —a 4-bit number or false if iterator doesn't uses default ranking method