Yioop_V9.5_Source_Code_Documentation

MediaWikiArchiveBundleIterator extends TextArchiveBundleIterator
in package

Used to iterate through a collection of .xml.bz2 media wiki files stored in a WebArchiveBundle folder. Here these media wiki files contain the kinds of documents used by wikipedia. Iteration would be for the purpose making an index of these records

Tags
author

Chris Pollett

see
WebArchiveBundle

Table of Contents

BUFFER_SIZE  = 16384000
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
MAX_RECORD_SIZE  = 49152
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
WIKI_PAGE_STYLES  = <<<EOD <style> table.wikitable { background:white; border:1px #aaa solid; border-collapse: scollapse margin:1em 0; } table.wikitable > tr > th,table.wikitable > tr > td, table.wikitable > * > tr > th,table.wikitable > * > tr > td { border:1px #aaa solid; padding:0.2em; } table.wikitable > tr > th, table.wikitable > * > tr > th { text-align:center; background:white; font-weight:bold } table.wikitable > caption { font-weight:bold; } </style> EOD
Used to define the styles we put on cache wiki pages
$buffer  : string
Used to buffer data from the currently opened file
$buffer_block_num  : int
Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename
$buffer_fh  : resource
If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file
$buffer_filename  : string
Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used
$bz2_iterator  : object
Used to interate over contents in a bzipped file
$compression  : string
Used to store the name of compression that should be used when iterator.
$current_offset  : int
current byte offset into the current arc file
$current_page_num  : int
current number of pages into the current arc file
$current_partition_num  : int
Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.
$delimiter  : string
If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator
$encoding  : string
Default character encoding used by records in the archive. For example, UTF-8
$end_delimiter  : string
Ending delimiters for records
$end_of_iterator  : bool
Whether or not the iterator still has more documents
$fh  : resource
File handle for current archive file
$header  : array<string|int, mixed>
Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)
$ini  : array<string|int, mixed>
Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file
$iterate_dir  : string
The path to the directory containing the archive partitions to be iterated over.
$iterate_timestamp  : int
Timestamp of the archive that is being iterated over
$num_partitions  : int
The number of arc files in this arc archive bundle
$parser  : object
Used to hold a WikiParser object that will be used for parsing
$partitions  : array<string|int, mixed>
Array of filenames of arc files in this directory (glob order)
$remainder  : string
$result_dir  : string
The path to the directory where the iteration status is stored.
$result_timestamp  : int
Timestamp of the archive that is being used to store results in
$start_delimiter  : string
Starting delimiters for records
$status_filename  : string
File name to write this archive iterator status messages to
$switch_partition_callback_name  : string
Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition
__construct()  : mixed
Creates a media wiki archive iterator with the given parameters.
checkEof()  : bool
Checks if this object's archive's current partition is at an end of file
checkFileHandle()  : bool
Checks if have a valid handle to object's archive's current partition
fileClose()  : mixed
Wrapper around particular compression scheme fclose function
fileGets()  : string
Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file
fileOpen()  : mixed
Wrapper around particular compression scheme fopen function
fileRead()  : string
Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file
fileTell()  : int
Returns the current position in the current iterator partition file for the given compression scheme.
getFileBlock()  : mixed
Reads and return the block of data from the current partition
getNextTagData()  : string
Used to extract data between two tags. After operation $this->buffer has contents after the close tag.
getNextTagsData()  : array<string|int, mixed>
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
getTextContent()  : string
Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom
initializeSubstitutions()  : mixed
Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)
makeBuffer()  : mixed
Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file
nextChunk()  : array<string|int, mixed>
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
nextPage()  : array<string|int, mixed>
Gets the next doc from the iterator
nextPages()  : array<string|int, mixed>
Gets the next $num many docs from the iterator
readMediaWikiHeader()  : mixed
Reads the siteinfo tag of the mediawiki xml file and extract data that will be used in constructing page summaries.
reset()  : mixed
Resets the iterator to the start of the archive bundle
restoreCheckPoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. We also reset up our regex substitutions
restoreCheckpoint()  : array<string|int, mixed>
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
saveCheckPoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
saveCheckpoint()  : mixed
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
seekPage()  : mixed
Advances the iterator to the $limit page, with as little additional processing as possible
setIniInfo()  : mixed
Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.
updateBuffer()  : bool
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
updatePartition()  : mixed
Helper function for nextChunk to advance the partition if we are at the end of the current archive file
weight()  : int
Estimates the important of the site according to the weighting of the particular archive iterator

Constants

BUFFER_SIZE

How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE

public mixed BUFFER_SIZE = 16384000

MAX_RECORD_SIZE

Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.

public mixed MAX_RECORD_SIZE = 49152

WIKI_PAGE_STYLES

Used to define the styles we put on cache wiki pages

public mixed WIKI_PAGE_STYLES = <<<EOD <style> table.wikitable { background:white; border:1px #aaa solid; border-collapse: scollapse margin:1em 0; } table.wikitable > tr > th,table.wikitable > tr > td, table.wikitable > * > tr > th,table.wikitable > * > tr > td { border:1px #aaa solid; padding:0.2em; } table.wikitable > tr > th, table.wikitable > * > tr > th { text-align:center; background:white; font-weight:bold } table.wikitable > caption { font-weight:bold; } </style> EOD

Properties

$buffer_block_num

Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename

public int $buffer_block_num

$buffer_fh

If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file

public resource $buffer_fh

$buffer_filename

Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used

public string $buffer_filename

$compression

Used to store the name of compression that should be used when iterator.

public string $compression

For example, gzip, bzip, etc.

$current_partition_num

Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.

public int $current_partition_num

$delimiter

If the archive uses a string of some kind to separate records, then delimeter is a regular expression which will match the separator

public string $delimiter

$encoding

Default character encoding used by records in the archive. For example, UTF-8

public string $encoding

$end_of_iterator

Whether or not the iterator still has more documents

public bool $end_of_iterator

$header

Used to store fields of meta information which is needed in making header information for each record processed. (such as base_address, ip_address, lang, etc.)

public array<string|int, mixed> $header

$ini

Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file

public array<string|int, mixed> $ini

$iterate_dir

The path to the directory containing the archive partitions to be iterated over.

public string $iterate_dir

$iterate_timestamp

Timestamp of the archive that is being iterated over

public int $iterate_timestamp

$partitions

Array of filenames of arc files in this directory (glob order)

public array<string|int, mixed> $partitions

$result_dir

The path to the directory where the iteration status is stored.

public string $result_dir

$result_timestamp

Timestamp of the archive that is being used to store results in

public int $result_timestamp

$status_filename

File name to write this archive iterator status messages to

public string $status_filename

$switch_partition_callback_name

Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition

public string $switch_partition_callback_name = null

Methods

__construct()

Creates a media wiki archive iterator with the given parameters.

public __construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir) : mixed
Parameters
$iterate_timestamp : string

timestamp of the arc archive bundle to iterate over the pages of

$iterate_dir : string

folder of files to iterate over

$result_timestamp : string

timestamp of the arc archive bundle results are being stored in

$result_dir : string

where to write last position checkpoints to

Return values
mixed

checkEof()

Checks if this object's archive's current partition is at an end of file

public checkEof() : bool
Return values
bool

whether end of file has been reached (true -it has)

checkFileHandle()

Checks if have a valid handle to object's archive's current partition

public checkFileHandle() : bool
Return values
bool

whether it has or not (true -it has)

fileClose()

Wrapper around particular compression scheme fclose function

public fileClose() : mixed
Return values
mixed

fileGets()

Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file

public fileGets() : string
Return values
string

from archive file up to next line ending or eof

fileOpen()

Wrapper around particular compression scheme fopen function

public fileOpen(string $filename[, bool $make_buffer_if_needed = true ]) : mixed
Parameters
$filename : string

name of file to open

$make_buffer_if_needed : bool = true
Return values
mixed

fileRead()

Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file

public fileRead(int $num_bytes) : string
Parameters
$num_bytes : int

to read from archive file

Return values
string

of length up to $num_bytes (less if eof occurs)

fileTell()

Returns the current position in the current iterator partition file for the given compression scheme.

public fileTell() : int
Return values
int

a position into the currently being processed file of the iterator

getFileBlock()

Reads and return the block of data from the current partition

public getFileBlock() : mixed
Return values
mixed

a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.

getNextTagData()

Used to extract data between two tags. After operation $this->buffer has contents after the close tag.

public getNextTagData(string $tag) : string
Parameters
$tag : string

tag name to look for

Return values
string

data start tag contents close tag of name $tag

getNextTagsData()

Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.

public getNextTagsData(array<string|int, mixed> $tags) : array<string|int, mixed>
Parameters
$tags : array<string|int, mixed>

array of tagnames to look for

Return values
array<string|int, mixed>

of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found

getTextContent()

Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom

public getTextContent(object $dom,  $path) : string
Parameters
$dom : object

DOMDocument to get the text from

$path :

xpath expression to find node with text

Return values
string

text content of the given node if it exists

initializeSubstitutions()

Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)

public initializeSubstitutions(string $base_address) : mixed
Parameters
$base_address : string

base url for link substitutions

Return values
mixed

makeBuffer()

Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file

public makeBuffer([string $buffer = "" ][, bool $return_string = false ]) : mixed
Parameters
$buffer : string = ""
$return_string : bool = false
Return values
mixed

whether successfully read in block or not

nextChunk()

Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.

public nextChunk() : array<string|int, mixed>
Return values
array<string|int, mixed>

with contents as described above

nextPage()

Gets the next doc from the iterator

public nextPage([bool $no_process = false ]) : array<string|int, mixed>
Parameters
$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative array for doc or string if no_process true

nextPages()

Gets the next $num many docs from the iterator

public abstract nextPages(int $num[, bool $no_process = false ]) : array<string|int, mixed>
Parameters
$num : int

number of docs to get

$no_process : bool = false

do not do any processing on page data

Return values
array<string|int, mixed>

associative arrays for $num pages

readMediaWikiHeader()

Reads the siteinfo tag of the mediawiki xml file and extract data that will be used in constructing page summaries.

public readMediaWikiHeader() : mixed
Return values
mixed

reset()

Resets the iterator to the start of the archive bundle

public abstract reset() : mixed
Return values
mixed

restoreCheckPoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. We also reset up our regex substitutions

public restoreCheckPoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

restoreCheckpoint()

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

public restoreCheckpoint() : array<string|int, mixed>
Return values
array<string|int, mixed>

the data serialized when saveCheckpoint was called

saveCheckPoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckPoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

saveCheckpoint()

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

public saveCheckpoint([array<string|int, mixed> $info = [] ]) : mixed
Parameters
$info : array<string|int, mixed> = []

any extra info a subclass wants to save

Return values
mixed

seekPage()

Advances the iterator to the $limit page, with as little additional processing as possible

public seekPage( $limit) : mixed
Parameters
$limit :

page to advance to

Return values
mixed

setIniInfo()

Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.

public setIniInfo(array<string|int, mixed> $ini) : mixed
Parameters
$ini : array<string|int, mixed>

configuration settings for this archive iterator

Return values
mixed

updateBuffer()

If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.

public updateBuffer([string $buffer = "" ][, bool $return_string = false ]) : bool
Parameters
$buffer : string = ""
$return_string : bool = false
Return values
bool

whether successfully read in next block or not

updatePartition()

Helper function for nextChunk to advance the partition if we are at the end of the current archive file

public updatePartition(array<string|int, mixed> &$info) : mixed
Parameters
$info : array<string|int, mixed>

a struct with data about current chunk. will up start partition flag

Return values
mixed

weight()

Estimates the important of the site according to the weighting of the particular archive iterator

public weight( &$site) : int
Parameters
$site :

an associative array containing info about a web page

Return values
int

a 4-bit number based on the log_2 size - 10 of the wiki entry (@see nextPage).


        

Search results