Yioop_V9.5_Source_Code_Documentation

WebArchiveBundle
in package

A web archive bundle is a collection of web archives which are managed together.It is useful to split data across several archive files rather than just store it in one, for both read efficiency and to keep filesizes from getting too big. In some places we are using 4 byte int's to store file offsets which restricts the size of the files we can use for wbe archives.

Tags
author

Chris Pollett

Table of Contents

$compressor  : object
How Compressor object used to compress/uncompress data stored in the bundle
$count  : int
Total number of page objects stored by this WebArchiveBundle
$description  : string
A short text name for this WebArchiveBundle
$dir_name  : string
Folder name to use for this WebArchiveBundle
$partition  : array<string|int, mixed>
Used to contain the WebArchive partitions of the bundle
$read_only_archive  : bool
Controls whether the archive was opened in read only mode
$version  : int
What version of web archive bundle this is
$write_partition  : int
The index of the partition to which new documents will be added
__construct()  : mixed
Makes or initializes an existing WebArchiveBundle with the given characteristics
addCount()  : mixed
Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.
addPages()  : int
Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.
getArchiveInfo()  : array<string|int, mixed>
Gets information about a WebArchiveBundle out of its description.txt file
getPage()  : array<string|int, mixed>
Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.
getParamModifiedTime()  : mixed
Returns the mast time the archive info of the bundle was modified.
getPartition()  : object
Gets an object encapsulating the $index the WebArchive partition in this bundle.
initCountIfNotExists()  : mixed
Creates a new counter to be maintained in the description.txt file if the counter doesn't exist, leaves unchanged otherwise
setArchiveInfo()  : mixed
Sets the archive info (DESCRIPTION, COUNT, NUM_DOCS_PER_PARTITION) for this web archive
setWritePartition()  : mixed
Sets the write partition to the provided value and if this is not a read only archive stores, this value persistently to archive info

Properties

$compressor

How Compressor object used to compress/uncompress data stored in the bundle

public object $compressor

$count

Total number of page objects stored by this WebArchiveBundle

public int $count

$description

A short text name for this WebArchiveBundle

public string $description

$dir_name

Folder name to use for this WebArchiveBundle

public string $dir_name

$partition

Used to contain the WebArchive partitions of the bundle

public array<string|int, mixed> $partition = []

$read_only_archive

Controls whether the archive was opened in read only mode

public bool $read_only_archive

$write_partition

The index of the partition to which new documents will be added

public int $write_partition

Methods

__construct()

Makes or initializes an existing WebArchiveBundle with the given characteristics

public __construct(string $dir_name[, bool $read_only_archive = true ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, string $description = null ][, string $compressor = "GzipCompressor" ]) : mixed
Parameters
$dir_name : string

folder name of the bundle

$read_only_archive : bool = true

whether to open archive in a read only mode suitable for obtaining search results to open it in a read write mode as used during a crawl

$num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION

number of documents before the web archive is changed

$description : string = null

a short text name/description of this WebArchiveBundle

$compressor : string = "GzipCompressor"

the Compressor object used to compress/uncompress data stored in the bundle

Return values
mixed

addCount()

Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.

public addCount(int $num[, string $field = "COUNT" ]) : mixed
Parameters
$num : int

number of items to add to current count

$field : string = "COUNT"

field of info struct to add to the count of

Return values
mixed

addPages()

Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.

public addPages(string $offset_field, array<string|int, mixed> &$pages) : int
Parameters
$offset_field : string

field used to record offsets after storing

$pages : array<string|int, mixed>

data to store

Return values
int

the write_partition the pages were stored in

getArchiveInfo()

Gets information about a WebArchiveBundle out of its description.txt file

public static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
$dir_name : string

folder name of the WebArchiveBundle to get info for

Return values
array<string|int, mixed>

containing the name (description) of the WebArchiveBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

getPage()

Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.

public getPage(int $offset, int $partition) : array<string|int, mixed>
Parameters
$offset : int

byte offset of page data

$partition : int

which WebArchive to look in

Return values
array<string|int, mixed>

desired page

getParamModifiedTime()

Returns the mast time the archive info of the bundle was modified.

public static getParamModifiedTime(string $dir_name) : mixed
Parameters
$dir_name : string

folder with archive bundle

Return values
mixed

getPartition()

Gets an object encapsulating the $index the WebArchive partition in this bundle.

public getPartition(int $index[, bool $fast_construct = true ]) : object
Parameters
$index : int

the number of the partition within this bundle to return

$fast_construct : bool = true

tells the constructor of the WebArchive avoid reading in its info block.

Return values
object

the WebArchive file which was requested

initCountIfNotExists()

Creates a new counter to be maintained in the description.txt file if the counter doesn't exist, leaves unchanged otherwise

public initCountIfNotExists([string $field = "COUNT" ]) : mixed
Parameters
$field : string = "COUNT"

field of info struct to add a counter for

Return values
mixed

setArchiveInfo()

Sets the archive info (DESCRIPTION, COUNT, NUM_DOCS_PER_PARTITION) for this web archive

public static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
$dir_name : string

folder with archive bundle

$info : array<string|int, mixed>

struct with above fields

Return values
mixed

setWritePartition()

Sets the write partition to the provided value and if this is not a read only archive stores, this value persistently to archive info

public setWritePartition(int $i) : mixed
Parameters
$i : int

the number of the current write partition

Return values
mixed

        

Search results