WebArchiveBundle
in package
A web archive bundle is a collection of web archives which are managed together.It is useful to split data across several archive files rather than just store it in one, for both read efficiency and to keep filesizes from getting too big. In some places we are using 4 byte int's to store file offsets which restricts the size of the files we can use for wbe archives.
Tags
Table of Contents
- $compressor : object
- How Compressor object used to compress/uncompress data stored in the bundle
- $count : int
- Total number of page objects stored by this WebArchiveBundle
- $description : string
- A short text name for this WebArchiveBundle
- $dir_name : string
- Folder name to use for this WebArchiveBundle
- $partition : array<string|int, mixed>
- Used to contain the WebArchive partitions of the bundle
- $read_only_archive : bool
- Controls whether the archive was opened in read only mode
- $version : int
- What version of web archive bundle this is
- $write_partition : int
- The index of the partition to which new documents will be added
- __construct() : mixed
- Makes or initializes an existing WebArchiveBundle with the given characteristics
- addCount() : mixed
- Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.
- addPages() : int
- Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.
- getArchiveInfo() : array<string|int, mixed>
- Gets information about a WebArchiveBundle out of its description.txt file
- getPage() : array<string|int, mixed>
- Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.
- getParamModifiedTime() : mixed
- Returns the mast time the archive info of the bundle was modified.
- getPartition() : object
- Gets an object encapsulating the $index the WebArchive partition in this bundle.
- initCountIfNotExists() : mixed
- Creates a new counter to be maintained in the description.txt file if the counter doesn't exist, leaves unchanged otherwise
- setArchiveInfo() : mixed
- Sets the archive info (DESCRIPTION, COUNT, NUM_DOCS_PER_PARTITION) for this web archive
- setWritePartition() : mixed
- Sets the write partition to the provided value and if this is not a read only archive stores, this value persistently to archive info
Properties
$compressor
How Compressor object used to compress/uncompress data stored in the bundle
public
object
$compressor
$count
Total number of page objects stored by this WebArchiveBundle
public
int
$count
$description
A short text name for this WebArchiveBundle
public
string
$description
$dir_name
Folder name to use for this WebArchiveBundle
public
string
$dir_name
$partition
Used to contain the WebArchive partitions of the bundle
public
array<string|int, mixed>
$partition
= []
$read_only_archive
Controls whether the archive was opened in read only mode
public
bool
$read_only_archive
$version
What version of web archive bundle this is
public
int
$version
$write_partition
The index of the partition to which new documents will be added
public
int
$write_partition
Methods
__construct()
Makes or initializes an existing WebArchiveBundle with the given characteristics
public
__construct(string $dir_name[, bool $read_only_archive = true ][, int $num_docs_per_partition = CNUM_DOCS_PER_PARTITION ][, string $description = null ][, string $compressor = "GzipCompressor" ]) : mixed
Parameters
- $dir_name : string
-
folder name of the bundle
- $read_only_archive : bool = true
-
whether to open archive in a read only mode suitable for obtaining search results to open it in a read write mode as used during a crawl
- $num_docs_per_partition : int = CNUM_DOCS_PER_PARTITION
-
number of documents before the web archive is changed
- $description : string = null
-
a short text name/description of this WebArchiveBundle
- $compressor : string = "GzipCompressor"
-
the Compressor object used to compress/uncompress data stored in the bundle
Return values
mixed —addCount()
Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.
public
addCount(int $num[, string $field = "COUNT" ]) : mixed
Parameters
- $num : int
-
number of items to add to current count
- $field : string = "COUNT"
-
field of info struct to add to the count of
Return values
mixed —addPages()
Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.
public
addPages(string $offset_field, array<string|int, mixed> &$pages) : int
Parameters
- $offset_field : string
-
field used to record offsets after storing
- $pages : array<string|int, mixed>
-
data to store
Return values
int —the write_partition the pages were stored in
getArchiveInfo()
Gets information about a WebArchiveBundle out of its description.txt file
public
static getArchiveInfo(string $dir_name) : array<string|int, mixed>
Parameters
- $dir_name : string
-
folder name of the WebArchiveBundle to get info for
Return values
array<string|int, mixed> —containing the name (description) of the WebArchiveBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.
getPage()
Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.
public
getPage(int $offset, int $partition) : array<string|int, mixed>
Parameters
- $offset : int
-
byte offset of page data
- $partition : int
-
which WebArchive to look in
Return values
array<string|int, mixed> —desired page
getParamModifiedTime()
Returns the mast time the archive info of the bundle was modified.
public
static getParamModifiedTime(string $dir_name) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
Return values
mixed —getPartition()
Gets an object encapsulating the $index the WebArchive partition in this bundle.
public
getPartition(int $index[, bool $fast_construct = true ]) : object
Parameters
- $index : int
-
the number of the partition within this bundle to return
- $fast_construct : bool = true
-
tells the constructor of the WebArchive avoid reading in its info block.
Return values
object —the WebArchive file which was requested
initCountIfNotExists()
Creates a new counter to be maintained in the description.txt file if the counter doesn't exist, leaves unchanged otherwise
public
initCountIfNotExists([string $field = "COUNT" ]) : mixed
Parameters
- $field : string = "COUNT"
-
field of info struct to add a counter for
Return values
mixed —setArchiveInfo()
Sets the archive info (DESCRIPTION, COUNT, NUM_DOCS_PER_PARTITION) for this web archive
public
static setArchiveInfo(string $dir_name, array<string|int, mixed> $info) : mixed
Parameters
- $dir_name : string
-
folder with archive bundle
- $info : array<string|int, mixed>
-
struct with above fields
Return values
mixed —setWritePartition()
Sets the write partition to the provided value and if this is not a read only archive stores, this value persistently to archive info
public
setWritePartition(int $i) : mixed
Parameters
- $i : int
-
the number of the current write partition