Yioop_V9.5_Source_Code_Documentation

AnalyticsJob extends MediaJob
in package

A media job used to periodically calculate summary statistics about group, thread, page, and query impressions.

Table of Contents

NUM_TIMES_INTERVAL  = 50
For size and time distributions the number of times the minimal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time
STATISTIC_REFRESH_RATE  = \seekquarry\yioop\configs\ANALYTICS_UPDATE_INTERVAL / 2
While computing the statistics page, number of seconds until a page refresh and save of progress so far occurs
$controller  : object
If MediaJob was instantiated in the web app, the controller that instatiated it
$crawl_model  : object
Used to get crawl seed info
$impression_model  : object
Used to get statistics from DBMS about wiki and thread views
$machine_model  : object
Used to determine which queue servers are available and which might have information about a crawl
$media_updater  : object
If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
$name_server_does_client_tasks  : bool
Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
$name_server_does_client_tasks_only  : bool
Whether this MediaJob performs name server only tasks
$phrase_model  : object
Used to get crawl statistics about the number of various HTTP response requests seen during a crawl
$tasks  : array<string|int, mixed>
The most recently received from the name server tasks for this MediaJob
$update_time  : int
Time in current epoch when analytics last updated
__construct()  : mixed
Instiates the MediaJob with a reference to the object that instatiated it
checkPrerequisites()  : bool
Only update if its been more than an hour since the last update
computeCrawlStatistics()  : mixed
Runs the queries necessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.
countQuery()  : int
Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.
doTasks()  : mixed
Calls ImpressionModel to actually calculate various impression totals since the last update
execNameServer()  : array<string|int, mixed>
Executes a method on the name server's JobController.
finishTasks()  : mixed
This method is called on the name server to finish processing any data returned by MediaUpdater clients.
getCurrentMachine()  : string
Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
getJobName()  : string
Gets the class name (less namespace and the word Job ) of the current MediaJob
getTasks()  : array<string|int, mixed>
Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
init()  : mixed
Initializes the time of last analytics update
nondistributedTasks()  : mixed
For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.
prepareTasks()  : mixed
This method is called on the name server to prepare data for any MediaUpdater clients.
putTasks()  : array<string|int, mixed>
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
run()  : mixed
Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.

Constants

NUM_TIMES_INTERVAL

For size and time distributions the number of times the minimal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time

public mixed NUM_TIMES_INTERVAL = 50

STATISTIC_REFRESH_RATE

While computing the statistics page, number of seconds until a page refresh and save of progress so far occurs

public mixed STATISTIC_REFRESH_RATE = \seekquarry\yioop\configs\ANALYTICS_UPDATE_INTERVAL / 2

Properties

$controller

If MediaJob was instantiated in the web app, the controller that instatiated it

public object $controller

$crawl_model

Used to get crawl seed info

public object $crawl_model

$impression_model

Used to get statistics from DBMS about wiki and thread views

public object $impression_model

$machine_model

Used to determine which queue servers are available and which might have information about a crawl

public object $machine_model

$media_updater

If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater

public object $media_updater

$name_server_does_client_tasks

Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks

public bool $name_server_does_client_tasks

$name_server_does_client_tasks_only

Whether this MediaJob performs name server only tasks

public bool $name_server_does_client_tasks_only

$phrase_model

Used to get crawl statistics about the number of various HTTP response requests seen during a crawl

public object $phrase_model

$tasks

The most recently received from the name server tasks for this MediaJob

public array<string|int, mixed> $tasks

$update_time

Time in current epoch when analytics last updated

public int $update_time

Methods

__construct()

Instiates the MediaJob with a reference to the object that instatiated it

public __construct([object $media_updater = null ][, object $controller = null ]) : mixed
Parameters
$media_updater : object = null

a reference to the media updater that instatiated this object (if being run in MediaUpdater)

$controller : object = null

a reference to the controller that instantiated this object (if being run in the web app)

Return values
mixed

checkPrerequisites()

Only update if its been more than an hour since the last update

public checkPrerequisites() : bool
Return values
bool

whether its been an hour since the last update

computeCrawlStatistics()

Runs the queries necessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.

public computeCrawlStatistics() : mixed

If these queries take too long it saves partial results and returns.

Return values
mixed

countQuery()

Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.

public countQuery(string $query, string $index_timestamp, array<string|int, mixed> $machine_urls) : int
Parameters
$query : string

to use and count the results of

$index_timestamp : string

timestamp of index to compute query count for

$machine_urls : array<string|int, mixed>

queue servers on which the count is to be computed

Return values
int

number of results that would be returned by the given query

doTasks()

Calls ImpressionModel to actually calculate various impression totals since the last update

public doTasks(array<string|int, mixed> $tasks) : mixed
Parameters
$tasks : array<string|int, mixed>

array of info that came from getTasks (in this nothing)

Return values
mixed

the result of carrying out that processing

execNameServer()

Executes a method on the name server's JobController.

public static execNameServer(string $command[, string $args = null ]) : array<string|int, mixed>

It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.

Parameters
$command : string

the method to invoke on the name server

$args : string = null

additional arguments to be passed to the name server

Return values
array<string|int, mixed>

data returned by the name server.

finishTasks()

This method is called on the name server to finish processing any data returned by MediaUpdater clients.

public finishTasks() : mixed
Return values
mixed

getCurrentMachine()

Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request

public static getCurrentMachine() : string
Return values
string

hash of current machine url

getJobName()

Gets the class name (less namespace and the word Job ) of the current MediaJob

public static getJobName() : string
Return values
string

name of the current job

getTasks()

Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.

public getTasks(int $machine_id[, array<string|int, mixed> $data = null ]) : array<string|int, mixed>
Parameters
$machine_id : int

id of client requesting data

$data : array<string|int, mixed> = null

any additional info about data being requested

Return values
array<string|int, mixed>

work for the client to process

init()

Initializes the time of last analytics update

public init() : mixed
Return values
mixed

nondistributedTasks()

For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.

public nondistributedTasks() : mixed
Return values
mixed

prepareTasks()

This method is called on the name server to prepare data for any MediaUpdater clients.

public prepareTasks() : mixed
Return values
mixed

putTasks()

After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server

public putTasks(int $machine_id, mixed $data) : array<string|int, mixed>
Parameters
$machine_id : int

id of client that is sending data to name server

$data : mixed

results of computation done by client

Return values
array<string|int, mixed>

any response information to send back to the client

run()

Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.

public run() : mixed
Return values
mixed

        

Search results