AnalyticsJob
extends MediaJob
in package
A media job used to periodically calculate summary statistics about group, thread, page, and query impressions.
Table of Contents
- NUM_TIMES_INTERVAL = 50
- For size and time distributions the number of times the minimal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time
- STATISTIC_REFRESH_RATE = \seekquarry\yioop\configs\ANALYTICS_UPDATE_INTERVAL / 2
- While computing the statistics page, number of seconds until a page refresh and save of progress so far occurs
- $controller : object
- If MediaJob was instantiated in the web app, the controller that instatiated it
- $crawl_model : object
- Used to get crawl seed info
- $impression_model : object
- Used to get statistics from DBMS about wiki and thread views
- $machine_model : object
- Used to determine which queue servers are available and which might have information about a crawl
- $media_updater : object
- If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
- $name_server_does_client_tasks : bool
- Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
- $name_server_does_client_tasks_only : bool
- Whether this MediaJob performs name server only tasks
- $phrase_model : object
- Used to get crawl statistics about the number of various HTTP response requests seen during a crawl
- $tasks : array<string|int, mixed>
- The most recently received from the name server tasks for this MediaJob
- $update_time : int
- Time in current epoch when analytics last updated
- __construct() : mixed
- Instiates the MediaJob with a reference to the object that instatiated it
- checkPrerequisites() : bool
- Only update if its been more than an hour since the last update
- computeCrawlStatistics() : mixed
- Runs the queries necessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.
- countQuery() : int
- Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.
- doTasks() : mixed
- Calls ImpressionModel to actually calculate various impression totals since the last update
- execNameServer() : array<string|int, mixed>
- Executes a method on the name server's JobController.
- finishTasks() : mixed
- This method is called on the name server to finish processing any data returned by MediaUpdater clients.
- getCurrentMachine() : string
- Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
- getJobName() : string
- Gets the class name (less namespace and the word Job ) of the current MediaJob
- getTasks() : array<string|int, mixed>
- Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
- init() : mixed
- Initializes the time of last analytics update
- nondistributedTasks() : mixed
- For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.
- prepareTasks() : mixed
- This method is called on the name server to prepare data for any MediaUpdater clients.
- putTasks() : array<string|int, mixed>
- After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
- run() : mixed
- Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
Constants
NUM_TIMES_INTERVAL
For size and time distributions the number of times the minimal recorded interval (DOWNLOAD_SIZE_INTERVAL for size) to check for pages with that size/download time
public
mixed
NUM_TIMES_INTERVAL
= 50
STATISTIC_REFRESH_RATE
While computing the statistics page, number of seconds until a page refresh and save of progress so far occurs
public
mixed
STATISTIC_REFRESH_RATE
= \seekquarry\yioop\configs\ANALYTICS_UPDATE_INTERVAL / 2
Properties
$controller
If MediaJob was instantiated in the web app, the controller that instatiated it
public
object
$controller
$crawl_model
Used to get crawl seed info
public
object
$crawl_model
$impression_model
Used to get statistics from DBMS about wiki and thread views
public
object
$impression_model
$machine_model
Used to determine which queue servers are available and which might have information about a crawl
public
object
$machine_model
$media_updater
If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
public
object
$media_updater
$name_server_does_client_tasks
Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
public
bool
$name_server_does_client_tasks
$name_server_does_client_tasks_only
Whether this MediaJob performs name server only tasks
public
bool
$name_server_does_client_tasks_only
$phrase_model
Used to get crawl statistics about the number of various HTTP response requests seen during a crawl
public
object
$phrase_model
$tasks
The most recently received from the name server tasks for this MediaJob
public
array<string|int, mixed>
$tasks
$update_time
Time in current epoch when analytics last updated
public
int
$update_time
Methods
__construct()
Instiates the MediaJob with a reference to the object that instatiated it
public
__construct([object $media_updater = null ][, object $controller = null ]) : mixed
Parameters
- $media_updater : object = null
-
a reference to the media updater that instatiated this object (if being run in MediaUpdater)
- $controller : object = null
-
a reference to the controller that instantiated this object (if being run in the web app)
Return values
mixed —checkPrerequisites()
Only update if its been more than an hour since the last update
public
checkPrerequisites() : bool
Return values
bool —whether its been an hour since the last update
computeCrawlStatistics()
Runs the queries necessary to determine httpd code distribution, filetype distribution, num hosts, language distribution, os distribution, server distribution, site distribution, file size distribution, download time distribution, etc for a web crawl for which statistics have been requested but not yet computed.
public
computeCrawlStatistics() : mixed
If these queries take too long it saves partial results and returns.
Return values
mixed —countQuery()
Performs the provided $query of a web crawl (potentially distributed across queue servers). Returns the count of the number of results that would be returned by that query.
public
countQuery(string $query, string $index_timestamp, array<string|int, mixed> $machine_urls) : int
Parameters
- $query : string
-
to use and count the results of
- $index_timestamp : string
-
timestamp of index to compute query count for
- $machine_urls : array<string|int, mixed>
-
queue servers on which the count is to be computed
Return values
int —number of results that would be returned by the given query
doTasks()
Calls ImpressionModel to actually calculate various impression totals since the last update
public
doTasks(array<string|int, mixed> $tasks) : mixed
Parameters
- $tasks : array<string|int, mixed>
-
array of info that came from getTasks (in this nothing)
Return values
mixed —the result of carrying out that processing
execNameServer()
Executes a method on the name server's JobController.
public
static execNameServer(string $command[, string $args = null ]) : array<string|int, mixed>
It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.
Parameters
- $command : string
-
the method to invoke on the name server
- $args : string = null
-
additional arguments to be passed to the name server
Return values
array<string|int, mixed> —data returned by the name server.
finishTasks()
This method is called on the name server to finish processing any data returned by MediaUpdater clients.
public
finishTasks() : mixed
Return values
mixed —getCurrentMachine()
Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
public
static getCurrentMachine() : string
Return values
string —hash of current machine url
getJobName()
Gets the class name (less namespace and the word Job ) of the current MediaJob
public
static getJobName() : string
Return values
string —name of the current job
getTasks()
Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
public
getTasks(int $machine_id[, array<string|int, mixed> $data = null ]) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of client requesting data
- $data : array<string|int, mixed> = null
-
any additional info about data being requested
Return values
array<string|int, mixed> —work for the client to process
init()
Initializes the time of last analytics update
public
init() : mixed
Return values
mixed —nondistributedTasks()
For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.
public
nondistributedTasks() : mixed
Return values
mixed —prepareTasks()
This method is called on the name server to prepare data for any MediaUpdater clients.
public
prepareTasks() : mixed
Return values
mixed —putTasks()
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
public
putTasks(int $machine_id, mixed $data) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of client that is sending data to name server
- $data : mixed
-
results of computation done by client
Return values
array<string|int, mixed> —any response information to send back to the client
run()
Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
public
run() : mixed