CrawlController
extends Controller
in package
implements
CrawlConstants
Controller used to manage networked installations of Yioop where there might be mulliple QueueServers and a NameServer. Command sent to the nameserver web page are mapped out to queue_servers using this controller. Each method of the controller essentially mimics one method of CrawlModel, PhraseModel, or in general anything that extends ParallelModel and is used to proxy that information through a result web page back to the name_server.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $activities : array<string|int, mixed>
- These are the activities supported by this controller
- $activity_component : array<string|int, mixed>
- Associative array of activity => component activity is on, used by @see Controller::call method to actually invoke a given activity on a given component
- $component_activities : array<string|int, mixed>
- Associative array of $components activities for this controller Components are collections of activities (a little like traits) which can be reused.
- $component_instances : array<string|int, mixed>
- Array of instances of components used by this controller
- $model_instances : array<string|int, mixed>
- Array of instances of models used by this controller
- $plugin_instances : array<string|int, mixed>
- Array of instances of indexing_plugins used by this controller
- $view_instances : array<string|int, mixed>
- Array of instances of views used by this controller
- $web_site : WebSite
- Stores a reference to the web server when Yioop runs in CLI mode, it acts as request router in non-CLI mode.
- __construct() : mixed
- Sets up component activities, instance array, and plugins.
- addDifferentialPrivacy() : int
- Adds to an integer, $actual_value, epsilon-noise taken from an L_1 gaussian source to centered at $actual_value to get a epsilon private, integer value.
- call() : mixed
- Used to invoke an activity method of the current controller or one its components
- checkCSRFTime() : bool
- Checks if the timestamp in $_REQUEST[$token_name] matches the timestamp of the last CSRF token accessed by this user for the kind of activity for which there might be a conflict.
- checkCSRFToken() : bool
- Checks if the form CSRF (cross-site request forgery preventing) token matches the given user and has not expired (1 hour till expires)
- checkRequest() : bool
- Checks the request if a request is for a valid activity and if it uses the correct authorization key
- clean() : string
- Used to clean strings that might be tainted as originate from the user
- clearFeedData() : mixed
- Wrapper call to the source model method that deletes the news feed and trending data stored in this Yioop instance
- clearQuerySavePoint() : mixed
- A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.
- combinedCrawlInfo() : mixed
- Handles a request for the combined crawl list, stalled, and status data from a remote name server and retrieves that the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
- component() : mixed
- Dynamic loader for Component objects which might live on the current Component
- convertArrayLines() : string
- Converts an array of lines of strings into a single string with proper newlines, each line having been trimmed and potentially cleaned
- convertStringCleanArray() : array<string|int, mixed>
- Cleans a string consisting of lines, typically of urls into an array of clean lines. This is used in handling data from the crawl options text areas. # is treated as a comment
- crawlStalled() : mixed
- Handles a request for whether or not the crawl is stalled on the given local server (which means no fetcher has spoken to it in a while) outputs this info back as body of the http response (url encoded, serialized php data)
- crawlStatus() : mixed
- Handles a request for the crawl status (memory use, recent fetchers crawl rate, etc) data from a remote name server and retrieves that the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
- deleteCrawl() : mixed
- Receives a request to delete a crawl from a remote name server and then deletes crawl on the local queue server
- displayView() : mixed
- Send the provided view to output, drawing it with the given data variable, using the current locale for translation, and writing mode
- generateCSRFToken() : string
- Generates a cross site request forgery preventing token based on the provided user name, the current time and the hidden AUTH_KEY
- getAccessModifiers() : array<string|int, mixed>
- Returns an array of the possible modifiers to the access to the activity in question.
- getCrawlItems() : mixed
- Receives a request to get crawl summary data for an array of urls from a remote name server and then looks these up on the local queue server
- getCrawlList() : mixed
- Handles a request for the crawl list (what crawl are stored on the machine) data from a remote name server and retrieves the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
- getCrawlSeedInfo() : mixed
- Handles a request for the starting parameters of a crawl of a given timestamp and retrieves that information from the bundle held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
- getCSRFTime() : int
- Used to return just the timestamp portion of the CSRF token
- getIndexingPluginList() : mixed
- Used to get a list of all available indexing plugins for this Yioop instance.
- getInfoTimestamp() : mixed
- Handles a request for information about a crawl with a given timestamp from a remote name server and retrieves statistics about this crawl that are held by the local queue server (number of pages, name, etc) outputs this info back as body of the http response (url encoded, serialized php data)
- initializeAdFields() : mixed
- If external source advertisements are present in the output of this controller this function can be used to initialize the field variables used to write the appropriate Javascripts
- injectUrlsCurrentCrawl() : mixed
- Receives a request to inject new urls into the active crawl from a remote name server and then does this for the local queue server
- model() : mixed
- Dynamic loader for Model objects which might live on the current Controller
- pagingLogic() : mixed
- When an activity involves displaying tabular data (such as rows of users, groups, etc), this method might be called to set up $data fields for next, prev, and page links, it also makes the call to the model to get the row data sorted and restricted as desired. For some data sources, rather than directly make a call to the model to get the data it might be passed directly to this method.
- parsePageHeadVars() : array<string|int, mixed>
- Used to parse head meta variables out of a data string provided either from a wiki page or a static page. Meta data is stored in lines before the first occurrence of END_HEAD_VARS. Head variables are name=value pairs. An example of head variable might be: title = This web page's title Anything after a semi-colon on a line in the head section is treated as a comment
- parsePageHeadVarsView() : mixed
- Used to set up the head variables for and page_data of a wiki or static page associated with a view.
- plugin() : mixed
- Dynamic loader for Plugin objects which might live on the current Controller
- processRequest() : mixed
- Checks that the request seems to be coming from a legitimate fetcher then determines which activity the fetcher is requesting and calls that activity for processing.
- recordViewSession() : mixed
- Used to store in a session which media list items have been viewed so we can put an indicator by them when the media list is rendered
- redirectLocation() : mixed
- Method to perform a 301 redirect to $location in both under web server and CLI setting
- redirectWithMessage() : mixed
- Does a 301 redirect to the given location, sets a session variable to display a message when get there.
- sendStartCrawlMessage() : mixed
- Receives a request to start a crawl from a remote name server and then starts the crawl process on the local queue server
- sendStopCrawlMessage() : mixed
- Receives a request to stop a crawl from a remote name server and then stop the current crawl on the local queue server
- setCrawlSeedInfo() : mixed
- Handles a request to change the parameters of a crawl of a given timestamp on the local machine (does nothing if crawl doesn't exist)
- setupGraphicalCaptchaViewData() : mixed
- Sets up the graphical captcha view Draws the string for graphical captcha
- view() : mixed
- Dynamic loader for View objects which might live on the current Controller
Properties
$activities
These are the activities supported by this controller
public
array<string|int, mixed>
$activities
= ["clearFeedData", "clearQuerySavePoint", "crawlStalled", "crawlStatus", "deleteCrawl", "injectUrlsCurrentCrawl", "combinedCrawlInfo", "getInfoTimestamp", "getCrawlItems", "getCrawlList", "getCrawlSeedInfo", "sendStartCrawlMessage", "sendStopCrawlMessage", "setCrawlSeedInfo"]
$activity_component
Associative array of activity => component activity is on, used by @see Controller::call method to actually invoke a given activity on a given component
public
array<string|int, mixed>
$activity_component
= []
$component_activities
Associative array of $components activities for this controller Components are collections of activities (a little like traits) which can be reused.
public
static array<string|int, mixed>
$component_activities
= []
$component_instances
Array of instances of components used by this controller
public
array<string|int, mixed>
$component_instances
$model_instances
Array of instances of models used by this controller
public
array<string|int, mixed>
$model_instances
$plugin_instances
Array of instances of indexing_plugins used by this controller
public
array<string|int, mixed>
$plugin_instances
$view_instances
Array of instances of views used by this controller
public
array<string|int, mixed>
$view_instances
= []
$web_site
Stores a reference to the web server when Yioop runs in CLI mode, it acts as request router in non-CLI mode.
public
WebSite
$web_site
In CLI, mode it is useful for caching files in RAM as they are read
Methods
__construct()
Sets up component activities, instance array, and plugins.
public
__construct([WebSite $web_site = null ]) : mixed
Parameters
- $web_site : WebSite = null
-
is the web server when Yioop runs in CLI mode, it acts as request router in non-CLI mode. In CLI, mode it is useful for caching files in RAM as they are read
Return values
mixed —addDifferentialPrivacy()
Adds to an integer, $actual_value, epsilon-noise taken from an L_1 gaussian source to centered at $actual_value to get a epsilon private, integer value.
public
addDifferentialPrivacy(int $actual_value) : int
Parameters
- $actual_value : int
-
number want to make private
Return values
int —$fuzzy_value number after noise added
call()
Used to invoke an activity method of the current controller or one its components
public
call(string $activity[, string $modifiers = [] ]) : mixed
Parameters
- $activity : string
-
method to invoke
- $modifiers : string = []
-
access modifiers to executing this method
Return values
mixed —checkCSRFTime()
Checks if the timestamp in $_REQUEST[$token_name] matches the timestamp of the last CSRF token accessed by this user for the kind of activity for which there might be a conflict.
public
checkCSRFTime(string $token_name[, string $action = "" ]) : bool
This is to avoid accidental replays of postings etc if the back button used.
Parameters
- $token_name : string
-
name of a $_REQUEST field used to hold a CSRF_TOKEN
- $action : string = ""
-
name of current action to check for conflicts
Return values
bool —whether a conflicting action has occurred.
checkCSRFToken()
Checks if the form CSRF (cross-site request forgery preventing) token matches the given user and has not expired (1 hour till expires)
public
checkCSRFToken(string $token_name, string $user_id[, bool $use_name_as_passed = false ]) : bool
Parameters
- $token_name : string
-
attribute of $_REQUEST containing CSRFToken
- $user_id : string
-
user id of the user to check the token for
- $use_name_as_passed : bool = false
-
whether to use $token_name as the token (if true) or to use $_REQUEST[$token_name]
Return values
bool —whether the CSRF token was valid
checkRequest()
Checks the request if a request is for a valid activity and if it uses the correct authorization key
public
checkRequest() : bool
Return values
bool —whether the request was valid or not
clean()
Used to clean strings that might be tainted as originate from the user
public
clean(mixed $value, mixed $type[, mixed $default = null ]) : string
Parameters
- $value : mixed
-
tainted data
- $type : mixed
-
type of data in value can be one of the following strings: bool, color, double, float, int, hash, or string, web-url; or it can be an array listing allowed values. If the latter, then if the value is not in the array the cleaned value will be first element of the array if $default is null
- $default : mixed = null
-
if $value is not set default value is returned, this isn't used much since if the error_reporting is E_ALL or -1 you would still get a Notice.
Return values
string —the clean input matching the type provided
clearFeedData()
Wrapper call to the source model method that deletes the news feed and trending data stored in this Yioop instance
public
clearFeedData() : mixed
Tags
Return values
mixed —clearQuerySavePoint()
A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.
public
clearQuerySavePoint() : mixed
This function deletes such a save point associated with a timestamp
Return values
mixed —combinedCrawlInfo()
Handles a request for the combined crawl list, stalled, and status data from a remote name server and retrieves that the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
public
combinedCrawlInfo() : mixed
Return values
mixed —component()
Dynamic loader for Component objects which might live on the current Component
public
component(string $component) : mixed
Parameters
- $component : string
-
name of model to return
Return values
mixed —convertArrayLines()
Converts an array of lines of strings into a single string with proper newlines, each line having been trimmed and potentially cleaned
public
convertArrayLines(array<string|int, mixed> $arr[, string $endline_string = "
" ][, bool $clean = false ]) : string
Parameters
- $arr : array<string|int, mixed>
-
the array of lines to be process
- $endline_string : string = " "
-
what string should be used to indicate the end of a line
- $clean : bool = false
-
whether to clean each line
Return values
string —a concatenated string of cleaned lines
convertStringCleanArray()
Cleans a string consisting of lines, typically of urls into an array of clean lines. This is used in handling data from the crawl options text areas. # is treated as a comment
public
convertStringCleanArray(string $str[, string $line_type = "url" ]) : array<string|int, mixed>
Parameters
- $str : string
-
contains the url data
- $line_type : string = "url"
-
does additional cleaning depending on the type of the lines. For instance, if is "url" then a line not beginning with a url scheme will have http:// prepended.
Return values
array<string|int, mixed> —$lines an array of clean lines
crawlStalled()
Handles a request for whether or not the crawl is stalled on the given local server (which means no fetcher has spoken to it in a while) outputs this info back as body of the http response (url encoded, serialized php data)
public
crawlStalled() : mixed
Return values
mixed —crawlStatus()
Handles a request for the crawl status (memory use, recent fetchers crawl rate, etc) data from a remote name server and retrieves that the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
public
crawlStatus() : mixed
Return values
mixed —deleteCrawl()
Receives a request to delete a crawl from a remote name server and then deletes crawl on the local queue server
public
deleteCrawl() : mixed
Return values
mixed —displayView()
Send the provided view to output, drawing it with the given data variable, using the current locale for translation, and writing mode
public
displayView(string $view, array<string|int, mixed> $data) : mixed
Parameters
- $view : string
-
the name of the view to draw
- $data : array<string|int, mixed>
-
an array of values to use in drawing the view
Return values
mixed —generateCSRFToken()
Generates a cross site request forgery preventing token based on the provided user name, the current time and the hidden AUTH_KEY
public
generateCSRFToken(string $user) : string
Parameters
- $user : string
-
username to use to generate token
Return values
string —a csrf token
getAccessModifiers()
Returns an array of the possible modifiers to the access to the activity in question.
public
getAccessModifiers(string $activity) : array<string|int, mixed>
Parameters
- $activity : string
-
method to get access modifier list for
Return values
array<string|int, mixed> —of string names => translated names of the access modifiers for the method in question (if any exist).
getCrawlItems()
Receives a request to get crawl summary data for an array of urls from a remote name server and then looks these up on the local queue server
public
getCrawlItems() : mixed
Return values
mixed —getCrawlList()
Handles a request for the crawl list (what crawl are stored on the machine) data from a remote name server and retrieves the statistic about this that are held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
public
getCrawlList() : mixed
Return values
mixed —getCrawlSeedInfo()
Handles a request for the starting parameters of a crawl of a given timestamp and retrieves that information from the bundle held by the local queue server outputs this info back as body of the http response (url encoded, serialized php data)
public
getCrawlSeedInfo() : mixed
Return values
mixed —getCSRFTime()
Used to return just the timestamp portion of the CSRF token
public
getCSRFTime(string $token_name) : int
Parameters
- $token_name : string
-
name of a $_REQUEST field used to hold a CSRF_TOKEN
Return values
int —the timestamp portion of the CSRF_TOKEN
getIndexingPluginList()
Used to get a list of all available indexing plugins for this Yioop instance.
public
getIndexingPluginList() : mixed
Return values
mixed —getInfoTimestamp()
Handles a request for information about a crawl with a given timestamp from a remote name server and retrieves statistics about this crawl that are held by the local queue server (number of pages, name, etc) outputs this info back as body of the http response (url encoded, serialized php data)
public
getInfoTimestamp() : mixed
Return values
mixed —initializeAdFields()
If external source advertisements are present in the output of this controller this function can be used to initialize the field variables used to write the appropriate Javascripts
public
initializeAdFields(array<string|int, mixed> &$data[, bool $ads_off = false ]) : mixed
Parameters
- $data : array<string|int, mixed>
-
data to be used in drawing the view
- $ads_off : bool = false
-
whether or not ads are turned off so that this method should do nothing
Return values
mixed —injectUrlsCurrentCrawl()
Receives a request to inject new urls into the active crawl from a remote name server and then does this for the local queue server
public
injectUrlsCurrentCrawl() : mixed
Return values
mixed —model()
Dynamic loader for Model objects which might live on the current Controller
public
model(string $model) : mixed
Parameters
- $model : string
-
name of model to return
Return values
mixed —pagingLogic()
When an activity involves displaying tabular data (such as rows of users, groups, etc), this method might be called to set up $data fields for next, prev, and page links, it also makes the call to the model to get the row data sorted and restricted as desired. For some data sources, rather than directly make a call to the model to get the data it might be passed directly to this method.
public
pagingLogic(array<string|int, mixed> &$data, mixed $field_or_model, string $output_field, int $default_show[, array<string|int, mixed> $search_array = [] ][, string $var_prefix = "" ][, array<string|int, mixed> $args = null ]) : mixed
Parameters
- $data : array<string|int, mixed>
-
used to send data to the view will be updated by this method with row and paging data
- $field_or_model : mixed
-
if an object, this is assumed to be a model and so the getRows method of this model is called to get row data, sorted and restricted according to $search_array; if a string then the row data is assumed to be in $data[$field_or_model] and pagingLogic itself does the sorting and restricting.
- $output_field : string
-
output rows for the view will be stored in $data[$output_field]
- $default_show : int
-
if not specified by $_REQUEST, then this will be used to determine the maximum number of rows that will be written to $data[$output_field]
- $search_array : array<string|int, mixed> = []
-
used to sort and restrict in the getRows call or the data from $data[$field_or_model]. Each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
- $var_prefix : string = ""
-
if there are multiple uses of pagingLogic presented on the same view then $var_prefix can be prepended to to the $data field variables like num_show, start_row, end_row to distinguish between them
- $args : array<string|int, mixed> = null
-
additional arguments that are passed to getRows and in turn to selectCallback, fromCallback, and whereCallback that might provide user_id, etc to further control which rows are returned
Return values
mixed —parsePageHeadVars()
Used to parse head meta variables out of a data string provided either from a wiki page or a static page. Meta data is stored in lines before the first occurrence of END_HEAD_VARS. Head variables are name=value pairs. An example of head variable might be: title = This web page's title Anything after a semi-colon on a line in the head section is treated as a comment
public
parsePageHeadVars(string $page_data[, mixed $with_body = false ]) : array<string|int, mixed>
Parameters
- $page_data : string
-
this is the actual content of a wiki or static page
- $with_body : mixed = false
Return values
array<string|int, mixed> —the associative array of head variables or pair [head vars, page body]
parsePageHeadVarsView()
Used to set up the head variables for and page_data of a wiki or static page associated with a view.
public
parsePageHeadVarsView(object $view, string $page_name, string $page_data) : mixed
Parameters
- $view : object
-
View on which page data will be rendered
- $page_name : string
-
a string name/id to associate with page. For example, might have 404 for a page about 404 errors
- $page_data : string
-
this is the actual content of a wiki or static page
Return values
mixed —plugin()
Dynamic loader for Plugin objects which might live on the current Controller
public
plugin(string $plugin) : mixed
Parameters
- $plugin : string
-
name of Plugin to return
Return values
mixed —processRequest()
Checks that the request seems to be coming from a legitimate fetcher then determines which activity the fetcher is requesting and calls that activity for processing.
public
processRequest() : mixed
Return values
mixed —recordViewSession()
Used to store in a session which media list items have been viewed so we can put an indicator by them when the media list is rendered
public
recordViewSession(int $page_id, string $sub_path, string $media_name) : mixed
Parameters
- $page_id : int
-
the id of page with media list
- $sub_path : string
-
the resource folder on that page
- $media_name : string
-
item to store indiicator into session for
Return values
mixed —redirectLocation()
Method to perform a 301 redirect to $location in both under web server and CLI setting
public
redirectLocation(string $location) : mixed
Parameters
- $location : string
-
url to redirect to
Return values
mixed —redirectWithMessage()
Does a 301 redirect to the given location, sets a session variable to display a message when get there.
public
redirectWithMessage(string $message[, string $copy_fields = false ][, bool $restart = false ][, bool $use_base_url = false ]) : mixed
Parameters
- $message : string
-
message to write
- $copy_fields : string = false
-
$_REQUEST fields to copy for redirect
- $restart : bool = false
-
if yioop is being run as its own server rather than under apache whether to restart this server.
- $use_base_url : bool = false
-
set true if the base_url be included in the redirect
Return values
mixed —sendStartCrawlMessage()
Receives a request to start a crawl from a remote name server and then starts the crawl process on the local queue server
public
sendStartCrawlMessage() : mixed
Return values
mixed —sendStopCrawlMessage()
Receives a request to stop a crawl from a remote name server and then stop the current crawl on the local queue server
public
sendStopCrawlMessage() : mixed
Return values
mixed —setCrawlSeedInfo()
Handles a request to change the parameters of a crawl of a given timestamp on the local machine (does nothing if crawl doesn't exist)
public
setCrawlSeedInfo() : mixed
Return values
mixed —setupGraphicalCaptchaViewData()
Sets up the graphical captcha view Draws the string for graphical captcha
public
setupGraphicalCaptchaViewData(array<string|int, mixed> &$data) : mixed
Parameters
- $data : array<string|int, mixed>
-
used by view to draw any dynamic content in this case we append a field "CAPTCHA_IMAGE" with a data url of the captcha to draw.
Return values
mixed —view()
Dynamic loader for View objects which might live on the current Controller
public
view(string $view) : mixed
Parameters
- $view : string
-
name of view to return