CrawlComponent
extends Component
in package
implements
CrawlConstants
This component is used to provide activities for the admin controller related to configuring and performing a web or archive crawl
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- MAX_MIX_FRAGMENTS = 10
- Maximum number of search result fragments in a crawl mix
- $parent : object
- Reference to the controller this component lives on
- __construct() : mixed
- Sets up this component by storing in its parent field a reference to controller this component lives on
- crawlStatistics() : mixed
- Called from @see manageCrawls to read in the file with statistics information about a crawl. This file is computed by @see AnalyticsJob
- editClassifier() : mixed
- Handles the particulars of editing a classifier, which includes changing its label and adding training examples.
- editCrawlOption() : mixed
- Called from @see manageCrawls to edit the parameters for the next crawl (or current crawl) to be carried out by the machines $machine_urls. Updates $data array to be supplied to AdminView
- editMix() : mixed
- Handles admin request related to the editing a crawl mix activity
- getCrawlParametersFromSeedInfo() : mixed
- Reads the parameters for a crawl from an array gotten from a crawl.ini file
- initCrawlBadges() : mixed
- Used to compute statistics for badges related for the manage crawls, mix crawls, and manage machine buttons shown typically to admin accounts
- initializeWikiEditor() : mixed
- Called to include the Javascript Wiki Editor (wiki.js) on a page and to send any localizations needed from PHP to Javascript-land It is used by both Crawl and SocialComponent
- initSocialBadges() : mixed
- Used to compute the impression statistics for badges on the social controls button for $user_id. These badges display the number of unread messages, the number of unread group post and the number of groups the user belongs to
- manageClassifiers() : mixed
- Handles admin requests for creating, editing, and deleting classifiers.
- manageCrawls() : array<string|int, mixed>
- Used to handle the manage crawl activity.
- mixCrawls() : array<string|int, mixed>
- Handles admin request related to the crawl mix activity
- pageOptions() : mixed
- Handles admin request related to controlling file options to be used in a crawl
- resultsEditor() : array<string|int, mixed>
- Handles admin request related to the search filter activity
- scrapers() : array<string|int, mixed>
- Handles admin request related to the Scrapers activity
- searchSources() : array<string|int, mixed>
- Handles admin request related to the search sources activity
- startCrawl() : mixed
- Called from @see manageCrawls to start a new crawl on the machines $machine_urls. Updates $data array with crawl start message
- mapSiteConstants() : array<string|int, mixed>
- Given an array with key fields coming from CrawlConstants returns an associative array sorted by key with the key fields the string names of the CrawlConstants in the original array. So if an array has a field [CrawlConstants::PAGE] => some page, the new array has a field PAGE => some page.
Constants
MAX_MIX_FRAGMENTS
Maximum number of search result fragments in a crawl mix
public
mixed
MAX_MIX_FRAGMENTS
= 10
Properties
$parent
Reference to the controller this component lives on
public
object
$parent
= null
Methods
__construct()
Sets up this component by storing in its parent field a reference to controller this component lives on
public
__construct(object $parent_controller) : mixed
Parameters
- $parent_controller : object
-
reference to the controller this component lives on
Return values
mixed —crawlStatistics()
Called from @see manageCrawls to read in the file with statistics information about a crawl. This file is computed by @see AnalyticsJob
public
crawlStatistics(array<string|int, mixed> &$data, array<string|int, mixed> $machine_urls) : mixed
Parameters
- $data : array<string|int, mixed>
-
an array of info to supply to AdminView
- $machine_urls : array<string|int, mixed>
-
machines that are being used in crawl Yioop name server on which to perform the crawl
Return values
mixed —editClassifier()
Handles the particulars of editing a classifier, which includes changing its label and adding training examples.
public
editClassifier(array<string|int, mixed> &$data, array<string|int, mixed> $classifiers, array<string|int, mixed> $machine_urls) : mixed
This activity directly handles changing the class label, but not adding training examples. The latter activity is done interactively without reloading the page via XmlHttpRequests, coordinated by the classifier controller dedicated to that task.
Parameters
- $data : array<string|int, mixed>
-
data to be passed on to the view
- $classifiers : array<string|int, mixed>
-
map from class labels to their associated classifiers
- $machine_urls : array<string|int, mixed>
-
string urls of machines managed by this Yioop name server
Return values
mixed —editCrawlOption()
Called from @see manageCrawls to edit the parameters for the next crawl (or current crawl) to be carried out by the machines $machine_urls. Updates $data array to be supplied to AdminView
public
editCrawlOption(array<string|int, mixed> &$data, array<string|int, mixed> $machine_urls) : mixed
Parameters
- $data : array<string|int, mixed>
-
an array of info to supply to AdminView
- $machine_urls : array<string|int, mixed>
-
string urls of machines managed by this Yioop name server on which to perform the crawl
Return values
mixed —editMix()
Handles admin request related to the editing a crawl mix activity
public
editMix(array<string|int, mixed> &$data) : mixed
Parameters
- $data : array<string|int, mixed>
-
info about the fragments and their contents for a particular crawl mix (changed by this method)
Return values
mixed —getCrawlParametersFromSeedInfo()
Reads the parameters for a crawl from an array gotten from a crawl.ini file
public
getCrawlParametersFromSeedInfo(array<string|int, mixed> &$crawl_params, array<string|int, mixed> $seed_info) : mixed
Parameters
- $crawl_params : array<string|int, mixed>
-
parameters to write to queue_server
- $seed_info : array<string|int, mixed>
-
data from crawl.ini file
Return values
mixed —initCrawlBadges()
Used to compute statistics for badges related for the manage crawls, mix crawls, and manage machine buttons shown typically to admin accounts
public
initCrawlBadges(int $user_id, array<string|int, mixed> &$data) : mixed
Parameters
- $user_id : int
-
of user - used to determin the mix crawl list
- $data : array<string|int, mixed>
-
associative array of data to send to the view. This method adds three new field NUM_MIXES, CRAWL_MANAGER, "NUM_MACHINES, CRAWLS_RUNNING, and NUM_CLOSED_CRAWLS
Return values
mixed —initializeWikiEditor()
Called to include the Javascript Wiki Editor (wiki.js) on a page and to send any localizations needed from PHP to Javascript-land It is used by both Crawl and SocialComponent
public
initializeWikiEditor(array<string|int, mixed> &$data[, $id = "" ]) : mixed
Parameters
- $data : array<string|int, mixed>
-
an asscoiative array of data to be used by the view and layout that the wiki editor will be drawn on This method tacks on to INCLUDE_SCRIPTS to make the layout load wiki.js.
- $id : = ""
-
if "" then all textareas on page will get editor buttons, if -1 then sets up translations, but does not add any button, otherwise, add buttons to textarea $id will. (Can call this method multiple times, if want more than one but not all)
Return values
mixed —initSocialBadges()
Used to compute the impression statistics for badges on the social controls button for $user_id. These badges display the number of unread messages, the number of unread group post and the number of groups the user belongs to
public
initSocialBadges(int $user_id, array<string|int, mixed> &$data) : mixed
Parameters
- $user_id : int
-
id of user to compute statistics for
- $data : array<string|int, mixed>
-
associative array of data to send to the view. This method adds three new field NUM_GROUPS, UNREAD_POSTS, and UNREAD_MESSAGES
Return values
mixed —manageClassifiers()
Handles admin requests for creating, editing, and deleting classifiers.
public
manageClassifiers() : mixed
This activity implements the logic for the page that lists existing classifiers, including the actions that can be performed on them.
Return values
mixed —manageCrawls()
Used to handle the manage crawl activity.
public
manageCrawls() : array<string|int, mixed>
This activity allows new crawls to be started, statistics about old crawls to be seen. It allows a user to stop the current crawl or restart an old crawl. It also allows a user to configure the options by which a crawl is conducted
Return values
array<string|int, mixed> —$data information and statistics about crawls in the system as well as status messages on performing a given sub activity
mixCrawls()
Handles admin request related to the crawl mix activity
public
mixCrawls() : array<string|int, mixed>
The crawl mix activity allows a user to create/edit crawl mixes: weighted combinations of search indexes
Return values
array<string|int, mixed> —$data info about available crawl mixes and changes to them as well as any messages about the success or failure of a sub activity.
pageOptions()
Handles admin request related to controlling file options to be used in a crawl
public
pageOptions() : mixed
This activity allows a user to specify the page range size to be be used during a crawl as well as which file types can be downloaded
Return values
mixed —resultsEditor()
Handles admin request related to the search filter activity
public
resultsEditor() : array<string|int, mixed>
This activity allows a user to specify hosts whose web pages are to be filtered out the search results
Return values
array<string|int, mixed> —$data info about the groups and their contents for a particular crawl mix
scrapers()
Handles admin request related to the Scrapers activity
public
scrapers() : array<string|int, mixed>
This activity allows a user to specify the configuration for the ways we detect Scrapers
Return values
array<string|int, mixed> —$data info about the Scraper settings
searchSources()
Handles admin request related to the search sources activity
public
searchSources() : array<string|int, mixed>
The search sources activity allows a user to add/delete search sources for news and podcasts, it also allows a user to control which subsearches appear on the SearchView page
Return values
array<string|int, mixed> —$data info about current search sources, and current sub-searches
startCrawl()
Called from @see manageCrawls to start a new crawl on the machines $machine_urls. Updates $data array with crawl start message
public
startCrawl(array<string|int, mixed> &$data, array<string|int, mixed> $request_fields) : mixed
Parameters
- $data : array<string|int, mixed>
-
an array of info to supply to AdminView
- $request_fields : array<string|int, mixed>
-
if start crawl fails this is a list of request fields to preserve in the redirect message
Return values
mixed —mapSiteConstants()
Given an array with key fields coming from CrawlConstants returns an associative array sorted by key with the key fields the string names of the CrawlConstants in the original array. So if an array has a field [CrawlConstants::PAGE] => some page, the new array has a field PAGE => some page.
private
mapSiteConstants(array<string|int, mixed> $site[, array<string|int, mixed> $exclude_fields = [self::PAGE] ]) : array<string|int, mixed>
Parameters
- $site : array<string|int, mixed>
-
the input array with CrawlConstant fields
- $exclude_fields : array<string|int, mixed> = [self::PAGE]
-
list of fields not to include in output array
Return values
array<string|int, mixed> —converted array with string names of CrawlConstants