Yioop_V9.5_Source_Code_Documentation

CrawlComponent extends Component
in package
implements CrawlConstants

This component is used to provide activities for the admin controller related to configuring and performing a web or archive crawl

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

MAX_MIX_FRAGMENTS  = 10
Maximum number of search result fragments in a crawl mix
$parent  : object
Reference to the controller this component lives on
__construct()  : mixed
Sets up this component by storing in its parent field a reference to controller this component lives on
crawlStatistics()  : mixed
Called from @see manageCrawls to read in the file with statistics information about a crawl. This file is computed by @see AnalyticsJob
editClassifier()  : mixed
Handles the particulars of editing a classifier, which includes changing its label and adding training examples.
editCrawlOption()  : mixed
Called from @see manageCrawls to edit the parameters for the next crawl (or current crawl) to be carried out by the machines $machine_urls. Updates $data array to be supplied to AdminView
editMix()  : mixed
Handles admin request related to the editing a crawl mix activity
getCrawlParametersFromSeedInfo()  : mixed
Reads the parameters for a crawl from an array gotten from a crawl.ini file
initCrawlBadges()  : mixed
Used to compute statistics for badges related for the manage crawls, mix crawls, and manage machine buttons shown typically to admin accounts
initializeWikiEditor()  : mixed
Called to include the Javascript Wiki Editor (wiki.js) on a page and to send any localizations needed from PHP to Javascript-land It is used by both Crawl and SocialComponent
initSocialBadges()  : mixed
Used to compute the impression statistics for badges on the social controls button for $user_id. These badges display the number of unread messages, the number of unread group post and the number of groups the user belongs to
manageClassifiers()  : mixed
Handles admin requests for creating, editing, and deleting classifiers.
manageCrawls()  : array<string|int, mixed>
Used to handle the manage crawl activity.
mixCrawls()  : array<string|int, mixed>
Handles admin request related to the crawl mix activity
pageOptions()  : mixed
Handles admin request related to controlling file options to be used in a crawl
resultsEditor()  : array<string|int, mixed>
Handles admin request related to the search filter activity
scrapers()  : array<string|int, mixed>
Handles admin request related to the Scrapers activity
searchSources()  : array<string|int, mixed>
Handles admin request related to the search sources activity
startCrawl()  : mixed
Called from @see manageCrawls to start a new crawl on the machines $machine_urls. Updates $data array with crawl start message
mapSiteConstants()  : array<string|int, mixed>
Given an array with key fields coming from CrawlConstants returns an associative array sorted by key with the key fields the string names of the CrawlConstants in the original array. So if an array has a field [CrawlConstants::PAGE] => some page, the new array has a field PAGE => some page.

Constants

MAX_MIX_FRAGMENTS

Maximum number of search result fragments in a crawl mix

public mixed MAX_MIX_FRAGMENTS = 10

Properties

$parent

Reference to the controller this component lives on

public object $parent = null

Methods

__construct()

Sets up this component by storing in its parent field a reference to controller this component lives on

public __construct(object $parent_controller) : mixed
Parameters
$parent_controller : object

reference to the controller this component lives on

Return values
mixed

crawlStatistics()

Called from @see manageCrawls to read in the file with statistics information about a crawl. This file is computed by @see AnalyticsJob

public crawlStatistics(array<string|int, mixed> &$data, array<string|int, mixed> $machine_urls) : mixed
Parameters
$data : array<string|int, mixed>

an array of info to supply to AdminView

$machine_urls : array<string|int, mixed>

machines that are being used in crawl Yioop name server on which to perform the crawl

Return values
mixed

editClassifier()

Handles the particulars of editing a classifier, which includes changing its label and adding training examples.

public editClassifier(array<string|int, mixed> &$data, array<string|int, mixed> $classifiers, array<string|int, mixed> $machine_urls) : mixed

This activity directly handles changing the class label, but not adding training examples. The latter activity is done interactively without reloading the page via XmlHttpRequests, coordinated by the classifier controller dedicated to that task.

Parameters
$data : array<string|int, mixed>

data to be passed on to the view

$classifiers : array<string|int, mixed>

map from class labels to their associated classifiers

$machine_urls : array<string|int, mixed>

string urls of machines managed by this Yioop name server

Return values
mixed

editCrawlOption()

Called from @see manageCrawls to edit the parameters for the next crawl (or current crawl) to be carried out by the machines $machine_urls. Updates $data array to be supplied to AdminView

public editCrawlOption(array<string|int, mixed> &$data, array<string|int, mixed> $machine_urls) : mixed
Parameters
$data : array<string|int, mixed>

an array of info to supply to AdminView

$machine_urls : array<string|int, mixed>

string urls of machines managed by this Yioop name server on which to perform the crawl

Return values
mixed

editMix()

Handles admin request related to the editing a crawl mix activity

public editMix(array<string|int, mixed> &$data) : mixed
Parameters
$data : array<string|int, mixed>

info about the fragments and their contents for a particular crawl mix (changed by this method)

Return values
mixed

getCrawlParametersFromSeedInfo()

Reads the parameters for a crawl from an array gotten from a crawl.ini file

public getCrawlParametersFromSeedInfo(array<string|int, mixed> &$crawl_params, array<string|int, mixed> $seed_info) : mixed
Parameters
$crawl_params : array<string|int, mixed>

parameters to write to queue_server

$seed_info : array<string|int, mixed>

data from crawl.ini file

Return values
mixed

initCrawlBadges()

Used to compute statistics for badges related for the manage crawls, mix crawls, and manage machine buttons shown typically to admin accounts

public initCrawlBadges(int $user_id, array<string|int, mixed> &$data) : mixed
Parameters
$user_id : int

of user - used to determin the mix crawl list

$data : array<string|int, mixed>

associative array of data to send to the view. This method adds three new field NUM_MIXES, CRAWL_MANAGER, "NUM_MACHINES, CRAWLS_RUNNING, and NUM_CLOSED_CRAWLS

Return values
mixed

initializeWikiEditor()

Called to include the Javascript Wiki Editor (wiki.js) on a page and to send any localizations needed from PHP to Javascript-land It is used by both Crawl and SocialComponent

public initializeWikiEditor(array<string|int, mixed> &$data[,  $id = "" ]) : mixed
Parameters
$data : array<string|int, mixed>

an asscoiative array of data to be used by the view and layout that the wiki editor will be drawn on This method tacks on to INCLUDE_SCRIPTS to make the layout load wiki.js.

$id : = ""

if "" then all textareas on page will get editor buttons, if -1 then sets up translations, but does not add any button, otherwise, add buttons to textarea $id will. (Can call this method multiple times, if want more than one but not all)

Return values
mixed

initSocialBadges()

Used to compute the impression statistics for badges on the social controls button for $user_id. These badges display the number of unread messages, the number of unread group post and the number of groups the user belongs to

public initSocialBadges(int $user_id, array<string|int, mixed> &$data) : mixed
Parameters
$user_id : int

id of user to compute statistics for

$data : array<string|int, mixed>

associative array of data to send to the view. This method adds three new field NUM_GROUPS, UNREAD_POSTS, and UNREAD_MESSAGES

Return values
mixed

manageClassifiers()

Handles admin requests for creating, editing, and deleting classifiers.

public manageClassifiers() : mixed

This activity implements the logic for the page that lists existing classifiers, including the actions that can be performed on them.

Return values
mixed

manageCrawls()

Used to handle the manage crawl activity.

public manageCrawls() : array<string|int, mixed>

This activity allows new crawls to be started, statistics about old crawls to be seen. It allows a user to stop the current crawl or restart an old crawl. It also allows a user to configure the options by which a crawl is conducted

Return values
array<string|int, mixed>

$data information and statistics about crawls in the system as well as status messages on performing a given sub activity

mixCrawls()

Handles admin request related to the crawl mix activity

public mixCrawls() : array<string|int, mixed>

The crawl mix activity allows a user to create/edit crawl mixes: weighted combinations of search indexes

Return values
array<string|int, mixed>

$data info about available crawl mixes and changes to them as well as any messages about the success or failure of a sub activity.

pageOptions()

Handles admin request related to controlling file options to be used in a crawl

public pageOptions() : mixed

This activity allows a user to specify the page range size to be be used during a crawl as well as which file types can be downloaded

Return values
mixed

resultsEditor()

Handles admin request related to the search filter activity

public resultsEditor() : array<string|int, mixed>

This activity allows a user to specify hosts whose web pages are to be filtered out the search results

Return values
array<string|int, mixed>

$data info about the groups and their contents for a particular crawl mix

scrapers()

Handles admin request related to the Scrapers activity

public scrapers() : array<string|int, mixed>

This activity allows a user to specify the configuration for the ways we detect Scrapers

Return values
array<string|int, mixed>

$data info about the Scraper settings

searchSources()

Handles admin request related to the search sources activity

public searchSources() : array<string|int, mixed>

The search sources activity allows a user to add/delete search sources for news and podcasts, it also allows a user to control which subsearches appear on the SearchView page

Return values
array<string|int, mixed>

$data info about current search sources, and current sub-searches

startCrawl()

Called from @see manageCrawls to start a new crawl on the machines $machine_urls. Updates $data array with crawl start message

public startCrawl(array<string|int, mixed> &$data, array<string|int, mixed> $request_fields) : mixed
Parameters
$data : array<string|int, mixed>

an array of info to supply to AdminView

$request_fields : array<string|int, mixed>

if start crawl fails this is a list of request fields to preserve in the redirect message

Return values
mixed

mapSiteConstants()

Given an array with key fields coming from CrawlConstants returns an associative array sorted by key with the key fields the string names of the CrawlConstants in the original array. So if an array has a field [CrawlConstants::PAGE] => some page, the new array has a field PAGE => some page.

private mapSiteConstants(array<string|int, mixed> $site[, array<string|int, mixed> $exclude_fields = [self::PAGE] ]) : array<string|int, mixed>
Parameters
$site : array<string|int, mixed>

the input array with CrawlConstant fields

$exclude_fields : array<string|int, mixed> = [self::PAGE]

list of fields not to include in output array

Return values
array<string|int, mixed>

converted array with string names of CrawlConstants


        

Search results