Yioop_V9.5_Source_Code_Documentation

FeedsUpdateJob extends MediaJob
in package

A media job to download and index feeds from various search sources (RSS, HTML scraper, etc). Idea is that this job runs once an hour to get the latest news, movies, weather from those sources.

Table of Contents

MAX_FEEDS_ONE_GO  = 100
Mamimum number of feeds to download in one try
MAX_THUMBS_ONE_GO  = 100
Mamimum number of thumb_urls to download in one try
OLD_ITEM_TIME  = 4 * \seekquarry\yioop\configs\ONE_WEEK
how long in seconds before a feed item expires
SINGLE_SOURCE_FACTOR  = 1.2
For a given feed update, the factor extra to allow the number of items from a single source a compared to the average items should have per source.
$controller  : object
If MediaJob was instantiated in the web app, the controller that instatiated it
$db  : object
Datasource object used to run db queries related to feed items (for storing and updating them)
$found_items  : array<string|int, mixed>
News Feed Items found from the current feed
$index_archive  : FeedDocumentBundle
The FeedDocumentBundle to put feed items into periodically
$media_updater  : object
If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
$media_urls  : array<string|int, mixed>
Used to keep track of image urls of thumbnails to download for feed items
$name_server_does_client_tasks  : bool
Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
$name_server_does_client_tasks_only  : bool
Whether this MediaJob performs name server only tasks
$tasks  : array<string|int, mixed>
The most recently received from the name server tasks for this MediaJob
$update_time  : int
Time in current epoch when feeds last updated
__construct()  : mixed
Instiates the MediaJob with a reference to the object that instatiated it
addFeedItemIfNew()  : bool
Adds $item to feed index bundle if it isn't already there
addFoundItemsPartition()  : bool
Add found feed items that have not previously been seen to the current partition. Found feed items are assumed to be in $this->found_items.
checkPrerequisites()  : bool
Only update if its been more than an hour since the last update
convertJsonDecodeToTags()  : string
Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.
doTasks()  : mixed
For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexDocumentBundle for feeds
execNameServer()  : array<string|int, mixed>
Executes a method on the name server's JobController.
finishTasks()  : mixed
This method is called on the name server to finish processing any data returned by MediaUpdater clients.
getCurrentMachine()  : string
Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
getFeedBundle()  : mixed
Sets the value of $this->index_archive to point to the FeedDocumentBundle associated to feeds on this instance of Yioop
getJobName()  : string
Gets the class name (less namespace and the word Job ) of the current MediaJob
getTasks()  : array<string|int, mixed>
Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash
init()  : mixed
Initializes the last update time to far in the past so, feeds will get immediately updated. Sets up connect to DB to store feeds items, and makes it so the same media job runs both on name server and client Media Updaters
nondistributedTasks()  : mixed
Get the media sources from the local database and use those to run the the same task as in the distributed setting
parseFeedAuxInfo()  : mixed
Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is read are html, json and regex feeds.
prepareTasks()  : mixed
This method is called on the name server to prepare data for any MediaUpdater clients.
putTasks()  : array<string|int, mixed>
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
run()  : mixed
Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
updateFoundItemsOneGo()  : mixed
Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.
getThumbs()  : mixed
Download images and create thumbnails for a list of image urls.

Constants

MAX_FEEDS_ONE_GO

Mamimum number of feeds to download in one try

public mixed MAX_FEEDS_ONE_GO = 100

MAX_THUMBS_ONE_GO

Mamimum number of thumb_urls to download in one try

public mixed MAX_THUMBS_ONE_GO = 100

OLD_ITEM_TIME

how long in seconds before a feed item expires

public mixed OLD_ITEM_TIME = 4 * \seekquarry\yioop\configs\ONE_WEEK

SINGLE_SOURCE_FACTOR

For a given feed update, the factor extra to allow the number of items from a single source a compared to the average items should have per source.

public mixed SINGLE_SOURCE_FACTOR = 1.2

Properties

$controller

If MediaJob was instantiated in the web app, the controller that instatiated it

public object $controller

$db

Datasource object used to run db queries related to feed items (for storing and updating them)

public object $db

$found_items

News Feed Items found from the current feed

public array<string|int, mixed> $found_items

$media_updater

If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater

public object $media_updater

$media_urls

Used to keep track of image urls of thumbnails to download for feed items

public array<string|int, mixed> $media_urls

$name_server_does_client_tasks

Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks

public bool $name_server_does_client_tasks

$name_server_does_client_tasks_only

Whether this MediaJob performs name server only tasks

public bool $name_server_does_client_tasks_only

$tasks

The most recently received from the name server tasks for this MediaJob

public array<string|int, mixed> $tasks

$update_time

Time in current epoch when feeds last updated

public int $update_time

Methods

__construct()

Instiates the MediaJob with a reference to the object that instatiated it

public __construct([object $media_updater = null ][, object $controller = null ]) : mixed
Parameters
$media_updater : object = null

a reference to the media updater that instatiated this object (if being run in MediaUpdater)

$controller : object = null

a reference to the controller that instantiated this object (if being run in the web app)

Return values
mixed

addFeedItemIfNew()

Adds $item to feed index bundle if it isn't already there

public addFeedItemIfNew(array<string|int, mixed> $item, string $source_name, string $lang, int $age, mixed $unique_fields) : bool
Parameters
$item : array<string|int, mixed>

data from a single feed item

$source_name : string

string name of the feed $item was found on

$lang : string

locale-tag of the feed

$age : int

how many seconds old records should be ignored

$unique_fields : mixed
Return values
bool

whether an item was added

addFoundItemsPartition()

Add found feed items that have not previously been seen to the current partition. Found feed items are assumed to be in $this->found_items.

public addFoundItemsPartition() : bool

After processing is complete $this->found_items = [];

Return values
bool

whether or not items were added

checkPrerequisites()

Only update if its been more than an hour since the last update

public checkPrerequisites() : bool
Return values
bool

whether its been an hour since the last update

convertJsonDecodeToTags()

Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.

public convertJsonDecodeToTags(array<string|int, mixed> $json_decode) : string
Parameters
$json_decode : array<string|int, mixed>

associative array coming from a json_decode'd string

Return values
string

result of converting array to an html string

doTasks()

For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexDocumentBundle for feeds

public doTasks(array<string|int, mixed> $tasks) : mixed
Parameters
$tasks : array<string|int, mixed>

array of feed info (url to download, paths to extract etc)

Return values
mixed

the result of carrying out that processing

execNameServer()

Executes a method on the name server's JobController.

public static execNameServer(string $command[, string $args = null ]) : array<string|int, mixed>

It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.

Parameters
$command : string

the method to invoke on the name server

$args : string = null

additional arguments to be passed to the name server

Return values
array<string|int, mixed>

data returned by the name server.

finishTasks()

This method is called on the name server to finish processing any data returned by MediaUpdater clients.

public finishTasks() : mixed
Return values
mixed

getCurrentMachine()

Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request

public static getCurrentMachine() : string
Return values
string

hash of current machine url

getFeedBundle()

Sets the value of $this->index_archive to point to the FeedDocumentBundle associated to feeds on this instance of Yioop

public getFeedBundle() : mixed
Return values
mixed

getJobName()

Gets the class name (less namespace and the word Job ) of the current MediaJob

public static getJobName() : string
Return values
string

name of the current job

getTasks()

Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash

public getTasks(int $machine_id[, array<string|int, mixed> $data = null ]) : array<string|int, mixed>
Parameters
$machine_id : int

id of machine making request for feeds

$data : array<string|int, mixed> = null

not used but inherited from the base MediaJob class as a parameter (so will always be null in this case)

Return values
array<string|int, mixed>

of feed urls and paths to extract from them

init()

Initializes the last update time to far in the past so, feeds will get immediately updated. Sets up connect to DB to store feeds items, and makes it so the same media job runs both on name server and client Media Updaters

public init() : mixed
Return values
mixed

nondistributedTasks()

Get the media sources from the local database and use those to run the the same task as in the distributed setting

public nondistributedTasks() : mixed
Return values
mixed

parseFeedAuxInfo()

Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is read are html, json and regex feeds.

public static parseFeedAuxInfo(array<string|int, mixed> &$feed) : mixed

In the case of an rss or atom feed this method assumes the AUX_INFO field just contains an xpath expression for finding a feed_item's image, and so just parses the AUX_INFO field into an IMAGE_XPATH field.

Parameters
$feed : array<string|int, mixed>

associative array of data about one particular feed

Return values
mixed

prepareTasks()

This method is called on the name server to prepare data for any MediaUpdater clients.

public prepareTasks() : mixed
Return values
mixed

putTasks()

After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server

public putTasks(int $machine_id, mixed $data) : array<string|int, mixed>
Parameters
$machine_id : int

id of client that is sending data to name server

$data : mixed

results of computation done by client

Return values
array<string|int, mixed>

any response information to send back to the client

run()

Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.

public run() : mixed
Return values
mixed

updateFoundItemsOneGo()

Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.

public updateFoundItemsOneGo(array<string|int, mixed> $feeds[, int $age = CONE_WEEK ][, bool $test_mode = false ]) : mixed
Parameters
$feeds : array<string|int, mixed>

list of feeds to download

$age : int = CONE_WEEK

how many seconds old records should be ignored

$test_mode : bool = false

if true then rather then update items in database, returns as a string the found feed items for the given feeds

Return values
mixed

either true, or if $test_mode is true then the results as a string of downloading the feeds and extracting the feed items

getThumbs()

Download images and create thumbnails for a list of image urls.

private getThumbs(array<string|int, mixed> $thumb_sites) : mixed
Parameters
$thumb_sites : array<string|int, mixed>

array of arrays. The sub-array should contain a field CrawlConstants::THUMB_URL with url to download. After download the thumb_nail is saved in the file CrawlConstants::FILE_NAME.

Return values
mixed

        

Search results