FeedsUpdateJob
extends MediaJob
in package
A media job to download and index feeds from various search sources (RSS, HTML scraper, etc). Idea is that this job runs once an hour to get the latest news, movies, weather from those sources.
Table of Contents
- MAX_FEEDS_ONE_GO = 100
- Mamimum number of feeds to download in one try
- MAX_THUMBS_ONE_GO = 100
- Mamimum number of thumb_urls to download in one try
- OLD_ITEM_TIME = 4 * \seekquarry\yioop\configs\ONE_WEEK
- how long in seconds before a feed item expires
- SINGLE_SOURCE_FACTOR = 1.2
- For a given feed update, the factor extra to allow the number of items from a single source a compared to the average items should have per source.
- $controller : object
- If MediaJob was instantiated in the web app, the controller that instatiated it
- $db : object
- Datasource object used to run db queries related to feed items (for storing and updating them)
- $found_items : array<string|int, mixed>
- News Feed Items found from the current feed
- $index_archive : FeedDocumentBundle
- The FeedDocumentBundle to put feed items into periodically
- $media_updater : object
- If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
- $media_urls : array<string|int, mixed>
- Used to keep track of image urls of thumbnails to download for feed items
- $name_server_does_client_tasks : bool
- Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
- $name_server_does_client_tasks_only : bool
- Whether this MediaJob performs name server only tasks
- $tasks : array<string|int, mixed>
- The most recently received from the name server tasks for this MediaJob
- $update_time : int
- Time in current epoch when feeds last updated
- __construct() : mixed
- Instiates the MediaJob with a reference to the object that instatiated it
- addFeedItemIfNew() : bool
- Adds $item to feed index bundle if it isn't already there
- addFoundItemsPartition() : bool
- Add found feed items that have not previously been seen to the current partition. Found feed items are assumed to be in $this->found_items.
- checkPrerequisites() : bool
- Only update if its been more than an hour since the last update
- convertJsonDecodeToTags() : string
- Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.
- doTasks() : mixed
- For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexDocumentBundle for feeds
- execNameServer() : array<string|int, mixed>
- Executes a method on the name server's JobController.
- finishTasks() : mixed
- This method is called on the name server to finish processing any data returned by MediaUpdater clients.
- getCurrentMachine() : string
- Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
- getFeedBundle() : mixed
- Sets the value of $this->index_archive to point to the FeedDocumentBundle associated to feeds on this instance of Yioop
- getJobName() : string
- Gets the class name (less namespace and the word Job ) of the current MediaJob
- getTasks() : array<string|int, mixed>
- Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash
- init() : mixed
- Initializes the last update time to far in the past so, feeds will get immediately updated. Sets up connect to DB to store feeds items, and makes it so the same media job runs both on name server and client Media Updaters
- nondistributedTasks() : mixed
- Get the media sources from the local database and use those to run the the same task as in the distributed setting
- parseFeedAuxInfo() : mixed
- Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is read are html, json and regex feeds.
- prepareTasks() : mixed
- This method is called on the name server to prepare data for any MediaUpdater clients.
- putTasks() : array<string|int, mixed>
- After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
- run() : mixed
- Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
- updateFoundItemsOneGo() : mixed
- Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.
- getThumbs() : mixed
- Download images and create thumbnails for a list of image urls.
Constants
MAX_FEEDS_ONE_GO
Mamimum number of feeds to download in one try
public
mixed
MAX_FEEDS_ONE_GO
= 100
MAX_THUMBS_ONE_GO
Mamimum number of thumb_urls to download in one try
public
mixed
MAX_THUMBS_ONE_GO
= 100
OLD_ITEM_TIME
how long in seconds before a feed item expires
public
mixed
OLD_ITEM_TIME
= 4 * \seekquarry\yioop\configs\ONE_WEEK
SINGLE_SOURCE_FACTOR
For a given feed update, the factor extra to allow the number of items from a single source a compared to the average items should have per source.
public
mixed
SINGLE_SOURCE_FACTOR
= 1.2
Properties
$controller
If MediaJob was instantiated in the web app, the controller that instatiated it
public
object
$controller
$db
Datasource object used to run db queries related to feed items (for storing and updating them)
public
object
$db
$found_items
News Feed Items found from the current feed
public
array<string|int, mixed>
$found_items
$index_archive
The FeedDocumentBundle to put feed items into periodically
public
FeedDocumentBundle
$index_archive
$media_updater
If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
public
object
$media_updater
$media_urls
Used to keep track of image urls of thumbnails to download for feed items
public
array<string|int, mixed>
$media_urls
$name_server_does_client_tasks
Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
public
bool
$name_server_does_client_tasks
$name_server_does_client_tasks_only
Whether this MediaJob performs name server only tasks
public
bool
$name_server_does_client_tasks_only
$tasks
The most recently received from the name server tasks for this MediaJob
public
array<string|int, mixed>
$tasks
$update_time
Time in current epoch when feeds last updated
public
int
$update_time
Methods
__construct()
Instiates the MediaJob with a reference to the object that instatiated it
public
__construct([object $media_updater = null ][, object $controller = null ]) : mixed
Parameters
- $media_updater : object = null
-
a reference to the media updater that instatiated this object (if being run in MediaUpdater)
- $controller : object = null
-
a reference to the controller that instantiated this object (if being run in the web app)
Return values
mixed —addFeedItemIfNew()
Adds $item to feed index bundle if it isn't already there
public
addFeedItemIfNew(array<string|int, mixed> $item, string $source_name, string $lang, int $age, mixed $unique_fields) : bool
Parameters
- $item : array<string|int, mixed>
-
data from a single feed item
- $source_name : string
-
string name of the feed $item was found on
- $lang : string
-
locale-tag of the feed
- $age : int
-
how many seconds old records should be ignored
- $unique_fields : mixed
Return values
bool —whether an item was added
addFoundItemsPartition()
Add found feed items that have not previously been seen to the current partition. Found feed items are assumed to be in $this->found_items.
public
addFoundItemsPartition() : bool
After processing is complete $this->found_items = [];
Return values
bool —whether or not items were added
checkPrerequisites()
Only update if its been more than an hour since the last update
public
checkPrerequisites() : bool
Return values
bool —whether its been an hour since the last update
convertJsonDecodeToTags()
Converts the results of an associative array coming from a json_decode'd string to an HTML string where the json field have become tags prefixed with "json". This can then be handled in the rest of the feeds updater like an HTML feed.
public
convertJsonDecodeToTags(array<string|int, mixed> $json_decode) : string
Parameters
- $json_decode : array<string|int, mixed>
-
associative array coming from a json_decode'd string
Return values
string —result of converting array to an html string
doTasks()
For each feed source downloads the feeds, checks which items are new, and makes an array of them. Then calls the method to add these items to both the IndexDocumentBundle for feeds
public
doTasks(array<string|int, mixed> $tasks) : mixed
Parameters
- $tasks : array<string|int, mixed>
-
array of feed info (url to download, paths to extract etc)
Return values
mixed —the result of carrying out that processing
execNameServer()
Executes a method on the name server's JobController.
public
static execNameServer(string $command[, string $args = null ]) : array<string|int, mixed>
It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.
Parameters
- $command : string
-
the method to invoke on the name server
- $args : string = null
-
additional arguments to be passed to the name server
Return values
array<string|int, mixed> —data returned by the name server.
finishTasks()
This method is called on the name server to finish processing any data returned by MediaUpdater clients.
public
finishTasks() : mixed
Return values
mixed —getCurrentMachine()
Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
public
static getCurrentMachine() : string
Return values
string —hash of current machine url
getFeedBundle()
Sets the value of $this->index_archive to point to the FeedDocumentBundle associated to feeds on this instance of Yioop
public
getFeedBundle() : mixed
Return values
mixed —getJobName()
Gets the class name (less namespace and the word Job ) of the current MediaJob
public
static getJobName() : string
Return values
string —name of the current job
getTasks()
Handles the request to get the array of feed sources which hash to a particular value i.e. match with the index of requesting machine's hashed url/name from array of available machines hash
public
getTasks(int $machine_id[, array<string|int, mixed> $data = null ]) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of machine making request for feeds
- $data : array<string|int, mixed> = null
-
not used but inherited from the base MediaJob class as a parameter (so will always be null in this case)
Return values
array<string|int, mixed> —of feed urls and paths to extract from them
init()
Initializes the last update time to far in the past so, feeds will get immediately updated. Sets up connect to DB to store feeds items, and makes it so the same media job runs both on name server and client Media Updaters
public
init() : mixed
Return values
mixed —nondistributedTasks()
Get the media sources from the local database and use those to run the the same task as in the distributed setting
public
nondistributedTasks() : mixed
Return values
mixed —parseFeedAuxInfo()
Information about how to parse non-rss and atom feeds is stored in the MEDIA_SOURCE table in the AUX_INFO column. When a feed is read from this table this method is used to parse this column into additional fields which are easier to use for manipulating feed data. Example feed types for which this parsing is read are html, json and regex feeds.
public
static parseFeedAuxInfo(array<string|int, mixed> &$feed) : mixed
In the case of an rss or atom feed this method assumes the AUX_INFO field just contains an xpath expression for finding a feed_item's image, and so just parses the AUX_INFO field into an IMAGE_XPATH field.
Parameters
- $feed : array<string|int, mixed>
-
associative array of data about one particular feed
Return values
mixed —prepareTasks()
This method is called on the name server to prepare data for any MediaUpdater clients.
public
prepareTasks() : mixed
Return values
mixed —putTasks()
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
public
putTasks(int $machine_id, mixed $data) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of client that is sending data to name server
- $data : mixed
-
results of computation done by client
Return values
array<string|int, mixed> —any response information to send back to the client
run()
Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
public
run() : mixed
Return values
mixed —updateFoundItemsOneGo()
Downloads one batch of $feeds_one_go feed items for @see updateFeedItems For each feed source downloads the feeds, checks which items are not in the database, adds them. This method does not update the inverted index shard.
public
updateFoundItemsOneGo(array<string|int, mixed> $feeds[, int $age = CONE_WEEK ][, bool $test_mode = false ]) : mixed
Parameters
- $feeds : array<string|int, mixed>
-
list of feeds to download
- $age : int = CONE_WEEK
-
how many seconds old records should be ignored
- $test_mode : bool = false
-
if true then rather then update items in database, returns as a string the found feed items for the given feeds
Return values
mixed —either true, or if $test_mode is true then the results as a string of downloading the feeds and extracting the feed items
getThumbs()
Download images and create thumbnails for a list of image urls.
private
getThumbs(array<string|int, mixed> $thumb_sites) : mixed
Parameters
- $thumb_sites : array<string|int, mixed>
-
array of arrays. The sub-array should contain a field CrawlConstants::THUMB_URL with url to download. After download the thumb_nail is saved in the file CrawlConstants::FILE_NAME.