RecommendationJob
extends MediaJob
in package
Recommendation Job recommends the trending threads as well as threads and groups which are relevant based on the users viewing history
Table of Contents
- CONTEXT_WINDOW_LENGTH = 5
- Length of context window for calculating term embeddings
- DESCRIPTION_STOP_WORDS = ["author", "authors", "plot", "genre", "genres", "star", "stars", "credits", "rating", "ratings", "year", "director", "cast", "runtime"]
- Stop words to exclude from the descriptions fetched by DescriptionUpdate media job
- HASH_ALGORITHM = "md5"
- Hash algorithm to be used for calculating hash in Hash2Vec embedding
- MAX_BATCH_SIZE = 200
- Maximum number of resources used in making resource recommendations/ Maximum number of group items to hold in memory in one go
- MAX_GROUP_ITEMS = 50000
- Maximum number of group items used in making recommendations
- MAX_TERM_EMBEDDINGS = 500
- MAX term embeddings fetched from database to initialize LRUCache
- MAX_TERMS = 20000
- Maximum number of terms used in making recommendations
- RECOMMENDATION_FILE = \seekquarry\yioop\configs\APP_DIR . "/resources/recommendation.txt"
- File containing paths to description folders of wiki page resources that should be used to create data corpus for computing recommendations
- SIGN_HASH_ALGORITHM = "crc32"
- Hash algorithm to be used for calculating sign in Hash2Vec term embedding
- UPDATE_PERIOD = \seekquarry\yioop\configs\ONE_MONTH
- Update period to consider for fetching the records from ITEM_IMPRESSION_SUMMARY table
- $active_time : int
- Used to track what is the active recommendation timestamp
- $controller : object
- If MediaJob was instantiated in the web app, the controller that instatiated it
- $cron_model : object
- Model used for timing when things were computed
- $db : object
- Datasource object used to run db queries related to recommendation items (for storing and updating them)
- $item_idf : array<string|int, mixed>
- Associative array of the number of items a term appears in
- $lru_cache : mixed
- LRUCache for term embeddings
- $media_updater : object
- If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
- $name_server_does_client_tasks : bool
- Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
- $name_server_does_client_tasks_only : bool
- Whether this MediaJob performs name server only tasks
- $tasks : array<string|int, mixed>
- The most recently received from the name server tasks for this MediaJob
- $update_time : int
- Time in current epoch when analytics last updated
- $user_idf : array<string|int, mixed>
- Associative array of the number of user views a term appears in
- __construct() : mixed
- Instiates the MediaJob with a reference to the object that instatiated it
- checkPrerequisites() : bool
- Only update if its been more than an hour since the last update
- cleanRemoveStopWords() : array<string|int, mixed>
- Split the given text into terms, clean the terms by removing non alphanumeric characters and remove the stop terms in order to reduce the noise while calculating the embeddings
- computeGroupEmbeddings() : array<string|int, mixed>
- Computes the group embeddings using the item embeddings for the items in a group. Additionally fetches the existing group embeddings from database and updates them if the item embeddings are updated
- computeGroupUserEmbeddings() : array<string|int, mixed>
- Computes the user embeddings based on the group embeddings which user have impression in ITEM_IMPRESSION_SUMMARY table for defined UPDATE_PERIOD or are a member in the group
- computeGroupUserRecommendations() : array<string|int, mixed>
- Computes the group recommendation for user based on the cosine similarity between user embeddings and group embeddings. Recommendations are calculated for the groups whic user has not interacted with yet and they are not member of that group
- computeItemEmbeddings() : array<string|int, mixed>
- Computes the item embeddings for individual items (main thread only and not comments) in groups feeds using the term embeddings for their terms.
- computeItemTermEmbeddings() : array<string|int, mixed>
- Computes the term embeddings for individual items (main thread only and not comments) in groups feeds for the terms in their title and description text. Processes only MAX_GROUP_ITEMS which are either newly created or recently edited
- computeItemUserEmbeddings() : array<string|int, mixed>
- Computes the user embeddings based on the item embeddings which user have impression in ITEM_IMPRESSION_SUMMARY table for defined UPDATE_PERIOD
- computeItemUserRecommendations() : array<string|int, mixed>
- Computes the items recommendation for user based on the cosine similarity between user embeddings and item embeddings. Recommendations are calculated for the items user have not interacted with yet and items should be from the groups where the user is already a memeber
- computeThreadGroupRecommendations() : mixed
- Manages the whole process of computing thread and group recommendations for users. Makes a series of calls to handle parts of this computation before synthesizing the result
- computeWikiResourceEmbeddings() : array<string|int, mixed>
- Computes the embeddings for wiki page resources using the calculated term embeddings and add the metadata details separately to the embeddings
- computeWikiResourceRecommendations() : mixed
- Manages the whole process of computing wiki resource recommendations for users. Makes a series of calls to handle parts of this computation before synthesizing the result
- computeWikiTermEmbeddings() : array<string|int, mixed>
- Computes the embedding for new terms in the description of wiki resources and updates the embedding of existing terms using Hash2Vec approach
- computeWikiUserEmbeddings() : array<string|int, mixed>
- Computes user embeddings for wiki resources based on the user's resources impression logged in ITEM_IMPRESSION_SUMMARY table for the defined update period
- computeWikiUserRecommendations() : mixed
- Computes the wiki resource recommendations based on cosine similarity between resource embeddings and user embeddings
- doTasks() : mixed
- This method is run on MediaUpdater client with data gotten from the name server by getTasks. The idea is the client is supposed to then this information and if need be send the results back to the name server
- execNameServer() : array<string|int, mixed>
- Executes a method on the name server's JobController.
- finishTasks() : mixed
- This method is called on the name server to finish processing any data returned by MediaUpdater clients.
- getCurrentMachine() : string
- Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
- getDescriptionFiles() : array<string|int, mixed>
- Returns all the resource description files in a given thumb folder and also recursively scan through subfolders if any
- getJobName() : string
- Gets the class name (less namespace and the word Job ) of the current MediaJob
- getTasks() : array<string|int, mixed>
- Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
- getTermEmbedding() : string
- Returns the term embedding either from LRU cache or database
- getWikiResourceDescriptions() : array<string|int, mixed>
- Fetches the description for the eligible wiki resources having the root folder path captured in RECOMMENDATION_FILE
- init() : mixed
- Sets up the database connection so can access tables related to recommendations. Initialize timing info related to job.
- initializeNewUserRecommendations() : mixed
- Computes recommendations for users who have yet to receive any recommendation of the given type based on what is the most most popular recommendation
- nondistributedTasks() : mixed
- For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.
- prepareTasks() : mixed
- This method is called on the name server to prepare data for any MediaUpdater clients.
- putTasks() : array<string|int, mixed>
- After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
- run() : mixed
- Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
- saveTermEmbeddingsCacheToDb() : mixed
- Writes back the term embeddings in cache to database and free up memory
- updateTermEmbeddingCache() : mixed
- Updates LRU cache of term embeddings and save the evicted embedding back to database
Constants
CONTEXT_WINDOW_LENGTH
Length of context window for calculating term embeddings
public
mixed
CONTEXT_WINDOW_LENGTH
= 5
DESCRIPTION_STOP_WORDS
Stop words to exclude from the descriptions fetched by DescriptionUpdate media job
public
mixed
DESCRIPTION_STOP_WORDS
= ["author", "authors", "plot", "genre", "genres", "star", "stars", "credits", "rating", "ratings", "year", "director", "cast", "runtime"]
HASH_ALGORITHM
Hash algorithm to be used for calculating hash in Hash2Vec embedding
public
mixed
HASH_ALGORITHM
= "md5"
MAX_BATCH_SIZE
Maximum number of resources used in making resource recommendations/ Maximum number of group items to hold in memory in one go
public
mixed
MAX_BATCH_SIZE
= 200
MAX_GROUP_ITEMS
Maximum number of group items used in making recommendations
public
mixed
MAX_GROUP_ITEMS
= 50000
MAX_TERM_EMBEDDINGS
MAX term embeddings fetched from database to initialize LRUCache
public
mixed
MAX_TERM_EMBEDDINGS
= 500
MAX_TERMS
Maximum number of terms used in making recommendations
public
mixed
MAX_TERMS
= 20000
RECOMMENDATION_FILE
File containing paths to description folders of wiki page resources that should be used to create data corpus for computing recommendations
public
mixed
RECOMMENDATION_FILE
= \seekquarry\yioop\configs\APP_DIR . "/resources/recommendation.txt"
SIGN_HASH_ALGORITHM
Hash algorithm to be used for calculating sign in Hash2Vec term embedding
public
mixed
SIGN_HASH_ALGORITHM
= "crc32"
UPDATE_PERIOD
Update period to consider for fetching the records from ITEM_IMPRESSION_SUMMARY table
public
mixed
UPDATE_PERIOD
= \seekquarry\yioop\configs\ONE_MONTH
Properties
$active_time
Used to track what is the active recommendation timestamp
public
int
$active_time
$controller
If MediaJob was instantiated in the web app, the controller that instatiated it
public
object
$controller
$cron_model
Model used for timing when things were computed
public
object
$cron_model
$db
Datasource object used to run db queries related to recommendation items (for storing and updating them)
public
object
$db
$item_idf
Associative array of the number of items a term appears in
public
array<string|int, mixed>
$item_idf
$lru_cache
LRUCache for term embeddings
public
mixed
$lru_cache
$media_updater
If the MediaJob was instantiated in a MediaUpdater, this is a reference to that updater
public
object
$media_updater
$name_server_does_client_tasks
Whether to run the job's client tasks on the name server in addition to prepareTasks and finishTasks
public
bool
$name_server_does_client_tasks
$name_server_does_client_tasks_only
Whether this MediaJob performs name server only tasks
public
bool
$name_server_does_client_tasks_only
$tasks
The most recently received from the name server tasks for this MediaJob
public
array<string|int, mixed>
$tasks
$update_time
Time in current epoch when analytics last updated
public
int
$update_time
$user_idf
Associative array of the number of user views a term appears in
public
array<string|int, mixed>
$user_idf
Methods
__construct()
Instiates the MediaJob with a reference to the object that instatiated it
public
__construct([object $media_updater = null ][, object $controller = null ]) : mixed
Parameters
- $media_updater : object = null
-
a reference to the media updater that instatiated this object (if being run in MediaUpdater)
- $controller : object = null
-
a reference to the controller that instantiated this object (if being run in the web app)
Return values
mixed —checkPrerequisites()
Only update if its been more than an hour since the last update
public
checkPrerequisites() : bool
Return values
bool —whether its been an hour since the last update
cleanRemoveStopWords()
Split the given text into terms, clean the terms by removing non alphanumeric characters and remove the stop terms in order to reduce the noise while calculating the embeddings
public
cleanRemoveStopWords(string $text[, bool $description_stop_word_flag = false ]) : array<string|int, mixed>
Parameters
- $text : string
-
which needs to be processed
- $description_stop_word_flag : bool = false
-
to remove words present in DESCRIPTION_STOP_WORDS
Return values
array<string|int, mixed> —$terms [term_id, term] term_id calculated using md5 hash for the term
computeGroupEmbeddings()
Computes the group embeddings using the item embeddings for the items in a group. Additionally fetches the existing group embeddings from database and updates them if the item embeddings are updated
public
computeGroupEmbeddings(array<string|int, mixed> $item_embeddings) : array<string|int, mixed>
Parameters
- $item_embeddings : array<string|int, mixed>
-
embedding for the items
Return values
array<string|int, mixed> —$updated_group_embeddings containing embeddings for groups
computeGroupUserEmbeddings()
Computes the user embeddings based on the group embeddings which user have impression in ITEM_IMPRESSION_SUMMARY table for defined UPDATE_PERIOD or are a member in the group
public
computeGroupUserEmbeddings(array<string|int, mixed> $group_embeddings) : array<string|int, mixed>
Parameters
- $group_embeddings : array<string|int, mixed>
-
embedding vectors of groups
Return values
array<string|int, mixed> —[$group_user_embedding, $user_groups] user embeddings for groups and the groups id user have membership
computeGroupUserRecommendations()
Computes the group recommendation for user based on the cosine similarity between user embeddings and group embeddings. Recommendations are calculated for the groups whic user has not interacted with yet and they are not member of that group
public
computeGroupUserRecommendations(array<string|int, mixed> $group_embeddings, array<string|int, mixed> $group_user_embeddings, array<string|int, mixed> $user_groups, mixed $user_group_impression) : array<string|int, mixed>
Parameters
- $group_embeddings : array<string|int, mixed>
-
embeddings vector for groups
- $group_user_embeddings : array<string|int, mixed>
-
embeddings vector for users
- $user_groups : array<string|int, mixed>
-
groups id for user having membership
- $user_group_impression : mixed
Return values
array<string|int, mixed> —$user_group_impression group ids which user has seen
computeItemEmbeddings()
Computes the item embeddings for individual items (main thread only and not comments) in groups feeds using the term embeddings for their terms.
public
computeItemEmbeddings(array<string|int, mixed> $item_terms) : array<string|int, mixed>
Additionally fetches the existing item embeddings from database and updates them if the term embeddings are updated for their terms
Parameters
- $item_terms : array<string|int, mixed>
-
terms in each item
Return values
array<string|int, mixed> —$updated_item_embeddings containing embeddings for items
computeItemTermEmbeddings()
Computes the term embeddings for individual items (main thread only and not comments) in groups feeds for the terms in their title and description text. Processes only MAX_GROUP_ITEMS which are either newly created or recently edited
public
computeItemTermEmbeddings() : array<string|int, mixed>
Return values
array<string|int, mixed> —$item_terms terms in each item
computeItemUserEmbeddings()
Computes the user embeddings based on the item embeddings which user have impression in ITEM_IMPRESSION_SUMMARY table for defined UPDATE_PERIOD
public
computeItemUserEmbeddings(array<string|int, mixed> $item_embeddings) : array<string|int, mixed>
Parameters
- $item_embeddings : array<string|int, mixed>
-
embedding vectors of items
Return values
array<string|int, mixed> —[$item_user_embedding, $user_items] user embeddings for items and the items id user have impression
computeItemUserRecommendations()
Computes the items recommendation for user based on the cosine similarity between user embeddings and item embeddings. Recommendations are calculated for the items user have not interacted with yet and items should be from the groups where the user is already a memeber
public
computeItemUserRecommendations(array<string|int, mixed> $item_embeddings, array<string|int, mixed> $item_user_embeddings, array<string|int, mixed> $user_items) : array<string|int, mixed>
Parameters
- $item_embeddings : array<string|int, mixed>
-
embeddings vectors for items
- $item_user_embeddings : array<string|int, mixed>
-
embeddings vectors for user
- $user_items : array<string|int, mixed>
-
items id for user in impression table
Return values
array<string|int, mixed> —$user_groups group ids where the user is a member
computeThreadGroupRecommendations()
Manages the whole process of computing thread and group recommendations for users. Makes a series of calls to handle parts of this computation before synthesizing the result
public
computeThreadGroupRecommendations() : mixed
Return values
mixed —computeWikiResourceEmbeddings()
Computes the embeddings for wiki page resources using the calculated term embeddings and add the metadata details separately to the embeddings
public
computeWikiResourceEmbeddings(array<string|int, mixed> $resource_terms, array<string|int, mixed> $meta_details_terms) : array<string|int, mixed>
Parameters
- $resource_terms : array<string|int, mixed>
-
of processed terms from resource description
- $meta_details_terms : array<string|int, mixed>
-
of raw resource descriptions
Return values
array<string|int, mixed> —$updated_item_embeddings array of updated wiki resource embeddings
computeWikiResourceRecommendations()
Manages the whole process of computing wiki resource recommendations for users. Makes a series of calls to handle parts of this computation before synthesizing the result
public
computeWikiResourceRecommendations() : mixed
Return values
mixed —computeWikiTermEmbeddings()
Computes the embedding for new terms in the description of wiki resources and updates the embedding of existing terms using Hash2Vec approach
public
computeWikiTermEmbeddings(array<string|int, mixed> $descriptions) : array<string|int, mixed>
Parameters
- $descriptions : array<string|int, mixed>
-
of resources
Return values
array<string|int, mixed> —[$resource_terms, $meta_details_term]
computeWikiUserEmbeddings()
Computes user embeddings for wiki resources based on the user's resources impression logged in ITEM_IMPRESSION_SUMMARY table for the defined update period
public
computeWikiUserEmbeddings(array<string|int, mixed> $item_embeddings) : array<string|int, mixed>
Parameters
- $item_embeddings : array<string|int, mixed>
-
of wiki page resources embedding
Return values
array<string|int, mixed> —[$user_embeddings, $user_items] of user embeddings for wiki resources and the user resource impression
computeWikiUserRecommendations()
Computes the wiki resource recommendations based on cosine similarity between resource embeddings and user embeddings
public
computeWikiUserRecommendations(array<string|int, mixed> $item_embeddings, array<string|int, mixed> $user_embeddings, array<string|int, mixed> $user_items, mixed $resource_metadata) : mixed
Parameters
- $item_embeddings : array<string|int, mixed>
-
of wiki resources embeddings
- $user_embeddings : array<string|int, mixed>
-
of users consumed wiki resources embeddings
- $user_items : array<string|int, mixed>
-
of users consumed wiki resources
- $resource_metadata : mixed
Return values
mixed —doTasks()
This method is run on MediaUpdater client with data gotten from the name server by getTasks. The idea is the client is supposed to then this information and if need be send the results back to the name server
public
doTasks(array<string|int, mixed> $tasks) : mixed
Parameters
- $tasks : array<string|int, mixed>
-
data that the MediaJob running on a client MediaUpdater needs to process
Return values
mixed —the result of carrying out that processing
execNameServer()
Executes a method on the name server's JobController.
public
static execNameServer(string $command[, string $args = null ]) : array<string|int, mixed>
It will typically execute either getTask or putTask for a specific Mediajob or getUpdateProperties to find out the current MediaUpdater should be configured.
Parameters
- $command : string
-
the method to invoke on the name server
- $args : string = null
-
additional arguments to be passed to the name server
Return values
array<string|int, mixed> —data returned by the name server.
finishTasks()
This method is called on the name server to finish processing any data returned by MediaUpdater clients.
public
finishTasks() : mixed
Return values
mixed —getCurrentMachine()
Returns a hash of the url of the current machine based on the value saved to self::current_machine_info_file by a machine statuses request
public
static getCurrentMachine() : string
Return values
string —hash of current machine url
getDescriptionFiles()
Returns all the resource description files in a given thumb folder and also recursively scan through subfolders if any
public
getDescriptionFiles(string $thumb_folder) : array<string|int, mixed>
Parameters
- $thumb_folder : string
-
path of a thumb folder
Return values
array<string|int, mixed> —$files list of description files path in given folder
getJobName()
Gets the class name (less namespace and the word Job ) of the current MediaJob
public
static getJobName() : string
Return values
string —name of the current job
getTasks()
Method called from JobController when a MediaUpdater client contacts the name server's web app. This method is supposed to marshal any data on the name server that the requesting client should process.
public
getTasks(int $machine_id[, array<string|int, mixed> $data = null ]) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of client requesting data
- $data : array<string|int, mixed> = null
-
any additional info about data being requested
Return values
array<string|int, mixed> —work for the client to process
getTermEmbedding()
Returns the term embedding either from LRU cache or database
public
getTermEmbedding(int $term_id, int $item_type[, bool $update = false ]) : string
Parameters
- $term_id : int
- $item_type : int
- $update : bool = false
-
indicates whether to update the cache
Return values
string —$term_embedding
getWikiResourceDescriptions()
Fetches the description for the eligible wiki resources having the root folder path captured in RECOMMENDATION_FILE
public
getWikiResourceDescriptions() : array<string|int, mixed>
Return values
array<string|int, mixed> —$descriptions of resources
init()
Sets up the database connection so can access tables related to recommendations. Initialize timing info related to job.
public
init() : mixed
Return values
mixed —initializeNewUserRecommendations()
Computes recommendations for users who have yet to receive any recommendation of the given type based on what is the most most popular recommendation
public
initializeNewUserRecommendations() : mixed
Return values
mixed —nondistributedTasks()
For now analytics update is only done on name server as Yioop currently only supports one DBMS at a time.
public
nondistributedTasks() : mixed
Return values
mixed —prepareTasks()
This method is called on the name server to prepare data for any MediaUpdater clients.
public
prepareTasks() : mixed
Return values
mixed —putTasks()
After a MediaUpdater client is done with the task given to it by the name server's media updater, the client contact the name server's web app. The name servers web app's JobController then calls this method to receive the data on the name server
public
putTasks(int $machine_id, mixed $data) : array<string|int, mixed>
Parameters
- $machine_id : int
-
id of client that is sending data to name server
- $data : mixed
-
results of computation done by client
Return values
array<string|int, mixed> —any response information to send back to the client
run()
Method executed by MediaUpdater to perform the MediaJob. This method shouldn't need to be overridden. Instead, the various callbacks it calls (listed in the class description) wshould be overridden.
public
run() : mixed
Return values
mixed —saveTermEmbeddingsCacheToDb()
Writes back the term embeddings in cache to database and free up memory
public
saveTermEmbeddingsCacheToDb(int $item_type) : mixed
Parameters
- $item_type : int
-
value for ITEM_TYPE column
Return values
mixed —updateTermEmbeddingCache()
Updates LRU cache of term embeddings and save the evicted embedding back to database
public
updateTermEmbeddingCache(int $term_id, string $term_embedding, int $item_type[, mixed $message = "" ]) : mixed
Parameters
- $term_id : int
- $term_embedding : string
- $item_type : int
- $message : mixed = ""