IndexingPlugin
in package
Base indexing plugin Class. An indexing plugin allows a developer to do additional processing on web pages during a crawl, then after the web crawl is over do post processing on the additional data that was collected. For example, during a crawl one might by analysing web pages mark pages that have recipes on them with the meta word recipe:all, then after the crawl is over do post processing such as clustering the recipe's found and add additional meta words to retrieve recipe's by principle ingredient.
Yioop comes included with two example subclasses of IndexingPlugins to illustrate how to write plugins: recipe_plugin.php and word_filter.php.
Subclasses of IndexingPlugin typically override some of the following four methods:
static getProcessors() -- returns an array of strings of page processor names which a plugin should be used with. For example, a plugin might want to alter the summary whenever an HtmlProcessor is used on a page, so this array should contain HtmlProcessor, but on the other hand, the plugin might not need to alter anything when the JpgProcessor is in use, so the returned array shouldn't contain JpgProcessor
pageProcessing($page, $url) -- which is called by a page processor when a page is being processed. It returns additional subdoc page summary info which is then handed back to the fetcher (@see pageProcessing method below for more info.)
pageSummaryProcessing(&$summary) -- which is called by a page processor in a fetcher after the initial summary has been generated (by processor itself and all plugins which are associated with the processor). This method can be used to further modify the summary
getAdditionalMetaWords() -- which is called when meta words are extracted from a query at search time. This allows the plugin to specify its own meta words to be extracted from the query. @see getAdditionalMetaWords for more details on the return type of this method.
If you would like to write a plugin which can be configured on the Admin > Page Options page, then you need to write four other methods:
loadConfiguration() -- which can read plugin configuration data from persistent storage on the name server into an array or object when a crawl is started. This data is then automatically serialized and sent to queue servers as part of starting a crawl
setConfiguration() -- which takes a configuration array or object and uses it to initialize an instance of the plugin on a queue_server or on a fetcher.
configureHandler(&$data) -- which is called by the AdminController pageOptions activity method to let the plugin handle any configuration $_REQUEST data sent by this activity with regard to the plugin and to also let plugin modify the $data which might be sent to the plugin's view. This method would typically be called on the name server and so can be used to save (or to call a method which saves) any configuration data extracted from the request.
configureView(&$data) -- which is called to draw the HTML configure screen used by the plugin given the information in &$data. This might display a form a user would use to alter the behavior of the plugin
Subclasses of IndexingPlugin stored in APP_DIR/library/indexing_plugins will be detected by Yioop. So one can add code there to make it easier to upgrade Yioop. I.e., your site specific code can stay in the work directory and you merely need to replace the Yioop folder when upgrading.
Tags
Table of Contents
- $db : object
- Reference to a database object that might be used by models on this plugin
- $index_archive : object
- The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
- __construct() : mixed
- Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object
- getAdditionalMetaWords() : array<string|int, mixed>
- Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
- getProcessors() : array<string|int, mixed>
- Returns a list of page processors that can use this plugin
- pageProcessing() : array<string|int, mixed>
- This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
- pageSummaryProcessing() : mixed
- Optionally modifies the page summary array produced by the PageProcessor handle method in place. This hook provides a way to easily modify the title, description, and meta words of a page. Only the PAGE, CRAWL_DELAY, ROBOT_PATHS, ROBOT_METAS, AGENT_LIST, TITLE, DESCRIPTION, META_WORDS, LANG, LINKS, and THUMB fields of the summary will be respected. If you add custom meta words, then you must define them in the getAdditionalMetaWords function for this plugin, or they will not be recognized in queries.
- postProcessing() : mixed
- This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
Properties
$db
Reference to a database object that might be used by models on this plugin
public
object
$db
$index_archive
The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
public
object
$index_archive
Methods
__construct()
Builds an IndexingPlugin object. Loads in the appropriate models for the given plugin object
public
__construct() : mixed
Return values
mixed —getAdditionalMetaWords()
Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
public
static getAdditionalMetaWords() : array<string|int, mixed>
Return values
array<string|int, mixed> —meta words => description length pairs
getProcessors()
Returns a list of page processors that can use this plugin
public
static getProcessors() : array<string|int, mixed>
Return values
array<string|int, mixed> —string names of page processors that this plugin associates with
pageProcessing()
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
public
pageProcessing(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
array<string|int, mixed> —consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION
pageSummaryProcessing()
Optionally modifies the page summary array produced by the PageProcessor handle method in place. This hook provides a way to easily modify the title, description, and meta words of a page. Only the PAGE, CRAWL_DELAY, ROBOT_PATHS, ROBOT_METAS, AGENT_LIST, TITLE, DESCRIPTION, META_WORDS, LANG, LINKS, and THUMB fields of the summary will be respected. If you add custom meta words, then you must define them in the getAdditionalMetaWords function for this plugin, or they will not be recognized in queries.
public
pageSummaryProcessing(array<string|int, mixed> &$summary, string $url) : mixed
Parameters
- $summary : array<string|int, mixed>
-
the summary data produced by the relevant page processor's handle method; modified in-place.
- $url : string
-
the url where the summary contents came from
Return values
mixed —postProcessing()
This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
public
postProcessing(string $index_name) : mixed
Parameters
- $index_name : string
-
the name/timestamp of an IndexArchiveBundle to do post processing for