WordfilterPlugin
extends IndexingPlugin
in package
implements
CrawlConstants
WordFilterPlugin is used to filter documents by terms during a crawl.
When this plugin is in use, each document summary that is generated by a TextProcessor or subclass during a crawl will be further processed by it pageSummaryProcessing method. First a set of applicable rules is computed base on the url of where the summary came from. (see documentation in factory example for more info on how the applicable rules are determined). Then as part of this processing the summary's title and description are sent to the method checkFilter. Here they are compared against the array of rules $this->filter_rules which consists of a list of rules each of which has a PRECONDITIONS and an ACTIONS field. Actions can either be directives that might appear within a ROBOTS meta tag of an HTML document: NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE or can be the word NOPROCESS, JUSTFOLLOW, NOTCONTAIN. The preconditions is checked in the function checkFilter. Details on what constitutes are legal precondition are described in the See $filter_rules and $rules_string documentation. Usually, if checkFilter returns true then pageSummaryProcessing adds the meta tags to the document summary and returns. If one of the actions was NOTCONTAIN, then only if checkFilter returned false are the meta tags added. The crawl makes use of the meta word info when performing indexing. In the case where the actions contain NOPROCESS the summary returned from pageSummaryProcessing will be false this will prevent any indexing of this document from occurring at all. In the case where the actions contain JUSTFOLLOW, the document won't be stored in the index but links from it will be followed. JUSTFOLLOW has a slightly different semantics than NOINDEX. When NOINDEX is used the document is actually stored in the index (unlike JUSTFOLLOW). If another document links to this document, it can be detected. If at search time a NOINDEX document or a link to a NOINDEX document is about to be returned, the NOINDEX is detected and the result won't be returned. With JUSTFOLLOW since the data is not stored in the index we can't tell if a link pointing to a JUSTFOLLOW page just hasn't been crawled yet or if it is a link to a JUSTFOLLOW page, so links to JUSTFOLLOW pages might appear in the index. One can see this effect by doing a search on site:any. The link that found the p7.html page shows up.
This plugin has been created with a dummy list of filter rules. By doing a crawl on the test site contain in the archive tests/word-filter-test-crawl.zip one can test how it behaves on those terms. To make use of this plugin on real web data one probably wants to alter the choice of words. This can be done from Admin > Page Options > Crawl Time tab by clicking on the Configure link next to the plugin. Alternatively, one could subclass this plugin in APP_DIR/library/indexing_plugins where one has a different array of filter_terms. To get a more sophisticated filtering process than a precondition checker one would override checkFilter. One can also directly modify the code below to achieve these effects, but altering code under the BASE_DIR makes it slightly harder to newer versions Yioop as they come out.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $db : object
- Reference to a database object that might be used by models on this plugin
- $default_rules_string : string
- Default rule string to be used if no other rules string is present
- $filter_rules : array<string|int, mixed>
- An array of rules. A rule is itself an array with two fields PRECONDITIONS and ACTIONS. ACTIONS is an array with elements from NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE, NOTCONTAIN, JUSTFOLLOW, and NOPROCESS which are to be followed if the PRECONDITIONS for the rule are met. PRECONDITIONS are an array of pairs term => frequency. term is a term to check in the document frequency indicates how often the term must appear for the condition to hold. An integer frequency value greater or equal to 1 is treated as raw count of occurrences that is required; a value between 0 and 1 is treated a fraction of the document that must be made up of occurrence of that term. The array in $this->filter rules is typically created by calling $this->parseRules() which converts the string in $this->rules_string into the format described above
- $index_archive : object
- The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
- $rules_string : string
- A string containing a parsable set of filter_rules to be used by the WordFilterPlugin. The format of these rules is described in the default value of this rule string below.
- __construct() : mixed
- Sets up the default word string for the word plugin
- checkFilter() : bool
- Used to check if $precondition is met by a supplied string.
- configureHandler() : mixed
- Behaves as a "controller" for the configuration page of the plugin.
- configureView() : mixed
- Used to draw the HTML configure screen for the word filter plugin.
- getAdditionalMetaWords() : array<string|int, mixed>
- Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
- getProcessors() : array<string|int, mixed>
- Which mime type page processors this plugin should do additional processing for
- loadConfiguration() : array<string|int, mixed>
- Reads plugin configuration data from data/word_filter_plugin.txt on the name server into $this->rule_string. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)
- loadDefaultConfiguration() : array<string|int, mixed>
- Reads plugin configuration data from the default setting of this plugin. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)
- pageProcessing() : array<string|int, mixed>
- This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
- pageSummaryProcessing() : mixed
- This method adds robots metas to or removes entirely a summary produced by a text page processor or its subsclasses depending on whether the summary title and description satisfy various rules in $this->filter_rules
- parseRules() : mixed
- Parse rules into array format from the string $this->rules_string into the array $this->filter_rules. $this->filter_rules is used when $this->pageSummaryProcessing(&$summary) is called.
- postProcessing() : mixed
- This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
- saveConfiguration() : mixed
- Saves to a file $this->rules_string, a field which contains the string rules that are being used with this plugin
- serializeRules() : mixed
- This is used to convert the array in $this->filter_rules into a string format in $this->rules_string which would be suitable for saving to disk or displaying on the configuration page.
- setConfiguration() : mixed
- Takes a configuration array of rules and sets them as the rules for this instance of the plugin. Typically used on a queue_server or on a fetcher. It first sets the value of $this->filter_rules, then in case we later call saveConfiguration(), it also call serializeRules to store the serial format in $this->rules_string
Properties
$db
Reference to a database object that might be used by models on this plugin
public
object
$db
$default_rules_string
Default rule string to be used if no other rules string is present
public
string
$default_rules_string
$filter_rules
An array of rules. A rule is itself an array with two fields PRECONDITIONS and ACTIONS. ACTIONS is an array with elements from NOINDEX, NOFOLLOW, NOCACHE, NOARCHIVE, NOODP, NOYDIR, NONE, NOTCONTAIN, JUSTFOLLOW, and NOPROCESS which are to be followed if the PRECONDITIONS for the rule are met. PRECONDITIONS are an array of pairs term => frequency. term is a term to check in the document frequency indicates how often the term must appear for the condition to hold. An integer frequency value greater or equal to 1 is treated as raw count of occurrences that is required; a value between 0 and 1 is treated a fraction of the document that must be made up of occurrence of that term. The array in $this->filter rules is typically created by calling $this->parseRules() which converts the string in $this->rules_string into the format described above
public
array<string|int, mixed>
$filter_rules
= []
$index_archive
The IndexArchiveBundle object that this indexing plugin might make changes to in its postProcessing method
public
object
$index_archive
$rules_string
A string containing a parsable set of filter_rules to be used by the WordFilterPlugin. The format of these rules is described in the default value of this rule string below.
public
string
$rules_string
= ""
Methods
__construct()
Sets up the default word string for the word plugin
public
__construct() : mixed
Return values
mixed —checkFilter()
Used to check if $precondition is met by a supplied string.
public
checkFilter(string $preconditions, string $test_string) : bool
See $filter_terms to see what constitutes a valid precondition.
Parameters
- $preconditions : string
-
the terms and their frequencies to search for
- $test_string : string
-
string to check whether preconditions met
Return values
bool —whether the summary should be filtered or not
configureHandler()
Behaves as a "controller" for the configuration page of the plugin.
public
configureHandler(array<string|int, mixed> &$data) : mixed
It is called by the AdminController pageOptions activity method to let the plugin handle any configuration $_REQUEST data sent by this activity with regard to the plugin. This method sees if the $_REQUEST has word filter plugin configuration data, and if so cleans and saves it. It then modifies $data so that if the plugin's configuration view is drawn it makes use of the current plugin configuration info.
Parameters
- $data : array<string|int, mixed>
-
info to be used by the admin view to draw itself.
Return values
mixed —configureView()
Used to draw the HTML configure screen for the word filter plugin.
public
configureView(array<string|int, mixed> &$data) : mixed
Parameters
- $data : array<string|int, mixed>
-
contains configuration data to be used in drawing the view
Return values
mixed —getAdditionalMetaWords()
Returns an associative array of meta words => description length for each meta word injected by this plugin into an index. The description length is used to say how the maximum length of the web snippet show in search results for this meta owrd should be
public
static getAdditionalMetaWords() : array<string|int, mixed>
Return values
array<string|int, mixed> —meta words => description length pairs
getProcessors()
Which mime type page processors this plugin should do additional processing for
public
static getProcessors() : array<string|int, mixed>
Return values
array<string|int, mixed> —an array of page processors
loadConfiguration()
Reads plugin configuration data from data/word_filter_plugin.txt on the name server into $this->rule_string. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)
public
loadConfiguration() : array<string|int, mixed>
Return values
array<string|int, mixed> —configuration associative array
loadDefaultConfiguration()
Reads plugin configuration data from the default setting of this plugin. Then parse this string to $this->filter_rules, the format used by $this->pageSummaryProcessing(&$summary)
public
loadDefaultConfiguration() : array<string|int, mixed>
Return values
array<string|int, mixed> —configuration associative array
pageProcessing()
This method is called by a PageProcessor in its handle() method just after it has processed a web page. This method allows an indexing plugin to do additional processing on the page such as adding sub-documents, before the page summary is handed back to the fetcher.
public
pageProcessing(string $page, string $url) : array<string|int, mixed>
Parameters
- $page : string
-
web-page contents
- $url : string
-
the url where the page contents came from, used to canonicalize relative links
Return values
array<string|int, mixed> —consisting of a sequence of subdoc arrays found on the given page. Each subdoc array has a self::TITLE and a self::DESCRIPTION
pageSummaryProcessing()
This method adds robots metas to or removes entirely a summary produced by a text page processor or its subsclasses depending on whether the summary title and description satisfy various rules in $this->filter_rules
public
pageSummaryProcessing(array<string|int, mixed> &$summary, string $url) : mixed
Parameters
- $summary : array<string|int, mixed>
-
the summary data produced by the relevant page processor's handle method; modified in-place.
- $url : string
-
the url where the summary contents came from
Return values
mixed —parseRules()
Parse rules into array format from the string $this->rules_string into the array $this->filter_rules. $this->filter_rules is used when $this->pageSummaryProcessing(&$summary) is called.
public
parseRules() : mixed
Return values
mixed —postProcessing()
This method is called by the queue_server with the name of a completed index. This allows the indexing plugin to perform searches on the index and using the results, inject new page/index data into the index before it becomes available for end use.
public
postProcessing(string $index_name) : mixed
Parameters
- $index_name : string
-
the name/timestamp of an IndexArchiveBundle to do post processing for
Return values
mixed —saveConfiguration()
Saves to a file $this->rules_string, a field which contains the string rules that are being used with this plugin
public
saveConfiguration() : mixed
Return values
mixed —serializeRules()
This is used to convert the array in $this->filter_rules into a string format in $this->rules_string which would be suitable for saving to disk or displaying on the configuration page.
public
serializeRules() : mixed
Return values
mixed —setConfiguration()
Takes a configuration array of rules and sets them as the rules for this instance of the plugin. Typically used on a queue_server or on a fetcher. It first sets the value of $this->filter_rules, then in case we later call saveConfiguration(), it also call serializeRules to store the serial format in $this->rules_string
public
setConfiguration(array<string|int, mixed> $configuration) : mixed
Parameters
- $configuration : array<string|int, mixed>