PageRuleParser
in package
implements
CrawlConstants
Has methods to parse user-defined page rules to apply documents to be indexed.
There are two types of statements that a user can define: command statements and assignment statements
A command statement takes a key field argument for the page associative array and does a function call to manipulate that page. These have the syntax: addMetaWords(field) ;add the field and field value to the META_WORD ;array for the page addKeywordLink(field) ;split the field on a comma, view this as a search ;keywords => link text association, and add this to ;the KEYWORD_LINKS array. setStack(field) ;set which field value should be used as a stack pushStack(field) ;add the field value for field to the top of stack popStack(field) ;pop the top of the stack into the field value for ;field setOutputFolder(dir) ;if auxiliary output, rather than just to the ; a yioop index, is being done, then set the folder ; for this output to be dir setOutputFormat(format) ;format of auxiliary output either CSV or SQL ;SQL mean that writeOutput will write an insert ;statement setOutputTable(table) ;if output is SQL then what table to use for the ;insert statements toArray(field) ;splits field value for field on a comma and ;assign field value to be the resulting array toString(field) ;if field value is an array then implode that ;array using comma and store the result in field ;value unset(field) ;unset that field value writeOutput(field) ;use the contents of field value viewed as an array ;to fill in the columns of a SQL insert statement ;or CSV row
Assignments can either be straight assignments with '=' or concatenation assignments with '.='. There are the following kinds of values that one can assign:
field = some_other_field ; sets $page['field'] = $page['some_other_field'] field = "some_string" ; sets $page['field'] to "some string" field = /some_regex/ ; sees if $page['field'] matches some_regex. If it does, this sets ; $page['field'] to 1; if it doesn't, it sets $page['field'] to 0. field = /some_regex/replacement_where_dollar_vars_allowed/ ; computes the results of replacing matches to some_regex in ; $page['field'] with replacement_where_dollar_vars_allowed field = /some_regex/g ;sets $page['field'] to the array of all matches ; of some regex in $page['field']
For each of the above assignments we could have used ".=" instead of "=" For /g (.=), will append to an array.
Tags
Interfaces, Classes, Traits and Enums
- CrawlConstants
- Shared constants and enums used by components that are involved in the crawling process
Table of Contents
- $output_folder : string
- If outputting to auxiliary file is being done, the current folder to use for such output
- $output_format : string
- If outputting to auxiliary file is being done, the current file format to output with (either SQL or CSV)
- $output_table : string
- If outputting to auxiliary file is being done, and the current file format is SQL then what table to output insert statements for
- $rule_trees : array<string|int, mixed>
- Used to store parse trees that this parser executes
- $stack : string
- Name of field which will be used as a stack for push and popping other fields values
- __construct() : mixed
- Constructs a PageRuleParser using the supplied page_rules
- addKeywordLink() : mixed
- Adds a $keywords => $link_text pair to the KEYWORD_LINKS array from this page based on the value $field on the page. The pair is extracted by splitting on comma. The KEYWORD_LINKS array can be used when a cached version of a page is displayed to show a list of links from the cached page in the header. These links correspond to search in Yioop. for example the value: madonna, rock star would add a link to the top of the cache page with text "rock star" which when clicked would perform a Yioop search on madonna.
- addMetaWord() : mixed
- Adds a meta word u:$field:$page_data[$field_name] to the array of meta words for this page
- executeAssignmentRule() : mixed
- Used to execute a single assignment rule on $page_data
- executeFunctionRule() : mixed
- Used to execute a single command rule on $page_data
- executeRuleTrees() : mixed
- Executes either the internal $rule_trees or the passed $rule_trees on the provided $page_data associative array
- getVarField() : string
- Either returns $var_name or the value of the CrawlConstant with name $var_name.
- parseRules() : array<string|int, mixed>
- Parses a string of pages rules into parse trees that can be executed later
- popStack() : mixed
- Pop an element or items in an array stored in field onto the current stack
- pushStack() : mixed
- Pushes an element or items in an array stored in field onto the current stack
- setOutputFolder() : mixed
- Set output folder
- setOutputFormat() : mixed
- Set output format
- setOutputTable() : mixed
- Set output table
- setStack() : mixed
- Set field variable to be used as a stack
- toArray() : mixed
- If $page_data[$field] is a string, splits it into an array on comma, trims leading and trailing spaces from each item and stores the result back into $page_data[$field]
- toString() : mixed
- If $page_data[$field] is an array, implode it into a string on comma, and stores the result back into $page_data[$field]
- unsetVariable() : mixed
- Unsets the key $field (or the crawl constant it corresponds to) in $page_data. If it is a crawlconstant it doesn't unset it -- it just sets it to the empty string
- writeOutput() : mixed
- Write the value of a field to the output folder in the current format. If the field is not set nothing is written
Properties
$output_folder
If outputting to auxiliary file is being done, the current folder to use for such output
public
string
$output_folder
= ""
$output_format
If outputting to auxiliary file is being done, the current file format to output with (either SQL or CSV)
public
string
$output_format
= ""
$output_table
If outputting to auxiliary file is being done, and the current file format is SQL then what table to output insert statements for
public
string
$output_table
= ""
$rule_trees
Used to store parse trees that this parser executes
public
array<string|int, mixed>
$rule_trees
$stack
Name of field which will be used as a stack for push and popping other fields values
public
string
$stack
Methods
__construct()
Constructs a PageRuleParser using the supplied page_rules
public
__construct([string $page_rules = "" ]) : mixed
Parameters
- $page_rules : string = ""
-
a sequence of lines with page rules as described in the class comments
Return values
mixed —addKeywordLink()
Adds a $keywords => $link_text pair to the KEYWORD_LINKS array from this page based on the value $field on the page. The pair is extracted by splitting on comma. The KEYWORD_LINKS array can be used when a cached version of a page is displayed to show a list of links from the cached page in the header. These links correspond to search in Yioop. for example the value: madonna, rock star would add a link to the top of the cache page with text "rock star" which when clicked would perform a Yioop search on madonna.
public
addKeywordLink( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —addMetaWord()
Adds a meta word u:$field:$page_data[$field_name] to the array of meta words for this page
public
addMetaWord( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —executeAssignmentRule()
Used to execute a single assignment rule on $page_data
public
executeAssignmentRule(array<string|int, mixed> $tree, array<string|int, mixed> &$page_data) : mixed
Parameters
- $tree : array<string|int, mixed>
-
annotated syntax tree of an assignment rule
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record (will be changed by this operation)
Return values
mixed —executeFunctionRule()
Used to execute a single command rule on $page_data
public
executeFunctionRule(array<string|int, mixed> $tree, array<string|int, mixed> &$page_data) : mixed
Parameters
- $tree : array<string|int, mixed>
-
annotated syntax tree of a function call rule
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record (will be changed by this operation)
Return values
mixed —executeRuleTrees()
Executes either the internal $rule_trees or the passed $rule_trees on the provided $page_data associative array
public
executeRuleTrees(array<string|int, mixed> &$page_data[, array<string|int, mixed> $rule_trees = null ]) : mixed
Parameters
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record (will be changed by this operation)
- $rule_trees : array<string|int, mixed> = null
-
an array of annotated syntax trees to for rules used to update $page_data
Return values
mixed —getVarField()
Either returns $var_name or the value of the CrawlConstant with name $var_name.
public
getVarField(string $var_name) : string
Parameters
- $var_name : string
-
field to look up
Return values
string —looked up value
parseRules()
Parses a string of pages rules into parse trees that can be executed later
public
parseRules(string $page_rules) : array<string|int, mixed>
Parameters
- $page_rules : string
-
a sequence of lines with page rules as described in the class comments
Return values
array<string|int, mixed> —of parse trees which can be executed in sequence
popStack()
Pop an element or items in an array stored in field onto the current stack
public
popStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
what field to get data to push onto fcurrent stack
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —pushStack()
Pushes an element or items in an array stored in field onto the current stack
public
pushStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
what field to get data to push onto fcurrent stack
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —setOutputFolder()
Set output folder
public
setOutputFolder( $dir, array<string|int, mixed> &$page_data) : mixed
Parameters
- $dir :
-
output directory in which to write data.txt files containing the contents of some fields after writeOutput commands
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —setOutputFormat()
Set output format
public
setOutputFormat( $format, array<string|int, mixed> &$page_data) : mixed
Parameters
- $format :
-
can be either csv or sql
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —setOutputTable()
Set output table
public
setOutputTable( $table, array<string|int, mixed> &$page_data) : mixed
Parameters
- $table :
-
table to use if output format is sql
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —setStack()
Set field variable to be used as a stack
public
setStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
what field variable to use for current stack
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —toArray()
If $page_data[$field] is a string, splits it into an array on comma, trims leading and trailing spaces from each item and stores the result back into $page_data[$field]
public
toArray( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —toString()
If $page_data[$field] is an array, implode it into a string on comma, and stores the result back into $page_data[$field]
public
toString( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —unsetVariable()
Unsets the key $field (or the crawl constant it corresponds to) in $page_data. If it is a crawlconstant it doesn't unset it -- it just sets it to the empty string
public
unsetVariable( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record
Return values
mixed —writeOutput()
Write the value of a field to the output folder in the current format. If the field is not set nothing is written
public
writeOutput( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
- $field :
-
the key in $page_data to use
- $page_data : array<string|int, mixed>
-
an associative array of containing summary info of a web page/record