Yioop_V9.5_Source_Code_Documentation

PageRuleParser
in package
implements CrawlConstants

Has methods to parse user-defined page rules to apply documents to be indexed.

There are two types of statements that a user can define: command statements and assignment statements

A command statement takes a key field argument for the page associative array and does a function call to manipulate that page. These have the syntax: addMetaWords(field) ;add the field and field value to the META_WORD ;array for the page addKeywordLink(field) ;split the field on a comma, view this as a search ;keywords => link text association, and add this to ;the KEYWORD_LINKS array. setStack(field) ;set which field value should be used as a stack pushStack(field) ;add the field value for field to the top of stack popStack(field) ;pop the top of the stack into the field value for ;field setOutputFolder(dir) ;if auxiliary output, rather than just to the ; a yioop index, is being done, then set the folder ; for this output to be dir setOutputFormat(format) ;format of auxiliary output either CSV or SQL ;SQL mean that writeOutput will write an insert ;statement setOutputTable(table) ;if output is SQL then what table to use for the ;insert statements toArray(field) ;splits field value for field on a comma and ;assign field value to be the resulting array toString(field) ;if field value is an array then implode that ;array using comma and store the result in field ;value unset(field) ;unset that field value writeOutput(field) ;use the contents of field value viewed as an array ;to fill in the columns of a SQL insert statement ;or CSV row

Assignments can either be straight assignments with '=' or concatenation assignments with '.='. There are the following kinds of values that one can assign:

field = some_other_field ; sets $page['field'] = $page['some_other_field'] field = "some_string" ; sets $page['field'] to "some string" field = /some_regex/ ; sees if $page['field'] matches some_regex. If it does, this sets ; $page['field'] to 1; if it doesn't, it sets $page['field'] to 0. field = /some_regex/replacement_where_dollar_vars_allowed/ ; computes the results of replacing matches to some_regex in ; $page['field'] with replacement_where_dollar_vars_allowed field = /some_regex/g ;sets $page['field'] to the array of all matches ; of some regex in $page['field']

For each of the above assignments we could have used ".=" instead of "=" For /g (.=), will append to an array.

Tags
author

Chris Pollett

Interfaces, Classes, Traits and Enums

CrawlConstants
Shared constants and enums used by components that are involved in the crawling process

Table of Contents

$output_folder  : string
If outputting to auxiliary file is being done, the current folder to use for such output
$output_format  : string
If outputting to auxiliary file is being done, the current file format to output with (either SQL or CSV)
$output_table  : string
If outputting to auxiliary file is being done, and the current file format is SQL then what table to output insert statements for
$rule_trees  : array<string|int, mixed>
Used to store parse trees that this parser executes
$stack  : string
Name of field which will be used as a stack for push and popping other fields values
__construct()  : mixed
Constructs a PageRuleParser using the supplied page_rules
addKeywordLink()  : mixed
Adds a $keywords => $link_text pair to the KEYWORD_LINKS array from this page based on the value $field on the page. The pair is extracted by splitting on comma. The KEYWORD_LINKS array can be used when a cached version of a page is displayed to show a list of links from the cached page in the header. These links correspond to search in Yioop. for example the value: madonna, rock star would add a link to the top of the cache page with text "rock star" which when clicked would perform a Yioop search on madonna.
addMetaWord()  : mixed
Adds a meta word u:$field:$page_data[$field_name] to the array of meta words for this page
executeAssignmentRule()  : mixed
Used to execute a single assignment rule on $page_data
executeFunctionRule()  : mixed
Used to execute a single command rule on $page_data
executeRuleTrees()  : mixed
Executes either the internal $rule_trees or the passed $rule_trees on the provided $page_data associative array
getVarField()  : string
Either returns $var_name or the value of the CrawlConstant with name $var_name.
parseRules()  : array<string|int, mixed>
Parses a string of pages rules into parse trees that can be executed later
popStack()  : mixed
Pop an element or items in an array stored in field onto the current stack
pushStack()  : mixed
Pushes an element or items in an array stored in field onto the current stack
setOutputFolder()  : mixed
Set output folder
setOutputFormat()  : mixed
Set output format
setOutputTable()  : mixed
Set output table
setStack()  : mixed
Set field variable to be used as a stack
toArray()  : mixed
If $page_data[$field] is a string, splits it into an array on comma, trims leading and trailing spaces from each item and stores the result back into $page_data[$field]
toString()  : mixed
If $page_data[$field] is an array, implode it into a string on comma, and stores the result back into $page_data[$field]
unsetVariable()  : mixed
Unsets the key $field (or the crawl constant it corresponds to) in $page_data. If it is a crawlconstant it doesn't unset it -- it just sets it to the empty string
writeOutput()  : mixed
Write the value of a field to the output folder in the current format. If the field is not set nothing is written

Properties

$output_folder

If outputting to auxiliary file is being done, the current folder to use for such output

public string $output_folder = ""

$output_format

If outputting to auxiliary file is being done, the current file format to output with (either SQL or CSV)

public string $output_format = ""

$output_table

If outputting to auxiliary file is being done, and the current file format is SQL then what table to output insert statements for

public string $output_table = ""

$rule_trees

Used to store parse trees that this parser executes

public array<string|int, mixed> $rule_trees

$stack

Name of field which will be used as a stack for push and popping other fields values

public string $stack

Methods

__construct()

Constructs a PageRuleParser using the supplied page_rules

public __construct([string $page_rules = "" ]) : mixed
Parameters
$page_rules : string = ""

a sequence of lines with page rules as described in the class comments

Return values
mixed

Adds a $keywords => $link_text pair to the KEYWORD_LINKS array from this page based on the value $field on the page. The pair is extracted by splitting on comma. The KEYWORD_LINKS array can be used when a cached version of a page is displayed to show a list of links from the cached page in the header. These links correspond to search in Yioop. for example the value: madonna, rock star would add a link to the top of the cache page with text "rock star" which when clicked would perform a Yioop search on madonna.

public addKeywordLink( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

addMetaWord()

Adds a meta word u:$field:$page_data[$field_name] to the array of meta words for this page

public addMetaWord( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

executeAssignmentRule()

Used to execute a single assignment rule on $page_data

public executeAssignmentRule(array<string|int, mixed> $tree, array<string|int, mixed> &$page_data) : mixed
Parameters
$tree : array<string|int, mixed>

annotated syntax tree of an assignment rule

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record (will be changed by this operation)

Return values
mixed

executeFunctionRule()

Used to execute a single command rule on $page_data

public executeFunctionRule(array<string|int, mixed> $tree, array<string|int, mixed> &$page_data) : mixed
Parameters
$tree : array<string|int, mixed>

annotated syntax tree of a function call rule

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record (will be changed by this operation)

Return values
mixed

executeRuleTrees()

Executes either the internal $rule_trees or the passed $rule_trees on the provided $page_data associative array

public executeRuleTrees(array<string|int, mixed> &$page_data[, array<string|int, mixed> $rule_trees = null ]) : mixed
Parameters
$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record (will be changed by this operation)

$rule_trees : array<string|int, mixed> = null

an array of annotated syntax trees to for rules used to update $page_data

Return values
mixed

getVarField()

Either returns $var_name or the value of the CrawlConstant with name $var_name.

public getVarField(string $var_name) : string
Parameters
$var_name : string

field to look up

Return values
string

looked up value

parseRules()

Parses a string of pages rules into parse trees that can be executed later

public parseRules(string $page_rules) : array<string|int, mixed>
Parameters
$page_rules : string

a sequence of lines with page rules as described in the class comments

Return values
array<string|int, mixed>

of parse trees which can be executed in sequence

popStack()

Pop an element or items in an array stored in field onto the current stack

public popStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

what field to get data to push onto fcurrent stack

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

pushStack()

Pushes an element or items in an array stored in field onto the current stack

public pushStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

what field to get data to push onto fcurrent stack

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

setOutputFolder()

Set output folder

public setOutputFolder( $dir, array<string|int, mixed> &$page_data) : mixed
Parameters
$dir :

output directory in which to write data.txt files containing the contents of some fields after writeOutput commands

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

setOutputFormat()

Set output format

public setOutputFormat( $format, array<string|int, mixed> &$page_data) : mixed
Parameters
$format :

can be either csv or sql

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

setOutputTable()

Set output table

public setOutputTable( $table, array<string|int, mixed> &$page_data) : mixed
Parameters
$table :

table to use if output format is sql

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

setStack()

Set field variable to be used as a stack

public setStack( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

what field variable to use for current stack

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

toArray()

If $page_data[$field] is a string, splits it into an array on comma, trims leading and trailing spaces from each item and stores the result back into $page_data[$field]

public toArray( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

toString()

If $page_data[$field] is an array, implode it into a string on comma, and stores the result back into $page_data[$field]

public toString( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

unsetVariable()

Unsets the key $field (or the crawl constant it corresponds to) in $page_data. If it is a crawlconstant it doesn't unset it -- it just sets it to the empty string

public unsetVariable( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

writeOutput()

Write the value of a field to the output folder in the current format. If the field is not set nothing is written

public writeOutput( $field, array<string|int, mixed> &$page_data) : mixed
Parameters
$field :

the key in $page_data to use

$page_data : array<string|int, mixed>

an associative array of containing summary info of a web page/record

Return values
mixed

        

Search results