ScraperModel
extends Model
in package
Used to manage data related to the SCRAPER database table.
This table is used to store web scrapers, a tool for scraping important content from pages which might have been generated by a content management system.
Tags
Table of Contents
- DEFAULT_DESCRIPTION_LENGTH = 150
- Default maximum character length of a search summary
- MAX_SNIPPET_TITLE_LENGTH = 20
- MIN_SNIPPET_LENGTH = 100
- SNIPPET_LENGTH_LEFT = 20
- SNIPPET_LENGTH_RIGHT = 40
- SNIPPET_TITLE_LENGTH = 20
- $any_fields : array<string|int, mixed>
- These fields if present in $search_array (used by @see getRows() ), but with value "-1", will be skipped as part of the where clause but will be used for order by clause
- $cache : object
- Cache object to be used if we are doing caching
- $db : object
- Reference to a DatasourceManager
- $db_name : string
- Name of the search engine database
- $edited_page_summaries : array<string|int, mixed>
- Associative array of page summaries which might be used to override default page summaries if set.
- $private_db : object
- Reference to a private DatasourceManager
- $private_db_name : string
- Name of the private search engine database
- $search_table_column_map : array<string|int, mixed>
- Associations of the form name of field for web forms => database column names/abbreviations
- $web_site : object
- Reference to a WebSite object in use to serve pages (if any)
- __construct() : mixed
- Sets up the database manager that will be used and name of the search engine database
- add() : mixed
- Used to add a new scraper to Yioop
- boldKeywords() : string
- Given a string, wraps in bold html tags a set of key words it contains.
- createIfNecessaryDirectory() : int
- Creates a directory and sets it to world permission if it doesn't already exist
- delete() : mixed
- Deletes the scraper with the provided id
- fileGetContents() : string
- Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.
- filePutContents() : mixed
- Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.
- formatSinglePageResult() : array<string|int, mixed>
- Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.
- fromCallback() : string
- Controls which tables and the names of tables underlie the given model and should be used in a getRows call
- get() : array<string|int, mixed>
- Returns the scraper with the given id
- getAllScrapers() : array<string|int, mixed>
- Return the contents of the SCRAPER table
- getDbmsList() : array<string|int, mixed>
- Gets a list of all DBMS that work with the search engine
- getRows() : array<string|int, mixed>
- Gets a range of rows which match the provided search criteria from $th provided table
- getSnippets() : string
- Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.
- getUserId() : string
- Get the user_id associated with a given username (In base class as used as an internal method in both signin and user models)
- isSingleLocalhost() : bool
- Used to determine if an action involves just one yioop instance on the current local machine or not
- loginDbms() : bool
- Returns whether the provided dbms needs a login and password or not (sqlite or sqlite3)
- postQueryCallback() : array<string|int, mixed>
- Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged
- rowCallback() : array<string|int, mixed>
- Called after as row is retrieved by getRows from the database to perform some manipulation that would be useful for this model.
- searchArrayToWhereOrderClauses() : array<string|int, mixed>
- Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive
- selectCallback() : string
- Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.
- translateDb() : mixed
- Used to get the translation of a string_id stored in the database to the given locale.
- update() : mixed
- Used to update the fields stored in a SCRAPER row according to an array holding new values
- whereCallback() : string
- Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.
Constants
DEFAULT_DESCRIPTION_LENGTH
Default maximum character length of a search summary
public
mixed
DEFAULT_DESCRIPTION_LENGTH
= 150
MAX_SNIPPET_TITLE_LENGTH
public
mixed
MAX_SNIPPET_TITLE_LENGTH
= 20
MIN_SNIPPET_LENGTH
public
mixed
MIN_SNIPPET_LENGTH
= 100
SNIPPET_LENGTH_LEFT
public
mixed
SNIPPET_LENGTH_LEFT
= 20
SNIPPET_LENGTH_RIGHT
public
mixed
SNIPPET_LENGTH_RIGHT
= 40
SNIPPET_TITLE_LENGTH
public
mixed
SNIPPET_TITLE_LENGTH
= 20
Properties
$any_fields
These fields if present in $search_array (used by @see getRows() ), but with value "-1", will be skipped as part of the where clause but will be used for order by clause
public
array<string|int, mixed>
$any_fields
= []
$cache
Cache object to be used if we are doing caching
public
static object
$cache
$db
Reference to a DatasourceManager
public
object
$db
$db_name
Name of the search engine database
public
string
$db_name
$edited_page_summaries
Associative array of page summaries which might be used to override default page summaries if set.
public
array<string|int, mixed>
$edited_page_summaries
= null
$private_db
Reference to a private DatasourceManager
public
object
$private_db
$private_db_name
Name of the private search engine database
public
string
$private_db_name
$search_table_column_map
Associations of the form name of field for web forms => database column names/abbreviations
public
array<string|int, mixed>
$search_table_column_map
= []
$web_site
Reference to a WebSite object in use to serve pages (if any)
public
object
$web_site
Methods
__construct()
Sets up the database manager that will be used and name of the search engine database
public
__construct([string $db_name = CDB_NAME ][, bool $connect = true ][, mixed $web_site = null ]) : mixed
Parameters
- $db_name : string = CDB_NAME
-
the name of the database for the search engine
- $connect : bool = true
-
whether to connect to the database by default after making the datasource class
- $web_site : mixed = null
Return values
mixed —add()
Used to add a new scraper to Yioop
public
add(string $name, string $signature, int $priority, string $text_path, string $delete_paths, string $extract_fields) : mixed
Parameters
- $name : string
-
of scraper to add
- $signature : string
-
the xpath to query the html of a web document to see if a scrape rule should be applied
- $priority : int
-
to choose this scrape rule as opposed to other scrape rules
- $text_path : string
-
the xpath string used to find the main dom container for the important text in the html document
- $delete_paths : string
-
xpath strings of dom elements to be removed from the dom after the dom was restricted to just the $text_path content. These are used to remove extranenous info from the main text contents. Each xpath should be separated from each other by a new line.
- $extract_fields : string
-
a string of lines each line consists of a summary field name followed by = followed by an xpath. The intended meaning of such a line is to evaluate the xpath and create a new field in a document summary with either the concatenated, trimmed text value of the nodes of the results of the xpath
Return values
mixed —boldKeywords()
Given a string, wraps in bold html tags a set of key words it contains.
public
boldKeywords(string $text, array<string|int, mixed> $words) : string
Parameters
- $text : string
-
haystack string to look for the key words
- $words : array<string|int, mixed>
-
an array of words to bold face
Return values
string —the resulting string after boldfacing has been applied
createIfNecessaryDirectory()
Creates a directory and sets it to world permission if it doesn't already exist
public
createIfNecessaryDirectory(string $directory) : int
Parameters
- $directory : string
-
name of directory to create
Return values
int —-1 on failure, 0 if already existed, 1 if created
delete()
Deletes the scraper with the provided id
public
delete(int $id) : mixed
Parameters
- $id : int
-
of scraper to be deleted
Return values
mixed —fileGetContents()
Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.
public
fileGetContents(string $filename[, bool $force_read = false ]) : string
Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache
Parameters
- $filename : string
-
name of file to get contents of
- $force_read : bool = false
-
whether to force the file to be read from persistent storage rather than the cache
Return values
string —contents of the file given by $filename
filePutContents()
Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.
public
filePutContents(string $filename, string $data) : mixed
Parameters
- $filename : string
-
name of file to write to persistent storages
- $data : string
-
string of data to store in file
Return values
mixed —formatSinglePageResult()
Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.
public
formatSinglePageResult(array<string|int, mixed> $page[, array<string|int, mixed> $words = null ][, int $description_length = self::DEFAULT_DESCRIPTION_LENGTH ]) : array<string|int, mixed>
Parameters
- $page : array<string|int, mixed>
-
a single search result summary
- $words : array<string|int, mixed> = null
-
keywords (typically what was searched on)
- $description_length : int = self::DEFAULT_DESCRIPTION_LENGTH
-
length of the description
Return values
array<string|int, mixed> —$page which has been snippified and bold faced
fromCallback()
Controls which tables and the names of tables underlie the given model and should be used in a getRows call
public
fromCallback([string $args = null ]) : string
Parameters
- $args : string = null
-
it does not matter.
Return values
string —which table to use
get()
Returns the scraper with the given id
public
get(int $id) : array<string|int, mixed>
Parameters
- $id : int
-
of scraper to look up
Return values
array<string|int, mixed> —associative array with ID, NAME, SIGNATURE, PRIORITY, TEXT_PATH, DELETE_PATHS, EXTRACT_FIELDS of a scraper
getAllScrapers()
Return the contents of the SCRAPER table
public
getAllScrapers() : array<string|int, mixed>
Return values
array<string|int, mixed> —associative of rows with ID, NAME, SIGNATURE, PRIORITY, TEXT_PATH, DELETE_PATHS, EXTRACT_FIELDS, one for each scraper
getDbmsList()
Gets a list of all DBMS that work with the search engine
public
getDbmsList() : array<string|int, mixed>
Return values
array<string|int, mixed> —Names of available data sources
getRows()
Gets a range of rows which match the provided search criteria from $th provided table
public
getRows(int $limit, int $num, int &$total[, array<string|int, mixed> $search_array = [] ][, array<string|int, mixed> $args = null ]) : array<string|int, mixed>
Parameters
- $limit : int
-
starting row from the potential results to return
- $num : int
-
number of rows after start row to return
- $total : int
-
gets set with the total number of rows that can be returned by the given database query
- $search_array : array<string|int, mixed> = []
-
each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
- $args : array<string|int, mixed> = null
-
additional values which may be used to get rows (what these are will typically depend on the subclass implementation)
Return values
array<string|int, mixed> —getSnippets()
Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.
public
getSnippets(string $text, array<string|int, mixed> $words, string $description_length) : string
There is also a rule that a snippet should avoid ending in the middle of a word
Parameters
- $text : string
-
haystack to extract snippet from
- $words : array<string|int, mixed>
-
keywords used to make look in haystack
- $description_length : string
-
length of the description desired
Return values
string —a concatenation of the extracted snippets of each word
getUserId()
Get the user_id associated with a given username (In base class as used as an internal method in both signin and user models)
public
getUserId(string $username) : string
Parameters
- $username : string
-
the username to look up
Return values
string —the corresponding userid
isSingleLocalhost()
Used to determine if an action involves just one yioop instance on the current local machine or not
public
isSingleLocalhost(array<string|int, mixed> $machine_urls[, string $index_timestamp = -1 ]) : bool
Parameters
- $machine_urls : array<string|int, mixed>
-
urls of yioop instances to which the action applies
- $index_timestamp : string = -1
-
if timestamp exists checks if the index has declared itself to be a no network index.
Return values
bool —whether it involves a single local yioop instance (true) or not (false)
loginDbms()
Returns whether the provided dbms needs a login and password or not (sqlite or sqlite3)
public
loginDbms(string $dbms) : bool
Parameters
- $dbms : string
-
the name of a database management system
Return values
bool —true if needs a login and password; false otherwise
postQueryCallback()
Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged
public
postQueryCallback(array<string|int, mixed> $rows) : array<string|int, mixed>
Parameters
- $rows : array<string|int, mixed>
-
that have been calculated so far by getRows
Return values
array<string|int, mixed> —$rows after this final manipulation
rowCallback()
Called after as row is retrieved by getRows from the database to perform some manipulation that would be useful for this model.
public
rowCallback(array<string|int, mixed> $row, mixed $args) : array<string|int, mixed>
For example, in CrawlModel, after a row representing a crawl mix has been gotten, this is used to perform an additional query to marshal its components. By default this method just returns this row unchanged.
Parameters
- $row : array<string|int, mixed>
-
row as retrieved from database query
- $args : mixed
-
additional arguments that might be used by this callback
Return values
array<string|int, mixed> —$row after callback manipulation
searchArrayToWhereOrderClauses()
Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive
public
searchArrayToWhereOrderClauses(array<string|int, mixed> $search_array[, array<string|int, mixed> $any_fields = ['status'] ]) : array<string|int, mixed>
Parameters
- $search_array : array<string|int, mixed>
-
each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
- $any_fields : array<string|int, mixed> = ['status']
-
these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause
Return values
array<string|int, mixed> —string for where clause, string for order by clause
selectCallback()
Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.
public
selectCallback([mixed $args = null ]) : string
This defaults to *, but in general will be overridden in subclasses of Model
Parameters
- $args : mixed = null
-
any additional arguments which should be used to determine the columns
Return values
string —a comma separated list of columns suitable for a SQL query
translateDb()
Used to get the translation of a string_id stored in the database to the given locale.
public
translateDb(string $string_id, string $locale_tag) : mixed
Parameters
- $string_id : string
-
id to translate
- $locale_tag : string
-
to translate to
Return values
mixed —translation if found, $string_id, otherwise
update()
Used to update the fields stored in a SCRAPER row according to an array holding new values
public
update(array<string|int, mixed> $scraper_info) : mixed
Parameters
- $scraper_info : array<string|int, mixed>
-
updated values for scraper
Return values
mixed —whereCallback()
Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.
public
whereCallback([mixed $args = null ]) : string
This defaults to an empty WHERE clause.
Parameters
- $args : mixed = null
-
additional arguments that might be used to construct the WHERE clause.
Return values
string —a SQL WHERE clause