Yioop_V9.5_Source_Code

CrawlModel extends ParallelModel
in package

Application

This is class is used to handle getting/setting crawl parameters, CRUD operations on current crawls, start, stop, status of crawls, getting cache files out of crawls, determining what is the default index to be used, marshalling/unmarshalling crawl mixes, and handling data from suggest-a-url forms

DEFAULT_DESCRIPTION_LENGTH

Default maximum character length of a search summary


    public
        mixed
    DEFAULT_DESCRIPTION_LENGTH
    = 150

MAX_SNIPPET_TITLE_LENGTH


    public
        mixed
    MAX_SNIPPET_TITLE_LENGTH
    = 20

MIN_DESCRIPTION_LENGTH

the minimum length of a description before we stop appending additional link doc summaries


    public
        mixed
    MIN_DESCRIPTION_LENGTH
    = 100

MIN_SNIPPET_LENGTH


    public
        mixed
    MIN_SNIPPET_LENGTH
    = 100

SNIPPET_LENGTH_LEFT


    public
        mixed
    SNIPPET_LENGTH_LEFT
    = 20

SNIPPET_LENGTH_RIGHT


    public
        mixed
    SNIPPET_LENGTH_RIGHT
    = 40

SNIPPET_TITLE_LENGTH


    public
        mixed
    SNIPPET_TITLE_LENGTH
    = 20

$any_fields

These fields if present in $search_array (used by @see getRows() ), but with value "-1", will be skipped as part of the where clause but will be used for order by clause


    public
        array<string|int, mixed>
    $any_fields
     = []

$cache

Cache object to be used if we are doing caching


    public
    static    object
    $cache

$current_machine

If known the id of the queue_server this belongs to


    public
        int
    $current_machine

$db

Reference to a DatasourceManager


    public
        object
    $db

$db_name

Name of the search engine database


    public
        string
    $db_name

$edited_page_summaries

Associative array of page summaries which might be used to override default page summaries if set.


    public
        array<string|int, mixed>
    $edited_page_summaries
     = null

$index_name

Stores the name of the current index archive to use to get search results from


    public
        string
    $index_name

$private_db

Reference to a private DatasourceManager


    public
        object
    $private_db

$private_db_name

Name of the private search engine database


    public
        string
    $private_db_name

$search_table_column_map

Used to map between search crawl mix form variables and database columns


    public
        array<string|int, mixed>
    $search_table_column_map
     = ["name" => "NAME", "owner_id" => "OWNER_ID"]

$suggest_url_file

File to be used to store suggest-a-url form data


    public
        string
    $suggest_url_file

$web_site

Reference to a WebSite object in use to serve pages (if any)


    public
        object
    $web_site

__construct()

Sets up the database manager that will be used and name of the search engine database


    public
                    __construct([string $db_name = CDB_NAME ][, bool $connect = true ]) : mixed

Parameters

$db_name : string = CDB_NAME: the name of the database for the search engine
$connect : bool = true: whether to connect to the database by default after making the datasource class

Return values

mixed —

aggregateCrawlList()

When @see getCrawlList() is used in a multi-queue server this method used to integrate the crawl lists received by the different machines


    public
                    aggregateCrawlList(array<string|int, mixed> $list_strings[, string $data_field = null ]) : array<string|int, mixed>

Parameters

$list_strings : array<string|int, mixed>: serialized crawl list data from different queue servers
$data_field : string = null: field of $list_strings to use for data

Return values

array<string|int, mixed> —

list of crawls and their meta data

aggregateStalled()

When @see crawlStalled() is used in a multi-queue server this method used to integrate the stalled information received by the different machines


    public
                    aggregateStalled(array<string|int, mixed> $stall_statuses[, string $data_field = null ]) : array<string|int, mixed>

Parameters

$stall_statuses : array<string|int, mixed>: contains web encoded serialized data one one field of which has the boolean data concerning stalled statis
$data_field : string = null: field of $stall_statuses to use for data if null then each element of $stall_statuses is a web encoded serialized boolean

Return values

array<string|int, mixed> —

aggregateStatuses()

When @see crawlStatus() is used in a multi-queue server this method used to integrate the status information received by the different machines


    public
                    aggregateStatuses(array<string|int, mixed> $status_strings[, string $data_field = null ]) : array<string|int, mixed>

Parameters

$status_strings : array<string|int, mixed>
$data_field : string = null: field of $status_strings to use for data

Return values

array<string|int, mixed> —

associative array of DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.

appendSuggestSites()

Add new distinct urls to those already saved in the suggest_url_file If the supplied url is not new or the file size exceeds MAX_SUGGEST_URL_FILE_SIZE then it is not added.


    public
                    appendSuggestSites(string $url) : string

Parameters

$url : string: to add

Return values

string —

true if the url was added or already existed in the file; false otherwise

boldKeywords()

Given a string, wraps in bold html tags a set of key words it contains.


    public
                    boldKeywords(string $text, array<string|int, mixed> $words) : string

Parameters

$text : string: haystack string to look for the key words
$words : array<string|int, mixed>: an array of words to bold face

Return values

string —

the resulting string after boldfacing has been applied

clearCrawlCaches()

Clears several memory and file caches related to crawls and networking.


    public
                    clearCrawlCaches() : mixed

Return values

mixed —

clearQuerySavePoint()

A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.


    public
                    clearQuerySavePoint(int $save_timestamp[, array<string|int, mixed> $machine_urls = null ]) : mixed

This function deletes such a save point associated with a timestamp

Parameters

$save_timestamp : int: timestamp of save point to delete
$machine_urls : array<string|int, mixed> = null: machines on which to try to delete savepoint

Return values

mixed —

clearSuggestSites()

Resets the suggest_url_file to be the empty file


    public
                    clearSuggestSites() : mixed

Return values

mixed —

combinedCrawlInfo()

This method is used to reduce the number of network requests needed by the crawlStatus method of admin_controller. It returns an array containing the results of the @see crawlStalled


    public
                    combinedCrawlInfo([array<string|int, mixed> $machine_urls = null ][, bool $use_cache = false ]) : array<string|int, mixed>

Parameters

$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers
$use_cache : bool = false: whether to try to use a cached version of the the crawl info or to always recompute it.

Return values

array<string|int, mixed> —

containing three components one for each of the three kinds of results listed above

crawlStalled()

Determines if the length of time since any of the fetchers has spoken with any of the queue servers has exceeded CRAWL_TIMEOUT. If so, typically the caller of this method would do something such as officially stop the crawl.


    public
                    crawlStalled([array<string|int, mixed> $machine_urls = null ]) : bool

Parameters

$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

bool —

whether the current crawl is stalled or not

crawlStatus()

Returns data about current crawl such as DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.


    public
                    crawlStatus([array<string|int, mixed> $machine_urls = null ]) : array<string|int, mixed>

Parameters

$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers on which the crawl is being conducted

Return values

array<string|int, mixed> —

associative array of the said data

createIfNecessaryDirectory()

Creates a directory and sets it to world permission if it doesn't already exist


    public
                    createIfNecessaryDirectory(string $directory) : int

Parameters

$directory : string: name of directory to create

Return values

int —

-1 on failure, 0 if already existed, 1 if created

deleteCrawl()

Deletes the crawl with the supplied timestamp if it exists. Also deletes any crawl mixes making use of this crawl


    public
                    deleteCrawl(string $timestamp[, array<string|int, mixed> $machine_urls = null ]) : mixed

Parameters

$timestamp : string: a Unix timestamp
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

mixed —

deleteCrawlMix()

Deletes from the DB the crawl mix ans its associated components and fragments


    public
                    deleteCrawlMix(int $timestamp) : mixed

Parameters

$timestamp : int: of the mix to delete

Return values

mixed —

deleteCrawlMixIteratorState()

Deletes the archive iterator and savepoint files created during the process of iterating through a crawl mix.


    public
                    deleteCrawlMixIteratorState(int $timestamp) : mixed

Parameters

$timestamp : int: The timestamp of the crawl mix

Return values

mixed —

execMachines()

This method is invoked by other ParallelModel (@see CrawlModel for examples) methods when they want to have their method performed on an array of other Yioop instances. The results returned can then be aggregated. The invocation sequence is crawlModelMethodA invokes execMachine with a list of urls of other Yioop instances. execMachine makes REST requests of those instances of the given command and optional arguments This request would be handled by a CrawlController which in turn calls crawlModelMethodA on the given Yioop instance, serializes the result and gives it back to execMachine and then back to the originally calling function.


    public
                    execMachines(string $command, array<string|int, mixed> $machine_urls[, string $arg = null ], int $num_machines[, bool $send_specs = false ][, int $fetcher_queue_server_ratio = 1 ]) : array<string|int, mixed>

Parameters

$command : string: the ParallelModel method to invoke on the remote Yioop instances
$machine_urls : array<string|int, mixed>: machines to invoke this command on
$arg : string = null: additional arguments to be passed to the remote machine
$num_machines : int: the integer to be used in calculating partition
$send_specs : bool = false: whether to send the queue_server, num fetcher info for given machine
$fetcher_queue_server_ratio : int = 1: maximum of the number 1 and the number of active fetchers running across all yioop instances currently divided by the number of queue servers

Return values

array<string|int, mixed> —

a list of outputs from each machine that was called.

fileGetContents()

Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.


    public
                    fileGetContents(string $filename[, bool $force_read = false ]) : string

Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache

Parameters

$filename : string: name of file to get contents of
$force_read : bool = false: whether to force the file to be read from persistent storage rather than the cache

Return values

string —

contents of the file given by $filename

filePutContents()

Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.


    public
                    filePutContents(string $filename, string $data) : mixed

Parameters

$filename : string: name of file to write to persistent storages
$data : string: string of data to store in file

Return values

mixed —

formatSinglePageResult()

Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.


    public
                    formatSinglePageResult(array<string|int, mixed> $page[, array<string|int, mixed> $words = null ][, int $description_length = self::DEFAULT_DESCRIPTION_LENGTH ]) : array<string|int, mixed>

Parameters

$page : array<string|int, mixed>: a single search result summary
$words : array<string|int, mixed> = null: keywords (typically what was searched on)
$description_length : int = self::DEFAULT_DESCRIPTION_LENGTH: length of the description

Return values

array<string|int, mixed> —

$page which has been snippified and bold faced

fromCallback()

{@inheritDoc}


    public
                    fromCallback([mixed $args = null ]) : string

Parameters

$args : mixed = null: any additional arguments which should be used to determine these tables (in this case none)

Return values

string —

a comma separated list of tables suitable for a SQL query

getChannel()

Gets the channel of the crawl with the given timestamp


    public
                    getChannel(int $timestamp) : int

Parameters

$timestamp : int: of crawl to get channel for

Return values

int —

$channel used by that crawl

getCrawlItem()

Get a summary of a document based on its url, the active machines and the idnex we want to look up in.


    public
                    getCrawlItem(string $url[, array<string|int, mixed> $machine_urls = null ][, string $index_name = "" ]) : array<string|int, mixed>

Parameters

$url : string: of summary we are trying to look-up
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers
$index_name : string = "": timestamp of the index to do the lookup in

Return values

array<string|int, mixed> —

summary data of the matching document

getCrawlItems()

Gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset).


    public
                    getCrawlItems(string $lookups[, array<string|int, mixed> $machine_urls = null ][, array<string|int, mixed> $exclude_fields = [] ][, array<string|int, mixed> $format_words = null ][, int $description_length = self::DEFAULT_DESCRIPTION_LENGTH ]) : array<string|int, mixed>

For Version >=3, indexes offset is the code "PDB" as a look up can done by the first four items.

Parameters

$lookups : string: things whose summaries we are trying to look up
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers
$exclude_fields : array<string|int, mixed> = []: an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
$format_words : array<string|int, mixed> = null: words which should be highlighted in search snippets returned
$description_length : int = self::DEFAULT_DESCRIPTION_LENGTH: length of snippets to be returned for each search result

Return values

array<string|int, mixed> —

of summary data for the matching documents

getCrawlList()

Gets a list of all index archives of crawls that have been conducted


    public
                    getCrawlList([bool $return_arc_bundles = false ][, bool $return_recrawls = false ][, array<string|int, mixed> $machine_urls = null ][, bool $cache = false ]) : array<string|int, mixed>

Parameters

$return_arc_bundles : bool = false: whether index bundles used for indexing arc or other archive bundles should be included in the list
$return_recrawls : bool = false: whether index archive bundles generated as a result of recrawling should be included in the result
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers
$cache : bool = false: whether to try to get/set the data to a cache file

Return values

array<string|int, mixed> —

available IndexArchiveBundle directories and their meta information this meta information includes the time of the crawl, its description, the number of pages downloaded, and the number of partitions used in storing the inverted index

getCrawlMix()

Retrieves the weighting component of the requested crawl mix


    public
                    getCrawlMix(string $timestamp[, bool $just_components = false ]) : array<string|int, mixed>

Parameters

$timestamp : string: of the requested crawl mix
$just_components : bool = false: says whether to find the mix name or just the components array.

Return values

array<string|int, mixed> —

the crawls and their weights that make up the requested crawl mix.

getCrawlMixTimestamp()

Returns the timestamp associated with a mix name;


    public
                    getCrawlMixTimestamp(string $mix_name) : mixed

Parameters

$mix_name : string: name to lookup

Return values

mixed —

timestamp associated with name if exists false otherwise

getCrawlSeedInfo()

Returns the crawl parameters that were used during a given crawl


    public
                    getCrawlSeedInfo(string $timestamp[, array<string|int, mixed> $machine_urls = null ]) : array<string|int, mixed>

Parameters

$timestamp : string: timestamp of the crawl to load the crawl parameters of
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

array<string|int, mixed> —

the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites

getCurrentIndexDatabaseName()

Gets the name (aka timestamp) of the current index archive to be used to handle search queries


    public
                    getCurrentIndexDatabaseName() : string

Return values

string —

the timestamp of the archive

getDbmsList()

Gets a list of all DBMS that work with the search engine


    public
                    getDbmsList() : array<string|int, mixed>

Return values

array<string|int, mixed> —

Names of available data sources

getDeltaFileInfo()

Returns all the files in $dir or its subdirectories with modified times more recent than timestamp. The file which have in their path or name a string in the $excludes array will be exclude


    public
                    getDeltaFileInfo(string $dir, int $timestamp, array<string|int, mixed> $excludes) : array<string|int, mixed>

Parameters

$dir : string: a directory to traverse
$timestamp : int: used to check modified times against
$excludes : array<string|int, mixed>: an array of path substrings tot exclude

Return values

array<string|int, mixed> —

of file structs consisting of name, modified time and size.

getInfoTimestamp()

Get a description associated with a Web Crawl or Crawl Mix


    public
                    getInfoTimestamp(int $timestamp[, array<string|int, mixed> $machine_urls = null ]) : array<string|int, mixed>

Parameters

$timestamp : int: of crawl or mix in question
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

array<string|int, mixed> —

associative array containing item DESCRIPTION

getMixList()

Gets a list of all mixes of available crawls


    public
                    getMixList(int $user_id[, bool $with_components = false ]) : array<string|int, mixed>

Parameters

$user_id : int: user that we are getting a list of mixes for We have disabled mix sharing so for now this is all mixes
$with_components : bool = false: if false then don't load the factors that make up the crawl mix, just load the name of the mixes and their timestamps; otherwise, if true loads everything

Return values

array<string|int, mixed> —

list of available crawls

getRows()

Gets a range of rows which match the provided search criteria from $th provided table


    public
                    getRows(int $limit, int $num, int &$total[, array<string|int, mixed> $search_array = [] ][, array<string|int, mixed> $args = null ]) : array<string|int, mixed>

Parameters

$limit : int: starting row from the potential results to return
$num : int: number of rows after start row to return
$total : int: gets set with the total number of rows that can be returned by the given database query
$search_array : array<string|int, mixed> = []: each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
$args : array<string|int, mixed> = null: additional values which may be used to get rows (what these are will typically depend on the subclass implementation)

Return values

array<string|int, mixed> —

getSeedInfo()

Returns the initial sites that a new crawl will start with along with crawl parameters such as crawl order, allowed and disallowed crawl sites


    public
                    getSeedInfo([bool $use_default = false ]) : array<string|int, mixed>

Parameters

$use_default : bool = false: whether or not to use the Yioop! default crawl.ini file rather than the one created by the user.

Return values

array<string|int, mixed> —

the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites

getSnippets()

Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.


    public
                    getSnippets(string $text, array<string|int, mixed> $words, string $description_length) : string

There is also a rule that a snippet should avoid ending in the middle of a word

Parameters

$text : string: haystack to extract snippet from
$words : array<string|int, mixed>: keywords used to make look in haystack
$description_length : string: length of the description desired

Return values

string —

a concatenation of the extracted snippets of each word

getSuggestSites()

Returns an array of urls which were stored via the suggest-a-url form in suggest_view.php


    public
                    getSuggestSites() : array<string|int, mixed>

Return values

array<string|int, mixed> —

urls that have been suggested

getUserId()

Get the user_id associated with a given username (In base class as used as an internal method in both signin and user models)


    public
                    getUserId(string $username) : string

Parameters

$username : string: the username to look up

Return values

string —

the corresponding userid

injectUrlsCurrentCrawl()

Add the provided urls to the schedule directory of URLs that will be crawled


    public
                    injectUrlsCurrentCrawl(string $timestamp, array<string|int, mixed> $inject_urls[, array<string|int, mixed> $machine_urls = null ]) : mixed

Parameters

$timestamp : string: Unix timestamp of crawl to add to schedule of
$inject_urls : array<string|int, mixed>: urls to be added to the schedule of the active crawl
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

mixed —

isCrawlMix()

Returns whether the supplied timestamp corresponds to a crawl mix


    public
                    isCrawlMix(string $timestamp) : bool

Parameters

$timestamp : string: of the requested crawl mix

Return values

bool —

true if it does; false otherwise

isMixOwner()

Returns whether there is a mix with the given $timestamp that $user_id owns. Currently mmix ownership is ignored and this is set to always return true;


    public
                    isMixOwner(string $timestamp, string $user_id) : bool

Parameters

$timestamp : string: to see if exists
$user_id : string: id of would be owner

Return values

bool —

true if owner; false otherwise

isSingleLocalhost()

Used to determine if an action involves just one yioop instance on the current local machine or not


    public
                    isSingleLocalhost(array<string|int, mixed> $machine_urls[, string $index_timestamp = -1 ]) : bool

Parameters

$machine_urls : array<string|int, mixed>: urls of yioop instances to which the action applies
$index_timestamp : string = -1: if timestamp exists checks if the index has declared itself to be a no network index.

Return values

bool —

whether it involves a single local yioop instance (true) or not (false)

loginDbms()

Returns whether the provided dbms needs a login and password or not (sqlite or sqlite3)


    public
                    loginDbms(string $dbms) : bool

Parameters

$dbms : string: the name of a database management system

Return values

bool —

true if needs a login and password; false otherwise

lookupSummaryOffsetGeneration()

Determines the offset into the summaries WebArchiveBundle and generation of the provided url (or hash_url) so that the info:url (info:base64_hash_url) summary can be retrieved. This assumes of course that the info:url meta word has been stored.


    public
                    lookupSummaryOffsetGeneration(string $url_or_key[, string $index_name = "" ][, bool $is_key = false ]) : array<string|int, mixed>

Parameters

$url_or_key : string: either info:base64_hash_url or just a url to lookup
$index_name : string = "": index into which to do the lookup
$is_key : bool = false: whether the string is info:base64_hash_url or just a url

Return values

array<string|int, mixed> —

(offset, generation) into the web archive bundle

networkGetCrawlItems()

In a multiple queue server setting, gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset). This makes an execMachines call to make a network request to the CrawlController's on each machine which in turn calls getCrawlItems (and thence nonNetworkGetCrawlItems) on each machine. The results are then sent back to networkGetCrawlItems and aggregated.


    public
                    networkGetCrawlItems(string $lookups, array<string|int, mixed> $machine_urls[, array<string|int, mixed> $exclude_fields = [] ][, array<string|int, mixed> $format_words = null ][, int $description_length = self::DEFAULT_DESCRIPTION_LENGTH ]) : array<string|int, mixed>

Parameters

$lookups : string: things whose summaries we are trying to look up
$machine_urls : array<string|int, mixed>: an array of urls of yioop queue servers
$exclude_fields : array<string|int, mixed> = []: an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
$format_words : array<string|int, mixed> = null: words which should be highlighted in search snippets returned
$description_length : int = self::DEFAULT_DESCRIPTION_LENGTH: length of snippets to be returned for each search result

Return values

array<string|int, mixed> —

of summary data for the matching documents

nonNetworkGetCrawlItems()

Gets summaries on a particular machine for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset) This may be used in either the single queue_server setting or it may be called indirectly by a particular machine's CrawlController as part of fufilling a network-based getCrawlItems request. $lookups contains items which are to be grouped (as came from same url or site with the same cache). So this function aggregates their descriptions.


    public
                    nonNetworkGetCrawlItems(string $lookups[, array<string|int, mixed> $exclude_fields = [] ][, array<string|int, mixed> $format_words = null ][, int $description_length = self::DEFAULT_DESCRIPTION_LENGTH ]) : array<string|int, mixed>

Parameters

$lookups : string: things whose summaries we are trying to look up
$exclude_fields : array<string|int, mixed> = []: an array of fields which might be in the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
$format_words : array<string|int, mixed> = null: words which should be highlighted in search snippets returned
$description_length : int = self::DEFAULT_DESCRIPTION_LENGTH: length of snippets to be returned for each search result

Return values

array<string|int, mixed> —

of summary data for the matching documents

postQueryCallback()

Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged


    public
                    postQueryCallback(array<string|int, mixed> $rows) : array<string|int, mixed>

Parameters

$rows : array<string|int, mixed>: that have been calculated so far by getRows

Return values

array<string|int, mixed> —

$rows after this final manipulation

rowCallback()

{@inheritDoc}


    public
                    rowCallback(array<string|int, mixed> $row, mixed $args) : array<string|int, mixed>

Parameters

$row : array<string|int, mixed>: row as retrieved from database query
$args : mixed: additional arguments that might be used by this callback. In this case, should be a boolean flag that says whether or not to add information about the components of the crawl mix

Return values

array<string|int, mixed> —

$row after callback manipulation

searchArrayToWhereOrderClauses()

Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive


    public
                    searchArrayToWhereOrderClauses(array<string|int, mixed> $search_array[, array<string|int, mixed> $any_fields = ['status'] ]) : array<string|int, mixed>

Parameters

$search_array : array<string|int, mixed>: each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
$any_fields : array<string|int, mixed> = ['status']: these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause

Return values

array<string|int, mixed> —

string for where clause, string for order by clause

selectCallback()

Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.


    public
                    selectCallback([mixed $args = null ]) : string

This defaults to *, but in general will be overridden in subclasses of Model

Parameters

$args : mixed = null: any additional arguments which should be used to determine the columns

Return values

string —

a comma separated list of columns suitable for a SQL query

sendStartCrawlMessage()

Used to send a message to the queue servers to start a crawl


    public
                    sendStartCrawlMessage(array<string|int, mixed> $crawl_params[, array<string|int, mixed> $seed_info = null ][, array<string|int, mixed> $machine_urls = null ], int $num_fetchers[, int $fetcher_queue_server_ratio = 1 ]) : mixed

Parameters

$crawl_params : array<string|int, mixed>: has info like the time of the crawl, whether starting a new crawl or resuming an old one, etc.
$seed_info : array<string|int, mixed> = null: what urls to crawl, etc as from the crawl.ini file
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers
$num_fetchers : int: number of fetchers on machine to start. This parameter and $channel are used to start the daemons running on the machines if they aren't already running
$fetcher_queue_server_ratio : int = 1: maximum of the number 1 and the number of active fetchers running across all yioop instances currently divided by the number of queue servers

Return values

mixed —

sendStopCrawlMessage()

Used to send a message to the queue servers to stop a crawl


    public
                    sendStopCrawlMessage( $channel[, array<string|int, mixed> $machine_urls = null ]) : mixed

Parameters

$channel :: of crawl to stop
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

mixed —

setCrawlMix()

Stores in DB the supplied crawl mix object


    public
                    setCrawlMix(array<string|int, mixed> $mix) : mixed

Parameters

$mix : array<string|int, mixed>: an associative array representing the crawl mix object

Return values

mixed —

setCrawlSeedInfo()

Changes the crawl parameters of an existing crawl (can be while crawling) Not all fields are allowed to be updated


    public
                    setCrawlSeedInfo(string $timestamp, array<string|int, mixed> $new_info[, array<string|int, mixed> $machine_urls = null ]) : mixed

Parameters

$timestamp : string: timestamp of the crawl to change
$new_info : array<string|int, mixed>: the new parameters
$machine_urls : array<string|int, mixed> = null: an array of urls of yioop queue servers

Return values

mixed —

setCurrentIndexDatabaseName()

Sets the IndexArchive that will be used for search results


    public
                    setCurrentIndexDatabaseName( $timestamp) : mixed

Parameters

$timestamp :: the timestamp of the index archive. The timestamp is when the crawl was started. Currently, the timestamp appears as substring of the index archives directory name

Return values

mixed —

setSeedInfo()

Writes a crawl.ini file with the provided data to the user's WORK_DIRECTORY


    public
                    setSeedInfo(array<string|int, mixed> $info) : mixed

Parameters

$info : array<string|int, mixed>: an array containing information about the crawl

Return values

mixed —

startQueueServerFetchers()

Used to start QueueServers and Fetchers on current machine when it is detected that someone tried to start a crawl but hadn't started any queue servers or fetchers.


    public
                    startQueueServerFetchers(int $channel, int $num_fetchers) : bool

Parameters

$channel : int: channel of crawl to start
$num_fetchers : int: the number of fetchers on the current machine

Return values

bool —

whether any processes were started

translateDb()

Used to get the translation of a string_id stored in the database to the given locale.


    public
                    translateDb(string $string_id, string $locale_tag) : mixed

Parameters

$string_id : string: id to translate
$locale_tag : string: to translate to

Return values

mixed —

translation if found, $string_id, otherwise

whereCallback()

Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.


    public
                    whereCallback([mixed $args = null ]) : string

This defaults to an empty WHERE clause.

Parameters

$args : mixed = null: additional arguments that might be used to construct the WHERE clause.

Return values

string —

a SQL WHERE clause

CrawlModel extends ParallelModel in package Application

Tags

Table of Contents

Constants

DEFAULT_DESCRIPTION_LENGTH

MAX_SNIPPET_TITLE_LENGTH

MIN_DESCRIPTION_LENGTH

MIN_SNIPPET_LENGTH

SNIPPET_LENGTH_LEFT

SNIPPET_LENGTH_RIGHT

SNIPPET_TITLE_LENGTH

Properties

$any_fields

$cache

$current_machine

$db

$db_name

$edited_page_summaries

$index_name

$private_db

$private_db_name

$search_table_column_map

$suggest_url_file

$web_site

Methods

__construct()

Parameters

Return values

aggregateCrawlList()

Parameters

Return values

aggregateStalled()

Parameters

Return values

aggregateStatuses()

Parameters

Return values

appendSuggestSites()

Parameters

Return values

boldKeywords()

Parameters

Return values

clearCrawlCaches()

Return values

clearQuerySavePoint()

Parameters

Return values

clearSuggestSites()

Return values

combinedCrawlInfo()

Parameters

Tags

Return values

crawlStalled()

Parameters

Return values

crawlStatus()

Parameters

Return values

createIfNecessaryDirectory()

Parameters

Return values

deleteCrawl()

Parameters

Return values

deleteCrawlMix()

Parameters

Return values

deleteCrawlMixIteratorState()

Parameters

Return values

execMachines()

Parameters

Return values

fileGetContents()

Parameters

Return values

filePutContents()

Parameters

CrawlModel extends ParallelModel
in package

Application