Yioop_V9.5_Source_Code

PhraseParser
in package

Application

Library of functions used to manipulate words and phrases

CONTROL_WORD_INDICATOR

Indicates the control word for programming languages


    public
        mixed
    CONTROL_WORD_INDICATOR
    = ':'

REGEX_INITIAL_POSITION

Indicates the control word for programming languages


    public
        mixed
    REGEX_INITIAL_POSITION
    = 1

SAFE_PHRASE_THRESHOLD

Threshold to use for a string to be conisdered "safe" (not X-rated)


    public
        mixed
    SAFE_PHRASE_THRESHOLD
    = 0.035

TOKENIZER

Constant storing the string


    public
        mixed
    TOKENIZER
    = 'Tokenizer'

$meta_words_list

A list of meta words that might be extracted from a query


    public
    static    array<string|int, mixed>
    $meta_words_list
     = ['\\-i:', '\\-index:', '\\-', 'class:', 'class-score:', 'cld:', 'code:', 'color:', 'date:', 'dns:', 'duration:', 'filetype:', 'guid:', 'hash:', 'host:', 'i:', 'info:', 'index:', 'ip:', 'link:', 'lang:', 'layout:', 'location:', 'media:', 'modified:', 'numlinks:', 'os:', 'path:', 'pubdate:', 'robot:', 'safe:', 'server:', 'site:', 'size:', 'time:', 'u:', 'version:', 'weight:', 'w:']

$programming_language_map

A list of meta words that might be extracted from a query


    public
    static    array<string|int, mixed>
    $programming_language_map
     = ['java' => 'java', 'py' => 'python']

$tokenizers

Tokenizer objects that have been loaded so far


    public
    static    mixed
    $tokenizers
     = []

@var array

calculateLinkMetas()

Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.


    public
            static        calculateLinkMetas(string $url, string $link_host, string $link_text, string $site_url[, array<string|int, mixed> $url_info = [] ][, array<string|int, mixed> $link_word_lists = [] ]) : array<string|int, mixed>

Parameters

$url : string: url of the link
$link_host : string: url of the host name of the link
$link_text : string: text of the anchor tag link came from
$site_url : string: url of the page link was on
$url_info : array<string|int, mixed> = []: key value pairs which may have been generated as part of the page processor
$link_word_lists : array<string|int, mixed> = []: list of words used in anchor text associated with this link and their positions in the anchor text

Return values

array<string|int, mixed> —

meta words associated with the link

calculateMetas()

Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.


    public
            static        calculateMetas(array<string|int, mixed> &$site[, bool $with_link_metas = true ]) : array<string|int, mixed>

Parameters

$site : array<string|int, mixed>: associated array containing info about a downloaded (or read from archive) document.
$with_link_metas : bool = true: whether to extract link: meta tags too

Return values

array<string|int, mixed> —

of meta words to be associate with this document

canonicalizePunctuatedTerms()

This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.


    public
            static        canonicalizePunctuatedTerms(string &$string[,  $lang = null ]) : mixed

Parameters

$string : string: a string of words, etc which might involve such terms
$lang : = null: a language tag to use as part of the canonicalization process not used right now

Return values

mixed —

charGramTerms()

Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.


    public
            static        charGramTerms(array<string|int, mixed> $pre_terms, string $lang) : array<string|int, mixed>

Parameters

$pre_terms : array<string|int, mixed>: the terms to make n-grams for
$lang : string: locale tag to determine n to be used for n-gramming

Return values

array<string|int, mixed> —

the n-grams for the terms in question

compressSentence()

Call the appropriate tokenizer sentence compression method


    public
            static        compressSentence(string $sentence_to_compress[, string $lang = null ]) : the

Parameters

$sentence_to_compress : string: the sentence to compress
$lang : string = null: locale tag for stemming

Return values

the —

compressed sentence

computeSafeSearchScore()

Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.


    public
            static        computeSafeSearchScore(string $phrase[, string $url = "" ]) : int

Parameters

$phrase : string: to check for X-ratedness
$url : string = "": optional url that the word_list came used to check against known porn sites

Return values

int —

$score of how explicit the phrase is between 0 and 1

extractPhrases()

Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase


    public
            static        extractPhrases(string $string[, string $lang = null ][, string $index_name = null ][, bool $exact_match = false ][, int $threshold = CMIN_RESULTS_TO_GROUP ]) : array<string|int, mixed>

Parameters

$string : string: subject to extract phrases from
$lang : string = null: locale tag for stemming
$index_name : string = null: name of index to be used as a reference when extracting phrases
$exact_match : bool = false: whether the match has to be exact or not
$threshold : int = CMIN_RESULTS_TO_GROUP: roughly causes a stop to extracting more phrases if exceed $threshold (still might get more than $threshold back, only when detect have more stop)

Return values

array<string|int, mixed> —

of phrases

extractPhrasesAndCount()

Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase


    public
            static        extractPhrasesAndCount(string $string[, string $lang = null ]) : array<string|int, mixed>

Parameters

$string : string: subject to extract phrases from
$lang : string = null: locale tag for stemming

Return values

array<string|int, mixed> —

pairs of the form (phrase, number of occurrences)

extractPhrasesInLists()

Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.


    public
            static        extractPhrasesInLists(string $string[, string $lang = null ]) : array<string|int, mixed>

Parameters

$string : string: subject to extract phrases from
$lang : string = null: locale tag for stemming and other phrase processing related stuff

Return values

array<string|int, mixed> —

word => list of positions at which the word occurred in the document

extractTermPositions()

Extracts from a $string an associative array of terms and position within $string of those terms


    public
            static        extractTermPositions(string $string, string $lang) : array<string|int, mixed>

Parameters

$string : string: text to extract terms and their positions from
$lang : string: locale of text

Return values

array<string|int, mixed> —

associative array of terms and positions

extractTermSentencePositionsTags()

Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences


    public
            static        extractTermSentencePositionsTags(string $string[, string $lang = null ][, bool $extract_sentences = false ]) : array<string|int, mixed>

Parameters

$string : string: to extract terms from
$lang : string = null: IANA tag to look up stemmer under
$extract_sentences : bool = false: whether to extract sentences to be used by question answering system

Return values

array<string|int, mixed> —

of terms and n word grams in the order they appeared in string

extractWordStringPageSummary()

Converts a summary of a web page into a string of space separated words


    public
            static        extractWordStringPageSummary(array<string|int, mixed> $page) : string

Parameters

$page : array<string|int, mixed>: associative array of page summary data. Contains title, description, and links fields

Return values

string —

the concatenated words extracted from the page summary

getCharGramsTerm()

Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array


    public
            static        getCharGramsTerm(array<string|int, mixed> $terms, string $lang) : array<string|int, mixed>

Parameters

$terms : array<string|int, mixed>: the terms to make n-grams for
$lang : string: locale tag to determine n to be used for n-gramming

Return values

array<string|int, mixed> —

the n-grams for the terms in question

getNGramsTerm()

Returns the characters n-grams for the given terms where n is the length.


    public
            static        getNGramsTerm(array<string|int, mixed> $terms, string $n) : array<string|int, mixed>

Parameters

$terms : array<string|int, mixed>: the terms to make n-grams for
$n : string: the n to use in n-gramming

Return values

array<string|int, mixed> —

the n-grams for the terms in question

getTokenizer()

Loads and instantiates a tokenizer object for a language if exists


    public
            static        getTokenizer(string $lang) : object

Parameters

$lang : string: IANA tag to look up stemmer under

Return values

object —

tokenizer with methods to process strings for a language

hyphenateEntities()

Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.


    public
            static        hyphenateEntities(string &$string[,  $lang = null ]) : mixed

Parameters

$string : string: a string of words, etc which might involve such terms
$lang : = null: a language tag to use as part of the canonicalization process

Return values

mixed —

javaTokenizer()

Given a string tokenizes into Java tokens


    public
            static        javaTokenizer(string $string, string $lang) : array<string|int, mixed>

Parameters

$string : string: what to extract terms from
$lang : string: indicates programming language

Return values

array<string|int, mixed> —

the terms computed from the string

oneWord()

Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes


    public
            static        oneWord(string $word_guess, string $locale, array<string|int, mixed> $additional_regexes) : bool

Parameters

$word_guess : string: word guess to be checked if a single word
$locale : string: language to check if is word for
$additional_regexes : array<string|int, mixed>: used in checking for this locale if something should be considered a word

Return values

bool —

true if a single word false otherwise

pythonTokenizer()

Given a string tokenizes into Python tokens


    public
            static        pythonTokenizer(string $string, string $lang) : array<string|int, mixed>

Parameters

$string : string: what to extract terms from
$lang : string: indicates programming language

Return values

array<string|int, mixed> —

the terms computed from the string

reverseMaximalMatch()

Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.


    public
            static        reverseMaximalMatch(string $segment, string $locale[, array<string|int, mixed> $additional_regexes = [] ]) : string

Parameters

$segment : string: string to make into a string of space separated words
$locale : string: IANA tag used to look up dictionary filter to use to do this segmenting
$additional_regexes : array<string|int, mixed> = []: which should be treated as a suffix

Return values

string —

space separated words

segmentSegment()

Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter


    public
            static        segmentSegment(string $segment, string $lang) : mixed

Note: this method is not used when trying to extract keywords from urls. Instead, UrlParser::getWordsInHostUrl($url) is used.

Parameters

$segment : string: string to split into terms
$lang : string: IANA tag to look up segmenter under from some other language

Return values

mixed —

stemCharGramSegment()

Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale


    public
            static        stemCharGramSegment(string $string, string $lang[, bool $to_string = false ]) : mixed

Parameters

$string : string: what to extract terms from
$lang : string: locale tag to determine which stemmers, chargramming and segmentation needs to be done.
$to_string : bool = false: if the result should be imploded on space to a single string or left as an array of terms

Return values

mixed —

either an array of the terms computed from the string or a string where this array has been imploded on space

stemTerms()

Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists


    public
            static        stemTerms(mixed $string_or_array, string $lang) : array<string|int, mixed>

Parameters

$string_or_array : mixed: to extract stemmed terms from
$lang : string: IANA tag to look up stemmer under

Return values

array<string|int, mixed> —

stemmed terms if stemmer; terms otherwise

stemTermsK()

Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists


    public
            static        stemTermsK(mixed $string_or_array, string $lang, string $keep_empties) : array<string|int, mixed>

Parameters

$string_or_array : mixed: to extract stemmed terms from
$lang : string: IANA tag to look up stemmer under
$keep_empties : string: whether to keep empty sentences or not

Return values

array<string|int, mixed> —

stemmed terms if stemmer; terms otherwise

PhraseParser in package Application

Tags

Table of Contents

Constants

CONTROL_WORD_INDICATOR

REGEX_INITIAL_POSITION

SAFE_PHRASE_THRESHOLD

TOKENIZER

Properties

$meta_words_list

$programming_language_map

$tokenizers

Methods

calculateLinkMetas()

Parameters

Return values

calculateMetas()

Parameters

Return values

canonicalizePunctuatedTerms()

Parameters

Return values

charGramTerms()

Parameters

Return values

compressSentence()

Parameters

Return values

computeSafeSearchScore()

Parameters

Return values

extractPhrases()

Parameters

Return values

extractPhrasesAndCount()

Parameters

Return values

extractPhrasesInLists()

Parameters

Return values

extractTermPositions()

Parameters

Return values

extractTermSentencePositionsTags()

Parameters

Return values

extractWordStringPageSummary()

Parameters

Return values

getCharGramsTerm()

Parameters

Return values

getNGramsTerm()

Parameters

Return values

getTokenizer()

Parameters

Return values

hyphenateEntities()

Parameters

Return values

javaTokenizer()

Parameters

Return values

oneWord()

Parameters

Return values

pythonTokenizer()

Parameters

Return values

reverseMaximalMatch()

Parameters

Return values

segmentSegment()

Parameters

Return values

stemCharGramSegment()

Parameters

Return values

stemTerms()

PhraseParser
in package

Application