
in package

Library of functions used to manipulate words and phrases


Chris Pollett

Table of Contents

Indicates the control word for programming languages
Indicates the control word for programming languages
Threshold to use for a string to be conisdered "safe" (not X-rated)
TOKENIZER  = 'Tokenizer'
Constant storing the string
$meta_words_list  : array<string|int, mixed>
A list of meta words that might be extracted from a query
$programming_language_map  : array<string|int, mixed>
A list of meta words that might be extracted from a query
$tokenizers  : mixed
Tokenizer objects that have been loaded so far
calculateLinkMetas()  : array<string|int, mixed>
Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.
calculateMetas()  : array<string|int, mixed>
Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.
canonicalizePunctuatedTerms()  : mixed
This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.
charGramTerms()  : array<string|int, mixed>
Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.
compressSentence()  : the
Call the appropriate tokenizer sentence compression method
computeSafeSearchScore()  : int
Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.
extractPhrases()  : array<string|int, mixed>
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase
extractPhrasesAndCount()  : array<string|int, mixed>
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase
extractPhrasesInLists()  : array<string|int, mixed>
Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.
extractTermPositions()  : array<string|int, mixed>
Extracts from a $string an associative array of terms and position within $string of those terms
extractTermSentencePositionsTags()  : array<string|int, mixed>
Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences
extractWordStringPageSummary()  : string
Converts a summary of a web page into a string of space separated words
getCharGramsTerm()  : array<string|int, mixed>
Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array
getNGramsTerm()  : array<string|int, mixed>
Returns the characters n-grams for the given terms where n is the length.
getTokenizer()  : object
Loads and instantiates a tokenizer object for a language if exists
hyphenateEntities()  : mixed
Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.
javaTokenizer()  : array<string|int, mixed>
Given a string tokenizes into Java tokens
oneWord()  : bool
Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes
pythonTokenizer()  : array<string|int, mixed>
Given a string tokenizes into Python tokens
reverseMaximalMatch()  : string
Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.
segmentSegment()  : mixed
Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter
stemCharGramSegment()  : mixed
Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale
stemTerms()  : array<string|int, mixed>
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
stemTermsK()  : array<string|int, mixed>
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists



Indicates the control word for programming languages

public mixed CONTROL_WORD_INDICATOR = ':'


Indicates the control word for programming languages



Threshold to use for a string to be conisdered "safe" (not X-rated)

public mixed SAFE_PHRASE_THRESHOLD = 0.035


Constant storing the string

public mixed TOKENIZER = 'Tokenizer'



A list of meta words that might be extracted from a query

public static array<string|int, mixed> $meta_words_list = ['\\-i:', '\\-index:', '\\-', 'class:', 'class-score:', 'cld:', 'code:', 'color:', 'date:', 'dns:', 'duration:', 'filetype:', 'guid:', 'hash:', 'host:', 'i:', 'info:', 'index:', 'ip:', 'link:', 'lang:', 'layout:', 'location:', 'media:', 'modified:', 'numlinks:', 'os:', 'path:', 'pubdate:', 'robot:', 'safe:', 'server:', 'site:', 'size:', 'time:', 'u:', 'version:', 'weight:', 'w:']


A list of meta words that might be extracted from a query

public static array<string|int, mixed> $programming_language_map = ['java' => 'java', 'py' => 'python']


Tokenizer objects that have been loaded so far

public static mixed $tokenizers = []

@var array



Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.

public static calculateLinkMetas(string $url, string $link_host, string $link_text, string $site_url[, array<string|int, mixed> $url_info = [] ][, array<string|int, mixed> $link_word_lists = [] ]) : array<string|int, mixed>
$url : string

url of the link

$link_host : string

url of the host name of the link

$link_text : string

text of the anchor tag link came from

$site_url : string

url of the page link was on

$url_info : array<string|int, mixed> = []

key value pairs which may have been generated as part of the page processor

$link_word_lists : array<string|int, mixed> = []

list of words used in anchor text associated with this link and their positions in the anchor text

Return values
array<string|int, mixed>

meta words associated with the link


Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.

public static calculateMetas(array<string|int, mixed> &$site[, bool $with_link_metas = true ]) : array<string|int, mixed>
$site : array<string|int, mixed>

associated array containing info about a downloaded (or read from archive) document.

$with_link_metas : bool = true

whether to extract link: meta tags too

Return values
array<string|int, mixed>

of meta words to be associate with this document


This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.

public static canonicalizePunctuatedTerms(string &$string[,  $lang = null ]) : mixed
$string : string

a string of words, etc which might involve such terms

$lang : = null

a language tag to use as part of the canonicalization process not used right now

Return values


Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.

public static charGramTerms(array<string|int, mixed> $pre_terms, string $lang) : array<string|int, mixed>
$pre_terms : array<string|int, mixed>

the terms to make n-grams for

$lang : string

locale tag to determine n to be used for n-gramming

Return values
array<string|int, mixed>

the n-grams for the terms in question


Call the appropriate tokenizer sentence compression method

public static compressSentence(string $sentence_to_compress[, string $lang = null ]) : the
$sentence_to_compress : string

the sentence to compress

$lang : string = null

locale tag for stemming

Return values

compressed sentence


Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.

public static computeSafeSearchScore(string $phrase[, string $url = "" ]) : int
$phrase : string

to check for X-ratedness

$url : string = ""

optional url that the word_list came used to check against known porn sites

Return values

$score of how explicit the phrase is between 0 and 1


Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase

public static extractPhrases(string $string[, string $lang = null ][, string $index_name = null ][, bool $exact_match = false ][, int $threshold = CMIN_RESULTS_TO_GROUP ]) : array<string|int, mixed>
$string : string

subject to extract phrases from

$lang : string = null

locale tag for stemming

$index_name : string = null

name of index to be used as a reference when extracting phrases

$exact_match : bool = false

whether the match has to be exact or not

$threshold : int = CMIN_RESULTS_TO_GROUP

roughly causes a stop to extracting more phrases if exceed $threshold (still might get more than $threshold back, only when detect have more stop)

Return values
array<string|int, mixed>

of phrases


Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase

public static extractPhrasesAndCount(string $string[, string $lang = null ]) : array<string|int, mixed>
$string : string

subject to extract phrases from

$lang : string = null

locale tag for stemming

Return values
array<string|int, mixed>

pairs of the form (phrase, number of occurrences)


Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.

public static extractPhrasesInLists(string $string[, string $lang = null ]) : array<string|int, mixed>
$string : string

subject to extract phrases from

$lang : string = null

locale tag for stemming and other phrase processing related stuff

Return values
array<string|int, mixed>

word => list of positions at which the word occurred in the document


Extracts from a $string an associative array of terms and position within $string of those terms

public static extractTermPositions(string $string, string $lang) : array<string|int, mixed>
$string : string

text to extract terms and their positions from

$lang : string

locale of text

Return values
array<string|int, mixed>

associative array of terms and positions


Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences

public static extractTermSentencePositionsTags(string $string[, string $lang = null ][, bool $extract_sentences = false ]) : array<string|int, mixed>
$string : string

to extract terms from

$lang : string = null

IANA tag to look up stemmer under

$extract_sentences : bool = false

whether to extract sentences to be used by question answering system

Return values
array<string|int, mixed>

of terms and n word grams in the order they appeared in string


Converts a summary of a web page into a string of space separated words

public static extractWordStringPageSummary(array<string|int, mixed> $page) : string
$page : array<string|int, mixed>

associative array of page summary data. Contains title, description, and links fields

Return values

the concatenated words extracted from the page summary


Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array

public static getCharGramsTerm(array<string|int, mixed> $terms, string $lang) : array<string|int, mixed>
$terms : array<string|int, mixed>

the terms to make n-grams for

$lang : string

locale tag to determine n to be used for n-gramming

Return values
array<string|int, mixed>

the n-grams for the terms in question


Returns the characters n-grams for the given terms where n is the length.

public static getNGramsTerm(array<string|int, mixed> $terms, string $n) : array<string|int, mixed>
$terms : array<string|int, mixed>

the terms to make n-grams for

$n : string

the n to use in n-gramming

Return values
array<string|int, mixed>

the n-grams for the terms in question


Loads and instantiates a tokenizer object for a language if exists

public static getTokenizer(string $lang) : object
$lang : string

IANA tag to look up stemmer under

Return values

tokenizer with methods to process strings for a language


Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.

public static hyphenateEntities(string &$string[,  $lang = null ]) : mixed
$string : string

a string of words, etc which might involve such terms

$lang : = null

a language tag to use as part of the canonicalization process

Return values


Given a string tokenizes into Java tokens

public static javaTokenizer(string $string, string $lang) : array<string|int, mixed>
$string : string

what to extract terms from

$lang : string

indicates programming language

Return values
array<string|int, mixed>

the terms computed from the string


Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes

public static oneWord(string $word_guess, string $locale, array<string|int, mixed> $additional_regexes) : bool
$word_guess : string

word guess to be checked if a single word

$locale : string

language to check if is word for

$additional_regexes : array<string|int, mixed>

used in checking for this locale if something should be considered a word

Return values

true if a single word false otherwise


Given a string tokenizes into Python tokens

public static pythonTokenizer(string $string, string $lang) : array<string|int, mixed>
$string : string

what to extract terms from

$lang : string

indicates programming language

Return values
array<string|int, mixed>

the terms computed from the string


Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.

public static reverseMaximalMatch(string $segment, string $locale[, array<string|int, mixed> $additional_regexes = [] ]) : string
$segment : string

string to make into a string of space separated words

$locale : string

IANA tag used to look up dictionary filter to use to do this segmenting

$additional_regexes : array<string|int, mixed> = []

which should be treated as a suffix

Return values

space separated words


Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter

public static segmentSegment(string $segment, string $lang) : mixed

Note: this method is not used when trying to extract keywords from urls. Instead, UrlParser::getWordsInHostUrl($url) is used.

$segment : string

string to split into terms

$lang : string

IANA tag to look up segmenter under from some other language

Return values


Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale

public static stemCharGramSegment(string $string, string $lang[, bool $to_string = false ]) : mixed
$string : string

what to extract terms from

$lang : string

locale tag to determine which stemmers, chargramming and segmentation needs to be done.

$to_string : bool = false

if the result should be imploded on space to a single string or left as an array of terms

Return values

either an array of the terms computed from the string or a string where this array has been imploded on space


Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists

public static stemTerms(mixed $string_or_array, string $lang) : array<string|int, mixed>
$string_or_array : mixed

to extract stemmed terms from

$lang : string

IANA tag to look up stemmer under

Return values
array<string|int, mixed>

stemmed terms if stemmer; terms otherwise


Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists

public static stemTermsK(mixed $string_or_array, string $lang, string $keep_empties) : array<string|int, mixed>
$string_or_array : mixed

to extract stemmed terms from

$lang : string

IANA tag to look up stemmer under

$keep_empties : string

whether to keep empty sentences or not

Return values
array<string|int, mixed>

stemmed terms if stemmer; terms otherwise


Search results