PhraseParser
in package
Library of functions used to manipulate words and phrases
Tags
Table of Contents
- CONTROL_WORD_INDICATOR = ':'
- Indicates the control word for programming languages
- REGEX_INITIAL_POSITION = 1
- Indicates the control word for programming languages
- SAFE_PHRASE_THRESHOLD = 0.035
- Threshold to use for a string to be conisdered "safe" (not X-rated)
- TOKENIZER = 'Tokenizer'
- Constant storing the string
- $meta_words_list : array<string|int, mixed>
- A list of meta words that might be extracted from a query
- $programming_language_map : array<string|int, mixed>
- A list of meta words that might be extracted from a query
- $tokenizers : mixed
- Tokenizer objects that have been loaded so far
- calculateLinkMetas() : array<string|int, mixed>
- Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.
- calculateMetas() : array<string|int, mixed>
- Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.
- canonicalizePunctuatedTerms() : mixed
- This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.
- charGramTerms() : array<string|int, mixed>
- Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.
- compressSentence() : the
- Call the appropriate tokenizer sentence compression method
- computeSafeSearchScore() : int
- Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.
- extractPhrases() : array<string|int, mixed>
- Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase
- extractPhrasesAndCount() : array<string|int, mixed>
- Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase
- extractPhrasesInLists() : array<string|int, mixed>
- Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.
- extractTermPositions() : array<string|int, mixed>
- Extracts from a $string an associative array of terms and position within $string of those terms
- extractTermSentencePositionsTags() : array<string|int, mixed>
- Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences
- extractWordStringPageSummary() : string
- Converts a summary of a web page into a string of space separated words
- getCharGramsTerm() : array<string|int, mixed>
- Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array
- getNGramsTerm() : array<string|int, mixed>
- Returns the characters n-grams for the given terms where n is the length.
- getTokenizer() : object
- Loads and instantiates a tokenizer object for a language if exists
- hyphenateEntities() : mixed
- Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.
- javaTokenizer() : array<string|int, mixed>
- Given a string tokenizes into Java tokens
- oneWord() : bool
- Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes
- pythonTokenizer() : array<string|int, mixed>
- Given a string tokenizes into Python tokens
- reverseMaximalMatch() : string
- Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.
- segmentSegment() : mixed
- Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter
- stemCharGramSegment() : mixed
- Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale
- stemTerms() : array<string|int, mixed>
- Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
- stemTermsK() : array<string|int, mixed>
- Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
Constants
CONTROL_WORD_INDICATOR
Indicates the control word for programming languages
public
mixed
CONTROL_WORD_INDICATOR
= ':'
REGEX_INITIAL_POSITION
Indicates the control word for programming languages
public
mixed
REGEX_INITIAL_POSITION
= 1
SAFE_PHRASE_THRESHOLD
Threshold to use for a string to be conisdered "safe" (not X-rated)
public
mixed
SAFE_PHRASE_THRESHOLD
= 0.035
TOKENIZER
Constant storing the string
public
mixed
TOKENIZER
= 'Tokenizer'
Properties
$meta_words_list
A list of meta words that might be extracted from a query
public
static array<string|int, mixed>
$meta_words_list
= ['\\-i:', '\\-index:', '\\-', 'class:', 'class-score:', 'cld:', 'code:', 'color:', 'date:', 'dns:', 'duration:', 'filetype:', 'guid:', 'hash:', 'host:', 'i:', 'info:', 'index:', 'ip:', 'link:', 'lang:', 'layout:', 'location:', 'media:', 'modified:', 'numlinks:', 'os:', 'path:', 'pubdate:', 'robot:', 'safe:', 'server:', 'site:', 'size:', 'time:', 'u:', 'version:', 'weight:', 'w:']
$programming_language_map
A list of meta words that might be extracted from a query
public
static array<string|int, mixed>
$programming_language_map
= ['java' => 'java', 'py' => 'python']
$tokenizers
Tokenizer objects that have been loaded so far
public
static mixed
$tokenizers
= []
@var array
Methods
calculateLinkMetas()
Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.
public
static calculateLinkMetas(string $url, string $link_host, string $link_text, string $site_url[, array<string|int, mixed> $url_info = [] ][, array<string|int, mixed> $link_word_lists = [] ]) : array<string|int, mixed>
Parameters
- $url : string
-
url of the link
- $link_host : string
-
url of the host name of the link
- $link_text : string
-
text of the anchor tag link came from
- $site_url : string
-
url of the page link was on
- $url_info : array<string|int, mixed> = []
-
key value pairs which may have been generated as part of the page processor
- $link_word_lists : array<string|int, mixed> = []
-
list of words used in anchor text associated with this link and their positions in the anchor text
Return values
array<string|int, mixed> —meta words associated with the link
calculateMetas()
Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.
public
static calculateMetas(array<string|int, mixed> &$site[, bool $with_link_metas = true ]) : array<string|int, mixed>
Parameters
- $site : array<string|int, mixed>
-
associated array containing info about a downloaded (or read from archive) document.
- $with_link_metas : bool = true
-
whether to extract link: meta tags too
Return values
array<string|int, mixed> —of meta words to be associate with this document
canonicalizePunctuatedTerms()
This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.
public
static canonicalizePunctuatedTerms(string &$string[, $lang = null ]) : mixed
Parameters
- $string : string
-
a string of words, etc which might involve such terms
- $lang : = null
-
a language tag to use as part of the canonicalization process not used right now
Return values
mixed —charGramTerms()
Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.
public
static charGramTerms(array<string|int, mixed> $pre_terms, string $lang) : array<string|int, mixed>
Parameters
- $pre_terms : array<string|int, mixed>
-
the terms to make n-grams for
- $lang : string
-
locale tag to determine n to be used for n-gramming
Return values
array<string|int, mixed> —the n-grams for the terms in question
compressSentence()
Call the appropriate tokenizer sentence compression method
public
static compressSentence(string $sentence_to_compress[, string $lang = null ]) : the
Parameters
- $sentence_to_compress : string
-
the sentence to compress
- $lang : string = null
-
locale tag for stemming
Return values
the —compressed sentence
computeSafeSearchScore()
Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.
public
static computeSafeSearchScore(string $phrase[, string $url = "" ]) : int
Parameters
- $phrase : string
-
to check for X-ratedness
- $url : string = ""
-
optional url that the word_list came used to check against known porn sites
Return values
int —$score of how explicit the phrase is between 0 and 1
extractPhrases()
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase
public
static extractPhrases(string $string[, string $lang = null ][, string $index_name = null ][, bool $exact_match = false ][, int $threshold = CMIN_RESULTS_TO_GROUP ]) : array<string|int, mixed>
Parameters
- $string : string
-
subject to extract phrases from
- $lang : string = null
-
locale tag for stemming
- $index_name : string = null
-
name of index to be used as a reference when extracting phrases
- $exact_match : bool = false
-
whether the match has to be exact or not
- $threshold : int = CMIN_RESULTS_TO_GROUP
-
roughly causes a stop to extracting more phrases if exceed $threshold (still might get more than $threshold back, only when detect have more stop)
Return values
array<string|int, mixed> —of phrases
extractPhrasesAndCount()
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase
public
static extractPhrasesAndCount(string $string[, string $lang = null ]) : array<string|int, mixed>
Parameters
- $string : string
-
subject to extract phrases from
- $lang : string = null
-
locale tag for stemming
Return values
array<string|int, mixed> —pairs of the form (phrase, number of occurrences)
extractPhrasesInLists()
Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.
public
static extractPhrasesInLists(string $string[, string $lang = null ]) : array<string|int, mixed>
Parameters
- $string : string
-
subject to extract phrases from
- $lang : string = null
-
locale tag for stemming and other phrase processing related stuff
Return values
array<string|int, mixed> —word => list of positions at which the word occurred in the document
extractTermPositions()
Extracts from a $string an associative array of terms and position within $string of those terms
public
static extractTermPositions(string $string, string $lang) : array<string|int, mixed>
Parameters
- $string : string
-
text to extract terms and their positions from
- $lang : string
-
locale of text
Return values
array<string|int, mixed> —associative array of terms and positions
extractTermSentencePositionsTags()
Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences
public
static extractTermSentencePositionsTags(string $string[, string $lang = null ][, bool $extract_sentences = false ]) : array<string|int, mixed>
Parameters
- $string : string
-
to extract terms from
- $lang : string = null
-
IANA tag to look up stemmer under
- $extract_sentences : bool = false
-
whether to extract sentences to be used by question answering system
Return values
array<string|int, mixed> —of terms and n word grams in the order they appeared in string
extractWordStringPageSummary()
Converts a summary of a web page into a string of space separated words
public
static extractWordStringPageSummary(array<string|int, mixed> $page) : string
Parameters
- $page : array<string|int, mixed>
-
associative array of page summary data. Contains title, description, and links fields
Return values
string —the concatenated words extracted from the page summary
getCharGramsTerm()
Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array
public
static getCharGramsTerm(array<string|int, mixed> $terms, string $lang) : array<string|int, mixed>
Parameters
- $terms : array<string|int, mixed>
-
the terms to make n-grams for
- $lang : string
-
locale tag to determine n to be used for n-gramming
Return values
array<string|int, mixed> —the n-grams for the terms in question
getNGramsTerm()
Returns the characters n-grams for the given terms where n is the length.
public
static getNGramsTerm(array<string|int, mixed> $terms, string $n) : array<string|int, mixed>
Parameters
- $terms : array<string|int, mixed>
-
the terms to make n-grams for
- $n : string
-
the n to use in n-gramming
Return values
array<string|int, mixed> —the n-grams for the terms in question
getTokenizer()
Loads and instantiates a tokenizer object for a language if exists
public
static getTokenizer(string $lang) : object
Parameters
- $lang : string
-
IANA tag to look up stemmer under
Return values
object —tokenizer with methods to process strings for a language
hyphenateEntities()
Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.
public
static hyphenateEntities(string &$string[, $lang = null ]) : mixed
Parameters
- $string : string
-
a string of words, etc which might involve such terms
- $lang : = null
-
a language tag to use as part of the canonicalization process
Return values
mixed —javaTokenizer()
Given a string tokenizes into Java tokens
public
static javaTokenizer(string $string, string $lang) : array<string|int, mixed>
Parameters
- $string : string
-
what to extract terms from
- $lang : string
-
indicates programming language
Return values
array<string|int, mixed> —the terms computed from the string
oneWord()
Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes
public
static oneWord(string $word_guess, string $locale, array<string|int, mixed> $additional_regexes) : bool
Parameters
- $word_guess : string
-
word guess to be checked if a single word
- $locale : string
-
language to check if is word for
- $additional_regexes : array<string|int, mixed>
-
used in checking for this locale if something should be considered a word
Return values
bool —true if a single word false otherwise
pythonTokenizer()
Given a string tokenizes into Python tokens
public
static pythonTokenizer(string $string, string $lang) : array<string|int, mixed>
Parameters
- $string : string
-
what to extract terms from
- $lang : string
-
indicates programming language
Return values
array<string|int, mixed> —the terms computed from the string
reverseMaximalMatch()
Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.
public
static reverseMaximalMatch(string $segment, string $locale[, array<string|int, mixed> $additional_regexes = [] ]) : string
Parameters
- $segment : string
-
string to make into a string of space separated words
- $locale : string
-
IANA tag used to look up dictionary filter to use to do this segmenting
- $additional_regexes : array<string|int, mixed> = []
-
which should be treated as a suffix
Return values
string —space separated words
segmentSegment()
Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter
public
static segmentSegment(string $segment, string $lang) : mixed
Note: this method is not used when trying to extract keywords from urls. Instead, UrlParser::getWordsInHostUrl($url) is used.
Parameters
- $segment : string
-
string to split into terms
- $lang : string
-
IANA tag to look up segmenter under from some other language
Return values
mixed —stemCharGramSegment()
Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale
public
static stemCharGramSegment(string $string, string $lang[, bool $to_string = false ]) : mixed
Parameters
- $string : string
-
what to extract terms from
- $lang : string
-
locale tag to determine which stemmers, chargramming and segmentation needs to be done.
- $to_string : bool = false
-
if the result should be imploded on space to a single string or left as an array of terms
Return values
mixed —either an array of the terms computed from the string or a string where this array has been imploded on space
stemTerms()
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
public
static stemTerms(mixed $string_or_array, string $lang) : array<string|int, mixed>
Parameters
- $string_or_array : mixed
-
to extract stemmed terms from
- $lang : string
-
IANA tag to look up stemmer under
Return values
array<string|int, mixed> —stemmed terms if stemmer; terms otherwise
stemTermsK()
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
public
static stemTermsK(mixed $string_or_array, string $lang, string $keep_empties) : array<string|int, mixed>
Parameters
- $string_or_array : mixed
-
to extract stemmed terms from
- $lang : string
-
IANA tag to look up stemmer under
- $keep_empties : string
-
whether to keep empty sentences or not
Return values
array<string|int, mixed> —stemmed terms if stemmer; terms otherwise