Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

This class has a collection of methods for English locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation), and a part of speech tagger (for question answering). The stemmer is my stab at implementing the Porter Stemmer algorithm presented http://tartarus.org/~martin/PorterStemmer/def.txt The code is based on the non-thread safe C version given by Martin Porter.

Since PHP is single-threaded this should be okay. Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$adjective_type  : mixed
List of adjective-like parts of speech that might appear in lexicon file
$adverb_type  : mixed
List of adverb-like parts of speech that might appear in lexicon file
$conjunction_type  : mixed
List of conjunction-like parts of speech that might appear in lexicon file
$determiner_type  : mixed
List of determiner-like parts of speech that might appear in lexicon file
$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$noun_type  : mixed
List of noun-like parts of speech that might appear in lexicon file
$question_token  : mixed
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
$semantic_rewrites  : array<string|int, mixed>
Phrases we would like yioop to rewrite before performing a query
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
$verb_type  : mixed
List of verb-like parts of speech that might appear in lexicon file
$buffer  : string
storage used in computing the stem
$j  : int
Index to start of the suffix of the word being considered for manipulation
$k  : int
Index of the current end of the word at the current state of computing its stem
__construct()  : mixed
Do any global set up for tokenizer (none in the case of en-US)
canonicalizePunctuatedTerms()  : mixed
This methods tries to handle punctuation in terms specific to the English language such as abbreviations.
compressSentence()  : the
Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).
extractDeepestSpeechPartPhrase()  : string
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
extractObjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
extractPredicateParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
extractSubjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
extractTripletByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
extractTripletsParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
extractTripletsPhrases()  : array<string|int, mixed>
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
isQuestion()  : bool
Takes a phrase query entered by user and return true if it is question and false if not
parseAdjective()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
parseAuxClause()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible
parseDeterminer()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
parseNoun()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
parseNounPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
parsePrepositionalPhrases()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
parseTypeList()  : string
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
parseVerb()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
parseVerbPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
parseWholePhrase()  : array<string|int, mixed>
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
parseWhoQuestion()  : array<string|int, mixed>
Takes tagged question string starts with Who and returns question triplet from the question string
parseWHPlusQuestion()  : array<string|int, mixed>
Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?
questionParser()  : array<string|int, mixed>
Takes any question started with WH question and returns the triplet from the question
rearrangeTripletsByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of an English word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation)
tagPartsOfSpeechPhrase()  : string
Takes a phrase and tags each term in it with its part of speech.
tagTokenizePartOfSpeech()  : array<string|int, mixed>
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
cons()  : if
Checks to see if the ith character in the buffer is a consonant
cvc()  : bool
Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.
doublec()  : bool
Checks if $j,($j-1) contain a double consonant.
ends()  : bool
Checks if the buffer currently ends with the string $s
m()  : mixed
m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .... </pre>
r()  : mixed
Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.
setto()  : mixed
setto($s) sets (j+1),...k to the characters in the string $s, readjusting k.
stemPhrase()  : string
Given an English phrase produces a phrase where each of the terms has been stemmed
step1ab()  : mixed
step1ab() gets rid of plurals and -ed or -ing. e.g.
step1c()  : mixed
step1c() turns terminal y to i when there is another vowel in the stem.
step2()  : mixed
step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.
step3()  : mixed
step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
step4()  : mixed
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.
step5()  : mixed
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
taggedPartOfSpeechTokensToString()  : string
Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.
vowelinstem()  : bool
Checks if 0,...$j contains a vowel

Properties

$adjective_type

List of adjective-like parts of speech that might appear in lexicon file

public static mixed $adjective_type = ["JJ", "JJR", "JJS"]
Tags
array

$adverb_type

List of adverb-like parts of speech that might appear in lexicon file

public static mixed $adverb_type = ["RB", "RBR", "RBS"]
Tags
array

$conjunction_type

List of conjunction-like parts of speech that might appear in lexicon file

public static mixed $conjunction_type = ["CC"]
Tags
array

$determiner_type

List of determiner-like parts of speech that might appear in lexicon file

public static mixed $determiner_type = ["DT", "PDT"]
Tags
array

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = ["titanic", "programming", "fishing", 'ins', "blues", "factorial", "pbs"]

$noun_type

List of noun-like parts of speech that might appear in lexicon file

public static mixed $noun_type = ["NN", "NNS", "NNP", "NNPS", "PRP"]
Tags
array

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

public static mixed $question_token = "qqq"
Tags
string

$semantic_rewrites

Phrases we would like yioop to rewrite before performing a query

public static array<string|int, mixed> $semantic_rewrites = ["ins" => 'uscis', "mimetype" => 'mime', "military" => 'armed forces', 'full metal alchemist' => 'fullmetal alchemist', 'bruce schnier' => 'bruce schneier', 'dragonball' => 'dragon ball']

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

public static mixed $stop_words = ['a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'based', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'click', 'co', 'com', 'come', 'comment', 'comments', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', 'didnt', 'different', 'do', 'does', 'doesnt', 'doing', 'done', 'dont', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', 'hasnt', 'have', 'havent', 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'http', 'https', 'hundred', 'i', 'id', 'ie', 'if', 'ill', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', 'isnt', 'it', 'itd', 'itll', 'its', 'itself', 'ive', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', 'll', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'quot', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', 'shell', 'shes', 'should', 'shouldnt', 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'thatll', 'thats', 'thatve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'therell', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'thereve', 'these', 'they', 'theyd', 'theyll', 'theyre', 'theyve', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'till', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', 've', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', 'wed', 'welcome', 'well', 'went', 'were', 'werent', 'weve', 'what', 'whatever', 'whatll', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'wholl', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', 'youll', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'youve', 'z', 'zero']
Tags
array

$verb_type

List of verb-like parts of speech that might appear in lexicon file

public static mixed $verb_type = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
Tags
array

$buffer

storage used in computing the stem

private static string $buffer

$j

Index to start of the suffix of the word being considered for manipulation

private static int $j

$k

Index of the current end of the word at the current state of computing its stem

private static int $k

Methods

__construct()

Do any global set up for tokenizer (none in the case of en-US)

public __construct() : mixed
Return values
mixed

canonicalizePunctuatedTerms()

This methods tries to handle punctuation in terms specific to the English language such as abbreviations.

public canonicalizePunctuatedTerms(string &$string) : mixed
Parameters
$string : string

a string of words, etc which might involve such terms

Return values
mixed

compressSentence()

Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).

public static compressSentence(string $sentence_to_compress) : the
Parameters
$sentence_to_compress : string

the sentence to compress

Return values
the

compressed sentence

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

public static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
$tree : array<string|int, mixed>

phrase to extract type from

$pos : string

the part of speech to extract

Return values
string

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

public static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

public static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

public static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

public static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

$type : string

either CONCISE or RAW

Return values
array<string|int, mixed>

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

public static extractTripletsParseTree(are $tree) : array<string|int, mixed>
Parameters
$tree : are

a parse tree for a sentence

Return values
array<string|int, mixed>

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).

public static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
$word_and_phrase_list : array<string|int, mixed>

of statements

Return values
array<string|int, mixed>

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not

public isQuestion( $phrase) : bool
Parameters
$phrase :

any statement

Return values
bool

returns true if statement is question

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

public static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseAuxClause()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible

public static parseAuxClause(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase

parseDeterminer()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible

public static parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

public static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible

public static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parsePrepositionalPhrases()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible

public static parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

$index : int = 1

which term in $tagged_phrase to start to try to parse a preposition from

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

public static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
$cur_node : array<string|int, mixed>

node within parse tree

$tagged_phrase : array<string|int, mixed>

parse tree for phrase

$type : string

self::$noun_type, self::$verb_type, etc

Return values
string

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

public static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

public static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

public static parseWholePhrase(array<string|int, mixed> $tagged_phrase,  $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree :

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

parseWhoQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string

public static parseWhoQuestion(string $tagged_question, int $index) : array<string|int, mixed>
Parameters
$tagged_question : string

part-of-speech tagged question

$index : int

current index in statement

Return values
array<string|int, mixed>

parsed triplet

parseWHPlusQuestion()

Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?

public static parseWHPlusQuestion(string $tagged_question,  $index) : array<string|int, mixed>
Parameters
$tagged_question : string

part-of-speech tagged question

$index :

current index in statement

Return values
array<string|int, mixed>

parsed triplet suitable for query look-up

questionParser()

Takes any question started with WH question and returns the triplet from the question

public static questionParser(string $question) : array<string|int, mixed>
Parameters
$question : string

question to parse

Return values
array<string|int, mixed>

question triplet

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

public static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

Return values
array<string|int, mixed>

$processed_triplets in format described above

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an English word

public static stem(string $word) : string

For example, jumps, jumping, jumpy, all have jump as a stem

Parameters
$word : string

the string to stem

Return values
string

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

tagPartsOfSpeechPhrase()

Takes a phrase and tags each term in it with its part of speech.

public static tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string

So each term in the original phrase gets mapped to term~part_of_speech This tagger is based on a Brill tagger. It makes uses a lexicon consisting of words from the Brown corpus together with a list of part of speech tags that that word had in the Brown Corpus. These are used to get an initial part of speech (in word was not present than we assume it is a noun). From this a fixed set of rules is used to modify the initial tag if necessary.

Parameters
$phrase : string

text to add parts speech tags to

$with_tokens : bool = true

whether to include the terms and the tags in the output string or just the part of speech tags

Return values
string

$tagged_phrase phrase where each term has ~part_of_speech appended ($with_tokens == true) or just space separated part_of_speech (!$with_tokens)

tagTokenizePartOfSpeech()

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.

public static tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>
Parameters
$text : string

string to tag and tokenize

Return values
array<string|int, mixed>

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text

cons()

Checks to see if the ith character in the buffer is a consonant

private static cons(int $i) : if
Parameters
$i : int

the character to check

Return values
if

the ith character is a constant

cvc()

Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.

private static cvc(int $i) : bool
  cav(e), lov(e), hop(e), crim(e), but
  snow, box, tray.
Parameters
$i : int

position to check in buffer for consonant-vowel-consonant

Return values
bool

whether the letters at indices have the given form

doublec()

Checks if $j,($j-1) contain a double consonant.

private static doublec(int $j) : bool
Parameters
$j : int

position to check in buffer for double consonant

Return values
bool

if it does or not

ends()

Checks if the buffer currently ends with the string $s

private static ends(string $s) : bool
Parameters
$s : string

string to use for check

Return values
bool

whether buffer currently ends with $s

m()

m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .... </pre>

private static m() : mixed
Return values
mixed

r()

Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.

private static r(string $s) : mixed
Parameters
$s : string

what to change the suffix to

Return values
mixed

setto()

setto($s) sets (j+1),...k to the characters in the string $s, readjusting k.

private static setto(string $s) : mixed
Parameters
$s : string

string to modify the end of buffer with

Return values
mixed

stemPhrase()

Given an English phrase produces a phrase where each of the terms has been stemmed

private static stemPhrase(string $phrase) : string
Parameters
$phrase : string

phrase to stem

Return values
string

in which each term has been stemmed according to the English stemmer

step1ab()

step1ab() gets rid of plurals and -ed or -ing. e.g.

private static step1ab() : mixed
   caresses  ->  caress
   ponies    ->  poni
   ties      ->  ti
   caress    ->  caress
   cats      ->  cat

   feed      ->  feed
   agreed    ->  agree
   disabled  ->  disable

   matting   ->  mat
   mating    ->  mate
   meeting   ->  meet
   milling   ->  mill
   messing   ->  mess

   meetings  ->  meet
Return values
mixed

step1c()

step1c() turns terminal y to i when there is another vowel in the stem.

private static step1c() : mixed
Return values
mixed

step2()

step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.

private static step2() : mixed
Return values
mixed

step3()

step3() deals with -ic-, -full, -ness etc. similar strategy to step2.

private static step3() : mixed
Return values
mixed

step4()

step4() takes off -ant, -ence etc., in context <c>vcvc<v>.

private static step4() : mixed
Return values
mixed

step5()

step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.

private static step5() : mixed
Return values
mixed

taggedPartOfSpeechTokensToString()

Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.

private static taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string
Parameters
$tagged_tokens : array<string|int, mixed>

array pairs as might come from tagTokenize

$with_tokens : bool = true

whether to include the terms and the tags in the output string or just the part of speech tags

Return values
string

$tagged_phrase a phrase with terms in the format token~tag ($with_token == true) or space separated tags (!$with_token).

vowelinstem()

Checks if 0,...$j contains a vowel

private static vowelinstem() : bool
Return values
bool

whether it does not


        

Search results