Tokenizer
in package
This class has a collection of methods for English locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation), and a part of speech tagger (for question answering). The stemmer is my stab at implementing the Porter Stemmer algorithm presented http://tartarus.org/~martin/PorterStemmer/def.txt The code is based on the non-thread safe C version given by Martin Porter.
Since PHP is single-threaded this should be okay. Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- $adjective_type : mixed
- List of adjective-like parts of speech that might appear in lexicon file
- $adverb_type : mixed
- List of adverb-like parts of speech that might appear in lexicon file
- $conjunction_type : mixed
- List of conjunction-like parts of speech that might appear in lexicon file
- $determiner_type : mixed
- List of determiner-like parts of speech that might appear in lexicon file
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $noun_type : mixed
- List of noun-like parts of speech that might appear in lexicon file
- $question_token : mixed
- Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
- $semantic_rewrites : array<string|int, mixed>
- Phrases we would like yioop to rewrite before performing a query
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
- $verb_type : mixed
- List of verb-like parts of speech that might appear in lexicon file
- $buffer : string
- storage used in computing the stem
- $j : int
- Index to start of the suffix of the word being considered for manipulation
- $k : int
- Index of the current end of the word at the current state of computing its stem
- __construct() : mixed
- Do any global set up for tokenizer (none in the case of en-US)
- canonicalizePunctuatedTerms() : mixed
- This methods tries to handle punctuation in terms specific to the English language such as abbreviations.
- compressSentence() : the
- Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).
- extractDeepestSpeechPartPhrase() : string
- Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
- extractObjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
- extractPredicateParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
- extractSubjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
- extractTripletByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- extractTripletsParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
- extractTripletsPhrases() : array<string|int, mixed>
- Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
- isQuestion() : bool
- Takes a phrase query entered by user and return true if it is question and false if not
- parseAdjective() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
- parseAuxClause() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible
- parseDeterminer() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
- parseNoun() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
- parseNounPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
- parsePrepositionalPhrases() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
- parseTypeList() : string
- Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
- parseVerb() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
- parseVerbPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
- parseWholePhrase() : array<string|int, mixed>
- Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
- parseWhoQuestion() : array<string|int, mixed>
- Takes tagged question string starts with Who and returns question triplet from the question string
- parseWHPlusQuestion() : array<string|int, mixed>
- Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?
- questionParser() : array<string|int, mixed>
- Takes any question started with WH question and returns the triplet from the question
- rearrangeTripletsByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of an English word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation)
- tagPartsOfSpeechPhrase() : string
- Takes a phrase and tags each term in it with its part of speech.
- tagTokenizePartOfSpeech() : array<string|int, mixed>
- Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
- cons() : if
- Checks to see if the ith character in the buffer is a consonant
- cvc() : bool
- Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.
- doublec() : bool
- Checks if $j,($j-1) contain a double consonant.
- ends() : bool
- Checks if the buffer currently ends with the string $s
- m() : mixed
- m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .... </pre>
- r() : mixed
- Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.
- setto() : mixed
- setto($s) sets (j+1),...k to the characters in the string $s, readjusting k.
- stemPhrase() : string
- Given an English phrase produces a phrase where each of the terms has been stemmed
- step1ab() : mixed
- step1ab() gets rid of plurals and -ed or -ing. e.g.
- step1c() : mixed
- step1c() turns terminal y to i when there is another vowel in the stem.
- step2() : mixed
- step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.
- step3() : mixed
- step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
- step4() : mixed
- step4() takes off -ant, -ence etc., in context <c>vcvc<v>.
- step5() : mixed
- step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
- taggedPartOfSpeechTokensToString() : string
- Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.
- vowelinstem() : bool
- Checks if 0,...$j contains a vowel
Properties
$adjective_type
List of adjective-like parts of speech that might appear in lexicon file
public
static mixed
$adjective_type
= ["JJ", "JJR", "JJS"]
Tags
$adverb_type
List of adverb-like parts of speech that might appear in lexicon file
public
static mixed
$adverb_type
= ["RB", "RBR", "RBS"]
Tags
$conjunction_type
List of conjunction-like parts of speech that might appear in lexicon file
public
static mixed
$conjunction_type
= ["CC"]
Tags
$determiner_type
List of determiner-like parts of speech that might appear in lexicon file
public
static mixed
$determiner_type
= ["DT", "PDT"]
Tags
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= ["titanic", "programming", "fishing", 'ins', "blues", "factorial", "pbs"]
$noun_type
List of noun-like parts of speech that might appear in lexicon file
public
static mixed
$noun_type
= ["NN", "NNS", "NNP", "NNPS", "PRP"]
Tags
$question_token
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
public
static mixed
$question_token
= "qqq"
Tags
$semantic_rewrites
Phrases we would like yioop to rewrite before performing a query
public
static array<string|int, mixed>
$semantic_rewrites
= ["ins" => 'uscis', "mimetype" => 'mime', "military" => 'armed forces', 'full metal alchemist' => 'fullmetal alchemist', 'bruce schnier' => 'bruce schneier', 'dragonball' => 'dragon ball']
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries
public
static mixed
$stop_words
= ['a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'based', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'click', 'co', 'com', 'come', 'comment', 'comments', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', 'didnt', 'different', 'do', 'does', 'doesnt', 'doing', 'done', 'dont', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', 'hasnt', 'have', 'havent', 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'http', 'https', 'hundred', 'i', 'id', 'ie', 'if', 'ill', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', 'isnt', 'it', 'itd', 'itll', 'its', 'itself', 'ive', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', 'll', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'quot', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', 'shell', 'shes', 'should', 'shouldnt', 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'thatll', 'thats', 'thatve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'therell', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'thereve', 'these', 'they', 'theyd', 'theyll', 'theyre', 'theyve', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'till', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', 've', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', 'wed', 'welcome', 'well', 'went', 'were', 'werent', 'weve', 'what', 'whatever', 'whatll', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'wholl', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', 'youll', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'youve', 'z', 'zero']
Tags
$verb_type
List of verb-like parts of speech that might appear in lexicon file
public
static mixed
$verb_type
= ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
Tags
$buffer
storage used in computing the stem
private
static string
$buffer
$j
Index to start of the suffix of the word being considered for manipulation
private
static int
$j
$k
Index of the current end of the word at the current state of computing its stem
private
static int
$k
Methods
__construct()
Do any global set up for tokenizer (none in the case of en-US)
public
__construct() : mixed
Return values
mixed —canonicalizePunctuatedTerms()
This methods tries to handle punctuation in terms specific to the English language such as abbreviations.
public
canonicalizePunctuatedTerms(string &$string) : mixed
Parameters
- $string : string
-
a string of words, etc which might involve such terms
Return values
mixed —compressSentence()
Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).
public
static compressSentence(string $sentence_to_compress) : the
Parameters
- $sentence_to_compress : string
-
the sentence to compress
Return values
the —compressed sentence
extractDeepestSpeechPartPhrase()
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
public
static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
- $tree : array<string|int, mixed>
-
phrase to extract type from
- $pos : string
-
the part of speech to extract
Return values
string —the label of deepest $pos only path in $tree
extractObjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
public
static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractPredicateParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
public
static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractSubjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
public
static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractTripletByType()
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
public
static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
- $type : string
-
either CONCISE or RAW
Return values
array<string|int, mixed> —$triplets in format described above
extractTripletsParseTree()
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
public
static extractTripletsParseTree(are $tree) : array<string|int, mixed>
Parameters
- $tree : are
-
a parse tree for a sentence
Return values
array<string|int, mixed> —triplet array
extractTripletsPhrases()
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
public
static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
- $word_and_phrase_list : array<string|int, mixed>
-
of statements
Return values
array<string|int, mixed> —with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.
isQuestion()
Takes a phrase query entered by user and return true if it is question and false if not
public
isQuestion( $phrase) : bool
Parameters
Return values
bool —returns true if statement is question
parseAdjective()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
public
static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
parseAuxClause()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible
public
static parseAuxClause(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase
parseDeterminer()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
public
static parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed
parseNoun()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
public
static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
parseNounPhrase()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
public
static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree
parsePrepositionalPhrases()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
public
static parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
- $index : int = 1
-
which term in $tagged_phrase to start to try to parse a preposition from
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree
parseTypeList()
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
public
static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
- $cur_node : array<string|int, mixed>
-
node within parse tree
- $tagged_phrase : array<string|int, mixed>
-
parse tree for phrase
- $type : string
-
self::$noun_type, self::$verb_type, etc
Return values
string —phrase string involving only terms of that $type
parseVerb()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
public
static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
parseVerbPhrase()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
public
static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree
parseWholePhrase()
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
public
static parseWholePhrase(array<string|int, mixed> $tagged_phrase, $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree :
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase
parseWhoQuestion()
Takes tagged question string starts with Who and returns question triplet from the question string
public
static parseWhoQuestion(string $tagged_question, int $index) : array<string|int, mixed>
Parameters
- $tagged_question : string
-
part-of-speech tagged question
- $index : int
-
current index in statement
Return values
array<string|int, mixed> —parsed triplet
parseWHPlusQuestion()
Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?
public
static parseWHPlusQuestion(string $tagged_question, $index) : array<string|int, mixed>
Parameters
Return values
array<string|int, mixed> —parsed triplet suitable for query look-up
questionParser()
Takes any question started with WH question and returns the triplet from the question
public
static questionParser(string $question) : array<string|int, mixed>
Parameters
- $question : string
-
question to parse
Return values
array<string|int, mixed> —question triplet
rearrangeTripletsByType()
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
public
static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
Return values
array<string|int, mixed> —$processed_triplets in format described above
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of an English word
public
static stem(string $word) : string
For example, jumps, jumping, jumpy, all have jump as a stem
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $words
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
tagPartsOfSpeechPhrase()
Takes a phrase and tags each term in it with its part of speech.
public
static tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string
So each term in the original phrase gets mapped to term~part_of_speech This tagger is based on a Brill tagger. It makes uses a lexicon consisting of words from the Brown corpus together with a list of part of speech tags that that word had in the Brown Corpus. These are used to get an initial part of speech (in word was not present than we assume it is a noun). From this a fixed set of rules is used to modify the initial tag if necessary.
Parameters
- $phrase : string
-
text to add parts speech tags to
- $with_tokens : bool = true
-
whether to include the terms and the tags in the output string or just the part of speech tags
Return values
string —$tagged_phrase phrase where each term has ~part_of_speech appended ($with_tokens == true) or just space separated part_of_speech (!$with_tokens)
tagTokenizePartOfSpeech()
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
public
static tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>
Parameters
- $text : string
-
string to tag and tokenize
Return values
array<string|int, mixed> —of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text
cons()
Checks to see if the ith character in the buffer is a consonant
private
static cons(int $i) : if
Parameters
- $i : int
-
the character to check
Return values
if —the ith character is a constant
cvc()
Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.
private
static cvc(int $i) : bool
cav(e), lov(e), hop(e), crim(e), but snow, box, tray.
Parameters
- $i : int
-
position to check in buffer for consonant-vowel-consonant
Return values
bool —whether the letters at indices have the given form
doublec()
Checks if $j,($j-1) contain a double consonant.
private
static doublec(int $j) : bool
Parameters
- $j : int
-
position to check in buffer for double consonant
Return values
bool —if it does or not
ends()
Checks if the buffer currently ends with the string $s
private
static ends(string $s) : bool
Parameters
- $s : string
-
string to use for check
Return values
bool —whether buffer currently ends with $s
m()
m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .... </pre>
private
static m() : mixed
Return values
mixed —r()
Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.
private
static r(string $s) : mixed
Parameters
- $s : string
-
what to change the suffix to
Return values
mixed —setto()
setto($s) sets (j+1),...k to the characters in the string $s, readjusting k.
private
static setto(string $s) : mixed
Parameters
- $s : string
-
string to modify the end of buffer with
Return values
mixed —stemPhrase()
Given an English phrase produces a phrase where each of the terms has been stemmed
private
static stemPhrase(string $phrase) : string
Parameters
- $phrase : string
-
phrase to stem
Return values
string —in which each term has been stemmed according to the English stemmer
step1ab()
step1ab() gets rid of plurals and -ed or -ing. e.g.
private
static step1ab() : mixed
caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet
Return values
mixed —step1c()
step1c() turns terminal y to i when there is another vowel in the stem.
private
static step1c() : mixed
Return values
mixed —step2()
step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.
private
static step2() : mixed
Return values
mixed —step3()
step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
private
static step3() : mixed
Return values
mixed —step4()
step4() takes off -ant, -ence etc., in context <c>vcvc<v>.
private
static step4() : mixed
Return values
mixed —step5()
step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
private
static step5() : mixed
Return values
mixed —taggedPartOfSpeechTokensToString()
Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.
private
static taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string
Parameters
- $tagged_tokens : array<string|int, mixed>
-
array pairs as might come from tagTokenize
- $with_tokens : bool = true
-
whether to include the terms and the tags in the output string or just the part of speech tags
Return values
string —$tagged_phrase a phrase with terms in the format token~tag ($with_token == true) or space separated tags (!$with_token).
vowelinstem()
Checks if 0,...$j contains a vowel
private
static vowelinstem() : bool
Return values
bool —whether it does not