Yioop_V9.5_Source_Code

Tokenizer
in package

Application

This class has a collection of methods for English locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation), and a part of speech tagger (for question answering). The stemmer is my stab at implementing the Porter Stemmer algorithm presented http://tartarus.org/~martin/PorterStemmer/def.txt The code is based on the non-thread safe C version given by Martin Porter.

Since PHP is single-threaded this should be okay. Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

$adjective_type

List of adjective-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $adjective_type
     = ["JJ", "JJR", "JJS"]

$adverb_type

List of adverb-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $adverb_type
     = ["RB", "RBR", "RBS"]

$conjunction_type

List of conjunction-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $conjunction_type
     = ["CC"]

$determiner_type

List of determiner-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $determiner_type
     = ["DT", "PDT"]

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = ["titanic", "programming", "fishing", 'ins', "blues", "factorial", "pbs"]

$noun_type

List of noun-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $noun_type
     = ["NN", "NNS", "NNP", "NNPS", "PRP"]

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list


    public
    static    mixed
    $question_token
     = "qqq"

$semantic_rewrites

Phrases we would like yioop to rewrite before performing a query


    public
    static    array<string|int, mixed>
    $semantic_rewrites
     = ["ins" => 'uscis', "mimetype" => 'mime', "military" => 'armed forces', 'full metal alchemist' => 'fullmetal alchemist', 'bruce schnier' => 'bruce schneier', 'dragonball' => 'dragon ball']

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries


    public
    static    mixed
    $stop_words
     = ['a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'based', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'click', 'co', 'com', 'come', 'comment', 'comments', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', 'didnt', 'different', 'do', 'does', 'doesnt', 'doing', 'done', 'dont', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', 'hasnt', 'have', 'havent', 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'http', 'https', 'hundred', 'i', 'id', 'ie', 'if', 'ill', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', 'isnt', 'it', 'itd', 'itll', 'its', 'itself', 'ive', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', 'll', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'quot', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', 'shed', 'shell', 'shes', 'should', 'shouldnt', 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'thatll', 'thats', 'thatve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'therell', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'thereve', 'these', 'they', 'theyd', 'theyll', 'theyre', 'theyve', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'till', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', 've', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', 'wed', 'welcome', 'well', 'went', 'were', 'werent', 'weve', 'what', 'whatever', 'whatll', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'wholl', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', 'youd', 'youll', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'youve', 'z', 'zero']

$verb_type

List of verb-like parts of speech that might appear in lexicon file


    public
    static    mixed
    $verb_type
     = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]

$buffer

storage used in computing the stem


    private
    static    string
    $buffer

$j

Index to start of the suffix of the word being considered for manipulation


    private
    static    int
    $j

$k

Index of the current end of the word at the current state of computing its stem


    private
    static    int
    $k

__construct()

Do any global set up for tokenizer (none in the case of en-US)


    public
                    __construct() : mixed

Return values

mixed —

canonicalizePunctuatedTerms()

This methods tries to handle punctuation in terms specific to the English language such as abbreviations.


    public
                    canonicalizePunctuatedTerms(string &$string) : mixed

Parameters

$string : string: a string of words, etc which might involve such terms

Return values

mixed —

compressSentence()

Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).


    public
            static        compressSentence(string $sentence_to_compress) : the

Parameters

$sentence_to_compress : string: the sentence to compress

Return values

the —

compressed sentence

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.


    public
            static        extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string

Parameters

$tree : array<string|int, mixed>: phrase to extract type from
$pos : string: the part of speech to extract

Return values

string —

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object


    public
            static        extractObjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate


    public
            static        extractPredicateParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject


    public
            static        extractSubjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields


    public
            static        extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above
$type : string: either CONCISE or RAW

Return values

array<string|int, mixed> —

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed


    public
            static        extractTripletsParseTree(are $tree) : array<string|int, mixed>

Parameters

$tree : are: a parse tree for a sentence

Return values

array<string|int, mixed> —

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).


    public
            static        extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>

Parameters

$word_and_phrase_list : array<string|int, mixed>: of statements

Return values

array<string|int, mixed> —

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not


    public
                    isQuestion( $phrase) : bool

Parameters

$phrase :: any statement

Return values

bool —

returns true if statement is question

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible


    public
            static        parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseAuxClause()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible


    public
            static        parseAuxClause(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase

parseDeterminer()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible


    public
            static        parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible


    public
            static        parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible


    public
            static        parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parsePrepositionalPhrases()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible


    public
            static        parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]
$index : int = 1: which term in $tagged_phrase to start to try to parse a preposition from

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.


    public
            static        parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string

Parameters

$cur_node : array<string|int, mixed>: node within parse tree
$tagged_phrase : array<string|int, mixed>: parse tree for phrase
$type : string: self::$noun_type, self::$verb_type, etc

Return values

string —

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible


    public
            static        parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible


    public
            static        parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.


    public
            static        parseWholePhrase(array<string|int, mixed> $tagged_phrase,  $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree :: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

parseWhoQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string


    public
            static        parseWhoQuestion(string $tagged_question, int $index) : array<string|int, mixed>

Parameters

$tagged_question : string: part-of-speech tagged question
$index : int: current index in statement

Return values

array<string|int, mixed> —

parsed triplet

parseWHPlusQuestion()

Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?


    public
            static        parseWHPlusQuestion(string $tagged_question,  $index) : array<string|int, mixed>

Parameters

$tagged_question : string: part-of-speech tagged question
$index :: current index in statement

Return values

array<string|int, mixed> —

parsed triplet suitable for query look-up

questionParser()

Takes any question started with WH question and returns the triplet from the question


    public
            static        questionParser(string $question) : array<string|int, mixed>

Parameters

$question : string: question to parse

Return values

array<string|int, mixed> —

question triplet

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields


    public
            static        rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above

Return values

array<string|int, mixed> —

$processed_triplets in format described above

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an English word


    public
            static        stem(string $word) : string

For example, jumps, jumping, jumpy, all have jump as a stem

Parameters

$word : string: the string to stem

Return values

string —

the stem of $words

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

tagPartsOfSpeechPhrase()

Takes a phrase and tags each term in it with its part of speech.


    public
            static        tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string

So each term in the original phrase gets mapped to term~part_of_speech This tagger is based on a Brill tagger. It makes uses a lexicon consisting of words from the Brown corpus together with a list of part of speech tags that that word had in the Brown Corpus. These are used to get an initial part of speech (in word was not present than we assume it is a noun). From this a fixed set of rules is used to modify the initial tag if necessary.

Parameters

$phrase : string: text to add parts speech tags to
$with_tokens : bool = true: whether to include the terms and the tags in the output string or just the part of speech tags

Return values

string —

$tagged_phrase phrase where each term has ~part_of_speech appended ($with_tokens == true) or just space separated part_of_speech (!$with_tokens)

tagTokenizePartOfSpeech()

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.


    public
            static        tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>

Parameters

$text : string: string to tag and tokenize

Return values

array<string|int, mixed> —

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text

cons()

Checks to see if the ith character in the buffer is a consonant


    private
            static        cons(int $i) : if

Parameters

$i : int: the character to check

Return values

if —

the ith character is a constant

cvc()

Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.


    private
            static        cvc(int $i) : bool

  cav(e), lov(e), hop(e), crim(e), but
  snow, box, tray.

Parameters

$i : int: position to check in buffer for consonant-vowel-consonant

Return values

bool —

whether the letters at indices have the given form

doublec()

Checks if $j,($j-1) contain a double consonant.


    private
            static        doublec(int $j) : bool

Parameters

$j : int: position to check in buffer for double consonant

Return values

bool —

if it does or not

ends()

Checks if the buffer currently ends with the string $s


    private
            static        ends(string $s) : bool

Parameters

$s : string: string to use for check

Return values

bool —

whether buffer currently ends with $s

m()

m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .... </pre>


    private
            static        m() : mixed

Return values

mixed —

r()

Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.


    private
            static        r(string $s) : mixed

Parameters

$s : string: what to change the suffix to

Return values

mixed —

setto()

setto($s) sets (j+1),...k to the characters in the string $s, readjusting k.


    private
            static        setto(string $s) : mixed

Parameters

$s : string: string to modify the end of buffer with

Return values

mixed —

stemPhrase()

Given an English phrase produces a phrase where each of the terms has been stemmed


    private
            static        stemPhrase(string $phrase) : string

Parameters

$phrase : string: phrase to stem

Return values

string —

in which each term has been stemmed according to the English stemmer

step1ab()

step1ab() gets rid of plurals and -ed or -ing. e.g.


    private
            static        step1ab() : mixed

   caresses  ->  caress
   ponies    ->  poni
   ties      ->  ti
   caress    ->  caress
   cats      ->  cat

   feed      ->  feed
   agreed    ->  agree
   disabled  ->  disable

   matting   ->  mat
   mating    ->  mate
   meeting   ->  meet
   milling   ->  mill
   messing   ->  mess

   meetings  ->  meet

Return values

mixed —

step1c()

step1c() turns terminal y to i when there is another vowel in the stem.


    private
            static        step1c() : mixed

Return values

mixed —

step2()

step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.


    private
            static        step2() : mixed

Return values

mixed —

step3()

step3() deals with -ic-, -full, -ness etc. similar strategy to step2.


    private
            static        step3() : mixed

Return values

mixed —

step4()

step4() takes off -ant, -ence etc., in context <c>vcvc<v>.


    private
            static        step4() : mixed

Return values

mixed —

step5()

step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.


    private
            static        step5() : mixed

Return values

mixed —

taggedPartOfSpeechTokensToString()

Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.


    private
            static        taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string

Parameters

$tagged_tokens : array<string|int, mixed>: array pairs as might come from tagTokenize
$with_tokens : bool = true: whether to include the terms and the tags in the output string or just the part of speech tags

Return values

string —

$tagged_phrase a phrase with terms in the format token~tag ($with_token == true) or space separated tags (!$with_token).

vowelinstem()

Checks if 0,...$j contains a vowel


    private
            static        vowelinstem() : bool

Return values

bool —

whether it does not

Tokenizer in package Application

Tags

Table of Contents

Properties

$adjective_type

Tags

$adverb_type

Tags

$conjunction_type

Tags

$determiner_type

Tags

$no_stem_list

$noun_type

Tags

$question_token

Tags

$semantic_rewrites

$stop_words

Tags

$verb_type

Tags

$buffer

$j

$k

Methods

__construct()

Return values

canonicalizePunctuatedTerms()

Parameters

Return values

compressSentence()

Parameters

Return values

extractDeepestSpeechPartPhrase()

Parameters

Return values

extractObjectParseTree()

Parameters

Return values

extractPredicateParseTree()

Parameters

Return values

extractSubjectParseTree()

Parameters

Return values

extractTripletByType()

Parameters

Return values

extractTripletsParseTree()

Parameters

Return values

extractTripletsPhrases()

Parameters

Return values

isQuestion()

Parameters

Return values

parseAdjective()

Parameters

Return values

parseAuxClause()

Parameters

Return values

parseDeterminer()

Parameters

Return values

parseNoun()

Parameters

Return values

parseNounPhrase()

Parameters

Return values

parsePrepositionalPhrases()

Parameters

Return values

parseTypeList()

Parameters

Return values

parseVerb()

Tokenizer
in package

Application