Tokenizer
in package
Chinese specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
Tags
Table of Contents
- $adjective_type : array<string|int, mixed>
- List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ
- $adverb_type : array<string|int, mixed>
- List of adverb-like parts of speech that might appear in lexicon file
- $conjunction_type : array<string|int, mixed>
- List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS
- $determiner_type : mixed
- List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M
- $dot : string
- Dots used in Chinese Numbers
- $exception_list : array<string|int, mixed>
- Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it
- $non_char_preg : string
- regular expression to determine if the None of the char in this term is in current language.
- $noun_type : array<string|int, mixed>
- List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN
- $num_dict : string
- The dictionary of characters can be used as Chinese Numbers
- $num_end : string
- A list of characters can be used at the end of numbers
- $particle_type : array<string|int, mixed>
- List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere
- $punctuation_preg : string
- A list of characters can be used as Chinese punctuations
- $question_token : string
- Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
- $question_words : array<string|int, mixed>
- Words array that determine if a sentence passed in is a question
- $stop_words : array<string|int, mixed>
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- $verb_type : array<string|int, mixed>
- List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB
- $named_entity_tagger : object
- Named Entity tagger instance used to recognizer noun entities in Chinese text
- $pos_tagger : object
- PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text
- $stochastic_term_segmenter : object
- StochasticTermSegmenter instance used for segmenting chines
- $traditional_simplified_map : array<string|int, mixed>
- Holds a associative array with keys which are traditional characters and values their simplified character correspondents.
- extractDeepestSpeechPartPhrase() : string
- Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
- extractObjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
- extractPredicateParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
- extractSubjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
- extractTripletByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- extractTripletsParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
- extractTripletsPhrases() : array<string|int, mixed>
- Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
- getNamedEntityTagger() : NamedEntityContextTagger
- Get the named entity tagger instance
- getPosKey() : string
- Determines the part of speech tag of a term using simple rules if possible
- getPosKeyList() : array<string|int, mixed>
- Possible tags a term can have that can be determined by a simple rule
- getPosTagger() : PartOfSpeechContextTagger
- Get Part of Speec instance
- getPosUnknownTagsList() : array<string|int, mixed>
- Return list of possible tags that an unknown term can have
- getStochasticTermSegmenter() : StochasticTermSegmenter
- Get the segmenter instance, instantiating it if necessary
- isCardinalNumber() : bool
- Check if the term passed in is a Cardinal Number
- isDate() : mixed
- isNotCurrentLang() : bool
- Check if all the chars in the term is NOT current language
- isOrdinalNumber() : mixed
- isPunctuation() : mixed
- Check if the term is a punctuation
- isQuestion() : bool
- Takes a phrase query entered by user and return true if it is question and false if not
- normalize() : string
- Converts traditional Chinese characters to simplified characters
- parseAdjective() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
- parseDeterminer() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
- parseNoun() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
- parseNounPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
- parsePrepositionalPhrases() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
- parseQuestion() : array<string|int, mixed>
- Takes tagged question string starts with Who and returns question triplet from the question string
- parseTypeList() : string
- Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
- parseVerb() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
- parseVerbPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
- parseWholePhrase() : array<string|int, mixed>
- Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
- questionParser() : array<string|int, mixed>
- Takes any question started with WH question and returns the triplet from the question
- questionType() : mixed
- Helper function for isQuestion
- rearrangeTripletsByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- segment() : string
- A word segmenter.
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- tagTokenizePartOfSpeech() : array<string|int, mixed>
- Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
Properties
$adjective_type
List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ
public
static array<string|int, mixed>
$adjective_type
= ["VA", "JJ"]
$adverb_type
List of adverb-like parts of speech that might appear in lexicon file
public
static array<string|int, mixed>
$adverb_type
= ["AD"]
$conjunction_type
List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS
public
static array<string|int, mixed>
$conjunction_type
= ["CC", "CS"]
$determiner_type
List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M
public
static mixed
$determiner_type
= ["DT", "CD", "OD", "M"]
Tags
$dot
Dots used in Chinese Numbers
public
static string
$dot
= "\\..点"
$exception_list
Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it
public
static array<string|int, mixed>
$exception_list
= ["十分", "一", "一点", "千万", "万一", "一一", "拾", "一时", "千千", "万万", "陆"]
of string
$non_char_preg
regular expression to determine if the None of the char in this term is in current language.
public
static string
$non_char_preg
= "/^[^\\p{Han}]+\$/u"
$noun_type
List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN
public
static array<string|int, mixed>
$noun_type
= ["NR", "NT", "NN", "PN"]
$num_dict
The dictionary of characters can be used as Chinese Numbers
public
static string
$num_dict
= "1234567890○〇零一二两三四五六七八九十百千万亿" . "0123456789壹贰叁肆伍陆柒捌玖拾廿卅卌佰仟萬億"
$num_end
A list of characters can be used at the end of numbers
public
static string
$num_end
= "%%"
$particle_type
List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere
public
static array<string|int, mixed>
$particle_type
= ["AS", "ETC", "DEC", "DEG", "DEV", "MSP", "DER", "SP", "IJ", "FW"]
$punctuation_preg
A list of characters can be used as Chinese punctuations
public
static string
$punctuation_preg
= "/^([\\x{2000}-\\x{206F}\\x{3000}-\\x{303F}\\x{FF00}-\\x{FF0F}" . "\\x{FF1A}-\\x{FF20}\\x{FF3B}-\\x{FF40}\\x{FF5B}-\\x{FF65}" . "\\x{FFE0}-\\x{FFEE}\\x{21}-\\x{2F}\\x{21}-\\x{2F}" . "\\x{3A}-\\x{40}\\x{5B}-\\x{60}\\x{25cf}])\\1*\$/u"
$question_token
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
public
static string
$question_token
= "qqq"
$question_words
Words array that determine if a sentence passed in is a question
public
static array<string|int, mixed>
$question_words
= ["any" => ["谁" => "who", "哪儿|哪里" => "where", "哪个" => "which", "哪些" => "list", "哪" => ["after" => ["1|一" => "which", "[2-9]|[1-9][0-9]+" => "list"], "other" => "where"], "什么|啥|咋" => ["after" => ["地方" => "where", "地点" => "where", "时\\w*" => "when"], "other" => "what"], "怎么|怎样|怎么样|如何" => "how", "为什么" => "why", "多少" => "how many", "几\\w*" => ["any" => ["吗|\\?|?" => "how many"], "other" => false], "多久" => "how long", "多大" => "how big"], "other" => ["any" => ["吗" => "yesno", "呢" => "what about"], "other" => ["other" => false, "any" => ["\\?|?" => "yesno"]]]]
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static array<string|int, mixed>
$stop_words
= ['一', '人', '里', '会', '没', '她', '吗', '去', '也', '有', '这', '那', '不', '什', '个', '来', '要', '就', '我', '你', '的', '是', '了', '他', '么', '们', '在', '说', '为', '好', '吧', '知道', '我的', '和', '你的', '想', '只', '很', '都', '对', '把', '啊', '怎', '得', '还', '过', '不是', '到', '样', '飞', '远', '身', '任何', '生活', '够', '号', '兰', '瑞', '达', '或', '愿', '蒂', '別', '军', '正', '是不是', '证', '不用', '三', '乐', '吉', '男人', '告訴', '路', '搞', '可是', '与', '次', '狗', '决', '金', '史', '姆', '部', '正在', '活', '刚', '回家', '贝', '如何', '须', '战', '不會', '夫', '喂', '父', '亚', '肯定', '女孩', '世界']
$verb_type
List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB
public
static array<string|int, mixed>
$verb_type
= ["VC", "VE", "VV", "SB", "LB"]
$named_entity_tagger
Named Entity tagger instance used to recognizer noun entities in Chinese text
private
static object
$named_entity_tagger
$pos_tagger
PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text
private
static object
$pos_tagger
$stochastic_term_segmenter
StochasticTermSegmenter instance used for segmenting chines
private
static object
$stochastic_term_segmenter
$traditional_simplified_map
Holds a associative array with keys which are traditional characters and values their simplified character correspondents.
private
static array<string|int, mixed>
$traditional_simplified_map
Methods
extractDeepestSpeechPartPhrase()
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
public
static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
- $tree : array<string|int, mixed>
-
phrase to extract type from
- $pos : string
-
the part of speech to extract
Return values
string —the label of deepest $pos only path in $tree
extractObjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
public
static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractPredicateParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
public
static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractSubjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
public
static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractTripletByType()
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
public
static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
- $type : string
-
either CONCISE or RAW
Return values
array<string|int, mixed> —$triplets in format described above
extractTripletsParseTree()
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
public
static extractTripletsParseTree(are $tree) : array<string|int, mixed>
Parameters
- $tree : are
-
a parse tree for a sentence
Return values
array<string|int, mixed> —triplet array
extractTripletsPhrases()
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
public
static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
- $word_and_phrase_list : array<string|int, mixed>
-
of statements
Return values
array<string|int, mixed> —with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.
getNamedEntityTagger()
Get the named entity tagger instance
public
static getNamedEntityTagger() : NamedEntityContextTagger
Return values
NamedEntityContextTagger —for Chinese
getPosKey()
Determines the part of speech tag of a term using simple rules if possible
public
static getPosKey(string $term) : string
Parameters
- $term : string
-
to see if can get a part of speech for via a rule
Return values
string —part of speech tag or $term if can't be determine
getPosKeyList()
Possible tags a term can have that can be determined by a simple rule
public
static getPosKeyList() : array<string|int, mixed>
Return values
array<string|int, mixed> —getPosTagger()
Get Part of Speec instance
public
static getPosTagger() : PartOfSpeechContextTagger
Return values
PartOfSpeechContextTagger —for Chinese
getPosUnknownTagsList()
Return list of possible tags that an unknown term can have
public
static getPosUnknownTagsList() : array<string|int, mixed>
Return values
array<string|int, mixed> —getStochasticTermSegmenter()
Get the segmenter instance, instantiating it if necessary
public
static getStochasticTermSegmenter() : StochasticTermSegmenter
Return values
StochasticTermSegmenter —isCardinalNumber()
Check if the term passed in is a Cardinal Number
public
static isCardinalNumber(string $term) : bool
Parameters
- $term : string
-
to check if a cardinal number or not
Return values
bool —whether it is a cardinal or not
isDate()
public
static isDate(mixed $term) : mixed
Parameters
- $term : mixed
Return values
mixed —isNotCurrentLang()
Check if all the chars in the term is NOT current language
public
static isNotCurrentLang(string $term) : bool
Parameters
- $term : string
-
is a string that to be checked
Return values
bool —true if all the chars in $term is NOT current language false otherwise
isOrdinalNumber()
public
static isOrdinalNumber(mixed $term) : mixed
Parameters
- $term : mixed
Return values
mixed —isPunctuation()
Check if the term is a punctuation
public
static isPunctuation(mixed $term) : mixed
Parameters
- $term : mixed
Return values
mixed —isQuestion()
Takes a phrase query entered by user and return true if it is question and false if not
public
static isQuestion( $phrase) : bool
Parameters
Return values
bool —returns question word if statement is question
normalize()
Converts traditional Chinese characters to simplified characters
public
static normalize(string $text) : string
Parameters
- $text : string
-
is a string of Chinese Char
Return values
string —normalized form of the text
parseAdjective()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
public
static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
parseDeterminer()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
public
static parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed
parseNoun()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
public
static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
parseNounPhrase()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
public
static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree
parsePrepositionalPhrases()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
public
static parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
- $index : int = 1
-
which term in $tagged_phrase to start to try to parse a preposition from
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree
parseQuestion()
Takes tagged question string starts with Who and returns question triplet from the question string
public
static parseQuestion(string $tagged_question, int $index, string $question_word) : array<string|int, mixed>
Parameters
- $tagged_question : string
-
part-of-speech tagged question
- $index : int
-
current index in statement
- $question_word : string
-
is the question word need to be replaced
Return values
array<string|int, mixed> —parsed triplet
parseTypeList()
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
public
static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
- $cur_node : array<string|int, mixed>
-
node within parse tree
- $tagged_phrase : array<string|int, mixed>
-
parse tree for phrase
- $type : string
-
self::$noun_type, self::$verb_type, etc
Return values
string —phrase string involving only terms of that $type
parseVerb()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
public
static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
parseVerbPhrase()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
public
static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree
parseWholePhrase()
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
public
static parseWholePhrase(array<string|int, mixed> $tagged_phrase, $tree[, $tree_np_pre = [] ]) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree :
-
that consists of ["curnode" => current parse position in $tagged_phrase]
- $tree_np_pre : = []
-
subject found from previous sub-sentence
Return values
array<string|int, mixed> —used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase
questionParser()
Takes any question started with WH question and returns the triplet from the question
public
static questionParser(string $question) : array<string|int, mixed>
Parameters
- $question : string
-
question to parse
Return values
array<string|int, mixed> —question triplet
questionType()
Helper function for isQuestion
public
static questionType( $term_array, $type_list) : mixed
Parameters
- $term_array :
-
segmented Chinese terms
- $type_list :
-
currect trace of self::$question_words return ["ques_words"=>ques_words,"types"=>types]
Return values
mixed —rearrangeTripletsByType()
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
public
static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
Return values
array<string|int, mixed> —$processed_triplets in format described above
segment()
A word segmenter.
public
static segment(string $pre_segment[, string $method = "STS" ]) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
- $method : string = "STS"
-
indicates which method to use
Return values
string —with words separated by space
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
tagTokenizePartOfSpeech()
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
public
static tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>
Parameters
- $text : string
-
string to tag and tokenize
Return values
array<string|int, mixed> —of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text