in package
Chinese specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram
Table of Contents
- $adjective_type : array<string|int, mixed>
- List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ
- $adverb_type : array<string|int, mixed>
- List of adverb-like parts of speech that might appear in lexicon file
- $conjunction_type : array<string|int, mixed>
- List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS
- $determiner_type : mixed
- List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M
- $dot : string
- Dots used in Chinese Numbers
- $exception_list : array<string|int, mixed>
- Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it
- $non_char_preg : string
- regular expression to determine if the None of the char in this term is in current language.
- $noun_type : array<string|int, mixed>
- List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN
- $num_dict : string
- The dictionary of characters can be used as Chinese Numbers
- $num_end : string
- A list of characters can be used at the end of numbers
- $particle_type : array<string|int, mixed>
- List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere
- $punctuation_preg : string
- A list of characters can be used as Chinese punctuations
- $question_token : string
- Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
- $question_words : array<string|int, mixed>
- Words array that determine if a sentence passed in is a question
- $stop_words : array<string|int, mixed>
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- $verb_type : array<string|int, mixed>
- List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB
- $named_entity_tagger : object
- Named Entity tagger instance used to recognizer noun entities in Chinese text
- $pos_tagger : object
- PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text
- $stochastic_term_segmenter : object
- StochasticTermSegmenter instance used for segmenting chines
- $traditional_simplified_map : array<string|int, mixed>
- Holds a associative array with keys which are traditional characters and values their simplified character correspondents.
- extractDeepestSpeechPartPhrase() : string
- Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
- extractObjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
- extractPredicateParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
- extractSubjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
- extractTripletByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- extractTripletsParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
- extractTripletsPhrases() : array<string|int, mixed>
- Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
- getNamedEntityTagger() : NamedEntityContextTagger
- Get the named entity tagger instance
- getPosKey() : string
- Determines the part of speech tag of a term using simple rules if possible
- getPosKeyList() : array<string|int, mixed>
- Possible tags a term can have that can be determined by a simple rule
- getPosTagger() : PartOfSpeechContextTagger
- Get Part of Speec instance
- getPosUnknownTagsList() : array<string|int, mixed>
- Return list of possible tags that an unknown term can have
- getStochasticTermSegmenter() : StochasticTermSegmenter
- Get the segmenter instance, instantiating it if necessary
- isCardinalNumber() : bool
- Check if the term passed in is a Cardinal Number
- isDate() : mixed
- isNotCurrentLang() : bool
- Check if all the chars in the term is NOT current language
- isOrdinalNumber() : mixed
- isPunctuation() : mixed
- Check if the term is a punctuation
- isQuestion() : bool
- Takes a phrase query entered by user and return true if it is question and false if not
- normalize() : string
- Converts traditional Chinese characters to simplified characters
- parseAdjective() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
- parseDeterminer() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
- parseNoun() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
- parseNounPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
- parsePrepositionalPhrases() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
- parseQuestion() : array<string|int, mixed>
- Takes tagged question string starts with Who and returns question triplet from the question string
- parseTypeList() : string
- Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
- parseVerb() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
- parseVerbPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
- parseWholePhrase() : array<string|int, mixed>
- Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
- questionParser() : array<string|int, mixed>
- Takes any question started with WH question and returns the triplet from the question
- questionType() : mixed
- Helper function for isQuestion
- rearrangeTripletsByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- segment() : string
- A word segmenter.
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- tagTokenizePartOfSpeech() : array<string|int, mixed>
- Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
- $tree : array<string|int, mixed>
phrase to extract type from
- $pos : string
the part of speech to extract
Return values
string —the label of deepest $pos only path in $tree
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
- $sub_pred_obj_triplets : array<string|int, mixed>
in format described above
- $type : string
either CONCISE or RAW
Return values
array<string|int, mixed> —$triplets in format described above
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
static extractTripletsParseTree(are $tree) : array<string|int, mixed>
- $tree : are
a parse tree for a sentence
Return values
array<string|int, mixed> —triplet array
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
- $word_and_phrase_list : array<string|int, mixed>
of statements
Return values
array<string|int, mixed> —with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.
Get the named entity tagger instance
static getNamedEntityTagger() : NamedEntityContextTagger
Return values
NamedEntityContextTagger —for Chinese
Determines the part of speech tag of a term using simple rules if possible
static getPosKey(string $term) : string
- $term : string
to see if can get a part of speech for via a rule
Return values
string —part of speech tag or $term if can't be determine
Possible tags a term can have that can be determined by a simple rule
static getPosKeyList() : array<string|int, mixed>
Return values
array<string|int, mixed> —getPosTagger()
Get Part of Speec instance
static getPosTagger() : PartOfSpeechContextTagger
Return values
PartOfSpeechContextTagger —for Chinese
Return list of possible tags that an unknown term can have
static getPosUnknownTagsList() : array<string|int, mixed>
Return values
array<string|int, mixed> —getStochasticTermSegmenter()
Get the segmenter instance, instantiating it if necessary
static getStochasticTermSegmenter() : StochasticTermSegmenter
Return values
StochasticTermSegmenter —isCardinalNumber()
Check if the term passed in is a Cardinal Number
static isCardinalNumber(string $term) : bool
- $term : string
to check if a cardinal number or not
Return values
bool —whether it is a cardinal or not
static isDate(mixed $term) : mixed
- $term : mixed
Return values
mixed —isNotCurrentLang()
Check if all the chars in the term is NOT current language
static isNotCurrentLang(string $term) : bool
- $term : string
is a string that to be checked
Return values
bool —true if all the chars in $term is NOT current language false otherwise
static isOrdinalNumber(mixed $term) : mixed
- $term : mixed
Return values
mixed —isPunctuation()
Check if the term is a punctuation
static isPunctuation(mixed $term) : mixed
- $term : mixed
Return values
mixed —isQuestion()
Takes a phrase query entered by user and return true if it is question and false if not
static isQuestion( $phrase) : bool
Return values
bool —returns question word if statement is question
Converts traditional Chinese characters to simplified characters
static normalize(string $text) : string
- $text : string
is a string of Chinese Char
Return values
string —normalized form of the text
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["cur_node" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
static parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
static parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["cur_node" => current parse position in $tagged_phrase]
- $index : int = 1
which term in $tagged_phrase to start to try to parse a preposition from
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree
Takes tagged question string starts with Who and returns question triplet from the question string
static parseQuestion(string $tagged_question, int $index, string $question_word) : array<string|int, mixed>
- $tagged_question : string
part-of-speech tagged question
- $index : int
current index in statement
- $question_word : string
is the question word need to be replaced
Return values
array<string|int, mixed> —parsed triplet
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
- $cur_node : array<string|int, mixed>
node within parse tree
- $tagged_phrase : array<string|int, mixed>
parse tree for phrase
- $type : string
self::$noun_type, self::$verb_type, etc
Return values
string —phrase string involving only terms of that $type
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
static parseWholePhrase(array<string|int, mixed> $tagged_phrase, $tree[, $tree_np_pre = [] ]) : array<string|int, mixed>
- $tagged_phrase : array<string|int, mixed>
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree :
that consists of ["curnode" => current parse position in $tagged_phrase]
- $tree_np_pre : = []
subject found from previous sub-sentence
Return values
array<string|int, mixed> —used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase
Takes any question started with WH question and returns the triplet from the question
static questionParser(string $question) : array<string|int, mixed>
- $question : string
question to parse
Return values
array<string|int, mixed> —question triplet
Helper function for isQuestion
static questionType( $term_array, $type_list) : mixed
- $term_array :
segmented Chinese terms
- $type_list :
currect trace of self::$question_words return ["ques_words"=>ques_words,"types"=>types]
Return values
mixed —rearrangeTripletsByType()
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
- $sub_pred_obj_triplets : array<string|int, mixed>
in format described above
Return values
array<string|int, mixed> —$processed_triplets in format described above
A word segmenter.
static segment(string $pre_segment[, string $method = "STS" ]) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
- $pre_segment : string
before segmentation
- $method : string = "STS"
indicates which method to use
Return values
string —with words separated by space
Removes the stop words from the page (used for Word Cloud generation and language detection)
static stopwordsRemover(mixed $data) : mixed
- $data : mixed
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
static tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>
- $text : string
string to tag and tokenize
Return values
array<string|int, mixed> —of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text