Yioop_V9.5_Source_Code

Tokenizer
in package

Application

Chinese specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

$adjective_type

List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ


    public
    static    array<string|int, mixed>
    $adjective_type
     = ["VA", "JJ"]

$adverb_type

List of adverb-like parts of speech that might appear in lexicon file


    public
    static    array<string|int, mixed>
    $adverb_type
     = ["AD"]

$conjunction_type

List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS


    public
    static    array<string|int, mixed>
    $conjunction_type
     = ["CC", "CS"]

$determiner_type

List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M


    public
    static    mixed
    $determiner_type
     = ["DT", "CD", "OD", "M"]

$dot

Dots used in Chinese Numbers


    public
    static    string
    $dot
     = "\\.．点"

$exception_list

Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it


    public
    static    array<string|int, mixed>
    $exception_list
     = ["十分", "一", "一点", "千万", "万一", "一一", "拾", "一时", "千千", "万万", "陆"]

of string

$non_char_preg

regular expression to determine if the None of the char in this term is in current language.


    public
    static    string
    $non_char_preg
     = "/^[^\\p{Han}]+\$/u"

$noun_type

List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN


    public
    static    array<string|int, mixed>
    $noun_type
     = ["NR", "NT", "NN", "PN"]

$num_dict

The dictionary of characters can be used as Chinese Numbers


    public
    static    string
    $num_dict
     = "1234567890○〇零一二两三四五六七八九十百千万亿" . "０１２３４５６７８９壹贰叁肆伍陆柒捌玖拾廿卅卌佰仟萬億"

$num_end

A list of characters can be used at the end of numbers


    public
    static    string
    $num_end
     = "％%"

$particle_type

List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere


    public
    static    array<string|int, mixed>
    $particle_type
     = ["AS", "ETC", "DEC", "DEG", "DEV", "MSP", "DER", "SP", "IJ", "FW"]

$punctuation_preg

A list of characters can be used as Chinese punctuations


    public
    static    string
    $punctuation_preg
     = "/^([\\x{2000}-\\x{206F}\\x{3000}-\\x{303F}\\x{FF00}-\\x{FF0F}" . "\\x{FF1A}-\\x{FF20}\\x{FF3B}-\\x{FF40}\\x{FF5B}-\\x{FF65}" . "\\x{FFE0}-\\x{FFEE}\\x{21}-\\x{2F}\\x{21}-\\x{2F}" . "\\x{3A}-\\x{40}\\x{5B}-\\x{60}\\x{25cf}])\\1*\$/u"

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list


    public
    static    string
    $question_token
     = "qqq"

$question_words

Words array that determine if a sentence passed in is a question


    public
    static    array<string|int, mixed>
    $question_words
     = ["any" => ["谁" => "who", "哪儿|哪里" => "where", "哪个" => "which", "哪些" => "list", "哪" => ["after" => ["1|一" => "which", "[2-9]|[1-9][0-9]+" => "list"], "other" => "where"], "什么|啥|咋" => ["after" => ["地方" => "where", "地点" => "where", "时\\w*" => "when"], "other" => "what"], "怎么|怎样|怎么样|如何" => "how", "为什么" => "why", "多少" => "how many", "几\\w*" => ["any" => ["吗|\\?|？" => "how many"], "other" => false], "多久" => "how long", "多大" => "how big"], "other" => ["any" => ["吗" => "yesno", "呢" => "what about"], "other" => ["other" => false, "any" => ["\\?|？" => "yesno"]]]]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection


    public
    static    array<string|int, mixed>
    $stop_words
     = ['一', '人', '里', '会', '没', '她', '吗', '去', '也', '有', '这', '那', '不', '什', '个', '来', '要', '就', '我', '你', '的', '是', '了', '他', '么', '们', '在', '说', '为', '好', '吧', '知道', '我的', '和', '你的', '想', '只', '很', '都', '对', '把', '啊', '怎', '得', '还', '过', '不是', '到', '样', '飞', '远', '身', '任何', '生活', '够', '号', '兰', '瑞', '达', '或', '愿', '蒂', '別', '军', '正', '是不是', '证', '不用', '三', '乐', '吉', '男人', '告訴', '路', '搞', '可是', '与', '次', '狗', '决', '金', '史', '姆', '部', '正在', '活', '刚', '回家', '贝', '如何', '须', '战', '不會', '夫', '喂', '父', '亚', '肯定', '女孩', '世界']

$verb_type

List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB


    public
    static    array<string|int, mixed>
    $verb_type
     = ["VC", "VE", "VV", "SB", "LB"]

$named_entity_tagger

Named Entity tagger instance used to recognizer noun entities in Chinese text


    private
    static    object
    $named_entity_tagger

$pos_tagger

PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text


    private
    static    object
    $pos_tagger

$stochastic_term_segmenter

StochasticTermSegmenter instance used for segmenting chines


    private
    static    object
    $stochastic_term_segmenter

$traditional_simplified_map

Holds a associative array with keys which are traditional characters and values their simplified character correspondents.


    private
    static    array<string|int, mixed>
    $traditional_simplified_map

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.


    public
            static        extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string

Parameters

$tree : array<string|int, mixed>: phrase to extract type from
$pos : string: the part of speech to extract

Return values

string —

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object


    public
            static        extractObjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate


    public
            static        extractPredicateParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject


    public
            static        extractSubjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields


    public
            static        extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above
$type : string: either CONCISE or RAW

Return values

array<string|int, mixed> —

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed


    public
            static        extractTripletsParseTree(are $tree) : array<string|int, mixed>

Parameters

$tree : are: a parse tree for a sentence

Return values

array<string|int, mixed> —

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).


    public
            static        extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>

Parameters

$word_and_phrase_list : array<string|int, mixed>: of statements

Return values

array<string|int, mixed> —

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

getNamedEntityTagger()

Get the named entity tagger instance


    public
            static        getNamedEntityTagger() : NamedEntityContextTagger

Return values

NamedEntityContextTagger —

for Chinese

getPosKey()

Determines the part of speech tag of a term using simple rules if possible


    public
            static        getPosKey(string $term) : string

Parameters

$term : string: to see if can get a part of speech for via a rule

Return values

string —

part of speech tag or $term if can't be determine

getPosKeyList()

Possible tags a term can have that can be determined by a simple rule


    public
            static        getPosKeyList() : array<string|int, mixed>

Return values

array<string|int, mixed> —

getPosTagger()

Get Part of Speec instance


    public
            static        getPosTagger() : PartOfSpeechContextTagger

Return values

PartOfSpeechContextTagger —

for Chinese

getPosUnknownTagsList()

Return list of possible tags that an unknown term can have


    public
            static        getPosUnknownTagsList() : array<string|int, mixed>

Return values

array<string|int, mixed> —

getStochasticTermSegmenter()

Get the segmenter instance, instantiating it if necessary


    public
            static        getStochasticTermSegmenter() : StochasticTermSegmenter

Return values

StochasticTermSegmenter —

isCardinalNumber()

Check if the term passed in is a Cardinal Number


    public
            static        isCardinalNumber(string $term) : bool

Parameters

$term : string: to check if a cardinal number or not

Return values

bool —

whether it is a cardinal or not

isDate()


    public
            static        isDate(mixed $term) : mixed

Parameters

$term : mixed

Return values

mixed —

isNotCurrentLang()

Check if all the chars in the term is NOT current language


    public
            static        isNotCurrentLang(string $term) : bool

Parameters

$term : string: is a string that to be checked

Return values

bool —

true if all the chars in $term is NOT current language false otherwise

isOrdinalNumber()


    public
            static        isOrdinalNumber(mixed $term) : mixed

Parameters

$term : mixed

Return values

mixed —

isPunctuation()

Check if the term is a punctuation


    public
            static        isPunctuation(mixed $term) : mixed

Parameters

$term : mixed

Return values

mixed —

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not


    public
            static        isQuestion( $phrase) : bool

Parameters

$phrase :: any statement

Return values

bool —

returns question word if statement is question

normalize()

Converts traditional Chinese characters to simplified characters


    public
            static        normalize(string $text) : string

Parameters

$text : string: is a string of Chinese Char

Return values

string —

normalized form of the text

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible


    public
            static        parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseDeterminer()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible


    public
            static        parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible


    public
            static        parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible


    public
            static        parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parsePrepositionalPhrases()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible


    public
            static        parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]
$index : int = 1: which term in $tagged_phrase to start to try to parse a preposition from

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string


    public
            static        parseQuestion(string $tagged_question, int $index, string $question_word) : array<string|int, mixed>

Parameters

$tagged_question : string: part-of-speech tagged question
$index : int: current index in statement
$question_word : string: is the question word need to be replaced

Return values

array<string|int, mixed> —

parsed triplet

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.


    public
            static        parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string

Parameters

$cur_node : array<string|int, mixed>: node within parse tree
$tagged_phrase : array<string|int, mixed>: parse tree for phrase
$type : string: self::$noun_type, self::$verb_type, etc

Return values

string —

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible


    public
            static        parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible


    public
            static        parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.


    public
            static        parseWholePhrase(array<string|int, mixed> $tagged_phrase,  $tree[,  $tree_np_pre = [] ]) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree :: that consists of ["curnode" => current parse position in $tagged_phrase]
$tree_np_pre : = []: subject found from previous sub-sentence

Return values

array<string|int, mixed> —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

questionParser()

Takes any question started with WH question and returns the triplet from the question


    public
            static        questionParser(string $question) : array<string|int, mixed>

Parameters

$question : string: question to parse

Return values

array<string|int, mixed> —

question triplet

questionType()

Helper function for isQuestion


    public
            static        questionType( $term_array,  $type_list) : mixed

Parameters

$term_array :: segmented Chinese terms
$type_list :: currect trace of self::$question_words return ["ques_words"=>ques_words,"types"=>types]

Return values

mixed —

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields


    public
            static        rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above

Return values

array<string|int, mixed> —

$processed_triplets in format described above

segment()

A word segmenter.


    public
            static        segment(string $pre_segment[, string $method = "STS" ]) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation
$method : string = "STS": indicates which method to use

Return values

string —

with words separated by space

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

tagTokenizePartOfSpeech()

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.


    public
            static        tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>

Parameters

$text : string: string to tag and tokenize

Return values

array<string|int, mixed> —

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text

Tokenizer in package Application

Tags

Table of Contents

Properties

$adjective_type

$adverb_type

$conjunction_type

$determiner_type

Tags

$dot

$exception_list

$non_char_preg

$noun_type

$num_dict

$num_end

$particle_type

$punctuation_preg

$question_token

$question_words

$stop_words

$verb_type

$named_entity_tagger

$pos_tagger

$stochastic_term_segmenter

$traditional_simplified_map

Methods

extractDeepestSpeechPartPhrase()

Parameters

Return values

extractObjectParseTree()

Parameters

Return values

extractPredicateParseTree()

Parameters

Return values

extractSubjectParseTree()

Parameters

Return values

extractTripletByType()

Parameters

Return values

extractTripletsParseTree()

Parameters

Return values

extractTripletsPhrases()

Parameters

Return values

getNamedEntityTagger()

Return values

getPosKey()

Parameters

Return values

getPosKeyList()

Return values

getPosTagger()

Return values

getPosUnknownTagsList()

Return values

getStochasticTermSegmenter()

Return values

isCardinalNumber()

Parameters

Return values

isDate()

Parameters

Return values

isNotCurrentLang()

Parameters

Return values

isOrdinalNumber()

Parameters

Return values

isPunctuation()

Parameters

Return values

isQuestion()

Parameters

Return values

normalize()

Parameters

Tokenizer
in package

Application