Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Chinese specific tokenization code. Typically, tokenizer.php either contains a stemmer for the language in question or it specifies how many characters in a char gram

Tags
author

Chris Pollett

Table of Contents

$adjective_type  : array<string|int, mixed>
List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ
$adverb_type  : array<string|int, mixed>
List of adverb-like parts of speech that might appear in lexicon file
$conjunction_type  : array<string|int, mixed>
List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS
$determiner_type  : mixed
List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M
$dot  : string
Dots used in Chinese Numbers
$exception_list  : array<string|int, mixed>
Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it
$non_char_preg  : string
regular expression to determine if the None of the char in this term is in current language.
$noun_type  : array<string|int, mixed>
List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN
$num_dict  : string
The dictionary of characters can be used as Chinese Numbers
$num_end  : string
A list of characters can be used at the end of numbers
$particle_type  : array<string|int, mixed>
List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere
$punctuation_preg  : string
A list of characters can be used as Chinese punctuations
$question_token  : string
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
$question_words  : array<string|int, mixed>
Words array that determine if a sentence passed in is a question
$stop_words  : array<string|int, mixed>
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
$verb_type  : array<string|int, mixed>
List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB
$named_entity_tagger  : object
Named Entity tagger instance used to recognizer noun entities in Chinese text
$pos_tagger  : object
PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text
$stochastic_term_segmenter  : object
StochasticTermSegmenter instance used for segmenting chines
$traditional_simplified_map  : array<string|int, mixed>
Holds a associative array with keys which are traditional characters and values their simplified character correspondents.
extractDeepestSpeechPartPhrase()  : string
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
extractObjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
extractPredicateParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
extractSubjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
extractTripletByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
extractTripletsParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
extractTripletsPhrases()  : array<string|int, mixed>
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
getNamedEntityTagger()  : NamedEntityContextTagger
Get the named entity tagger instance
getPosKey()  : string
Determines the part of speech tag of a term using simple rules if possible
getPosKeyList()  : array<string|int, mixed>
Possible tags a term can have that can be determined by a simple rule
getPosTagger()  : PartOfSpeechContextTagger
Get Part of Speec instance
getPosUnknownTagsList()  : array<string|int, mixed>
Return list of possible tags that an unknown term can have
getStochasticTermSegmenter()  : StochasticTermSegmenter
Get the segmenter instance, instantiating it if necessary
isCardinalNumber()  : bool
Check if the term passed in is a Cardinal Number
isDate()  : mixed
isNotCurrentLang()  : bool
Check if all the chars in the term is NOT current language
isOrdinalNumber()  : mixed
isPunctuation()  : mixed
Check if the term is a punctuation
isQuestion()  : bool
Takes a phrase query entered by user and return true if it is question and false if not
normalize()  : string
Converts traditional Chinese characters to simplified characters
parseAdjective()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
parseDeterminer()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
parseNoun()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
parseNounPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
parsePrepositionalPhrases()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
parseQuestion()  : array<string|int, mixed>
Takes tagged question string starts with Who and returns question triplet from the question string
parseTypeList()  : string
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
parseVerb()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
parseVerbPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
parseWholePhrase()  : array<string|int, mixed>
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
questionParser()  : array<string|int, mixed>
Takes any question started with WH question and returns the triplet from the question
questionType()  : mixed
Helper function for isQuestion
rearrangeTripletsByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
segment()  : string
A word segmenter.
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
tagTokenizePartOfSpeech()  : array<string|int, mixed>
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.

Properties

$adjective_type

List of adjective-like parts of speech that might appear in lexicon file Predicative adjective: VA other noun-modifier: JJ

public static array<string|int, mixed> $adjective_type = ["VA", "JJ"]

$adverb_type

List of adverb-like parts of speech that might appear in lexicon file

public static array<string|int, mixed> $adverb_type = ["AD"]

$conjunction_type

List of conjunction-like parts of speech that might appear in lexicon file Coordinating conjunction: CC Subordinating conjunction: CS

public static array<string|int, mixed> $conjunction_type = ["CC", "CS"]

$determiner_type

List of determiner-like parts of speech that might appear in lexicon file Determiner: DT Cardinal Number: CD Ordinal Number: OD Measure word: M

public static mixed $determiner_type = ["DT", "CD", "OD", "M"]
Tags
array

$dot

Dots used in Chinese Numbers

public static string $dot = "\\..点"

$exception_list

Exception words of the regex found by functions: isCardinalNumber, isOrdinalNumber, isDate ex. "十分" in most of time means "very", but it will be determined to be "10 minutes" by the function so we need to remove it

public static array<string|int, mixed> $exception_list = ["十分", "一", "一点", "千万", "万一", "一一", "拾", "一时", "千千", "万万", "陆"]

of string

$non_char_preg

regular expression to determine if the None of the char in this term is in current language.

public static string $non_char_preg = "/^[^\\p{Han}]+\$/u"

$noun_type

List of noun-like parts of speech that might appear in lexicon file Proper Noun: NR Temporal Noun: NT Other Noun: NN Pronoun: PN

public static array<string|int, mixed> $noun_type = ["NR", "NT", "NN", "PN"]

$num_dict

The dictionary of characters can be used as Chinese Numbers

public static string $num_dict = "1234567890○〇零一二两三四五六七八九十百千万亿" . "0123456789壹贰叁肆伍陆柒捌玖拾廿卅卌佰仟萬億"

$num_end

A list of characters can be used at the end of numbers

public static string $num_end = "%%"

$particle_type

List of particle-like parts of speech that might appear in lexicon file No meaning words that can appear anywhere

public static array<string|int, mixed> $particle_type = ["AS", "ETC", "DEC", "DEG", "DEV", "MSP", "DER", "SP", "IJ", "FW"]

$punctuation_preg

A list of characters can be used as Chinese punctuations

public static string $punctuation_preg = "/^([\\x{2000}-\\x{206F}\\x{3000}-\\x{303F}\\x{FF00}-\\x{FF0F}" . "\\x{FF1A}-\\x{FF20}\\x{FF3B}-\\x{FF40}\\x{FF5B}-\\x{FF65}" . "\\x{FFE0}-\\x{FFEE}\\x{21}-\\x{2F}\\x{21}-\\x{2F}" . "\\x{3A}-\\x{40}\\x{5B}-\\x{60}\\x{25cf}])\\1*\$/u"

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

public static string $question_token = "qqq"

$question_words

Words array that determine if a sentence passed in is a question

public static array<string|int, mixed> $question_words = ["any" => ["谁" => "who", "哪儿|哪里" => "where", "哪个" => "which", "哪些" => "list", "哪" => ["after" => ["1|一" => "which", "[2-9]|[1-9][0-9]+" => "list"], "other" => "where"], "什么|啥|咋" => ["after" => ["地方" => "where", "地点" => "where", "时\\w*" => "when"], "other" => "what"], "怎么|怎样|怎么样|如何" => "how", "为什么" => "why", "多少" => "how many", "几\\w*" => ["any" => ["吗|\\?|?" => "how many"], "other" => false], "多久" => "how long", "多大" => "how big"], "other" => ["any" => ["吗" => "yesno", "呢" => "what about"], "other" => ["other" => false, "any" => ["\\?|?" => "yesno"]]]]

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static array<string|int, mixed> $stop_words = ['一', '人', '里', '会', '没', '她', '吗', '去', '也', '有', '这', '那', '不', '什', '个', '来', '要', '就', '我', '你', '的', '是', '了', '他', '么', '们', '在', '说', '为', '好', '吧', '知道', '我的', '和', '你的', '想', '只', '很', '都', '对', '把', '啊', '怎', '得', '还', '过', '不是', '到', '样', '飞', '远', '身', '任何', '生活', '够', '号', '兰', '瑞', '达', '或', '愿', '蒂', '別', '军', '正', '是不是', '证', '不用', '三', '乐', '吉', '男人', '告訴', '路', '搞', '可是', '与', '次', '狗', '决', '金', '史', '姆', '部', '正在', '活', '刚', '回家', '贝', '如何', '须', '战', '不會', '夫', '喂', '父', '亚', '肯定', '女孩', '世界']

$verb_type

List of verb-like parts of speech that might appear in lexicon file Copula: VC you3 as the main verb: VE Other verb: VV Short passive voice: SB Long passive voice: LB

public static array<string|int, mixed> $verb_type = ["VC", "VE", "VV", "SB", "LB"]

$named_entity_tagger

Named Entity tagger instance used to recognizer noun entities in Chinese text

private static object $named_entity_tagger

$pos_tagger

PartOfSpeechContextTagger instance used in adding part of speech annotations to Chinese text

private static object $pos_tagger

$stochastic_term_segmenter

StochasticTermSegmenter instance used for segmenting chines

private static object $stochastic_term_segmenter

$traditional_simplified_map

Holds a associative array with keys which are traditional characters and values their simplified character correspondents.

private static array<string|int, mixed> $traditional_simplified_map

Methods

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

public static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
$tree : array<string|int, mixed>

phrase to extract type from

$pos : string

the part of speech to extract

Return values
string

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

public static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

public static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

public static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

public static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

$type : string

either CONCISE or RAW

Return values
array<string|int, mixed>

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

public static extractTripletsParseTree(are $tree) : array<string|int, mixed>
Parameters
$tree : are

a parse tree for a sentence

Return values
array<string|int, mixed>

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).

public static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
$word_and_phrase_list : array<string|int, mixed>

of statements

Return values
array<string|int, mixed>

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

getNamedEntityTagger()

Get the named entity tagger instance

public static getNamedEntityTagger() : NamedEntityContextTagger
Return values
NamedEntityContextTagger

for Chinese

getPosKey()

Determines the part of speech tag of a term using simple rules if possible

public static getPosKey(string $term) : string
Parameters
$term : string

to see if can get a part of speech for via a rule

Return values
string

part of speech tag or $term if can't be determine

getPosKeyList()

Possible tags a term can have that can be determined by a simple rule

public static getPosKeyList() : array<string|int, mixed>
Return values
array<string|int, mixed>

getPosTagger()

Get Part of Speec instance

public static getPosTagger() : PartOfSpeechContextTagger
Return values
PartOfSpeechContextTagger

for Chinese

getPosUnknownTagsList()

Return list of possible tags that an unknown term can have

public static getPosUnknownTagsList() : array<string|int, mixed>
Return values
array<string|int, mixed>

getStochasticTermSegmenter()

Get the segmenter instance, instantiating it if necessary

public static getStochasticTermSegmenter() : StochasticTermSegmenter
Return values
StochasticTermSegmenter

isCardinalNumber()

Check if the term passed in is a Cardinal Number

public static isCardinalNumber(string $term) : bool
Parameters
$term : string

to check if a cardinal number or not

Return values
bool

whether it is a cardinal or not

isDate()

public static isDate(mixed $term) : mixed
Parameters
$term : mixed
Return values
mixed

isNotCurrentLang()

Check if all the chars in the term is NOT current language

public static isNotCurrentLang(string $term) : bool
Parameters
$term : string

is a string that to be checked

Return values
bool

true if all the chars in $term is NOT current language false otherwise

isOrdinalNumber()

public static isOrdinalNumber(mixed $term) : mixed
Parameters
$term : mixed
Return values
mixed

isPunctuation()

Check if the term is a punctuation

public static isPunctuation(mixed $term) : mixed
Parameters
$term : mixed
Return values
mixed

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not

public static isQuestion( $phrase) : bool
Parameters
$phrase :

any statement

Return values
bool

returns question word if statement is question

normalize()

Converts traditional Chinese characters to simplified characters

public static normalize(string $text) : string
Parameters
$text : string

is a string of Chinese Char

Return values
string

normalized form of the text

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

public static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseDeterminer()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible

public static parseDeterminer(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

public static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible

public static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parsePrepositionalPhrases()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible

public static parsePrepositionalPhrases(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

$index : int = 1

which term in $tagged_phrase to start to try to parse a preposition from

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string

public static parseQuestion(string $tagged_question, int $index, string $question_word) : array<string|int, mixed>
Parameters
$tagged_question : string

part-of-speech tagged question

$index : int

current index in statement

$question_word : string

is the question word need to be replaced

Return values
array<string|int, mixed>

parsed triplet

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

public static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
$cur_node : array<string|int, mixed>

node within parse tree

$tagged_phrase : array<string|int, mixed>

parse tree for phrase

$type : string

self::$noun_type, self::$verb_type, etc

Return values
string

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

public static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

public static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

public static parseWholePhrase(array<string|int, mixed> $tagged_phrase,  $tree[,  $tree_np_pre = [] ]) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree :

that consists of ["curnode" => current parse position in $tagged_phrase]

$tree_np_pre : = []

subject found from previous sub-sentence

Return values
array<string|int, mixed>

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

questionParser()

Takes any question started with WH question and returns the triplet from the question

public static questionParser(string $question) : array<string|int, mixed>
Parameters
$question : string

question to parse

Return values
array<string|int, mixed>

question triplet

questionType()

Helper function for isQuestion

public static questionType( $term_array,  $type_list) : mixed
Parameters
$term_array :

segmented Chinese terms

$type_list :

currect trace of self::$question_words return ["ques_words"=>ques_words,"types"=>types]

Return values
mixed

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

public static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

Return values
array<string|int, mixed>

$processed_triplets in format described above

segment()

A word segmenter.

public static segment(string $pre_segment[, string $method = "STS" ]) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

$method : string = "STS"

indicates which method to use

Return values
string

with words separated by space

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

tagTokenizePartOfSpeech()

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.

public static tagTokenizePartOfSpeech(string $text) : array<string|int, mixed>
Parameters
$text : string

string to tag and tokenize

Return values
array<string|int, mixed>

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text


        

Search results