Tokenizer
in package
Hindi specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) Java stemming algorithm: http://members.unine.ch/jacques.savoy/clef/HindiStemmerLight.java.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
Tags
Table of Contents
- $adjective_type : array<string|int, mixed>
- List of adjective-like parts of speech that might appear in lexicon
- $no_stem_list : array<string|int, mixed>
- Words we don't want to be stemmed
- $noun_type : array<string|int, mixed>
- List of noun-like parts of speech that might appear in lexicon
- $postpositional_type : array<string|int, mixed>
- List of postpositional-like parts of speech that might appear in lexicon
- $question_pattern : array<string|int, mixed>
- List of questions in Hindi
- $question_token : string
- Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
- $stop_words : mixed
- A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
- $verb_type : array<string|int, mixed>
- List of verb-like parts of speech that might appear in lexicon
- extractDeepestSpeechPartPhrase() : string
- Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
- extractObjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
- extractPredicateParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
- extractSubjectParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
- extractTripletByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields
- extractTripletsParseTree() : array<string|int, mixed>
- Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
- extractTripletsPhrases() : array<string|int, mixed>
- Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
- isQuestion() : bool
- Takes a phrase query entered by user and return true if it is question and false if not
- parseAdjective() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
- parseNoun() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
- parseNounPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible
- parsePostpositionPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible
- parseQuestion() : array<string|int, mixed>
- Takes tagged question string starts with Who and returns question triplet from the question string
- parseTypeList() : string
- Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
- parseVerb() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
- parseVerbPhrase() : array<string|int, mixed>
- Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
- parseWholePhrase() : array<string|int, mixed>
- Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
- questionParser() : array<string|int, mixed>
- Takes questions and returns the triplet from the question
- rearrangeTripletsByType() : array<string|int, mixed>
- Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
- segment() : string
- Stub function which could be used for a word segmenter.
- stem() : string
- Computes the stem of an Hindi word
- stopwordsRemover() : mixed
- Removes the stop words from the page (used for Word Cloud generation and language detection)
- taggedPartOfSpeechTokensToString() : string
- This method is used to simplify the different tags of speech to a common form
- tagPartsOfSpeechPhrase() : string
- The method takes as input a phrase and returns a string with each term tagged with a part of speech.
- tagTokenizePartOfSpeech() : string
- Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token
- tagUnknownWords() : array<string|int, mixed>
- This method tags the remaining words in a partially tagged text array.
- removeSuffix() : string
- Removes common Hindi suffixes
Properties
$adjective_type
List of adjective-like parts of speech that might appear in lexicon
public
static array<string|int, mixed>
$adjective_type
= ["JJ", "JJR", "JJS"]
$no_stem_list
Words we don't want to be stemmed
public
static array<string|int, mixed>
$no_stem_list
= []
$noun_type
List of noun-like parts of speech that might appear in lexicon
public
static array<string|int, mixed>
$noun_type
= ["NN", "NNS", "NNP", "NNPS", "DT"]
$postpositional_type
List of postpositional-like parts of speech that might appear in lexicon
public
static array<string|int, mixed>
$postpositional_type
= ["IN", "inj", "PREP", "proNN", "CONJ", "INT", "particle", "case", "PSP", "direct_DT", "PRP"]
$question_pattern
List of questions in Hindi
public
static array<string|int, mixed>
$question_pattern
= "/\\b[क्या|कब|कहा|क्यों|कौन|जिसे|जिसका|कहाँ|कहां]\\b/ui"
$question_token
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
public
static string
$question_token
= "qqq"
$stop_words
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
public
static mixed
$stop_words
= ['जैसा', 'मैं', 'उसके', 'कि', 'वह', 'था', 'के', 'लिए', 'पर', 'हैं', 'साथ', 'वे', 'हो', 'पर', 'एक', 'है', 'इस', 'से', 'द्वारा', 'गरम', 'शब्द', 'लेकिन', 'क्या', 'कुछ', 'है', 'यह', 'आप', 'या', 'था', 'की', 'तक', 'और', 'एक', 'में', 'हम', 'कर', 'सकते', 'हैं', 'बाहर', 'अन्य', 'थे', 'जो', 'कर', 'उनके', 'समय', 'अगर', 'होगा', 'कैसे', 'कहा', 'एक', 'प्रत्येक', 'बता', 'करता', 'है', 'सेट', 'तीन', 'चाहते हैं', 'हवा', 'अच्छी तरह से', 'भी', 'खेलने', 'छोटे', 'अंत', 'डाल', 'घर', 'पढ़ा', 'हाथ', 'बंदरगाह', 'बड़ा', 'जादू', 'जोड़', 'और', 'भी', 'भूमि', 'यहाँ', 'चाहिए', 'बड़ा', 'उच्च', 'ऐसा', 'का', 'पालन', 'करें', 'अधिनियम', 'क्यों', 'पूछना', 'पुरुषों', 'परिवर्तन', 'चला', 'गया', 'प्रकाश', 'तरह', 'बंद', 'आवश्यकता', 'घर', 'तस्वीर', 'कोशिश', 'हमें', 'फिर', 'पशु', 'बिंदु', 'मां', 'दुनिया', 'निकट', 'बनाना', 'आत्म', 'पृथ्वी', 'पिता']
Tags
$verb_type
List of verb-like parts of speech that might appear in lexicon
public
static array<string|int, mixed>
$verb_type
= ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "RB"]
Methods
extractDeepestSpeechPartPhrase()
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
public
static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
- $tree : array<string|int, mixed>
-
phrase to extract type from
- $pos : string
-
the part of speech to extract
Return values
string —the label of deepest $pos only path in $tree
extractObjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
public
static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractPredicateParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
public
static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractSubjectParseTree()
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
public
static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
- $tree : mixed
Return values
array<string|int, mixed> —with two fields CONCISE and RAW as described above
extractTripletByType()
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields
public
static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
- $type : string
-
either CONCISE or RAW
Return values
array<string|int, mixed> —$triplets in format described above
extractTripletsParseTree()
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
public
static extractTripletsParseTree(array<string|int, mixed> $parse_tree) : array<string|int, mixed>
Parameters
- $parse_tree : array<string|int, mixed>
-
a parse tree for a sentence
Return values
array<string|int, mixed> —triplet array
extractTripletsPhrases()
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
public
static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
- $word_and_phrase_list : array<string|int, mixed>
-
of statements
Return values
array<string|int, mixed> —with two fields: QUESTION_LIST consisting of (SUBJECT, COMPLEMENT) where one of the components has been replaced with a question marker.
isQuestion()
Takes a phrase query entered by user and return true if it is question and false if not
public
isQuestion( $phrase) : bool
Parameters
Return values
bool —returns true if statement is question
parseAdjective()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
public
static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
parseNoun()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
public
static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
parseNounPhrase()
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible
public
static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" with value an Adjective subtree "NN" with value of a Noun Subtree
parsePostpositionPhrase()
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible
public
static parsePostpositionPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["cur_node" => current parse position in $tagged_phrase]
- $index : int = 1
-
position in array to start from
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase
parseQuestion()
Takes tagged question string starts with Who and returns question triplet from the question string
public
static parseQuestion(string $tagged_question, int $index) : array<string|int, mixed>
Parameters
- $tagged_question : string
-
part-of-speech tagged question
- $index : int
-
current index in statement
Return values
array<string|int, mixed> —parsed triplet
parseTypeList()
Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
public
static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
- $cur_node : array<string|int, mixed>
-
node within parse tree
- $tagged_phrase : array<string|int, mixed>
-
parse tree for phrase
- $type : string
-
self::$noun_type, self::$verb_type, etc
Return values
string —phrase string involving only terms of that $type
parseVerb()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
public
static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
parseVerbPhrase()
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
public
static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : array<string|int, mixed>
-
that consists of ["curnode" => current parse position in $tagged_phrase]
Return values
array<string|int, mixed> —has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree
parseWholePhrase()
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
public
static parseWholePhrase(array<string|int, mixed> $tagged_phrase[, $tree = [] ]) : array<string|int, mixed>
Parameters
- $tagged_phrase : array<string|int, mixed>
-
an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
- $tree : = []
-
this parameter is ignored but kept so as to match other methods such as @see parseNounPhrase in the recursive descent parser
Return values
array<string|int, mixed> —used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a subject phrase $tree["POST"] contains a subtree for a object phrase $tree["VP"] contains a subtree for a predicate phrase
questionParser()
Takes questions and returns the triplet from the question
public
static questionParser(string $question) : array<string|int, mixed>
Parameters
- $question : string
-
question to parse
Return values
array<string|int, mixed> —question triplet
rearrangeTripletsByType()
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
public
static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
- $sub_pred_obj_triplets : array<string|int, mixed>
-
in format described above
Return values
array<string|int, mixed> —$processed_triplets in format described above
segment()
Stub function which could be used for a word segmenter.
public
static segment(string $pre_segment) : string
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
Parameters
- $pre_segment : string
-
before segmentation
Return values
string —should return string with words separated by space in this case does nothing
stem()
Computes the stem of an Hindi word
public
static stem(string $word) : string
Parameters
- $word : string
-
the string to stem
Return values
string —the stem of $word
stopwordsRemover()
Removes the stop words from the page (used for Word Cloud generation and language detection)
public
static stopwordsRemover(mixed $data) : mixed
Parameters
- $data : mixed
-
either a string or an array of string to remove stop words from
Return values
mixed —$data with no stop words
taggedPartOfSpeechTokensToString()
This method is used to simplify the different tags of speech to a common form
public
static taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string
Parameters
- $tagged_tokens : array<string|int, mixed>
-
which is an array of tokens assigned tags.
- $with_tokens : bool = true
-
whether to include the terms and the tags in the output string or just the part of speech tags
Return values
string —$tagged_phrase which is a string fo form token~pos
tagPartsOfSpeechPhrase()
The method takes as input a phrase and returns a string with each term tagged with a part of speech.
public
static tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string
Parameters
- $phrase : string
-
text to add parts speech tags to
- $with_tokens : bool = true
-
whether to include the terms and the tags in the output string or just the part of speech tags
Return values
string —$tagged_phrase which is a string of format term~pos
tagTokenizePartOfSpeech()
Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token
public
static tagTokenizePartOfSpeech(string $text) : string
Parameters
- $text : string
-
input phrase which is to be tagged
Return values
string —$result which is an array of token => tag
tagUnknownWords()
This method tags the remaining words in a partially tagged text array.
public
static tagUnknownWords(array<string|int, mixed> $partially_tagged_text) : array<string|int, mixed>
Parameters
- $partially_tagged_text : array<string|int, mixed>
-
term array representing a text passage. Each element in array is in turnan associative array [token => token_value, tag => tag_value (may be empty)]
Return values
array<string|int, mixed> —text passage array where all empty tags now have values
removeSuffix()
Removes common Hindi suffixes
private
static removeSuffix(string $word) : string
Parameters
- $word : string
-
to remove suffixes from
Return values
string —result of suffix removal