Yioop_V9.5_Source_Code

Tokenizer
in package

Application

Hindi specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) Java stemming algorithm: http://members.unine.ch/jacques.savoy/clef/HindiStemmerLight.java.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

$adjective_type

List of adjective-like parts of speech that might appear in lexicon


    public
    static    array<string|int, mixed>
    $adjective_type
     = ["JJ", "JJR", "JJS"]

$no_stem_list

Words we don't want to be stemmed


    public
    static    array<string|int, mixed>
    $no_stem_list
     = []

$noun_type

List of noun-like parts of speech that might appear in lexicon


    public
    static    array<string|int, mixed>
    $noun_type
     = ["NN", "NNS", "NNP", "NNPS", "DT"]

$postpositional_type

List of postpositional-like parts of speech that might appear in lexicon


    public
    static    array<string|int, mixed>
    $postpositional_type
     = ["IN", "inj", "PREP", "proNN", "CONJ", "INT", "particle", "case", "PSP", "direct_DT", "PRP"]

$question_pattern

List of questions in Hindi


    public
    static    array<string|int, mixed>
    $question_pattern
     = "/\\b[क्या|कब|कहा|क्यों|कौन|जिसे|जिसका|कहाँ|कहां]\\b/ui"

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list


    public
    static    string
    $question_token
     = "qqq"

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection


    public
    static    mixed
    $stop_words
     = ['जैसा', 'मैं', 'उसके', 'कि', 'वह', 'था', 'के', 'लिए', 'पर', 'हैं', 'साथ', 'वे', 'हो', 'पर', 'एक', 'है', 'इस', 'से', 'द्वारा', 'गरम', 'शब्द', 'लेकिन', 'क्या', 'कुछ', 'है', 'यह', 'आप', 'या', 'था', 'की', 'तक', 'और', 'एक', 'में', 'हम', 'कर', 'सकते', 'हैं', 'बाहर', 'अन्य', 'थे', 'जो', 'कर', 'उनके', 'समय', 'अगर', 'होगा', 'कैसे', 'कहा', 'एक', 'प्रत्येक', 'बता', 'करता', 'है', 'सेट', 'तीन', 'चाहते हैं', 'हवा', 'अच्छी तरह से', 'भी', 'खेलने', 'छोटे', 'अंत', 'डाल', 'घर', 'पढ़ा', 'हाथ', 'बंदरगाह', 'बड़ा', 'जादू', 'जोड़', 'और', 'भी', 'भूमि', 'यहाँ', 'चाहिए', 'बड़ा', 'उच्च', 'ऐसा', 'का', 'पालन', 'करें', 'अधिनियम', 'क्यों', 'पूछना', 'पुरुषों', 'परिवर्तन', 'चला', 'गया', 'प्रकाश', 'तरह', 'बंद', 'आवश्यकता', 'घर', 'तस्वीर', 'कोशिश', 'हमें', 'फिर', 'पशु', 'बिंदु', 'मां', 'दुनिया', 'निकट', 'बनाना', 'आत्म', 'पृथ्वी', 'पिता']

$verb_type

List of verb-like parts of speech that might appear in lexicon


    public
    static    array<string|int, mixed>
    $verb_type
     = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "RB"]

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.


    public
            static        extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string

Parameters

$tree : array<string|int, mixed>: phrase to extract type from
$pos : string: the part of speech to extract

Return values

string —

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object


    public
            static        extractObjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate


    public
            static        extractPredicateParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject


    public
            static        extractSubjectParseTree(mixed $tree) : array<string|int, mixed>

Parameters

$tree : mixed

Return values

array<string|int, mixed> —

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields


    public
            static        extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above
$type : string: either CONCISE or RAW

Return values

array<string|int, mixed> —

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed


    public
            static        extractTripletsParseTree(array<string|int, mixed> $parse_tree) : array<string|int, mixed>

Parameters

$parse_tree : array<string|int, mixed>: a parse tree for a sentence

Return values

array<string|int, mixed> —

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).


    public
            static        extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>

Parameters

$word_and_phrase_list : array<string|int, mixed>: of statements

Return values

array<string|int, mixed> —

with two fields: QUESTION_LIST consisting of (SUBJECT, COMPLEMENT) where one of the components has been replaced with a question marker.

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not


    public
                    isQuestion( $phrase) : bool

Parameters

$phrase :: any statement

Return values

bool —

returns true if statement is question

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible


    public
            static        parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible


    public
            static        parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible


    public
            static        parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" with value an Adjective subtree "NN" with value of a Noun Subtree

parsePostpositionPhrase()

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible


    public
            static        parsePostpositionPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["cur_node" => current parse position in $tagged_phrase]
$index : int = 1: position in array to start from

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase

parseQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string


    public
            static        parseQuestion(string $tagged_question, int $index) : array<string|int, mixed>

Parameters

$tagged_question : string: part-of-speech tagged question
$index : int: current index in statement

Return values

array<string|int, mixed> —

parsed triplet

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.


    public
            static        parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string

Parameters

$cur_node : array<string|int, mixed>: node within parse tree
$tagged_phrase : array<string|int, mixed>: parse tree for phrase
$type : string: self::$noun_type, self::$verb_type, etc

Return values

string —

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible


    public
            static        parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible


    public
            static        parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : array<string|int, mixed>: that consists of ["curnode" => current parse position in $tagged_phrase]

Return values

array<string|int, mixed> —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.


    public
            static        parseWholePhrase(array<string|int, mixed> $tagged_phrase[,  $tree = [] ]) : array<string|int, mixed>

Parameters

$tagged_phrase : array<string|int, mixed>: an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)
$tree : = []: this parameter is ignored but kept so as to match other methods such as @see parseNounPhrase in the recursive descent parser

Return values

array<string|int, mixed> —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a subject phrase $tree["POST"] contains a subtree for a object phrase $tree["VP"] contains a subtree for a predicate phrase

questionParser()

Takes questions and returns the triplet from the question


    public
            static        questionParser(string $question) : array<string|int, mixed>

Parameters

$question : string: question to parse

Return values

array<string|int, mixed> —

question triplet

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields


    public
            static        rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>

Parameters

$sub_pred_obj_triplets : array<string|int, mixed>: in format described above

Return values

array<string|int, mixed> —

$processed_triplets in format described above

segment()

Stub function which could be used for a word segmenter.


    public
            static        segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

$pre_segment : string: before segmentation

Return values

string —

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Hindi word


    public
            static        stem(string $word) : string

Parameters

$word : string: the string to stem

Return values

string —

the stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)


    public
            static        stopwordsRemover(mixed $data) : mixed

Parameters

$data : mixed: either a string or an array of string to remove stop words from

Return values

mixed —

$data with no stop words

taggedPartOfSpeechTokensToString()

This method is used to simplify the different tags of speech to a common form


    public
            static        taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string

Parameters

$tagged_tokens : array<string|int, mixed>: which is an array of tokens assigned tags.
$with_tokens : bool = true: whether to include the terms and the tags in the output string or just the part of speech tags

Return values

string —

$tagged_phrase which is a string fo form token~pos

tagPartsOfSpeechPhrase()

The method takes as input a phrase and returns a string with each term tagged with a part of speech.


    public
            static        tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string

Parameters

$phrase : string: text to add parts speech tags to
$with_tokens : bool = true: whether to include the terms and the tags in the output string or just the part of speech tags

Return values

string —

$tagged_phrase which is a string of format term~pos

tagTokenizePartOfSpeech()

Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token


    public
            static        tagTokenizePartOfSpeech(string $text) : string

Parameters

$text : string: input phrase which is to be tagged

Return values

string —

$result which is an array of token => tag

tagUnknownWords()

This method tags the remaining words in a partially tagged text array.


    public
            static        tagUnknownWords(array<string|int, mixed> $partially_tagged_text) : array<string|int, mixed>

Parameters

$partially_tagged_text : array<string|int, mixed>: term array representing a text passage. Each element in array is in turnan associative array [token => token_value, tag => tag_value (may be empty)]

Return values

array<string|int, mixed> —

text passage array where all empty tags now have values

removeSuffix()

Removes common Hindi suffixes


    private
            static        removeSuffix(string $word) : string

Parameters

$word : string: to remove suffixes from

Return values

string —

result of suffix removal

Tokenizer in package Application

Tags

Table of Contents

Properties

$adjective_type

$no_stem_list

$noun_type

$postpositional_type

$question_pattern

$question_token

$stop_words

Tags

$verb_type

Methods

extractDeepestSpeechPartPhrase()

Parameters

Return values

extractObjectParseTree()

Parameters

Return values

extractPredicateParseTree()

Parameters

Return values

extractSubjectParseTree()

Parameters

Return values

extractTripletByType()

Parameters

Return values

extractTripletsParseTree()

Parameters

Return values

extractTripletsPhrases()

Parameters

Return values

isQuestion()

Parameters

Return values

parseAdjective()

Parameters

Return values

parseNoun()

Parameters

Return values

parseNounPhrase()

Parameters

Return values

parsePostpositionPhrase()

Parameters

Return values

parseQuestion()

Parameters

Return values

parseTypeList()

Parameters

Return values

parseVerb()

Parameters

Return values

parseVerbPhrase()

Parameters

Return values

parseWholePhrase()

Parameters

Return values

questionParser()

Parameters

Return values

rearrangeTripletsByType()

Parameters

Return values

segment()

Parameters

Return values

stem()

Parameters

Return values

stopwordsRemover()

Parameters

Return values

Tokenizer
in package

Application