Yioop_V9.5_Source_Code_Documentation

Tokenizer
in package

Hindi specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) Java stemming algorithm: http://members.unine.ch/jacques.savoy/clef/HindiStemmerLight.java.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Tags
author

Chris Pollett

Table of Contents

$adjective_type  : array<string|int, mixed>
List of adjective-like parts of speech that might appear in lexicon
$no_stem_list  : array<string|int, mixed>
Words we don't want to be stemmed
$noun_type  : array<string|int, mixed>
List of noun-like parts of speech that might appear in lexicon
$postpositional_type  : array<string|int, mixed>
List of postpositional-like parts of speech that might appear in lexicon
$question_pattern  : array<string|int, mixed>
List of questions in Hindi
$question_token  : string
Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list
$stop_words  : mixed
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
$verb_type  : array<string|int, mixed>
List of verb-like parts of speech that might appear in lexicon
extractDeepestSpeechPartPhrase()  : string
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
extractObjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
extractPredicateParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
extractSubjectParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
extractTripletByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields
extractTripletsParseTree()  : array<string|int, mixed>
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
extractTripletsPhrases()  : array<string|int, mixed>
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).
isQuestion()  : bool
Takes a phrase query entered by user and return true if it is question and false if not
parseAdjective()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
parseNoun()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
parseNounPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible
parsePostpositionPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible
parseQuestion()  : array<string|int, mixed>
Takes tagged question string starts with Who and returns question triplet from the question string
parseTypeList()  : string
Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
parseVerb()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
parseVerbPhrase()  : array<string|int, mixed>
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
parseWholePhrase()  : array<string|int, mixed>
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
questionParser()  : array<string|int, mixed>
Takes questions and returns the triplet from the question
rearrangeTripletsByType()  : array<string|int, mixed>
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
segment()  : string
Stub function which could be used for a word segmenter.
stem()  : string
Computes the stem of an Hindi word
stopwordsRemover()  : mixed
Removes the stop words from the page (used for Word Cloud generation and language detection)
taggedPartOfSpeechTokensToString()  : string
This method is used to simplify the different tags of speech to a common form
tagPartsOfSpeechPhrase()  : string
The method takes as input a phrase and returns a string with each term tagged with a part of speech.
tagTokenizePartOfSpeech()  : string
Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token
tagUnknownWords()  : array<string|int, mixed>
This method tags the remaining words in a partially tagged text array.
removeSuffix()  : string
Removes common Hindi suffixes

Properties

$adjective_type

List of adjective-like parts of speech that might appear in lexicon

public static array<string|int, mixed> $adjective_type = ["JJ", "JJR", "JJS"]

$no_stem_list

Words we don't want to be stemmed

public static array<string|int, mixed> $no_stem_list = []

$noun_type

List of noun-like parts of speech that might appear in lexicon

public static array<string|int, mixed> $noun_type = ["NN", "NNS", "NNP", "NNPS", "DT"]

$postpositional_type

List of postpositional-like parts of speech that might appear in lexicon

public static array<string|int, mixed> $postpositional_type = ["IN", "inj", "PREP", "proNN", "CONJ", "INT", "particle", "case", "PSP", "direct_DT", "PRP"]

$question_pattern

List of questions in Hindi

public static array<string|int, mixed> $question_pattern = "/\\b[क्या|कब|कहा|क्यों|कौन|जिसे|जिसका|कहाँ|कहां]\\b/ui"

$question_token

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

public static string $question_token = "qqq"

$stop_words

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

public static mixed $stop_words = ['जैसा', 'मैं', 'उसके', 'कि', 'वह', 'था', 'के', 'लिए', 'पर', 'हैं', 'साथ', 'वे', 'हो', 'पर', 'एक', 'है', 'इस', 'से', 'द्वारा', 'गरम', 'शब्द', 'लेकिन', 'क्या', 'कुछ', 'है', 'यह', 'आप', 'या', 'था', 'की', 'तक', 'और', 'एक', 'में', 'हम', 'कर', 'सकते', 'हैं', 'बाहर', 'अन्य', 'थे', 'जो', 'कर', 'उनके', 'समय', 'अगर', 'होगा', 'कैसे', 'कहा', 'एक', 'प्रत्येक', 'बता', 'करता', 'है', 'सेट', 'तीन', 'चाहते हैं', 'हवा', 'अच्छी तरह से', 'भी', 'खेलने', 'छोटे', 'अंत', 'डाल', 'घर', 'पढ़ा', 'हाथ', 'बंदरगाह', 'बड़ा', 'जादू', 'जोड़', 'और', 'भी', 'भूमि', 'यहाँ', 'चाहिए', 'बड़ा', 'उच्च', 'ऐसा', 'का', 'पालन', 'करें', 'अधिनियम', 'क्यों', 'पूछना', 'पुरुषों', 'परिवर्तन', 'चला', 'गया', 'प्रकाश', 'तरह', 'बंद', 'आवश्यकता', 'घर', 'तस्वीर', 'कोशिश', 'हमें', 'फिर', 'पशु', 'बिंदु', 'मां', 'दुनिया', 'निकट', 'बनाना', 'आत्म', 'पृथ्वी', 'पिता']
Tags
array

$verb_type

List of verb-like parts of speech that might appear in lexicon

public static array<string|int, mixed> $verb_type = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "RB"]

Methods

extractDeepestSpeechPartPhrase()

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

public static extractDeepestSpeechPartPhrase(array<string|int, mixed> $tree, string $pos) : string
Parameters
$tree : array<string|int, mixed>

phrase to extract type from

$pos : string

the part of speech to extract

Return values
string

the label of deepest $pos only path in $tree

extractObjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

public static extractObjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

public static extractPredicateParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

public static extractSubjectParseTree(mixed $tree) : array<string|int, mixed>
Parameters
$tree : mixed
Return values
array<string|int, mixed>

with two fields CONCISE and RAW as described above

extractTripletByType()

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields

public static extractTripletByType(array<string|int, mixed> $sub_pred_obj_triplets, string $type) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

$type : string

either CONCISE or RAW

Return values
array<string|int, mixed>

$triplets in format described above

extractTripletsParseTree()

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

public static extractTripletsParseTree(array<string|int, mixed> $parse_tree) : array<string|int, mixed>
Parameters
$parse_tree : array<string|int, mixed>

a parse tree for a sentence

Return values
array<string|int, mixed>

triplet array

extractTripletsPhrases()

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in original phrase) and RAW (removing (adjectives, etc).

public static extractTripletsPhrases(array<string|int, mixed> $word_and_phrase_list) : array<string|int, mixed>
Parameters
$word_and_phrase_list : array<string|int, mixed>

of statements

Return values
array<string|int, mixed>

with two fields: QUESTION_LIST consisting of (SUBJECT, COMPLEMENT) where one of the components has been replaced with a question marker.

isQuestion()

Takes a phrase query entered by user and return true if it is question and false if not

public isQuestion( $phrase) : bool
Parameters
$phrase :

any statement

Return values
bool

returns true if statement is question

parseAdjective()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

public static parseAdjective(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseNoun()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

public static parseNoun(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseNounPhrase()

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible

public static parseNounPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" with value an Adjective subtree "NN" with value of a Noun Subtree

parsePostpositionPhrase()

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible

public static parsePostpositionPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree[, int $index = 1 ]) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["cur_node" => current parse position in $tagged_phrase]

$index : int = 1

position in array to start from

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase

parseQuestion()

Takes tagged question string starts with Who and returns question triplet from the question string

public static parseQuestion(string $tagged_question, int $index) : array<string|int, mixed>
Parameters
$tagged_question : string

part-of-speech tagged question

$index : int

current index in statement

Return values
array<string|int, mixed>

parsed triplet

parseTypeList()

Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

public static parseTypeList(array<string|int, mixed> &$cur_node, array<string|int, mixed> $tagged_phrase, string $type) : string
Parameters
$cur_node : array<string|int, mixed>

node within parse tree

$tagged_phrase : array<string|int, mixed>

parse tree for phrase

$type : string

self::$noun_type, self::$verb_type, etc

Return values
string

phrase string involving only terms of that $type

parseVerb()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

public static parseVerb(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseVerbPhrase()

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

public static parseVerbPhrase(array<string|int, mixed> $tagged_phrase, array<string|int, mixed> $tree) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : array<string|int, mixed>

that consists of ["curnode" => current parse position in $tagged_phrase]

Return values
array<string|int, mixed>

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree

parseWholePhrase()

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

public static parseWholePhrase(array<string|int, mixed> $tagged_phrase[,  $tree = [] ]) : array<string|int, mixed>
Parameters
$tagged_phrase : array<string|int, mixed>

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree : = []

this parameter is ignored but kept so as to match other methods such as @see parseNounPhrase in the recursive descent parser

Return values
array<string|int, mixed>

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a subject phrase $tree["POST"] contains a subtree for a object phrase $tree["VP"] contains a subtree for a predicate phrase

questionParser()

Takes questions and returns the triplet from the question

public static questionParser(string $question) : array<string|int, mixed>
Parameters
$question : string

question to parse

Return values
array<string|int, mixed>

question triplet

rearrangeTripletsByType()

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

public static rearrangeTripletsByType(array<string|int, mixed> $sub_pred_obj_triplets) : array<string|int, mixed>
Parameters
$sub_pred_obj_triplets : array<string|int, mixed>

in format described above

Return values
array<string|int, mixed>

$processed_triplets in format described above

segment()

Stub function which could be used for a word segmenter.

public static segment(string $pre_segment) : string

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters
$pre_segment : string

before segmentation

Return values
string

should return string with words separated by space in this case does nothing

stem()

Computes the stem of an Hindi word

public static stem(string $word) : string
Parameters
$word : string

the string to stem

Return values
string

the stem of $word

stopwordsRemover()

Removes the stop words from the page (used for Word Cloud generation and language detection)

public static stopwordsRemover(mixed $data) : mixed
Parameters
$data : mixed

either a string or an array of string to remove stop words from

Return values
mixed

$data with no stop words

taggedPartOfSpeechTokensToString()

This method is used to simplify the different tags of speech to a common form

public static taggedPartOfSpeechTokensToString(array<string|int, mixed> $tagged_tokens[, bool $with_tokens = true ]) : string
Parameters
$tagged_tokens : array<string|int, mixed>

which is an array of tokens assigned tags.

$with_tokens : bool = true

whether to include the terms and the tags in the output string or just the part of speech tags

Return values
string

$tagged_phrase which is a string fo form token~pos

tagPartsOfSpeechPhrase()

The method takes as input a phrase and returns a string with each term tagged with a part of speech.

public static tagPartsOfSpeechPhrase(string $phrase[, bool $with_tokens = true ]) : string
Parameters
$phrase : string

text to add parts speech tags to

$with_tokens : bool = true

whether to include the terms and the tags in the output string or just the part of speech tags

Return values
string

$tagged_phrase which is a string of format term~pos

tagTokenizePartOfSpeech()

Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token

public static tagTokenizePartOfSpeech(string $text) : string
Parameters
$text : string

input phrase which is to be tagged

Return values
string

$result which is an array of token => tag

tagUnknownWords()

This method tags the remaining words in a partially tagged text array.

public static tagUnknownWords(array<string|int, mixed> $partially_tagged_text) : array<string|int, mixed>
Parameters
$partially_tagged_text : array<string|int, mixed>

term array representing a text passage. Each element in array is in turnan associative array [token => token_value, tag => tag_value (may be empty)]

Return values
array<string|int, mixed>

text passage array where all empty tags now have values

removeSuffix()

Removes common Hindi suffixes

private static removeSuffix(string $word) : string
Parameters
$word : string

to remove suffixes from

Return values
string

result of suffix removal


        

Search results