Yioop_V9.5_Source_Code_Documentation

NWordGrams
in package

Library of functions used to create and extract n word grams

Tags
author

Ravi Dhillon (Bigram Version), Chris Pollett (ngrams + rewrite + support for page count dumps)

Table of Contents

AUX_SUFFIX  = "_aux_grams.txt"
Auxiliary suffice file ngrams to add to filter
BLOCK_SIZE  = 8192
How many bytes to read in one go from wiki file when creating filter
FILTER_SUFFIX  = "_word_grams.ftr"
Suffix appended to language tag to create the filter file name containing bigrams.
PAGE_COUNT_WIKIPEDIA  = 2
PAGE_COUNT_WIKTIONARY  = 3
TEXT_SUFFIX  = "_word_grams.txt"
Suffix appended to language tag to create the text file name containing bigrams.
WIKI_DUMP_REDIRECT  = 0
WIKI_DUMP_TITLE  = 1
$ngrams  : object
Static copy of n-grams files
makeNWordGramsFilterFile()  : none
Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.
makeNWordGramsTextFile()  : int
Generates a n word grams text file from input wikipedia xml file.
makeSegmentFilterFile()  : mixed
Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php
ngramsContains()  : true
Says whether or not phrase exists in the N word gram Bloom Filter

Constants

AUX_SUFFIX

Auxiliary suffice file ngrams to add to filter

public mixed AUX_SUFFIX = "_aux_grams.txt"

BLOCK_SIZE

How many bytes to read in one go from wiki file when creating filter

public mixed BLOCK_SIZE = 8192

FILTER_SUFFIX

Suffix appended to language tag to create the filter file name containing bigrams.

public mixed FILTER_SUFFIX = "_word_grams.ftr"

PAGE_COUNT_WIKIPEDIA

public mixed PAGE_COUNT_WIKIPEDIA = 2

PAGE_COUNT_WIKTIONARY

public mixed PAGE_COUNT_WIKTIONARY = 3

TEXT_SUFFIX

Suffix appended to language tag to create the text file name containing bigrams.

public mixed TEXT_SUFFIX = "_word_grams.txt"

WIKI_DUMP_REDIRECT

public mixed WIKI_DUMP_REDIRECT = 0

WIKI_DUMP_TITLE

public mixed WIKI_DUMP_TITLE = 1

Properties

$ngrams

Static copy of n-grams files

protected static object $ngrams = null

Methods

makeNWordGramsFilterFile()

Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.

public static makeNWordGramsFilterFile(string $lang, string $num_gram, int $num_ngrams_found[, int $max_gram_len = 2 ]) : none

The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.

Parameters
$lang : string

locale to be used to stem n grams.

$num_gram : string

value of n in n-gram (how many words in sequence should constitute a gram)

$num_ngrams_found : int

count of n word grams in text file.

$max_gram_len : int = 2

value n of longest n gram to be added.

Return values
none

makeNWordGramsTextFile()

Generates a n word grams text file from input wikipedia xml file.

public static makeNWordGramsTextFile(string $wiki_file, string $lang, string $locale[, int $num_gram = 2 ][, int $ngram_type = self::PAGE_COUNT_WIKIPEDIA ][, int $max_terms = -1 ]) : int

The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.

Parameters
$wiki_file : string

compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files

$lang : string

Language to be used to create n grams.

$locale : string

Locale to be used to store results.

$num_gram : int = 2

number of words in grams we are looking for

$ngram_type : int = self::PAGE_COUNT_WIKIPEDIA

where in Wiki Dump to extract grams from

$max_terms : int = -1

maximum number of n-grams to compute and put in file

Return values
int

$num_ngrams_found count of n-grams in text file.

makeSegmentFilterFile()

Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php

public static makeSegmentFilterFile(string $dict_file, string $lang) : mixed
Parameters
$dict_file : string

file to use as a dictionary to make filter from

$lang : string

locale tag of locale we are building the filter for

Return values
mixed

ngramsContains()

Says whether or not phrase exists in the N word gram Bloom Filter

public static ngramsContains( $phrase, string $lang[, string $filter_prefix = 2 ]) : true
Parameters
$phrase :

what to check if is a bigram

$lang : string

language of bigrams file

$filter_prefix : string = 2

either the word "segment", "all", or number n of the number of words in an ngram in filter.

Return values
true

or false


        

Search results