NWordGrams
in package
Library of functions used to create and extract n word grams
Tags
Table of Contents
- AUX_SUFFIX = "_aux_grams.txt"
- Auxiliary suffice file ngrams to add to filter
- BLOCK_SIZE = 8192
- How many bytes to read in one go from wiki file when creating filter
- FILTER_SUFFIX = "_word_grams.ftr"
- Suffix appended to language tag to create the filter file name containing bigrams.
- PAGE_COUNT_WIKIPEDIA = 2
- PAGE_COUNT_WIKTIONARY = 3
- TEXT_SUFFIX = "_word_grams.txt"
- Suffix appended to language tag to create the text file name containing bigrams.
- WIKI_DUMP_REDIRECT = 0
- WIKI_DUMP_TITLE = 1
- $ngrams : object
- Static copy of n-grams files
- makeNWordGramsFilterFile() : none
- Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.
- makeNWordGramsTextFile() : int
- Generates a n word grams text file from input wikipedia xml file.
- makeSegmentFilterFile() : mixed
- Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php
- ngramsContains() : true
- Says whether or not phrase exists in the N word gram Bloom Filter
Constants
AUX_SUFFIX
Auxiliary suffice file ngrams to add to filter
public
mixed
AUX_SUFFIX
= "_aux_grams.txt"
BLOCK_SIZE
How many bytes to read in one go from wiki file when creating filter
public
mixed
BLOCK_SIZE
= 8192
FILTER_SUFFIX
Suffix appended to language tag to create the filter file name containing bigrams.
public
mixed
FILTER_SUFFIX
= "_word_grams.ftr"
PAGE_COUNT_WIKIPEDIA
public
mixed
PAGE_COUNT_WIKIPEDIA
= 2
PAGE_COUNT_WIKTIONARY
public
mixed
PAGE_COUNT_WIKTIONARY
= 3
TEXT_SUFFIX
Suffix appended to language tag to create the text file name containing bigrams.
public
mixed
TEXT_SUFFIX
= "_word_grams.txt"
WIKI_DUMP_REDIRECT
public
mixed
WIKI_DUMP_REDIRECT
= 0
WIKI_DUMP_TITLE
public
mixed
WIKI_DUMP_TITLE
= 1
Properties
$ngrams
Static copy of n-grams files
protected
static object
$ngrams
= null
Methods
makeNWordGramsFilterFile()
Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.
public
static makeNWordGramsFilterFile(string $lang, string $num_gram, int $num_ngrams_found[, int $max_gram_len = 2 ]) : none
The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.
Parameters
- $lang : string
-
locale to be used to stem n grams.
- $num_gram : string
-
value of n in n-gram (how many words in sequence should constitute a gram)
- $num_ngrams_found : int
-
count of n word grams in text file.
- $max_gram_len : int = 2
-
value n of longest n gram to be added.
Return values
none —makeNWordGramsTextFile()
Generates a n word grams text file from input wikipedia xml file.
public
static makeNWordGramsTextFile(string $wiki_file, string $lang, string $locale[, int $num_gram = 2 ][, int $ngram_type = self::PAGE_COUNT_WIKIPEDIA ][, int $max_terms = -1 ]) : int
The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.
Parameters
- $wiki_file : string
-
compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files
- $lang : string
-
Language to be used to create n grams.
- $locale : string
-
Locale to be used to store results.
- $num_gram : int = 2
-
number of words in grams we are looking for
- $ngram_type : int = self::PAGE_COUNT_WIKIPEDIA
-
where in Wiki Dump to extract grams from
- $max_terms : int = -1
-
maximum number of n-grams to compute and put in file
Return values
int —$num_ngrams_found count of n-grams in text file.
makeSegmentFilterFile()
Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php
public
static makeSegmentFilterFile(string $dict_file, string $lang) : mixed
Parameters
- $dict_file : string
-
file to use as a dictionary to make filter from
- $lang : string
-
locale tag of locale we are building the filter for
Return values
mixed —ngramsContains()
Says whether or not phrase exists in the N word gram Bloom Filter
public
static ngramsContains( $phrase, string $lang[, string $filter_prefix = 2 ]) : true
Parameters
- $phrase :
-
what to check if is a bigram
- $lang : string
-
language of bigrams file
- $filter_prefix : string = 2
-
either the word "segment", "all", or number n of the number of words in an ngram in filter.
Return values
true —or false