Yioop_V9.5_Source_Code

NWordGrams
in package

Application

Library of functions used to create and extract n word grams

AUX_SUFFIX

Auxiliary suffice file ngrams to add to filter


    public
        mixed
    AUX_SUFFIX
    = "_aux_grams.txt"

BLOCK_SIZE

How many bytes to read in one go from wiki file when creating filter


    public
        mixed
    BLOCK_SIZE
    = 8192

FILTER_SUFFIX

Suffix appended to language tag to create the filter file name containing bigrams.


    public
        mixed
    FILTER_SUFFIX
    = "_word_grams.ftr"

PAGE_COUNT_WIKIPEDIA


    public
        mixed
    PAGE_COUNT_WIKIPEDIA
    = 2

PAGE_COUNT_WIKTIONARY


    public
        mixed
    PAGE_COUNT_WIKTIONARY
    = 3

TEXT_SUFFIX

Suffix appended to language tag to create the text file name containing bigrams.


    public
        mixed
    TEXT_SUFFIX
    = "_word_grams.txt"

WIKI_DUMP_REDIRECT


    public
        mixed
    WIKI_DUMP_REDIRECT
    = 0

WIKI_DUMP_TITLE


    public
        mixed
    WIKI_DUMP_TITLE
    = 1

$ngrams

Static copy of n-grams files


    protected
    static    object
    $ngrams
     = null

makeNWordGramsFilterFile()

Creates a bloom filter file from a n word gram text file. The path of n word gram text file used is based on the input $lang.


    public
            static        makeNWordGramsFilterFile(string $lang, string $num_gram, int $num_ngrams_found[, int $max_gram_len = 2 ]) : none

The name of output filter file is based on the $lang and the number n. Size is based on input number of n word grams . The n word grams are read from text file, stemmed if a stemmer is available for $lang and then stored in filter file.

Parameters

$lang : string: locale to be used to stem n grams.
$num_gram : string: value of n in n-gram (how many words in sequence should constitute a gram)
$num_ngrams_found : int: count of n word grams in text file.
$max_gram_len : int = 2: value n of longest n gram to be added.

Return values

none —

makeNWordGramsTextFile()

Generates a n word grams text file from input wikipedia xml file.


    public
            static        makeNWordGramsTextFile(string $wiki_file, string $lang, string $locale[, int $num_gram = 2 ][, int $ngram_type = self::PAGE_COUNT_WIKIPEDIA ][, int $max_terms = -1 ]) : int

The input file can be a bz2 compressed or uncompressed. The input XML file is parsed line by line and pattern for n word gram is searched. If a n word gram is found it is added to the array. After the complete file is parsed we remove the duplicate n word grams and sort them. The resulting array is written to the text file. The function returns the number of bigrams stored in the text file.

Parameters

$wiki_file : string: compressed or uncompressed wikipedia XML file path to be used to extract bigrams. This can also be a folder containing such files
$lang : string: Language to be used to create n grams.
$locale : string: Locale to be used to store results.
$num_gram : int = 2: number of words in grams we are looking for
$ngram_type : int = self::PAGE_COUNT_WIKIPEDIA: where in Wiki Dump to extract grams from
$max_terms : int = -1: maximum number of n-grams to compute and put in file

Return values

int —

$num_ngrams_found count of n-grams in text file.

makeSegmentFilterFile()

Used to create a filter file suitable for use in word segmentation (splitting text like "thiscontainsnospaces" into "this contains no spaces"). Used by @see token_tool.php


    public
            static        makeSegmentFilterFile(string $dict_file, string $lang) : mixed

Parameters

$dict_file : string: file to use as a dictionary to make filter from
$lang : string: locale tag of locale we are building the filter for

Return values

mixed —

ngramsContains()

Says whether or not phrase exists in the N word gram Bloom Filter


    public
            static        ngramsContains( $phrase, string $lang[, string $filter_prefix = 2 ]) : true

Parameters

$phrase :: what to check if is a bigram
$lang : string: language of bigrams file
$filter_prefix : string = 2: either the word "segment", "all", or number n of the number of words in an ngram in filter.

Return values

true —

or false

Yioop_V9.5_Source_Code_Documentation

NWordGrams
in package

Application

Tags

Table of Contents

Constants

AUX_SUFFIX

BLOCK_SIZE

FILTER_SUFFIX

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKTIONARY

TEXT_SUFFIX

WIKI_DUMP_REDIRECT

WIKI_DUMP_TITLE

Properties

$ngrams

Methods

makeNWordGramsFilterFile()

Parameters

Return values

makeNWordGramsTextFile()

Parameters

Return values

makeSegmentFilterFile()

Parameters

Return values

ngramsContains()

Parameters

Return values

Search results

NWordGrams in package Application

Tags

Table of Contents

Constants

AUX_SUFFIX

BLOCK_SIZE

FILTER_SUFFIX

PAGE_COUNT_WIKIPEDIA

PAGE_COUNT_WIKTIONARY

TEXT_SUFFIX

WIKI_DUMP_REDIRECT

WIKI_DUMP_TITLE

Properties

$ngrams

Methods

makeNWordGramsFilterFile()

Parameters

Return values

makeNWordGramsTextFile()

Parameters

Return values

makeSegmentFilterFile()

Parameters

Return values

ngramsContains()

Parameters

Return values

NWordGrams
in package

Application