Yioop_V9.5_Source_Code_Documentation

TokenTool.php

SeekQuarry/Yioop -- Open Source Pure PHP Search Engine, Crawler, and Indexer

Copyright (C) 2009 - 2023 Chris Pollett chris@pollett.org

LICENSE:

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

END LICENSE

TokenTool is used to create suggest word dictionaries and 'n' word gram filter files for the Yioop! search engine.

A description of its usage is given in the $usage global variable

Tags
author

Ravi Dhillon ravi.dhillon@yahoo.com, Chris Pollett (modified for n ngrams, added more functionality)

license

https://www.gnu.org/licenses/ GPL3

link
https://www.seekquarry.com/
copyright

2009 - 2023

filesource

Table of Contents

getTrainingFileNames()  : array<string|int, mixed>
Returns an array of filenames to be used for training the current task in TokenTool
makeKwikiEntriesGetSeedSites()  : mixed
Generates knowledge wiki callouts for search results pages based on the first paragraph of a Wikipedia Page that matches a give qeury.
getNextPage()  : mixed
Gets the next wiki page from a file handle pointing to the wiki dump file
removeTags()  : string
Remove all occurrence of a open close tag pairs from $text
getBraceTag()  : array<string|int, mixed>
Get a substring offset pair matching the input open close brace tag pattern
getTagOffsetPage()  : mixed
Get the outer contents of an xml open/close tag pair from a text source together with a new offset location after
getTopPages()  : array<string|int, mixed>
Returns title and page counts of the top $max_pages many entries in a $page_count_file for a locale $locale_tag
smartOpen()  : array<string|int, mixed>
Gets a read file handle for $file_open appropriate for whether it is uncompressed, bz2 compressed, or gz compressed. It returns also function pointers to the functions needed to do reading and closing for the file handle.
translateLocale()  : mixed
Translates Yioop web app strings to a given locale ($locale_tag) and writes the LOCALE_DIR/$locale_tag/configure.ini file for these translations.
wikiHeaderPageToString()  : string
Converts an array of wiki header information and a wiki page contents string into a string suitable to be store into the GROUP_PAGE_HISTORY database table.
translatePhrase()  : mixed
Translates a string from English to a given locale using an online translation tool.
makeNWordGramsFiles()  : mixed
Makes an n or all word gram Bloom filter based on the supplied arguments Wikipedia files are assumed to have been place in the PREP_DIR before this is run and writes it into the resources folder of the given locale
makeSuggestTrie()  : mixed
Makes a trie that can be used to make word suggestions as someone enters terms into the Yioop! search box. Outputs the result into the file suggest_trie.txt.gz in the supplied locale dir
fileWithTrim()  : array<string|int, mixed>
Reads file into an array or outputs file not found. For each entry in array trims it. Any blank lines are deleted

Functions

getTrainingFileNames()

Returns an array of filenames to be used for training the current task in TokenTool

getTrainingFileNames(array<string|int, mixed> $command_line_args[, int $start_index = 4 ]) : array<string|int, mixed>
Parameters
$command_line_args : array<string|int, mixed>

supplied to TokenTool.php. Assume array of the format: [ ... max_file_names_to_consider, file_glob1, file_glob2, ...]

$start_index : int = 4

index in $command_line_args of max_file_names_to_consider

Return values
array<string|int, mixed>

$file_names of files with training data

makeKwikiEntriesGetSeedSites()

Generates knowledge wiki callouts for search results pages based on the first paragraph of a Wikipedia Page that matches a give qeury.

makeKwikiEntriesGetSeedSites(string $locale_tag, string $page_count_file, string $wiki_dump_file, int $max_entries, int $max_seed_sites) : mixed

Also generates an initial list of potential seed sites for a crawl based off urls scraped from the wiki pages.

Parameters
$locale_tag : string

the IANA language tag of the locale to create knowledge wiki entries and seed sites for

$page_count_file : string

the file name of a a wiki page count dump file (or folder of such files). Such a file contains the names of wiki pages and how many times they were accessed

$wiki_dump_file : string

a dump of wikipedia pages and meta pages

$max_entries : int

maximum number of kwiki entries to create. Will pick the one with the highest counts in $page_count_file

$max_seed_sites : int

maximum number of seed sites to add to Yioop's set of seed sites. Again chooses those with highest page count score

Return values
mixed

getNextPage()

Gets the next wiki page from a file handle pointing to the wiki dump file

getNextPage(resource $fr, function $read, int $block_size, mixed &$input_buffer) : mixed
Parameters
$fr : resource

file handle (might be a compressed file handle, for example, corresponding to gzopen of bzopen)

$read : function

a function for reading from thhe given file handle

$block_size : int

size of blocks to use when reading

$input_buffer : mixed
Return values
mixed

removeTags()

Remove all occurrence of a open close tag pairs from $text

removeTags(string $text, string $open, string $close) : string
Parameters
$text : string

to remove tag pair from

$open : string

string pattern for open tag

$close : string

string pattern for close tag

Return values
string

text after tag removed

getBraceTag()

Get a substring offset pair matching the input open close brace tag pattern

getBraceTag(string $page, string $brace_open, string $brace_close, string $tag, int $offset) : array<string|int, mixed>
Parameters
$page : string

source text to search for the tag in For example, lala {{infobox {{blah yoyoy}} }} dada.

$brace_open : string

character sequence starting the tag region. For example {{

$brace_close : string

character sequence ending the tag region. For example }}

$tag : string

tag that might be associated with the opening of the the sequence. For example infobox.

$offset : int

offset to start searching from

Return values
array<string|int, mixed>

ordered pair [substring containing the brace tag, offset after the tag]. If had "lala {{infobox {{blah yoyoy}} }} dada" as input and searched on {{, }}, infobox, 0 would get ["{{infobox {{blah yoyoy}}", 31]

getTagOffsetPage()

Get the outer contents of an xml open/close tag pair from a text source together with a new offset location after

getTagOffsetPage(string $page, string $tag, int $offset) : mixed
Parameters
$page : string

text source to search the tag pair in

$tag : string

the xml tag to look for

$offset : int

offset to start searching after for the open/close pair

Return values
mixed

getTopPages()

Returns title and page counts of the top $max_pages many entries in a $page_count_file for a locale $locale_tag

getTopPages(string $page_count_file, string $locale_tag, int $max_pages[, array<string|int, mixed> $title_counts = [] ]) : array<string|int, mixed>
Parameters
$page_count_file : string

page count file to use to search for title counts with respect to a locale

$locale_tag : string

locale to get top pages for

$max_pages : int

number of pages

$title_counts : array<string|int, mixed> = []

title counts that might have come from analyzing a previous file. These will be in the output and contribute to $max_pages

Return values
array<string|int, mixed>

$title_counts wiki page titles => num_views associative array

smartOpen()

Gets a read file handle for $file_open appropriate for whether it is uncompressed, bz2 compressed, or gz compressed. It returns also function pointers to the functions needed to do reading and closing for the file handle.

smartOpen(string $file_name) : array<string|int, mixed>
Parameters
$file_name : string

name of file want read file handle for

Return values
array<string|int, mixed>

[file_handle, read_function_ptr, close_function_ptr]

translateLocale()

Translates Yioop web app strings to a given locale ($locale_tag) and writes the LOCALE_DIR/$locale_tag/configure.ini file for these translations.

translateLocale(string $locale_tag, int $with_wiki_pages) : mixed

Currently, translations are done using the Yandex.translate (https://translate.yandex.com/) API.

Parameters
$locale_tag : string

of locale to translate

$with_wiki_pages : int

if this is <=0, public and help wiki pages are not translated, if it is 1, they are translated to the locale if the locale does not already have a translation. If it is >1 then it is force translated to locale.

Return values
mixed

wikiHeaderPageToString()

Converts an array of wiki header information and a wiki page contents string into a string suitable to be store into the GROUP_PAGE_HISTORY database table.

wikiHeaderPageToString(array<string|int, mixed> $wiki_header, string $wiki_page_data) : string
Parameters
$wiki_header : array<string|int, mixed>

of wiki header information

$wiki_page_data : string

mediawiki data

Return values
string

suitable to be stored in GROUP_PAGE_HISTORY

translatePhrase()

Translates a string from English to a given locale using an online translation tool.

translatePhrase(string $translate_text, string $locale_tag) : mixed
Parameters
$translate_text : string

text to be translated

$locale_tag : string

locale to translate to

Return values
mixed

translated string on success, false otherwise

makeNWordGramsFiles()

Makes an n or all word gram Bloom filter based on the supplied arguments Wikipedia files are assumed to have been place in the PREP_DIR before this is run and writes it into the resources folder of the given locale

makeNWordGramsFiles(array<string|int, mixed> $args) : mixed
Parameters
$args : array<string|int, mixed>

command line arguments with first two elements of $argv removed. For details on which arguments do what see the $usage variable

Return values
mixed

makeSuggestTrie()

Makes a trie that can be used to make word suggestions as someone enters terms into the Yioop! search box. Outputs the result into the file suggest_trie.txt.gz in the supplied locale dir

makeSuggestTrie(string $dict_file, string $locale, string $end_marker) : mixed
Parameters
$dict_file : string

where the word list is stored, one word per line

$locale : string

which locale to write the suggest file to

$end_marker : string

used to indicate end of word in the trie

Return values
mixed

fileWithTrim()

Reads file into an array or outputs file not found. For each entry in array trims it. Any blank lines are deleted

fileWithTrim( $file_name) : array<string|int, mixed>
Parameters
$file_name :

file to read into array

Return values
array<string|int, mixed>

of trimmed lines

Search results