TokenTool.php
SeekQuarry/Yioop -- Open Source Pure PHP Search Engine, Crawler, and Indexer
Copyright (C) 2009 - 2023 Chris Pollett chris@pollett.org
LICENSE:
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
END LICENSE
TokenTool is used to create suggest word dictionaries and 'n' word gram filter files for the Yioop! search engine.
A description of its usage is given in the $usage global variable
Tags
Table of Contents
- getTrainingFileNames() : array<string|int, mixed>
- Returns an array of filenames to be used for training the current task in TokenTool
- makeKwikiEntriesGetSeedSites() : mixed
- Generates knowledge wiki callouts for search results pages based on the first paragraph of a Wikipedia Page that matches a give qeury.
- getNextPage() : mixed
- Gets the next wiki page from a file handle pointing to the wiki dump file
- removeTags() : string
- Remove all occurrence of a open close tag pairs from $text
- getBraceTag() : array<string|int, mixed>
- Get a substring offset pair matching the input open close brace tag pattern
- getTagOffsetPage() : mixed
- Get the outer contents of an xml open/close tag pair from a text source together with a new offset location after
- getTopPages() : array<string|int, mixed>
- Returns title and page counts of the top $max_pages many entries in a $page_count_file for a locale $locale_tag
- smartOpen() : array<string|int, mixed>
- Gets a read file handle for $file_open appropriate for whether it is uncompressed, bz2 compressed, or gz compressed. It returns also function pointers to the functions needed to do reading and closing for the file handle.
- translateLocale() : mixed
- Translates Yioop web app strings to a given locale ($locale_tag) and writes the LOCALE_DIR/$locale_tag/configure.ini file for these translations.
- wikiHeaderPageToString() : string
- Converts an array of wiki header information and a wiki page contents string into a string suitable to be store into the GROUP_PAGE_HISTORY database table.
- translatePhrase() : mixed
- Translates a string from English to a given locale using an online translation tool.
- makeNWordGramsFiles() : mixed
- Makes an n or all word gram Bloom filter based on the supplied arguments Wikipedia files are assumed to have been place in the PREP_DIR before this is run and writes it into the resources folder of the given locale
- makeSuggestTrie() : mixed
- Makes a trie that can be used to make word suggestions as someone enters terms into the Yioop! search box. Outputs the result into the file suggest_trie.txt.gz in the supplied locale dir
- fileWithTrim() : array<string|int, mixed>
- Reads file into an array or outputs file not found. For each entry in array trims it. Any blank lines are deleted
Functions
getTrainingFileNames()
Returns an array of filenames to be used for training the current task in TokenTool
getTrainingFileNames(array<string|int, mixed> $command_line_args[, int $start_index = 4 ]) : array<string|int, mixed>
Parameters
- $command_line_args : array<string|int, mixed>
-
supplied to TokenTool.php. Assume array of the format: [ ... max_file_names_to_consider, file_glob1, file_glob2, ...]
- $start_index : int = 4
-
index in $command_line_args of max_file_names_to_consider
Return values
array<string|int, mixed> —$file_names of files with training data
makeKwikiEntriesGetSeedSites()
Generates knowledge wiki callouts for search results pages based on the first paragraph of a Wikipedia Page that matches a give qeury.
makeKwikiEntriesGetSeedSites(string $locale_tag, string $page_count_file, string $wiki_dump_file, int $max_entries, int $max_seed_sites) : mixed
Also generates an initial list of potential seed sites for a crawl based off urls scraped from the wiki pages.
Parameters
- $locale_tag : string
-
the IANA language tag of the locale to create knowledge wiki entries and seed sites for
- $page_count_file : string
-
the file name of a a wiki page count dump file (or folder of such files). Such a file contains the names of wiki pages and how many times they were accessed
- $wiki_dump_file : string
-
a dump of wikipedia pages and meta pages
- $max_entries : int
-
maximum number of kwiki entries to create. Will pick the one with the highest counts in $page_count_file
- $max_seed_sites : int
-
maximum number of seed sites to add to Yioop's set of seed sites. Again chooses those with highest page count score
Return values
mixed —getNextPage()
Gets the next wiki page from a file handle pointing to the wiki dump file
getNextPage(resource $fr, function $read, int $block_size, mixed &$input_buffer) : mixed
Parameters
- $fr : resource
-
file handle (might be a compressed file handle, for example, corresponding to gzopen of bzopen)
- $read : function
-
a function for reading from thhe given file handle
- $block_size : int
-
size of blocks to use when reading
- $input_buffer : mixed
Return values
mixed —removeTags()
Remove all occurrence of a open close tag pairs from $text
removeTags(string $text, string $open, string $close) : string
Parameters
- $text : string
-
to remove tag pair from
- $open : string
-
string pattern for open tag
- $close : string
-
string pattern for close tag
Return values
string —text after tag removed
getBraceTag()
Get a substring offset pair matching the input open close brace tag pattern
getBraceTag(string $page, string $brace_open, string $brace_close, string $tag, int $offset) : array<string|int, mixed>
Parameters
- $page : string
-
source text to search for the tag in For example, lala {{infobox {{blah yoyoy}} }} dada.
- $brace_open : string
-
character sequence starting the tag region. For example {{
- $brace_close : string
-
character sequence ending the tag region. For example }}
- $tag : string
-
tag that might be associated with the opening of the the sequence. For example infobox.
- $offset : int
-
offset to start searching from
Return values
array<string|int, mixed> —ordered pair [substring containing the brace tag, offset after the tag]. If had "lala {{infobox {{blah yoyoy}} }} dada" as input and searched on {{, }}, infobox, 0 would get ["{{infobox {{blah yoyoy}}", 31]
getTagOffsetPage()
Get the outer contents of an xml open/close tag pair from a text source together with a new offset location after
getTagOffsetPage(string $page, string $tag, int $offset) : mixed
Parameters
- $page : string
-
text source to search the tag pair in
- $tag : string
-
the xml tag to look for
- $offset : int
-
offset to start searching after for the open/close pair
Return values
mixed —getTopPages()
Returns title and page counts of the top $max_pages many entries in a $page_count_file for a locale $locale_tag
getTopPages(string $page_count_file, string $locale_tag, int $max_pages[, array<string|int, mixed> $title_counts = [] ]) : array<string|int, mixed>
Parameters
- $page_count_file : string
-
page count file to use to search for title counts with respect to a locale
- $locale_tag : string
-
locale to get top pages for
- $max_pages : int
-
number of pages
- $title_counts : array<string|int, mixed> = []
-
title counts that might have come from analyzing a previous file. These will be in the output and contribute to $max_pages
Return values
array<string|int, mixed> —$title_counts wiki page titles => num_views associative array
smartOpen()
Gets a read file handle for $file_open appropriate for whether it is uncompressed, bz2 compressed, or gz compressed. It returns also function pointers to the functions needed to do reading and closing for the file handle.
smartOpen(string $file_name) : array<string|int, mixed>
Parameters
- $file_name : string
-
name of file want read file handle for
Return values
array<string|int, mixed> —[file_handle, read_function_ptr, close_function_ptr]
translateLocale()
Translates Yioop web app strings to a given locale ($locale_tag) and writes the LOCALE_DIR/$locale_tag/configure.ini file for these translations.
translateLocale(string $locale_tag, int $with_wiki_pages) : mixed
Currently, translations are done using the Yandex.translate (https://translate.yandex.com/) API.
Parameters
- $locale_tag : string
-
of locale to translate
- $with_wiki_pages : int
-
if this is <=0, public and help wiki pages are not translated, if it is 1, they are translated to the locale if the locale does not already have a translation. If it is >1 then it is force translated to locale.
Return values
mixed —wikiHeaderPageToString()
Converts an array of wiki header information and a wiki page contents string into a string suitable to be store into the GROUP_PAGE_HISTORY database table.
wikiHeaderPageToString(array<string|int, mixed> $wiki_header, string $wiki_page_data) : string
Parameters
- $wiki_header : array<string|int, mixed>
-
of wiki header information
- $wiki_page_data : string
-
mediawiki data
Return values
string —suitable to be stored in GROUP_PAGE_HISTORY
translatePhrase()
Translates a string from English to a given locale using an online translation tool.
translatePhrase(string $translate_text, string $locale_tag) : mixed
Parameters
- $translate_text : string
-
text to be translated
- $locale_tag : string
-
locale to translate to
Return values
mixed —translated string on success, false otherwise
makeNWordGramsFiles()
Makes an n or all word gram Bloom filter based on the supplied arguments Wikipedia files are assumed to have been place in the PREP_DIR before this is run and writes it into the resources folder of the given locale
makeNWordGramsFiles(array<string|int, mixed> $args) : mixed
Parameters
- $args : array<string|int, mixed>
-
command line arguments with first two elements of $argv removed. For details on which arguments do what see the $usage variable
Return values
mixed —makeSuggestTrie()
Makes a trie that can be used to make word suggestions as someone enters terms into the Yioop! search box. Outputs the result into the file suggest_trie.txt.gz in the supplied locale dir
makeSuggestTrie(string $dict_file, string $locale, string $end_marker) : mixed
Parameters
- $dict_file : string
-
where the word list is stored, one word per line
- $locale : string
-
which locale to write the suggest file to
- $end_marker : string
-
used to indicate end of word in the trie
Return values
mixed —fileWithTrim()
Reads file into an array or outputs file not found. For each entry in array trims it. Any blank lines are deleted
fileWithTrim( $file_name) : array<string|int, mixed>
Parameters
Return values
array<string|int, mixed> —of trimmed lines