diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml
index 48fe758..831f931 100755
--- a/en-US/pages/documentation.thtml
+++ b/en-US/pages/documentation.thtml
@@ -41,10 +41,10 @@
</li>
<li><a href="#yioop-sites"><b>Building Sites with Yioop</b></a>
<ul>
- <li><a href="#localizing">Localizing Yioop to a New Language</a></li>
- <li><a href="#framework">Building a Site using Yioop as Framework</a>
+ <li><a href="#framework">Building a Site using Yioop as a Framework</a>
</li>
<li><a href="#embedding">Embedding Yioop in an Existing Site</a></li>
+ <li><a href="#localizing">Localizing Yioop to a New Language</a></li>
</ul>
</li>
<li><a href="#advanced-topics"><b>Advanced Topics</b></a>
@@ -2689,220 +2689,7 @@ terms or meta-word could be used to look up this document in a Yioop index.</p>
and red otherwise. A similar On/Off switch is present to turn on
and off mirroring on a machine that is acting as a mirror.</p>
<h2 id='yioop-sites'>Building Sites with Yioop</h2>
- <h3 id='localizing'>Localizing Yioop to a New Language</h3>
- <p>The Manage Locales activity can be used to configure Yioop
- for use with different languages and for different regions. If you decide
- to customize your Yioop installation by adding files to
- WORK_DIRECTORY/app as described in the <a href="framework">Building a
- Site using Yioop as a Framework</a> section, then the localization
- tools described in this section can also be used to localize your custom
- site. Clicking the Manage Locales activity one sees a page like:</p>
-<img src='resources/ManagingLocales.png' alt='The Manage Locales form'/>
- <p>
- The first form on this activity allows you to create a new locale --
- an object representing a language and a region. The first field
- on this form should be filled in with a name for the locale in the
- language of the locale. So for French you would put Français. The
- locale tag should be the <a h
- ref='http://en.wikipedia.org/wiki/IANA_language_tag'>IETF language tag</a>.
- The last field on the form is to specify how the language is written.
- There are four options: lr-tb -- from left-to-write from the top of
- the page to the bottom as in English, rl-tb from right-to-left from the
- top the page to the bottom as in Hebrew and Arabic, tb-rl from the top
- of the page to the bottom from right-to-left as in Classical Chinese, and
- finally, tb-lr from the top of the page to the bottom from left-to-right
- as in non-cyrillic Mongolian. lr-tb and rl-tb support work better
- than the vertical language support. As of this writing, only
- Internet Explorer has some vertical language support and the Yioop
- stylesheets for vertical languages still need some tweaking.
- </p>
- <p>The second form for this activity allows one to delete an existing
- locale. Beneath this form is a table with the currently available
- locales. Each row consists of four items: a link with the name of
- a locale which can be used to edit the translations for the locale,
- the IETF language tag for the locale, its writing mode, and finally
- the percentage of strings ids that have already been translated
- for the locale. To translate string ids for a locale click on its
- link. This should display the following form:</p>
-<img src='resources/EditingLocaleStrings.png' alt='The Edit Locales form'/>
- <p>In the above case, the link for English was clicked. The Back link
- in the corner can be used to written to the previous form.
- The Static Pages download has a list of all the static pages (.thtml files)
- which are in either the folder WORK_DIRECTORY/locale/current-tag/pages
- (in this case, current-tag is en-US) or the folder
- WORK_DIRECTORY/locale/default-tag/pages where default-tag is the IANA tag
- for the default language of the Yioop installation. Selecting a page
- allows one to edit it within Yioop. The idea is that one might have
- a couple of static pages you have created in the default locale pages folder
- and a localizer can use this interface to see what is written in these
- files. Yioop automatically creates these files in the directory the
- localizer is localizing for, and the localizer can translate their contents
- into the appropriate language. Beneath this dropdown, the
- Edit Locale page mainly consists of a two column table: the right column
- being string ids, the left column containing what should be their
- translation into the given locale. If no translation exists yet,
- the field will be displayed in red. String ids are extracted by Yioop
- automatically from controller, view, helper, layout, and element class files
- which are either in the Yioop Installation itself or in the installation
- WORK_DIRECTORY/app folder. Yioop looks for tl() function calls to extract
- ids from these files, for example, on seeing tl('search_view_query_results')
- Yioop would extract the id search_view_query_results; on seeing
- tl('search_view_calculated', $data['ELAPSED_TIME']) Yioop would extract
- the id, 'search_view_calculated'. In the second case, the translation is
- expected the translation to have a %s in it for the value of
- $data['ELAPSED_TIME']. Note %s is used regardless of the the type, say
- int, float, string, etc., of $data['ELAPSED_TIME']. tl() can handle
- additional arguments, whenever an additional argument is supplied an
- additional %s would be expected somewhere in the translation string.
- If you make a set of translations, be sure to submit the form associated
- with this table by scrolling to the bottom of the page and clicking the
- Submit link. This saves your translations; otherwise, your work will be
- lost if you navigate away from this page. One aid to translating is if you
- hover your mouse over a field that needs translation, then its translation
- in the default locale (usually English) is displayed. If you want to find
- where in the source code a string id comes from the ids follow
- the rough convention file_name_approximate_english_translation.
- So you would expect to find admin_controller_login_successful
- in the file controllers/admin_controller.php . String ids with the
- prefix db_ (such as the names of activities) are stored in the database.
- So you cannot find these ids in the source code. The tooltip trick
- mentioned above does not work for database string ids.</p>
- <h4>Adding a stemmer, segmenter or supporting character
- n-gramming for your language</h4>
- <p>Depending on the language you are localizing to, it may make sense
- to write a stemmer for words that will be inserted into the index.
- A stemmer takes inflected or sometimes derived words and reduces
- them to their stem. For instance, jumps and jumping would be reduced to
- jump in English. As Yioop crawls it attempts to detect the language of
- a given web page it is processing. If a stemmer exists for this language
- it will call the Tokenizer class's stem($word) method on each word it
- extracts from the document before inserting information about it into the
- index. Similarly, if an end-user is entering a simple conjunctive search
- query and a stemmer exists for his language settings, then the query terms will
- be stemmed before being looked up in the index. Currently, Yioop comes
- with only an English and Italian language stemmers. The English stemmer
- uses the Porter Stemming Algorithm [<a href="#P1980">P1980</a>], the
- Italian Stemmer is based on the algorithm presented at
- <a href="http://snowball.tartoros.org/">snowball.tartoros.org</a>.
- Stemmers should be written as a static method located in the
- file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php .
- The snowball.tartoros.org link
- points to a site that has source code for stemmers for many other languages
- (unfortunately, not written in PHP). It would not be hard to port these
- to PHP and then add modify the tokenizer.php file of the
- appropriate locale folder. For instance, one
- could modify the file
- WORK_DIRECTORY/locale/fr-FR/resources/tokenizer.php
- to contain a class FrTokenizer with a static method
- stem($word) if one wanted to add a stemmer for French.
- </p>
- <p>The class inside tokenizer.php can also be used by Yioop to
- do word segmentation. This is the process of splitting a string of words
- without spaces in some language into its component words. Yioop
- comes with an example segmenter for the zh-CN (Chinese) locale. It works
- by starting at the ned of the string and trying to greedily find the
- longest word that can be matched with the portion of the suffix of the
- string that has been processed yet (reverse maximal match). To do this
- it makes use of a word Bloom filter as part of how it detects if a string
- is a word or not. We describe how to make such filter using token_tool.php
- in a moment.</p>
- <p>In addition to supporting the ability to add stemmers and segmenters,
- Yioop also supports a default technique which can be used in lieu of a
- stemmer called character n-grams. When used this technique segments text
- into sequences of n characters which are then stored in Yioop as a term.
- For instance if n were 3 then the word "thunder" would be split
- into "thu", "hun", "und", "nde", and "der" and each of these would be
- asscociated with the document that contained the word thunder.
- N-grams are useful for languages like Chinese and Japanese in which
- words in the text are often not separated with spaces. It is also
- useful for languages like German which can have long compound words.
- The drawback of n-grams is that they tend to make the index larger.
- For Yioop built-in locales that do not have stemmer the file, the file
- WORK_DIRECTORY/locale/LOCALE-TAG/resources/tokenizer.php has a line
- of the form $CHARGRAMS['LOCALE_TAG'] = SOME_NUMBER; This number is
- the length of string to use in doing char-gramming. If you add a
- language to Yioop and want to use char gramming merely add a tokenizer.php
- to the corresponding locale folder with such a line in it.</p>
- <h4 id="token_tool">Using token_tool.php to improve search performance and
- relevance for your language</h4>
- <p>configs/token_tool.php is used to create suggest word dictionaries and
- word filter files for the Yioop search engine. To create either of
- these items, the user puts a source file in Yioop's WORK_DIRECTORY/prepare
- folder. Suggest word dictionaries are used to supply the content of the
- dropdown of search terms that appears as a user is entering a query in
- Yioop. They are also used to do spell correction suggestions after a
- search has been performed. To make a suggest dictionary one can use a
- command like:</p>
- <pre>
- php token_tool.php dictionary filename locale endmarker
- </pre>
- <p>
- Here <i>filename</i> should be in the current folder or PREP_DIR, locale is
- the locale this suggest (for example, en-US)
- file is being made for and where a file suggest_trie.txt.gz will be written,
- and endmarker is the end of word symbol to use in the trie. For example,
- $ works pretty well. The format of <i>filename</i> should be a sequence of
- line, each line containing a word or phrase followed by a space followed by
- a frequency count. i.e., the last thing on the line should be a number.
- Given a corpus of documents a frequency for a word would be the number of
- occurences of that word in the document.
- </p>
- <p>
- token_tool.php can also be used to make filter files used by a word
- segmenter. To make a filter file
- token_tool.php is run from the command line as:
- </p>
- <pre>
- php token_tool.php segment-filter dictionary_file locale
- </pre>
- <p>
- Here dictionary_file should be a text file with one word/line,
-locale is the IANA language tag of the locale to store the results for.
- </p>
-
- <h4>Obtaining data sets for token_tool.php</h4>
- <p>
- Many word lists with frequencies are obtainable on the web for free
- with Creative Commons licenses. A good starting point is:</p>
- <pre>
- <a href="http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists"
- >http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists</a>
- </pre>
- <p>A little script-fu can generally take such a list and output it with the
- line format of "word/phrase space frequency" needed by
- token_tool.php and as the word/line format used for filter files.</p>
- <h4>Spell correction and romanized input with locale.js</h4>
- <p>Yioop supports the ability to suggest alternative queries
- after a search is performed. These queries are mainly restricted to
- fixing typos in the original query. In order to calculate
- these spelling corrections, Yioop takes the query and for each query term
- computes each possible single character change to that term. For each
- of these it looks up in the given locale's suggest_trie.txt.gz
- a frequency count of that variant, if it exists. If the best suggestion
- is some multiple better than the frequency count of the original query
- then Yioop suggests this alternative query. In order for this to
- work, Yioop needs to know what constitutes a single character in the
- original query. The file locale.js in the
- WORK_DIRECTORY/locale/LOCALE_TAG/resources folder can be used
- to specify this for the locale given by LOCALE_TAG. To do this,
- all you need to do is specify a Javascript variable alpha. For example,
- for French (fr-FR) this looks like:</p>
- <pre>
-var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
- </pre>
- <p>The letters do not have to be in any alphabetical order, but should be
- comprehensive of the non-punctuation symbols of the language in question.
- </p>
- <p>Another thing locale.js can be used for is to given mappings
- between roman letters and other scripts for use in the Yioop's autosuggest
- dropdown that appears as you type a query. As you type,
- scripts/suggest.js function onTypeTerm is called. This in turn
- will cause a particular locale's locale.js function transliterate(query)
- if it exists. This function should return a string with the result
- of the transliteration. An example of doing this is given for the
- Telugu locale in Yioop.</p>
- <p><a href="#toc">Return to table of contents</a>.</p>
<h3 id='framework'>Building a Site using Yioop as Framework</h3>
<p>The Yioop code base can serve as the code base for new custom search
web sites. The web-app portion of Yioop uses a model-view-controller (MVC)
@@ -3190,6 +2977,220 @@ xmlns:atom="http://www.w3.org/2005/Atom"
these methods as well as how to extract results from what is returned
can be found in the file examples/search_api.php .</p>
<p><a href="#toc">Return to table of contents</a>.</p>
+ <h3 id='localizing'>Localizing Yioop to a New Language</h3>
+ <p>The Manage Locales activity can be used to configure Yioop
+ for use with different languages and for different regions. If you decide
+ to customize your Yioop installation by adding files to
+ WORK_DIRECTORY/app as described in the <a href="framework">Building a
+ Site using Yioop as a Framework</a> section, then the localization
+ tools described in this section can also be used to localize your custom
+ site. Clicking the Manage Locales activity one sees a page like:</p>
+<img src='resources/ManagingLocales.png' alt='The Manage Locales form'/>
+ <p>
+ The first form on this activity allows you to create a new locale --
+ an object representing a language and a region. The first field
+ on this form should be filled in with a name for the locale in the
+ language of the locale. So for French you would put Français. The
+ locale tag should be the <a h
+ ref='http://en.wikipedia.org/wiki/IANA_language_tag'>IETF language tag</a>.
+ The last field on the form is to specify how the language is written.
+ There are four options: lr-tb -- from left-to-write from the top of
+ the page to the bottom as in English, rl-tb from right-to-left from the
+ top the page to the bottom as in Hebrew and Arabic, tb-rl from the top
+ of the page to the bottom from right-to-left as in Classical Chinese, and
+ finally, tb-lr from the top of the page to the bottom from left-to-right
+ as in non-cyrillic Mongolian. lr-tb and rl-tb support work better
+ than the vertical language support. As of this writing, only
+ Internet Explorer has some vertical language support and the Yioop
+ stylesheets for vertical languages still need some tweaking.
+ </p>
+ <p>The second form for this activity allows one to delete an existing
+ locale. Beneath this form is a table with the currently available
+ locales. Each row consists of four items: a link with the name of
+ a locale which can be used to edit the translations for the locale,
+ the IETF language tag for the locale, its writing mode, and finally
+ the percentage of strings ids that have already been translated
+ for the locale. To translate string ids for a locale click on its
+ link. This should display the following form:</p>
+<img src='resources/EditingLocaleStrings.png' alt='The Edit Locales form'/>
+ <p>In the above case, the link for English was clicked. The Back link
+ in the corner can be used to written to the previous form.
+ The Static Pages download has a list of all the static pages (.thtml files)
+ which are in either the folder WORK_DIRECTORY/locale/current-tag/pages
+ (in this case, current-tag is en-US) or the folder
+ WORK_DIRECTORY/locale/default-tag/pages where default-tag is the IANA tag
+ for the default language of the Yioop installation. Selecting a page
+ allows one to edit it within Yioop. The idea is that one might have
+ a couple of static pages you have created in the default locale pages folder
+ and a localizer can use this interface to see what is written in these
+ files. Yioop automatically creates these files in the directory the
+ localizer is localizing for, and the localizer can translate their contents
+ into the appropriate language. Beneath this dropdown, the
+ Edit Locale page mainly consists of a two column table: the right column
+ being string ids, the left column containing what should be their
+ translation into the given locale. If no translation exists yet,
+ the field will be displayed in red. String ids are extracted by Yioop
+ automatically from controller, view, helper, layout, and element class files
+ which are either in the Yioop Installation itself or in the installation
+ WORK_DIRECTORY/app folder. Yioop looks for tl() function calls to extract
+ ids from these files, for example, on seeing tl('search_view_query_results')
+ Yioop would extract the id search_view_query_results; on seeing
+ tl('search_view_calculated', $data['ELAPSED_TIME']) Yioop would extract
+ the id, 'search_view_calculated'. In the second case, the translation is
+ expected the translation to have a %s in it for the value of
+ $data['ELAPSED_TIME']. Note %s is used regardless of the the type, say
+ int, float, string, etc., of $data['ELAPSED_TIME']. tl() can handle
+ additional arguments, whenever an additional argument is supplied an
+ additional %s would be expected somewhere in the translation string.
+ If you make a set of translations, be sure to submit the form associated
+ with this table by scrolling to the bottom of the page and clicking the
+ Submit link. This saves your translations; otherwise, your work will be
+ lost if you navigate away from this page. One aid to translating is if you
+ hover your mouse over a field that needs translation, then its translation
+ in the default locale (usually English) is displayed. If you want to find
+ where in the source code a string id comes from the ids follow
+ the rough convention file_name_approximate_english_translation.
+ So you would expect to find admin_controller_login_successful
+ in the file controllers/admin_controller.php . String ids with the
+ prefix db_ (such as the names of activities) are stored in the database.
+ So you cannot find these ids in the source code. The tooltip trick
+ mentioned above does not work for database string ids.</p>
+
+ <h4>Adding a stemmer, segmenter or supporting character
+ n-gramming for your language</h4>
+ <p>Depending on the language you are localizing to, it may make sense
+ to write a stemmer for words that will be inserted into the index.
+ A stemmer takes inflected or sometimes derived words and reduces
+ them to their stem. For instance, jumps and jumping would be reduced to
+ jump in English. As Yioop crawls it attempts to detect the language of
+ a given web page it is processing. If a stemmer exists for this language
+ it will call the Tokenizer class's stem($word) method on each word it
+ extracts from the document before inserting information about it into the
+ index. Similarly, if an end-user is entering a simple conjunctive search
+ query and a stemmer exists for his language settings, then the query terms will
+ be stemmed before being looked up in the index. Currently, Yioop comes
+ with only an English and Italian language stemmers. The English stemmer
+ uses the Porter Stemming Algorithm [<a href="#P1980">P1980</a>], the
+ Italian Stemmer is based on the algorithm presented at
+ <a href="http://snowball.tartoros.org/">snowball.tartoros.org</a>.
+ Stemmers should be written as a static method located in the
+ file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php .
+ The snowball.tartoros.org link
+ points to a site that has source code for stemmers for many other languages
+ (unfortunately, not written in PHP). It would not be hard to port these
+ to PHP and then add modify the tokenizer.php file of the
+ appropriate locale folder. For instance, one
+ could modify the file
+ WORK_DIRECTORY/locale/fr-FR/resources/tokenizer.php
+ to contain a class FrTokenizer with a static method
+ stem($word) if one wanted to add a stemmer for French.
+ </p>
+ <p>The class inside tokenizer.php can also be used by Yioop to
+ do word segmentation. This is the process of splitting a string of words
+ without spaces in some language into its component words. Yioop
+ comes with an example segmenter for the zh-CN (Chinese) locale. It works
+ by starting at the ned of the string and trying to greedily find the
+ longest word that can be matched with the portion of the suffix of the
+ string that has been processed yet (reverse maximal match). To do this
+ it makes use of a word Bloom filter as part of how it detects if a string
+ is a word or not. We describe how to make such filter using token_tool.php
+ in a moment.</p>
+ <p>In addition to supporting the ability to add stemmers and segmenters,
+ Yioop also supports a default technique which can be used in lieu of a
+ stemmer called character n-grams. When used this technique segments text
+ into sequences of n characters which are then stored in Yioop as a term.
+ For instance if n were 3 then the word "thunder" would be split
+ into "thu", "hun", "und", "nde", and "der" and each of these would be
+ asscociated with the document that contained the word thunder.
+ N-grams are useful for languages like Chinese and Japanese in which
+ words in the text are often not separated with spaces. It is also
+ useful for languages like German which can have long compound words.
+ The drawback of n-grams is that they tend to make the index larger.
+ For Yioop built-in locales that do not have stemmer the file, the file
+ WORK_DIRECTORY/locale/LOCALE-TAG/resources/tokenizer.php has a line
+ of the form $CHARGRAMS['LOCALE_TAG'] = SOME_NUMBER; This number is
+ the length of string to use in doing char-gramming. If you add a
+ language to Yioop and want to use char gramming merely add a tokenizer.php
+ to the corresponding locale folder with such a line in it.</p>
+ <h4 id="token_tool">Using token_tool.php to improve search performance and
+ relevance for your language</h4>
+ <p>configs/token_tool.php is used to create suggest word dictionaries and
+ word filter files for the Yioop search engine. To create either of
+ these items, the user puts a source file in Yioop's WORK_DIRECTORY/prepare
+ folder. Suggest word dictionaries are used to supply the content of the
+ dropdown of search terms that appears as a user is entering a query in
+ Yioop. They are also used to do spell correction suggestions after a
+ search has been performed. To make a suggest dictionary one can use a
+ command like:</p>
+ <pre>
+ php token_tool.php dictionary filename locale endmarker
+ </pre>
+ <p>
+ Here <i>filename</i> should be in the current folder or PREP_DIR, locale is
+ the locale this suggest (for example, en-US)
+ file is being made for and where a file suggest_trie.txt.gz will be written,
+ and endmarker is the end of word symbol to use in the trie. For example,
+ $ works pretty well. The format of <i>filename</i> should be a sequence of
+ line, each line containing a word or phrase followed by a space followed by
+ a frequency count. i.e., the last thing on the line should be a number.
+ Given a corpus of documents a frequency for a word would be the number of
+ occurences of that word in the document.
+ </p>
+ <p>
+ token_tool.php can also be used to make filter files used by a word
+ segmenter. To make a filter file
+ token_tool.php is run from the command line as:
+ </p>
+ <pre>
+ php token_tool.php segment-filter dictionary_file locale
+ </pre>
+ <p>
+ Here dictionary_file should be a text file with one word/line,
+locale is the IANA language tag of the locale to store the results for.
+ </p>
+
+ <h4>Obtaining data sets for token_tool.php</h4>
+ <p>
+ Many word lists with frequencies are obtainable on the web for free
+ with Creative Commons licenses. A good starting point is:</p>
+ <pre>
+ <a href="http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists"
+ >http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists</a>
+ </pre>
+ <p>A little script-fu can generally take such a list and output it with the
+ line format of "word/phrase space frequency" needed by
+ token_tool.php and as the word/line format used for filter files.</p>
+ <h4>Spell correction and romanized input with locale.js</h4>
+ <p>Yioop supports the ability to suggest alternative queries
+ after a search is performed. These queries are mainly restricted to
+ fixing typos in the original query. In order to calculate
+ these spelling corrections, Yioop takes the query and for each query term
+ computes each possible single character change to that term. For each
+ of these it looks up in the given locale's suggest_trie.txt.gz
+ a frequency count of that variant, if it exists. If the best suggestion
+ is some multiple better than the frequency count of the original query
+ then Yioop suggests this alternative query. In order for this to
+ work, Yioop needs to know what constitutes a single character in the
+ original query. The file locale.js in the
+ WORK_DIRECTORY/locale/LOCALE_TAG/resources folder can be used
+ to specify this for the locale given by LOCALE_TAG. To do this,
+ all you need to do is specify a Javascript variable alpha. For example,
+ for French (fr-FR) this looks like:</p>
+ <pre>
+var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz";
+ </pre>
+ <p>The letters do not have to be in any alphabetical order, but should be
+ comprehensive of the non-punctuation symbols of the language in question.
+ </p>
+ <p>Another thing locale.js can be used for is to given mappings
+ between roman letters and other scripts for use in the Yioop's autosuggest
+ dropdown that appears as you type a query. As you type,
+ scripts/suggest.js function onTypeTerm is called. This in turn
+ will cause a particular locale's locale.js function transliterate(query)
+ if it exists. This function should return a string with the result
+ of the transliteration. An example of doing this is given for the
+ Telugu locale in Yioop.</p>
+ <p><a href="#toc">Return to table of contents</a>.</p>
<h2 id="advanced-topics">Advanced Topics</h2>
<h3 id='customizing-code'>Modifying Yioop Code</h3>
<p>One advantage of an open-source project is that you have complete