viewgit/inc/functions.php:22 Function utf8_encode() is deprecated [8192]
diff --git a/configs/public_help_pages.php b/configs/public_help_pages.php new file mode 100644 index 0000000..8e458d2 --- /dev/null +++ b/configs/public_help_pages.php @@ -0,0 +1,7664 @@ +<?php +/** + * + * Default Public Wiki Pages + * + * This file should be generated using export_public_help_db.php + */ +$public_pages = array(); +$public_pages["en-US"]["404"] = <<< 'EOD' +title=Page Not Found +description=The page you requested cannot be found on our server +END_HEAD_VARS +==The page you requested cannot be found.== +EOD; +$public_pages["en-US"]["409"] = <<< 'EOD' +title=Conflict + +description=Your request would result in an edit conflict. +END_HEAD_VARS +==Your request would result in an edit conflict, so will not be processed.== +EOD; +$public_pages["en-US"]["About"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=About + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS==About SeekQuarry/Yioop== + +SeekQuarry, LLC is the company responsible for the [[http://www.yioop.com/|Yioop PHP Search Engine]] project. SeekQuarry is owned by, and Yioop was mainly written by, myself, [[http://www.cs.sjsu.edu/faculty/pollett|Chris Pollett]]. Development of Yioop began in Nov. 2009 and was first publicly released August, 2010. SeekQuarry maintains the documentation and official public code repository for Yioop. It is also responsible for the SeekQuarry and Yioop servers. SeekQuarry LLC receives revenue from [[http://www.seekquarry.com/?c=main&p=downloads#consulting|consulting services]] related to Yioop and by [[http://www.seekquarry.com/?c=main&p=downloads#contribute|contributions]] from people interested in the continued development of the Yioop Search Engine Software and in the documentary resources the Seekquarry website provides. The Yioop and SeekQuarry Names +When looking for names for my search engine, I was originally thinking about using the name SeekQuarry whose domain name hadn't been registered. After deciding that I would use Yioop for the name of my search engine site, I decided I would use SeekQuarry as a site to publish the software that is used in the Yioop engine. That is, yioop.com is a live site that demonstrates the open source search engine software distributed on the seekquarry.com site. + +<br> + +The name Yioop has the following history: I was looking for names that hadn't already been registered. My wife is Vietnamese, so I thought I might have better luck with Vietnamese words since all the English ones seemed to have been taken. I started with the word giup, which is the way to spell 'help' in Vietnamese if you remove the accents. It was already taken. Then I tried yoop, which is my lame way of pronouncing how giup sounds like in English. It was already taken. So then I combined the two to get Yioop. + +==Dictionary Data== + +The [[https://en.wikipedia.org/wiki/Bloom_Filter|Bloom filter]] for Chinese word segmentation was developed using the word list [[http://www.mdbg.net/chindict/chindict.php?page=cedict|http://www.mdbg.net/chindict/chindict.php?page=cedict]] which has a Creative Commons Share Alike 3.0 Unported License. [[https://en.wikipedia.org/wiki/Trie|Trie]]'s for word suggestion for all languages other than Vietnamese were built using the [[https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists|Wiktionary Frequency Lists]]. These are also available under a [[https://creativecommons.org/licenses/by-sa/3.0/|Creative Commons Share Alike 3.0 Unported License]] as described on [[https://en.wikipedia.org/wiki/Wikipedia:Database_download|Wikipedia's Download page]]. The derived data files (if they were created for that language) for a language IANA tag, locale-tag, can be found in the locale/locale-tag/resources folder of the Yioop project. These are also licensed using the same license. For Vietnamese, I used the following [[http://www.informatik.uni-leipzig.de/~duc/software/misc/wordlist.html|Vietnamese Word List]] obtained with permision from [[http://www.informatik.uni-leipzig.de/~duc/|Ho Ngoc Duc]]. + +<br> + +The English part-of-speech tagging algorithm in Yioop was originally coded by Shailesh Padave using with permission the article on Brill tagging by Ian Barber. This kind of tagging in turn makes use of the [[https://en.wikipedia.org/wiki/Brown_Corpus|Brown Corpus]]. Part-of-Speech tagging is used in Yioop only if thesaurus lookup is being used in final result reordering. In this case, to generate the thesaurus, [[http://wordnet.princeton.edu/wordnet/|WordNet]] is used. + +==Additional Credits== + +Several people helped with localization: Mary Pollett, Jonathan Ben-David, Ismail.B, Andrea Brunetti, Thanh Bui, Sujata Dongre, Animesh Dutta, Aida Khosroshahi, Youn Kim, Radha Kotipalli, Akshat Kukreti, Vijeth Patil, Chao-Hsin Shih, Ahmed Kamel Taha, and Sugi Widjaja. Several of my former students have contributed code which appear in the Yioop repository. They are: Mangesh Dahale, Ravi Dhillon, Priya Gangaraju, Akshat Kukreti, Sreenidhi Pundi Muralidharan, Nakul Natu, Shailesh Padave, Vijaya Pamidi, Snigdha Parvatneni, Akash Patel, Vijeth Patil, Mallika Perepa, Tarun Pepira, Eswara Rajesh Pinapala, Tamayee Potluri, Shawn Tice, Sandhya Vissapragada. In addition, several of my students have created projects based on Yioop. Information about these project can be found on their [[http://www.cs.sjsu.edu/faculty/pollett/masters/|student pages]]. + +EOD; +$public_pages["en-US"]["Coding"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Open Source Search Engine Software - Seekquarry :: Coding + +author=Chris Pollett + +robots= + +description=Describes coding guidelines for the Yioop Search Engine + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS{{id='contents' +=Coding Guidelines for the Yioop= +}} +==Introduction== + +In order to understand a software project it helps to understand its organization and conventions. To encourage people to dive in and help improve Yioop and to ensure contributions which are easily understood within the context of Yioop's current standards, this article describes the coding conventions, issue tracking, and commit process for Yioop. It first describes the coding styles to be used for various languages within Yioop. It then describes some guidelines for what kind of code should go into which kind of files in Yioop. Finally, it concludes with a discussion of how issues should be submitted to the issue tracker, how to make patches for Yioop, and how commit messages should be written. There might seem like a lot of rules to follow in the below. If you want to get started coding, it suffices to look at the general section, then skim the sections one is likely to need for the programming task at hand, and then finally to look at the making patch section. + +[[Coding#contents|Return to table of contents]]. + +==General== + +#One of the design goals of Yioop was to minimize dependencies on other projects and libraries. When coming up with a solution to a problem preference should be given to solutions which do not introduce new dependencies on external projects or libraries. Also, one should be on the lookout for eliminating existing dependencies, configuration requirements, etc. +#The coding language for Yioop is English. This means all comments within the source code should be in English. +#All data that will be written to the web interface should be localizable. That means easily translatable to any text representation of a human language. The section on [[Coding#localization|localization]] discusses facilities in Yioop for doing this. +#Information written as log messages to log files and profiling information about queries (made available by the query info checkbox in Configure), which are not intended for end-users, do not need to be localized. +#Project file names should be lowercase words. Multi-word file names should separate words with an underscore. For example, index_manager.php +#Each project file should begin with the [[http://www.gnu.org/licenses/|GPL3 license]] as a comment in the appropriate format for the file in question. For example, for a PHP file, this might look like: + /** + * SeekQuarry/Yioop -- + * Open Source Pure PHP Search Engine, Crawler, and Indexer + * + * Copyright (C) 2009 - 2014 Chris Pollett chris@pollett.org + * + * LICENSE: + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 3 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + * + * END LICENSE + * + * @author Chris Pollett chris@pollett.org + * @package seek_quarry + * @subpackage bin + * @license http://www.gnu.org/licenses/ GPL3 + * @link http://www.seekquarry.com/ + * @copyright 2009 - 2014 + * @filesource + */ +Here the subpackage might vary. +#All non-binary files in Yioop should be [[http://en.wikipedia.org/wiki/Utf-8|UTF-8 encoded]]. Files should not have a [[http://en.wikipedia.org/wiki/Byte_order_mark|byte order mark]]. +#All non-binary files in Yioop should follow the convention of using four spaces for tabs (rather than tab characters). Further, all lines should be less than or equal to 80 columns in length. Lines should not end with trailing white-space characters. It is recommended to use an editor which can display white-space characters and which can display a bar marking the 80th column. For example, one can use [[http://projects.gnome.org/gedit/|gEdit]] or [[http://www.vim.org/|vim]]. +#One should use one space before and after assignment, boolean, binary, and comparison operators. A single space should be used after, but not before, commas and semi-colons. A space should not be used before increment, decrement, and sign operators: + if($i == 0 && $j > 5 * $x) { /* some statements*/} + $i = 7; + $i += 3; + $a = array(1, 2, 3, 4); + for($i = 0; $i < $num; $i++) { + } +#Some leeway may be given on this if it helps make a line under 80 characters -- provided being under 80 characters helps program clarity. +#Do not use unstable code layouts such as: + $something1 = 25; + //... + $something10 = 25; + //... + $something100 = 27; +Although the equal signs are aligned, the spacing is unstable under changes of variable names. Do not have multiple statements on one line such as: + $a=1; $b=6; $c=7; +#Braces on class declarations, interface declarations, function declarations, switch statements, and CSS declaration groups should be vertically aligned. For example, + class MyClass + { + //code for class + } + + interface MyInterface + { + //code for interface + } + + function myFun() + { + //some code + } + + .my-selector + { + //some css + } + + switch($my_var) + { + //some cases + } +#Braces for conditionals, loops, etc. should roughly follow the one true brace convention (1TBS): + if(cond) { /*single statement should still use braces*/} + + if(cond) { + //some statements + } else if { + // another condition + } else { + // yet another condition + } + + while(something) { + //do something + } + + for($i = 0; $i < $num; $i++) { + } +#The body of conditionals, loops, etc. code blocks should be indented 4 spaces. Code should not appear on the same line as an opening brace or on the same line as a closing brace: + class MyClass + { function MyFun //not allowed + { + } + } + + if(something) { + $i++; + $j++; } // not allowed + + if(something) { + $i++; + $j++; + } // good +An exception is allowed for single-line code blocks: + if(something) { $i++; } // is allowed + if(something) { + $i++; + } //is preferred +#When a non-compound statement is split across several lines, all lines after the first should be indented four spaces: + //a long function call + setlocale(LC_ALL, $locale_tag, $locale_tag.'.UTF-8', + $locale_tag.'.UTF8', $locale_tag.".TCVN", $locale_tag.".VISCII", + $locale_tag_parts[0], $locale_tag_parts[0].'.UTF-8', + $locale_tag_parts[0].'.UTF8', $locale_tag_parts[0].".TCVN"); + + // a case where the conditional of an if is long + if(!file_exists("$tag_prefix/statistics.txt") || + filemtime("$tag_prefix/statistics.txt") < + filemtime("$tag_prefix/configure.ini")) { + //code + } + +[[Coding#contents|Return to table of contents]]. + +==PHP== + +Most of the code for Yioop is written in [[http://www.php.net|PHP]]. Here are some conventions that Yioop programmers should follow with regards to this language: + +#Files should have require/include's statements at the top. Dynamic (that is, based on a variable) require/include statements should be avoided. They are allowed in base-classes only in the web-app. +#Classes should be organized as: + class MyClass + { + // Variable Declarations + var some_var; + + // Constant Declarations + const SOME_CONTANTS; + + // Constructor + function __construct() + { + // code + } + + // abstract member functions, if any + /* + abstract function someAbstractMethod($arg1, $arg2) + { + // code + } + */ + + //non static member functions + function someFunction($arg) + { + // code + } + + // static member functions + function someStaticFunction($arg) + { + // code + } + } +#PHP can make visibility distinctions on variables and member functions using the keywords: public, protected, private. At this point, Yioop classes are not written with this feature. +#Except for loop variables where $i, $j, $k may be used, preference should be given to variable names which are full words. $queue rather than $q, for example. Some common abbreviations are permissible $dir (for directory), $db (for database), $str (for string), but should be avoided. +#Variable names should be descriptive. If this entails multi-word variable names, then the words should be separated by underscores. For example, $crawl_order. +#Defines, class constants, global variables (used in more than one file) should be written in all-caps. All other variables should be lowercase only. Some example defines in Yioop are: BASE_DIR, NAME_SERVER, USER_AGENT_SHORT. Some example global variables are: $INDEXED_FILE_TYPES, $IMAGE_TYPES, $PAGE_PROCESSORS. Some example class constants in Yioop are: CrawlConstants::GOT_ROBOT_TXT, CrawlConstants::INVERTED_INDEX, IndexDictionary::DICT_BLOCK_SIZE. +#Function and member function names should be camel-cased beginning with a lowercase letter. For example, insert, crawlHash, getEntry, extractWordStringPageSummary. +#Class and interface names should be camel-cased beginning with an uppercase letter. For example, CrawlConstants, IndexShard, WebArchiveBundle. Class names involved in the web-app portion of Yioop: controllers, elements, helpers, layouts, models, and views should begin with an uppercase letter, subsequent words except this last should be lowercase. For example, SearchfiltersModel, MachinestatusView. This facilitates Yioop's auto-loading mechanism. +#Yioop code should not use PHP [[http://php.net/manual/en/language.generators.overview.php|generators]], [[http://us3.php.net/manual/en/language.namespaces.php|namespaces]], [[http://php.net/manual/en/language.oop5.traits.php|traits]], or [[http://php.net/manual/en/functions.anonymous.php|anonymous functions]] to maintain backward to PHP 5.2 compatibility as much as possible. +#Each require/include, define, global variable, function, class, interface, field, constant, member function should have a [[http://www.phpdoc.org/|phpDoc]] docblock. These comments look like /** some comment */. +#The GPL license should be included in a phpDoc (page-level) docblock which includes @author, @package, @subpackage, @license, @link http://www.seekquarry.com/, @copyright, and @filesource tags. See the example in the [[Coding#General|General guidelines section]]. +#Field variables (PHP properties) docblock's should use @var to say the type of the field. For example, + /** + * Number of days between resets of the page url filter + * If nonpositive, then never reset filter + * @var int + */ + var $page_recrawl_frequency; +#Multi-line phpDoc's should have a a vertical line of *'s. For example, + /** + * First line of a phpDoc is a short summary, should not rehash function name + * + * Then a blank comment line, followed by + * a longer description. This in turn is followed by an @tags + * + * @param type $var_name description of variable + * @return type description of returned valued + */ +#Each parameter of a function/member function should be documented with an @param tag. The return value of a function/member function should be documented with an @return tag. For example, + /** + * Subtracts the two values $value1 and $value2 + * + * This function is intended to be used as a callback function for sorting + * + * @param float $value1 a value to take the difference between + * @param float $value2 the other value + * @return float the difference + */ + function difference($value1, $value2) + { + return $value1 - $value2; + } +Notice the type of the argument/return value is give after the @tag. This could be NULL, int, float, string, array, object, resource, or mixed -- mixed, is used for return values which might return more than one type. +#Multi-line comments within the body of a function or method should not use // such as: + // first line + // second line +C-style comments /* */ should be used instead. +#Multi-line comments within the body of a function or method should not have a vertical stripe of stars. This prevents fragile layout problems with comments. For example, a good multi-line comment within a function might look like: + /* + This loop's end condition + will be satisfied by something clever. + */ + +[[Coding#contents|Return to table of contents]]. + +==Javascript== + +#Variable names should not begin with $'s to avoid confusion with PHP. Except for this, they should follow the same conventions as PHP variable names described earlier. Here are some example Javascript variable names: i,j,k, request, message_tag. +#Function names should be camel-cased beginning with a lowercase letter. For example, elt, redrawGroup, drawCrawlSelect. +#Function docblock comments have the same format as PHP ones, but rather than use /** */ use /* */. For example, + /* + * Make an AJAX request for a url and put the results as inner HTML of a tag + * + * @param Object tag a DOM element to put the results of the AJAX request + * @param String url web page to fetch using AJAX + */ + function getPage(tag, url) + { + //code + } +#Within functions, comments follow the same conventions as PHP. +#One should avoid echoing Javascript within PHP code and instead move such code as much as possible to an external .js file. +#Javascript should be included/inlined at the end of web pages not at the beginning. This allows browsers to begin rendering pages rather than blocking for pages to load. +#Javascript output via PHP in a controller should be output in the $data['SCRIPT'] field sent in the $data variable to a view. +#Localization needed by Javascript should be passed from PHP controllers using the $data['SCRIPT'] field sent in the $data variable to a view. For example, in PHP one might have: + $data["MESSAGE"] = tl('admin_controller_configure_no_set_config'); + $data['SCRIPT'] .= + "doMessage('<h1 class=\"red\" >". + $data["MESSAGE"] . "</h1>');" . + "setTimeout('window.location.href= ". + "window.location.href', 3000);"; +The PHP function tl is used here to provide the translation, which will be used in the Javascript function call. +#Javascript output by a PHP View should be output as much as possible outside of PHP tags <?php ... ?> rather than with echo or similar statements. +#External Javascript files (.js files) should not contain any PHP code. +#External Javascript files should be included using the $data['INCLUDE_SCRIPTS'] array. For example, + $data['INCLUDE_SCRIPTS'] = array("script1", "script2"); +would include script1.js and script2.js from the Yioop script folder. + +[[Coding#contents|Return to table of contents]]. + +==CSS== + +#CSS should [[http://validator.w3.org/|W3C validate]] as either CSS 2 or CSS 3. CSS 3 styles should fail gracefully on non-supported browsers. Use of browser specific extensions such as -ms, -moz, -o, and -webkit selectors should only be for CSS 3 effects not yet supported by the given browser. +#A [[http://www.w3.org/TR/CSS21/syndata.html#rule-sets|CSS Rule Set]] in Yioop should follow one of the following formats: + /* single selector case */ + selector + { + property1: value1; /* notice there should be a single space after the : */ + property2: value2; /* all property-value pairs should be terminate with a + semi-colon */ + ... + } + + /* multiple selector case */ + selector1, + selector2, + ... + { + property1: value1; + property2: value2; + ... + } +#Selectors should be written on one line. For example: + .html-rtl .user-nav ul li +Notice a single space is used between parts of this. +#If an element should look different in a right-to-left language than a left-to-right language, then the .html-ltr and .html-rtl class selectors should be used. For example, + .html-ltr .user-nav + { + margin:0 0.5in 0 0; + min-width: 10in; + padding:0; + text-align: right; + } + + .html-rtl .user-nav + { + margin:0 0 0 0.5in; + min-width: 10in; + padding:0; + text-align: left; + } +For vertically written languages, one can use the selectors: .html-rl-tb, .html-lr-tb, .html-tb-rl, .html-tb-lr. Finally, if an element needs to be formatted differently for mobile devices, the .mobile selector should be used: + .mobile .user-nav + { + font-size: 11pt; + min-width: 0; + left:0px; + padding: 0px; + position: absolute; + right: 0px; + top: -10px; + width:320px; + } +#To increase clarity, left-to-right, right-to-left, and mobile variants of the otherwise same selector should appear near each other in the given stylesheet file. +#Class and ID selectors should be lowercase. Multi-word selector names should have the words separated by a hyphen: + .mobile + #message + #more-menu + .user-nav +#Multiple selectors should be listed in alphabetical order. Properties in a rule-set should be listed alphabetically. For example, + .html-ltr .role-table, + .html-ltr .role-table td, + .html-ltr .role-table th + { + border: 1px solid black; + margin-left: 0.2in; + padding: 1px; + } +An exception to this is a browser-specific property should be grouped next to its CSS3 equivalent. + +[[Coding#contents|Return to table of contents]]. + +==HTML== +#Any web page output by Yioop should validate as [[http://www.w3.org/TR/html5/|HTML5]]. This can be checked at the site [[http://validator.w3.org/|http://validator.w3.org/]]. +#Any web page output by Yioop should pass the Web accessibility checks of the [[http://wave.webaim.org/|WAVE Tool]]. +#Web pages should render reasonably similarly in any version of Chrome, Firefox, Internet Explorer, Opera, or Safari released since 2009. To test this, it generally suffices to test a 2009 version of each of these browsers together with a current version. +#All tags in a document should be closed, but short forms of tags are allowed. i.e., a tag like <br`>` must have a corresponding close tag </br>; however, it is permissible to use the short open-close form <br` />`. +#All tag attribute should have their values in single or double quotes: + <tag attribute1='value1' attribute2='value1' > + not + <tag attribute1=value1 attribute2=value1 > +#For those still using Internet Explorer 6... For any given tag, name attribute values should be different than their id attribute values. For multi-word name attribute values, separate words with underscore, for id attributes, separate them with hyphens. For example, + <input id="some-form-field" name="some_form_field" type="text" /> +#HTML code is output in views, elements, helpers, and layouts in Yioop. This code might be seen in one of two contexts: Either by directly looking at the source code of Yioop (so one can see the PHP code, etc.) or in a browser or other client when one uses the client's "View Source" feature. Code should look reasonably visually appealing in either context, but with preference given to how it looks as source code. Client-side HTML is often a useful tool for debugging however, so should not be entirely neglected. +#Generating code dynamically all on one line should be avoided. Client-side HTML should avoid lines longer than 80 characters as well. +#Although not as strictly followed as for braces, an attempt should be made to align block-level elements. For such an element, one should often place the starting and ending tag on a line by itself and nest the contents by four spaces, if possible. This is not required if the indentation level would be too deep to easily read the line. Inline elements can be more free-form: + <ol> + <li>Although not as strictly followed as for braces, an attempt + should be made to align block-level elements. For such an element, one + should often place the starting and ending tag on a line by itself and nest + the contents by <b>four spaces</b>, if possible. This is not + required if the indentation level would be too deep to easily read the line. + Inline elements can be more free-form: + </li> + </ol> +Notice we indent for the ol tag. Since starting text on a separate line for an li tag might affect appearance, adding a space to the output, we don't do it. We do, however, put the close tag on a line by itself. In the above the b tag is inlined. +#Here are some examples of splitting long lines in HTML: + <-- Long open tags --> + + <-- case where content start and end spacing affects output --> + <tag attr1="value1" attr2="value2" + attr3="value3">contents</tag> + + <-- or, if it doesn't affect output: --> + <tag attr1="value1" attr2="value2" + attr3="value3"> + contents + </tag> + + <-- Long urls should be split near '/', '?', '&'. Most browsers + ignore a carriage return (without spaces) at such places in a url + --> + <a href="http://www.cs.sjsu.edu/faculty/ + pollett/masters/Semesters/Fall10/vijaya/index.shtml">Vijaya Pamidi's + master's pages</a> +#Urls appearing in HTML should make use of the HTML entity for ampersand: & rather than just a & . Browsers will treat these the same and this can often help with validation issues. + +[[Coding#contents|Return to table of contents]]. + +==SQL== + +SQL in Yioop typically appears embedded in PHP code. This section briefly describes some minor issues with the formatting of SQL, and, in general, how Yioop code should interact with databases. + +#Except in subclasses of DatasourceManager, Yioop PHP code should not directly call native PHP database functions. That is, functions with names beginning with db2_, mysql_, mysqli_, pg_, orcl_, sqlite_, etc., or similar PHP classes. A DatasourceManager object exists as the $db field variable of any subclass of Model. +#SQL should not appear in Yioop in any functions or classes other than subclasses of Model. +#SQL code should be in uppercase. An example PHP string of SQL code might look like: + $sql = "SELECT LOCALE_NAME, WRITING_MODE ". + " FROM LOCALE WHERE LOCALE_TAG = ?"; +#New tables names and field names created for Yioop should also be uppercase only. +#Multi-word names should be separated by an underscore: LOCALE_NAME, WRITING_MODE, etc. +#New tables added to the Yioop should maintain its [[http://en.wikipedia.org/wiki/Boyce%E2%80%93Codd_normal_form|BCNF normalization]]. [[http://en.wikipedia.org/wiki/Denormalization|Denormalization]] should be avoided. +#Yioop's DatasourceManager class does have a facility for prepared statements. Using prepared statements should be preferred over escaping query parameters. Below is exampled of prepared statements in Yioop called from a model: + $sql = "INSERT INTO CRAWL_MIXES VALUES (?, ?, ?, ?)"; + $this->db->execute($sql, array($timestamp, $mix['NAME'], + $mix['OWNER_ID'], $mix['PARENT'])); +Notice how the values that are to be filled in for the ? are listed in order in the array. execute caches the last statement it has seen, so internally if you call $db->execute twice with the same statement it doesn't do the lower level prepare call to the database the second time. You can also use named parameters, as in the following example: + $sql = "UPDATE VISITOR SET DELAY=:delay, END_TIME=:end_time, + FORGET_AGE=:forget_age, ACCESS_COUNT=:account_count + WHERE ADDRESS=:ip_address AND PAGE_NAME=:page_name"; + $this->db->execute($sql, array( + ":delay"=>$delay, ":end_time" => $end_time, + ":forget_age" => $forget_age, + ":account_count" => $access_count, + ":ip_address" => $ip_address, ":page_name" => $page_name)); +#In the rare case where a non-prepared statement is used, strings should +be properly escaped usingDatasourceManager::escape_string. For example, + $sql = "INSERT INTO LOCALE". + "(LOCALE_NAME, LOCALE_TAG, WRITING_MODE) VALUES". + "('".$this->db->escapeString($locale_name). + "', '".$this->db->escapeString($locale_tag) . + "', '".$this->db->escapeString($writing_mode)."')"; + +[[Coding#contents|Return to table of contents]]. + +==Localization== + +Details on how Yioop can be translated into different languages can be found in the [[Documentation#Localizing%20Yioop%20to%20a%20New%20Languageg|Yioop Localization Documentation]]. As a coder what things should be localized are given in the [[Coding#General|general considerations]] section of this document. In this section, we describe a little about what constitutes a good translation, and then talk a little about, as a coder, how you should add new strings to be localized. We also make some remarks on how localization patches should be created before posting them to the issue tracker. This section describes how Yioop should be localized. The seekquarry.com site is also localizable. If you are interested in translating the Yioop documentation or pages on seekquarry.com, drop me a line at: chris@pollett.org . + +#It can take quite a long time to translate all the strings in Yioop. Translations of only some of the missing strings for some locale are welcome! Preference should be given to strings that an end-user is likely to see. In order of priority one should translate string ids beginning with search_view_, pagination_helper_, search_controller_, signin_element_, settings_view_, settings_controller_, web_layout_, signin_view_, static_view_, statistics_view_. +#For static pages, there are two versions -- those included with the Yioop download, and those on the the order of translation should be: privacy.thtml, bot.thtml, 404.thtml, and 409.thtml. For translations of the privacy statement for yioop.com, you should add a sentence saying the meaning of English statement takes precedence over any translations. +#Localization should be done by a native (or close to) speaker of the language Yioop is being translated to. Automated translations using things like [[http://translate.google.com/|Google Translate]] should be avoided. If used, such translations should be verified by a native speaker before being used. +#There are three main kinds of text which might need to be localized in Yioop: static strings, dynamic strings, and static pages. +#Text that has the potential to be output by the Yioop web interface should only appear in views, elements, helpers, layouts, or controllers. Controllers should only pass the string to be translated to a view, which in turn outputs it; rather than directly output it. +#If you need Javascript to output a translatable string, use a PHP controller to output a Javascript variable into $data['SCRIPT'], then have your Javascript make use of this variable to provide translation on the client. External .js files should not contain PHP code. An example of using this mechanism is given by the files mix.js and admin_controller.php's editMix member function. +#String ids should be all lowercase, with an underscore used to separate words. They should follow the convention: file_name_approximate_english_translation. For example, signin_view_password is a string id which appears in the views/signin_view.php file, and in English is translated as Password. +#Dynamic strings ids are string ids stored in the database and which may be added by administrators after downloading Yioop. String ids for these strings should all be in the format: db_use_case_translation. For example, db_activity_manage_locales or db_subsearch_images . +#All suggested localizations posted to the issue tracker should be [[http://en.wikipedia.org/wiki/UTF-8|UTF-8 encoded]]. +#If the only string ids you have translated are static ones, you can just make a new issue in the [[http://www.seekquarry.com/mantis/|issue tracker]] and post the relevant configure.ini file. These files should be located in the Yioop Work Directory/locale/locale_in_question . Ideally, you should add strings through Manage Locales, which will modify this file for you. +#For dynamic string translations just cut-and-paste the relevant line from Edit Locales into a new note for your issue. +#For more extensive translations including static pages please make a git patch and post that. + +[[Coding#contents|Return to table of contents]]. + +==Code-base Organization== + +This section describes what code should be put where when writing new code for Yioop. It can serve as a rough guide as to where to find stuff. Also, coding organization is used to ensure the security of the overall Yioop software. Some of the material in this section overlaps with what is described in the [[Documentation#Summary%20of%20Files%20and%20Folders|Summary of Files and Folders]] and the [[Documentation#Building%20a%20Site%20using%20Yioop%20as%20Framework|Building a Site using Yioop as Framework]] sections of the main Yioop documentation. All folder paths listed in this section are with respect to the Yioop base folder unless otherwise indicated. + +#There are two main categories of apps in Yioop: the command line tools and programs, and the Yioop web app. +#Core libraries common to both kinds of apps should be put in the lib folder. One exception to this are subclasses of datasource_manager. DatasourceManager has database and filesystem functions which might be useful to both kinds of apps. It is contained in models/datasources. The easiest way to create an instance of this class is with a line like: + $model = new Model(); // $model->db will be a DatasourceManager +#Some command-line programs such as bin/fetcher.php and bin/queue_server.php communicate with the web app either through curl requests or by file-based message passing. As a crude way to the check integrity of these messages as well as to reduce the size of serializations of the messages sent, the CrawlConstants interface defines a large number of shared class constants. This interface is then implemented by all classes that have need of this kind of message passing. CrawlConstants is defined in the file lib/crawl_constants.php . +#Command-line tools useful for general Yioop configuration together with the Yioop configuration files config.php and local_config.php should be put in the configs folder. Some examples are: configure_tool.php and createdb.php . +#All non-configuration command-line tools should be in the bin folder. +#Example scripts such as the file search.php which demonstrates the Yioop search API should be in the examples folder. +#External Javascripts should be in the scripts folder, CSS should be the css folder, images should be in the resources folder, and sqlite3 databases in the data folder. +#Code (PHP and Javascript) related to a particular locale should be in the folder locale/locale-tag/resources. Examples of this are the files: locale/en-US/resources/locale.js and locale/en-US/resources/tokenizer.php . +#Unit tests and coding experiments (the latter might test different aspects about speed and memory usage of PHP or Javascript constructs) should be in the tests folder. Auxiliary files to these tests and experiments should be put in tests/test_files. +#Unit tests should be written for any new lib folder files. Unit tests should be a subclass of UnitTest which can be found in lib/unit_test.php. The file name for a unit test should end in _test.php to facilitates it detection by tests/index.php which is used to run the tests. As much as possible unit tests should be written for bin folder programs and the web app as well. +#Command-line tools should have a check that they are not being run from the web such as: + // if the command-line program does not have a unit test + if(php_sapi_name() != 'cli') {echo "BAD REQUEST"; exit();} + + // if the command-line program has a unit test + if(!defined("UNIT_TEST_MODE")) { + if(php_sapi_name() != 'cli') {echo "BAD REQUEST"; exit();} + } +#Files other than command line programs, ./index.php, and ./tests/index.php should not define the BASE_DIR or UNIT_TEST_MODE constants. To ensure ./index.php and ./tests/index.php are the only web-facing entry points to Yioop, all other php files should have a line: + if(!defined('BASE_DIR')) {echo "BAD REQUEST"; exit();} +Minimizing entry points helps with security. Both of these files output the HTTP header: + header("X-FRAME-OPTIONS: DENY"); +to try to prevent [[http://en.wikipedia.org/wiki/Clickjacking|clickjacking]]. +#The Yioop web app has the following kinds of files: controllers, models, views, (these three are main three); and components, element, helpers, and layouts (lesser). These should be put respectively into the folders: controllers, models, views, controllers/components, views/elements, views/helpers, views/layouts. Filenames should for these files should end with its type: i.e., a view should end with _view.php, for example, my_view.php . +#A view roughly corresponds to one web page, a layout is used to render common page headers and footers for several views, an element is used for a relatively static portion of a web page which might appear in more than one view, and a helper is used to dynamically render a web page element such as a select tag according to passed PHP variables. +#Views, elements, and layouts should contain minimal PHP and be mostly HTML. In these classes for, while, etc. loops should be avoided. PHP in these classes should be restricted to simple conditionals and echos of $data variable fields. +#Control logic involving conditionals, loops, etc. should be put in controllers or components. Components are collections of related methods which might be used by several controllers (a little like PHP 5.4 traits). A component and a $parent field that allows access to the controller it currently lives on. +#In the web app, only models should access the file system or a database. +#Variables whose values come from a web client should be cleaned before used by a view or a model. Subclasses of Controller have a clean() member function for this purpose. Further DatasourceManager's have an escapeString method which should be used on string before inserting them into a database in a Model. +#Models, views, elements, helpers, and layouts should not use the $_GET, $_POST, $_REQUEST super-globals. Controllers should not use $_GET and $_POST, at most they should use $_REQUEST. This helps facilitates changing whether HTTP GET or POST is used -- also, using the same variable name for both a GET and POST variable is evil -- this restriction may (or may not) help in catching such errors. +#For controllers which use the $_SESSION super-global, the integrity of the session against [[http://en.wikipedia.org/wiki/Csrf|cross-site request forgery]] should be checked. This should be done in the processRequest method using code like: + if(isset($_SESSION['USER_ID'])) { + $user = $_SESSION['USER_ID']; + } else { + $user = $_SERVER['REMOTE_ADDR']; + } + + $data[CSRF_TOKEN] = $this->generateCSRFToken($user); + $token_okay = $this->checkCSRFToken(CSRF_TOKEN, $user); + if($token_okay) { + //now can do stuff + } +#When creating a new release of Yioop, one should check if any required database or locale changes were made since the last version. If database changes have been made, then configs/createdb.php should be updated. Also lib/upgrade_functions should have a new upgradeDatabaseVersion function added. If locale changes need to be pushed from BASE_DIR/locale files to WORK_DIRECTORY/locale files when people upgrade, then one should change the version number on the view_locale_version string id. i.e., view_locale_version0 as a string id might become view_locale_version1. This string id is in views/view.php. It is not actually output anywhere to the UI -- it is used only for this purpose. number variable which controls whether client-side, HTML5 localStorage related to the previous release will still work with the new release. If it won't work, then this version number should be updated. An example of such a variable is SUGGEST_VERSION_NO in suggest.js. + +[[Coding#contents|Return to table of contents]]. + +==Issue Tracking/Making Patches/Commit Messages== + +In this section we discuss the Yioop issue tracker and discuss using the git version control system to make and apply patches for Yioop. +#If you would like to contribute code to Yioop, but don't yet have an account on the issue tracker, you can [[http://www.seekquarry.com/mantis/signup_page.php|sign up for an account]]. +#After one has an account and is logged in, one can click the [[http://www.seekquarry.com/mantis/bug_report_page.php|Report Issue]] link to report an issue. Be sure to fill in as many report fields and give as much detail as possible. In particular, you should select a Product Version. +#The Upload File fieldset lets you upload files to an issue and the Add Note fieldset allows you to add new notes. This is where you could upload a patch. By default, a new account is a Reporter level account. This won't let you set people to moniter (get email about) the issue besides yourself. However, the administrator will be aware the issue was created. +#A developer level account will allow you to change the status of issues, update/delete issues, set who is monitoring an issue, and assign issues to individuals. This can be done through the fieldset just beneath Attached Files. +#Information about [[http://en.wikipedia.org/wiki/Git_%28software%29|Git]], Git clients, etc. can be obtained from: [[http://git-scm.com/|http://git-scm.com/]]. Here we talk about a typically workflow for coding Yioop using Git. +#After installing git, make sure to configure your user name and email address: + % git config --global user.name "Chris Pollett" + % git config --global user.email "chris@pollett.org" +You should of course change the values above to your name and email. To see your current configuration settings you can type: + % git config -l +If you want to remove any settings you can type: + % git config --unset some.setting.you.dont.want +Setting the user name and email will ensure that you receive credit/blame for any changes that end up in the main git repository. To see who is responsible for what lines in a file one can use the git blame command. For example: + % git blame yioopbar.xml + |chris-polletts-macbook-pro:yioop:526>git blame yioopbar.xml + git blame yioopbar.xml + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 1) <?xml version="1.0" e + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 2) <OpenSearchDescriptio + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 3) <ShortName>Yioop< + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 4) <Description>Quickly + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 5) <InputEncoding>UTF-8 + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 6) <Image width="16" hei + 774eb50d (Chris Pollett 2012-12-31 10:47:57 -0800 7) <Url type="text/html" + 774eb50d (Chris Pollett 2012-12-31 10:47:57 -0800 8) template="http:// + ad3c397c (Chris Pollett 2010-12-28 00:27:38 -0800 9) </Url> + 774eb50d (Chris Pollett 2012-12-31 10:47:57 -0800 10) </OpenSearchDescripti +#To make a new copy of the most recent version of Yioop one can run the git clone command: + % git clone https://seekquarry.com/git/yioop.git yioop +This would create a copy of the Yioop repository into a folder yioop in the current directory. Thereafter, to bring this copy up to date with the most recent version of yioop one can issue the command: + % git pull +#Once one has a git clone of Yioop -- or done a git pull of the most recent changes to Yioop -- one can start coding! After coding a while you should run git status to see what files you have changed. For example, + % git status + # On branch master + # Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded. + # + # Untracked files: + # (use "git add <file>..." to include in what will be committed) + # + # tmp.php + nothing added to commit but untracked files present (use "git add" to track) +This says there has been one commit to the main repository since your clone / last git pull. It also says we could bring things up to date by just doing a git pull. In this case, however, it says that there was an untracked file in the repository. If this file was a file we made with the intention of adding it to Yioop, we should type git add to add it. For example, + % git add tmp.php + + + Now we could try to do a git pull. Suppose we get the message... + + Updating e3e4f20..a9a8ed9 + error: Your local changes to the following files would be overwritten by merge: + tmp.php + Please, commit your changes or stash them before you can merge. + Aborting +What this means is that someone else has also added tmp.php and there are conflicts between these two versions. To merge these two versions, we first commit our version: + % git commit -a -m "Fixes Issue 987, Yioop needs a tmp.php file, a=chris" + [master 3afe055] Fixes Issue 987, Yioop needs a tmp.php file, a=chris + 1 file changed, 4 insertions(+) + create mode 100644 tmp.php +The option -a tells git to put in the commit all changes done to staged files (those that we have git add'd) since the last commit. The option -m is used to give an inline message. The general format of a of such a message in Yioop is: which issue number in the issue tracker is being fixed, a brief English summary of that issue, and under whose authority the commit is being done. This last will be in the format a=chris where a means approved and the person who approved is of sufficient seniority to commit unreviewed things or in the format r=someone, where someone is the person asked in the issue to review your commits before they are pushed. Often for administrator commits, there won't be an associated issue tracking issue, in which case the format reduces to: some useful English description of the change, a=username of administrator. Now that we have done the above commit, we can try again to do a git pull: + % git pull + Auto-merging tmp.php + CONFLICT (add/add): Merge conflict in tmp.php + Automatic merge failed; fix conflicts and then commit the result. + %cat tmp.php + cat tmp.php + <?php + <<<<<<< HEAD + echo "hello"; + echo "good bye"; + ======= + >>>>>>> a9a8ed990108598d06334e29c0eb37d98f0845aa + ?> +The listing of the tmp.php file above has blocks of the form: <<<<<<< HEAD, =======, >>>>>>> a9a8ed990108598d06334e29c0eb37d98f0845aa. In this case, there is only one such block, in general, there could be many. The stuff before the ======= in the block is in the local repository, the stuff after the ======= is in the remote repository. So in the local copy, there are the two lines: + echo "hello"; + echo "good bye"; +not in the remote repository. On the other hand, there is nothing in the remote repository not in the local copy. So we could fix this conflict by editing this block to look like: + <?php + echo "hello"; + echo "good bye"; + ?> +In general, we should fix each conflict block if there is more than one. Conflicts can also be in more than one file, so we could have to fix each file with conflicts. Once this is done, to tell git we have resolved the conflict, we can type: + % git add tmp.php + % git commit + [master e5ebf9f] Merge branch 'master' of https://seekquarry.com/git/yioop +Here we didn't use -m, so we were dropped into the vi text editor, where we left the default commit message. Now we can go back to editing our local copy of Yioop. If we do a git pull at this point, we will get the message: "Already up-to-date." +#The "opposite command" to git pull is git push. Most casual developers for Yioop don't have push privileges on the main Yioop repository. If one did, a possible development workflow would be: Pull the master copy of Yioop to a local branch, make your changes and post a patch to the Bug/Issue in question on the issue tracker asking someone to review it (probably, the administrator, which is me, Chris Pollett). The reviewer gives a thumbs up or down. If it is a thumbs up, you push your changes back to the master branch. Otherwise, you revise you patch and try again. To configure git so git push works you can either make a ~/.netrc file with + machine seekquarry.com + login <username> + password <password> +in it, chmod it to 600, and type: + % git config remote.upload.url https://seekquarry.com/git/yioop.git +or you can just type the command: + % git config remote.upload.url \ + https://<username>@seekquarry.com/git/yioop.git +After this, you should be able to use the command: + % git push upload master +This pushes your local changes back to the repository. In the second method, you will be prompted for your password. Another common setting that you might to change is http.sslVerify. If you are getting error messages such as + error: server certificate verification failed. CAfile: + /etc/ssl/certs/ca-certificates.crt CRLfile: none + while accessing https://seekquarry.com/git/yioop.git/info/refs + + + you might want to use the command: + % git config --global --add http.sslVerify false +#In the workflow above, the changes we make to our local repository should be reviewed before we do a push back to the Yioop repository. To do this review, we need to make a patch, upload the patch to the issue tracker, and add someone to this issue monitor list who could review it, asking them to do a review. These last two steps require the user to have at least a developer account on the issue tracker. Anyone who registers for the issue tracker gets initially a reporter account. If you would like to code for Yioop and have already made a patch, you can send an email to chris@pollett.org to request your account to be upgraded to a developer account. New developers do not get push access on the Yioop repository. For such a developer, the workflow is create a patch, post it to an issue on the issue tracker, get it approved by an administrator reviewer, then the reviewer pushes the result to the main Yioop repository. +#After coding, but before making a patch you should run bin/code_tool.php to remove any stray tab characters, or spaces at the end of lines. This program can be run either on a single file or on a folder. For example, one could type: + % php bin/code_tool.php clean tmp.php +This assumes you were in the Yioop base directory and that was also the location of tmp.php. You should also run the command: + % php bin/code_tool.php longlines tmp.php +to check for lines over 80 characters. +#To make a patch, we start with an up-to-date copy of Yioop obtained by either doing a fresh clone or by doing a git pull. Suppose we create a couple new files, add them to our local repository, do a commit, delete one of these files, make a few more changes, and commit the result. This might look on a Mac or Linux system like: + % ed test1.php + test1.php: No such file or directory + a + <?php + ?> + . + wq + 9 + % ed test2.php + test2.php: No such file or directory + a + <?php + ?> + . + wq + 9 + % git add test1.php + % git add test2.php + % git commit -a -m "Adding test1.php and test2.php to the repository" + [master 100f787] Adding test1.php and test2.php to the repository + 2 files changed, 4 insertions(+) + create mode 100644 test1.php + create mode 100644 test2.php + % ed test1.php + 9 + 1 + <?php + a + phpinfo(); + . + wq + 24 + % git rm test2.php + rm 'test2.php' + % ls + ./ README* data/ locale/ search_filters/ + ../ bin/ error.php* models/ test1.php + .DS_Store* blog.php* examples/ my.patch tests/ + .git/ bot.php* extensions/ privacy.php* views/ + .gitignore configs/ favicon.ico resources/ yioopbar.xml + INSTALL* controllers/ index.php* robots.txt + LICENSE* css/ lib/ scripts/ + % git commit -a -m "Adding phpinfo to test1.php, removing test2.php" + [master 7e64648] Adding phpinfo to test1.php, removing test2.php + 2 files changed, 1 insertion(+), 2 deletions(-) + delete mode 100644 test2.php +Presumably, you will use a less ancient editor than ed. ed though does have the virtue of not clearing the screen, making it easy to cut and paste what we did. We now want to make a patch consisting of all the commits since we did the git pull. First, we get the name of the commit before we started modifying stuff by doing git log -3 to list out the information about the last three commits. If you had done more commits or less commits since the git pull then -3 would be different. We see the name is e3e4f20674cf19cf5840f431066de0bccd1b226c. The first eight or so characters of this uniquely identify this commit, so we copy them. To make a patch with git, one uses the format-patch command. By default this will make a separate patch file for each commit after the starting commit we choose. To instead make one patch file we use the --stdout option and redirect the stream to my.patch. We can use the cat command to list out the contents of the file my.patch. This sequence of commands looks like the following... + % git log -3 + commit 7e646486faa35f69d7322a8e4fca12fb6b457b8f + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 17:32:00 2013 -0800 + + Adding phpinfo to test1.php, removing test2.php + + commit 100f7870221d453720c90dcce3cef76c0d475cc8 + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 16:35:02 2013 -0800 + + Adding test1.php and test2.php to the repository + + commit e3e4f20674cf19cf5840f431066de0bccd1b226c + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 15:48:34 2013 -0800 + + modify string id in settings_view, remove _REQUEST variable from + machinelog_element, a=chris + % git format-patch e3e4f2067 --stdout > my.patch + % cat my.patch + From 100f7870221d453720c90dcce3cef76c0d475cc8 Mon Sep 17 00:00:00 2001 + From: Chris Pollett <chris@pollett.org> + Date: Tue, 1 Jan 2013 16:35:02 -0800 + Subject: [PATCH 1/2] Adding test1.php and test2.php to the repository + + --- + test1.php | 2 ++ + test2.php | 2 ++ + 2 files changed, 4 insertions(+) + create mode 100644 test1.php + create mode 100644 test2.php + + diff --git a/test1.php b/test1.php + new file mode 100644 + index 0000000..acb6c35 + --- /dev/null + +++ b/test1.php + @@ -0,0 +1,2 @@ + +<?php + +?> + diff --git a/test2.php b/test2.php + new file mode 100644 + index 0000000..acb6c35 + --- /dev/null + +++ b/test2.php + @@ -0,0 +1,2 @@ + +<?php + +?> + -- + 1.7.10.2 (Apple Git-33) + + + From 7e646486faa35f69d7322a8e4fca12fb6b457b8f Mon Sep 17 00:00:00 2001 + From: Chris Pollett <chris@pollett.org> + Date: Tue, 1 Jan 2013 17:32:00 -0800 + Subject: [PATCH 2/2] Adding phpinfo to test1.php, removing test2.php + + --- + test1.php | 1 + + test2.php | 2 -- + 2 files changed, 1 insertion(+), 2 deletions(-) + delete mode 100644 test2.php + + diff --git a/test1.php b/test1.php + index acb6c35..e2b4c37 100644 + --- a/test1.php + +++ b/test1.php + @@ -1,2 +1,3 @@ + <?php + + phpinfo(); + ?> + diff --git a/test2.php b/test2.php + deleted file mode 100644 + index acb6c35..0000000 + --- a/test2.php + +++ /dev/null + @@ -1,2 +0,0 @@ + -<?php + -?> + -- + 1.7.10.2 (Apple Git-33) +#One should always list out the patch as we did above before posting it to the issue tracker. It can happen that we accidentally find that we have more things in the patch than we should. Also, it is useful to do one last check that the Yioop coding guidelines seem to be followed within the patch. +#The last step before uploading the patch to the issue tracker is to just check that the patch in fact works. To do this make a fresh clone of Yioop. Copy my.patch into your clone folder. To see what files the patch affects, we can type: + % git apply --stat my.patch + test1.php | 2 ++ + test2.php | 2 ++ + test1.php | 1 + + test2.php | 2 -- + 4 files changed, 5 insertions(+), 2 deletions(-) +Since there are two concatenated patches in my.patch, it first lists the two files affected by the first patch, then the two files affected by the second patch. To do a check to see if the patch will cause any problems before applying it, one can type: + % git apply --check my.patch +Finally, to apply the patch we can type: + % git am --signoff < my.patch + Applying: Adding test1.php and test2.php to the repository + Applying: Adding phpinfo to test1.php, removing test2.php + </pre> + The am says apply from a mail, the --signoff option says to write a + commit message with your email saying you approved this commit. From + the above we see each patch within my.patch was applied in turn. To + see what this signoff looks like, we can do: + <pre> + commit aca40730c41fafe9a21d4f0d765d3695f20cc5aa + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 17:32:00 2013 -0800 + + Adding phpinfo to test1.php, removing test2.php + + Signed-off-by: Chris Pollett <chris@pollett.org> + + commit d0d13d9cf3059450ee6b1b4a51db0d0fae18256c + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 16:35:02 2013 -0800 + + Adding test1.php and test2.php to the repository + + Signed-off-by: Chris Pollett <chris@pollett.org> + + commit e3e4f20674cf19cf5840f431066de0bccd1b226c + Author: Chris Pollett <chris@pollett.org> + Date: Tue Jan 1 15:48:34 2013 -0800 + + modify string id in settings_view, remove _REQUEST variable from + machinelog_element, a=chris +At this point the patch seems good to go, so we can upload it to the issue tracker! + +[[Coding#contents|Return to table of contents]]. + +==New Version Quality Assurance Checklist== + +The following should be check before creating a new release of Yioop: + +#All unit tests pass. +#Included sqlite database default.db is up-to-date. +#Install guides are up to date and installation can be performed as described for each of the major platforms (Linux variants, MAC OSX, Windows, HipHop). +#Upgrade functions successfully upgrade Yioop from last version. Upgrade functions need only be written from the last official release to the current official release. +#Yioop can perform a regular and archive crawl on each of the platforms for which an install guide has been made. +#Each kind of archive iterator has been tested on the development platform to be still working. +#Multi-queue server crawls should be tested. Mirror and Media updater processes should be tested. +#Documentation reflects changes since last version of Yioop, screenshots have been updated. +#Source code documentation has been updated. The current command used to do this is + phpdoc -d ./yioop -t ./yioop_docs --defaultpackagename="seek_quarry" --sourcecode \ + --title="Yioop Vcur_version Source Code Documentation" +This should be executed from one directory up from the Yioop source code +#Each admin panel activity and each setting within each activity works as described. +#Web app still appears correctly for major browsers: Firefox, Chrome, Internet Explorer, Safari released in the last two years. + +[[Coding#contents|Return to table of contents]]. +EOD; +$public_pages["en-US"]["Demos"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc= + +title=Demos + +author=Chris Pollett + +robots= + +description=Information about various sites that demo Yioop software. + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS==Sites which use Yioop Software== + +* '''[[http://yioop.com/|yioop.com]]''' is the main demo site for yioop. Currently, the default crawl is set to one from slightly over a year ago. Under Settings, you can select the current crawl which is slightly smaller but more fresh. +* '''[[http://seekquarry.com|seekquarry.com]]'''. Seekquarry itself was created using the wikis within a vanilla install of Yioop. The download yioop software form was a simple customization added to the app directory. +* '''[[http://findcan.ca|findcan.ca]]''' demonstrates doing a niche crawl of just Canadian websites. [[http://findcan.ca/?c=group&a=groupFeeds&just_thread=12| This how to do a niche crawl blog entry]] explains how this crawl was created. This site also demonstrates using Yioop software's built in support for banner and skyscraper ads. + +EOD; +$public_pages["en-US"]["Discussion"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Discussion + +author=Chris Pollett + +robots= + +description=Links to Yioop related discussion boards. + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS==Yioop Software Discussion== + +*[[https://yioop.com/?c=group&just_group_id=212|Yioop Discussion on yioop.com]] +*[[https://www.seekquarry.com/phpBB|Old Discussion Archive]] + + +EOD; +$public_pages["en-US"]["DocDev"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=dashed-border + +toc=true + +title=Docs Under Dev + +author= + +robots= + +description= + +page_header=main_header + +page_footer= + +END_HEAD_VARS{{id="contents" +=Yioop Documentation v 2.0= +}} + +==Overview== +===Getting Started=== + +This document serves as a detailed reference for the Yioop search engine. If you want to get started using Yioop now, you probably want to first read the [[Install|Installation Guides]] page and look at the [[http://www.yioop.com/?c=group&group_id=20&arg=read&a=wiki&page_name=Main|Yioop Video Tutorials Wiki]]. If you cannot find your particular machine configuration there, you can check the Yioop [[Documentation#Requirements|Requirements]] section followed by the more general Installation and Configuration instructions. + +[[Yioop.com]], the demo site for Yioop software, allows people to register accounts. Once registered, if you have questions about Yioop and its installation, you can join the +[[https://yioop.com/?c=group&just_group_id=212|Yioop Software Help]] group and post your questions there. This group is frequently checked by the creators of Yioop, and you will likely get a quick response. + +Having a Yioop account also allows you to experiment with some of the features of Yioop beyond search such as Yioop Groups, Wikis, and Crawl Mixes without needing to install the software yourself. The [[Documentation#Search%20and%20User%20Interface|Search and User Interface]], [[Documentation#Managing%20Users,%20Roles,%20and%20Groups|Managing Users, Roles, and Groups]], [[Documentation#Feeds%20and%20Wikis|Feeds and Wikis]], and [[Documentation#Mixing%20Crawl%20Indexes|Mixing Crawl Indexes]] sections below could serve as a guide to testing the portion of the site general users have access to on Yioop.com. + +When using Yioop software and do not understand a feature, make sure to also check out the integrated help system throughout Yioop. Clicking on a question mark icon will reveal an additional blue column on a page with help information as seen below: +{{class="docs" +((resource:Documentation:IntegratedHelp.png|Integrated Help Example)) +}} + +===Introduction=== + +The Yioop search engine is designed to allow users to produce indexes of a web-site or a collection of web-sites. The number of pages a Yioop index can handle range from small site to those containing tens or hundreds of millions of pages. In contrast, a search-engine like Google maintains an index of tens of billions of pages. Nevertheless, since you, the user, have control over the exact sites which are being indexed with Yioop, you have much better control over the kinds of results that a search will return. Yioop provides a traditional web interface to do queries, an rss api, and a function api. It also supports many common features of a search portal such as user discussion group, blogs, wikis, and a news aggregator. In this section we discuss some of the different search engine technologies which exist today, how Yioop fits into this eco-system, and when Yioop might be the right choice for your search engine needs. In the remainder of this document after the introduction, we discuss how to get and install Yioop; the files and folders used in Yioop; the various crawl, search, social portal, and administration facilities in the Yioop; localization in the Yioop system; building a site using the Yioop framework; embedding Yioop in an existing web-site; customizing Yioop; and the Yioop command-line tools. + +Since the mid-1990s a wide variety of search engine technologies have been explored. Understanding some of this history is useful in understanding Yioop capabilities. In 1994, Web Crawler, one of the earliest still widely-known search engines, only had an index of about 50,000 pages which was stored in an Oracle database. Today, databases are still used to create indexes for small to medium size sites. An example of such a search engine written in PHP is [[http://www.sphider.eu/|Sphider]]. Given that a database is being used, one common way to associate a word with a document is to use a table with a columns like word id, document id, score. Even if one is only extracting about a hundred unique words per page, this table's size would need to be in the hundreds of millions for even a million page index. This edges towards the limits of the capabilities of database systems although techniques like table sharding can help to some degree. The Yioop engine uses a database to manage some things like users and roles, but uses its own web archive format and indexing technologies to handle crawl data. This is one of the reasons that Yioop can scale to larger indexes. + +When a site that is being indexed consists of dynamic pages rather than the largely static page situation considered above, and those dynamic pages get most of their text content from a table column or columns, different search index approaches are often used. Many database management systems like [[http://www.mysql.com/|MySQL]]/[[https://mariadb.org/|MariaDB]], support the ability to create full text indexes for text columns. A faster more robust approach is to use a stand-alone full text index server such as [[http://www.sphinxsearch.com/|Sphinx]]. However, for these approaches to work the text you are indexing needs to be in a database column or columns, or have an easy to define "XML mapping". Nevertheless, these approaches illustrate another common thread in the development of search systems: Search as an appliance, where you either have a separate search server and access it through either a web-based API or through function calls. + +Yioop has both a search function API as well as a web API that can return [[http://www.opensearch.org/|Open Search RSS]] results or a JSON variant. These can be used to embed Yioop within your existing site. If you want to create a new search engine site, Yioop +provides all the basic features of web search portal. It has its own account management system with the ability to set up groups that have both discussions boards and wikis with various levels of access control. The built in Public group's wiki together with the GUI configure page can be used to completely customize the look and feel of Yioop. Third party display ads can also be added through the GUI interface. If you want further customization, Yioop +offers a web-based, model-view-adapter (a variation on model-view-controller) framework with a web-interface for localization. + +By 1997 commercial sites like Inktomi and AltaVista already had tens or hundreds of millions of pages in their indexes [ [[Documentation#P1994|P1994]] ] [ [[Documentation#P1997a|P1997a]] ] [ [[Documentation#P1997b|P1997b]] ]. Google [ [[Documentation#BP1998|BP1998]] ] circa 1998 in comparison had an index of about 25 million pages. These systems used many machines each working on parts of the search engine problem. On each machine there would, in addition, be several search related processes, and for crawling, hundreds of simultaneous threads would be active to manage open connections to remote machines. Without threading, downloading millions of pages would be very slow. Yioop is written in [[http://www.php.net/|PHP]]. This language is the `P' in the very popular [[http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29|LAMP]] web platform. This is one of the reasons PHP was chosen as the language of Yioop. Unfortunately, PHP does not have built-in threads. However, the PHP language does have a multi-curl library (implemented in C) which uses threading to support many simultaneous page downloads. This is what Yioop uses. Like these early systems Yioop also supports the ability to distribute the task of downloading web pages to several machines. As the problem of managing many machines becomes more difficult as the number of machines grows, Yioop further has a web interface for turning on and off the processes related to crawling on remote machines managed by Yioop. + +There are several aspects of a search engine besides downloading web pages that benefit from a distributed computational model. One of the reasons Google was able to produce high quality results was that it was able to accurately rank the importance of web pages. The computation of this page rank involves repeatedly applying Google's normalized variant of the web adjacency matrix to an initial guess of the page ranks. This problem naturally decomposes into rounds. Within a round the Google matrix is applied to the current page ranks estimates of a set of sites. This operation is reasonably easy to distribute to many machines. Computing how relevant a word is to a document is another task that benefits from multi-round, distributed computation. When a document is processed by indexers on multiple machines, words are extracted and a stemming algorithm such as [ [[Documentation#P1980|P1980]] ] or a character n-gramming technique might be employed (a stemmer would extract the word jump from words such as jumps, jumping, etc; converting jumping to 3-grams would make terms of length 3, i.e., jum, ump, mpi, pin, ing). For some languages like Chinese, where spaces between words are not always used, a segmenting algorithm like reverse maximal match might be used. Next a statistic such as BM25F [ [[Documentation#ZCTSR2004|ZCTSR2004]] ] (or at least the non-query time part of it) is computed to determine the importance of that word in that document compared to that word amongst all other documents. To do this calculation one needs to compute global statistics concerning all documents seen, such as their average-length, how often a term appears in a document, etc. If the crawling is distributed it might take one or more merge rounds to compute these statistics based on partial computations on many machines. Hence, each of these computations benefit from allowing distributed computation to be multi-round. Infrastructure such as the Google File System [ [[Documentation#GGL2003|GGL2003]] ], the MapReduce model [ [[Documentation#DG2004|DG2004]] ], and the Sawzall language [ [[Documentation#PDGQ2006|PDGQ2006]] ] were built to make these multi-round distributed computation tasks easier. In the open source community, the [[http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html|Hadoop Distributed File System]], [[http://hadoop.apache.org/docs/mapreduce/current/index.html|Hadoop MapReduce]], and [[http://hadoop.apache.org/pig/|Pig]] play an analogous role [ [[Documentation#W2009|W2009]] ]. Recently, a theoretical framework for what algorithms can be carried out as rounds of map inputs to sequence of key value pairs, shuffle pairs with same keys to the same nodes, reduce key-value pairs at each node by some computation has begun to be developed [ [[Documentation#KSV2010|KSV2010]] ]. This framework shows the map reduce model is capable of solving quite general cloud computing problems -- more than is needed just to deploy a search engine. + +Infrastructure such as this is not trivial for a small-scale business or individual to deploy. On the other hand, most small businesses and homes do have available several machines not all of whose computational abilities are being fully exploited. So the capability to do distributed crawling and indexing in this setting exists. Further high-speed internet for homes and small businesses is steadily getting better. Since the original Google paper, techniques to rank pages have been simplified [ [[Documentation#APC2003|APC2003]] ]. It is also possible to approximate some of the global statistics needed in BM25F using suitably large samples. More details on the exact ranking mechanisms used by Yioop and be found on the [[Ranking|Yioop Ranking Mechanisms]] page. + +Yioop tries to exploit these advances to use a simplified distributed model which might be easier to deploy in a smaller setting. Each node in a Yioop system is assumed to have a web server running. One of the Yioop nodes web app's is configured to act as a coordinator for crawls. It is called the '''name server'''. In addition to the name server, one might have several processes called '''queue servers''' that perform scheduling and indexing jobs, as well as '''fetcher''' processes which are responsible for downloading pages and the page processing such as stemming, char-gramming and segmenting mentioned above. Through the name server's web app, users can send messages to the queue servers and fetchers. This interface writes message files that queue servers periodically looks for. Fetcher processes periodically ping the name server to find the name of the current crawl as well as a list of queue servers. Fetcher programs then periodically make requests in a round-robin fashion to the queue servers for messages and schedules. A schedule is data to process and a message has control information about what kind of processing should be done. A given queue_server is responsible for generating schedule files for data with a certain hash value, for example, to crawl urls for urls with host names that hash to queue server's id. As a fetcher processes a schedule, it periodically POSTs the result of its computation back to the responsible queue server's web server. The data is then written to a set of received files. The queue_server as part of its loop looks for received files and merges their results into the index so far. So the model is in a sense one round: URLs are sent to the fetchers, summaries of downloaded pages are sent back to the queue servers and merged into their indexes. As soon as the crawl is over one can do text search on the crawl. Deploying this computation model is relatively simple: The web server software needs to be installed on each machine, the Yioop software (which has the the fetcher, queue_server, and web app components) is copied to the desired location under the web server's document folder, each instance of Yioop is configured to know who the name server is, and finally, the fetcher programs and queue server programs are started. + +As an example of how this scales, 2010 Mac Mini running a queue server program can schedule and index about 100,000 pages/hour. This corresponds to the work of about 7 fetcher processes (which may be on different machines -- roughly, you want 1GB and 1core/fetcher). The checks by fetchers on the name server are lightweight, so adding another machine with a queue server and the corresponding additional fetchers allows one to effectively double this speed. This also has the benefit of speeding up query processing as when a query comes in, it gets split into queries for each of the queue server's web apps, but the query only "looks" slightly more than half as far into the posting list as would occur in a single queue server setting. To further increase query throughput, the number queries that can be handled at a given time, Yioop installations can also be configured as "mirrors" which keep an exact copy of the data stored in the site being mirrored. When a query request comes into a Yioop node, either it or any of its mirrors might handle it. Query processing, for multi-word queries can actually be a major bottleneck if you don't have many machines and you do have a large index. To further speed this up, Yioop uses a hybrid inverted index/suffix tree approach to store word lookups. The suffix tree ideas being motivated by [ [[Documentation#PTSHVC2011|PTSHVC2011]] ]. + +Since a multi-million page crawl involves both downloading from the web rapidly over several days, Yioop supports the ability to dynamically change its crawl parameters as a crawl is going on. This allows a user on request from a web admin to disallow Yioop from continuing to crawl a site or to restrict the number of urls/hours crawled from a site without having to stop the overall crawl. One can also through a web interface inject new seed sites, if you want, while the crawl is occurring. This can help if someone suggests to you a site that might otherwise not be found by Yioop given its original list of seed sites. Crawling at high-speed can cause a website to become congested and unresponsive. As of Version 0.84, if Yioop detects a site is becoming congested it can automatically slow down the crawling of that site. Finally, crawling at high-speed can cause your domain name server (the server that maps www.yioop.com to 173.13.143.74) to become slow. To reduce the effect of this Yioop supports domain name caching. + +Despite its simpler one-round model, Yioop does a number of things to improve the quality of its search results. While indexing, Yioop can make use Lasso regression classifiers [ [[Documentation#GLM2007|GLM2007]] ] using data from earlier crawls to help label and/or rank documents in the active crawl. Yioop also takes advantage of the link structure that might exist between documents in a one-round way: For each link extracted from a page, Yioop creates a micropage which it adds to its index. This includes relevancy calculations for each word in the link as well as an [ [[Documentation#APC2003|APC2003]] ]-based ranking of how important the link was. Yioop supports a number of iterators which can be thought of as implementing a stripped-down relational algebra geared towards word-document indexes (this is much the same idea as Pig). One of these operators allows one to make results from unions of stored crawls. This allows one to do many smaller topic specific crawls and combine them with your own weighting scheme into a larger crawl. A second useful operator allows you to display a certain number of results from a given subquery, then go on to display results from other subqueries. This allows you to make a crawl presentation like: the first result should come from the open crawl results, the second result from Wikipedia results, the next result should be an image, and any remaining results should come from the open search results. Yioop comes with a GUI facility to make the creation of these crawl mixes easy. To speed up query processing for these crawl mixes one can also create materialized versions of crawl mix results, which makes a separate index of crawl mix results. Another useful operator Yioop supports allows one to perform groupings of document results. In the search results displayed, grouping by url allows all links and documents associated with a url to be grouped as one object. Scoring of this group is a sum of all these scores. Thus, link text is used in the score of a document. How much weight a word from a link gets also depends on the link's rank. So a low-ranked link with the word "stupid" to a given site would tend not to show up early in the results for the word "stupid". Grouping also is used to handle deduplication: It might be the case that the pages of many different URLs have essentially the same content. Yioop creates a hash of the web page content of each downloaded url. Amongst urls with the same hash only the one that is linked to the most will be returned after grouping. Finally, if a user wants to do more sophisticated post-processing such as clustering or computing page rank, Yioop supports a straightforward architecture for indexing plugins. + +There are several open source crawlers which can scale to crawls in the millions to hundred of millions of pages. Most of these are written in Java, C, C++, C#, not PHP. Three important ones are [[http://nutch.apache.org/|Nutch]]/ [[http://lucene.apache.org/|Lucene]]/ [[http://lucene.apache.org/solr/|Solr]] [ [[Documentation#KC2004|KC2004]] ], [[http://www.yacy.net/|YaCy]], and [[http://crawler.archive.org/|Heritrix]] [ [[Documentation#MKSR2004|MKSR2004]] ]. Nutch is the original application for which the Hadoop infrastructure described above was developed. Nutch is a crawler, Lucene is for indexing, and Solr is a search engine front end. The YaCy project uses an interesting distributed hash table peer-to-peer approach to crawling, indexing, and search. Heritrix is a web crawler developed at the [[http://www.archive.org/|Internet Archive]]. It was designed to do archival quality crawls of the web. Its ARC file format inspired the use of WebArchive objects in Yioop. WebArchives are Yioop's container file format for storing web pages, web summary data, url lists, and other kinds of data used by Yioop. A WebArchive is essentially a linked-list of compressed, serialized PHP objects, the last element in this list containing a header object with information like version number and a total count of objects stored. The compression format can be chosen to suit the kind of objects being stored. The header can be used to store auxiliary data structures into the list if desired. One nice aspect of serialized PHP objects versus serialized Java Objects is that they are humanly readable text strings. The main purpose of Web Archives is to allow one to store many small files compressed as one big file. They also make data from web crawls very portable, making them easy to copy from one location to another. Like Nutch and Heritrix, Yioop also has a command-line tool for quickly looking at the contents of such archive objects. + +The [[http://www.archive.org/web/researcher/ArcFileFormat.php|ARC format]] is one example of an archival file format for web data. Besides at the Internet Archive, ARC and its successor [[http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml|WARC format]] are often used by TREC conferences to store test data sets such as [[http://ir.dcs.gla.ac.uk/test_collections/|GOV2]] and the [[http://lemurproject.org/clueweb09/|ClueWeb 2009]] / [[http://lemurproject.org/clueweb12/|ClueWeb 2012]] Datasets. In addition, it was used by grub.org (hopefully, only on a temporary hiatus), a distributed, open-source, search engine project in C#. Another important format for archiving web pages is the XML format used by Wikipedia for archiving MediaWiki wikis. [[http://www.wikipedia.org/|Wikipedia]] offers [[http://en.wikipedia.org/wiki/Wikipedia:Database_download|creative common-licensed downloads]] of their site in this format. The [[http://www.dmoz.org/|Open Directory Project]] makes available its [[http://www.dmoz.org/rdf.html|ODP data set]] in an RDF-like format licensed using the Open Directory License. Thus, we see that there are many large scale useful data sets that can be easily licensed. Raw data dumps do not contain indexes of the data though. This makes sense because indexing technology is constantly improving and it is always possible to re-index old data. Yioop supports importing and indexing data from ARC, WARC, database queries results, MediaWiki XML dumps, and Open Directory RDF. Yioop further has a generic text importer which can be used to index log records, mail, Usenet posts, etc. Yioop also supports re-indexing of old Yioop data files created after version 0.66, and indexing crawl mixes. This means using Yioop you can have searchable access to many data sets as well as have the ability to maintain your data going forward. When displaying caches of web pages in Yioop, the interface further supports the ability to display a history of all cached copies of that page, in a similar fashion to Internet Archives interface. + +Another important aspect of creating a modern search engine is the ability to display in an appropriate way various media sources. Yioop comes with built-in subsearch abilities for images, where results are displayed as image strips; video, where thumbnails for video are shown; and news, where news items are grouped together and a configurable set of news/twitter feeds can be set to be updated on an hourly basis. + +This concludes the discussion of how Yioop fits into the current and historical landscape of search engines and indexes. + +===Feature List=== + +Here is a summary of the features of Yioop: + +*'''General''' +**Yioop is an open-source, distributed crawler and search engine written in PHP. +**Crawling, indexing, and serving search results can be done on a single machine or distributed across several machines. +**The fetcher/queue_server processes on several machines can be managed through the web interface of a main Yioop instance. +**Yioop installations can be created with a variety of topologies: one queue_server and many fetchers or several queue_servers and many fetchers. +**Using web archives, crawls can be mirrored amongst several machines to speed-up serving search results. This can be further sped-up by using memcache or filecache. +**Yioop can be used to create web sites via its own built-in wiki system. For more complicated customizations, Yioop's model-view-adapter framework is designed to be easily extendible. This framework also comes with a GUI which makes it easy to localize strings and static pages. +**Yioop search result and feed pages can be configured to display banner or skyscraper ads through an Site Admin GUI (if desired). +**Yioop has been optimized to work well with smart phone web browsers and with tablet devices. +*'''Social and User Interface''' +**Yioop can be configured to allow or not to allow users to register for accounts. +**If allowed, user accounts can create discussion groups, blogs, and wikis. +** Blogs and wiki support attaching images, videos, and files and also support including math using LaTeX or AsciiMathML. +** Yioop comes with two built in groups: Help and Public. Help's wiki pages allow one to customize the integrated help throughout the Yioop system. The Public Groups discussion can be used as a site blog, its wiki page can be used to customize the look-and-feel of the overall Yioop site without having to do programming. +** Wiki pages support different types such as standard wiki page, page alias, media gallery, and slide presentation. +** Video on wiki pages and in discussion posts is served using HTTP-pseudo streaming so users can scrub through video files. For uploaded video files below a configurable size limit, videos are automatically converted to web friendly mp4 and webm formats. +** Wiki pages can be configured to have auto-generated tables of contents, to make use of common headers and footers, and to output meta tags for SEO purposes. +**Users can share their own mixes of crawls that exist in the Yioop system. +**If user accounts are enabled, Yioop has a search tools page on which people can suggest urls to crawl. +**Yioop has three different captcha'ing mechanisms that can be used in account registration and for suggest urls: a standard graphics-based captcha, a text-based captcha, and a hash-cash-like catpha. +**Password authentication can be configured to either use a standard password hash based system, or make use of Fiat Shamir zero-knowledge authentication. +*'''Search''' +**Yioop supports subsearches geared towards presenting certain kinds of media such as images, video, and news. The list of video and news sites can be configured through the GUI. Yioop has a media_updater process which can be used to automatically update news feeds hourly. +**News feeds can either be RSS or Atom feed or can be scraped from an HTML page using XPath queries. What image is used for a news feed item can also be configured using XPath queries. +**Yioop determines search results using a number of iterators which can be combined like a simplified relational algebra. +**Yioop can be configured to display word suggestions as a user types a query. It can also suggest spell corrections for mis-typed queries. This feature can be localized. +**Yioop can also make use of a thesaurus facility such as provided by WordNet to suggest related queries. +**Yioop supports the ability to filter out urls from search results after a crawl has been performed. It also has the ability to edit summary information that will be displayed for urls. +**A given Yioop installation might have several saved crawls and it is very quick to switch between any of them and immediately start doing text searches. +**Besides the standard output of a web page with ten links, Yioop can output query results in Open Search RSS format, a JSON variant of this format, and also to query Yioop data via a function api. +*'''Indexing''' +**Yioop is capable of indexing small sites to sites or collections of sites containing low hundreds of millions of documents. +**Yioop uses a hybrid inverted index/suffix tree approach for word lookup to make multi-word queries faster on disk bound machines. +**Yioop indexes are positional rather than bag of word indexes, and a index compression scheme called Modified9 is used. +**Yioop has a web interface which makes it easy to combine results from several crawl indexes to create unique result presentations. These combinations can be done in a conditional manner using "if:" meta words. +**Yioop supports the indexing of many different filetypes including: HTML, Atom, BMP, DOC, DOCX ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps, SVG, XLSX, and XML. It has a web interface for controlling which amongst these filetypes (or all of them) you want to index. It supports also attempting to extract information from unknown filetypes. +**Yioop supports extracting data from zipped formats like DOCX even if it only did a partial download of the file. +**Yioop has a simple page rule language for controlling what content should be extracted from a page or record. +**Yioop has two different kinds of text summarizers which can be used to further affect what words are index: a basic web page scraper, and a centroid algorithm summarizer. The latter can be used to generate word clouds of crawled documents. +**Indexing occurs as crawling happens, so when a crawl is stopped, it is ready to be used to handle search queries immediately. +**Yioop Indexes can be used to create classifiers which then can be used in labeling and ranking future indexes. +**Yioop comes with stemmers for English, French, German, Italian, and Russian, and a word segmenter for Chinese. It uses char-gramming for other languages. Yioop has a simple architecture for adding stemmers for other languages. +**Yioop uses a web archive file format which makes it easy to copy crawl results amongst different machines. It has a command-line tool for inspecting these archives if they need to examined in a non-web setting. It also supports command-line search querying of these archives. +**Yioop supports an indexing plugin architecture to make it possible to write one's own indexing modules that do further post-processing. +*'''Web and Archive Crawling''' +**Yioop supports open web crawls, but through its web interface one can configure it also to crawl only specifics site, domains, or collections of sites and domains. One can customize a crawl using regex in disallowed directives to crawl a site to a fixed depth. +**Yioop uses multi-curl to support many simultaneous downloads of pages. +**Yioop obeys robots.txt files including Google and Bing extensions such as the Crawl-delay and Sitemap directives as well as * and $ in allow and disallow. It further supports the robots meta tag directives NONE, NOINDEX, NOFOLLOW, NOARCHIVE, and NOSNIPPET and the link tag directive rel="canonical". It also supports anchor tags with rel="nofollow" attributes. It also supports X-Robots-Tag HTTP headers. Finally, it tries to detect if a robots.txt became a redirect due to congestion. +**Yioop comes with a word indexing plugin which can be used to control how Yioop crawls based on words on the page and the domain. This is useful for creating niche subject specific indexes. +**Yioop has its own DNS caching mechanism and it adjusts the number of simultaneous downloads it does in one go based on the number of lookups it will need to do. +** Yioop can crawl over HTTP, HTTPS, and Gopher protocols. +**Yioop supports crawling TOR networks (.onion urls). +**Yioop supports crawling through a list of proxy servers. +**Yioop supports crawling Git Repositories and can index Java and Python code. +**Yioop supports crawl quotas for web sites. I.e., one can control the number of urls/hour downloaded from a site. +**Yioop can detect website congestion and slow down crawling a site that it detects as congested. +**Yioop supports dynamically changing the allowed and disallowed sites while a crawl is in progress. Yioop also supports dynamically injecting new seeds site via a web interface into the active crawl. +**Yioop has a web form that allows a user to control the recrawl frequency for a page during a crawl. +**Yioop keeps track of ETag: and Expires: HTTP headers to avoid downloading content it already has in its index. +**Yioop supports importing data from ARC, WARC, database queries, MediaWiki XML, and ODP RDF files. It has generic importing facility to import text records such as access log, mail log, usenet posts, etc., which are either not compressed, or compressed using gzip or bzip2. It also supports re-indexing of data from WebArchives. + +[[Documentation#contents|Return to table of contents]]. + +==Set-up== +===Requirements=== + +The Yioop search engine requires: (1) a web server, (2) PHP 5.3 or better (Yioop used only to serve search results from a pre-built index has been tested to work in PHP 5.2), (3) Curl libraries for downloading web pages. To be a little more specific Yioop has been tested with Apache 2.2 and I've been told Version 0.82 or newer works with lighttpd. It should work with other webservers, although it might take some finessing. For PHP, you need a build of PHP that incorporates multi-byte string (mb_ prefixed) functions, Curl, Sqlite (or at least PDO with Sqlite driver), the GD graphics library and the command-line interface. If you are using Mac OSX Snow Leopard or newer, the version of Apache2 and PHP that come with it suffice. For Windows, Mac, and Linux, another easy way to get the required software is to download a Apache/PHP/MySql suite such as [[http://www.apachefriends.org/en/xampp.html|XAMPP]]. On Windows machines, find the the php.ini file under the php folder in your Xampp folder and change the line: + ;extension=php_curl.dll +to + extension=php_curl.dll +The php.ini file has a post_max_size setting you might want to change. You might want to change it to: + post_max_size = 32M +Yioop will work with the post_max_size set to as little as two megabytes bytes, but will be faster with the larger post capacity. If you intend to make use of Yioop Discussion Groups and Wiki and their ability to upload documents. You might want to consider also adjusting the value of the variable ''upload_max_filesize''. This value should be set to at most what you set post_max_size to. + +If you are using WAMP, similar changes as with XAMPP must be made, but be aware that WAMP has two php.ini files and both of these must be changed. + +If you are using the Ubuntu-variant of Linux, the following lines would get the software you need: + sudo apt-get install curl + sudo apt-get install apache2 + sudo apt-get install php5 + sudo apt-get install php5-cli + sudo apt-get install php5-sqlite + sudo apt-get install php5-curl + sudo apt-get install php5-gd +For both Mac and Linux, you might want to alter the post_max_size variable in your php.ini file as in the Windows case above. + +In addition to the minimum installation requirements above, if you want to use the [[Documentation#GUI%20for%20Managing%20Machines%20and%20Servers|Manage Machines]] feature in Yioop, you might need to do some additional configuration. The Manage Machines activity allows you through a web interface to start/stop and look at the log files for each of the queue_servers, and fetchers that you want Yioop to manage. If it is not configured then these task would need to be done via the command line. '''Also, if you do not use the Manage Machine interface your Yioop site can make use of only one queue_server.''' + +As a final step, after installing the necessary software, '''make sure to start/restart your web server and verify that it is running.''' + +====Memory Requirements==== + +In addition, to the prerequisite software listed above, Yioop allows specifies for its process certain upper bounds on the amounts of memory they can use. By default bin/queue_server.php's limit is set to 2500MB, bin/fetcher.php's limit is set to 1200MB. You can expect that index.php might need up to 500MB. These values are set near the tops of each of these files in turn with a line like: + ini_set("memory_limit","2500M"); +For the index.php file, you may need to set the limit at well in your php.ini file for the instance of PHP used by your web server. If the value is too low for the index.php Web app you might see messages in the Fetcher logs that begin with: "Trouble sending to the scheduler at url..." + +Often in a VM setting these requirements are somewhat steep. It is possible to get Yioop to work in environments like EC2 (be aware this might violate your service agreement). To reduce these memory requirements, one can manually adjust the variables NUM_DOCS_PER_GENERATION, SEEN_URLS_BEFORE_UPDATE_SCHEDULER, NUM_URLS_QUEUE_RAM, MAX_FETCH_SIZE, and URL_FILTER_SIZE in the configs/config.php file. Experimenting with these values you should be able to trade-off memory requirements for speed. + +[[Documentation#contents|Return to table of contents]]. + +===Installation and Configuration=== + +The Yioop application can be obtained using [[Download|the download page at seekquarry.com]]. After downloading and unzipping it, move the Yioop search engine into some folder under your web server's document root. Yioop makes use of an auxiliary folder to store profile/crawl data. Before Yioop will run you must configure this directory. This can be done in one of two ways: either through the web interface (the preferred way), as we will now describe or using the configs/configure_tool.php script (which is harder, but might be suitable for some VPS settings) which will be described in the [[Documentation#Yioop%20Command-line%20 Tools|command line tools section]]. From the web interface, to configure this directory point your web browser to where your Yioop folder is located, a configuration page should appear and let you set the path to the auxiliary folder (Search Engine Work Directory). This page looks like: +{{class="docs" +((resource:Documentation:ConfigureScreenForm1.png|The work directory form)) +}} +For this step, as a security precaution, you must connect via localhost. If you are in a web hosting environment (for example, if you are using cPanel to set up Yioop) where it is difficult to connect using localhost, you can add a file, configs/local_config.php, with the following content: + <?php + define('NO_LOCAL_CHECK', 'true'); + ?> +Returning to our installation discussion, notice under the text field there is a heading "Component Check" and there is red text under it, this section is used to indicate any requirements that Yioop has that might not be met yet on your machine. In the case above, the web server needs permissions on the file configs/config.php to write in the value of the directory you choose in the form for the Work Directory. Another common message asks you to make sure the web server has permissions on the place where this auxiliary folder needs to be created. When filling out the form of this page, on both *nix-like, and Windows machines, you should use forward slashes for the folder location. For example, + + /Applications/XAMPP/xamppfiles/htdocs #Mac, Linux system similar + c:/xampp/htdocs/yioop_data #Windows + +Once you have set the folder, you should see a second Profile Settings form beneath the Search Engine Work Directory form. If you are asked to sign-in before this, and you have not previously created accounts in this Work Directory, then the default account has login root, and an empty password. Once you see it, The Profile Settings form allows you to configure the debug, search access, database, queue server, and robot settings. It will look something like: + +{{class="docs" +((resource:Documentation:ConfigureScreenForm2.png|Basic configure form)) +}} + +These settings suffice if you are only doing single machine crawling. The '''Crawl Robot Set-up''' fieldset is used to provide websites that you crawl with information about who is crawling them. The field Crawl Robot Name is used to say the name of your robot. You should choose a common name for all of the fetchers in your set-up, but the name should be unique to your web-site. It is bad form to pretend to be someone else's robot, for example, the googlebot. As Yioop crawls it sends the web-site it crawls a User-Agent string, this string contains the url back to the bot.php file in the Yioop folder. bot.php is supposed to provide a detailed description of your robot. The contents of textarea Robot Description is supposed to provide this description and is inserted between <body> </body> tags on the bot.php page. + +You might need to click {{id="advance" '''Toggle Advance Settings'''}} if you are doing Yioop development or if you are crawling in a multi-machine setting. The advanced settings look like: + +{{class="docs" +((resource:Documentation:ConfigureScreenForm3.png|Advanced configure form)) +}} + +The '''Debug Display''' fieldset has three checkboxes: Error Info, Query Info, and Test Info. Checking Error Info will mean that when the Yioop web app runs, any PHP Errors, Warnings, or Notices will be displayed on web pages. This is useful if you need to do debugging, but should not be set in a production environment. The second checkbox, Query Info, when checked, will cause statistics about the time, etc. of database queries to be displayed at the bottom of each web page. The last checkbox, Test Info, says whether or not to display automated tests of some of the systems library classes if the browser is navigated to http://YIOOP_INSTALLATION/tests/. None of these debug settings should be checked in a production environment. + +The '''Search Access''' fieldset has three checkboxes: Web, RSS, and API. These control whether a user can use the web interface to get query results, whether RSS responses to queries are permitted, or whether or not the function based search API is available. Using the Web Search interface and formatting a query url to get an RSS response are describe in the Yioop [[Documentation#Search%20and%20User%20Interface|Search and User Interface]] section. The Yioop Search Function API is described in the section [[Documentation#Embedding%20Yioop%20in%20an%20Existing%20Site|Embedding Yioop]], you can also look in the examples folder at the file search_api.php to see an example of how to use it. '''If you intend to use Yioop in a configuration with multiple queue servers (not fetchers), then the RSS checkbox needs to be checked.''' + +The '''Site Customizations''' fieldset lets you configure the overall look and feel of a Yioop instance. The '''Use Wiki Public Main Page as Landing Page''' checkbox lets you set the main page of the Public wiki to be the landing page of the whole Yioop site rather than the default centered search box landing page. Several of the text fields in Site Customizations control various colors used in drawing the Yioop interface. These include '''Background Color''', '''Foreground Color''', '''Top Bar Color''', '''Side Bar Color'''. The values of these fields can be any legitimate style-sheet color such as a # followed by an red, green, blue value (between 0-9 A-F), or a color word such as: yellow, cyan, etc. If you would like to use a background image, you can either use the picker link or drag and drop one into the rounded square next to the '''Background Image''' label. Various other images such as the '''Site Logo''', '''Mobile Logo''' (the logo used for mobile devices), and '''Favicon''' (the little logo that appears in the title tab of a page or in the url bar) can similarly be chosen or dragged-and-dropped. + +A '''Search Toolbar''' is a short file that can be used to add your search engine to the search bar of a browser. You can drag such a file into the gray area next to this label and click save to set this for your site. The link to install the search bar is visible on the Settings page. There is also a link tag on every page of the Yioop site that allows a browser to auto-discover this as well. As a starting point, one can try tweaking the default Yioop search bar, yioopbar.xml, in the base folder of Yioop. + +The three fields '''Timezone''', '''Web Cookie Name''', and '''Web Token Name''' control respectively, the timezone used by Yioop when it does date conversions, the name of the cookie it sets in a browser's cookie cache, and the name used for the tokens to prevent cross-site request forgery that appear in Yioop URLs when one is logged-in. + +Finally, if one knows cascading stylesheets (CSS) and wants greater control of the the look and feel of the site, then one can enter standard stylesheet command in the '''Auxiliary Style Directives''' textarea. + +===Optional Server and Security Configurations=== +The configuration activity just described suffices to set up Yioop for a single server crawl. If that is what you are interested in you may want to skip ahead to the section on the [[Documentation#Search%20and%20User%20Interface|Yioop Search Interface]] to learn about the different search features available in Yioop or you may want to skip ahead to [[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]] to learn about how to perform a crawl. In this section, we describe the Server Settings and Security activities which might be useful in a multi-machine, multi-user setting and which might also be useful for crawling hidden websites or crawling through proxies. + +The Server Settings activity looks like: + +{{class="docs" +((resource:Documentation:ServerSettings.png|The Server Settings Activity)) +}} + +The '''Name Server Set-up''' fieldset is used to tell Yioop which machine is going to act as a name server during a crawl and what secret string to use to make sure that communication is being done between legitimate queue_servers and fetchers of your installation. You can choose anything for your secret string as long as you use the same string amongst all of the machines in your Yioop installation. The reason why you have to set the name server url is that each machine that is going to run a fetcher to download web pages needs to know who the queue servers are so they can request a batch of urls to download. There are a few different ways this can be set-up: + +#If the particular instance of Yioop is only being used to display search results from crawls that you have already done, then this fieldset can be filled in however you want. +#If you are doing crawling on only one machine, you can put http://localhost/path_to_yioop/ or http://127.0.0.1/path_to_yioop/, where you appropriately modify "path_to_yioop". +#Otherwise, if you are doing a crawl on multiple machines, use the url of Yioop on the machine that will act as the name server. + +In communicating between the fetcher and the server, Yioop uses curl. Curl can be particular about redirects in the case where posted data is involved. i.e., if a redirect happens, it does not send posted data to the redirected site. For this reason, Yioop insists on a trailing slash on your queue server url. Beneath the Queue Server Url field, is a Memcached checkbox and a Filecache checkbox. Only one of these can be checked at a time. The Memcached check box only shows if you have [[http://php.net/manual/en/book.memcache.php|PHP Memcache]] installed. Checking the Memcached checkbox allows you to specify memcached servers that, if specified, will be used to cache in memory search query results as well as index pages that have been accessed. Checking the Filecache box, tells Yioop to cache search query results in temporary files. Memcache probably gives a better performance boost than Filecaching, but not all hosting environments have Memcache available. + +The '''Database Set-up''' fieldset is used to specify what database management system should be used, how it should be connected to, and what user name and password should be used for the connection. At present [[http://www.php.net/manual/en/intro.pdo.php|PDO]] (PHP's generic DBMS interface), sqlite3, and Mysql databases are supported. The database is used to store information about what users are allowed to use the admin panel and what activities and roles these users have. Unlike many database systems, if an sqlite3 database is being used then the connection is always a file on the current filesystem and there is no notion of login and password, so in this case only the name of the database is asked for. For sqlite, the database is stored in WORK_DIRECTORY/data. For single user settings with a limited number of news feeds, sqlite is probably the most convenient database system to use with Yioop. If you think you are going to make use of Yioop's social functionality and have many users, feeds, and crawl mixes, using a system like Mysql or Postgres might be more appropriate. + +If you would like to use a different DBMS than Sqlite or Mysql, then the easiest way is to select PDO as the Database System and for the Hostname given use the DSN with the appropriate DBMS driver. For example, for Postgres one might have something like: + pgsql:host=localhost;port=5432;dbname=test;user=bruce;password=mypass +You can put the username and password either in the DSN or in the Username and Password fields. The database name field must be filled in with the name of the database you want to connect to. It is also include needs to be included in the dsn, as in the above. PDO and Yioop has been tested to work with Postgres and sqlite, for other DBMS's it might take some tinkering to get things to work. + +When switching database information, Yioop checks first if a usable database with the user supplied data exists. If it does, then it uses it; otherwise, it tries to create a new database. Yioop comes with a small sqlite demo database in the data directory and this is used to populate the installation database in this case. This database has one account root with no password which has privileges on all activities. Since different databases associated with a Yioop installation might have different user accounts set-up after changing database information you might have to sign in again. + +The '''Account Registration''' fieldset is used to control how user's can obtain accounts on a Yioop installation. The dropdown at the start of this fieldset allows one to select one of four possibilities: Disable Registration, users cannot register themselves, only the root account can add users; No Activation, user accounts are immediately activated once a user signs up; Email Activation, after registering, users must click on a link which comes in a separate email to activate their accounts; and Admin Activation, after registering, an admin account must activate the user before the user is allowed to use their account. When Disable Registration is selected, the Suggest A Url form and link on the tool.php page is disabled as well, for all other registration type this link is enabled. If Email Activation is chosen, then the reset of this fieldset can be used to specify the email address that the email comes to the user. The checkbox Use PHP mail() function controls whether to use the mail function in PHP to send the mail, this only works if mail can be sent from the local machine. Alternatively, if this is not checked like in the image above, one can configure an outgoing SMTP server to send the email through. + +The '''Proxy Server''' fieldset is used to control which proxies to use while crawling. By default Yioop does not use any proxies while crawling. A Tor Proxy can serve as a gateway to the Tor Network. Yioop can use this proxy to download .onion URLs on the [[https://en.wikipedia.org/wiki/Tor_%28anonymity_network%29|Tor network]]. The configuration given in the example above works with the Tor Proxy that comes with the +[[https://www.torproject.org/projects/torbrowser.html|Tor Browser]]. Obviously, this proxy needs to be running though for Yioop to make use of it. Beneath the Tor Proxy input field is a checkbox labelled Crawl via Proxies. Checking this box, will reveal a textarea labelled Proxy Servers. You can enter the address:port or address:port:proxytype of proxy servers you would like to crawl through. If proxy servers are used, Yioop will make any requests to download pages to a randomly chosen server on the list which will proxy the request to the site which has the page to download. To some degree this can make the download site think the request is coming from a different ip (and potentially location) than it actually is. In practice, servers can often use HTTP headers to guess that a proxy is being used. + +The '''Ad Server Configuration''' fieldset can be used to specify advertising scripts (such as Google Ad Words, Bidvertiser, Adspeed, etc) which are to be added on search result pages or on discussion thread pages . There are four possible placements of ads: None -- don't display advertising at all, Top -- display banner ads beneath the search bar but above search results, Side -- display skyscraper Ads in a column beside the search results, and Both -- display both banner and skyscraper ads. Choosing any option other than None reveals text areas where one can insert the Javascript one would get from the Ad network. The '''Global Ad Script''' text area is used for any Javascript or HTML the Ad provider wants you to include in the HTML head tag for the web page (many advertisers don't need this). + +The Security activity looks like: + +{{class="docs" +((resource:Documentation:Security.png|The Security Activity)) +}} + +The '''Authentication Type''' fieldset is used to control the protocol used to log people into Yioop. This can either be Normal Authentication, passwords are checked against stored as salted hashes of the password; or ZKP (zero knowledge protocol) authentication, the server picks challenges at random and send these to the browser the person is logging in from, the browser computes based on the password an appropriate response according to the [[https://en.wikipedia.org/wiki/Feige%3C?php%20?%3E%E2%80%93Fiat%E2%80%93Shamir_identification_scheme|Fiat Shamir]] protocol. The password is never sent over the internet and is not stored on the server. These are the main advantages of ZKP, its drawback is that it is slower than Normal Authentication as to prove who you are with a low probability of error requires several browser-server exchanges. You should choose which authentication scheme you want before you create many users as if you switch everyone will need to get a new password. + +The '''Captcha Type''' fieldset controls what kind of [[https://en.wikipedia.org/wiki/Captcha|captcha]] will be used during account registration, password recovery, and if a user wants to suggest a url. The captcha type only has an effect if under the Server Settings activity, Account Registration is not set to Disable Registration. The choices for captcha are: Text Captcha, the user has to select from a series of dropdown answers to questions of the form: Which in the following list is the most/largest/etc? or Which is the following list is the least/smallest/etc?; Graphic Captcha, the user needs to enter a sequence of characters from a distorted image; and hash captcha, the user's browser (the user doesn't need to do anything) needs to extend a random string with additional characters to get a string whose hash begins with a certain lead set of characters. Of these, Hash Captcha is probably the least intrusive but requires Javascript and might run slowly on older browsers. A text captcha might be used to test domain expertise of the people who are registering for an account. Finally, the graphic captcha is probably the one people are most familiar with. + +The Captcha and Recovery Questions section of the Security activity provides links to edit the Text Captcha and Recovery Questions for the current locale (you can change the current locale in Settings). In both cases, there are a fixed list of tests you can localize. A single test consists of a more question, a less question, and a comma separate list of possibilities. For example, + Which lives or lasts the longest? + Which lives or lasts the shortest? + lightning,bacteria,ant,dog,horse,person,oak tree,planet,star,galaxy +When challenging a user, Yioop picks a subset of tests. For each test, it randomly chooses between more or less question. It then picks a subset of the ordered list of choices, randomly permutes them, and presents them to the user in a dropdown. + +Yioop's captcha-ing system tries to prevent attacks where a machine quickly tries several possible answers to a captcha. Yioop has a IP address based timeout system (implemented in models/visitor_model.php). Initially a timeout of one second between requests involving a captcha is in place. An error screen shows up if multiple requests from the same IP address for a captcha page are made within the time out period. Every mistaken entry of a captcha doubles this timeout period. The timeout period for an IP address is reset on a daily basis back to one second. + +[[Documentation#contents|Return to table of contents]]. + +===Upgrading Yioop=== + +If you have an older version of Yioop that you would like to upgrade, make sure to back up your data. Then download the latest version of Yioop and unzip it to the location you would like. Set the Search Engine Work Directory by the same method and to the same value as your old Yioop Installation. See the Installation section above for instructions on this, if you have forgotten how you did this. Knowing the old Work Directory location, should allow Yioop to complete or instruct you how to complete the upgrade process. + +[[Documentation#contents|Return to table of contents]]. + +===Summary of Files and Folders=== + +The Yioop search engine consists of three main scripts: + +;'''bin/fetcher.php''': Used to download batches of urls provided the queue_server. +;'''bin/queue_server.php''': Maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create the index_archive that is used by the search front end. +;'''index.php''': Acts as the search engine web page. It is also used to handle message passing between the fetchers (multiple machines can act as fetchers) and the queue_server. + +The file index.php is used when you browse to an installation of a Yioop website. The description of how to use a Yioop web site is given in the sections starting from the The Yioop User Interface section. The files fetcher.php and queue_server.php are only connected with crawling the web. If one already has a stored crawl of the web, then you no longer need to run or use these programs. For instance, you might obtain a crawl of the web on your home machine and upload the crawl to a an instance of Yioop on the ISP hosting your website. This website could serve search results without making use of either fetcher.php or queue_server.php. To perform a web crawl you need to use both of these programs however as well as the Yioop web site. This is explained in detail in the section on [[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]]. + +The Yioop folder itself consists of several files and sub-folders. The file index.php as mentioned above is the main entry point into the Yioop web application. yioopbar.xml is the xml file specifying how to access Yioop as an Open Search Plugin. favicon.ico is used to display the little icon in the url bar of a browser when someone browses to the Yioop site. A URL to the file bot.php is given by the Yioop robot as it crawls websites so that website owners can find out information about who is crawling their sites. Here is a rough guide to what the Yioop folder's various sub-folders contain: + +;'''bin''': This folder is intended to hold command-line scripts and daemons which are used in conjunction with Yioop. In addition to the fetcher.php and queue_server.php script already mentioned, it contains: '''arc_tool.php''', '''classifier_tool.php''', '''classifier_trainer.php''', '''code_tool.php''', '''mirror.php''',''' media_updater.php''', and '''query_tool.php'''. arc_tool.php can be used to examine the contents of WebArchiveBundle's and IndexArchiveBundle's from the command line. classifier_tool.php is a command line tool for creating a classifier it can be used to perform some of the tasks that can also be done through the [[Documentation#Classifying Web Pages|Web Classifier Interface]]. classifier_trainer.php is a daemon used in the finalization stage of building a classifier. code_tool.php is for use by developers to maintain the Yioop code-base in various ways. mirror.php can be used if you would like to create a mirror/copy of a Yioop installation. media_updater.php can be used to do hourly updates of news feed search sources in Yioop. It also does video conversions of video files into web formats. Finally, query_tool.php can be used to run queries from the command-line. +;{{id="configs" '''configs'''}} : This folder contains configuration files. You will probably not need to edit any of these files directly as you can set the most common configuration settings from with the admin panel of Yioop. The file '''config.php''' controls a number of parameters about how data is stored, how, and how often, the queue_server and fetchers communicate, and which file types are supported by Yioop. '''configure_tool.php''' is a command-line tool which can perform some of the configurations needed to get a Yioop installation running. It is only necessary in some virtual private server settings -- the preferred way to configure Yioop is through the web interface. '''createdb.php''' can be used to create a bare instance of the Yioop database with a root admin user having no password. This script is not strictly necessary as the database should be creatable via the admin panel; however, it can be useful if the database isn't working for some reason. createdb.php includes the file '''public_help_pages.php''' from WORK_DIRECTORY/app/configs if present or from BASE_DIR/configs if not. This file contains the initial rows for the Public and Help group wikis. When upgrading, it is useful to export the changes you have made to these wikis to WORK_DIRECTORY/app/configs/public_help_pages.php. This can be done by running the file '''export_public_help_db.php''' which is in the configs folder. +Also, in the configs folder is the file default_crawl.ini. This file is copied to WORK_DIRECTORY after you set this folder in the admin/configure panel. There it is renamed as '''crawl.ini''' and serves as the initial set of sites to crawl until you decide to change these. The file '''token_tool.php''' is a tool which can be used to help in term extraction during crawls and for making trie's which can be used for word suggestions for a locale. To help word extraction this tool can generate in a locale folder (see below) a word bloom filter. This filter can be used to segment strings into words for languages such as Chinese that don't use spaces to separate words in sentences. For trie and segmenter filter construction, this tool can use a file that lists words one on a line. +;'''controllers''': The controllers folder contains all the controller classes used by the web component of the Yioop search engine. Most requests coming into Yioop go through the top level index.php file. The query string (the component of the url after the ?) then says who is responsible for handling the request. In this query string there is a part which reads c= ... This says which controller should be used. The controller uses the rest of the query string such as the a= variable for activity function to call and the arg= variable to determine which data must be retrieved from which models, and finally which view with what elements on it should be displayed back to the user. Within the controller folder is a sub-folder components, a component is a collection of activities which may be added to a controller so that it can handle a request. +;'''css''': This folder contains the stylesheets used to control how web page tags should look for the Yioop site when rendered in a browser. +;'''data''': This folder contains a default sqlite database for a new Yioop installation. Whenever the WORK_DIRECTORY is changed it is this database which is initially copied into the WORK_DIRECTORY to serve as the database of allowed users for the Yioop system. +;'''examples''': This folder contains a file search_api.php whose code gives an example of how to use the Yioop search function api. +;'''lib''': This folder is short for library. It contains all the common classes for things like indexing, storing data to files, parsing urls, etc. lib contains six subfolders: ''archive_bundle_iterators'', ''classifiers'', ''compressors'', ''index_bundle_iterators'', ''indexing_plugins'', ''processors''. The ''archive_bundle_iterators'' folder has iterators for iterating over the objects of various kinds of web archive file formats, such as arc, wiki-media, etc. These iterators are used to iterate over such archives during a recrawl. The classifier folder contains code for training classifiers used by Yioop. The ''compressors'' folder contains classes that might be used to compress objects in a web_archive. The ''index_bundle_iterator'' folder contains a variety of iterators useful for iterating over lists of documents which might be returned during a query to the search engine. The ''processors'' folder contains processors to extract page summaries for a variety of different mimetypes. +;'''locale''': This folder contains the default locale data which comes with the Yioop system. A locale encapsulates data associated with a language and region. A locale is specified by an [[http://en.wikipedia.org/wiki/IANA_language_tag|IETF language tag]]. So, for instance, within the locale folder there is a folder en-US for the locale consisting of English in the United States. Within a given locale tag folder there is a file configure.ini which contains translations of string ids to string in the language of the locale. This approach is the same idea as used in [[http://en.wikipedia.org/wiki/Gettext|Gettext]] .po files. Yioop's approach does not require a compilation step nor a restart of the webserver for translations to appear. On the other hand, it is slower than the Gettext approach, but this could be easily mitigated using a memory cache such as [[http://memcached.org/|memcached]] or [[archive_bundle_iterators|apc]]. Besides the file configure.ini, there is a statistics.txt file which has info about what percentage of the id's have been translated. In addition to configure.ini and statistics.txt, the locale folder for a language contains two sub-folders: pages, containing static html (with extension .thtml) files which might need to be translated, and resources. The resources folder contains the files: ''locale.js'', which contains locale specify Javascript code such as the variable alpha which is used to list out the letters in the alphabet for the language in question for spell check purposes, and roman_array for mapping between roman alphabet and the character system of the locale in question; ''suggest-trie.txt.gz'', a Trie data structure used for search bar word suggestions; and ''tokenizer.php'', which can specify the number of characters for this language to constitute a char gram, might contain segmenter to split strings into words for this language, a stemmer class used to stem terms for this language, a stopword remover for the centroid summarizer, a part of speech tagger, or thesaurus lookup procedure for the locale. +;'''models''': This folder contains the subclasses of Model used by Yioop Models are used to encapsulate access to secondary storage. i.e., Accesses to databases or the filesystem. They are responsible for marshalling/de-marshalling objects that might be stored in more than one table or across serveral files. The models folder has within it a datasources folder. A datasource is an abstraction layer for the particular filesystem and database system that is being used by a Yioop installation. At present, datasources have been defined for PDO (PHP's generic DBMS interface), sqlite3, and mysql databases. +;'''resources''': Used to store binary resources such as graphics, video, or audio. For now, just stores the Yioop logo. +;'''scripts''': This folder contains the Javascript files used by Yioop. +;'''tests''': This folder contains UnitTest's and JavascriptUnitTests for various lib and script components. Yioop comes with its own minimal UnitTest and JavascriptUnitTest classes which defined in the lib/unit_test.php and lib/javascript_unit_test.php. It also contains a few files used for experiments. For example, string_cat_experiment.php was used to test which was the faster way to do string concatenation in PHP. many_user_experiment.php can be used to create a test Yioop installation with many users, roles, and groups. Some unit testing of the wiki Help system makes use of [[http://phantomjs.org/|PhantomJS]]. If PhantomJS is not configured, these tests will be skipped. To configure PhantomJS you simply add a define for your path to PhatomJS to your local_config.php file. For example, one might have add the define: +define("PHANTOM_JS", "/usr/local/bin/phantomjs"); +;'''views''': This folder contains View subclasses as well as folders for elements, helpers, and layouts. A View is responsible for taking data given to it by a controller and formatting it in a suitable way. Most Views output a web page; however, some of the views responsible for communication between the fetchers and the queue_server output serialized objects. The elements folder contains Element classes which are typically used to output portions of web pages. For example, the html that allows one to choose an Activity in the Admin portion of the website is rendered by an ActivityElement. The helpers folder contains Helper subclasses. A Helper is used to automate the task of outputting certain kinds of web tags. For instance, the OptionsHelper when given an array can be used to output select tags and option tags using data from the array. The layout folder contains Layout subclasses. A Layout encapsulates the header and footer information for the kind of a document a View lives on. For example, web pages on the Yioop site all use the WebLayout class as their Layout. The WebLayout class has a render method for outputting the doctype, open html tag, head of the document including links for style sheets, etc. This method then calls the render methods of the current View, and finally outputs scripts and the necessary closing document tags. + +In addition to the Yioop application folder, Yioop makes use of a WORK DIRECTORY. The location of this directory is set during the configuration of a Yioop installation. Yioop stores crawls, and other data local to a particular Yioop installation in files and folders in this directory. In the event that you upgrade your Yioop installation you should only need to replace the Yioop application folder and in the configuration process of Yioop tell it where your WORK DIRECTORY is. Of course, it is always recommended to back up one's data before performing an upgrade. Within the WORK DIRECTORY, Yioop stores four main files: profile.php, crawl.ini, bot.txt, and robot_table.txt. Here is a rough guide to what the WORK DIRECTORY's sub-folder contain: + +;'''app''': This folder is used to contain your overrides to the views, controllers, models, resources, locale etc. For example, if you wanted to change how the search results were rendered, you could add a views/search_view.php file to the app folder and Yioop would use it rather than the one in the Yioop base directory's views folder. Using the app dir makes it easier to have customizations that won't get messed up when you upgrade Yioop. +;'''cache''': The directory is used to store folders of the form ArchiveUNIX_TIMESTAMP, IndexDataUNIX_TIMESTAMP, and QueueBundleUNIX_TIMESTAMP. ArchiveUNIX_TIMESTAMP folders hold complete caches of web pages that have been crawled. These folders will appear on machines which are running fetcher.php. IndexDataUNIX_TIMESTAMP folders hold a word document index as well as summaries of pages crawled. A folder of this type is needed by the web app portion of Yioop to serve search results. These folders can be moved from machine to whichever machine you want to server results from. QueueBundleUNIX_TIMESTAMP folders are used to maintain the priority queue during the crawling process. The queue_server.php program is responsible for creating both IndexDataUNIX_TIMESTAMP and QueueBundleUNIX_TIMESTAMP folders. +;'''data''': If an sqlite or sqlite3 (rather than say MySQL) database is being used then a seek_quarry.db file is stored in the data folder. In Yioop, the database is used to manage users, roles, locales, and crawls. Data for crawls themselves are NOT stored in the database. Suggest a url data is stored in data in the file suggest_url.txt, certain cron information about machines is saved in cron_time.txt, and plugin configuration information can also be stored in this folder. +;'''locale''': This is generally a copy of the locale folder mentioned earlier. In fact, it is the version that Yioop will try to use first. It contains any customizations that have been done to locale for this instance of Yioop. If you using a version of Yioop after Yioop 2.0, this folder have been moved to app/locale. +;'''log''': When the fetcher and queue_server are run as daemon processes log messages are written to log files in this folder. Log rotation is also done. These log files can be opened in a text editor or console app. +;'''query''': This folder is used to stored caches of already performed queries when file caching is being used. +;'''schedules''': This folder has four kinds of subfolders: media_convert, IndexDataUNIX_TIMESTAMP, RobotDataUNIX_TIMESTAMP, and ScheduleDataUNIX_TIMESTAMP. The easiest to explain is the media_convert folder. It is used by media_updater.php to stored job information about video files that need to be converted. For the other folder, when a fetcher communicates with the web app to say what it has just crawled, the web app writes data into these folders to be processed later by the queue_server. The UNIX_TIMESTAMP is used to keep track of which crawl the data is destined for. IndexData folders contain mini-inverted indexes (word document records) which are to be added to the global inverted index (called the dictionary) for that crawl. RobotData folders contain information that came from robots.txt files. Finally, ScheduleData folders have data about found urls that could eventually be scheduled to crawl. Within each of these three kinds of folders there are typical many sub-folders, one for each day data arrived, and within these subfolders there are files containing the respective kinds of data. +;'''search_filters''': This folder is used to store text files containing global after crawl search filter and summary data. The global search filter allows a user to specify after a crawl is done that certain urls be removed from the search results. The global summary data can be used to edit the summaries for a small number of web pages whose summaries seem inaccurate or inappropriate. For example, some sites like Facebook only allow big search engines like Google to crawl them. Still there are many links to Facebook, so Facebook on an open web crawl will appear, but with a somewhat confused summary based only on link text; the results editor allows one to give a meaningful summary for Facebook. +;'''temp''': This is used for storing temporary files that Yioop creates during the crawl process. For example, temporary files used while making thumbnails. Each fetcher has its own temp folder, so you might also see folders 0-temp, 1-temp, etc. + +[[Documentation#contents|Return to table of contents]]. + +==Search and User Interface== + +At this point one hopefully has installed Yioop. If you used one of the [[Install|install guides]], you may also have performed a simple crawl. We are now going to describe some of the basic search features of Yioop as well as the Yioop administration interface. We will describe how to perform crawls with Yioop in more detail in the [[Documentation#Crawling%20and%20Customizing%20Results|Crawling and Customizing Results]] chapter. If you do not have a crawl available, you can test some of these features on the [[http://www.yioop.com/|Yioop Demo Site]]. + +===Search Basics=== +The main search form for Yioop looks like: + +{{class='docs width-three-quarter' +((resource:Documentation:SearchScreen.png|The Search form)) +}} + +The HTML for this form is in views/search_views.php and the icon is stored in resources/yioop.png. You may want to modify these to incorporate Yioop search into your site. For more general ways to modify the look of this pages, consult the [[Documentation#Building%20a%20Site%20using%20Yioop%20as%20Framework|Building a site using Yioop]] documentation. The Yioop logo on any screen in the Yioop interface is clickable and returns the user to the main search screen. One performs a search by typing a query into the search form field and clicking the Search button. As one is typing, Yioop suggests possible queries, you can click, or use the up down arrows to select one of these suggestion to also perform a search + +{{class='docs width-three-quarter' +((resource:Documentation:Autosuggest.png|Example suggestions as you type)) +}} + +For some non-roman alphabet scripts such as Telugu you can enter words using how they sound using roman letters and get suggestions in the script in question: + +{{class="docs" +((resource:Documentation:TeluguAutosuggest.png|Telugu suggestions for roman text)) +}} + +The [More Statistics] link only shows if under the Admin control panel you clicked on more statistics for the crawl. This link goes to a page showing many global statistics about the web crawl. Beneath this link are the Blog and Privacy links (as well as a link back to the SeekQuarry site). These two links are to static pages which can be customized through the Manage Locale activity. Typical search results might look like: + +{{class="docs" +((resource:Documentation:SearchResults.png|Example Search Results)) +}} + +Thesaurus results might appear to one side and suggest alternative queries based on a thesaurus look up (for English, this is based on Wordnet). The terms next Words: are a word cloud of important terms in the document. These are created if the indexer user the centroid summarizer. Hovering over the Score of a search result reveals its component scores. These might include: Rank, Relevance, Proximity, as well as any Use to Rank Classifier scores and Word Net scores (if installed). + +{{class="docs" +((resource:Documentation:ScoreToolTip.png|Example Score Components Tool Tip)) +}} + +If one slightly mistypes a query term, Yioop can sometimes suggest a spelling correction: + +{{class="docs" +((resource:Documentation:SearchSpellCorrect.png|Example Search Results with a spelling correction)) +}} + +Each result back from the query consists of several parts: First comes a title, which is a link to the page that matches the query term. This is followed by a brief summary of that page with the query words in bold. Then the document rank, relevancy, proximity, and overall scores are listed. Each of these numbers is a grouped statistic -- several "micro index entry" are grouped together/summed to create each. So even though a given "micro index entry" might have a document rank between 1 and 10 there sum could be a larger value. Further, the overall score is a generalized inner product of the scores of the "micro index entries", so the separated scores will not typically sum to the overall score. After these scores there are three links: Cached, Similar, and Inlinks. Clicking on Cached will display Yioop's downloaded copy of the page in question. We will describe this in more detail in a moment. Clicking on Similar causes Yioop to locate the five words with the highest relevancy scores for that document and then to perform a search on those words. Clicking on Inlinks will take you to a page consisting of all the links that Yioop found to the document in question. Finally, clicking on an IP address link returns all documents that were crawled from that IP address. + +{{class="docs" +((resource:Documentation:Cache.png|Example Cache Results)) +}} + +As the above illustrates, on a cache link click, Yioop will display a cached version of the page. The cached version has a link to the original version and download time at the top. Next there is a link to display all caches of this page that Yioop has in any index. This is followed by a link for extracted summaries, then in the body of the cached document the query terms are highlighted. Links within the body of a cache document first target a cached version of the page that is linked to which is as near into the future of the current cached page as possible. If Yioop doesn't have a cache for a link target then it goes to location pointed to by that target. Clicking on the history toggle, produces the following interface: + +{{class="docs" +((resource:Documentation:CacheHistory.png|Example Cache History UI)) +}} + +This lets you select different caches of the page in question. + +Clicking the "Toggle extracted summary" link will show the title, summary, and links that were extracted from the full page and indexed. No other terms on the page are used to locate the page via a search query. This can be viewed as an "SEO" view of the page. + +{{class="docs" +((resource:Documentation:CacheSEO.png|Example Cache SEO Results)) +}} + +It should be noted that cached copies of web pages are stored on the fetcher which originally downloaded the page. The IndexArchive associated with a crawl is stored on the queue server and can be moved around to any location by simply moving the folder. However, if an archive is moved off the network on which fetcher lives, then the look up of a cached page might fail. + +In addition, to a straightforward web search, one can also do image, video, news searches by clicking on the Images, Video, or News links in the top bar of Yioop search pages. Below are some examples of what these look like for a search on "Obama": + +{{class="docs" +((resource:Documentation:ImageSearch.png|Example Image Search Results)) +((resource:Documentation:VideoSearch.png|Example Video Search Results)) +((resource:Documentation:NewsSearch.png|Example News Search Results)) +}} + +When Yioop crawls a page it adds one of the following meta words to the page media:text, media:image, or media:video. RSS (or Atom) feed sources that have been added to Media Sources under the [[Documentation#Search%20Sources|Search Sources]] activity are downloaded from each hour. Each RSS item on such a downloaded pages has the meta word media:news added to it. A usual web search just takes the search terms provided to perform a search. An Images, Video, News search tacks on to the search terms, media:image or media:video, or media:news. Detection of images is done via mimetype at initial page download time. At this time a thumbnail is generated. When search results are presented it is this cached thumbnail that is shown. So image search does not leak information to third party sites. On any search results page with images, Yioop tries to group the images into a thumbnail strip. This is true of both normal and images search result pages. In the case of image search result pages, except for not-yet-downloaded pages, this results in almost all of the results being the thumbnail strip. Video page detection is not done through mimetype as popular sites like YouTube, Vimeo, and others vary in how they use Flash or video tags to embed video on a web page. Yioop uses the Video Media sources that have been added in the Search Sources activity to detect whether a link is in the format of a video page. To get a thumbnail for the video it again uses the method for rewriting the video url to an image link specified for the particular site in question in Search Sources. i.e., the thumbnail will be downloaded from the orginal site. '''This could leak information to third party sites about your search.''' + +The format of News search results is somewhat different from usual search results. News search results can appear during a normal web search, in which case they will appear clustered together, with a leading link "News results for ...". No snippets will be shown for these links, but the original media source for the link will be displayed and the time at which the item first appeared will be displayed. On the News subsearch page, the underneath the link to the item, the complete RSS description of the new item is displayed. In both settings, it is possible to click on the media source name next to the news item link. This will take one to a page of search results listing all articles from that media source. For instance, if one were to click on the Yahoo News text above one would go to results for all Yahoo News articles. This is equivalent to doing a search on: media:news:Yahoo+News . If one clicks on the News subsearch, not having specified a query yet, then all stored news items in the current language will be displayed, roughly ranked by recentness. If one has RSS media sources which are set to be from different locales, then this will be taken into account on this blank query News page. + +[[Documentation#contents|Return to table of contents]]. + + +===Search Tools Page=== + +As one can see from the image of the main search form shown previously, the footer of each search and search result page has several links. Blog takes one to the group feed of the built in PUBLIC group which is editable from the root account, Privacy takes one to the Yioop installations privacy policy, and Terms takes one to the Yioop installations terms of service. The YioopBot link takes one to a page describing the installation's web crawler. These static pages are all Wiki pages of the PUBLIC group and can be edited by the root account. The Tools link takes one to the following page: + +{{class="docs" +((resource:Documentation:SearchTools.png|Search Tools Page)) +}} + +Beneath the Other Search Sources section is a complete listing of all the search sources that were created using [[Documentation#Search%20Sources|Search Sources]]. This might be more than just the Images, Video, and News that come by default with Yioop. The My Account section of this page gives another set of links for signing into, modifying the settings of, and creating account. The Other Tools section has a link to the form below where user's can suggest links for the current or future crawls. + +{{class="docs" +((resource:Documentation:SuggestAUrl.png|Suggest A Url Form)) +}} + +This link only appears if under Server Settings, Account Registration is not set to Disable registration. The Wiki Pages link under Other Tools takes one to a searchable list of all Wiki pages of the default PUBLIC group. + +[[Documentation#contents|Return to table of contents]]. + +===Search Operators=== + +Turning now to the topic of how to enter a query in Yioop: A basic query to the Yioop search form is typically a sequence of words seperated by whitespace. This will cause Yioop to compute a "conjunctive query", it will look up only those documents which contain all of the terms listed. Yioop also supports a variety of other search box commands and query types: + +* '''#num#''' in a query are treated as query presentation markers. When a query is first parsed, it is split into columns based with ''#num#'' as the column boundary. For example, bob #2# bob sally #3# sally #1#. A given column is used to present ''#num#'' results, where ''#num#'' is what is between the hash marks immediately after it. So in the query above, the subquery ''bob'' is used for the first two search results, then the subquery ''bob sally'' is used for the next three results, finally the last column is always used for any remaining results. In this case, the subquery ''sally'' would be used for all remaining results even though its ''#num#'' is 1. If a query does not have any #num#'s it is assumed that it has only one column. +* Separating query terms with a vertical bar | results in a disjunctive query. These are parsed for after the presentation markers above. So a search on: ''Chris | Pollett'' would return pages that have either the word ''Chris'' or the word ''Pollett'' or both. +* Putting the query in quotes, for example "Chris Pollett", will cause Yioop to perform an exact match search. Yioop in this case would only return documents that have the string "Chris Pollett" rather than just the words Chris and Pollett possibly not next to each other in the document. Also, using the quote syntax, you can perform searches such as "Chris * Homepage" which would return documents which have the word Chris followed by some text followed by the word Homepage. +* If the query has at least one word not prefixed by -, then adding a `-' in front of a word in a query means search for results not containing that term. So a search on: of ''-the'' would return results containing the word "of" but not containing the word "the". +* Searches of the forms: '''related:url''', '''cache:url''', '''link:url''', '''ip:ip_address''' are equivalent to having clicked on the Similar, Cached, InLinks, IP address links, respectively, on a summary with that url and ip address. + +The remaining query types we list in alphabetical order: + +;'''code&#58;http_error_code''' : returns the summaries of all documents downloaded with that HTTP response code. For example, code:04 would return all summaries where the response was a Page Not Found error. +;'''date&#58;Y, date&#58;Y-m, date&#58;Y-m-d, date&#58;Y-m-d-H, date&#58;Y-m-d-H-i, date&#58;Y-m-d-H-i-s''' : returns summaries of all documents crawled on the given date. For example, ''date:2011-01'' returns all document crawled in January, 2011. As one can see detail goes down to the second level, so one can have an idea about how frequently the crawler is hitting a given site at a given time. +;'''dns&#58;num_seconds''' : returns summaries of all documents whose DNS lookup time was between num_seconds and num_seconds + 0.5 seconds. For example, dns:0.5. +;'''filetype&#58;extension''': returns summaries of all documents found with the given extension. So a search: Chris Pollett filetype&#58;pdf would return all documents containing the words Chris and Pollett and with extension pdf. +;'''host&#58;all''': returns summaries of all domain level pages (pages where the path was /). +;'''index&#58;timestamp or i&#58;timestamp''' : causes the search to make use of the IndexArchive with the given timestamp. So a search like: ''Chris Pollett i&#58;1283121141 | Chris Pollett'' take results from the index with timestamp 1283121141 for Chris Pollett and unions them with results for Chris Pollett in the default index +;'''if&#58;keyword!add_keywords_on_true!add_keywords_on_false''' : checks the current conjunctive query clause for "keyword"; if present, it adds "add_keywords_on_true" to the clause, else it adds the keywords "add_keywords_on_true". This meta word is typically used as part of a crawl mix. The else condition does not need to be present. As an example, ''if&#58;oracle!info&#58;http://oracle.com/!site&#58;none'' might be added to a crawl mix so that if a query had the keyword oracle then the site http://oracle.com/ would be returned by the given query clause. As part of a larger crawl mix this could be used to make oracle's homepage appear at the top of the query results. If you would like to inject multiple keywords then separate the keywords using plus rather than white space. For example, if:corvette!fast+car. +;'''info&#58;url''' : returns the summary in the Yioop index for the given url only. For example, one could type info:http://www.yahoo.com/ or info:www.yahoo.com to get the summary for just the main Yahoo! page. This is useful for checking if a particular page is in the index. +;'''lang&#58;IETF_language_tag''' : returns summaries of all documents whose language can be determined to match the given language tag. For example, ''lang:en-US''. +;'''media&#58;kind''' : returns summaries of all documents found of the given media kind. Currently, the text, image, news, and video are the four supported media kinds. So one can add to the search terms ''media:image'' to get only image results matching the query keywords. +'''mix&#58;name or m&#58;name ''': tells Yioop to use the crawl mix "name" when computing the results of the query. The section on mixing crawl indexes has more details about crawl mixes. If the name of the original mix had spaces, for example, cool mix then to use the mix you would need to replace the spaces with plusses, ''m:cool+mix''. +;'''modified&#58;Y, modified&#58;Y-M, modified&#58;Y-M-D''' : returns summaries of all documents which were last modified on the given date. For example, modified:2010-02 returns all document which were last modifed in February, 2010. +;'''no&#58;some_command''' is used to tell Yioop not to perform some default transformation of the search terms. For example, ''no:guess'' tells Yioop not to try to guess the semantics of the search before doing the search. This would mean for instance, that Yioop would not rewrite the query ''yahoo.com'' into ''site:yahoo.com. no:network'' tells Yioop to only return search results from the current machine and not to send the query to all machines in the Yioop instance. ''no:cache'' says to recompute the query and not to make use of memcache or file cache. +;'''numlinks&#58;some_number''': returns summaries of all documents which had some_number of outgoing links. For example, numlinks:5. +;'''os&#58;operating_system''': returns summaries of all documents served on servers using the given operating system. For example, ''os:centos'', make sure to use lowercase. +;'''path&#58;path_component_of_url''': returns summaries of all documents whose path component begins with path_component_of_url. For example, ''path:/phpBB'' would return all documents whose path started with phpBB, ''path:/robots.txt'' would return summaries for all robots.txt files. +;'''raw&#58;number''' : control whether or not Yioop tries to do deduplication on results and whether links and pages for the same url should be grouped. Any number greater than zero says don't do deduplication. +;'''robot&#58;user_agent_name''' : returns robots.txt pages that contained that user_agent_name (after lower casing). For example, ''robot:yioopbot'' would return all robots.txt pages explicitly having a rule for YioopBot. +;'''safe&#58;boolean_value''' : is used provide "safe" or "unsafe" search results. Yioop has a crude, "hand-tuned", linear classifier for whether a site contains pornographic content. If one adds safe:true to a search, only those pages found which were deemed non-pornographic will be returned. Adding safe:false has the opposite effect. +;'''server&#58;web_server_name''' : returns summaries of all documents served on that kind of web server. For example, ''server:apache''. +;'''site&#58;url, site&#58;host, or site&#58;domain''': returns all of the summaries of pages found at that url, host, or domain. As an example, ''site:http://prints.ucanbuyart.com/lithograph_art.html'', ''site:http://prints.ucanbuyart.com/'', ''site:prints.ucanbuyart.com'', ''site:.ucanbuyart.com'', site:ucanbuyart.com, site:com, will all returns with decreasing specificity. To return all pages and links to pages in the Yioop index, you can do ''site:any''. To return all pages (as opposed to pages and links to pages) listed in a Yioop index you can do ''site:all''. ''site:all'' doesn't return any links, so you can't group links to urls and pages of that url together. If you want all sites where one has a page in the index as well as links to that site, than you can do ''site:doc''. +;'''size&#58;num_bytes''': returns summaries of all documents whose download size was between num_bytes and num_bytes + 5000. num_bytes must be a multiple of 5000. For example, ''size:15000''. +;'''time&#58;num_seconds''' : returns summaries of all documents whose download time excluding DNS lookup time was between num_seconds and num_seconds + 0.5 seconds. For example, ''time:1.5''. +;'''version&#58;version_number''' : returns summaries of all documents served on web servers with the given version number. For example, one might have a query ''server:apache version:2.2.9''. +;'''weight&#58;some_number or w&#58;some_number''' : has the effect of multiplying all score for this portion of a query by some_number. For example, ''Chris Pollett | Chris Pollett site:wikipedia.org w:5 ''would multiply scores satisfying ''Chris Pollett'' and on ''wikipedia.org'' by 5 and union these with those satisfying ''Chris Pollett''. + +Although we didn't say it next to each query form above, if it makes sense, there is usually an ''all'' variant to a form. For example, ''os:all'' returns all documents from servers for which os information appeared in the headers. + +===Result Formats=== +In addition to using the search form interface to query Yioop, it is also possible to query Yioop and get results in Open Search RSS format. To do that you can either directly type a URL into your browser of the form: + http://my-yioop-instance-host/?f=rss&q=query+terms +Or you can write AJAX code that makes requests of URLs in this format. Although, there is no official Open Search JSON format, one can get a JSON object with the same structure as the RSS search results using a query to Yioop such as: + http://my-yioop-instance-host/?f=json&q=query+terms + +[[Documentation#contents|Return to table of contents]]. + +===Settings=== + +In the corner of the page with the main search form is a Settings-Signin element: + +{{class="docs" +((resource:Documentation:SettingsSignin.png|Settings Sign-in Element)) +}} + +This element provides access for a user to change their search settings by clicking Settings. The Sign In link provides access to the Admin and User Accounts panels for the website. Clicking the Sign In link also takes one to a page where one can register for an account if Yioop is set up to allow user registration. + +{{class="docs" +((resource:Documentation:Settings.png|The Settings Form)) +}} + +On the Settings page, there are currently three items which can be adjusted: The number of results per page when doing a search, the language Yioop should use, and the particular search index Yioop should use. When a user clicks save, the data is stored by Yioop. The user can then click "Return to Yioop" to go back the search page. Thereafter, interaction with Yioop will make use of any settings' changes. Data is stored in Yioop and associated with a given user via a cookies mechanism. In order for this to work, the user's browser must allow cookies to be set. This is usually the default for most browsers; however, it can sometimes be disabled in which case the browser option must be changed back to the default for Settings to work correctly. It is possible to control some of these settings by tacking on stuff to the URL. For instance, adding &l=fr-FR to the URL query string (the portion of the URL after the question mark) would tell Yioop to use the French from France for outputting text. You can also add &its= the Unix timestamp of the search index you want. + +[[Documentation#contents|Return to table of contents]]. + +===Mobile Interface=== + +Yioop's user interface is designed to display reasonably well on tablet devices such as the iPad. For smart phones, such as iPhone, Android, Blackberry, or Windows Phone, Yioop has a separate user interface. For search, settings, and login, this looks fairly similar to the non-mobile user interface: + +{{class="docs" +((resource:Documentation:MobileSearch.png|Mobile Search Landing Page)) +((resource:Documentation:MobileSettings.png|Mobile Settings Page)) +((resource:Documentation:MobileSignin.png|Mobile Admin Panel Login)) +}} + +For Admin pages, each activity is controlled in an analgous fashion to the non-mobile setting, but the Activity element has been replaced with a dropdown: + +{{class="docs" +((resource:Documentation:MobileAdmin.png|Example Mobile Admin Activity)) +}} + +We now resume our discussion of how to use each of the Yioop admin activities for the default, non-mobile, setting, simply noting that except for the above minor changes, these instructions will also apply to the mobile setting. + +[[Documentation#contents|Return to table of contents]]. + +==User Accounts and Social Features== +===Registration and Signin=== + +Clicking on the Sign In link on the corner of the Yioop web site will bring up the following form: + +{{class="docs" +((resource:Documentation:SigninScreen.png|Admin Panel Login)) +}} + +Correctly, entering a username and password will then bring the user to the User Account portion of the Yioop website. Each Account page has on it an Activity element as well as a main panel where the current activity is displayed. The Activity element allows the user to choose what is the current activity for the session. The choices available on the Activity element depend on the roles the user has. A default installation of Yioop comes with two predefined roles Admin and User. If someone has the Admin role then the Activity element looks like: + +{{class="docs" +((resource:Documentation:AdminActivityElement.png|Admin Activity Element)) +}} + +On the other hand, if someone just has the User role, then their Acitivity element looks like: + +{{class="docs" +((resource:Documentation:UserActivityElement.png|User Activity Element)) +}} + +Over the next several sections we will discuss each of the Yioop account activities in turn. Before we do that we make a couple remarks about using Yioop from a mobile device. + +[[Documentation#contents|Return to table of contents]]. + +===Managing Accounts=== + +By default, when a user first signs in to the Yioop admin panel the current activity is the Manage Account activity. This activity just lets user's change their account information using the form pictured below. It also has summary information about Crawls and Indexes (Admin account only), Groups and Feeds, and Crawl mixes. There are also helpful links from each of these sections to a related activity for managing them. + +{{class="docs" +((resource:Documentation:ManageAccount.png|Manage Account Page)) +}} + +Initially, the Account Details fields are grayed out. To edit them, or to edit the user icon next to them, click the Edit link next to account details. This will allow a user to change information using their account password. A new user icon can either be selected by clicking the choose link underneath it, or by dragging and dropping an icon into the image area. The user's password must be entered correctly into the password field for changes to take effect when Save is clicked. Clicking the Lock link will cause these details to be grayed out and not editable again. + +{{class="docs" +((resource:Documentation:ChangeAccountInfo.png|Change Account Information Form)) +}} + +If a user wants to change their password they can click the Password link label for the password field. This reveals the following additional form fields where the password can be changed: + +{{class="docs" +((resource:Documentation:ChangePassword.png|Change Password Form)) +}} + +[[Documentation#contents|Return to table of contents]]. + +===Managing Users, Roles, and Groups=== + +The Manage Users, Manage Groups, and Manage Roles activities have similar looking forms as well as related functions. All three of these activities are available to accounts with the Admin role, but only Manage Groups is a available to those with a standard User role. To describe these activities, let's start at the beginning... Users are people who have accounts to connect with a Yioop installation. Users, once logged in may engage in various Yioop activities such as Manage Crawls, Mix Crawls, and so on. A user is not directly assigned which activities they have permissions on. Instead, they derive their permissions from which roles they have been directly assigned and by which groups they belong to. When first launched, Manage User activity looks like: + +{{class="docs" +((resource:Documentation:AddUser.png|The Add User form)) +}} + +The purpose is this activity is to allow an administrator to add, monitor and modify the accounts of users of a Yioop installation. At the top of the activity is the "Add User" form. This would allow an administrator to add a new user to the Yioop system. Most of the fields on this form are self explanatory except the Status field which we will describe in a moment. Beneath this is a User List table. At the top of this table is a dropdown used to control how many users to display at one time. If there are more than that many users, there will be arrow links to page through the user list. There is a also a search link which can be used to bring up the following Search User form: + +{{class="docs" +((resource:Documentation:SearchUser.png|The Search User form)) +}} + +This form can be used to find and sort various users out of the complete User List. If we look at the User List, the first four columns, Username, First Name, Last Name, and Email Address are pretty self-explanatory. The Status column has a dropdown for each user row, this dropdown also appear in the Add User form. It represents the current status of the User and can be either Inactive, Active, or Banned. An Inactive user is typically a user that has used the Yioop registration form to sign up for an account, but who hasn't had the account activated by the administrator, nor had the account activated by using an email link. Such a user can't create or post to groups or log in. On the other hand, such a user has reserved that username so that other people can't use it. A Banned user is a user who has been banned from logging, but might have groups or posts that the administrator wants to retain. Selecting a different dropdown value changes that user's status. Next to the Status column are two action columns which can be used to edit a user or to delete a user. Deleting a user, deletes their account, any groups that the user owns, and deletes any posts the user made. The Edit User form looks like: + +{{class="docs" +((resource:Documentation:EditUser.png|The Edit User form)) +}} + +This form let's you modify some of the attributes of a users. There are also two links on it: one with the number of roles that a user has, the other with the number of groups that a user has. Here the word "role" means a set of activities. Clicking on one of these links brings up a paged listing of the particular roles/groups the user has/belongs to. It will also let you add or delete roles/groups. Adding a role to a user means that the user can do the set of activities that the role contains, adding a group to the user means the user can read that group, and if the privileges for non-owners allow posting then can also post or comment to that group's feed and edit the group's wiki. This completes the description of the Manage +User Activity. + +Roles are managed through the Manage Role activity, which looks like: + +{{class="docs" +((resource:Documentation:AddRole.png|The Add Role form)) +}} + +Similar to the Manage User form, at the top of this activity, there is an Add Role form, and beneath this a Role List. The controls of the Role List operate in much the same fashion as those of the User List described earlier. Clicking on the Edit link of a role brings up a form which looks like: + +{{class="docs" +((resource:Documentation:EditRole.png|The Edit Role form)) +}} + +In the above, we have a Localizer role. We might have created this role, then used the Select Activity dropdown to add all the activities of the User role. A localizer is a person who can localize Yioop to a new language. So we might then want to use the Select dropdown to add Manage Locales to the list of activities. Once we have created a role that we like, we can then assign user's that role and they will be able to perform all of the activities listed on it. If a user has more than one role, than they can perform an activity as long as it is listed in at least one role. + +Groups are collections of users that have access to a group feed and a set of wiki pages. Groups are managed through the Manage Groups activity which looks like: + +{{class="docs" +((resource:Documentation:ManageGroups.png|The Manage Groups form)) +}} + +Unlike Manage Users and Manage Roles, the Manage Group activity belongs to the standard User role, allowing any user to create and manage groups. As one can see from the image above The Create/Join Group form takes the name of a group. If you enter a name that currently does not exist the following form will appear: + +{{class="docs" +((resource:Documentation:CreateGroup.png|The Create Group form)) +}} + +The user who creates a group is set as the initial group owner. + +The '''Register dropdown''' says how other users are allowed to join the group: '''No One''' means no other user can join the group (you can still invite other users), '''By Request''' means that other users can request the group owner to join the group, but that the group is not publicly visible in the browsable group directory; '''Public Request''' is the same as By Request, but the group is publicly visible in the browsable group directory; and '''Anyone''' means all users are allowed to join the group and the group appears in the browseable directory of groups. It should be noted that the root account can always join and browse for any group. The root account can also always take over ownership of any group. + +The '''Access dropdown''' controls how users who belong/subscribe to a group other than the owner can access that group. The possibilities are '''No Read''' means that non-members of the group cannot read the group feed or wiki, a non-owner member of the group can read but not write the group news feed and wiki; '''Read''' means that a non-member of the group can read the group news feed and the groups wiki page, but non-owners cannot write the feed or wiki; '''Read Comment''' means that a non-owner member of the group can read the group feed and wikis and can comment on any existing threads, but cannot start new ones, '''Read Write''' means that a non-owner member of the group can start new threads and comment on existing ones in the group feed, but cannot edit the group's wiki; finally, '''Read Write Wiki''' is wiki Read Write except a non-owner member can edit the group's wiki. The access to a group can be changed by the owner after a group is created. No Read and Read are often suitable if a group's owner wants to perform some kind of moderation. Read and Read Comment groups are often suitable if someone wants to use a Yioop Group as a blog. Read Write makes sense for a more traditional bulletin board. + +The '''Voting dropdown''' controls to what degree users can vote on posts. '''No Voting''' means group feed posts cannot be voted on; '''+ Voting''' means that a post can be voted up but not down; and '''+/- Voting''' means a post can be voted up or down. Yioop restricts a user to at most one vote/post. + +The '''Post Lifetime dropdown''' controls how long a group feed post is retained by the Yioop system before it is automatically deleted. The possible values are '''Never Expires''', '''One Hour''', '''One Day''', or '''One Month'''. + +A default installation of Yioop has two built-in groups: '''Public''' and '''Help''' owned by root. Public has Read access and all users automatically subscribed to it and cannot unsubscribe it. It is useful for general announcements and its wiki can be used as part of building a site for Yioop. The Help group's wiki is used to maintain all the wiki pages related to Yioop's integrated help system. When a user clicks on the help icon [?], the page that is presented in blue comes from this wiki. This group's registration is by default by Public Request and its access is Read Write Wiki. + +If on the Create/Join Group form, the name of a group entered already exists, but is not joinable, then an error message that the group's name is in use is displayed. If either anyone can join the group or the group can be joined by request, then that group will be added to the list of subscribed to groups. If membership is by request, then initially in the list of groups it will show up with access Request Join. + +Beneath the Create/Join Group form is the Groups List table. This lists all the groups that a user is currently subscribed to: + +{{class="docs" +((resource:Documentation:GroupsList.png|Groups List Table)) +}} + +The controls at the top of this table are similar in functionality to the controls we have already discussed for the User Lists table of Manage Users and the Roles List table of Manage Roles. This table let's a user manage their existing groups, but does not let a user to see what groups already exist. If one looks back at the Create/Join Groups form though, one can see next to it there is a link "Browse". Clicking this link takes one to the Discover Groups form and the Not Subscribed to Groups table: + +{{class="docs" +((resource:Documentation:BrowseGroups.png|The Browse Groups form)) +}} + +If a group is subscribable then the Join link in the Actions column of Not Subscribed to Groups table should be clickable. Let's briefly now consider the other columns of either the Groups List or not Subscribed to Groups table. The Name column gives the name of the group. Group name are unique identifiers for a group on a Yioop system. In the Groups List table the name is clickable and takes you to the group feed for that group. The owner column gives that username of the owner of the group. If you are the root account or if you are the owner of the group, then this field should be a clickable link that take you to the following form: + +{{class="docs" +((resource:Documentation:TransferGroup.png|The Transfer Group form)) +}} + +that can be used to transfer the ownership of a group. The next two column give the register and access information for the group. If you are the owner of the group these will be dropdowns allow you to change these settings. We have already explained what the Join link does in the actions column. Other links which can appear in the actions column are Unsubscribe, which let's you leave a group which you have joined but are not the owner of; Delete, which, if you are the owner of a group, let's you delete the group, its feed, and all its wiki pages; and Edit, which displays the following form: + +{{class="docs" +((resource:Documentation:EditGroup.png|The Edit Group form)) +}} + +The Register, Access, Voting, Post Lifetime dropdowns lets one modify the registration, group access, voting, and post lifetime properties for the group which we have already described before. Next to the Members table header is a link with the number of current memebers of the group. Clicking this link expands this area into a listing of users in the group as seen above. This allows one to change access of different members to the group, for example, approving a join request or banning a user. It also allows one to delete a member from a group. Beneath the user listing is a link which can take one to a form to invite more users. + +===Feeds and Wikis=== + +The initial screen of the Feeds and Wikis page has an integrated list of all the recent posts to any groups to which a user subscribes: + +{{class="docs" +((resource:Documentation:FeedsWikis.png|The Main Feeds and Wiki Page)) +}} + +The arrow icon next to the feed allow one to collapse the Activities element to free up screen real estate for the feed. Once collapsed, an arrow icon pointing the opposite direction will appear to let you show the Activities element again. Next to the Group Activity header at the top of the page are two icons: + +{{class="docs" +((resource:Documentation:GroupingIcons.png|The Group Icons)) +}} + +These control whether the Feed View above has posts in order of time or if posts are arranged by group as below: + +{{class="docs" +((resource:Documentation:FeedsWikis2.png|The Grouped Feeds and Wiki Page)) +}} + + +Going back to the original feed view above, notice posts are displayed with the most recent post at the top. If there has been very recent activity (within the last five minute), this page will refresh every 15 seconds for up to twenty minutes, checking for new posts. Each post has a title which links to a thread for that post. This is followed by the time when the post first appeared and the group title. This title, although gray, can be clicked to go to that particular group feed. If the user has the ability to start new threads in a group and one is in single feed mode, an icon with a plus-sign and a pencil appears next ot the group name, which when clicked allows a user to start a new thread in that group. Beneath the title of the post, is the username of the person who posted. Again, this is clickable and will take you to a page of all recent posts of that person. Beneath the username, is the content of the post. On the opposite side of the post box may appear links to Edit or X (delete) the post, as well as a link to comment on a post. The Edit and X delete links only appear if you are the poster or the owner of the group the post was made in. The Comment link let's you make a follow up post to that particular thread in that group. For example, for the "I just learned an interesting thing!" post above, the current user could start a new thread by clicking the plus-pencil icon or comment on this post by clicking the Comment link. If you are not the owner of a group then the Comment and Start a New Thread links only appear if you have the necessary privileges on that group. + +The image below shows what happens when one clicks on a group link, in this case, the Chris Blog link. + +{{class="docs" +((resource:Documentation:SingleGroupFeed.png|A Single Group Feed)) +}} + +On the opposite side of the screen there is a link My Group Feeds, which let's one go back to the previous screen. At the top of this screen is clickable title of the group, in this case, Chris Blog, this takes one to the Manage Groups activity where properties of this group could be examined. Next we see a toggle between Feed and Wiki. Currently, on group feed page, clicking Wiki would take one to the Main page of the wiki. Posts in the single group view are grouped by thread with the thread containing the most recent activity at the top. Notice next to each thread link there is a count of the number of posts to that thread. The content of the thread post is the content of the starting post to the thread, to see latter comments one has to click the thread link. There is now a Start New Thread button at the top of the single group feed as it is clear which group the thread will be started in. Clicking this button would reveal the following form: + +{{class="docs" +((resource:Documentation:StartNewThread.png|Starting a new thread from a group feed)) +}} + +Adding a Subject and Post to this form and clicking save would start a new thread. Posts can make use of [[Syntax|Yioop's Wiki Syntax]] to achieve effects like bolding text, etc. The icons above the text area can be used to quickly add this mark-up to selected text. Although the icons are relatively standard, hovering over an icon will display a tooltip which should aid in figuring out what it does. Beneath the text area is a dark gray area with instructions on how to add resources to a page such as images, videos, or other documents. For images and videos these will appear embedded in the text of the post when it is saved, for other media a link to the resource will appear when the source is saved. The size allowed for uploaded media is determined by your php instances php.ini configuration file's values for post_max_size and upload_max_filesize. Yioop uses the value of the constant MAX_VIDEO_CONVERT_SIZE set in a configs/local_config.php or from configs/config.php to determine if a video should be automatically converted to the two web friendly formats mp4 and webm. This conversion only happens if, in addition, [[http://ffmpeg.org/|FFMPEG]] has been installed and the path to it has been given as a constant FFMPEG in either a configs/local_config.php or in configs/config.php. + +Clicking the comment link of any existing thread reveals the following form to add a comment to that thread: + +{{class="docs" +((resource:Documentation:AddComment.png|Adding a comment to an existing thread)) +}} + +Below we see an example of the feed page we get after clicking on the My First Blog Post thread in the Chris Blog group: + +{{class="docs" +((resource:Documentation:FeedThread.png|A Group Feed Thread)) +}} + +Since we are now within a single thread, there is no Start New Thread button at the top. Instead, we have a Comment button at the top and bottom of the page. The starting post of the thread is listed first and ending most recent post is listed last (paging buttons both on the group and single thread page, let one jump to the last post). The next image below is an example of the feed page one gets when one clicks on a username link, in this case, cpollett: + +{{class="docs" +((resource:Documentation:UserFeed.png|User Feed)) +}} + +Single Group, Threads, and User feeds of groups which anyone can join (i.e., public groups) all have RSS feeds which could be used in a news aggregator or crawled by Yioop. To see what the link would be for the item you are interested in, first collapse the activity element if its not collapsed (i.e., click the [<<] link at the top of the page). Take the URL in the browser in the url bar, and add &f=rss to it. It is okay to remove the YIOOP_TOKEN= variable from this URL. Doing this for the cpollett user feed, one gets the url: + http://www.yioop.com/?c=group&a=groupFeeds&just_user_id=4&f=rss +whose RSS feed looks like: + +{{class="docs" +((resource:Documentation:UserRssFeed.png|User Rss Feed)) +}} + +As we mentioned above when we described the single group feed page, if we click on the Wiki link at the top we go to the Main wiki page of the group, where we could read that page. If the Main Wiki page (or for that if matter if we go to any wiki page that) does not exist, then we would get like the following: + +{{class="docs" +((resource:Documentation:NonexistantPage.png|Screenshot of going to location of a non-existant wiki page)) +}} + +This page might be slightly different depending on whether the user has write access to the given group. The [[Syntax|Wiki Syntax Guide]] link in the above takes one to a page that describes how to write wiki pages. The Edit link referred to in the above looks slightly different and is in a slightly different location depending on whether we are viewing the page with the Activity element collapsed or not. If the Activity element is not collapsed then it appears one of three links within the current activity as: + +{{class="docs" +((resource:Documentation:AdminHeading.png|Read Edit Page Headings on Admin view)) +}} + +On the other hand, if the Activity element is collapsed, then it appear on the navigation bar at the top of the screen as: + +{{class="docs" +((resource:Documentation:GroupHeading.png|Read Edit Page Headings on Group view)) +}} + +Notice besides editing a page there is a link to read the page and a link Pages. The Pages link takes us to a screen where we can see all the pages that have been created for a group: + +{{class="docs" +((resource:Documentation:WikiPageList.png|List of Groups Wiki Pages)) +}} + +The search bar can be used to search within the titles of wiki pages of this group for a particular page. Suppose now we clicked on Test Page in the above, then we would go to that page initially in Read view: + +{{class="docs" +((resource:Documentation:WikiPage.png|Example Viewing a Wiki Page)) +}} + +If we have write access, and we click the Edit link for this page, we work see the following edit page form: + +{{class="docs" +((resource:Documentation:EditWikiPage.png|Editing a Wiki Page)) +}} + +This page is written using Wiki mark-up whose syntax which as we mentioned above can be +found in the [[Syntax|Yioop Wiki Syntax Guide]]. So for example, the heading at the top of the page is written as +<nowiki> + =Test Page= +</nowiki> +in this mark-up. The buttons above the textarea can help you insert the mark-up you need without having to remember it. Also, as mentioned above the dark gray area below the textarea describes how to associate images, video, and other media to the document. Unlike with posts, a complete list of currently associated media can be found at the bottom of the document under the '''Page Resources''' heading. Links to Rename, Add a resource to the page, and Delete each resource can also be found here. Clicking on the icon next to a resource let's you look at the resource on a page by itself. This icon will be a thumbnail of the resource for images and videos. In the case of videos, the thumbnail is only generated if the FFMPEG software mentioned earlier is installed and the FFMPEG constant is defined. In this case, as with posts, if the video is less than MAX_VIDEO_CONVERT_SIZE, Yioop will automatically try to convert it to mp4 and webm so that it can be streamed by Yioop using HTTP pseudo-streaming. + +Clicking the '''Settings Link''' next to the wiki page name reveals the following additional form elements: + +{{class="docs" +((resource:Documentation:WikiPageSettings.png|Wiki Page Settings)) +}} + +The meaning of these various settings is described in [[Syntax#Page%20Settings,%20Page%20Type|Page Settings, Page Type]] section of the Yioop Wiki Syntax Guide. + + +The '''Discuss link''' takes you to a thread in the current group where the contents of the wiki page should be discussed. Underneath the textarea above is a Save button. Every time one clicks the save button a new version of the page is saved, but the old one is not discarded. We can use the Edit Reason field to provide a reason for the changes between versions. When we read a page it is the most recent version that is displayed. However, by clicking the History link above we can see a history of prior version. For example: + +{{class="docs" +((resource:Documentation:HistoryPage.png|An example History Page of a Wiki Page)) +}} + +The Revert links on this history page can be used to change the current wiki page to a prior version. The time link for each version can be clicked to view that prior version without reverting. The First and Second links next to a version can be used to set either the first field or second field at the top of the history page which is labelled Difference: . Clicking the Go button for the Difference form computes the change set between two selected versions of a wiki document. This might look like: + +{{class="docs" +((resource:Documentation:DiffPage.png|An example diff page of two versions of a Wiki Page)) +}} + +This completes the description of group feeds and wiki pages. + +[[Documentation#contents|Return to table of contents]]. + +==Crawling and Customizing Results== +===Performing and Managing Crawls=== + +The Manage Crawl activity in Yioop looks like: + +{{class="docs" +((resource:Documentation:ManageCrawl.png|Manage Crawl Form)) +}} + +This activity will actually list slightly different kinds of peak memory usages depending on whether the queue_server's are run from a terminal or through the web interface. The screenshot above was done when a single queue_server was being run from the terminal. The first form in this activity allows you to name and start a new web crawl. Next to the Start New Crawl button is an Options link, which allows one to set the parameters under which the crawl will execute. We will return to what the Options page looks like in a moment. When a crawl is executing, under the start crawl form appears statistics about the crawl as well as a Stop Crawl button. Crawling continues until this Stop Crawl button is pressed or until no new sites can be found. As a crawl occurs, a sequence of IndexShard's are written. These keep track of which words appear in which documents for groups of 50,000 or so documents. In addition an IndexDictionary of which words appear in which shard is written to a separate folder and subfolders. When the Stop button is clicked the "tiers" of data in this dictionary need to be logarithmically merged, this process can take a couple of minutes, so after clicking stop do not kill the queue_server (if you were going to) until after it says waiting for messages again. Beneath this stop button line, is a link which allows you to change the crawl options of the currently active crawl. Changing the options on an active crawl may take some time to fully take effect as the currently processing queue of urls needs to flush. At the bottom of the page is a table listing previously run crawls. Next to each previously run crawl are three links. The first link lets you resume this crawl, if this is possible, and say Closed otherwise. Resume will cause Yioop to look for unprocessed fetcher data regarding that crawl, and try to load that into a fresh priority queue of to crawl urls. If it can do this, crawling would continue. The second link let's you set this crawl's result as the default index. In the above picture there were only two saved crawls, the second of which was set as the default index. When someone comes to your Yioop installation and does not adjust their settings, the default index is used to compute search results. The final link allows one to Delete the crawl. For both resuming a crawl and deleting a crawl, it might take a little while before you see the process being reflected in the display. This is because communication might need to be done with the various fetchers, and because the on screen display refreshes only every 20 seconds or so. + +{{id='prerequisites' +====Prerequisites for Crawling==== +}} + +Before you can start a new crawl, you need to run at least one queue_server.php script and you need to run at least one fetcher.php script. These can be run either from the same Yioop installation or from separate machines or folder with Yioop installed. Each installation of Yioop that is going to participate in a crawl should be configured with the same name server and server key. Running these scripts can be done either via the command line or through a web interface. As described in the Requirements section you might need to do some additional initial set up if you want to take the web interface approach. On the other hand, the command-line approach only works if you are using only one queue server. You can still have more than one fetcher, but the crawl speed in this case probably won't go faster after ten to twelve fetchers. Also, in the command-line approach the queue server and name server should be the same instance of Yioop. In the remainder of this section we describe how to start the queue_server.php and fetcher.php scripts via the command line; the GUI for Managing Machines and Servers section describes how to do it via a web interface. To begin open a command shell and cd into the bin subfolder of the Yioop folder. To start a queue_server type: + + php queue_server.php terminal + +To start a fetcher type: + + php fetcher.php terminal + +The above lines are under the assumption that the path to php has been properly set in your PATH environment variable. If this is not the case, you would need to type the path to php followed by php then the rest of the line. If you want to stop these programs after starting them simply type CTRL-C. Assuming you have done the additional configuration mentioned above that are needed for the GUI approach managing these programs, it is also possible to run the queue_server and fetcher programs as daemons. To do this one could type respectively: + + php queue_server.php start + +or + + php fetcher.php start + +When run as a daemon, messages from these programs are written into log files in the log subfolder of the WORK_DIRECTORY folder. To stop these daemons one types: + + php queue_server.php stop + +or + + php fetcher.php stop + +Once the queue_server is running and at least one fetcher is running, the Start New Crawl button should work to commence a crawl. Again, it will up to a minute or so for information about a running crawl to show up in the Currently Processing fieldset. During a crawl, it is possible for a fetcher or the queue server to crash. This usually occurs due to lack of memory for one of these programs. It also can sometimes happen for a fetcher due to flakiness in multi-curl. If this occurs simply restart the fetcher in question and the crawl can continue. A queue server crash should be much rarer. If it occurs, all of the urls to crawl that reside in memory will be lost. To continue crawling, you would need to resume the crawl through the web interface. If there are no unprocessed schedules for the given crawl (which usually means you haven't been crawling very long), it is not possible to resume the crawl. We have now described what is necessary to perform a crawl we now return to how to set the options for how the crawl is conducted. + +====Common Crawl and Search Configurations==== + +When testing Yioop, it is quite common just to have one instance of the fetcher and one instance of the queue_server running, both on the same machine and same installation of Yioop. In this subsection we wish to briefly describe some other configurations which are possible and also some configs/config.php configurations that can affect the crawl and search speed. The most obvious config.php setting which can affect the crawl speed is NUM_MULTI_CURL_PAGES. A fetcher when performing downloads, opens this many simultaneous connections, gets the pages corresponding to them, processes them, then proceeds to download the next batch of NUM_MULTI_CURL_PAGES pages. Yioop uses the fact that there are gaps in this loop where no downloading is being done to ensure robots.txt Crawl-delay directives are being honored (a Crawl-delayed host will only be scheduled to at most one fetcher at a time). The downside of this is that your internet connection might not be used to its fullest ability to download pages. Thus, it can make sense rather than increasing NUM_MULTI_CURL_PAGES, to run multiple copies of the Yioop fetcher on a machine. To do this one can either install the Yioop software multiple times or give an instance number when one starts a fetcher. For example: + + php fetcher.php start 5 + +would start instance 5 of the fetcher program. + +Once a crawl is complete, one can see its contents in the folder WORK DIRECTORY/cache/IndexDataUNIX_TIMESTAMP. In the multi-queue server setting each queue server machine would have such a folder containing the data for the hosts that queue server crawled. Putting the WORK_DIRECTORY on a solid-state drive can, as you might expect, greatly speed-up how fast search results will be served. Unfortunately, if a given queue server is storing ten million or so pages, the corresponding IndexDataUNIX_TIMESTAMP folder might be around 200 GB. Two main sub-folders of IndexDataUNIX_TIMESTAMP largely determine the search performance of Yioop handling queries from a crawl. These are the dictionary subfolder and the posting_doc_shards subfolder, where the former has the greater influence. For the ten million page situation these might be 5GB and 30GB respectively. It is completely possible to copy these subfolders to a SSD and use symlinks to them under the original crawl directory to enhance Yioop's search performance. + +====Specifying Crawl Options and Modifying Options of the Active Crawl==== + +As we pointed out above, next to the Start Crawl button is an Options link. Clicking on this link, let's you set various aspect of how the next crawl should be conducted. If there is a currently processing crawl, there will be an options link under its stop button. Both of these links lead to similar pages, however, for an active crawl fewer parameters can be changed. So we will only describe the first link. We do mention here though that under the active crawl options page it is possible to inject new seed urls into the crawl as it is progressing. In the case of clicking the Option link next to the start button, the user should be taken to an activity screen which looks like: + +{{class="docs" +((resource:Documentation:WebCrawlOptions.png|Web Crawl Options Form)) +}} + +The Back link in the corner returns one to the previous activity. + +There are two kinds of crawls that can be performed by Yioop either a crawl of sites on the web or a crawl of data that has been previously stored in a supported archive format such as data that was crawled by Versions 0.66 and above of Yioop, data coming from a database or text archive via Yioop's importing methods described below, [[http://www.archive.org/web/researcher/ArcFileFormat.php|Internet Archive ARC file]], [[http://archive-access.sourceforge.net/warc/|ISO WARC Files]], [[http://en.wikipedia.org/wiki/Wikipedia:Database_download|MediaWiki xml dump]], [[http://rdf.dmoz.org/|Open Directory Project RDF file]]. In the next subsection, we describe new web crawls and then return to archive crawls subsection after that. Finally, we have a short section on some advanced crawl options which can only be set in config.php or local_config.php. You will probably not need these features but we mention them for completeness + +=====Web Crawl Options===== + +On the web crawl tab, the first form field, "Get Crawl Options From", allows one to read in crawl options either from the default_crawl.ini file or from the crawl options used in a previous crawl. The rest of the form allows the user to change the existing crawl options. The second form field is labeled Crawl Order. This can be set to either Bread First or Page Importance. It specifies the order in which pages will be crawled. In breadth first crawling, roughly all the seeds sites are visited first, followed by sites linked directly from seed sites, followed by sites linked directly from sites linked directly from seed sites, etc. Page Importance is our modification of [ [[Documentation#APC2003|APC2003]]]. In this order, each seed sites starts with a certain quantity of money. When a site is crawled it distributes its money equally amongst sites it links to. When picking sites to crawl next, one chooses those that currently have the most money. Additional rules are added to handle things like the fact that some sites might have no outgoing links. Also, in our set-up we don't revisit already seen sites. To handle these situation we take a different tack from the original paper. This crawl order roughly approximates crawling according to page rank. + +The next checkbox is labelled Restrict Sites by Url. If it is checked then a textarea with label Allowed To Crawl Sites appears. If one checks Restricts Sites by Url then only pages on those sites and domains listed in the Allowed To Crawl Sites textarea can be crawled. We will say how to specify domains and sites in a moment, first let's discuss the last two textareas on the Options form. The Disallowed sites textarea allows you to specify sites that you do not want the crawler to crawl under any circumstance. There are many reasons you might not want a crawler to crawl a site. For instance, some sites might not have a good robots.txt file, but will ban you from interacting with their site if they get too much traffic from you. + +Just above the Seed Sites textarea are two links "Add User Suggest Data". If on the Server Settings activity Account Registration is set to anything other than Disable Registration, it is possible for a search site user to suggest urls to crawl. This can be done by going to the [[Documentation#Search%20Tools%20Page|Search Tools Page]] and clicking on the Suggest a Url link. Suggested links are stored in WORK_DIRECTORY/data/suggest_url.txt. Clicking Add User Suggest Data adds any suggested urls in this file into the Seed Site textarea, then deletes the contents of this file. The suggested urls which are not already in the seed site list are added after comment lines (lines starting with #) which give the time at which the urls were added. Adding Suggest data can be done either for new crawls or to inject urls into currently running crawls. +The Seed sites textarea allows you to specify a list of urls that the crawl should start from. The crawl will begin using these urls. This list can include ".onion" urls if you want to crawl [[http://en.wikipedia.org/wiki/Tor_network|TOR networks]]. + +The format for sites, domains, and urls are the same for each of these textareas, except that the Seed site area can only take urls (or urls and title/descriptions) and in the Disallowed Sites/Sites with Quotas one can give a url followed by #. Otherwise, in this common format, there should be one site, url, or domain per line. You should not separate sites and domains with commas or other punctuation. White space is ignored. A domain can be specified as: + domain:.sjsu.edu +Urls like: + http://www.sjsu.edu/ + https://www.sjsu.edu/gape/ + http://bob.cs.sjsu.edu/index.html +would all fall under this domain. The word domain above is a slight misnomer as domain:sjsu.edu, without the leading period, also matches a site like http://mysjsu.edu/. A site can be specified as scheme://domain/path. Currently, Yioop recognizes the three schemas: http, https, and gopher (an older web protocol). For example, https://www.somewhere.com/foo/ . Such a site includes https://www.somewhere.com/foo/anything_more . Yioop also recognizes * and $ within urls. So http://my.site.com/*/*/ would match http://my.site.com/subdir1/subdir2/rest and http://my.site.com/*/*/$ would require the last symbol in the url to be '/'. This kind of pattern matching can be useful in the to restrict the depth of a crawl to within a url to a certain fixed depth -- you can allow crawling a site, but disallow the downloading of pages with more than a certain number of `/' in them. + +In the Disallowed Sites/Sites with Quotas, a number after a # sign indicates that at most that many pages should be downloaded from that site in any given hour. For example, + http://www.ucanbuyart.com/#100 +indicates that at most 100 pages are to be downloaded from http://www.ucanbuyart.com/ per hour. + +In the seed site area one can specify title and page descriptions for pages that Yioop would otherwise be forbidden to crawl by the robots.txt file. For example, + http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site +tells Yioop to generate a placeholder page for http://www.facebook.com/ with title "Facebook" and description "A famous social media site" rather than to attempt to download the page. The [[Documentation#Results%20Editor|Results Editor]] activity can only be used to affect pages which are in a Yioop index. This technique allows one to add arbitrary pages to the index. + +When configuring a new instance of Yioop the file default_crawl.ini is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings for the Options form. + +{{id="archive" +=====Archive Crawl Options===== +}} + +We now consider how to do crawls of previously obtained archives. From the initial crawl options screen, clicking on the Archive Crawl tab gives one the following form: + +{{class="docs" +((resource:Documentation:ArchiveCrawlOptions.png|Archive Crawl Options Form)) +}} + +The dropdown lists all previously done crawls that are available for recrawl. + +{{class="docs" +((resource:Documentation:ArchiveCrawlDropDown.png|Archive Crawl dropdown)) +}} + +These include both previously done Yioop crawls, previously down recrawls (prefixed with RECRAWL::), Yioop Crawl Mixes (prefixed with MIX::), and crawls of other file formats such as: arc, warc, database data, MediaWiki XML, and ODP RDF, which have been appropriately prepared in the PROFILE_DIR/cache folder (prefixed with ARCFILE::). In addition, Yioop also has a generic text file archive importer (also, prefixed with ARCFILE::). + +You might want to re-crawl an existing Yioop crawl if you want to add new meta-words, new cache page links, extract fields in a different manner, or if you are migrating a crawl from an older version of Yioop for which the index isn't readable by your newer version of Yioop. For similar reasons, you might want to recrawl a previously re-crawled crawl. When you archive crawl a crawl mix, Yioop does a search on the keyword site:any using the crawl mix in question. The results are then indexed into a new archive. This new archive might have considerably better query performance (in terms of speed) as compared to queries performed on the original crawl mix. How to make a crawl mix is described in the [[Documentation#Mixing%20Crawl%20Indexes|Crawl Mixes]] section. You might want to do an archive crawl of other file formats if you want Yioop to be able to provide search results of their content. Once you have selected the archive you want to crawl, you can add meta words as discussed in the Crawl Time Tab Page Rule portion of the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. Afterwards,go back to the Create Crawl screen to start your crawl. As with a Web Crawl, for an archive crawl you need both the queue_server running and a least one fetcher running to perform a crawl. + +To re-crawl a previously created web archive that was made using several fetchers, each of the fetchers that was used in the creation process should be running. This is because the data used in the recrawl will come locally from the machine of that fetcher. For other kinds of archive crawls and mix crawls, which fetchers one uses, doesn't matter because archive crawl data comes through the name server. You might also notice that the number of pages in a web archive re-crawl is actually larger than the initial crawl. This can happen because during the initial crawl data was stored in the fetcher's archive bundle and a partial index of this data sent to appropriate queue_servers but was not yet processed by these queue servers. So it was waiting in a schedules folder to be processed in the event the crawl was resumed. + +To get Yioop to detect arc, database data, MediaWiki, ODP RDF, or generic text archive files you need to create an PROFILE_DIR/cache/archives folder on the name server machine. Yioop checks subfolders of this for files with the name arc_description.ini. For example, to do a Wikimedia archive crawl, one could make a subfolder PROFILE_DIR/cache/archives/my_wiki_media_files and put in it a file arc_description.ini in the format to be discussed in a moment. In addition to the arc_description.ini, you would also put in this folder all the archive files (or links to them) that you would like to index. When indexing, Yioop will process each archive file in turn. Returning to the arc_description.ini file, arc_description.ini's contents are used to give a description of the archive crawl that will be displayed in the archive dropdown as well as to specify the kind of archives the folder contains and how to extract it. An example arc_description.ini for MediaWiki dumps might look like: + + arc_type = 'MediaWikiArchiveBundle'; + description = 'English Wikipedia 2012'; + +In the Archive Crawl dropdown the description will appear with the prefix ARCFILE:: and you can then select it as the source to crawl. Currently, the supported arc_types are: ArcArchiveBundle, DatabaseBundle, MediaWikiArchiveBundle, OdpRdfArchiveBundle, TextArchiveBundle, and WarcArchiveBundle. For the ArcArchiveBundle, OdpRdfArchiveBundle, MediaWikiArchiveBundle, WarcArchiveBundle arc_types, generally a two line arc_description.ini file like above suffices. We now describe how to import from the other kind of formats in a little more detail. In general, the arc_description.ini will tell Yioop how to get string items (in a associative array with a minimal amount of additional information) from the archive in question. Processing on these string items can then be controlled using Page Rules, described in the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. + +An example arc_description.ini where the arc_type is DatabaseBundle might be: + arc_type = 'DatabaseBundle'; + description = 'DB Records'; + dbms = "mysql"; + db_host = "localhost"; + db_name = "MYGREATDB"; + db_user = "someone"; + db_password = "secret"; + encoding = "UTF-8"; + sql = "SELECT MYCOL1, MYCOL2 FROM MYTABLE1 M1, MYTABLE2 M2 WHERE M1.FOO=M2.BAR"; + field_value_separator = '|'; + column_separator = '##'; + +Here is a specific example that gets the rows out of the TRANSLATION table of Yioop where the database was stored in a Postgres DBMS. In the comments I indicate how to alter it for other DBMS's. + + arc_type = 'DatabaseBundle'; + description = 'DB Records'; + ;sqlite3 specific + ;dbms ="sqlite3"; + ;mysql specific + ;dbms = "mysql"; + ;db_host = "localhost"; + ;db_user = "root"; + ;db_password = ""; + dbms = "pdo"; + ;below is for postgres; similar if want db2 or oracle + db_host = "pgsql:host=localhost;port=5432;dbname=seek_quarry" + db_name = "seek_quarry"; + db_user = "cpollett"; + db_password = ""; + encoding = "UTF-8"; + sql = "SELECT * from TRANSLATION"; + field_value_separator = '|'; + column_separator = '##'; + +Possible values for dbms are pdo, mysql, sqlite3. If pdo is chosen, then db_host should be a PHP DSN specifying which DBMS driver to use. db_name is the name of the database you would like to connect to, db_user is the database username, db_password is the password for that user, and encoding is the character set of rows that the database query will return. + +The sql variable is used to give a query whose result rows will be the items indexed by Yioop. Yioop indexes string "pages", so to make these rows into a string each column result will be made into a string: field field_value_separator value. Here field is the name of the column, value is the value for that column in the given result row. Columns are concatenated together separated by the value of of column_separator. The resulting string is then sent to Yioop's TextProcessor page processor. + +We next give a few examples of arc_description.ini files where the arc_type is TextArchiveBundle. First, suppose we wanted to index access log file records that look like: + 127.0.0.1 - - [21/Dec/2012:09:03:01 -0800] "POST /git/yioop2/ HTTP/1.1" 200 - \ + "-" "Mozilla/5.0 (compatible; YioopBot; \ + +http://localhost/git/yioop/bot.php)" +Here each record is delimited by a newline and the character encoding is UTF-8. The records are stored in files with the extension .log and these files are uncompressed. We then might use the following arc_description.ini file: + arc_type = 'TextArchiveBundle'; + description = 'Log Files'; + compression = 'plain'; + file_extension = 'log'; + end_delimiter = "\n"; + encoding = "UTF-8"; +In addition to compression = 'plain', Yioop supports gzip and bzip2. The end_delimeter is a regular expression indicating how to know when a record ends. To process a TextArchiveBundle Yioop needs either an end_delimeter or a start_delimiter (or both) to be specified. As another example, for a mail.log file with entries of the form: + From pollett@mathcs.sjsu.edu Wed Aug 7 10:59:04 2002 -0700 + Date: Wed, 7 Aug 2002 10:59:04 -0700 (PDT) + From: Chris Pollett <pollett@mathcs.sjsu.edu> + X-Sender: pollett@eniac.cs.sjsu.edu + To: John Doe <johndoe@mail.com> + Subject: Re: a message + In-Reply-To: <5.1.0.14.0.20020723093456.00ac9c00@mail.com> + Message-ID: <Pine.GSO.4.05.10208071057420.9463-100000@eniac.cs.sjsu.edu> + MIME-Version: 1.0 + Content-Type: TEXT/PLAIN; charset=US-ASCII + Status: O + X-Status: + X-Keywords: + X-UID: 17 + + Hi John, + + I got your mail. + + Chris +The following might be used: + + arc_type = 'TextArchiveBundle'; + description = 'Mail Logs'; + compression = 'plain'; + file_extension = 'log'; + start_delimiter = "\n\nFrom\s"; + encoding = "ASCII"; + +Notice here we are splitting records using a start delimeter. Also, we have chosen ASCII as the character encoding. As a final example, we show how to import tar gzip files of Usenet records as found, in the [[http://archive.org/details/utzoo-wiseman-usenet-archive|UTzoo Usenet Archive 1981-1991]]. Further discussion on how to process this collection is given in the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. + + arc_type = 'TextArchiveBundle'; + description = 'Utzoo Usenet Archive'; + compression = 'gzip'; + file_extension = 'tgz'; + start_delimiter = "\0\0\0\0Path:"; + end_delimiter = "\n\0\0\0\0"; + encoding = "ASCII"; + +Notice in the above we set the compression to be gzip. Then we have Yioop act on the raw tar file. In tar files, content objects are separated by long paddings of null's. Usenet posts begin with Path, so to keep things simple we grab records which begin with a sequence of null's the Path and end with another sequence of null's. + +As a final reminder for this section, remember that, in addition, to the arc_description.ini file, the subfolder should also contain instances of the files in question that you would like to archive crawl. So for arc files, these would be files of extension .arc.gz; for MediaWiki, files of extension .xml.bz2; and for ODP-RDF, files of extension .rdf.u8.gz . +Crawl Options of config.php or local_config.php + +There are a couple of flags which can be set in the config.php or in a local_config.php file that affect web crawling which we now mention for completeness. As was mentioned before, when Yioop is crawling it makes use of Etag: and Expires: HTTP headers received during web page download to determine when a page can be recrawled. This assumes one has not completely turned off recrawling under the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Indexing and Search Options]] activity. To turn Etag and Expires checking off, one can add to a local_config.php file the line: + +define("USE_ETAG_EXPIRES", false); + + +Yioop can be run using the [[https://github.com/facebook/hhvm/|Hip Hop Virtual Machine from FaceBook]]. This will tend to make Yioop run faster and use less memory than running it under the standard PHP interpreter. Hip Hop can be used on various Linux flavors and to some degree runs under OSX (the queue server and fetcher will run, but the web app doesn't). If you want to use the Hip Hop on Mac OSX, and if you install it via Homebrew, then you will need to set a force variable and set the path for Hip Hop in your local_config.php file with lines like: + define('FORCE_HHVM', true); + define('HHVM_PATH', '/usr/local/bin'); +The above lines are only needed on OSX to run Hip Hop. + +[[Documentation#contents|Return to table of contents]] + +===Mixing Crawl Indexes=== + +Once you have performed a few crawls with Yioop, you can use the Mix Crawls activity to create mixture of your crawls. This activity is available to users who have either Admin role or just the standard User role. This section describes how to create crawl mixes which are processed when a query comes in to Yioop. Once one has created such a crawl mix, an admin user can make a new index which consists of results of the crawl mix ("materialize it") by doing an archive crawl of the crawl mix. The [[Documentation#archive|Archive Crawl Options]] subsection has more details on how to do this latter operation. The main Mix Crawls activity looks like: + +{{class="docs" +((resource:Documentation:ManageMixes.png|The Manage Mixes form)) +}} + +The first form allows you to name and create a new crawl mixture. Clicking "Create" sends you to a second page where you can provide information about how the mixture should be built. Beneath the Create mix form is a table listing all the previously created crawl mixes. Above this listing, but below the Create form is a standard set of nav elements for selecting which mixes will be displayed in this table. A Crawl mix is "owned" by the user who creates that mix. The table only lists crawl mixes "owned" by the user. The first column has the name of the mix, the second column says how the mix is built out of component crawls, and the actions columns allows you to edit the mix, set it as the default index for Yioop search results, or delete the mix. You can also append "m:name+of+mix" or "mix:name+of+mix" to a query to use that quiz without having to set it as the index. When you create a new mix, and are logged in so Yioop knows the mix belongs to you, your mix will also show up on the Settings page. The "Share" column pops a link where you can share a crawl mix with a Yioop Group. This will post a message with a link to that group so that others can import your mix into their lists of mixes. Creating a new mix or editing an existing mix sends you to a second page: + +{{class="docs" +((resource:Documentation:EditMix.png|The Edit Mixes form)) +}} + +Using the "Back" link on this page will take you to the prior screen. The first text field on the edit page lets you rename your mix if you so desire. Beneath this is an "Add Groups" button. A group is a weighted list of crawls. If only one group were present, then search results would come from any crawl listed for this group. A given result's score would be the weighted sum of the scores of the crawls in the group it appears in. Search results are displayed in descending order according to this total score. If more that one group is present then the number of results field for that group determines how many of the displayed results should come from that group. For the Crawl Mix displayed above, there are three groups: The first group is used to display the first result, the second group is used to display the second result, the last group is used to display any remaining search results. + +The UI for groups works as follows: The top row has three columns. To add new components to a group use the dropdown in the first column. The second column controls for how many results the particular crawl group should be used. Different groups results are presented in the order they appear in the crawl mix. The last group is always used to display any remaining results for a search. The delete group link in the third column can be used to delete a group. Beneath the first row of a group, there is one row for each crawl that belongs to the group. The first link for a crawl says how its scores should be weighted in the search results for that group. The second column is the name of the crawl. The third column is a space separated list of words to add to the query when obtaining results for that crawl. So for example, in the first group above, there are two indexes which will be unioned: Default Crawl with a weight of 1, and CanCrawl Test with a weight of 2. For the Default Crawl we inject two keywords media:text and Canada to the query we get from the user. media:text means we will get whatever results from this crawl that consisted of text rather than image pages. Keywords can be used to make a particular component of a crawl mix behave in a conditional many by using the "if:" meta word described in the search and user interface section. The last link in a crawl row allows you to delete a crawl from a crawl group. For changes on this page to take effect, the "Save" button beneath this dropdown must be clicked. + +[[Documentation#contents|Return to table of contents]]. + +===Classifying Web Pages=== + +Sometimes searching for text that occurs within a page isn't enough to find what one is looking for. For example, the relevant set of documents may have many terms in common, with only a small subset showing up on any particular page, so that one would have to search for many disjoint terms in order to find all relevant pages. Or one may not know which terms are relevant, making it hard to formulate an appropriate query. Or the relevant documents may share many key terms with irrelevant documents, making it difficult to formulate a query that fetches one but not the other. Under these circumstances (among others), it would be useful to have meta words already associated with the relevant documents, so that one could just search for the meta word. The Classifiers activity provides a way to train classifiers that recognize classes of documents; these classifiers can then be used during a crawl to add appropriate meta words to pages determined to belong to one or more classes. + +Clicking on the Classifiers activity displays a text field where you can create a new classifier, and a table of existing classifiers, where each row corresponds to a classifier and provides some statistics and action links. A classifier is identified by its class label, which is also used to form the meta word that will be attached to documents. Each classifier can only be trained to recognize instances of a single target class, so the class label should be a short description of that class, containing only alphanumeric characters and underscores (e.g., "spam", "homepage", or "menu"). Typing a new class label into the text box and hitting the Create button initializes a new classifier, which will then show up in the table. + +{{class="docs" +((resource:Documentation:ManageClassifiers.png|The Manage Classifiers page)) +}} + +Once you have a fresh classifier, the natural thing to do is edit it by clicking on the Edit action link. If you made a mistake, however, or no longer want a classifier for some reason, then you can click on the Delete action link to delete it; this cannot be undone. The Finalize action link is used to prepare a classifier to classify new web pages, which cannot be done until you've added some training examples. We'll discuss how to add new examples next, then return to the Finalize link. + +====Editing a Classifier==== + +Clicking on the Edit action link takes you to a new page where you can change a classifier's class label, view some statistics, and provide examples of positive and negative instances of the target class. The first two options should be self-explanatory, but the last is somewhat involved. A classifier needs labeled training examples in order to learn to recognize instances of a particular class, and you help provide these by picking out example pages from previous crawls and telling the classification system whether they belong to the class or do not belong to the class. The Add Examples section of the Edit Classifier page lets you select an existing crawl to draw potential examples from, and optionally narrow down the examples to those that satisfy a query. Once you've done this, clicking the Load button will send a request to the server to load some pages from the crawl and choose the next one to receive a label. You'll be presented with a record representing the selected document, similar to a search result, with several action links along the side that let you mark this document as either a positive or negative example of the target class, or skip this document and move on to the next one: + +{{class="docs" +((resource:Documentation:ClassifiersEdit.png|The Classifiers edit page)) +}} + +When you select any of the action buttons, your choice is sent back to the server, and a new example to label is sent back (so long as there are more examples in the selected index). The old example record is shifted down the page and its background color updated to reflect your decision—green for a positive example, red for a negative one, and gray for a skip; the statistics at the top of the page are updated accordingly. The new example record replaces the old one, and the process repeats. Each time a new label is sent to the server, it is added to the training set that will ultimately be used to prepare the classifier to classify new web pages during a crawl. Each time you label a set number of new examples (10 by default), the classifier will also estimate its current accuracy by splitting the current training set into training and testing portions, training a simple classifier on the training portion, and testing on the remainder (checking the classifier output against the known labels). The new estimated accuracy, calculated as the proportion of the test pages classified correctly, is displayed under the Statistics section. You can also manually request an updated accuracy estimate by clicking the Update action link next to the Accuracy field. Doing this will send a request to the server that will initiate the same process described previously, and after a delay, display the new estimate. + +All of this happens without reloading the page, so avoid using the web browser's Back button. If you do end up reloading the page somehow, then the current example record and the list of previously-labeled examples will be gone, but none of your progress toward building the training set will be lost. + +====Finalizing a Classifier==== + +Editing a classifier adds new labeled examples to the training set, providing the classifier with a more complete picture of the kinds of documents it can expect to see in the future. In order to take advantage of an expanded training set, though, you need to finalize the classifier. This is broken out into a separate step because it involves optimizing a function over the entire training set, which can be slow for even a few hundred example documents. It wouldn't be practical to wait for the classifier to re-train each time you add a new example, so you have to explicitly tell the classifier that you're done adding examples for now by clicking on the Finalize action link either next to the Load button on the edit classifier page or next to the given classifier's name on the classifier management page. + +Clicking this link will kick off a separate process that trains the classifier in the background. When the page reloads, the Finalize link should have changed to text that reads "Finalizing..." (but if the training set is very small, training may complete almost immediately). After starting finalization, it's fine to walk away for a bit, reload the page, or carry out some unrelated task for the user account. You should not however, make further changes to the classifier's training set, or start a new crawl that makes use of the classifier. When the classifier finishes its training phase, the Finalizing message will be replaced by one that reads "Finalized" indicating that the classifier is ready for use. + +====Using a Classifier==== + +Using a classifier is as simple as checking the "Use to Classify" or "Use to Rank" checkboxes next to the classifier's label on the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] activity, under the "Classifiers and Rankers" heading. When the next crawl starts, the classifier (and any other selected classifiers) will be applied to each fetched page. If "Use to Rank" is checked then the classifier score for that page will be recorded. If "Use to Classify" is checked and if a page is determined to belong to a target class, it will have several meta words added. As an example, if the target class is "spam", and a page is determined to belong to the class with probability .79, then the page will have the following meta words added: + +*class:spam +*class:spam:50plus +*class:spam:60plus +*class:spam:70plus +*class:spam:70 + +These meta words allow one to search for all pages classified as spam at any probability over the preset threshold of .50 (with class:spam), at any probability over a specific multiple of .1 (e.g., over .6 with class:spam:60plus), or within a specific range (e.g., .60–.69 with class:spam:60). Note that no meta words are added if the probability falls below the threshold, so no page will ever have the meta words class:spam:10plus, class:spam:20plus, class:spam:20, and so on. + +[[Documentation#contents|Return to table of contents]]. + +===Page Indexing and Search Options=== + +Several properties about how web pages are indexed and how pages are looked up at search time can be controlled by clicking on Page Options. There are three tabs for this activity: Crawl Time, Search Time, and Test Options. We will discuss each of these in turn. + +====Crawl Time Tab==== + +Clicking on Page Options leads to the default Crawl Time Tab: + +{{class="docs" +((resource:Documentation:PageOptionsCrawl.png|The Page Options Crawl form)) +}} + +This tab controls some aspects about how a page is processed and indexed at crawl time. The form elements before Page Field Extraction Rules are relatively straightforward and we will discuss these briefly below. The Page Rules textarea allows you to specify additional commands for how you would like text to be extracted from a page document summary. The description of this language will take the remainder of this subsection. + +The Get Options From dropdown allows one to load in crawl time options that were used in a previous crawl. Beneath this, The Byte Range to Download dropdown controls how many bytes out of any given web page should be downloaded. Smaller numbers reduce the requirements on disk space needed for a crawl; bigger numbers would tend to improve the search results. If whole pages are being cached, these downloaded bytes are stored in archives with the fetcher. The Summarizer dropdown control what summarizer is used on a page during page processing. Yioop uses a summarizer to control what portions of a page will be put into the index and are available at search time for snippets. The two available summarizers are Basic, which picks the pages meta title, meta description, h1 tags, etc in a fixed order until the summary size is reached; and Centroid, which computes an "average sentence" for the document and adds phrases from the actual document according to nearness to this average. If Centroid summarizer is used Yioop also generates a word cloud for each document. Centroid tends to produces slightly better results than Basic but is slower. How to tweak the Centroid summarizer for a particular locale, is described in the [[Documentation#Localizing%20Yioop%20to%20a%20New%20Language|Localizing Yioop]] section. The Max Page Summary Length in Bytes controls how many of the total bytes can be used to make a page summary which is sent to the queue server. It is only words in this summary which can actually be looked up in search result. Care should be taken in making this value larger as it can increase the both the RAM memory requirements (you might have to change the memory_limit variable at the start of queue_server.php to prevent crashing) while crawling and it can slow the crawl process down. The Cache whole crawled pages checkbox says whether to when crawling to keep both the whole web page downloaded as well as the summary extracted from the web page (checked) or just to keep the page summary (unchecked). The next dropdown, Allow Page Recrawl After, controls how many days that Yioop keeps track of all the URLs that it has downloaded from. For instance, if one sets this dropdown to 7, then after seven days Yioop will clear its Bloom Filter files used to store which urls have been downloaded, and it would be allowed to recrawl these urls again if they happened in links. It should be noted that all of the information from before the seven days will still be in the index, just that now Yioop will be able to recrawl pages that it had previously crawled. Besides letting Yioop get a fresher version of page it already has, this also has the benefit of speeding up longer crawls as Yioop doesn't need to check as many Bloom filter files. In particular, it might just use one and keep it in memory. + +The Page File Types to Crawl checkboxes allow you to decide which file extensions you want Yioop to download during a crawl. This check is done before any download is attempted, so Yioop at that point can only guess the [[http://en.wikipedia.org/wiki/MIME|MIME Type]], as it hasn't received this information from the server yet. An example of a url with a file extension is: + http://flickr.com/humans.txt +which has the extension txt. So if txt is unchecked, then Yioop won't try to download this page even though Yioop can process plain text files. A url like: + http://flickr.com/ +has no file extension and will be assumed to be have a html extension. To crawl sites which have a file extension, but no one in the above list check the unknown checkbox in the upper left of this list. + +The Classifiers and Rankers checkboxes allow you to select the classifiers that will be used to classify or rank pages. Each classifier (see the [[Documentation#Classifying%20Web%20Pages|Classifiers]] section for details) is represented in the list by its class label and two checkboxes. Checking the box under Use to classify indicates that the associated classifier should be used (made active) during the next crawl for classifying, checking the "Use to Rank" indicates that the classifier should be be used (made active) and its score for the document should be stored so that it can be used as part of the search time score. Each active classifier is run on each page downloaded during a crawl. If "Use to Crawl" was checked and the page is determined to belong to the class that the classifier has been trained to recognize, then a meta word like "class:label", where label is the class label, is added to the page summary. For faster access to pages that contain a single term and a label, for example, pages that contain "rich" and are labeled as "non-spam", Yioop actually uses the first character of the label "non-spam" and embeds it as part of the term ID of "rich" on "non-spam" pages with the word "rich". To ensure this speed-up can be used it is useful to make sure ones classifier labels begin with different first characters. If "Use to Rank" is checked then when a classifier is run on the page, the score from the classifier is recorded. When a search is done that might retrieve this page, this score is then used as one component of the overall score that this page receives for the query. + +The Indexing Plugins checkboxes allow you to select which plugins to use during the crawl. Yioop comes with three built-in plugins: AddressesPlugin, RecipePlugin, and WordFilterPlugin. One can also write or downlaod additional plugins. If the plugin can be configured, next to the checkbox will be a link to a configuration screen. Let's briefly look at each of these plugins in turn... + +Checking the AddressesPlugin enables Yioop during a crawl to try to calculate addresses for each page summary it creates. When Yioop processes a page it by default creates a summary of the page with a TITLE and a DESCRIPTION as well as a few other fields. With the addresses plugin activated, it will try to extract data to three additional fields: EMAILS, PHONE_NUMBERS, and ADDRESSES. If you want to test out how these behave, pick some web page, view source on the web page, copy the source, and then paste into the Test Options Tab on the page options page (the Test Options Tab is described later in this section). + +Clicking the RecipePlugin checkbox causes Yioop during a crawl to run the code in indexing_plugins/recipe_plugin.php. This code tries to detect pages which are food recipes and separately extracts these recipes and clusters them by ingredient. It then add search meta words ingredient: and recipe:all to allow one to search recipes by ingredient or only documents containing recipes. + +Checking the WordFilterPlugin causes Yioop to run code in indexing_plugins/wordfilter_plugin.php on each downloaded page. +The [[http://www.yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=04%20Niche%20or%20Subject%20Specific%20Crawling%20With%20Yioop.mp4| Niche Crawling Video Tutorial]] has information about how to use this plugin to create subject-specific crawls of the web. This code checks if the downloaded page has one of the words listed in the textarea one finds on the plugin's configure page. If it does, then the plugin follows the actions listed for pages that contain that term. Below is an example WordFilterPlugin configure page: + +{{class="docs" +((resource:Documentation:WordFilterConfigure.png|Word Filter Configure Page)) +}} + +Lines in the this configure file either specify a url or domain using a syntax like [url_or_domain] or specify a rule or a comment. Whitespace is ignored and everything after a semi-colon on a line is treated as a comment. The rules immediately following a url or domain line up till the next url or domain line are in effect if one crawling is crawling a prage with that url or domain. Each '''rule line''' in the textarea consists of a comma separated list of literals followed by a colon followed by a comma separated list of what to do if the literal condition is satisfied. A single literal in the list of literals is an optional + or - followed by a sequence of non-space characters. After the + or -, up until a ; symbol is called the term in the literal. If the literal sign is + or if no sign is present, then the literal holds for a document if it contains the term, if the literal sign is - then the literal holds for a document if it does not contain the term, if there is a decimal number between 0 and 1, say x, after the # up to a comma or the first white-space character, then this is modified so the literal holds only if x'th fraction of the documents length comes from the literal's term. If rather than a decimal x were a positive natural number then the term would need to occur x times. If all the literal in the comma separated list hold, then the rule is said to hold, and the actions will apply. The line -term0:JUSTFOLLOW says that if the downloaded page does not contain the word "term0" then do not index the page, but do follow outgoing links from the page. The line term1:NOPROCESS says if the document has the word "term1" then do not index it or follow links from it. The last line +term2:NOFOLLOW,NOSNIPPET says if the page contains "term2" then do not follow any outgoing links. NOSNIPPET means that if the page is returned from search results, the link to the page should not have a snippet of text from that page beneath it. As an example of a more complicated rule, consider: + + surfboard#2,bikini#0.02:NOINDEX, NOFOLLOW + +Here for the rule to hold the condition surfboard#2 requires that the term surfboard occurred at least twice in the document and the condition bikini#0.02 requires that 0.02 percent of the documents total length also come from copies of the word bikini. In addition, to the commands just mentioned, WordFilterPlugin supports standard robots.txt directives such as: NOINDEX, NOCACHE, NOARCHIVE, NOODP, NOYDIR, and NONE. More details about how indexing plugins work and how to write your own indexing plugin can be found in the [[Documentation#Modifying%20Yioop%20Code|Modifying Yioop]] section. + +====Page Field Extraction Language==== + +We now return to the Page Field Extraction Rules textarea of the Page Options - Crawl Time tab. Commands in this area allow a user to control what data is extracted from a summary of a page. The textarea allows you to do things like modify the summary, title, and other fields extracted from a page summary; extract new meta words from a summary; and add links which will appear when a cache of a page is shown. Page Rules are especially useful for extracting data from generic text archives and database archives. How to import such archives is described in the Archive Crawls sub-section of +[[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]]. The input to the page rule processor is an asscociative array that results from Yioop doing initial processing on a page. To see what this array looks like one can take a web page and paste it into the form on the Test Options tab. There are two types of page rule statements that a user can define: command statements and assignment statements. In addition, a semicolon ';' can be used to indicate the rest of a line is a comment. Although the initial textarea for rules might appear small. Most modern browsers allow one to resize this area by dragging on the lower right hand corner of the area. This makes it relatively easy to see large sets of rules. + +A command statement takes a key field argument for the page associative array and does a function call to manipulate that page. Below is a list of currently supported commands followed by comments on what they do: + + addMetaWords(field) ;add the field and field value to the META_WORD + ;array for the page + addKeywordLink(field) ;split the field on a comma, view this as a search + ;keywords => link text association, and add this to + ;the KEYWORD_LINKS array. + setStack(field) ;set which field value should be used as a stack + pushStack(field) ;add the field value for field to the top of stack + popStack(field) ;pop the top of the stack into the field value for + ;field + setOutputFolder(dir) ;if auxiliary output, rather than just to the + ;a yioop index, is being done, then set the folder + ;for this output to be dir + setOutputFormat(format) ;set the format of auxiliary output. + ;Should be either CSV or SQL + ;SQL mean that writeOutput will write an insert + ;statement + setOutputTable(table) ;if output is SQL then what table to use for the + ;insert statements + toArray(field) ;splits field value for field on a comma and + ;assign field value to be the resulting array + toString(field) ;if field value is an array then implode that + ;array using comma and store the result in field + ;value + unset(field) ;unset that field value + writeOutput(field) ;use the contents of field value viewed as an array + ;to fill in the columns of a SQL insert statement + ;or CSV row + + +Page rule assignments can either be straight assignments with '=' or concatenation assignments with '.='. Let $page indicate the associative array that Yioop supplies the page rule processor. There are four kinds of values that one can assign: + field = some_other_field ; sets $page['field'] = $page['some_other_field'] + field = "some_string" ; sets $page['field'] to "some string" + field = /some_regex/replacement_where_dollar_vars_allowed/ + ; computes the results of replacing matches to some_regex + ; in $page['field'] with replacement_where_dollar_vars_allowed + field = /some_regex/g ;sets $page['field'] to the array of all matches + ; of some regex in $page['field'] +For each of the above assignments we could have used ".=" instead of "=". We next give a simple example and followed by a couple more complicated examples of page rules and the context in which they were used: + +In the first example, we just want to extract meaningful titles for mail log records that were read in using a TextArchiveBundleIterator. Here after initial page processing a whole email would end up in the DESCRIPTION field of the $page associative array given tot the page rule processor. So we use the following two rules: + TITLE = DESCRIPTION + TITLE = /(.|\n|\Z)*?Subject:[\t ](.+?)\n(.|\n|\Z)*/$2/ +We initially set the TITLE to be the whole record, then use a regex to extract out the correct portion of the subject line. Between the first two slashes recognizes the whole record where the pattern inside the second pair of parentheses (.+?) matches the subject text. The $2 after the second parenthesis says replace the value of TITLE with just this portion. + +The next example was used to do a quick first pass processing of record from the [[http://archive.org/details/utzoo-wiseman-usenet-archive|UTzoo Archive of Usenet Posts from 1981-1991]]. What each block does is described in the comments below + ; + ; Set the UI_FLAGS variable. This variable in a summary controls + ; which of the header elements should appear on cache pages. + ; UI_FLAGS should be set to a string with a comma separated list + ; of the options one wants. In this case, we use: yioop_nav, says that + ; we do want to display header; version, says that we want to display + ; when a cache item was crawled by Yioop; and summaries, says to display + ; the toggle extracted summaries link and associated summary data. + ; Other possible UI_FLAGS are history, whether to display the history + ; dropdown to other cached versions of item; highlight, whether search + ; keywords should be highlighted in cached items + ; + UI_FLAGS = "yioop_nav,version,summaries" + ; + ; Use Post Subject line for title + ; + TITLE = DESCRIPTION + TITLE = /(.|\n)*?Subject:([^\n]+)\n(.|\n)*/$2/ + ; + ; Add a link with a blank keyword search so cache pages have + ; link back to yioop + ; + link_yioop = ",Yioop" + addKeywordLink(link_yioop) + unset(link_yioop) ;using unset so don't have link_yioop in final summary + ; + ; Extract y-M and y-M-j dates as meta word u:date:y-M and u:date:y-M-j + ; + date = DESCRIPTION + date = /(.|\n)*?Date:([^\n]+)\n(.|\n)*/$2/ + date = /.*,\s*(\d*)-(\w*)-(\d*)\s*.*/$3-$2-$1/ + addMetaWord(date) + date = /(\d*)-(\w*)-.*/$1-$2/ + addMetaWord(date) + ; + ; Add a link to articles containing u:date:y-M meta word. The link text + ; is Date:y-M + ; + link_date = "u:date:" + link_date .= date + link_date .= ",Date:" + link_date .= date + addKeywordLink(link_date) + ; + ; Add u:date:y meta-word + ; + date = /(\d*)-.*/$1/ + addMetaWord(date) + ; + ; Get the first three words of subject ignoring re: separated by underscores + ; + subject = TITLE + subject = /(\s*(RE:|re:|rE:|Re:)\s*)?(.*)/$3/ + subject_word1 = subject + subject_word1 = /\s*([^\s]*).*/$1/ + subject_word2 = subject + subject_word2 = /\s*([^\s]*)\s*([^\s]*).*/$2/ + subject_word3 = subject + subject_word3 = /\s*([^\s]*)\s*([^\s]*)\s*([^\s]*).*/$3/ + subject = subject_word1 + unset(subject_word1) + subject .= "_" + subject .= subject_word2 + unset(subject_word2) + subject .= "_" + subject .= subject_word3 + unset(subject_word3) + ; + ; Get the first newsgroup listed in the Newsgroup: line, add a meta-word + ; u:newsgroup:this-newgroup. Add a link to cache page for a search + ; on this meta word + ; + newsgroups = DESCRIPTION + newsgroups = /(.|\n)*?Newsgroups:([^\n]+)\n(.|\n)*/$2/ + newsgroups = /\s*((\w|\.)+).*/$1/ + addMetaWord(newsgroups) + link_news = "u:newsgroups:" + link_news .= newsgroups + link_news .= ",Newsgroup: " + link_news .= newsgroups + addKeywordLink(link_news) + unset(link_news) + ; + ; Makes a thread meta u:thread:newsgroup-three-words-from-subject. + ; Adds a link to cache page to search on this meta word + ; + thread = newsgroups + thread .= ":" + thread .= subject + addMetaWord(thread) + unset(newsgroups) + link_thread = "u:thread:" + link_thread .= thread + link_thread .= ",Current Thread" + addKeywordLink(link_thread) + unset(subject) + unset(thread) + unset(link_thread) +As a last example of page rules, suppose we wanted to crawl the web and whenever we detected a page had an address we wanted to write that address as a SQL insert statement to a series of text files. We can do this using page rules and the AddressesPlugin. First, we would check the AddressesPlugin and then we might use page rules like: + summary = ADDRESSES + setStack(summary) + pushStack(DESCRIPTION) + pushStack(TITLE) + setOutputFolder(/Applications/MAMP/htdocs/crawls/data) + setOutputFormat(sql) + setOutputTable(SUMMARY); + writeOutput(summary) +The first line says copy the contents of the ADDRESSES field of the page into a new summary field. The next line says use the summary field as the current stack. At this point the stack would be an array with all the addresses found on the given page. So you could use the command like popStack(first_address) to copy the first address in this array over to a new variable first_address. In the above case what we do instead is push the contents of the DESCRIPTION field onto the top of the stack. Then we push the contents of the TITLE field. The line + setOutputFolder(/Applications/MAMP/htdocs/crawls/data) +sets /Applications/MAMP/htdocs/crawls/data as the folder that any auxiliary output from the page_processor should go to. setOutputFormat(sql) says we want to output sql, the other possibility is csv. The line setOutputTable(SUMMARY); says the table name to use for INSERT statements should be called SUMMARY. Finally, the line writeOutput(summary) would use the contents of the array entries of the summary field as the column values for an INSERT statement into the SUMMARY table. This writes a line to the file data.txt in /Applications/MAMP/htdocs/crawls/data. If data.txt exceeds 10MB, it is compressed into a file data.txt.0.gz and a new data.txt file is started. + +====Search Time Tab==== + +The Page Options Search Time tab looks like: + +{{class="docs" +((resource:Documentation:PageOptionsSearch.png|The Page Options Search form)) +}} + +The Search Page Elements and Links control group is used to tell which element and links you would like to have presented on the search landing and search results pages. The Word Suggest checkbox controls whether a dropdown of word suggestions should be presented by Yioop when a user starts typing in the Search box. It also controls whether spelling correction and thesaurus suggestions will appear The Subsearch checkbox controls whether the links for Image, Video, and News search appear in the top bar of Yioop You can actually configure what these links are in the [[Documentation#Search%20Sources|Search Sources]] activity. The checkbox here is a global setting for displaying them or not. In addition, if this is unchecked then the hourly activity of downloading any RSS media sources for the News subsearch will be turned off. The Signin checkbox controls whether to display the link to the page for users to sign in to Yioop The Cache checkbox toggles whether a link to the cache of a search item should be displayed as part of each search result. The Similar checkbox toggles whether a link to similar search items should be displayed as part of each search result. The Inlinks checkbox toggles whether a link for inlinks to a search item should be displayed as part of each search result. Finally, the IP address checkbox toggles whether a link for pages with the same ip address should be displayed as part of each search result. + +The Search Ranking Factors group of controls: Title Weight, Description Weight, Link Weight field are used by Yioop to decide how to weigh each portion of a document when it returns query results to you. + +When Yioop ranks search results it search out in its postings list until it finds a certain number of qualifying documents. It then sorts these by their score, returning usually the top 10 results. In a multi-queue-server setting the query is simultaneously asked by the name server machine of each of the queue server machines and the results are aggregated. The Search Results Grouping controls allow you to affect this behavior. Minimum Results to Group controls the number of results the name server want to have before sorting of results is done. When the name server request documents from each queue server, it requests for alpha*(Minimum Results to Group)/(Number of Queue Servers) documents. Server Alpha controls the number alpha. + +The Save button of course saves any changes you make on this form. + +====Test Options Tab==== + +The Page Options Test Options tab looks like: + +{{class="docs" +((resource:Documentation:PageOptionsTest.png|The Page Options Test form)) +}} + +In the Type dropdown one can select a [[http://en.wikipedia.org/wiki/Internet_media_type|MIME Type]] used to select the page processor Yioop uses to extract text from the data you type or paste into the textarea on this page. Test Options let's you see how Yioop would process a web page and add summary data to its index. After filling in the textarea with a page, clicking Test Process Page will show the $summary associative array Yioop would create from the page after the appropriate page processor is applied. Beneath it shows the $summary array that would result after user-defined page rules from the crawl time tab are applied. Yioop stores a serialized form of this array in a IndexArchiveBundle for a crawl. Beneath this array is an array of terms (or character n-grams) that were extracted from the page together with their positions in the document. Finally, a list of meta words that the document has are listed. Either extracted terms or meta-word could be used to look up this document in a Yioop index. + +===Results Editor=== + +Sometimes after a large crawl one finds that there are some results that appear that one does not want in the crawl or that the summary for some result is lacking. The Result Editor activity allows one to fix these issues without having to do a completely new crawl. It has three main forms: An edited urls forms, a url editing form, and a filter websites form. + +If one has already edited the summary for a url, then the dropdown in the edited urls form will list this url. One can select it and click load to get it to display in the url editing form. The purpose of the url editing form is to allow a user to change the title and description for a url that appears on a search results page. By filling out the three fields of the url editing form, or by loading values into them through the previous form and changing them, and then clicking save, updates the appearance of the summary for that url. To return to using the default summary, one only fills out the url field, leaves the other two blank, and saves. This form does not affect whether the page is looked up for a given query, only its final appearance. It can only be used to edit the appearance of pages which appear in the index, not to add pages to the index. Also, the edit will affect the appearance of that page for all indexes managed by Yioop If you know there is a page that won't be crawled by Yioop, but would like it to appear in an index, please look at the crawl options section of [[Documentation#Performing%20and%20Managing%20Crawls|Manage Crawls]] documentation. + +To understand the filter websites form, recall the disallowed sites crawl option allows a user to specify they don't want Yioop to crawl a given web site. After a crawl is done though one might be asked to removed a website from the crawl results, or one might want to remove a website from the crawl results because it has questionable content. A large crawl can take days to replace, to make the job of doing such filtering faster while one is waiting for a replacement crawl where the site has been disallowed, one can use a search filter. + +{{class="docs" +((resource:Documentation:ResultsEditor.png|The Results Editor form)) +}} + +Using the filter websites form one can specify a list of hosts which should be excluded from the search results. The sites listed in the Sites to Filter textarea are required to be hostnames. Using a filter, any web page with the same host name as one listed in the Sites to Filter will not appear in the search results. So for example, the filter settings in the example image above contain the line http://www.cs.sjsu.edu/, so given these settings, the web page http://www.cs.sjsu.edu/faculty/pollett/ would not appear in search results. + +[[Documentation#contents|Return to table of contents]]. + +===Search Sources=== + +The Search Sources activity is used to manage the media sources available to Yioop, and also to control the subsearch links displayed on the top navigation bar. The Search Sources activity looks like: + +{{class="docs" +((resource:Documentation:SearchSources.png|The Search Sources form)) +}} + +The top form is used to add a media source to Yioop. Currently, the Media Kind can be either Video, RSS, or HTML. '''Video Media''' sources are used to help Yioop recognize links which are of videos on a web video site such as YouTube. This helps in both tagging such pages with the meta word media:video in a Yioop index and in being able to render a thumbnail of the video in the search results. When the media kind is set to video, this form has three fields: Name, which should be a short familiar name for the video site (for example, YouTube); URL, which should consist of a url pattern by which to recognize a video on that site; and Thumb, which consist of a url pattern to replace the original pattern by to find the thumbnail for that video. For example, the value of URL for YouTube is: + http://www.youtube.com/watch?v={}& +This will match any url which begins with http://www.youtube.com/watch?v= followed by some string followed by & followed by another string. The {} indicates that from v= to the & should be treated as the identifier for the video. The Thumb url in the case of YouTube is: + http://img.youtube.com/vi/{}/2.jpg +If the identifier in the first video link was yv0zA9kN6L8, then using the above, when displaying a thumb for the video, Yioop would use the image source: + http://img.youtube.com/vi/{yv0zA9kN6L8}/2.jpg +Some video sites have more complicated APIs for specifying thumbnails. In which case, you can still do media:video tagging but display a blank thumbnail rather than suggest a thumbnail link. To do this one uses the thumb url. + http://www.yioop.com/resources/blank.png?{} +If one selects the media kind to be '''RSS''' (really simple syndication, a kind of news feed, you can also use Atom feeds as sources), then the media sources form has four fields: '''Name''', again a short familiar name for the RSS feed; '''URL''', the url of the RSS feed, '''Language''', what language the RSS feed is; and Image XPath, an optional field which allows you to specify and XPath relative to a RSS item, an image url if it is in the item. This language element is used to control whether or not a news item will display given the current language settings of Yioop. If under Manage Machines the Media Updater on the Name Server is turned on, then these RSS feeds will be downloaded hourly. If under the Search Time screen of the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] activity, the subsearch checkbox is checked, then there will be a link to News which appears on the top of the search page. Clicking on this link will display news items in order of recentness. + +An '''HTML Feed''' is a web page that has news articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example, +<pre> + Name: Cape Breton Post + URL: http://www.capebretonpost.com/News/Local-1968 + Channel: //div[contains(@class, "channel")] + Item: //article + Title: //a + Description: //div[contains(@class, "dek")] + Link: //a +</pre> +The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a . + +Returning again to Image Xpath, which is a field of both the RSS form and the HTML Feed form. Not all RSS feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed: +<pre> + http://feeds.wired.com/wired/index + //description/div[contains(@class, + "rss_thumbnail")]/img/@src +</pre> + +Beneath this media sources form is a table listing all the currently added media sources, their urls, and a links that allows one to edit or delete sources. + +The second form on the page is the Add a Subsearch form. The form allows you to add a new specialized search link which may appear at the top the search page. If there are more that three of these subsearch are added or if one is seeing the page on a mobile platform, one instead gets a "More" link. This links to the tool.php page which then lists out all possible specialized search, some account links, and other useful Yioop tools. The Add a Subsearch form has three fields: Folder Name is a short familiar name for the subsearch, it will appear as part of the query string when the given subsearch is being performed. For example, if the folder names was news, then s=news will appear as part of the query string when a news subsearch is being done. Folder Name is also used to make the localization identifier used in translating the subsearch's name into different languages. This identifer will have the format db_subsearch_identifier. For example, db_subsearch_news. Index Source, the second form element, is used to specify a crawl or a crawl mix that the given subsearch should use in returning results. Results per Page, the last form element, controls the number of search results which should appear when using this kind of subsearch. + +Beneath this form is a table listing all the currently added subsearches and their properties. The actions column at the end of this table let's one either edit, localize or delete a given subsearch. Clicking localize takes one to the Manage Locale's page for the default locale and that particular subsearch localization identifier, so that you can fill in a value for it. Remembering the name of this identifier, one can then in Manage Locales navigate to other locales, and fill in translations for them as well, if desired. + +[[Documentation#contents|Return to table of contents]]. + +===GUI for Managing Machines and Servers=== + +Rather than use the command line as described in the [[Documentation#prerequisites|Prerequisites for Crawling]] section, it is possible to start/stop and view the log files of queue servers and fetcher through the Manage Machines activity. In order to do this, the additional requirements for this activity mentioned in the [[Documentation#Requirements|Requirements]] section must have been met. The Manage Machines activity looks like: + +{{class="docs" +((resource:Documentation:ManageMachines.png|The Manage Machines form)) +}} + +The Add machine form at the top of the page allows one to add a new machine to be controlled by this Yioop instance. The Machine Name field lets you give this machine an easy to remember name. The Machine URL field should be filled in with the URL to the installed Yioop instance. The Mirror checkbox says whether you want the given Yioop installation to act as a mirror for another Yioop installation. Checking it will reveal a Parent Name textfield that allows you to choose which installation amongst the previously entered machines names (not urls) you want to mirror. The Has Queue Server checkbox is used to say whether the given Yioop installation will be running a queue server or not. Finally, the Number of Fetchers dropdown allows you to say how many fetcher instances you want to be able to manage for that machine. Beneath the Add machine form is the Machine Information listings. This shows the currently known about machines. This list always begins with the Name Server itself and a toggle to control whether or not the Media Updater process is running on the Name Server. This allows you to control whether or not Yioop attempts to update its RSS (or Atom) search sources on an hourly basis. There is also a link to the log file of the Media Updater process. Under the Name Server information is a dropdown that can be used to control the number of current machine statuses that are displayed for all other machines that have been added. It also might have next and previous arrow links to go through the currently available machines. To modify a machine that you have already added, + +Beneath this dropdown is a set of boxes for each machine you have added to Yioop. In the far corner of this box is a link to Delete that machine from the list of known machines, if desired. Besides this, each box lists the queue server, if any, and each of the fetchers you requested to be able to manage on that machine. Next to these there is a link to the log file for that server/fetcher and below this there is an On/Off switch for starting and stopping the server/fetcher. This switch is green if the server/fetcher is running and red otherwise. A similar On/Off switch is present to turn on and off mirroring on a machine that is acting as a mirror. It is possible for a switch to be yellow if the machine is crashed but where it is possible that the machine might be automatically restarted by Yioop without your intervention. + +==Building Sites with Yioop== + +===Building a Site using Yioop's Wiki System=== + +As was mentioned in the Configure Activity [[Documentation#advance|Toggle Advance Settings]] section of the documentation, background color, icons, title, and SEO meta information for a Yioop instance call all be configured from the Configure Activity. Adding advertisements such as banner and skyscraper ads can be done using the form on the [[Documentation#Optional%20Server%20and%20Security%20Configurations|Server Settings]] activity. If you would like a site with a more custom landing page, then one can check '''Use Wiki Public Main Page as Landing Page''' under Toggle Advance +Settings : Site Customizations. The Public Main page will then be the page you see when you first go to your site. You can then build out your site using the wiki system for the public group. Common headers and footers can be specified for pages on your site using each wiki page's Settings attributes. More advanced styling of pages can be done by specifying the auxiliary css data under Toggle Advance Settings. As Wiki pages can be set to be galleries, or slide presentations, and as Yioop supports including images, video, and embedding search bars and math equations on pages using its [[Syntax|Yioop's Wiki Syntax]], one can develop quite advanced sites using just this approach. The video tutorial [[https://yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=03%20Building%20Web%20Sites%20with%20Yioop.mp4|Building Websites Using Yioop]] explains how the Seekquarry.com site was built using Yioop Software in this way. + +===Building a Site using Yioop as Framework=== + +For more advanced, dynamic websites than the wiki approach described above, the Yioop code base can still serve as the code base for new custom search web sites. The web-app portion of Yioop uses a [[https://en.wikipedia.org/wiki/Model-view-adapter|model-view-adapter (MVA) framework]]. This is a common, web-suitable variant on the more well-known Model View Controller design pattern. In this set-up, sub-classes of the Model class should handle file I/O and database function, sub-classes of Views should be responsible for rendering outputs, and sub-classes of the Controller class do calculations on data received from the web and from the models to give the views the data they finally need to render. In the remainder of this section we describe how this framework is implemented in Yioop and how to add code to the WORK_DIRECTORY/app folder to customize things for your site. In this discussion we will use APP_DIR to refer to WORK_DIRECTORY/app and BASE_DIR to refer to the directory where Yioop is installed. + +The index.php script is the first script run by the Yioop web app. It has an array $available_controllers which lists the controllers available to the script. The names of the controllers in this array are lowercase. Based on whether the $_REQUEST['c'] variable is in this array index.php either loads the file {$_REQUEST['c']}_controller.php or loads whatever the default controller is. index.php also checks for the existing of APP_DIR/index.php and loads it if it exists. This gives the app developer a chance to change the available controllers and which controller is set for a given request. A controller file should have in it a file which extends the class Controller. Controller files should always have names of the form somename_controller.php and the class inside them should be named SomenameController. Notice it is Somename rather than SomeName. These general naming conventions are used for models, views, etc. Any Controller subclass has methods component($name), model($name), view($name), and indexing_plugin($name). These methods load, instantiate, and return a class with the given name. For example, $my_controller->model("crawl"); checks to see if a CrawlModel has already been instantiated, if so, it returns it; if not, it does a r equire_once on model/crawl_model.php and then instantiates a CrawlModel saves a reference to it, and returns it. + +If a require once is needed. Yioop first looks in APP_DIR. For example, $my_controller->view("search") would first look for a file: APP_DIR/views/search_view.php to include, if it cannot find such a file then it tries to include BASE_DIR/views/search_view.php. So to change the behavior of an existing BASE_DIR file one just has a modified copy of the file in the appropriate place in your APP_DIR. This holds in general for other program files such as components, models, and plugins. It doesn't hold for resources such as images -- we'll discuss those in a moment. Notice because it looks in APP_DIR first, you can go ahead and create new controllers, models, views, etc which don't exists in BASE_DIR and get Yioop to load them. +A Controller must implement the abstract method processRequest. The index.php script after finishing its bootstrap process calls the processRequest method of the Controller it chose to load. If this was your controller, the code in your controller should make use of data gotten out of the loaded models as well as data from the web request to do some calculations. Typically, to determine the calculation performed, the controller cleans and looks at $_REQUEST['a'], the request activity, and uses the method call($activity) to call a method that can handle the activity. When a controller is constructed it makes use of the global variable $COMPONENT_ACTIVITIES defined in configs/config.php to know which components have what activities. The call method checks if there is a Component repsonsible for the requested activity, if there is it calls that Component's $activity method, otherwise, the method that handles $activity is assumed to come from the controller itself. The results of the calculations done in $activity would typically be put into an associative array $data. After the call method complete, processRequest would typically take $data and call the base Controller method displayView($view, $data). Here $view is the whichever loaded view object you would like to display. + +To complete the picture of how Yioop eventually produces a web page or other output, we now describe how subclasses of the View class work. Subclasses of View have a field $layout and two methods helper($name), and element($name). A subclass of View has at most one Layout and it is used for rendering the header and footer of the page. It is included and instantiated by setting $layout to be the name of the layout one wants to load. For example, $layout="web"; would load either the file APP_DIR/views/layouts/web_layout.php or BASE_DIR/views/layouts/web_layout.php. This file is expected to have in it a class WebLayout extending Layout. The contructor of a Layout take as argument a view which it sets to an instance variable. The way Layout's get drawn is as follows: When the controller calls displayView($view, $data), this method does some initialization and then calls the render($data) of the base View class. This in turn calls the render($data) method of whatever Layout was on the view. This render method then draws the header and then calls $this->view->renderView($data); to draw the view, and finally draws the footer. + +The methods helper($name) and element($name) of View load and intantiate, if necessary, and return the Helper or Element $name in a similar fashion to the model($name) method of Controller. Element's have render($data) methods and can be used to draw out portions of pages which may be common across Views. Helper's on the other hand are used typically to render UI elements. For example, OptionsHelper has a render($id, $name, $options, $selected) method and is used to draw select dropdowns. + +When rendering a View or Element one often has css, scripts, images, videos, objects, etc. In BASE_DIR, the targets of these tags would typically be stored in the css, scripts, or resources folders. The APP_DIR/css, APP_DIR/scripts, and APP_DIR/resources folder are a natural place for them in your customized site. One wrinkle, however, is that APP_DIR, unlike BASE_DIR, doesn't have to be under your web servers DOCUMENT_ROOT. So how does one refer in a link to these folders? To this one uses Yioop's ResourceController class which can be invoked by a link like: + <img src="?c=resource&a=get&n=myicon.png&f=resources" /> +Here c=resource specifies the controller, a=get specifies the activity -- to get a file, n=myicon.png specifies we want the file myicon.png -- the value of n is cleaned to make sure it is a filename before being used, and f=resources specifies the folder -- f is allowed to be one of css, script, or resources. This would get the file APP_DIR/resources/myicon.png . + +This completes our description of the Yioop framework and how to build a new site using it. It should be pointed out that code in the APP_DIR can be localized using the same mechanism as in BASE_DIR. More details on this can be found in the section on [[Documentation#Localizing%20Yioop%20to%20a%20New%20Language|Localizing Yioop]]. + +[[Documentation#contents|Return to table of contents]]. + +===Embedding Yioop in an Existing Site=== + +One use-case for Yioop is to serve search result for your existing site. There are three common ways to do this: (1) On your site have a web-form or links with your installation of Yioop as their target and let Yioop format the results. (2) Use the same kind of form or links, but request an OpenSearch RSS Response from Yioop and then you format the results and display the results within your site. (3) Your site makes functions calls of the Yioop Search API and gets either PHP arrays or a string back and then does what it wants with the results. For access method (1) and (2) it is possible to have Yioop on an different machine so that it doesn't consume your main web-site's machines resources. As we mentioned in the configuration section it is possible to disable each of these access paths from within the Admin portion of the web-site. This might be useful for instance if you are using access methods (2) or (3) and don't want users to be able to access the Yioop search results via its built in web form. We will now spend a moment to look at each of these access methods in more detail... + +====Accessing Yioop via a Web Form==== + +A very minimal code snippet for such a form would be: + <form method="get" action='YIOOP_LOCATION'> + <input type="hidden" name="its" value="TIMESTAMP_OF_CRAWL_YOU_WANT" /> + <input type="hidden" name="l" value="LOCALE_TAG" /> + <input type="text" name="q" value="" /> + <button type="submit">Search</button> + </form> +In the above form, you should change YIOOP_LOCATION to your instance of Yioop's web location, TIMESTAMP_OF_CRAWL_YOU_WANT should be the Unix timestamp that appears in the name of the IndexArchive folder that you want Yioop to serve results from, LOCALE_TAG should be the locale you want results displayed in, for example, en-US for American English. In addition, to embedding this form on some page on your site, you would probably want to change the resources/yioop.png image to something more representative of your site. You might also want to edit the file views/search_view.php to give a link back to your site from the search results. + +If you had a form such as above, clicking Search would take you to the URL: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&q=QUERY +where QUERY was what was typed in the search form. Yioop supports two other kinds of queries: Related sites queries and cache look-up queries. The related query format is: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&a=related&arg=URL +where URL is the url that you are looking up related URLs for. To do a look up of the Yioop cache of a web page the url format is: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&q=QUERY&a=cache&arg=URL +Here the terms listed in QUERY will be styled in different colors in the web page that is returned; URL is the url of the web page you want to look up in the cache. + +===Accessing Yioop and getting and OpenSearch RSS or JSON Response=== + +The same basic urls as above can return RSS or JSON results simply by appending to the end of the them &f=rss or &f=json. This of course only makes sense for usual and related url queries -- cache queries return web-pages not a list of search results. Here is an example of what a portion of an RSS result might look like: + + <?xml version="1.0" encoding="UTF-8" ?> + <rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" + xmlns:atom="http://www.w3.org/2005/Atom" + > + <channel> + <title>PHP Search Engine - Yioop : art</title> + <language>en-US</language> + <link>http://localhost/git/yioop/?f=rss&q=art&its=1317152828</link> + <description>Search results for: art</description> + <opensearch:totalResults>1105</opensearch:totalResults> + <opensearch:startIndex>0</opensearch:startIndex> + <opensearch:itemsPerPage>10</opensearch:itemsPerPage> + <atom:link rel="search" type="application/opensearchdescription+xml" + href="http://localhost/git/yioop/yioopbar.xml"/> + <opensearch:Query role="request" searchTerms="art"/> + + <item> + <title> An Online Fine Art Gallery U Can Buy Art - + Buy Fine Art Online</title> + + <link>http://www.ucanbuyart.com/</link> + <description> UCanBuyArt.com is an online art gallery + and dealer designed... art gallery and dealer designed for art + sales of high quality and original... art sales of high quality + and original art from renowned artists. Art</description> + </item> + ... + ... + </channel> + </rss> + +Notice the opensearch: tags tell us the totalResults, startIndex and itemsPerPage. The opensearch:Query tag tells us what the search terms were. + +===Accessing Yioop via the Function API=== + +The last way we will consider to get search results out of Yioop is via its function API. The Yioop Function API consists of the following three methods in controllers/search_controller.php : + function queryRequest($query, $results_per_page, $limit = 0) + + function relatedRequest($url, $results_per_page, $limit = 0, + $crawl_time = 0) + + function cacheRequest($url, $highlight=true, $terms ="", + $crawl_time = 0) +These methods handle basic queries, related queries, and cache of web page requests respectively. The arguments of the first two are reasonably self-explanatory. The $highlight and $terms arguments to cacheRequest are to specify whether or not you want syntax highlighting of any of the words in the returned cached web-page. If wanted then $terms should be a space separated list of terms. + +An example script showing what needs to be set-up before invoking these methods as well as how to extract results from what is returned can be found in the file examples/search_api.php . + +[[Documentation#contents|Return to table of contents]]. + +===Localizing Yioop to a New Language=== + +The Manage Locales activity can be used to configure Yioop for use with different languages and for different regions. If you decide to customize your Yioop installation by adding files to WORK_DIRECTORY/app as described in the [[Documentation#Building%20a%20Site%20using%20Yioop%20as%20Framework|Building a Site using Yioop as a Framework]] section, then the localization tools described in this section can also be used to localize your custom site. Clicking the Manage Locales activity one sees a page like: + +{{class="docs" +((resource:Documentation:ManageLocales.png|The Manage Locales form)) +}} + +The first form on this activity allows you to create a new locale -- an object representing a language and a region. The first field on this form should be filled in with a name for the locale in the language of the locale. So for French you would put Français. The locale tag should be the IETF language tag. The '''Writing Mode''' element on the form is to specify how the language is written. There are four options: lr-tb -- from left-to-write from the top of the page to the bottom as in English, rl-tb from right-to-left from the top the page to the bottom as in Hebrew and Arabic, tb-rl from the top of the page to the bottom from right-to-left as in Classical Chinese, and finally, tb-lr from the top of the page to the bottom from left-to-right as in non-cyrillic Mongolian or American Sign Language. lr-tb and rl-tb support work better than the vertical language support. As of this writing, Internet Explorer and WebKit based browsers (Chrome/Safari) have some vertical language support and the Yioop stylesheets for vertical languages still need some tweaking. For information on the status in Firefox check out this [[https://bugzilla.mozilla.org/show_bug.cgi?id=writing-mode|writing mode bug]]. Finally, the '''Locale Enabled ''' checkbox controls whether or not to present the locale on the Settings Page. This allows you to choose only the locales you want for your website without having to delete the locale data for other locales you don't want, but may want in the future as more translate strings become available. + +Beneath the Add Locale form is an alphabetical in the local-tah table listing some of the current locales. The Show Dropdown let's you control let's you control how many of these locales are displayed in one go. The Search link lets you bring up an advance search form to search for particular locales and also allows you to control the direction of the listing. The Locale List table first colume has a link with the name of the locale. Clicking on this link brings up a page where one can edit the strings for that locale. The next two columns of the Locale List table give the locale tag and writing direction of the locale, this is followed by the percent of strings translated. Clicking the Edit link in the column let's one edit the locale tag, and text direction of a locale. Finally, clicking the Delete link let's one delete a locale and all its strings. + +To translate string ids for a locale click on its name link. This should display the following forms and table of string id and their transated values: + +{{class="docs" +((resource:Documentation:EditLocaleStrings.png|The Edit Locales form)) +}} + +In the above case, the link for English was clicked. The Back link in the corner can be used to written to the previous form. The drop down controls whether to display all localizable strings or just those missing translations. The Filter field can be used to restrict the list of string id's presented to just those matching what is this field. Beneath this dropdown, the Edit Locale page mainly consists of a two column table: the right column being string ids, the left column containing what should be their translation into the given locale. If no translation exists yet, the field will be displayed in red. String ids are extracted by Yioop automatically from controller, view, helper, layout, and element class files which are either in the Yioop Installation itself or in the installation WORK_DIRECTORY/app folder. Yioop looks for tl() function calls to extract ids from these files, for example, on seeing tl('search_view_query_results') Yioop would extract the id search_view_query_results; on seeing tl('search_view_calculated', $data['ELAPSED_TIME']) Yioop would extract the id, 'search_view_calculated'. In the second case, the translation is expected the translation to have a %s in it for the value of $data['ELAPSED_TIME']. Note %s is used regardless of the the type, say int, float, string, etc., of $data['ELAPSED_TIME']. tl() can handle additional arguments, whenever an additional argument is supplied an additional %s would be expected somewhere in the translation string. If you make a set of translations, be sure to submit the form associated with this table by scrolling to the bottom of the page and clicking the Submit link. This saves your translations; otherwise, your work will be lost if you navigate away from this page. One aid to translating is if you hover your mouse over a field that needs translation, then its translation in the default locale (usually English) is displayed. If you want to find where in the source code a string id comes from the ids follow the rough convention file_name_approximate_english_translation. So you would expect to find admin_controller_login_successful in the file controllers/admin_controller.php . String ids with the prefix db_ (such as the names of activities) are stored in the database. So you cannot find these ids in the source code. The tooltip trick mentioned above does not work for database string ids. + +====Localizing Wiki Pages==== +When a user goes to a wiki page with a URL such as + YIOOP_LOCATION?c=group&group_id=some_integer&a=wiki&arg=read&page_name=Some_Page_Name +or + YIOOP_LOCATION?c=admin&group_id=some_integer&a=wiki&arg=read&page_name=Some_Page_Name +or for the public group possible with + YIOOP_LOCATION?c=static&p=Some_Page_Name +the page that is displayed is in the locale that has been most recently set for the user. If no locale was set, then Yioop tries to determine the locale based on browser header info, and if this fails, falls back to the Default Locale set when Yioop was configure. When one edits a wiki page the locale that one is editing the page for is displayed under the page name such as en-US in the image below: +{{class="docs" +((resource:Documentation:LocaleOnWikiPage.png|Locale on a wiki page)) +}} +To edit the page for a different locale, choose the locale you want using the Settings page while logged in and then navigate to the wiki page you would like to edit (using the same name from the original language). Suppose you were editing the Dental_Floss page in en-US locale. To make the French page, you click Settings on the top bar of Yioop, go to your account settings, and choose French (fr-FR) as the language. Now one would navigate back the the wiki you were on to the Dental_Floss page which doesn't exist for French. You could click Edit now and make the French page at this location, but this would be sub-optimal as the French word for dental floss is dentrifice. So instead, on the fr-FR Dental_Floss edit page, you edit the page Settings to make this page a Page Alias for Dentrifice, and then create and edit the French Dentrifice article. If a user then starts on the English version of the page and switches locales to French they will end up on the Dentrifice page. You should also set up the page alias in the reverse direction as well, to handle when someone start on the French Dentrifice page and switches to the en-US Dentrifice. + +====Adding a stemmer, segmenter or supporting character n-gramming for your language==== + +Depending on the language you are localizing to, it may make sense to write a stemmer for words that will be inserted into the index. A stemmer takes inflected or sometimes derived words and reduces them to their stem. For instance, jumps and jumping would be reduced to jump in English. As Yioop crawls it attempts to detect the language of a given web page it is processing. If a stemmer exists for this language it will call the Tokenizer class's stem($word) method on each word it extracts from the document before inserting information about it into the index. Similarly, if an end-user is entering a simple conjunctive search query and a stemmer exists for his language settings, then the query terms will be stemmed before being looked up in the index. Currently, Yioop comes with stemmers for English, French, German, Italian, and Russian. The English stemmer uses the Porter Stemming Algorithm [ [[Documentation#P1980|P1980]]], the other stemmers are based on the algorithms presented at snowball.tartoros.org. Stemmers should be written as a static method located in the file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php . The [[snowball.tartoros.org]] link points to a site that has source code for stemmers for many other languages (unfortunately, not written in PHP). It would not be hard to port these to PHP and then add modify the tokenizer.php file of the appropriate locale folder. For instance, one could modify the file WORK_DIRECTORY/locale/pt/resources/tokenizer.php to contain a class PtTokenizer with a static method stem($word) if one wanted to add a stemmer for Portuguese. + +The class inside tokenizer.php can also be used by Yioop to do word segmentation. This is the process of splitting a string of words without spaces in some language into its component words. Yioop comes with an example segmenter for the zh-CN (Chinese) locale. It works by starting at the ned of the string and trying to greedily find the longest word that can be matched with the portion of the suffix of the string that has been processed yet (reverse maximal match). To do this it makes use of a word Bloom filter as part of how it detects if a string is a word or not. We describe how to make such filter using token_tool.php in a moment. + +In addition to supporting the ability to add stemmers and segmenters, Yioop also supports a default technique which can be used in lieu of a stemmer called character n-grams. When used this technique segments text into sequences of n characters which are then stored in Yioop as a term. For instance if n were 3 then the word "thunder" would be split into "thu", "hun", "und", "nde", and "der" and each of these would be asscociated with the document that contained the word thunder. N-grams are useful for languages like Chinese and Japanese in which words in the text are often not separated with spaces. It is also useful for languages like German which can have long compound words. The drawback of n-grams is that they tend to make the index larger. For Yioop built-in locales that do not have stemmer the file, the file WORK_DIRECTORY/locale/LOCALE-TAG/resources/tokenizer.php has a line of the form $CHARGRAMS['LOCALE_TAG'] = SOME_NUMBER; This number is the length of string to use in doing char-gramming. If you add a language to Yioop and want to use char gramming merely add a tokenizer.php to the corresponding locale folder with such a line in it. + +{{id='token_tool' +====Using token_tool.php to improve search performance and relevance for your language==== +}} + +configs/token_tool.php is used to create suggest word dictionaries and word filter files for the Yioop search engine. To create either of these items, the user puts a source file in Yioop's WORK_DIRECTORY/prepare folder. Suggest word dictionaries are used to supply the content of the dropdown of search terms that appears as a user is entering a query in Yioop. They are also used to do spell correction suggestions after a search has been performed. To make a suggest dictionary one can use a command like: + + php token_tool.php dictionary filename locale endmarker + +Here ''filename'' should be in the current folder or PREP_DIR, locale is the locale this suggest (for example, en-US) file is being made for and where a file suggest_trie.txt.gz will be written, and endmarker is the end of word symbol to use in the trie. For example, $ works pretty well. The format of ''filename'' should be a sequence of line, each line containing a word or phrase followed by a space followed by a frequency count. i.e., the last thing on the line should be a number. Given a corpus of documents a frequency for a word would be the number of occurences of that word in the document. + +token_tool.php can also be used to make filter files used by a word segmenter. To make a filter file token_tool.php is run from the command line as: + php token_tool.php segment-filter dictionary_file locale + +Here dictionary_file should be a text file with one word/line, locale is the IANA language tag of the locale to store the results for. +====Obtaining data sets for token_tool.php==== + +Many word lists with frequencies are obtainable on the web for free with Creative Commons licenses. A good starting point is: + [[http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists]] +A little script-fu can generally take such a list and output it with the line format of "word/phrase space frequency" needed by token_tool.php and as the word/line format used for filter files. + +====Spell correction and romanized input with locale.js==== + +Yioop supports the ability to suggest alternative queries after a search is performed. These queries are mainly restricted to fixing typos in the original query. In order to calculate these spelling corrections, Yioop takes the query and for each query term computes each possible single character change to that term. For each of these it looks up in the given locale's suggest_trie.txt.gz a frequency count of that variant, if it exists. If the best suggestion is some multiple better than the frequency count of the original query then Yioop suggests this alternative query. In order for this to work, Yioop needs to know what constitutes a single character in the original query. The file locale.js in the WORK_DIRECTORY/locale/LOCALE_TAG/resources folder can be used to specify this for the locale given by LOCALE_TAG. To do this, all you need to do is specify a Javascript variable alpha. For example, for French (fr-FR) this looks like: + var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz"; +The letters do not have to be in any alphabetical order, but should be comprehensive of the non-punctuation symbols of the language in question. + +Another thing locale.js can be used for is to given mappings between roman letters and other scripts for use in the Yioop's autosuggest dropdown that appears as you type a query. As you type, scripts/suggest.js function onTypeTerm is called. This in turn will cause a particular locale's locale.js function transliterate(query) if it exists. This function should return a string with the result of the transliteration. An example of doing this is given for the Telugu locale in Yioop. + +====Thesaurus Results and Part of Speech Tagging==== + +As mentioned in the [[Documentation#Search%20Basics|Search Basics]] topic, for some queries Yioop displays a list of related queries to one side of the search results. These are obtained from a "computer thesaurus". In this subsection, we describe how to enable this facility for English and how you could add this functionality for other languages. If enabled, the thesaurus also can be used to modify search ranking as described in the [[Ranking#Final%20Reordering|Final Reordering]] of the Yioop Ranking Mechanisms document. + +In order to generate suggested related queries, Yioop first tags the original query terms according to part of speech. For the en-US, this is done by calling a method: tagTokenizePartOfSpeech($text) in WORK_DIRECTORY/locale/en-US/resources/tokenizer.php. For en-US, a simple Brill tagger (see Ranking document for more info) is implemented to do this. After this method is called the terms in $text should have a suffix ~part-of-speech where ~part-of-speeech where part-of-speech is one of NN for noun, VB for verb, AJ for adjective, AV for adverb, or some other value (which would be ignored by Yioop). For example, the noun dog might become dog~NN after tagging. To localize to another language this method in the corresponding tokenizer.php file would need to be implemented. + +The second method needed for Thesaurus results is scoredThesaurusMatches($term, $word_type, $whole_query) which should also be in tokenizer.php for the desired locale. Here $term is a term (without a part-of-speech tag), $word_type is the part of speech (one of the ones listed above), and $whole_query is the original query. The output of this method should be an array of (score => array of thesaurus terms) associations. The score representing one word sense of term. In the case, of English, this method is implemented using [[http://wordnet.princeton.edu/|WordNet]]. So for thesaurus results to work for English, WordNet needs to be installed and in either the config.php file or local_config.php you need to define the constant WORDNET_EXEC to the path to the WordNet executable on your file system. On a Linux or OSX system, this might be something like: /usr/local/bin/wn . + +====Using Stop Words to improve Centroid Summarization==== + +While crawling, Yioop makes use of a summarizer to extract the important portions of the web page both for indexing and for search result snippet purposes. There are two summarizers that come with Yioop a Basic summarizer, which uses an ad hoc approach to finding the most important parts of the document, and a centroid summarizer which tries to compute an "average sentence" for the document and uses this to pick representative sentence based on nearness to this average. The summarizer that is used can be set under the Crawl Time tab of [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]]. This latter summarizer works better if certain common words (stop words) from the documents language are removed. When using centroid summarizer, Yioop check to see if tokenizer.php for the current locale contains a method stopwordsRemover($page). If it does it calls it, this method takes a string of words are returns a string with all the stop words removed. This method exists for en-US, but, if desired, could also be implemented for other locales to improve centroid summarization. + +[[Documentation#contents|Return to table of contents]]. + +==Advanced Topics== +===Modifying Yioop Code=== +One advantage of an open-source project is that you have complete access to the source code. Thus, Yioop can be modified to fit in with your existing project. You can also freely add new features onto Yioop. In this section, we look a little bit at some of the common ways you might try to modify Yioop as well as ways to examine the output of a crawl in a more technical manner. If you decide to modify the source code it is recommended that you look at the [[Documentation#Summary%20of%20Files%20and%20Folders|Summary of Files and Folders]] above again, as well as look at the [[http://www.seekquarry.com/yioop-docs/|online Yioop code documentation]]. + +====Handling new File Types==== +One relatively easy enhancement to Yioop is to enhance the way it processes an existing file type or to get it to process new file types. Yioop was written from scratch without dependencies on existing projects. So the PHP processors for Microsoft file formats and for PDF are only approximate. These processors can be found in lib/processors. To write your own processor, you should extend either the TextProcessor or ImageProcessor class. You then need to write in your subclass a static method process($page, $url). Here $page is a string representation of a downloaded document of the file type you are going to handle and $url is the a canonical url from which this page is downloaded. This method should return an array of the format: + $summary['TITLE'] = a title for the document + $summary['DESCRIPTION'] = a text summary extracted from the document + $summary['LINKS'] = an array of links (canonical not relative) extracted + from the document. +A good reference implementation of a TextProcessor subclass can be found in html_processor.php. If you are trying to support a new file type, then to get Yioop to use your processor you need to add lines to some global variables at the top of the file. You should add the extension of the file type you are going to process to the array $INDEXED_FILE_TYPES. You will also need to add an entry $PAGE_PROCESSORS["new_mime_type_handle"] = "NewProcessor". As an example, these are the relevant lines at the top of ppt_processor.php: + $INDEXED_FILE_TYPES[] = "ppt"; + $PAGE_PROCESSORS["application/vnd.ms-powerpoint"] = "PptProcessor"; +If your processor is cool, only relies on code you wrote, and you want to contribute it back to the Yioop, please feel free to e-mail it to chris@pollett.org . + +====Writing an Indexing Plugin==== +An indexing plugin provides a way that an advanced end-user can extend the indexing capabilities of Yioop. Bundled with Yioop are three example indexing plugins. These are found in the lib/indexing_plugins folder. We will discuss the code for the recipe and word filter plugin here. The code for the address plugin, used to extract snail mail address from web pages follows the same kind of structure. If you decide to write your own plugin or want to install a third-party plugin you can put it in the folder: WORK_DIRECTORY/app/lib/indexing_plugins. The recipe indexing plugin can serve as a guide for writing your own plugin if you don't need your plugin to have a configure screen. The recipe plugin is used to detect food recipes which occur on pages during a crawl. It creates "micro-documents" associated with found recipes. These are stored in the index during the crawl under the meta-word "recipe:all". After the crawl is over, the recipe plugin's postProcessing method is called. It looks up all the documents associated with the word "recipe:all". It extracts ingredients from these and does clustering of recipes based on ingredient. It finally injects new meta-words of the form "ingredient:some_food_ingredient", which can be used to retrieve recipes most closely associated with a given ingredient. As it is written, recipe plugin assumes that all the recipes can be read into memory in one go, but one could easily imagine reading through the list of recipes in batches of the amount that could fit in memory in one go. + +The recipe plugin illustrates the kinds of things that can be written using indexing plugins. To make your own plugin, you would need to write a subclass of the class IndexingPlugin with a file name of the form mypluginname_plugin.php. Then you would need to put this file in the folder WORK_DIRECTORY/app/lib/indexing_plugins. RecipePlugin subclasses IndexingPlugin and implements the following four methods: pageProcessing($page, $url), postProcessing($index_name), getProcessors(), getAdditionalMetaWords() so they don't have their return NULL default behavior. We explain what each of these is for in a moment. During a web crawl, after a fetcher has downloaded a batch of web pages, it uses a page's mimetype to determine a page processor class to extract summary data from that page. The page processors that Yioop implements can be found in the folder lib/processors. They have file names of the form someprocessorname_processor.php. As a crawl proceeds, your plugin will typically be called to do further processing of a page only in addition to some of these processors. The static method getProcessors() should return an array of the form array( "someprocessorname1", "someprocessorname2", ...), listing the processors that your plugin will do additional processing of documents for. A page processor has a method handle($page, $url) called by Yioop with a string $page of a downloaded document and a string $url of where it was downloaded from. This method first calls the process($page, $url) method of the processor to do initial summary extraction and then calls method pageProcessing($page, $url) of each indexing_plugin associated with the given processor. A pageProcessing($page, $url) method is expected to return an array of subdoc arrays found on the given page. Each subdoc array should haves a CrawlConstants::TITLE and a CrawlConstants::DESCRIPTION. The handle method of a processor will add to each subdoc the fields: CrawlConstants::LANG, CrawlConstants::LINKS, CrawlConstants::PAGE, CrawlConstants::SUBDOCTYPE. The SUBDOCTYPE is the name of the plugin. The resulting "micro-document" is inserted by Yioop into the index under the word nameofplugin:all . After the crawl is over, Yioop will call the postProcessing($index_name) method of each indexing plugin that was in use. Here $index_name is the timestamp of the crawl. Your plugin can do whatever post processing it wants in this method. For example, the recipe plugin does searches of the index and uses the results of these searches to inject new meta-words into the index. In order for Yioop to be aware of the meta-words you are adding, you need to implement the method getAdditionalMetaWords(). Also, the web snippet you might want in the search results for things like recipes might be longer or shorter than a typical result snippet. The getAdditionalMetaWords() method also tells Yioop this information. For example, for the recipe plugin, getAdditionalMetaWords() returns the associative array: + array("recipe:" => HtmlProcessor::MAX_DESCRIPTION_LEN, + "ingredient:" => HtmlProcessor::MAX_DESCRIPTION_LEN); +The WordFilterPlugin illustrates how one can write an indexing plugin with a configure screen. It overrides the base class' pageSummaryProcessing(&$summary) and getProcessors() methods as well as implements the methods saveConfiguration($configuration), loadConfiguration(), setConfiguration($configuration), configureHandler(&$data), and configureView(&$data). The purpose of getProcessors() was already mentioned under recipe plugin description above. pageSummaryProcessing(&$summary) is called by a page processor after a page has been processed and a summary generated. WordFilterPlugin uses this callback to check if the title or the description in this summary have any of the words the filter is filtering for and if so takes the appropriate action. loadConfiguration, saveConfiguration($configuration), and setConfiguration are three methods to handle persistence for any plugin data that the user can change. The first two operate on the name server, the last might operate on a queue_server or a fetcher. loadConfiguration is be called by configureHandler(&$data) to read in any current configuration, unserialize it and modify it according to any data sent by the user. saveConfiguration($configuration) would then be called by configureHandler(&$data) to serialize and write any $configuration data that needs to be stored by the plugin. For WordFilterPlugin, a list of filter terms and actions are what is saved by saveConfiguration($configuration) and loaded by loadConfiguration. When a crawl is started or when a fetcher contacts the name server, plugin configuration data is sent by the name server. The method setConfiguration($configuration) is used to initialize the local copy of a fetcher's or queue_server's process with the configuration settings from the name server. For WordFilterPlugin, the filter terms and actions are stored in a field variable by this function. + +As has already been hinted at by the configuration discussion above, configureHandler(&$data) plays the role of a controller for an index plugin. It is in fact called by the AdminController activity pageOptions if the configure link for a plugin has been clicked. In addition, to managing the load and save configuration process, it also sets up any data needed by configureView(&$data). For WordFilterPlugin, this involves setting a variable $data["filter_words"] so that configureView(&$data) has access to a list of filter words and actions to draw. Finally, the last method of the WordFilterPlugin we describe, configureView(&$data), outputs using $data the HTML that will be seen in the configure screen. This HTML will appear in a div tag on the final page. It is initially styled so that it is not displayed. Clicking on the configure link will cause the div tag data to be displayed in a light box in the center of the screen. For WordFilterPlugin, this methods draws a title and textarea form with the currently filtered terms in it. It makes use of Yioop's tl() functions so that the text of the title can be localized to different languages. This form has hidden field c=admin, a=pageOptions option-type=crawl_time, so that hte AdminController will know to call pageOption and pageOption will know in turn to let plugin's configureHandler methods to get a chance to handle this data. + +[[Documentation#contents|Return to table of contents]]. + +===Yioop Command-line Tools=== +In addition to [[Documentation#token_tool|token_tool.php]] which we describe in the section on localization, and to [[Documentation#configs|export_public_help_db.php]] whcih we describe in the section on the Yioop folder structure, Yioop comes with several useful command-line tools and utilities. We next describe these in roughly their order of likely utility: + +* [[Documentation#configure_tool|bin/configure_tool.php]]: Used to configure Yioop from the command-line +* [[Documentation#arc_tool|bin/arc_tool.php]]: Used to examine the contents of WebArchiveBundle's and IndexArchiveBundles's +* [[Documentation#query_tool|bin/query_tool.php]]: Used to query an index from the command-line +* [[Documentation#code_tool|bin/code_tool.php]]: Used to help code Yioop and to help make clean patches for Yioop. +* [[Documentation#classifier_tool|bin/classifier_tool.php]]: Used to make Yioop a Yioop classifier from the command line rather than using the GUI interface. + +{{id='configure_tool' +====Configuring Yioop from the Command-line==== +}} + +In a multiple queue server and fetcher setting, one might have web access only to the name server machine -- all the other machines might be on virtual private servers to which one has only command-line access. Hence, it is useful to be able to set up a work directory and configure Yioop through the command-line. To do this one can use the script configs/configure_tool.php. One can run it from the command-line within the configs folder, with a line like: + php configure_tool.php +When launched, this program will display a menu like: + Yioop CONFIGURATION TOOL + +++++++++++++++++++++++++ + + Checking Yioop configuration... + =============================== + Check Passed. + Using configs/local_config.php so changing work directory above may not work. + =============================== + + Available Options: + ================== + (1) Create/Set Work Directory + (2) Change root password + (3) Set Default Locale + (4) Debug Display Set-up + (5) Search Access Set-up + (6) Search Page Elements and Links + (7) Name Server Set-up + (8) Crawl Robot Set-up + (9) Exit program + + Please choose an option: +Except for the Change root password option, these correspond to the different fieldsets on the Configure activity. The command-line forms one gets from selecting one of these choices let one set the same values as were described earlier in the [[Documentation#Installation%20and%20Configuration|Installation]] section. The change root password option lets one set the account password for root. I.e., the main admin user. On a non-nameserver machine, it is probably simpler to go with a sqlite database, rather than hit on a global mysql database from each machine. Such a barebones local database set-up would typically only have one user, root + +Another thing to consider when configuring a collection of Yioop machines in such a setting, is, by default, under Search Access Set-up, subsearch is unchecked. This means the RSS feeds won't be downloaded hourly on such machines. If one unchecks this, they can be. + +{{id='arc_tool' +====Examining the contents of WebArchiveBundle's and IndexArchiveBundles's==== +}} + +The command-line script bin/arc_tool.php can be used to examine and manipulate the contents of a WebArchiveBundle or an IndexArchiveBundle. Below is a summary of the different command-line uses of arc_tool.php: + +;'''php arc_tool.php count bundle_name''' + or +'''php arc_tool.php count bundle_name save''' : returns the counts of docs and links for each shard in bundle as well as an overall total. The second command saves the just computed count into the index description (can be used to fix the index count if it gets screwed up). +; '''php arc_tool.php dict bundle_name word''' : returns index dictionary records for word stored in index archive bundle. +; '''php arc_tool.php info bundle_name''' : return info about documents stored in archive. +; '''php arc_tool.php inject timestamp file''' : injects the urls in file as a schedule into crawl of given timestamp. This can be used to make a closed index unclosed and to allow for continued crawling. +; '''php arc_tool.php list''' : returns a list of all the archives in the Yioop! crawl directory, including non-Yioop! archives in the /archives sub-folder. +; '''php arc_tool.php mergetiers bundle_name max_tier''' : merges tiers of word dictionary into one tier up to max_tier +; '''php arc_tool.php posting bundle_name generation offset''' + or +'''php arc_tool.php posting bundle_name generation offset num''' : returns info about the posting (num many postings) in bundle_name at the given generation and offset +; '''php arc_tool.php rebuild bundle_name''' : Re-extracts words from summaries files in bundle_name into index shards then builds a new dictionary +; '''php arc_tool.php reindex bundle_name''' : Reindex the word dictionary in bundle_name using existing index shards +; '''php arc_tool.php shard bundle_name generation''' : Prints information about the number of words and frequencies of words within the generation'th index shard in the bundle +; '''php arc_tool.php show bundle_name start num''' : outputs items start through num from bundle_name or name of non-Yioop archive crawl folder + +The bundle name can be a full path name, a relative path from the current directory, or it can be just the bundle directory's file name in which case WORK_DIRECTORY/cache will be assumed to be the bundle's location. The following are some examples of using arc tool. Recall a backslash in Unix/OSX terminal is the line continuation character, so we can image lines where it is indicated below as being all on one line. They are not all from the same session: + |chris-polletts-macbook-pro:bin:108>php arc_tool.php list + Found Yioop Archives: + ===================== + 0-Archive1334468745 + 0-Archive1336527560 + IndexData1334468745 + IndexData1336527560 + + Found Non-Yioop Archives: + ========================= + english-wikipedia2012 + chris-polletts-macbook-pro:bin:109> + + ... + + |chris-polletts-macbook-pro:bin:158>php arc_tool.php info \ + /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731 + + Bundle Name: IndexData1293767731 + Bundle Type: IndexArchiveBundle + Description: test + Number of generations: 1 + Number of stored links and documents: 267260 + Number of stored documents: 16491 + Crawl order was: Page Importance + Seed sites: + http://www.ucanbuyart.com/ + http://www.ucanbuyart.com/fine_art_galleries.html + http://www.ucanbuyart.com/indexucba.html + Sites allowed to crawl: + domain:ucanbuyart.com + domain:ucanbuyart.net + Sites not allowed to be crawled: + domain:arxiv.org + domain:ask.com + Meta Words: + http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/ + + |chris-polletts-macbook-pro:bin:159> + |chris-polletts-macbook-pro:bin:202>php arc_tool.php show \ + /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3 + + BEGIN ITEM, LENGTH:21098 + [URL] + http://www.ucanbuyart.com/robots.txt + [HTTP RESPONSE CODE] + 404 + [MIMETYPE] + text/html + [CHARACTER ENCODING] + ASCII + [PAGE DATA] + <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + + <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> + + <head> + <base href="http://www.ucanbuyart.com/" /> + </pre> + ... + + |chris-polletts-macbook-pro:bin:117>php arc_tool.php reindex IndexData1317414152 + + Shard 0 + [Sat, 01 Oct 2011 11:05:17 -0700] Adding shard data to dictionary files... + [Sat, 01 Oct 2011 11:05:28 -0700] Merging tiers of dictionary + + Final Merge Tiers + + Reindex complete!! + +The mergetiers command is like a partial reindex. It assumes all the shard words have been added to the dictionary, but that the dictionary still has more than one tier (tiers are the result of incremental log-merges which are made during the crawling process). The mergetiers command merges these tiers into one large tier which is then usable by Yioop for query processing. + +{{id='query_tool' +====Querying an Index from the command-line==== +}} + +The command-line script bin/query_tool.php can be used to query indices in the Yioop WORK_DIRECTORY/cache. This tool can be used on an index regardless of whether or not Apache is running. It can be used for long running queries that might timeout when run within a browser to put their results into memcache or filecache. The command-line arguments for the query tool are: + php query_tool.php query num_results start_num lang_tag +The default num_results is 10, start_num is 0, and lang_tag is en-US. The following shows how one could do a query on "Chris Pollett": + + |chris-polletts-macbook-pro:bin:141>php query_tool.php "Chris Pollett" + + ============ + TITLE: ECCC - Pointers to + URL: http://eccc.hpi-web.de/static/pointers/personal_www_home_pages_of_complexity_theorists/ + IPs: 141.89.225.3 + DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located + at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of + complexity people 2011 2010 2009 2011...1994 POINTE + Rank: 3.9551158411 + Relevance: 0.492443777769 + Proximity: 1 + Score: 4.14 + ============ + + ============ + TITLE: ECCC - Pointers to + URL: http://www.eccc.uni-trier.de/static/pointers/personal_www_home_pages_of_complexity_theorists/ + IPs: 141.89.225.3 + DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located + at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of + complexity people 2011 2010 2009 2011...1994 POINTE + Rank: 3.886318974 + Relevance: 0.397622570289 + Proximity: 1 + Score: 4.03 + ============ + + ..... + +The index the results are returned from is the default index; however, all of the Yioop meta words should work so you can do queries like "my_query i:timestamp_of_index_want". Query results depend on the kind of language stemmer/char-gramming being used, so French results might be better if one specifies fr-FR then if one relies on the default en-US. + +{{id='code_tool' +====A Tool for Coding and Making Patches for Yioop==== +}} + +'''bin/code_tool.php''' can perform several useful tasks to help developers program for the Yioop environment. Below is a brief summary of its functionality: + +;'''php code_tool.php clean path''' : Replaces all tabs with four spaces and trims all whitespace off ends of lines in the folder or file path. +;'''php code_tool.php copyright path''' : Adjusts all lines in the files in the folder at path (or if path is a file just that) of the form 2009 - \d\d\d\d to the form 2009 - this_year where this_year is the current year. +;'''php code_tool.php longlines path''' : Prints out all lines in files in the folder or file path which are longer than 80 characters. +;'''php code_tool.php replace path pattern replace_string''' + or +'''php code_tool.php replace path pattern replace_string effect''' : Prints all lines matching the regular expression pattern followed by the result of replacing pattern with replace_string in the folder or file path. Does not change files. +;'''php code_tool.php replace path pattern replace_string interactive''' : Prints each line matching the regular expression pattern followed by the result of replacing pattern with replace_string in the folder or file path. Then it asks if you want to update the line. Lines you choose for updating will be modified in the files. +;'''php code_tool.php replace path pattern replace_string change''' : Each line matching the regular expression pattern is update by replacing pattern with replace_string in the folder or file path. This format doe not echo anything, it does a global replace without interaction. +;'''php code_tool.php search path pattern''' :Prints all lines matching the regular expression pattern in the folder or file path. + +{{id='classifier_tool' +====A Command-line Tool for making Yioop Classifiers==== +}} + +'''bin/classifier_tool.php''' is used to automate the building and testing of classifiers, providing an alternative to the web interface when a labeled training set is available. + +'''classifier_tool.php''' takes an activity to perform, the name of a dataset to use, and a label for the constructed classifier. The activity is the name of one of the 'run*' functions implemented by this class, without the common 'run' prefix (e.g., 'TrainAndTest'). The dataset is specified as the common prefix of two indexes that have the suffixes "Pos" and "Neg", respectively. So if the prefix were "DATASET", then this tool would look for the two existing indexes "DATASET Pos" and "DATASET Neg" from which to draw positive and negative examples. Each document in these indexes should be a positive or negative example of the target class, according to whether it's in the "Pos" or "Neg" index. Finally, the label is just the label to be used for the constructed classifier. + +Beyond these options (set with the -a, -d, and -l flags), a number of other options may be set to alter parameters used by an activity or a classifier. These options are set using the -S, -I, -F, and -B flags, which correspond to string, integer, float, and boolean parameters respectively. These flags may be used repeatedly, and each expects an argument of the form NAME=VALUE, where NAME is the name of a parameter, and VALUE is a value parsed according to the flag. The NAME should match one of the keys of the options member of this class, where a period ('.') may be used to specify nesting. For example: + -I debug=1 # set the debug level to 1 + -B cls.use_nb=0 # tell the classifier to use Naive Bayes +To build and evaluate a classifier for the label 'spam', trained using the two indexes "DATASET Neg" and "DATASET Pos", and a maximum of the top 25 most informative features: + php bin/classifier_tool.php -a TrainAndTest -d 'DATASET' -l 'spam' + -I cls.chi2.max=25 + +==References== + +;{{id='APC2003' '''[APC2003]'''}} : Serge Abiteboul and Mihai Preda and Gregory Cobena. [[http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf|Adaptive on-line page importance computation]]. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290. 2003. +;{{id='B1970' '''[B1970]'''}} : Bloom, Burton H. [[http://www.lsi.upc.edu/~diaz/p422-bloom.pdf|Space/time trade-offs in hash coding with allowable errors]]. Communications of the ACM Volume 13 Issue 7. pp. 422–426. 1970. +;{{id='BSV2004' '''[BSV2004]'''}} : Paolo Boldi and Massimo Santini and Sebastiano Vigna. [[http://vigna.di.unimi.it/ftp/papers/ParadoxicalPageRank.pdf|Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations]]. Algorithms and Models for the Web-Graph. pp. 168–180. 2004. +;{{id='BP1998' '''[BP1998]'''}} : Brin, S. and Page, L. [[http://infolab.stanford.edu/~backrub/google.html|The Anatomy of a Large-Scale Hypertextual Web Search Engine]]. In: Seventh International World-Wide Web Conference (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998. +;{{id='BCC2010' '''[BCC2010]'''}} : S. Büttcher, C. L. A. Clarke, and G. V. Cormack. [[http://mitpress.mit.edu/books/information-retrieval|Information Retrieval: Implementing and Evaluating Search Engines]]. MIT Press. 2010. +;{{id='DG2004' '''[DG2004]'''}} : Jeffrey Dean and Sanjay Ghemawat. [[http://research.google.com/archive/mapreduce-osdi04.pdf|MapReduce: Simplified Data Processing on Large Clusters]]. OSDI'04: Sixth Symposium on Operating System Design and Implementation. 2004 +;{{id='GGL2003' '''[GGL2003]'''}} : Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. [[http://research.google.com/archive/gfs-sosp2003.pdf|The Google File System]]. 19th ACM Symposium on Operating Systems Principles. 2003. +;{{id='GLM2007' '''[GLM2007]'''}} : A. Genkin, D. Lewis, and D. Madigan. [[http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf|Large-scale bayesian logistic regression for text categorization]]. Technometrics. Volume 49. Issue 3. pp. 291--304, 2007. +;{{id='H2002' '''[H2002]'''}} : T. Haveliwala. [[http://infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf|Topic-Sensitive PageRank]]. Proceedings of the Eleventh International World Wide Web Conference (Honolulu, Hawaii). 2002. +;{{id='KSV2010' '''[KSV2010]'''}} : Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf|A Model of Computation for MapReduce]]. Proceedings of the ACM Symposium on Discrete Algorithms. 2010. pp. 938-948. +;{{id='KC2004' '''[KC2004]'''}} : Rohit Khare and Doug Cutting. [[http://www.master.netseven.it/files/262-Nutch.pdf|Nutch: A flexible and scalable open-source web search engine]]. CommerceNet Labs Technical Report 04. 2004. +;{{id='LDH2010' '''[LDH2010]'''}} : Jimmy Lin and Chris Dyer. [[http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf|Data-Intensive Text Processing with MapReduce]]. Synthesis Lectures on Human Language Technologies. Morgan and Claypool Publishers. 2010. +;{{id='LM2006' '''[LM2006]'''}} : Amy N. Langville and Carl D. Meyer. [[http://press.princeton.edu/titles/8216.html|Google's PageRank and Beyond]]. Princton University Press. 2006. +;{{id='MRS2008' '''[MRS2008]'''}} : C. D. Manning, P. Raghavan and H. Schütze. [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html|Introduction to Information Retrieval]]. Cambridge University Press. 2008. +;{{id='MKSR2004' '''[MKSR2004]'''}} : G. Mohr, M. Kimpton, M. Stack, and I.Ranitovic. [[http://iwaw.europarchive.org/04/Mohr.pdf|Introduction to Heritrix, an archival quality web crawler]]. 4th International Web Archiving Workshop. 2004. +;{{id='PTSHVC2011' '''[PTSHVC2011]'''}} : Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. [[http://www.ittc.ku.edu/~jsv/Papers/PTS11.InvertedIndexSIGIR.pdf|Inverted indexes for phrases and strings]]. Proceedings of the 34nth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011. +;{{id='P1997a' '''[P1997a]'''}} : J. Peek. [[http://www.usenix.org/publications/library/proceedings/ana97/summaries/monier.html|Summary of the talk: The AltaVista Web Search Engine]] by Louis Monier. USENIX Annual Technical Conference Anaheim. California. ;login: Volume 22. Number 2. April 1997. +;{{id='P1997b' '''[P1997b]'''}} : J. Peek. [[http://www.usenix.org/publications/library/proceedings/ana97/summaries/brewer.html|Summary of the talk: The Inktomi Search Engine by Louis Monier]]. USENIX Annual Technical Conference. Anaheim, California. ;login: Volume 22. Number 2. April 1997. +;{{id='P1994' '''[P1994]'''}} : B. Pinkerton. [[http://web.archive.org/web/20010904075500/http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html|Finding what people want: Experiences with the WebCrawler]]. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. 1994. +;{{id='P1980' '''[P1980]'''}} : M.F. Porter. [[http://tartarus.org/~martin/PorterStemmer/def.txt|An algorithm for suffix stripping]]. Program. Volume 14 Issue 3. 1980. pp 130−137. On the same website, there are [[http://snowball.tartarus.org/|stemmers for many other languages]]. +;{{id='PDGQ2006' '''[PDGQ2006]'''}} : Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan. [[http://research.google.com/archive/sawzall-sciprog.pdf|Interpreting the Data: Parallel Analysis with Sawzall]]. Scientific Programming Journal. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure.Volume 13. Issue 4. 2006. pp.227-298. +;{{id='W2009' '''[W2009]'''}} : Tom White. [[http://www.amazon.com/gp/product/1449389732/ref=pd_lpo_k2_dp_sr_1?pf_rd_p=486539851&pf_rd_s=lpo-top-stripe-1&pf_rd_t=201&pf_rd_i=0596521979&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=0N5VCGFDA7V7MJXH69G6|Hadoop: The Definitive Guide]]. O'Reilly. 2009. +;{{id='ZCTSR2004' '''[ZCTSR2004]'''}} : Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [[http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf|Microsoft Cambridge at TREC-13: Web and HARD tracks]]. In Proceedings of 3th Annual Text Retrieval Conference. 2004. + +[[Documentation#contents|Return to table of contents]]. +EOD; +$public_pages["en-US"]["Documentation"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Open Source Search Engine Software - Seekquarry :: Documentation + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS{{id="contents" +=Yioop Documentation v 2.0= +}} + +==Overview== +===Getting Started=== + +This document serves as a detailed reference for the Yioop search engine. If you want to get started using Yioop now, you probably want to first read the [[Install|Installation Guides]] page and look at the [[http://www.yioop.com/?c=group&group_id=20&arg=read&a=wiki&page_name=Main|Yioop Video Tutorials Wiki]]. If you cannot find your particular machine configuration there, you can check the Yioop [[Documentation#Requirements|Requirements]] section followed by the more general Installation and Configuration instructions. + +[[Yioop.com]], the demo site for Yioop software, allows people to register accounts. Once registered, if you have questions about Yioop and its installation, you can join the +[[https://yioop.com/?c=group&just_group_id=212|Yioop Software Help]] group and post your questions there. This group is frequently checked by the creators of Yioop, and you will likely get a quick response. + +Having a Yioop account also allows you to experiment with some of the features of Yioop beyond search such as Yioop Groups, Wikis, and Crawl Mixes without needing to install the software yourself. The [[Documentation#Search%20and%20User%20Interface|Search and User Interface]], [[Documentation#Managing%20Users,%20Roles,%20and%20Groups|Managing Users, Roles, and Groups]], [[Documentation#Feeds%20and%20Wikis|Feeds and Wikis]], and [[Documentation#Mixing%20Crawl%20Indexes|Mixing Crawl Indexes]] sections below could serve as a guide to testing the portion of the site general users have access to on Yioop.com. + +When using Yioop software and do not understand a feature, make sure to also check out the integrated help system throughout Yioop. Clicking on a question mark icon will reveal an additional blue column on a page with help information as seen below: +{{class="docs" +((resource:Documentation:IntegratedHelp.png|Integrated Help Example)) +}} + +===Introduction=== + +The Yioop search engine is designed to allow users to produce indexes of a web-site or a collection of web-sites. The number of pages a Yioop index can handle range from small site to those containing tens or hundreds of millions of pages. In contrast, a search-engine like Google maintains an index of tens of billions of pages. Nevertheless, since you, the user, have control over the exact sites which are being indexed with Yioop, you have much better control over the kinds of results that a search will return. Yioop provides a traditional web interface to do queries, an rss api, and a function api. It also supports many common features of a search portal such as user discussion group, blogs, wikis, and a news aggregator. In this section we discuss some of the different search engine technologies which exist today, how Yioop fits into this eco-system, and when Yioop might be the right choice for your search engine needs. In the remainder of this document after the introduction, we discuss how to get and install Yioop; the files and folders used in Yioop; the various crawl, search, social portal, and administration facilities in the Yioop; localization in the Yioop system; building a site using the Yioop framework; embedding Yioop in an existing web-site; customizing Yioop; and the Yioop command-line tools. + +Since the mid-1990s a wide variety of search engine technologies have been explored. Understanding some of this history is useful in understanding Yioop capabilities. In 1994, Web Crawler, one of the earliest still widely-known search engines, only had an index of about 50,000 pages which was stored in an Oracle database. Today, databases are still used to create indexes for small to medium size sites. An example of such a search engine written in PHP is [[http://www.sphider.eu/|Sphider]]. Given that a database is being used, one common way to associate a word with a document is to use a table with a columns like word id, document id, score. Even if one is only extracting about a hundred unique words per page, this table's size would need to be in the hundreds of millions for even a million page index. This edges towards the limits of the capabilities of database systems although techniques like table sharding can help to some degree. The Yioop engine uses a database to manage some things like users and roles, but uses its own web archive format and indexing technologies to handle crawl data. This is one of the reasons that Yioop can scale to larger indexes. + +When a site that is being indexed consists of dynamic pages rather than the largely static page situation considered above, and those dynamic pages get most of their text content from a table column or columns, different search index approaches are often used. Many database management systems like [[http://www.mysql.com/|MySQL]]/[[https://mariadb.org/|MariaDB]], support the ability to create full text indexes for text columns. A faster more robust approach is to use a stand-alone full text index server such as [[http://www.sphinxsearch.com/|Sphinx]]. However, for these approaches to work the text you are indexing needs to be in a database column or columns, or have an easy to define "XML mapping". Nevertheless, these approaches illustrate another common thread in the development of search systems: Search as an appliance, where you either have a separate search server and access it through either a web-based API or through function calls. + +Yioop has both a search function API as well as a web API that can return [[http://www.opensearch.org/|Open Search RSS]] results or a JSON variant. These can be used to embed Yioop within your existing site. If you want to create a new search engine site, Yioop +provides all the basic features of web search portal. It has its own account management system with the ability to set up groups that have both discussions boards and wikis with various levels of access control. The built in Public group's wiki together with the GUI configure page can be used to completely customize the look and feel of Yioop. Third party display ads can also be added through the GUI interface. If you want further customization, Yioop +offers a web-based, model-view-adapter (a variation on model-view-controller) framework with a web-interface for localization. + +By 1997 commercial sites like Inktomi and AltaVista already had tens or hundreds of millions of pages in their indexes [ [[Documentation#P1994|P1994]] ] [ [[Documentation#P1997a|P1997a]] ] [ [[Documentation#P1997b|P1997b]] ]. Google [ [[Documentation#BP1998|BP1998]] ] circa 1998 in comparison had an index of about 25 million pages. These systems used many machines each working on parts of the search engine problem. On each machine there would, in addition, be several search related processes, and for crawling, hundreds of simultaneous threads would be active to manage open connections to remote machines. Without threading, downloading millions of pages would be very slow. Yioop is written in [[http://www.php.net/|PHP]]. This language is the `P' in the very popular [[http://en.wikipedia.org/wiki/LAMP_%28software_bundle%29|LAMP]] web platform. This is one of the reasons PHP was chosen as the language of Yioop. Unfortunately, PHP does not have built-in threads. However, the PHP language does have a multi-curl library (implemented in C) which uses threading to support many simultaneous page downloads. This is what Yioop uses. Like these early systems Yioop also supports the ability to distribute the task of downloading web pages to several machines. As the problem of managing many machines becomes more difficult as the number of machines grows, Yioop further has a web interface for turning on and off the processes related to crawling on remote machines managed by Yioop. + +There are several aspects of a search engine besides downloading web pages that benefit from a distributed computational model. One of the reasons Google was able to produce high quality results was that it was able to accurately rank the importance of web pages. The computation of this page rank involves repeatedly applying Google's normalized variant of the web adjacency matrix to an initial guess of the page ranks. This problem naturally decomposes into rounds. Within a round the Google matrix is applied to the current page ranks estimates of a set of sites. This operation is reasonably easy to distribute to many machines. Computing how relevant a word is to a document is another task that benefits from multi-round, distributed computation. When a document is processed by indexers on multiple machines, words are extracted and a stemming algorithm such as [ [[Documentation#P1980|P1980]] ] or a character n-gramming technique might be employed (a stemmer would extract the word jump from words such as jumps, jumping, etc; converting jumping to 3-grams would make terms of length 3, i.e., jum, ump, mpi, pin, ing). For some languages like Chinese, where spaces between words are not always used, a segmenting algorithm like reverse maximal match might be used. Next a statistic such as BM25F [ [[Documentation#ZCTSR2004|ZCTSR2004]] ] (or at least the non-query time part of it) is computed to determine the importance of that word in that document compared to that word amongst all other documents. To do this calculation one needs to compute global statistics concerning all documents seen, such as their average-length, how often a term appears in a document, etc. If the crawling is distributed it might take one or more merge rounds to compute these statistics based on partial computations on many machines. Hence, each of these computations benefit from allowing distributed computation to be multi-round. Infrastructure such as the Google File System [ [[Documentation#GGL2003|GGL2003]] ], the MapReduce model [ [[Documentation#DG2004|DG2004]] ], and the Sawzall language [ [[Documentation#PDGQ2006|PDGQ2006]] ] were built to make these multi-round distributed computation tasks easier. In the open source community, the [[http://hadoop.apache.org/docs/hdfs/current/hdfs_design.html|Hadoop Distributed File System]], [[http://hadoop.apache.org/docs/mapreduce/current/index.html|Hadoop MapReduce]], and [[http://hadoop.apache.org/pig/|Pig]] play an analogous role [ [[Documentation#W2009|W2009]] ]. Recently, a theoretical framework for what algorithms can be carried out as rounds of map inputs to sequence of key value pairs, shuffle pairs with same keys to the same nodes, reduce key-value pairs at each node by some computation has begun to be developed [ [[Documentation#KSV2010|KSV2010]] ]. This framework shows the map reduce model is capable of solving quite general cloud computing problems -- more than is needed just to deploy a search engine. + +Infrastructure such as this is not trivial for a small-scale business or individual to deploy. On the other hand, most small businesses and homes do have available several machines not all of whose computational abilities are being fully exploited. So the capability to do distributed crawling and indexing in this setting exists. Further high-speed internet for homes and small businesses is steadily getting better. Since the original Google paper, techniques to rank pages have been simplified [ [[Documentation#APC2003|APC2003]] ]. It is also possible to approximate some of the global statistics needed in BM25F using suitably large samples. More details on the exact ranking mechanisms used by Yioop and be found on the [[Ranking|Yioop Ranking Mechanisms]] page. + +Yioop tries to exploit these advances to use a simplified distributed model which might be easier to deploy in a smaller setting. Each node in a Yioop system is assumed to have a web server running. One of the Yioop nodes web app's is configured to act as a coordinator for crawls. It is called the '''name server'''. In addition to the name server, one might have several processes called '''queue servers''' that perform scheduling and indexing jobs, as well as '''fetcher''' processes which are responsible for downloading pages and the page processing such as stemming, char-gramming and segmenting mentioned above. Through the name server's web app, users can send messages to the queue servers and fetchers. This interface writes message files that queue servers periodically looks for. Fetcher processes periodically ping the name server to find the name of the current crawl as well as a list of queue servers. Fetcher programs then periodically make requests in a round-robin fashion to the queue servers for messages and schedules. A schedule is data to process and a message has control information about what kind of processing should be done. A given queue_server is responsible for generating schedule files for data with a certain hash value, for example, to crawl urls for urls with host names that hash to queue server's id. As a fetcher processes a schedule, it periodically POSTs the result of its computation back to the responsible queue server's web server. The data is then written to a set of received files. The queue_server as part of its loop looks for received files and merges their results into the index so far. So the model is in a sense one round: URLs are sent to the fetchers, summaries of downloaded pages are sent back to the queue servers and merged into their indexes. As soon as the crawl is over one can do text search on the crawl. Deploying this computation model is relatively simple: The web server software needs to be installed on each machine, the Yioop software (which has the the fetcher, queue_server, and web app components) is copied to the desired location under the web server's document folder, each instance of Yioop is configured to know who the name server is, and finally, the fetcher programs and queue server programs are started. + +As an example of how this scales, 2010 Mac Mini running a queue server program can schedule and index about 100,000 pages/hour. This corresponds to the work of about 7 fetcher processes (which may be on different machines -- roughly, you want 1GB and 1core/fetcher). The checks by fetchers on the name server are lightweight, so adding another machine with a queue server and the corresponding additional fetchers allows one to effectively double this speed. This also has the benefit of speeding up query processing as when a query comes in, it gets split into queries for each of the queue server's web apps, but the query only "looks" slightly more than half as far into the posting list as would occur in a single queue server setting. To further increase query throughput, the number queries that can be handled at a given time, Yioop installations can also be configured as "mirrors" which keep an exact copy of the data stored in the site being mirrored. When a query request comes into a Yioop node, either it or any of its mirrors might handle it. Query processing, for multi-word queries can actually be a major bottleneck if you don't have many machines and you do have a large index. To further speed this up, Yioop uses a hybrid inverted index/suffix tree approach to store word lookups. The suffix tree ideas being motivated by [ [[Documentation#PTSHVC2011|PTSHVC2011]] ]. + +Since a multi-million page crawl involves both downloading from the web rapidly over several days, Yioop supports the ability to dynamically change its crawl parameters as a crawl is going on. This allows a user on request from a web admin to disallow Yioop from continuing to crawl a site or to restrict the number of urls/hours crawled from a site without having to stop the overall crawl. One can also through a web interface inject new seed sites, if you want, while the crawl is occurring. This can help if someone suggests to you a site that might otherwise not be found by Yioop given its original list of seed sites. Crawling at high-speed can cause a website to become congested and unresponsive. As of Version 0.84, if Yioop detects a site is becoming congested it can automatically slow down the crawling of that site. Finally, crawling at high-speed can cause your domain name server (the server that maps www.yioop.com to 173.13.143.74) to become slow. To reduce the effect of this Yioop supports domain name caching. + +Despite its simpler one-round model, Yioop does a number of things to improve the quality of its search results. While indexing, Yioop can make use Lasso regression classifiers [ [[Documentation#GLM2007|GLM2007]] ] using data from earlier crawls to help label and/or rank documents in the active crawl. Yioop also takes advantage of the link structure that might exist between documents in a one-round way: For each link extracted from a page, Yioop creates a micropage which it adds to its index. This includes relevancy calculations for each word in the link as well as an [ [[Documentation#APC2003|APC2003]] ]-based ranking of how important the link was. Yioop supports a number of iterators which can be thought of as implementing a stripped-down relational algebra geared towards word-document indexes (this is much the same idea as Pig). One of these operators allows one to make results from unions of stored crawls. This allows one to do many smaller topic specific crawls and combine them with your own weighting scheme into a larger crawl. A second useful operator allows you to display a certain number of results from a given subquery, then go on to display results from other subqueries. This allows you to make a crawl presentation like: the first result should come from the open crawl results, the second result from Wikipedia results, the next result should be an image, and any remaining results should come from the open search results. Yioop comes with a GUI facility to make the creation of these crawl mixes easy. To speed up query processing for these crawl mixes one can also create materialized versions of crawl mix results, which makes a separate index of crawl mix results. Another useful operator Yioop supports allows one to perform groupings of document results. In the search results displayed, grouping by url allows all links and documents associated with a url to be grouped as one object. Scoring of this group is a sum of all these scores. Thus, link text is used in the score of a document. How much weight a word from a link gets also depends on the link's rank. So a low-ranked link with the word "stupid" to a given site would tend not to show up early in the results for the word "stupid". Grouping also is used to handle deduplication: It might be the case that the pages of many different URLs have essentially the same content. Yioop creates a hash of the web page content of each downloaded url. Amongst urls with the same hash only the one that is linked to the most will be returned after grouping. Finally, if a user wants to do more sophisticated post-processing such as clustering or computing page rank, Yioop supports a straightforward architecture for indexing plugins. + +There are several open source crawlers which can scale to crawls in the millions to hundred of millions of pages. Most of these are written in Java, C, C++, C#, not PHP. Three important ones are [[http://nutch.apache.org/|Nutch]]/ [[http://lucene.apache.org/|Lucene]]/ [[http://lucene.apache.org/solr/|Solr]] [ [[Documentation#KC2004|KC2004]] ], [[http://www.yacy.net/|YaCy]], and [[http://crawler.archive.org/|Heritrix]] [ [[Documentation#MKSR2004|MKSR2004]] ]. Nutch is the original application for which the Hadoop infrastructure described above was developed. Nutch is a crawler, Lucene is for indexing, and Solr is a search engine front end. The YaCy project uses an interesting distributed hash table peer-to-peer approach to crawling, indexing, and search. Heritrix is a web crawler developed at the [[http://www.archive.org/|Internet Archive]]. It was designed to do archival quality crawls of the web. Its ARC file format inspired the use of WebArchive objects in Yioop. WebArchives are Yioop's container file format for storing web pages, web summary data, url lists, and other kinds of data used by Yioop. A WebArchive is essentially a linked-list of compressed, serialized PHP objects, the last element in this list containing a header object with information like version number and a total count of objects stored. The compression format can be chosen to suit the kind of objects being stored. The header can be used to store auxiliary data structures into the list if desired. One nice aspect of serialized PHP objects versus serialized Java Objects is that they are humanly readable text strings. The main purpose of Web Archives is to allow one to store many small files compressed as one big file. They also make data from web crawls very portable, making them easy to copy from one location to another. Like Nutch and Heritrix, Yioop also has a command-line tool for quickly looking at the contents of such archive objects. + +The [[http://www.archive.org/web/researcher/ArcFileFormat.php|ARC format]] is one example of an archival file format for web data. Besides at the Internet Archive, ARC and its successor [[http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml|WARC format]] are often used by TREC conferences to store test data sets such as [[http://ir.dcs.gla.ac.uk/test_collections/|GOV2]] and the [[http://lemurproject.org/clueweb09/|ClueWeb 2009]] / [[http://lemurproject.org/clueweb12/|ClueWeb 2012]] Datasets. In addition, it was used by grub.org (hopefully, only on a temporary hiatus), a distributed, open-source, search engine project in C#. Another important format for archiving web pages is the XML format used by Wikipedia for archiving MediaWiki wikis. [[http://www.wikipedia.org/|Wikipedia]] offers [[http://en.wikipedia.org/wiki/Wikipedia:Database_download|creative common-licensed downloads]] of their site in this format. The [[http://www.dmoz.org/|Open Directory Project]] makes available its [[http://www.dmoz.org/rdf.html|ODP data set]] in an RDF-like format licensed using the Open Directory License. Thus, we see that there are many large scale useful data sets that can be easily licensed. Raw data dumps do not contain indexes of the data though. This makes sense because indexing technology is constantly improving and it is always possible to re-index old data. Yioop supports importing and indexing data from ARC, WARC, database queries results, MediaWiki XML dumps, and Open Directory RDF. Yioop further has a generic text importer which can be used to index log records, mail, Usenet posts, etc. Yioop also supports re-indexing of old Yioop data files created after version 0.66, and indexing crawl mixes. This means using Yioop you can have searchable access to many data sets as well as have the ability to maintain your data going forward. When displaying caches of web pages in Yioop, the interface further supports the ability to display a history of all cached copies of that page, in a similar fashion to Internet Archives interface. + +Another important aspect of creating a modern search engine is the ability to display in an appropriate way various media sources. Yioop comes with built-in subsearch abilities for images, where results are displayed as image strips; video, where thumbnails for video are shown; and news, where news items are grouped together and a configurable set of news/twitter feeds can be set to be updated on an hourly basis. + +This concludes the discussion of how Yioop fits into the current and historical landscape of search engines and indexes. + +===Feature List=== + +Here is a summary of the features of Yioop: + +*'''General''' +**Yioop is an open-source, distributed crawler and search engine written in PHP. +**Crawling, indexing, and serving search results can be done on a single machine or distributed across several machines. +**The fetcher/queue_server processes on several machines can be managed through the web interface of a main Yioop instance. +**Yioop installations can be created with a variety of topologies: one queue_server and many fetchers or several queue_servers and many fetchers. +**Using web archives, crawls can be mirrored amongst several machines to speed-up serving search results. This can be further sped-up by using memcache or filecache. +**Yioop can be used to create web sites via its own built-in wiki system. For more complicated customizations, Yioop's model-view-adapter framework is designed to be easily extendible. This framework also comes with a GUI which makes it easy to localize strings and static pages. +**Yioop search result and feed pages can be configured to display banner or skyscraper ads through an Site Admin GUI (if desired). +**Yioop has been optimized to work well with smart phone web browsers and with tablet devices. +*'''Social and User Interface''' +**Yioop can be configured to allow or not to allow users to register for accounts. +**If allowed, user accounts can create discussion groups, blogs, and wikis. +** Blogs and wiki support attaching images, videos, and files and also support including math using LaTeX or AsciiMathML. +** Yioop comes with two built in groups: Help and Public. Help's wiki pages allow one to customize the integrated help throughout the Yioop system. The Public Groups discussion can be used as a site blog, its wiki page can be used to customize the look-and-feel of the overall Yioop site without having to do programming. +** Wiki pages support different types such as standard wiki page, page alias, media gallery, and slide presentation. +** Video on wiki pages and in discussion posts is served using HTTP-pseudo streaming so users can scrub through video files. For uploaded video files below a configurable size limit, videos are automatically converted to web friendly mp4 and webm formats. +** Wiki pages can be configured to have auto-generated tables of contents, to make use of common headers and footers, and to output meta tags for SEO purposes. +**Users can share their own mixes of crawls that exist in the Yioop system. +**If user accounts are enabled, Yioop has a search tools page on which people can suggest urls to crawl. +**Yioop has three different captcha'ing mechanisms that can be used in account registration and for suggest urls: a standard graphics-based captcha, a text-based captcha, and a hash-cash-like catpha. +**Password authentication can be configured to either use a standard password hash based system, or make use of Fiat Shamir zero-knowledge authentication. +*'''Search''' +**Yioop supports subsearches geared towards presenting certain kinds of media such as images, video, and news. The list of video and news sites can be configured through the GUI. Yioop has a media_updater process which can be used to automatically update news feeds hourly. +**News feeds can either be RSS or Atom feed or can be scraped from an HTML page using XPath queries. What image is used for a news feed item can also be configured using XPath queries. +**Yioop determines search results using a number of iterators which can be combined like a simplified relational algebra. +**Yioop can be configured to display word suggestions as a user types a query. It can also suggest spell corrections for mis-typed queries. This feature can be localized. +**Yioop can also make use of a thesaurus facility such as provided by WordNet to suggest related queries. +**Yioop supports the ability to filter out urls from search results after a crawl has been performed. It also has the ability to edit summary information that will be displayed for urls. +**A given Yioop installation might have several saved crawls and it is very quick to switch between any of them and immediately start doing text searches. +**Besides the standard output of a web page with ten links, Yioop can output query results in Open Search RSS format, a JSON variant of this format, and also to query Yioop data via a function api. +*'''Indexing''' +**Yioop is capable of indexing small sites to sites or collections of sites containing low hundreds of millions of documents. +**Yioop uses a hybrid inverted index/suffix tree approach for word lookup to make multi-word queries faster on disk bound machines. +**Yioop indexes are positional rather than bag of word indexes, and a index compression scheme called Modified9 is used. +**Yioop has a web interface which makes it easy to combine results from several crawl indexes to create unique result presentations. These combinations can be done in a conditional manner using "if:" meta words. +**Yioop supports the indexing of many different filetypes including: HTML, Atom, BMP, DOC, DOCX ePub, GIF, JPG, PDF, PPT, PPTX, PNG, RSS, RTF, sitemaps, SVG, XLSX, and XML. It has a web interface for controlling which amongst these filetypes (or all of them) you want to index. It supports also attempting to extract information from unknown filetypes. +**Yioop supports extracting data from zipped formats like DOCX even if it only did a partial download of the file. +**Yioop has a simple page rule language for controlling what content should be extracted from a page or record. +**Yioop has two different kinds of text summarizers which can be used to further affect what words are index: a basic web page scraper, and a centroid algorithm summarizer. The latter can be used to generate word clouds of crawled documents. +**Indexing occurs as crawling happens, so when a crawl is stopped, it is ready to be used to handle search queries immediately. +**Yioop Indexes can be used to create classifiers which then can be used in labeling and ranking future indexes. +**Yioop comes with stemmers for English, French, German, Italian, and Russian, and a word segmenter for Chinese. It uses char-gramming for other languages. Yioop has a simple architecture for adding stemmers for other languages. +**Yioop uses a web archive file format which makes it easy to copy crawl results amongst different machines. It has a command-line tool for inspecting these archives if they need to examined in a non-web setting. It also supports command-line search querying of these archives. +**Yioop supports an indexing plugin architecture to make it possible to write one's own indexing modules that do further post-processing. +*'''Web and Archive Crawling''' +**Yioop supports open web crawls, but through its web interface one can configure it also to crawl only specifics site, domains, or collections of sites and domains. One can customize a crawl using regex in disallowed directives to crawl a site to a fixed depth. +**Yioop uses multi-curl to support many simultaneous downloads of pages. +**Yioop obeys robots.txt files including Google and Bing extensions such as the Crawl-delay and Sitemap directives as well as * and $ in allow and disallow. It further supports the robots meta tag directives NONE, NOINDEX, NOFOLLOW, NOARCHIVE, and NOSNIPPET and the link tag directive rel="canonical". It also supports anchor tags with rel="nofollow" attributes. It also supports X-Robots-Tag HTTP headers. Finally, it tries to detect if a robots.txt became a redirect due to congestion. +**Yioop comes with a word indexing plugin which can be used to control how Yioop crawls based on words on the page and the domain. This is useful for creating niche subject specific indexes. +**Yioop has its own DNS caching mechanism and it adjusts the number of simultaneous downloads it does in one go based on the number of lookups it will need to do. +** Yioop can crawl over HTTP, HTTPS, and Gopher protocols. +**Yioop supports crawling TOR networks (.onion urls). +**Yioop supports crawling through a list of proxy servers. +**Yioop supports crawling Git Repositories and can index Java and Python code. +**Yioop supports crawl quotas for web sites. I.e., one can control the number of urls/hour downloaded from a site. +**Yioop can detect website congestion and slow down crawling a site that it detects as congested. +**Yioop supports dynamically changing the allowed and disallowed sites while a crawl is in progress. Yioop also supports dynamically injecting new seeds site via a web interface into the active crawl. +**Yioop has a web form that allows a user to control the recrawl frequency for a page during a crawl. +**Yioop keeps track of ETag: and Expires: HTTP headers to avoid downloading content it already has in its index. +**Yioop supports importing data from ARC, WARC, database queries, MediaWiki XML, and ODP RDF files. It has generic importing facility to import text records such as access log, mail log, usenet posts, etc., which are either not compressed, or compressed using gzip or bzip2. It also supports re-indexing of data from WebArchives. + +[[Documentation#contents|Return to table of contents]]. + +==Set-up== +===Requirements=== + +The Yioop search engine requires: (1) a web server, (2) PHP 5.3 or better (Yioop used only to serve search results from a pre-built index has been tested to work in PHP 5.2), (3) Curl libraries for downloading web pages. To be a little more specific Yioop has been tested with Apache 2.2 and I've been told Version 0.82 or newer works with lighttpd. It should work with other webservers, although it might take some finessing. For PHP, you need a build of PHP that incorporates multi-byte string (mb_ prefixed) functions, Curl, Sqlite (or at least PDO with Sqlite driver), the GD graphics library and the command-line interface. If you are using Mac OSX Snow Leopard or newer, the version of Apache2 and PHP that come with it suffice. For Windows, Mac, and Linux, another easy way to get the required software is to download a Apache/PHP/MySql suite such as [[http://www.apachefriends.org/en/xampp.html|XAMPP]]. On Windows machines, find the the php.ini file under the php folder in your Xampp folder and change the line: + ;extension=php_curl.dll +to + extension=php_curl.dll +The php.ini file has a post_max_size setting you might want to change. You might want to change it to: + post_max_size = 32M +Yioop will work with the post_max_size set to as little as two megabytes bytes, but will be faster with the larger post capacity. If you intend to make use of Yioop Discussion Groups and Wiki and their ability to upload documents. You might want to consider also adjusting the value of the variable ''upload_max_filesize''. This value should be set to at most what you set post_max_size to. + +If you are using WAMP, similar changes as with XAMPP must be made, but be aware that WAMP has two php.ini files and both of these must be changed. + +If you are using the Ubuntu-variant of Linux, the following lines would get the software you need: + sudo apt-get install curl + sudo apt-get install apache2 + sudo apt-get install php5 + sudo apt-get install php5-cli + sudo apt-get install php5-sqlite + sudo apt-get install php5-curl + sudo apt-get install php5-gd +For both Mac and Linux, you might want to alter the post_max_size variable in your php.ini file as in the Windows case above. + +In addition to the minimum installation requirements above, if you want to use the [[Documentation#GUI%20for%20Managing%20Machines%20and%20Servers|Manage Machines]] feature in Yioop, you might need to do some additional configuration. The Manage Machines activity allows you through a web interface to start/stop and look at the log files for each of the queue_servers, and fetchers that you want Yioop to manage. If it is not configured then these task would need to be done via the command line. '''Also, if you do not use the Manage Machine interface your Yioop site can make use of only one queue_server.''' + +As a final step, after installing the necessary software, '''make sure to start/restart your web server and verify that it is running.''' + +====Memory Requirements==== + +In addition, to the prerequisite software listed above, Yioop allows specifies for its process certain upper bounds on the amounts of memory they can use. By default bin/queue_server.php's limit is set to 2500MB, bin/fetcher.php's limit is set to 1200MB. You can expect that index.php might need up to 500MB. These values are set near the tops of each of these files in turn with a line like: + ini_set("memory_limit","2500M"); +For the index.php file, you may need to set the limit at well in your php.ini file for the instance of PHP used by your web server. If the value is too low for the index.php Web app you might see messages in the Fetcher logs that begin with: "Trouble sending to the scheduler at url..." + +Often in a VM setting these requirements are somewhat steep. It is possible to get Yioop to work in environments like EC2 (be aware this might violate your service agreement). To reduce these memory requirements, one can manually adjust the variables NUM_DOCS_PER_GENERATION, SEEN_URLS_BEFORE_UPDATE_SCHEDULER, NUM_URLS_QUEUE_RAM, MAX_FETCH_SIZE, and URL_FILTER_SIZE in the configs/config.php file. Experimenting with these values you should be able to trade-off memory requirements for speed. + +[[Documentation#contents|Return to table of contents]]. + +===Installation and Configuration=== + +The Yioop application can be obtained using [[Download|the download page at seekquarry.com]]. After downloading and unzipping it, move the Yioop search engine into some folder under your web server's document root. Yioop makes use of an auxiliary folder to store profile/crawl data. Before Yioop will run you must configure this directory. This can be done in one of two ways: either through the web interface (the preferred way), as we will now describe or using the configs/configure_tool.php script (which is harder, but might be suitable for some VPS settings) which will be described in the [[Documentation#Yioop%20Command-line%20 Tools|command line tools section]]. From the web interface, to configure this directory point your web browser to where your Yioop folder is located, a configuration page should appear and let you set the path to the auxiliary folder (Search Engine Work Directory). This page looks like: +{{class="docs" +((resource:Documentation:ConfigureScreenForm1.png|The work directory form)) +}} +For this step, as a security precaution, you must connect via localhost. If you are in a web hosting environment (for example, if you are using cPanel to set up Yioop) where it is difficult to connect using localhost, you can add a file, configs/local_config.php, with the following content: + <?php + define('NO_LOCAL_CHECK', 'true'); + ?> +Returning to our installation discussion, notice under the text field there is a heading "Component Check" and there is red text under it, this section is used to indicate any requirements that Yioop has that might not be met yet on your machine. In the case above, the web server needs permissions on the file configs/config.php to write in the value of the directory you choose in the form for the Work Directory. Another common message asks you to make sure the web server has permissions on the place where this auxiliary folder needs to be created. When filling out the form of this page, on both *nix-like, and Windows machines, you should use forward slashes for the folder location. For example, + + /Applications/XAMPP/xamppfiles/htdocs #Mac, Linux system similar + c:/xampp/htdocs/yioop_data #Windows + +Once you have set the folder, you should see a second Profile Settings form beneath the Search Engine Work Directory form. If you are asked to sign-in before this, and you have not previously created accounts in this Work Directory, then the default account has login root, and an empty password. Once you see it, The Profile Settings form allows you to configure the debug, search access, database, queue server, and robot settings. It will look something like: + +{{class="docs" +((resource:Documentation:ConfigureScreenForm2.png|Basic configure form)) +}} + +These settings suffice if you are only doing single machine crawling. The '''Crawl Robot Set-up''' fieldset is used to provide websites that you crawl with information about who is crawling them. The field Crawl Robot Name is used to say the name of your robot. You should choose a common name for all of the fetchers in your set-up, but the name should be unique to your web-site. It is bad form to pretend to be someone else's robot, for example, the googlebot. As Yioop crawls it sends the web-site it crawls a User-Agent string, this string contains the url back to the bot.php file in the Yioop folder. bot.php is supposed to provide a detailed description of your robot. The contents of textarea Robot Description is supposed to provide this description and is inserted between <body> </body> tags on the bot.php page. + +You might need to click {{id="advance" '''Toggle Advance Settings'''}} if you are doing Yioop development or if you are crawling in a multi-machine setting. The advanced settings look like: + +{{class="docs" +((resource:Documentation:ConfigureScreenForm3.png|Advanced configure form)) +}} + +The '''Debug Display''' fieldset has three checkboxes: Error Info, Query Info, and Test Info. Checking Error Info will mean that when the Yioop web app runs, any PHP Errors, Warnings, or Notices will be displayed on web pages. This is useful if you need to do debugging, but should not be set in a production environment. The second checkbox, Query Info, when checked, will cause statistics about the time, etc. of database queries to be displayed at the bottom of each web page. The last checkbox, Test Info, says whether or not to display automated tests of some of the systems library classes if the browser is navigated to http://YIOOP_INSTALLATION/tests/. None of these debug settings should be checked in a production environment. + +The '''Search Access''' fieldset has three checkboxes: Web, RSS, and API. These control whether a user can use the web interface to get query results, whether RSS responses to queries are permitted, or whether or not the function based search API is available. Using the Web Search interface and formatting a query url to get an RSS response are describe in the Yioop [[Documentation#Search%20and%20User%20Interface|Search and User Interface]] section. The Yioop Search Function API is described in the section [[Documentation#Embedding%20Yioop%20in%20an%20Existing%20Site|Embedding Yioop]], you can also look in the examples folder at the file search_api.php to see an example of how to use it. '''If you intend to use Yioop in a configuration with multiple queue servers (not fetchers), then the RSS checkbox needs to be checked.''' + +The '''Site Customizations''' fieldset lets you configure the overall look and feel of a Yioop instance. The '''Use Wiki Public Main Page as Landing Page''' checkbox lets you set the main page of the Public wiki to be the landing page of the whole Yioop site rather than the default centered search box landing page. Several of the text fields in Site Customizations control various colors used in drawing the Yioop interface. These include '''Background Color''', '''Foreground Color''', '''Top Bar Color''', '''Side Bar Color'''. The values of these fields can be any legitimate style-sheet color such as a # followed by an red, green, blue value (between 0-9 A-F), or a color word such as: yellow, cyan, etc. If you would like to use a background image, you can either use the picker link or drag and drop one into the rounded square next to the '''Background Image''' label. Various other images such as the '''Site Logo''', '''Mobile Logo''' (the logo used for mobile devices), and '''Favicon''' (the little logo that appears in the title tab of a page or in the url bar) can similarly be chosen or dragged-and-dropped. + +A '''Search Toolbar''' is a short file that can be used to add your search engine to the search bar of a browser. You can drag such a file into the gray area next to this label and click save to set this for your site. The link to install the search bar is visible on the Settings page. There is also a link tag on every page of the Yioop site that allows a browser to auto-discover this as well. As a starting point, one can try tweaking the default Yioop search bar, yioopbar.xml, in the base folder of Yioop. + +The three fields '''Timezone''', '''Web Cookie Name''', and '''Web Token Name''' control respectively, the timezone used by Yioop when it does date conversions, the name of the cookie it sets in a browser's cookie cache, and the name used for the tokens to prevent cross-site request forgery that appear in Yioop URLs when one is logged-in. + +Finally, if one knows cascading stylesheets (CSS) and wants greater control of the the look and feel of the site, then one can enter standard stylesheet command in the '''Auxiliary Style Directives''' textarea. + +===Optional Server and Security Configurations=== +The configuration activity just described suffices to set up Yioop for a single server crawl. If that is what you are interested in you may want to skip ahead to the section on the [[Documentation#Search%20and%20User%20Interface|Yioop Search Interface]] to learn about the different search features available in Yioop or you may want to skip ahead to [[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]] to learn about how to perform a crawl. In this section, we describe the Server Settings and Security activities which might be useful in a multi-machine, multi-user setting and which might also be useful for crawling hidden websites or crawling through proxies. + +The Server Settings activity looks like: + +{{class="docs" +((resource:Documentation:ServerSettings.png|The Server Settings Activity)) +}} + +The '''Name Server Set-up''' fieldset is used to tell Yioop which machine is going to act as a name server during a crawl and what secret string to use to make sure that communication is being done between legitimate queue_servers and fetchers of your installation. You can choose anything for your secret string as long as you use the same string amongst all of the machines in your Yioop installation. The reason why you have to set the name server url is that each machine that is going to run a fetcher to download web pages needs to know who the queue servers are so they can request a batch of urls to download. There are a few different ways this can be set-up: + +#If the particular instance of Yioop is only being used to display search results from crawls that you have already done, then this fieldset can be filled in however you want. +#If you are doing crawling on only one machine, you can put http://localhost/path_to_yioop/ or http://127.0.0.1/path_to_yioop/, where you appropriately modify "path_to_yioop". +#Otherwise, if you are doing a crawl on multiple machines, use the url of Yioop on the machine that will act as the name server. + +In communicating between the fetcher and the server, Yioop uses curl. Curl can be particular about redirects in the case where posted data is involved. i.e., if a redirect happens, it does not send posted data to the redirected site. For this reason, Yioop insists on a trailing slash on your queue server url. Beneath the Queue Server Url field, is a Memcached checkbox and a Filecache checkbox. Only one of these can be checked at a time. The Memcached check box only shows if you have [[http://php.net/manual/en/book.memcache.php|PHP Memcache]] installed. Checking the Memcached checkbox allows you to specify memcached servers that, if specified, will be used to cache in memory search query results as well as index pages that have been accessed. Checking the Filecache box, tells Yioop to cache search query results in temporary files. Memcache probably gives a better performance boost than Filecaching, but not all hosting environments have Memcache available. + +The '''Database Set-up''' fieldset is used to specify what database management system should be used, how it should be connected to, and what user name and password should be used for the connection. At present [[http://www.php.net/manual/en/intro.pdo.php|PDO]] (PHP's generic DBMS interface), sqlite3, and Mysql databases are supported. The database is used to store information about what users are allowed to use the admin panel and what activities and roles these users have. Unlike many database systems, if an sqlite3 database is being used then the connection is always a file on the current filesystem and there is no notion of login and password, so in this case only the name of the database is asked for. For sqlite, the database is stored in WORK_DIRECTORY/data. For single user settings with a limited number of news feeds, sqlite is probably the most convenient database system to use with Yioop. If you think you are going to make use of Yioop's social functionality and have many users, feeds, and crawl mixes, using a system like Mysql or Postgres might be more appropriate. + +If you would like to use a different DBMS than Sqlite or Mysql, then the easiest way is to select PDO as the Database System and for the Hostname given use the DSN with the appropriate DBMS driver. For example, for Postgres one might have something like: + pgsql:host=localhost;port=5432;dbname=test;user=bruce;password=mypass +You can put the username and password either in the DSN or in the Username and Password fields. The database name field must be filled in with the name of the database you want to connect to. It is also include needs to be included in the dsn, as in the above. PDO and Yioop has been tested to work with Postgres and sqlite, for other DBMS's it might take some tinkering to get things to work. + +When switching database information, Yioop checks first if a usable database with the user supplied data exists. If it does, then it uses it; otherwise, it tries to create a new database. Yioop comes with a small sqlite demo database in the data directory and this is used to populate the installation database in this case. This database has one account root with no password which has privileges on all activities. Since different databases associated with a Yioop installation might have different user accounts set-up after changing database information you might have to sign in again. + +The '''Account Registration''' fieldset is used to control how user's can obtain accounts on a Yioop installation. The dropdown at the start of this fieldset allows one to select one of four possibilities: Disable Registration, users cannot register themselves, only the root account can add users; No Activation, user accounts are immediately activated once a user signs up; Email Activation, after registering, users must click on a link which comes in a separate email to activate their accounts; and Admin Activation, after registering, an admin account must activate the user before the user is allowed to use their account. When Disable Registration is selected, the Suggest A Url form and link on the tool.php page is disabled as well, for all other registration type this link is enabled. If Email Activation is chosen, then the reset of this fieldset can be used to specify the email address that the email comes to the user. The checkbox Use PHP mail() function controls whether to use the mail function in PHP to send the mail, this only works if mail can be sent from the local machine. Alternatively, if this is not checked like in the image above, one can configure an outgoing SMTP server to send the email through. + +The '''Proxy Server''' fieldset is used to control which proxies to use while crawling. By default Yioop does not use any proxies while crawling. A Tor Proxy can serve as a gateway to the Tor Network. Yioop can use this proxy to download .onion URLs on the [[https://en.wikipedia.org/wiki/Tor_%28anonymity_network%29|Tor network]]. The configuration given in the example above works with the Tor Proxy that comes with the +[[https://www.torproject.org/projects/torbrowser.html|Tor Browser]]. Obviously, this proxy needs to be running though for Yioop to make use of it. Beneath the Tor Proxy input field is a checkbox labelled Crawl via Proxies. Checking this box, will reveal a textarea labelled Proxy Servers. You can enter the address:port or address:port:proxytype of proxy servers you would like to crawl through. If proxy servers are used, Yioop will make any requests to download pages to a randomly chosen server on the list which will proxy the request to the site which has the page to download. To some degree this can make the download site think the request is coming from a different ip (and potentially location) than it actually is. In practice, servers can often use HTTP headers to guess that a proxy is being used. + +The '''Ad Server Configuration''' fieldset can be used to specify advertising scripts (such as Google Ad Words, Bidvertiser, Adspeed, etc) which are to be added on search result pages or on discussion thread pages . There are four possible placements of ads: None -- don't display advertising at all, Top -- display banner ads beneath the search bar but above search results, Side -- display skyscraper Ads in a column beside the search results, and Both -- display both banner and skyscraper ads. Choosing any option other than None reveals text areas where one can insert the Javascript one would get from the Ad network. The '''Global Ad Script''' text area is used for any Javascript or HTML the Ad provider wants you to include in the HTML head tag for the web page (many advertisers don't need this). + +The Security activity looks like: + +{{class="docs" +((resource:Documentation:Security.png|The Security Activity)) +}} + +The '''Authentication Type''' fieldset is used to control the protocol used to log people into Yioop. This can either be Normal Authentication, passwords are checked against stored as salted hashes of the password; or ZKP (zero knowledge protocol) authentication, the server picks challenges at random and send these to the browser the person is logging in from, the browser computes based on the password an appropriate response according to the [[https://en.wikipedia.org/wiki/Feige%3C?php%20?%3E%E2%80%93Fiat%E2%80%93Shamir_identification_scheme|Fiat Shamir]] protocol. The password is never sent over the internet and is not stored on the server. These are the main advantages of ZKP, its drawback is that it is slower than Normal Authentication as to prove who you are with a low probability of error requires several browser-server exchanges. You should choose which authentication scheme you want before you create many users as if you switch everyone will need to get a new password. + +The '''Captcha Type''' fieldset controls what kind of [[https://en.wikipedia.org/wiki/Captcha|captcha]] will be used during account registration, password recovery, and if a user wants to suggest a url. The captcha type only has an effect if under the Server Settings activity, Account Registration is not set to Disable Registration. The choices for captcha are: Text Captcha, the user has to select from a series of dropdown answers to questions of the form: Which in the following list is the most/largest/etc? or Which is the following list is the least/smallest/etc?; Graphic Captcha, the user needs to enter a sequence of characters from a distorted image; and hash captcha, the user's browser (the user doesn't need to do anything) needs to extend a random string with additional characters to get a string whose hash begins with a certain lead set of characters. Of these, Hash Captcha is probably the least intrusive but requires Javascript and might run slowly on older browsers. A text captcha might be used to test domain expertise of the people who are registering for an account. Finally, the graphic captcha is probably the one people are most familiar with. + +The Captcha and Recovery Questions section of the Security activity provides links to edit the Text Captcha and Recovery Questions for the current locale (you can change the current locale in Settings). In both cases, there are a fixed list of tests you can localize. A single test consists of a more question, a less question, and a comma separate list of possibilities. For example, + Which lives or lasts the longest? + Which lives or lasts the shortest? + lightning,bacteria,ant,dog,horse,person,oak tree,planet,star,galaxy +When challenging a user, Yioop picks a subset of tests. For each test, it randomly chooses between more or less question. It then picks a subset of the ordered list of choices, randomly permutes them, and presents them to the user in a dropdown. + +Yioop's captcha-ing system tries to prevent attacks where a machine quickly tries several possible answers to a captcha. Yioop has a IP address based timeout system (implemented in models/visitor_model.php). Initially a timeout of one second between requests involving a captcha is in place. An error screen shows up if multiple requests from the same IP address for a captcha page are made within the time out period. Every mistaken entry of a captcha doubles this timeout period. The timeout period for an IP address is reset on a daily basis back to one second. + +[[Documentation#contents|Return to table of contents]]. + +===Upgrading Yioop=== + +If you have an older version of Yioop that you would like to upgrade, make sure to back up your data. Then download the latest version of Yioop and unzip it to the location you would like. Set the Search Engine Work Directory by the same method and to the same value as your old Yioop Installation. See the Installation section above for instructions on this, if you have forgotten how you did this. Knowing the old Work Directory location, should allow Yioop to complete or instruct you how to complete the upgrade process. + +[[Documentation#contents|Return to table of contents]]. + +===Summary of Files and Folders=== + +The Yioop search engine consists of three main scripts: + +;'''bin/fetcher.php''': Used to download batches of urls provided the queue_server. +;'''bin/queue_server.php''': Maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create the index_archive that is used by the search front end. +;'''index.php''': Acts as the search engine web page. It is also used to handle message passing between the fetchers (multiple machines can act as fetchers) and the queue_server. + +The file index.php is used when you browse to an installation of a Yioop website. The description of how to use a Yioop web site is given in the sections starting from the The Yioop User Interface section. The files fetcher.php and queue_server.php are only connected with crawling the web. If one already has a stored crawl of the web, then you no longer need to run or use these programs. For instance, you might obtain a crawl of the web on your home machine and upload the crawl to a an instance of Yioop on the ISP hosting your website. This website could serve search results without making use of either fetcher.php or queue_server.php. To perform a web crawl you need to use both of these programs however as well as the Yioop web site. This is explained in detail in the section on [[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]]. + +The Yioop folder itself consists of several files and sub-folders. The file index.php as mentioned above is the main entry point into the Yioop web application. yioopbar.xml is the xml file specifying how to access Yioop as an Open Search Plugin. favicon.ico is used to display the little icon in the url bar of a browser when someone browses to the Yioop site. A URL to the file bot.php is given by the Yioop robot as it crawls websites so that website owners can find out information about who is crawling their sites. Here is a rough guide to what the Yioop folder's various sub-folders contain: + +;'''bin''': This folder is intended to hold command-line scripts and daemons which are used in conjunction with Yioop. In addition to the fetcher.php and queue_server.php script already mentioned, it contains: '''arc_tool.php''', '''classifier_tool.php''', '''classifier_trainer.php''', '''code_tool.php''', '''mirror.php''',''' media_updater.php''', and '''query_tool.php'''. arc_tool.php can be used to examine the contents of WebArchiveBundle's and IndexArchiveBundle's from the command line. classifier_tool.php is a command line tool for creating a classifier it can be used to perform some of the tasks that can also be done through the [[Documentation#Classifying Web Pages|Web Classifier Interface]]. classifier_trainer.php is a daemon used in the finalization stage of building a classifier. code_tool.php is for use by developers to maintain the Yioop code-base in various ways. mirror.php can be used if you would like to create a mirror/copy of a Yioop installation. media_updater.php can be used to do hourly updates of news feed search sources in Yioop. It also does video conversions of video files into web formats. Finally, query_tool.php can be used to run queries from the command-line. +;{{id="configs" '''configs'''}} : This folder contains configuration files. You will probably not need to edit any of these files directly as you can set the most common configuration settings from with the admin panel of Yioop. The file '''config.php''' controls a number of parameters about how data is stored, how, and how often, the queue_server and fetchers communicate, and which file types are supported by Yioop. '''configure_tool.php''' is a command-line tool which can perform some of the configurations needed to get a Yioop installation running. It is only necessary in some virtual private server settings -- the preferred way to configure Yioop is through the web interface. '''createdb.php''' can be used to create a bare instance of the Yioop database with a root admin user having no password. This script is not strictly necessary as the database should be creatable via the admin panel; however, it can be useful if the database isn't working for some reason. createdb.php includes the file '''public_help_pages.php''' from WORK_DIRECTORY/app/configs if present or from BASE_DIR/configs if not. This file contains the initial rows for the Public and Help group wikis. When upgrading, it is useful to export the changes you have made to these wikis to WORK_DIRECTORY/app/configs/public_help_pages.php. This can be done by running the file '''export_public_help_db.php''' which is in the configs folder. +Also, in the configs folder is the file default_crawl.ini. This file is copied to WORK_DIRECTORY after you set this folder in the admin/configure panel. There it is renamed as '''crawl.ini''' and serves as the initial set of sites to crawl until you decide to change these. The file '''token_tool.php''' is a tool which can be used to help in term extraction during crawls and for making trie's which can be used for word suggestions for a locale. To help word extraction this tool can generate in a locale folder (see below) a word bloom filter. This filter can be used to segment strings into words for languages such as Chinese that don't use spaces to separate words in sentences. For trie and segmenter filter construction, this tool can use a file that lists words one on a line. +;'''controllers''': The controllers folder contains all the controller classes used by the web component of the Yioop search engine. Most requests coming into Yioop go through the top level index.php file. The query string (the component of the url after the ?) then says who is responsible for handling the request. In this query string there is a part which reads c= ... This says which controller should be used. The controller uses the rest of the query string such as the a= variable for activity function to call and the arg= variable to determine which data must be retrieved from which models, and finally which view with what elements on it should be displayed back to the user. Within the controller folder is a sub-folder components, a component is a collection of activities which may be added to a controller so that it can handle a request. +;'''css''': This folder contains the stylesheets used to control how web page tags should look for the Yioop site when rendered in a browser. +;'''data''': This folder contains a default sqlite database for a new Yioop installation. Whenever the WORK_DIRECTORY is changed it is this database which is initially copied into the WORK_DIRECTORY to serve as the database of allowed users for the Yioop system. +;'''examples''': This folder contains a file search_api.php whose code gives an example of how to use the Yioop search function api. +;'''lib''': This folder is short for library. It contains all the common classes for things like indexing, storing data to files, parsing urls, etc. lib contains six subfolders: ''archive_bundle_iterators'', ''classifiers'', ''compressors'', ''index_bundle_iterators'', ''indexing_plugins'', ''processors''. The ''archive_bundle_iterators'' folder has iterators for iterating over the objects of various kinds of web archive file formats, such as arc, wiki-media, etc. These iterators are used to iterate over such archives during a recrawl. The classifier folder contains code for training classifiers used by Yioop. The ''compressors'' folder contains classes that might be used to compress objects in a web_archive. The ''index_bundle_iterator'' folder contains a variety of iterators useful for iterating over lists of documents which might be returned during a query to the search engine. The ''processors'' folder contains processors to extract page summaries for a variety of different mimetypes. +;'''locale''': This folder contains the default locale data which comes with the Yioop system. A locale encapsulates data associated with a language and region. A locale is specified by an [[http://en.wikipedia.org/wiki/IANA_language_tag|IETF language tag]]. So, for instance, within the locale folder there is a folder en-US for the locale consisting of English in the United States. Within a given locale tag folder there is a file configure.ini which contains translations of string ids to string in the language of the locale. This approach is the same idea as used in [[http://en.wikipedia.org/wiki/Gettext|Gettext]] .po files. Yioop's approach does not require a compilation step nor a restart of the webserver for translations to appear. On the other hand, it is slower than the Gettext approach, but this could be easily mitigated using a memory cache such as [[http://memcached.org/|memcached]] or [[archive_bundle_iterators|apc]]. Besides the file configure.ini, there is a statistics.txt file which has info about what percentage of the id's have been translated. In addition to configure.ini and statistics.txt, the locale folder for a language contains two sub-folders: pages, containing static html (with extension .thtml) files which might need to be translated, and resources. The resources folder contains the files: ''locale.js'', which contains locale specify Javascript code such as the variable alpha which is used to list out the letters in the alphabet for the language in question for spell check purposes, and roman_array for mapping between roman alphabet and the character system of the locale in question; ''suggest-trie.txt.gz'', a Trie data structure used for search bar word suggestions; and ''tokenizer.php'', which can specify the number of characters for this language to constitute a char gram, might contain segmenter to split strings into words for this language, a stemmer class used to stem terms for this language, a stopword remover for the centroid summarizer, a part of speech tagger, or thesaurus lookup procedure for the locale. +;'''models''': This folder contains the subclasses of Model used by Yioop Models are used to encapsulate access to secondary storage. i.e., Accesses to databases or the filesystem. They are responsible for marshalling/de-marshalling objects that might be stored in more than one table or across serveral files. The models folder has within it a datasources folder. A datasource is an abstraction layer for the particular filesystem and database system that is being used by a Yioop installation. At present, datasources have been defined for PDO (PHP's generic DBMS interface), sqlite3, and mysql databases. +;'''resources''': Used to store binary resources such as graphics, video, or audio. For now, just stores the Yioop logo. +;'''scripts''': This folder contains the Javascript files used by Yioop. +;'''tests''': This folder contains UnitTest's and JavascriptUnitTests for various lib and script components. Yioop comes with its own minimal UnitTest and JavascriptUnitTest classes which defined in the lib/unit_test.php and lib/javascript_unit_test.php. It also contains a few files used for experiments. For example, string_cat_experiment.php was used to test which was the faster way to do string concatenation in PHP. many_user_experiment.php can be used to create a test Yioop installation with many users, roles, and groups. Some unit testing of the wiki Help system makes use of [[http://phantomjs.org/|PhantomJS]]. If PhantomJS is not configured, these tests will be skipped. To configure PhantomJS you simply add a define for your path to PhatomJS to your local_config.php file. For example, one might have add the define: +define("PHANTOM_JS", "/usr/local/bin/phantomjs"); +;'''views''': This folder contains View subclasses as well as folders for elements, helpers, and layouts. A View is responsible for taking data given to it by a controller and formatting it in a suitable way. Most Views output a web page; however, some of the views responsible for communication between the fetchers and the queue_server output serialized objects. The elements folder contains Element classes which are typically used to output portions of web pages. For example, the html that allows one to choose an Activity in the Admin portion of the website is rendered by an ActivityElement. The helpers folder contains Helper subclasses. A Helper is used to automate the task of outputting certain kinds of web tags. For instance, the OptionsHelper when given an array can be used to output select tags and option tags using data from the array. The layout folder contains Layout subclasses. A Layout encapsulates the header and footer information for the kind of a document a View lives on. For example, web pages on the Yioop site all use the WebLayout class as their Layout. The WebLayout class has a render method for outputting the doctype, open html tag, head of the document including links for style sheets, etc. This method then calls the render methods of the current View, and finally outputs scripts and the necessary closing document tags. + +In addition to the Yioop application folder, Yioop makes use of a WORK DIRECTORY. The location of this directory is set during the configuration of a Yioop installation. Yioop stores crawls, and other data local to a particular Yioop installation in files and folders in this directory. In the event that you upgrade your Yioop installation you should only need to replace the Yioop application folder and in the configuration process of Yioop tell it where your WORK DIRECTORY is. Of course, it is always recommended to back up one's data before performing an upgrade. Within the WORK DIRECTORY, Yioop stores four main files: profile.php, crawl.ini, bot.txt, and robot_table.txt. Here is a rough guide to what the WORK DIRECTORY's sub-folder contain: + +;'''app''': This folder is used to contain your overrides to the views, controllers, models, resources, locale etc. For example, if you wanted to change how the search results were rendered, you could add a views/search_view.php file to the app folder and Yioop would use it rather than the one in the Yioop base directory's views folder. Using the app dir makes it easier to have customizations that won't get messed up when you upgrade Yioop. +;'''cache''': The directory is used to store folders of the form ArchiveUNIX_TIMESTAMP, IndexDataUNIX_TIMESTAMP, and QueueBundleUNIX_TIMESTAMP. ArchiveUNIX_TIMESTAMP folders hold complete caches of web pages that have been crawled. These folders will appear on machines which are running fetcher.php. IndexDataUNIX_TIMESTAMP folders hold a word document index as well as summaries of pages crawled. A folder of this type is needed by the web app portion of Yioop to serve search results. These folders can be moved from machine to whichever machine you want to server results from. QueueBundleUNIX_TIMESTAMP folders are used to maintain the priority queue during the crawling process. The queue_server.php program is responsible for creating both IndexDataUNIX_TIMESTAMP and QueueBundleUNIX_TIMESTAMP folders. +;'''data''': If an sqlite or sqlite3 (rather than say MySQL) database is being used then a seek_quarry.db file is stored in the data folder. In Yioop, the database is used to manage users, roles, locales, and crawls. Data for crawls themselves are NOT stored in the database. Suggest a url data is stored in data in the file suggest_url.txt, certain cron information about machines is saved in cron_time.txt, and plugin configuration information can also be stored in this folder. +;'''locale''': This is generally a copy of the locale folder mentioned earlier. In fact, it is the version that Yioop will try to use first. It contains any customizations that have been done to locale for this instance of Yioop. If you using a version of Yioop after Yioop 2.0, this folder have been moved to app/locale. +;'''log''': When the fetcher and queue_server are run as daemon processes log messages are written to log files in this folder. Log rotation is also done. These log files can be opened in a text editor or console app. +;'''query''': This folder is used to stored caches of already performed queries when file caching is being used. +;'''schedules''': This folder has four kinds of subfolders: media_convert, IndexDataUNIX_TIMESTAMP, RobotDataUNIX_TIMESTAMP, and ScheduleDataUNIX_TIMESTAMP. The easiest to explain is the media_convert folder. It is used by media_updater.php to stored job information about video files that need to be converted. For the other folder, when a fetcher communicates with the web app to say what it has just crawled, the web app writes data into these folders to be processed later by the queue_server. The UNIX_TIMESTAMP is used to keep track of which crawl the data is destined for. IndexData folders contain mini-inverted indexes (word document records) which are to be added to the global inverted index (called the dictionary) for that crawl. RobotData folders contain information that came from robots.txt files. Finally, ScheduleData folders have data about found urls that could eventually be scheduled to crawl. Within each of these three kinds of folders there are typical many sub-folders, one for each day data arrived, and within these subfolders there are files containing the respective kinds of data. +;'''search_filters''': This folder is used to store text files containing global after crawl search filter and summary data. The global search filter allows a user to specify after a crawl is done that certain urls be removed from the search results. The global summary data can be used to edit the summaries for a small number of web pages whose summaries seem inaccurate or inappropriate. For example, some sites like Facebook only allow big search engines like Google to crawl them. Still there are many links to Facebook, so Facebook on an open web crawl will appear, but with a somewhat confused summary based only on link text; the results editor allows one to give a meaningful summary for Facebook. +;'''temp''': This is used for storing temporary files that Yioop creates during the crawl process. For example, temporary files used while making thumbnails. Each fetcher has its own temp folder, so you might also see folders 0-temp, 1-temp, etc. + +[[Documentation#contents|Return to table of contents]]. + +==Search and User Interface== + +At this point one hopefully has installed Yioop. If you used one of the [[Install|install guides]], you may also have performed a simple crawl. We are now going to describe some of the basic search features of Yioop as well as the Yioop administration interface. We will describe how to perform crawls with Yioop in more detail in the [[Documentation#Crawling%20and%20Customizing%20Results|Crawling and Customizing Results]] chapter. If you do not have a crawl available, you can test some of these features on the [[http://www.yioop.com/|Yioop Demo Site]]. + +===Search Basics=== +The main search form for Yioop looks like: + +{{class='docs width-three-quarter' +((resource:Documentation:SearchScreen.png|The Search form)) +}} + +The HTML for this form is in views/search_views.php and the icon is stored in resources/yioop.png. You may want to modify these to incorporate Yioop search into your site. For more general ways to modify the look of this pages, consult the [[Documentation#Building%20a%20Site%20using%20Yioop%20as%20Framework|Building a site using Yioop]] documentation. The Yioop logo on any screen in the Yioop interface is clickable and returns the user to the main search screen. One performs a search by typing a query into the search form field and clicking the Search button. As one is typing, Yioop suggests possible queries, you can click, or use the up down arrows to select one of these suggestion to also perform a search + +{{class='docs width-three-quarter' +((resource:Documentation:Autosuggest.png|Example suggestions as you type)) +}} + +For some non-roman alphabet scripts such as Telugu you can enter words using how they sound using roman letters and get suggestions in the script in question: + +{{class="docs" +((resource:Documentation:TeluguAutosuggest.png|Telugu suggestions for roman text)) +}} + +The [More Statistics] link only shows if under the Admin control panel you clicked on more statistics for the crawl. This link goes to a page showing many global statistics about the web crawl. Beneath this link are the Blog and Privacy links (as well as a link back to the SeekQuarry site). These two links are to static pages which can be customized through the Manage Locale activity. Typical search results might look like: + +{{class="docs" +((resource:Documentation:SearchResults.png|Example Search Results)) +}} + +Thesaurus results might appear to one side and suggest alternative queries based on a thesaurus look up (for English, this is based on Wordnet). The terms next Words: are a word cloud of important terms in the document. These are created if the indexer user the centroid summarizer. Hovering over the Score of a search result reveals its component scores. These might include: Rank, Relevance, Proximity, as well as any Use to Rank Classifier scores and Word Net scores (if installed). + +{{class="docs" +((resource:Documentation:ScoreToolTip.png|Example Score Components Tool Tip)) +}} + +If one slightly mistypes a query term, Yioop can sometimes suggest a spelling correction: + +{{class="docs" +((resource:Documentation:SearchSpellCorrect.png|Example Search Results with a spelling correction)) +}} + +Each result back from the query consists of several parts: First comes a title, which is a link to the page that matches the query term. This is followed by a brief summary of that page with the query words in bold. Then the document rank, relevancy, proximity, and overall scores are listed. Each of these numbers is a grouped statistic -- several "micro index entry" are grouped together/summed to create each. So even though a given "micro index entry" might have a document rank between 1 and 10 there sum could be a larger value. Further, the overall score is a generalized inner product of the scores of the "micro index entries", so the separated scores will not typically sum to the overall score. After these scores there are three links: Cached, Similar, and Inlinks. Clicking on Cached will display Yioop's downloaded copy of the page in question. We will describe this in more detail in a moment. Clicking on Similar causes Yioop to locate the five words with the highest relevancy scores for that document and then to perform a search on those words. Clicking on Inlinks will take you to a page consisting of all the links that Yioop found to the document in question. Finally, clicking on an IP address link returns all documents that were crawled from that IP address. + +{{class="docs" +((resource:Documentation:Cache.png|Example Cache Results)) +}} + +As the above illustrates, on a cache link click, Yioop will display a cached version of the page. The cached version has a link to the original version and download time at the top. Next there is a link to display all caches of this page that Yioop has in any index. This is followed by a link for extracted summaries, then in the body of the cached document the query terms are highlighted. Links within the body of a cache document first target a cached version of the page that is linked to which is as near into the future of the current cached page as possible. If Yioop doesn't have a cache for a link target then it goes to location pointed to by that target. Clicking on the history toggle, produces the following interface: + +{{class="docs" +((resource:Documentation:CacheHistory.png|Example Cache History UI)) +}} + +This lets you select different caches of the page in question. + +Clicking the "Toggle extracted summary" link will show the title, summary, and links that were extracted from the full page and indexed. No other terms on the page are used to locate the page via a search query. This can be viewed as an "SEO" view of the page. + +{{class="docs" +((resource:Documentation:CacheSEO.png|Example Cache SEO Results)) +}} + +It should be noted that cached copies of web pages are stored on the fetcher which originally downloaded the page. The IndexArchive associated with a crawl is stored on the queue server and can be moved around to any location by simply moving the folder. However, if an archive is moved off the network on which fetcher lives, then the look up of a cached page might fail. + +In addition, to a straightforward web search, one can also do image, video, news searches by clicking on the Images, Video, or News links in the top bar of Yioop search pages. Below are some examples of what these look like for a search on "Obama": + +{{class="docs" +((resource:Documentation:ImageSearch.png|Example Image Search Results)) +((resource:Documentation:VideoSearch.png|Example Video Search Results)) +((resource:Documentation:NewsSearch.png|Example News Search Results)) +}} + +When Yioop crawls a page it adds one of the following meta words to the page media:text, media:image, or media:video. RSS (or Atom) feed sources that have been added to Media Sources under the [[Documentation#Search%20Sources|Search Sources]] activity are downloaded from each hour. Each RSS item on such a downloaded pages has the meta word media:news added to it. A usual web search just takes the search terms provided to perform a search. An Images, Video, News search tacks on to the search terms, media:image or media:video, or media:news. Detection of images is done via mimetype at initial page download time. At this time a thumbnail is generated. When search results are presented it is this cached thumbnail that is shown. So image search does not leak information to third party sites. On any search results page with images, Yioop tries to group the images into a thumbnail strip. This is true of both normal and images search result pages. In the case of image search result pages, except for not-yet-downloaded pages, this results in almost all of the results being the thumbnail strip. Video page detection is not done through mimetype as popular sites like YouTube, Vimeo, and others vary in how they use Flash or video tags to embed video on a web page. Yioop uses the Video Media sources that have been added in the Search Sources activity to detect whether a link is in the format of a video page. To get a thumbnail for the video it again uses the method for rewriting the video url to an image link specified for the particular site in question in Search Sources. i.e., the thumbnail will be downloaded from the orginal site. '''This could leak information to third party sites about your search.''' + +The format of News search results is somewhat different from usual search results. News search results can appear during a normal web search, in which case they will appear clustered together, with a leading link "News results for ...". No snippets will be shown for these links, but the original media source for the link will be displayed and the time at which the item first appeared will be displayed. On the News subsearch page, the underneath the link to the item, the complete RSS description of the new item is displayed. In both settings, it is possible to click on the media source name next to the news item link. This will take one to a page of search results listing all articles from that media source. For instance, if one were to click on the Yahoo News text above one would go to results for all Yahoo News articles. This is equivalent to doing a search on: media:news:Yahoo+News . If one clicks on the News subsearch, not having specified a query yet, then all stored news items in the current language will be displayed, roughly ranked by recentness. If one has RSS media sources which are set to be from different locales, then this will be taken into account on this blank query News page. + +[[Documentation#contents|Return to table of contents]]. + + +===Search Tools Page=== + +As one can see from the image of the main search form shown previously, the footer of each search and search result page has several links. Blog takes one to the group feed of the built in PUBLIC group which is editable from the root account, Privacy takes one to the Yioop installations privacy policy, and Terms takes one to the Yioop installations terms of service. The YioopBot link takes one to a page describing the installation's web crawler. These static pages are all Wiki pages of the PUBLIC group and can be edited by the root account. The Tools link takes one to the following page: + +{{class="docs" +((resource:Documentation:SearchTools.png|Search Tools Page)) +}} + +Beneath the Other Search Sources section is a complete listing of all the search sources that were created using [[Documentation#Search%20Sources|Search Sources]]. This might be more than just the Images, Video, and News that come by default with Yioop. The My Account section of this page gives another set of links for signing into, modifying the settings of, and creating account. The Other Tools section has a link to the form below where user's can suggest links for the current or future crawls. + +{{class="docs" +((resource:Documentation:SuggestAUrl.png|Suggest A Url Form)) +}} + +This link only appears if under Server Settings, Account Registration is not set to Disable registration. The Wiki Pages link under Other Tools takes one to a searchable list of all Wiki pages of the default PUBLIC group. + +[[Documentation#contents|Return to table of contents]]. + +===Search Operators=== + +Turning now to the topic of how to enter a query in Yioop: A basic query to the Yioop search form is typically a sequence of words seperated by whitespace. This will cause Yioop to compute a "conjunctive query", it will look up only those documents which contain all of the terms listed. Yioop also supports a variety of other search box commands and query types: + +* '''#num#''' in a query are treated as query presentation markers. When a query is first parsed, it is split into columns based with ''#num#'' as the column boundary. For example, bob #2# bob sally #3# sally #1#. A given column is used to present ''#num#'' results, where ''#num#'' is what is between the hash marks immediately after it. So in the query above, the subquery ''bob'' is used for the first two search results, then the subquery ''bob sally'' is used for the next three results, finally the last column is always used for any remaining results. In this case, the subquery ''sally'' would be used for all remaining results even though its ''#num#'' is 1. If a query does not have any #num#'s it is assumed that it has only one column. +* Separating query terms with a vertical bar | results in a disjunctive query. These are parsed for after the presentation markers above. So a search on: ''Chris | Pollett'' would return pages that have either the word ''Chris'' or the word ''Pollett'' or both. +* Putting the query in quotes, for example "Chris Pollett", will cause Yioop to perform an exact match search. Yioop in this case would only return documents that have the string "Chris Pollett" rather than just the words Chris and Pollett possibly not next to each other in the document. Also, using the quote syntax, you can perform searches such as "Chris * Homepage" which would return documents which have the word Chris followed by some text followed by the word Homepage. +* If the query has at least one word not prefixed by -, then adding a `-' in front of a word in a query means search for results not containing that term. So a search on: of ''-the'' would return results containing the word "of" but not containing the word "the". +* Searches of the forms: '''related:url''', '''cache:url''', '''link:url''', '''ip:ip_address''' are equivalent to having clicked on the Similar, Cached, InLinks, IP address links, respectively, on a summary with that url and ip address. + +The remaining query types we list in alphabetical order: + +;'''code&#58;http_error_code''' : returns the summaries of all documents downloaded with that HTTP response code. For example, code:04 would return all summaries where the response was a Page Not Found error. +;'''date&#58;Y, date&#58;Y-m, date&#58;Y-m-d, date&#58;Y-m-d-H, date&#58;Y-m-d-H-i, date&#58;Y-m-d-H-i-s''' : returns summaries of all documents crawled on the given date. For example, ''date:2011-01'' returns all document crawled in January, 2011. As one can see detail goes down to the second level, so one can have an idea about how frequently the crawler is hitting a given site at a given time. +;'''dns&#58;num_seconds''' : returns summaries of all documents whose DNS lookup time was between num_seconds and num_seconds + 0.5 seconds. For example, dns:0.5. +;'''filetype&#58;extension''': returns summaries of all documents found with the given extension. So a search: Chris Pollett filetype&#58;pdf would return all documents containing the words Chris and Pollett and with extension pdf. +;'''host&#58;all''': returns summaries of all domain level pages (pages where the path was /). +;'''index&#58;timestamp or i&#58;timestamp''' : causes the search to make use of the IndexArchive with the given timestamp. So a search like: ''Chris Pollett i&#58;1283121141 | Chris Pollett'' take results from the index with timestamp 1283121141 for Chris Pollett and unions them with results for Chris Pollett in the default index +;'''if&#58;keyword!add_keywords_on_true!add_keywords_on_false''' : checks the current conjunctive query clause for "keyword"; if present, it adds "add_keywords_on_true" to the clause, else it adds the keywords "add_keywords_on_true". This meta word is typically used as part of a crawl mix. The else condition does not need to be present. As an example, ''if&#58;oracle!info&#58;http://oracle.com/!site&#58;none'' might be added to a crawl mix so that if a query had the keyword oracle then the site http://oracle.com/ would be returned by the given query clause. As part of a larger crawl mix this could be used to make oracle's homepage appear at the top of the query results. If you would like to inject multiple keywords then separate the keywords using plus rather than white space. For example, if:corvette!fast+car. +;'''info&#58;url''' : returns the summary in the Yioop index for the given url only. For example, one could type info:http://www.yahoo.com/ or info:www.yahoo.com to get the summary for just the main Yahoo! page. This is useful for checking if a particular page is in the index. +;'''lang&#58;IETF_language_tag''' : returns summaries of all documents whose language can be determined to match the given language tag. For example, ''lang:en-US''. +;'''media&#58;kind''' : returns summaries of all documents found of the given media kind. Currently, the text, image, news, and video are the four supported media kinds. So one can add to the search terms ''media:image'' to get only image results matching the query keywords. +'''mix&#58;name or m&#58;name ''': tells Yioop to use the crawl mix "name" when computing the results of the query. The section on mixing crawl indexes has more details about crawl mixes. If the name of the original mix had spaces, for example, cool mix then to use the mix you would need to replace the spaces with plusses, ''m:cool+mix''. +;'''modified&#58;Y, modified&#58;Y-M, modified&#58;Y-M-D''' : returns summaries of all documents which were last modified on the given date. For example, modified:2010-02 returns all document which were last modifed in February, 2010. +;'''no&#58;some_command''' is used to tell Yioop not to perform some default transformation of the search terms. For example, ''no:guess'' tells Yioop not to try to guess the semantics of the search before doing the search. This would mean for instance, that Yioop would not rewrite the query ''yahoo.com'' into ''site:yahoo.com. no:network'' tells Yioop to only return search results from the current machine and not to send the query to all machines in the Yioop instance. ''no:cache'' says to recompute the query and not to make use of memcache or file cache. +;'''numlinks&#58;some_number''': returns summaries of all documents which had some_number of outgoing links. For example, numlinks:5. +;'''os&#58;operating_system''': returns summaries of all documents served on servers using the given operating system. For example, ''os:centos'', make sure to use lowercase. +;'''path&#58;path_component_of_url''': returns summaries of all documents whose path component begins with path_component_of_url. For example, ''path:/phpBB'' would return all documents whose path started with phpBB, ''path:/robots.txt'' would return summaries for all robots.txt files. +;'''raw&#58;number''' : control whether or not Yioop tries to do deduplication on results and whether links and pages for the same url should be grouped. Any number greater than zero says don't do deduplication. +;'''robot&#58;user_agent_name''' : returns robots.txt pages that contained that user_agent_name (after lower casing). For example, ''robot:yioopbot'' would return all robots.txt pages explicitly having a rule for YioopBot. +;'''safe&#58;boolean_value''' : is used provide "safe" or "unsafe" search results. Yioop has a crude, "hand-tuned", linear classifier for whether a site contains pornographic content. If one adds safe:true to a search, only those pages found which were deemed non-pornographic will be returned. Adding safe:false has the opposite effect. +;'''server&#58;web_server_name''' : returns summaries of all documents served on that kind of web server. For example, ''server:apache''. +;'''site&#58;url, site&#58;host, or site&#58;domain''': returns all of the summaries of pages found at that url, host, or domain. As an example, ''site:http://prints.ucanbuyart.com/lithograph_art.html'', ''site:http://prints.ucanbuyart.com/'', ''site:prints.ucanbuyart.com'', ''site:.ucanbuyart.com'', site:ucanbuyart.com, site:com, will all returns with decreasing specificity. To return all pages and links to pages in the Yioop index, you can do ''site:any''. To return all pages (as opposed to pages and links to pages) listed in a Yioop index you can do ''site:all''. ''site:all'' doesn't return any links, so you can't group links to urls and pages of that url together. If you want all sites where one has a page in the index as well as links to that site, than you can do ''site:doc''. +;'''size&#58;num_bytes''': returns summaries of all documents whose download size was between num_bytes and num_bytes + 5000. num_bytes must be a multiple of 5000. For example, ''size:15000''. +;'''time&#58;num_seconds''' : returns summaries of all documents whose download time excluding DNS lookup time was between num_seconds and num_seconds + 0.5 seconds. For example, ''time:1.5''. +;'''version&#58;version_number''' : returns summaries of all documents served on web servers with the given version number. For example, one might have a query ''server:apache version:2.2.9''. +;'''weight&#58;some_number or w&#58;some_number''' : has the effect of multiplying all score for this portion of a query by some_number. For example, ''Chris Pollett | Chris Pollett site:wikipedia.org w:5 ''would multiply scores satisfying ''Chris Pollett'' and on ''wikipedia.org'' by 5 and union these with those satisfying ''Chris Pollett''. + +Although we didn't say it next to each query form above, if it makes sense, there is usually an ''all'' variant to a form. For example, ''os:all'' returns all documents from servers for which os information appeared in the headers. + +===Result Formats=== +In addition to using the search form interface to query Yioop, it is also possible to query Yioop and get results in Open Search RSS format. To do that you can either directly type a URL into your browser of the form: + http://my-yioop-instance-host/?f=rss&q=query+terms +Or you can write AJAX code that makes requests of URLs in this format. Although, there is no official Open Search JSON format, one can get a JSON object with the same structure as the RSS search results using a query to Yioop such as: + http://my-yioop-instance-host/?f=json&q=query+terms + +[[Documentation#contents|Return to table of contents]]. + +===Settings=== + +In the corner of the page with the main search form is a Settings-Signin element: + +{{class="docs" +((resource:Documentation:SettingsSignin.png|Settings Sign-in Element)) +}} + +This element provides access for a user to change their search settings by clicking Settings. The Sign In link provides access to the Admin and User Accounts panels for the website. Clicking the Sign In link also takes one to a page where one can register for an account if Yioop is set up to allow user registration. + +{{class="docs" +((resource:Documentation:Settings.png|The Settings Form)) +}} + +On the Settings page, there are currently three items which can be adjusted: The number of results per page when doing a search, the language Yioop should use, and the particular search index Yioop should use. When a user clicks save, the data is stored by Yioop. The user can then click "Return to Yioop" to go back the search page. Thereafter, interaction with Yioop will make use of any settings' changes. Data is stored in Yioop and associated with a given user via a cookies mechanism. In order for this to work, the user's browser must allow cookies to be set. This is usually the default for most browsers; however, it can sometimes be disabled in which case the browser option must be changed back to the default for Settings to work correctly. It is possible to control some of these settings by tacking on stuff to the URL. For instance, adding &l=fr-FR to the URL query string (the portion of the URL after the question mark) would tell Yioop to use the French from France for outputting text. You can also add &its= the Unix timestamp of the search index you want. + +[[Documentation#contents|Return to table of contents]]. + +===Mobile Interface=== + +Yioop's user interface is designed to display reasonably well on tablet devices such as the iPad. For smart phones, such as iPhone, Android, Blackberry, or Windows Phone, Yioop has a separate user interface. For search, settings, and login, this looks fairly similar to the non-mobile user interface: + +{{class="docs" +((resource:Documentation:MobileSearch.png|Mobile Search Landing Page)) +((resource:Documentation:MobileSettings.png|Mobile Settings Page)) +((resource:Documentation:MobileSignin.png|Mobile Admin Panel Login)) +}} + +For Admin pages, each activity is controlled in an analgous fashion to the non-mobile setting, but the Activity element has been replaced with a dropdown: + +{{class="docs" +((resource:Documentation:MobileAdmin.png|Example Mobile Admin Activity)) +}} + +We now resume our discussion of how to use each of the Yioop admin activities for the default, non-mobile, setting, simply noting that except for the above minor changes, these instructions will also apply to the mobile setting. + +[[Documentation#contents|Return to table of contents]]. + +==User Accounts and Social Features== +===Registration and Signin=== + +Clicking on the Sign In link on the corner of the Yioop web site will bring up the following form: + +{{class="docs" +((resource:Documentation:SigninScreen.png|Admin Panel Login)) +}} + +Correctly, entering a username and password will then bring the user to the User Account portion of the Yioop website. Each Account page has on it an Activity element as well as a main panel where the current activity is displayed. The Activity element allows the user to choose what is the current activity for the session. The choices available on the Activity element depend on the roles the user has. A default installation of Yioop comes with two predefined roles Admin and User. If someone has the Admin role then the Activity element looks like: + +{{class="docs" +((resource:Documentation:AdminActivityElement.png|Admin Activity Element)) +}} + +On the other hand, if someone just has the User role, then their Acitivity element looks like: + +{{class="docs" +((resource:Documentation:UserActivityElement.png|User Activity Element)) +}} + +Over the next several sections we will discuss each of the Yioop account activities in turn. Before we do that we make a couple remarks about using Yioop from a mobile device. + +[[Documentation#contents|Return to table of contents]]. + +===Managing Accounts=== + +By default, when a user first signs in to the Yioop admin panel the current activity is the Manage Account activity. This activity just lets user's change their account information using the form pictured below. It also has summary information about Crawls and Indexes (Admin account only), Groups and Feeds, and Crawl mixes. There are also helpful links from each of these sections to a related activity for managing them. + +{{class="docs" +((resource:Documentation:ManageAccount.png|Manage Account Page)) +}} + +Initially, the Account Details fields are grayed out. To edit them, or to edit the user icon next to them, click the Edit link next to account details. This will allow a user to change information using their account password. A new user icon can either be selected by clicking the choose link underneath it, or by dragging and dropping an icon into the image area. The user's password must be entered correctly into the password field for changes to take effect when Save is clicked. Clicking the Lock link will cause these details to be grayed out and not editable again. + +{{class="docs" +((resource:Documentation:ChangeAccountInfo.png|Change Account Information Form)) +}} + +If a user wants to change their password they can click the Password link label for the password field. This reveals the following additional form fields where the password can be changed: + +{{class="docs" +((resource:Documentation:ChangePassword.png|Change Password Form)) +}} + +[[Documentation#contents|Return to table of contents]]. + +===Managing Users, Roles, and Groups=== + +The Manage Users, Manage Groups, and Manage Roles activities have similar looking forms as well as related functions. All three of these activities are available to accounts with the Admin role, but only Manage Groups is a available to those with a standard User role. To describe these activities, let's start at the beginning... Users are people who have accounts to connect with a Yioop installation. Users, once logged in may engage in various Yioop activities such as Manage Crawls, Mix Crawls, and so on. A user is not directly assigned which activities they have permissions on. Instead, they derive their permissions from which roles they have been directly assigned and by which groups they belong to. When first launched, Manage User activity looks like: + +{{class="docs" +((resource:Documentation:AddUser.png|The Add User form)) +}} + +The purpose is this activity is to allow an administrator to add, monitor and modify the accounts of users of a Yioop installation. At the top of the activity is the "Add User" form. This would allow an administrator to add a new user to the Yioop system. Most of the fields on this form are self explanatory except the Status field which we will describe in a moment. Beneath this is a User List table. At the top of this table is a dropdown used to control how many users to display at one time. If there are more than that many users, there will be arrow links to page through the user list. There is a also a search link which can be used to bring up the following Search User form: + +{{class="docs" +((resource:Documentation:SearchUser.png|The Search User form)) +}} + +This form can be used to find and sort various users out of the complete User List. If we look at the User List, the first four columns, Username, First Name, Last Name, and Email Address are pretty self-explanatory. The Status column has a dropdown for each user row, this dropdown also appear in the Add User form. It represents the current status of the User and can be either Inactive, Active, or Banned. An Inactive user is typically a user that has used the Yioop registration form to sign up for an account, but who hasn't had the account activated by the administrator, nor had the account activated by using an email link. Such a user can't create or post to groups or log in. On the other hand, such a user has reserved that username so that other people can't use it. A Banned user is a user who has been banned from logging, but might have groups or posts that the administrator wants to retain. Selecting a different dropdown value changes that user's status. Next to the Status column are two action columns which can be used to edit a user or to delete a user. Deleting a user, deletes their account, any groups that the user owns, and deletes any posts the user made. The Edit User form looks like: + +{{class="docs" +((resource:Documentation:EditUser.png|The Edit User form)) +}} + +This form let's you modify some of the attributes of a users. There are also two links on it: one with the number of roles that a user has, the other with the number of groups that a user has. Here the word "role" means a set of activities. Clicking on one of these links brings up a paged listing of the particular roles/groups the user has/belongs to. It will also let you add or delete roles/groups. Adding a role to a user means that the user can do the set of activities that the role contains, adding a group to the user means the user can read that group, and if the privileges for non-owners allow posting then can also post or comment to that group's feed and edit the group's wiki. This completes the description of the Manage +User Activity. + +Roles are managed through the Manage Role activity, which looks like: + +{{class="docs" +((resource:Documentation:AddRole.png|The Add Role form)) +}} + +Similar to the Manage User form, at the top of this activity, there is an Add Role form, and beneath this a Role List. The controls of the Role List operate in much the same fashion as those of the User List described earlier. Clicking on the Edit link of a role brings up a form which looks like: + +{{class="docs" +((resource:Documentation:EditRole.png|The Edit Role form)) +}} + +In the above, we have a Localizer role. We might have created this role, then used the Select Activity dropdown to add all the activities of the User role. A localizer is a person who can localize Yioop to a new language. So we might then want to use the Select dropdown to add Manage Locales to the list of activities. Once we have created a role that we like, we can then assign user's that role and they will be able to perform all of the activities listed on it. If a user has more than one role, than they can perform an activity as long as it is listed in at least one role. + +Groups are collections of users that have access to a group feed and a set of wiki pages. Groups are managed through the Manage Groups activity which looks like: + +{{class="docs" +((resource:Documentation:ManageGroups.png|The Manage Groups form)) +}} + +Unlike Manage Users and Manage Roles, the Manage Group activity belongs to the standard User role, allowing any user to create and manage groups. As one can see from the image above The Create/Join Group form takes the name of a group. If you enter a name that currently does not exist the following form will appear: + +{{class="docs" +((resource:Documentation:CreateGroup.png|The Create Group form)) +}} + +The user who creates a group is set as the initial group owner. + +The '''Register dropdown''' says how other users are allowed to join the group: '''No One''' means no other user can join the group (you can still invite other users), '''By Request''' means that other users can request the group owner to join the group, but that the group is not publicly visible in the browsable group directory; '''Public Request''' is the same as By Request, but the group is publicly visible in the browsable group directory; and '''Anyone''' means all users are allowed to join the group and the group appears in the browseable directory of groups. It should be noted that the root account can always join and browse for any group. The root account can also always take over ownership of any group. + +The '''Access dropdown''' controls how users who belong/subscribe to a group other than the owner can access that group. The possibilities are '''No Read''' means that non-members of the group cannot read the group feed or wiki, a non-owner member of the group can read but not write the group news feed and wiki; '''Read''' means that a non-member of the group can read the group news feed and the groups wiki page, but non-owners cannot write the feed or wiki; '''Read Comment''' means that a non-owner member of the group can read the group feed and wikis and can comment on any existing threads, but cannot start new ones, '''Read Write''' means that a non-owner member of the group can start new threads and comment on existing ones in the group feed, but cannot edit the group's wiki; finally, '''Read Write Wiki''' is wiki Read Write except a non-owner member can edit the group's wiki. The access to a group can be changed by the owner after a group is created. No Read and Read are often suitable if a group's owner wants to perform some kind of moderation. Read and Read Comment groups are often suitable if someone wants to use a Yioop Group as a blog. Read Write makes sense for a more traditional bulletin board. + +The '''Voting dropdown''' controls to what degree users can vote on posts. '''No Voting''' means group feed posts cannot be voted on; '''+ Voting''' means that a post can be voted up but not down; and '''+/- Voting''' means a post can be voted up or down. Yioop restricts a user to at most one vote/post. + +The '''Post Lifetime dropdown''' controls how long a group feed post is retained by the Yioop system before it is automatically deleted. The possible values are '''Never Expires''', '''One Hour''', '''One Day''', or '''One Month'''. + +A default installation of Yioop has two built-in groups: '''Public''' and '''Help''' owned by root. Public has Read access and all users automatically subscribed to it and cannot unsubscribe it. It is useful for general announcements and its wiki can be used as part of building a site for Yioop. The Help group's wiki is used to maintain all the wiki pages related to Yioop's integrated help system. When a user clicks on the help icon [?], the page that is presented in blue comes from this wiki. This group's registration is by default by Public Request and its access is Read Write Wiki. + +If on the Create/Join Group form, the name of a group entered already exists, but is not joinable, then an error message that the group's name is in use is displayed. If either anyone can join the group or the group can be joined by request, then that group will be added to the list of subscribed to groups. If membership is by request, then initially in the list of groups it will show up with access Request Join. + +Beneath the Create/Join Group form is the Groups List table. This lists all the groups that a user is currently subscribed to: + +{{class="docs" +((resource:Documentation:GroupsList.png|Groups List Table)) +}} + +The controls at the top of this table are similar in functionality to the controls we have already discussed for the User Lists table of Manage Users and the Roles List table of Manage Roles. This table let's a user manage their existing groups, but does not let a user to see what groups already exist. If one looks back at the Create/Join Groups form though, one can see next to it there is a link "Browse". Clicking this link takes one to the Discover Groups form and the Not Subscribed to Groups table: + +{{class="docs" +((resource:Documentation:BrowseGroups.png|The Browse Groups form)) +}} + +If a group is subscribable then the Join link in the Actions column of Not Subscribed to Groups table should be clickable. Let's briefly now consider the other columns of either the Groups List or not Subscribed to Groups table. The Name column gives the name of the group. Group name are unique identifiers for a group on a Yioop system. In the Groups List table the name is clickable and takes you to the group feed for that group. The owner column gives that username of the owner of the group. If you are the root account or if you are the owner of the group, then this field should be a clickable link that take you to the following form: + +{{class="docs" +((resource:Documentation:TransferGroup.png|The Transfer Group form)) +}} + +that can be used to transfer the ownership of a group. The next two column give the register and access information for the group. If you are the owner of the group these will be dropdowns allow you to change these settings. We have already explained what the Join link does in the actions column. Other links which can appear in the actions column are Unsubscribe, which let's you leave a group which you have joined but are not the owner of; Delete, which, if you are the owner of a group, let's you delete the group, its feed, and all its wiki pages; and Edit, which displays the following form: + +{{class="docs" +((resource:Documentation:EditGroup.png|The Edit Group form)) +}} + +The Register, Access, Voting, Post Lifetime dropdowns lets one modify the registration, group access, voting, and post lifetime properties for the group which we have already described before. Next to the Members table header is a link with the number of current memebers of the group. Clicking this link expands this area into a listing of users in the group as seen above. This allows one to change access of different members to the group, for example, approving a join request or banning a user. It also allows one to delete a member from a group. Beneath the user listing is a link which can take one to a form to invite more users. + +===Feeds and Wikis=== + +The initial screen of the Feeds and Wikis page has an integrated list of all the recent posts to any groups to which a user subscribes: + +{{class="docs" +((resource:Documentation:FeedsWikis.png|The Main Feeds and Wiki Page)) +}} + +The arrow icon next to the feed allow one to collapse the Activities element to free up screen real estate for the feed. Once collapsed, an arrow icon pointing the opposite direction will appear to let you show the Activities element again. Next to the Group Activity header at the top of the page are two icons: + +{{class="docs" +((resource:Documentation:GroupingIcons.png|The Group Icons)) +}} + +These control whether the Feed View above has posts in order of time or if posts are arranged by group as below: + +{{class="docs" +((resource:Documentation:FeedsWikis2.png|The Grouped Feeds and Wiki Page)) +}} + + +Going back to the original feed view above, notice posts are displayed with the most recent post at the top. If there has been very recent activity (within the last five minute), this page will refresh every 15 seconds for up to twenty minutes, checking for new posts. Each post has a title which links to a thread for that post. This is followed by the time when the post first appeared and the group title. This title, although gray, can be clicked to go to that particular group feed. If the user has the ability to start new threads in a group and one is in single feed mode, an icon with a plus-sign and a pencil appears next ot the group name, which when clicked allows a user to start a new thread in that group. Beneath the title of the post, is the username of the person who posted. Again, this is clickable and will take you to a page of all recent posts of that person. Beneath the username, is the content of the post. On the opposite side of the post box may appear links to Edit or X (delete) the post, as well as a link to comment on a post. The Edit and X delete links only appear if you are the poster or the owner of the group the post was made in. The Comment link let's you make a follow up post to that particular thread in that group. For example, for the "I just learned an interesting thing!" post above, the current user could start a new thread by clicking the plus-pencil icon or comment on this post by clicking the Comment link. If you are not the owner of a group then the Comment and Start a New Thread links only appear if you have the necessary privileges on that group. + +The image below shows what happens when one clicks on a group link, in this case, the Chris Blog link. + +{{class="docs" +((resource:Documentation:SingleGroupFeed.png|A Single Group Feed)) +}} + +On the opposite side of the screen there is a link My Group Feeds, which let's one go back to the previous screen. At the top of this screen is clickable title of the group, in this case, Chris Blog, this takes one to the Manage Groups activity where properties of this group could be examined. Next we see a toggle between Feed and Wiki. Currently, on group feed page, clicking Wiki would take one to the Main page of the wiki. Posts in the single group view are grouped by thread with the thread containing the most recent activity at the top. Notice next to each thread link there is a count of the number of posts to that thread. The content of the thread post is the content of the starting post to the thread, to see latter comments one has to click the thread link. There is now a Start New Thread button at the top of the single group feed as it is clear which group the thread will be started in. Clicking this button would reveal the following form: + +{{class="docs" +((resource:Documentation:StartNewThread.png|Starting a new thread from a group feed)) +}} + +Adding a Subject and Post to this form and clicking save would start a new thread. Posts can make use of [[Syntax|Yioop's Wiki Syntax]] to achieve effects like bolding text, etc. The icons above the text area can be used to quickly add this mark-up to selected text. Although the icons are relatively standard, hovering over an icon will display a tooltip which should aid in figuring out what it does. Beneath the text area is a dark gray area with instructions on how to add resources to a page such as images, videos, or other documents. For images and videos these will appear embedded in the text of the post when it is saved, for other media a link to the resource will appear when the source is saved. The size allowed for uploaded media is determined by your php instances php.ini configuration file's values for post_max_size and upload_max_filesize. Yioop uses the value of the constant MAX_VIDEO_CONVERT_SIZE set in a configs/local_config.php or from configs/config.php to determine if a video should be automatically converted to the two web friendly formats mp4 and webm. This conversion only happens if, in addition, [[http://ffmpeg.org/|FFMPEG]] has been installed and the path to it has been given as a constant FFMPEG in either a configs/local_config.php or in configs/config.php. + +Clicking the comment link of any existing thread reveals the following form to add a comment to that thread: + +{{class="docs" +((resource:Documentation:AddComment.png|Adding a comment to an existing thread)) +}} + +Below we see an example of the feed page we get after clicking on the My First Blog Post thread in the Chris Blog group: + +{{class="docs" +((resource:Documentation:FeedThread.png|A Group Feed Thread)) +}} + +Since we are now within a single thread, there is no Start New Thread button at the top. Instead, we have a Comment button at the top and bottom of the page. The starting post of the thread is listed first and ending most recent post is listed last (paging buttons both on the group and single thread page, let one jump to the last post). The next image below is an example of the feed page one gets when one clicks on a username link, in this case, cpollett: + +{{class="docs" +((resource:Documentation:UserFeed.png|User Feed)) +}} + +Single Group, Threads, and User feeds of groups which anyone can join (i.e., public groups) all have RSS feeds which could be used in a news aggregator or crawled by Yioop. To see what the link would be for the item you are interested in, first collapse the activity element if its not collapsed (i.e., click the [<<] link at the top of the page). Take the URL in the browser in the url bar, and add &f=rss to it. It is okay to remove the YIOOP_TOKEN= variable from this URL. Doing this for the cpollett user feed, one gets the url: + http://www.yioop.com/?c=group&a=groupFeeds&just_user_id=4&f=rss +whose RSS feed looks like: + +{{class="docs" +((resource:Documentation:UserRssFeed.png|User Rss Feed)) +}} + +As we mentioned above when we described the single group feed page, if we click on the Wiki link at the top we go to the Main wiki page of the group, where we could read that page. If the Main Wiki page (or for that if matter if we go to any wiki page that) does not exist, then we would get like the following: + +{{class="docs" +((resource:Documentation:NonexistantPage.png|Screenshot of going to location of a non-existant wiki page)) +}} + +This page might be slightly different depending on whether the user has write access to the given group. The [[Syntax|Wiki Syntax Guide]] link in the above takes one to a page that describes how to write wiki pages. The Edit link referred to in the above looks slightly different and is in a slightly different location depending on whether we are viewing the page with the Activity element collapsed or not. If the Activity element is not collapsed then it appears one of three links within the current activity as: + +{{class="docs" +((resource:Documentation:AdminHeading.png|Read Edit Page Headings on Admin view)) +}} + +On the other hand, if the Activity element is collapsed, then it appear on the navigation bar at the top of the screen as: + +{{class="docs" +((resource:Documentation:GroupHeading.png|Read Edit Page Headings on Group view)) +}} + +Notice besides editing a page there is a link to read the page and a link Pages. The Pages link takes us to a screen where we can see all the pages that have been created for a group: + +{{class="docs" +((resource:Documentation:WikiPageList.png|List of Groups Wiki Pages)) +}} + +The search bar can be used to search within the titles of wiki pages of this group for a particular page. Suppose now we clicked on Test Page in the above, then we would go to that page initially in Read view: + +{{class="docs" +((resource:Documentation:WikiPage.png|Example Viewing a Wiki Page)) +}} + +If we have write access, and we click the Edit link for this page, we work see the following edit page form: + +{{class="docs" +((resource:Documentation:EditWikiPage.png|Editing a Wiki Page)) +}} + +This page is written using Wiki mark-up whose syntax which as we mentioned above can be +found in the [[Syntax|Yioop Wiki Syntax Guide]]. So for example, the heading at the top of the page is written as +<nowiki> + =Test Page= +</nowiki> +in this mark-up. The buttons above the textarea can help you insert the mark-up you need without having to remember it. Also, as mentioned above the dark gray area below the textarea describes how to associate images, video, and other media to the document. Unlike with posts, a complete list of currently associated media can be found at the bottom of the document under the '''Page Resources''' heading. Links to Rename, Add a resource to the page, and Delete each resource can also be found here. Clicking on the icon next to a resource let's you look at the resource on a page by itself. This icon will be a thumbnail of the resource for images and videos. In the case of videos, the thumbnail is only generated if the FFMPEG software mentioned earlier is installed and the FFMPEG constant is defined. In this case, as with posts, if the video is less than MAX_VIDEO_CONVERT_SIZE, Yioop will automatically try to convert it to mp4 and webm so that it can be streamed by Yioop using HTTP pseudo-streaming. + +Clicking the '''Settings Link''' next to the wiki page name reveals the following additional form elements: + +{{class="docs" +((resource:Documentation:WikiPageSettings.png|Wiki Page Settings)) +}} + +The meaning of these various settings is described in [[Syntax#Page%20Settings,%20Page%20Type|Page Settings, Page Type]] section of the Yioop Wiki Syntax Guide. + + +The '''Discuss link''' takes you to a thread in the current group where the contents of the wiki page should be discussed. Underneath the textarea above is a Save button. Every time one clicks the save button a new version of the page is saved, but the old one is not discarded. We can use the Edit Reason field to provide a reason for the changes between versions. When we read a page it is the most recent version that is displayed. However, by clicking the History link above we can see a history of prior version. For example: + +{{class="docs" +((resource:Documentation:HistoryPage.png|An example History Page of a Wiki Page)) +}} + +The Revert links on this history page can be used to change the current wiki page to a prior version. The time link for each version can be clicked to view that prior version without reverting. The First and Second links next to a version can be used to set either the first field or second field at the top of the history page which is labelled Difference: . Clicking the Go button for the Difference form computes the change set between two selected versions of a wiki document. This might look like: + +{{class="docs" +((resource:Documentation:DiffPage.png|An example diff page of two versions of a Wiki Page)) +}} + +This completes the description of group feeds and wiki pages. + +[[Documentation#contents|Return to table of contents]]. + +==Crawling and Customizing Results== +===Performing and Managing Crawls=== + +The Manage Crawl activity in Yioop looks like: + +{{class="docs" +((resource:Documentation:ManageCrawl.png|Manage Crawl Form)) +}} + +This activity will actually list slightly different kinds of peak memory usages depending on whether the queue_server's are run from a terminal or through the web interface. The screenshot above was done when a single queue_server was being run from the terminal. The first form in this activity allows you to name and start a new web crawl. Next to the Start New Crawl button is an Options link, which allows one to set the parameters under which the crawl will execute. We will return to what the Options page looks like in a moment. When a crawl is executing, under the start crawl form appears statistics about the crawl as well as a Stop Crawl button. Crawling continues until this Stop Crawl button is pressed or until no new sites can be found. As a crawl occurs, a sequence of IndexShard's are written. These keep track of which words appear in which documents for groups of 50,000 or so documents. In addition an IndexDictionary of which words appear in which shard is written to a separate folder and subfolders. When the Stop button is clicked the "tiers" of data in this dictionary need to be logarithmically merged, this process can take a couple of minutes, so after clicking stop do not kill the queue_server (if you were going to) until after it says waiting for messages again. Beneath this stop button line, is a link which allows you to change the crawl options of the currently active crawl. Changing the options on an active crawl may take some time to fully take effect as the currently processing queue of urls needs to flush. At the bottom of the page is a table listing previously run crawls. Next to each previously run crawl are three links. The first link lets you resume this crawl, if this is possible, and say Closed otherwise. Resume will cause Yioop to look for unprocessed fetcher data regarding that crawl, and try to load that into a fresh priority queue of to crawl urls. If it can do this, crawling would continue. The second link let's you set this crawl's result as the default index. In the above picture there were only two saved crawls, the second of which was set as the default index. When someone comes to your Yioop installation and does not adjust their settings, the default index is used to compute search results. The final link allows one to Delete the crawl. For both resuming a crawl and deleting a crawl, it might take a little while before you see the process being reflected in the display. This is because communication might need to be done with the various fetchers, and because the on screen display refreshes only every 20 seconds or so. + +{{id='prerequisites' +====Prerequisites for Crawling==== +}} + +Before you can start a new crawl, you need to run at least one queue_server.php script and you need to run at least one fetcher.php script. These can be run either from the same Yioop installation or from separate machines or folder with Yioop installed. Each installation of Yioop that is going to participate in a crawl should be configured with the same name server and server key. Running these scripts can be done either via the command line or through a web interface. As described in the Requirements section you might need to do some additional initial set up if you want to take the web interface approach. On the other hand, the command-line approach only works if you are using only one queue server. You can still have more than one fetcher, but the crawl speed in this case probably won't go faster after ten to twelve fetchers. Also, in the command-line approach the queue server and name server should be the same instance of Yioop. In the remainder of this section we describe how to start the queue_server.php and fetcher.php scripts via the command line; the GUI for Managing Machines and Servers section describes how to do it via a web interface. To begin open a command shell and cd into the bin subfolder of the Yioop folder. To start a queue_server type: + + php queue_server.php terminal + +To start a fetcher type: + + php fetcher.php terminal + +The above lines are under the assumption that the path to php has been properly set in your PATH environment variable. If this is not the case, you would need to type the path to php followed by php then the rest of the line. If you want to stop these programs after starting them simply type CTRL-C. Assuming you have done the additional configuration mentioned above that are needed for the GUI approach managing these programs, it is also possible to run the queue_server and fetcher programs as daemons. To do this one could type respectively: + + php queue_server.php start + +or + + php fetcher.php start + +When run as a daemon, messages from these programs are written into log files in the log subfolder of the WORK_DIRECTORY folder. To stop these daemons one types: + + php queue_server.php stop + +or + + php fetcher.php stop + +Once the queue_server is running and at least one fetcher is running, the Start New Crawl button should work to commence a crawl. Again, it will up to a minute or so for information about a running crawl to show up in the Currently Processing fieldset. During a crawl, it is possible for a fetcher or the queue server to crash. This usually occurs due to lack of memory for one of these programs. It also can sometimes happen for a fetcher due to flakiness in multi-curl. If this occurs simply restart the fetcher in question and the crawl can continue. A queue server crash should be much rarer. If it occurs, all of the urls to crawl that reside in memory will be lost. To continue crawling, you would need to resume the crawl through the web interface. If there are no unprocessed schedules for the given crawl (which usually means you haven't been crawling very long), it is not possible to resume the crawl. We have now described what is necessary to perform a crawl we now return to how to set the options for how the crawl is conducted. + +====Common Crawl and Search Configurations==== + +When testing Yioop, it is quite common just to have one instance of the fetcher and one instance of the queue_server running, both on the same machine and same installation of Yioop. In this subsection we wish to briefly describe some other configurations which are possible and also some configs/config.php configurations that can affect the crawl and search speed. The most obvious config.php setting which can affect the crawl speed is NUM_MULTI_CURL_PAGES. A fetcher when performing downloads, opens this many simultaneous connections, gets the pages corresponding to them, processes them, then proceeds to download the next batch of NUM_MULTI_CURL_PAGES pages. Yioop uses the fact that there are gaps in this loop where no downloading is being done to ensure robots.txt Crawl-delay directives are being honored (a Crawl-delayed host will only be scheduled to at most one fetcher at a time). The downside of this is that your internet connection might not be used to its fullest ability to download pages. Thus, it can make sense rather than increasing NUM_MULTI_CURL_PAGES, to run multiple copies of the Yioop fetcher on a machine. To do this one can either install the Yioop software multiple times or give an instance number when one starts a fetcher. For example: + + php fetcher.php start 5 + +would start instance 5 of the fetcher program. + +Once a crawl is complete, one can see its contents in the folder WORK DIRECTORY/cache/IndexDataUNIX_TIMESTAMP. In the multi-queue server setting each queue server machine would have such a folder containing the data for the hosts that queue server crawled. Putting the WORK_DIRECTORY on a solid-state drive can, as you might expect, greatly speed-up how fast search results will be served. Unfortunately, if a given queue server is storing ten million or so pages, the corresponding IndexDataUNIX_TIMESTAMP folder might be around 200 GB. Two main sub-folders of IndexDataUNIX_TIMESTAMP largely determine the search performance of Yioop handling queries from a crawl. These are the dictionary subfolder and the posting_doc_shards subfolder, where the former has the greater influence. For the ten million page situation these might be 5GB and 30GB respectively. It is completely possible to copy these subfolders to a SSD and use symlinks to them under the original crawl directory to enhance Yioop's search performance. + +====Specifying Crawl Options and Modifying Options of the Active Crawl==== + +As we pointed out above, next to the Start Crawl button is an Options link. Clicking on this link, let's you set various aspect of how the next crawl should be conducted. If there is a currently processing crawl, there will be an options link under its stop button. Both of these links lead to similar pages, however, for an active crawl fewer parameters can be changed. So we will only describe the first link. We do mention here though that under the active crawl options page it is possible to inject new seed urls into the crawl as it is progressing. In the case of clicking the Option link next to the start button, the user should be taken to an activity screen which looks like: + +{{class="docs" +((resource:Documentation:WebCrawlOptions.png|Web Crawl Options Form)) +}} + +The Back link in the corner returns one to the previous activity. + +There are two kinds of crawls that can be performed by Yioop either a crawl of sites on the web or a crawl of data that has been previously stored in a supported archive format such as data that was crawled by Versions 0.66 and above of Yioop, data coming from a database or text archive via Yioop's importing methods described below, [[http://www.archive.org/web/researcher/ArcFileFormat.php|Internet Archive ARC file]], [[http://archive-access.sourceforge.net/warc/|ISO WARC Files]], [[http://en.wikipedia.org/wiki/Wikipedia:Database_download|MediaWiki xml dump]], [[http://rdf.dmoz.org/|Open Directory Project RDF file]]. In the next subsection, we describe new web crawls and then return to archive crawls subsection after that. Finally, we have a short section on some advanced crawl options which can only be set in config.php or local_config.php. You will probably not need these features but we mention them for completeness + +=====Web Crawl Options===== + +On the web crawl tab, the first form field, "Get Crawl Options From", allows one to read in crawl options either from the default_crawl.ini file or from the crawl options used in a previous crawl. The rest of the form allows the user to change the existing crawl options. The second form field is labeled Crawl Order. This can be set to either Bread First or Page Importance. It specifies the order in which pages will be crawled. In breadth first crawling, roughly all the seeds sites are visited first, followed by sites linked directly from seed sites, followed by sites linked directly from sites linked directly from seed sites, etc. Page Importance is our modification of [ [[Documentation#APC2003|APC2003]]]. In this order, each seed sites starts with a certain quantity of money. When a site is crawled it distributes its money equally amongst sites it links to. When picking sites to crawl next, one chooses those that currently have the most money. Additional rules are added to handle things like the fact that some sites might have no outgoing links. Also, in our set-up we don't revisit already seen sites. To handle these situation we take a different tack from the original paper. This crawl order roughly approximates crawling according to page rank. + +The next checkbox is labelled Restrict Sites by Url. If it is checked then a textarea with label Allowed To Crawl Sites appears. If one checks Restricts Sites by Url then only pages on those sites and domains listed in the Allowed To Crawl Sites textarea can be crawled. We will say how to specify domains and sites in a moment, first let's discuss the last two textareas on the Options form. The Disallowed sites textarea allows you to specify sites that you do not want the crawler to crawl under any circumstance. There are many reasons you might not want a crawler to crawl a site. For instance, some sites might not have a good robots.txt file, but will ban you from interacting with their site if they get too much traffic from you. + +Just above the Seed Sites textarea are two links "Add User Suggest Data". If on the Server Settings activity Account Registration is set to anything other than Disable Registration, it is possible for a search site user to suggest urls to crawl. This can be done by going to the [[Documentation#Search%20Tools%20Page|Search Tools Page]] and clicking on the Suggest a Url link. Suggested links are stored in WORK_DIRECTORY/data/suggest_url.txt. Clicking Add User Suggest Data adds any suggested urls in this file into the Seed Site textarea, then deletes the contents of this file. The suggested urls which are not already in the seed site list are added after comment lines (lines starting with #) which give the time at which the urls were added. Adding Suggest data can be done either for new crawls or to inject urls into currently running crawls. +The Seed sites textarea allows you to specify a list of urls that the crawl should start from. The crawl will begin using these urls. This list can include ".onion" urls if you want to crawl [[http://en.wikipedia.org/wiki/Tor_network|TOR networks]]. + +The format for sites, domains, and urls are the same for each of these textareas, except that the Seed site area can only take urls (or urls and title/descriptions) and in the Disallowed Sites/Sites with Quotas one can give a url followed by #. Otherwise, in this common format, there should be one site, url, or domain per line. You should not separate sites and domains with commas or other punctuation. White space is ignored. A domain can be specified as: + domain:.sjsu.edu +Urls like: + http://www.sjsu.edu/ + https://www.sjsu.edu/gape/ + http://bob.cs.sjsu.edu/index.html +would all fall under this domain. The word domain above is a slight misnomer as domain:sjsu.edu, without the leading period, also matches a site like http://mysjsu.edu/. A site can be specified as scheme://domain/path. Currently, Yioop recognizes the three schemas: http, https, and gopher (an older web protocol). For example, https://www.somewhere.com/foo/ . Such a site includes https://www.somewhere.com/foo/anything_more . Yioop also recognizes * and $ within urls. So http://my.site.com/*/*/ would match http://my.site.com/subdir1/subdir2/rest and http://my.site.com/*/*/$ would require the last symbol in the url to be '/'. This kind of pattern matching can be useful in the to restrict the depth of a crawl to within a url to a certain fixed depth -- you can allow crawling a site, but disallow the downloading of pages with more than a certain number of `/' in them. + +In the Disallowed Sites/Sites with Quotas, a number after a # sign indicates that at most that many pages should be downloaded from that site in any given hour. For example, + http://www.ucanbuyart.com/#100 +indicates that at most 100 pages are to be downloaded from http://www.ucanbuyart.com/ per hour. + +In the seed site area one can specify title and page descriptions for pages that Yioop would otherwise be forbidden to crawl by the robots.txt file. For example, + http://www.facebook.com/###!Facebook###!A%20famous%20social%20media%20site +tells Yioop to generate a placeholder page for http://www.facebook.com/ with title "Facebook" and description "A famous social media site" rather than to attempt to download the page. The [[Documentation#Results%20Editor|Results Editor]] activity can only be used to affect pages which are in a Yioop index. This technique allows one to add arbitrary pages to the index. + +When configuring a new instance of Yioop the file default_crawl.ini is copied to WORK_DIRECTORY/crawl.ini and contains the initial settings for the Options form. + +{{id="archive" +=====Archive Crawl Options===== +}} + +We now consider how to do crawls of previously obtained archives. From the initial crawl options screen, clicking on the Archive Crawl tab gives one the following form: + +{{class="docs" +((resource:Documentation:ArchiveCrawlOptions.png|Archive Crawl Options Form)) +}} + +The dropdown lists all previously done crawls that are available for recrawl. + +{{class="docs" +((resource:Documentation:ArchiveCrawlDropDown.png|Archive Crawl dropdown)) +}} + +These include both previously done Yioop crawls, previously down recrawls (prefixed with RECRAWL::), Yioop Crawl Mixes (prefixed with MIX::), and crawls of other file formats such as: arc, warc, database data, MediaWiki XML, and ODP RDF, which have been appropriately prepared in the PROFILE_DIR/cache folder (prefixed with ARCFILE::). In addition, Yioop also has a generic text file archive importer (also, prefixed with ARCFILE::). + +You might want to re-crawl an existing Yioop crawl if you want to add new meta-words, new cache page links, extract fields in a different manner, or if you are migrating a crawl from an older version of Yioop for which the index isn't readable by your newer version of Yioop. For similar reasons, you might want to recrawl a previously re-crawled crawl. When you archive crawl a crawl mix, Yioop does a search on the keyword site:any using the crawl mix in question. The results are then indexed into a new archive. This new archive might have considerably better query performance (in terms of speed) as compared to queries performed on the original crawl mix. How to make a crawl mix is described in the [[Documentation#Mixing%20Crawl%20Indexes|Crawl Mixes]] section. You might want to do an archive crawl of other file formats if you want Yioop to be able to provide search results of their content. Once you have selected the archive you want to crawl, you can add meta words as discussed in the Crawl Time Tab Page Rule portion of the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. Afterwards,go back to the Create Crawl screen to start your crawl. As with a Web Crawl, for an archive crawl you need both the queue_server running and a least one fetcher running to perform a crawl. + +To re-crawl a previously created web archive that was made using several fetchers, each of the fetchers that was used in the creation process should be running. This is because the data used in the recrawl will come locally from the machine of that fetcher. For other kinds of archive crawls and mix crawls, which fetchers one uses, doesn't matter because archive crawl data comes through the name server. You might also notice that the number of pages in a web archive re-crawl is actually larger than the initial crawl. This can happen because during the initial crawl data was stored in the fetcher's archive bundle and a partial index of this data sent to appropriate queue_servers but was not yet processed by these queue servers. So it was waiting in a schedules folder to be processed in the event the crawl was resumed. + +To get Yioop to detect arc, database data, MediaWiki, ODP RDF, or generic text archive files you need to create an PROFILE_DIR/cache/archives folder on the name server machine. Yioop checks subfolders of this for files with the name arc_description.ini. For example, to do a Wikimedia archive crawl, one could make a subfolder PROFILE_DIR/cache/archives/my_wiki_media_files and put in it a file arc_description.ini in the format to be discussed in a moment. In addition to the arc_description.ini, you would also put in this folder all the archive files (or links to them) that you would like to index. When indexing, Yioop will process each archive file in turn. Returning to the arc_description.ini file, arc_description.ini's contents are used to give a description of the archive crawl that will be displayed in the archive dropdown as well as to specify the kind of archives the folder contains and how to extract it. An example arc_description.ini for MediaWiki dumps might look like: + + arc_type = 'MediaWikiArchiveBundle'; + description = 'English Wikipedia 2012'; + +In the Archive Crawl dropdown the description will appear with the prefix ARCFILE:: and you can then select it as the source to crawl. Currently, the supported arc_types are: ArcArchiveBundle, DatabaseBundle, MediaWikiArchiveBundle, OdpRdfArchiveBundle, TextArchiveBundle, and WarcArchiveBundle. For the ArcArchiveBundle, OdpRdfArchiveBundle, MediaWikiArchiveBundle, WarcArchiveBundle arc_types, generally a two line arc_description.ini file like above suffices. We now describe how to import from the other kind of formats in a little more detail. In general, the arc_description.ini will tell Yioop how to get string items (in a associative array with a minimal amount of additional information) from the archive in question. Processing on these string items can then be controlled using Page Rules, described in the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. + +An example arc_description.ini where the arc_type is DatabaseBundle might be: + arc_type = 'DatabaseBundle'; + description = 'DB Records'; + dbms = "mysql"; + db_host = "localhost"; + db_name = "MYGREATDB"; + db_user = "someone"; + db_password = "secret"; + encoding = "UTF-8"; + sql = "SELECT MYCOL1, MYCOL2 FROM MYTABLE1 M1, MYTABLE2 M2 WHERE M1.FOO=M2.BAR"; + field_value_separator = '|'; + column_separator = '##'; + +Here is a specific example that gets the rows out of the TRANSLATION table of Yioop where the database was stored in a Postgres DBMS. In the comments I indicate how to alter it for other DBMS's. + + arc_type = 'DatabaseBundle'; + description = 'DB Records'; + ;sqlite3 specific + ;dbms ="sqlite3"; + ;mysql specific + ;dbms = "mysql"; + ;db_host = "localhost"; + ;db_user = "root"; + ;db_password = ""; + dbms = "pdo"; + ;below is for postgres; similar if want db2 or oracle + db_host = "pgsql:host=localhost;port=5432;dbname=seek_quarry" + db_name = "seek_quarry"; + db_user = "cpollett"; + db_password = ""; + encoding = "UTF-8"; + sql = "SELECT * from TRANSLATION"; + field_value_separator = '|'; + column_separator = '##'; + +Possible values for dbms are pdo, mysql, sqlite3. If pdo is chosen, then db_host should be a PHP DSN specifying which DBMS driver to use. db_name is the name of the database you would like to connect to, db_user is the database username, db_password is the password for that user, and encoding is the character set of rows that the database query will return. + +The sql variable is used to give a query whose result rows will be the items indexed by Yioop. Yioop indexes string "pages", so to make these rows into a string each column result will be made into a string: field field_value_separator value. Here field is the name of the column, value is the value for that column in the given result row. Columns are concatenated together separated by the value of of column_separator. The resulting string is then sent to Yioop's TextProcessor page processor. + +We next give a few examples of arc_description.ini files where the arc_type is TextArchiveBundle. First, suppose we wanted to index access log file records that look like: + 127.0.0.1 - - [21/Dec/2012:09:03:01 -0800] "POST /git/yioop2/ HTTP/1.1" 200 - \ + "-" "Mozilla/5.0 (compatible; YioopBot; \ + +http://localhost/git/yioop/bot.php)" +Here each record is delimited by a newline and the character encoding is UTF-8. The records are stored in files with the extension .log and these files are uncompressed. We then might use the following arc_description.ini file: + arc_type = 'TextArchiveBundle'; + description = 'Log Files'; + compression = 'plain'; + file_extension = 'log'; + end_delimiter = "\n"; + encoding = "UTF-8"; +In addition to compression = 'plain', Yioop supports gzip and bzip2. The end_delimeter is a regular expression indicating how to know when a record ends. To process a TextArchiveBundle Yioop needs either an end_delimeter or a start_delimiter (or both) to be specified. As another example, for a mail.log file with entries of the form: + From pollett@mathcs.sjsu.edu Wed Aug 7 10:59:04 2002 -0700 + Date: Wed, 7 Aug 2002 10:59:04 -0700 (PDT) + From: Chris Pollett <pollett@mathcs.sjsu.edu> + X-Sender: pollett@eniac.cs.sjsu.edu + To: John Doe <johndoe@mail.com> + Subject: Re: a message + In-Reply-To: <5.1.0.14.0.20020723093456.00ac9c00@mail.com> + Message-ID: <Pine.GSO.4.05.10208071057420.9463-100000@eniac.cs.sjsu.edu> + MIME-Version: 1.0 + Content-Type: TEXT/PLAIN; charset=US-ASCII + Status: O + X-Status: + X-Keywords: + X-UID: 17 + + Hi John, + + I got your mail. + + Chris +The following might be used: + + arc_type = 'TextArchiveBundle'; + description = 'Mail Logs'; + compression = 'plain'; + file_extension = 'log'; + start_delimiter = "\n\nFrom\s"; + encoding = "ASCII"; + +Notice here we are splitting records using a start delimeter. Also, we have chosen ASCII as the character encoding. As a final example, we show how to import tar gzip files of Usenet records as found, in the [[http://archive.org/details/utzoo-wiseman-usenet-archive|UTzoo Usenet Archive 1981-1991]]. Further discussion on how to process this collection is given in the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] section. + + arc_type = 'TextArchiveBundle'; + description = 'Utzoo Usenet Archive'; + compression = 'gzip'; + file_extension = 'tgz'; + start_delimiter = "\0\0\0\0Path:"; + end_delimiter = "\n\0\0\0\0"; + encoding = "ASCII"; + +Notice in the above we set the compression to be gzip. Then we have Yioop act on the raw tar file. In tar files, content objects are separated by long paddings of null's. Usenet posts begin with Path, so to keep things simple we grab records which begin with a sequence of null's the Path and end with another sequence of null's. + +As a final reminder for this section, remember that, in addition, to the arc_description.ini file, the subfolder should also contain instances of the files in question that you would like to archive crawl. So for arc files, these would be files of extension .arc.gz; for MediaWiki, files of extension .xml.bz2; and for ODP-RDF, files of extension .rdf.u8.gz . +Crawl Options of config.php or local_config.php + +There are a couple of flags which can be set in the config.php or in a local_config.php file that affect web crawling which we now mention for completeness. As was mentioned before, when Yioop is crawling it makes use of Etag: and Expires: HTTP headers received during web page download to determine when a page can be recrawled. This assumes one has not completely turned off recrawling under the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Indexing and Search Options]] activity. To turn Etag and Expires checking off, one can add to a local_config.php file the line: + +define("USE_ETAG_EXPIRES", false); + + +Yioop can be run using the [[https://github.com/facebook/hhvm/|Hip Hop Virtual Machine from FaceBook]]. This will tend to make Yioop run faster and use less memory than running it under the standard PHP interpreter. Hip Hop can be used on various Linux flavors and to some degree runs under OSX (the queue server and fetcher will run, but the web app doesn't). If you want to use the Hip Hop on Mac OSX, and if you install it via Homebrew, then you will need to set a force variable and set the path for Hip Hop in your local_config.php file with lines like: + define('FORCE_HHVM', true); + define('HHVM_PATH', '/usr/local/bin'); +The above lines are only needed on OSX to run Hip Hop. + +[[Documentation#contents|Return to table of contents]] + +===Mixing Crawl Indexes=== + +Once you have performed a few crawls with Yioop, you can use the Mix Crawls activity to create mixture of your crawls. This activity is available to users who have either Admin role or just the standard User role. This section describes how to create crawl mixes which are processed when a query comes in to Yioop. Once one has created such a crawl mix, an admin user can make a new index which consists of results of the crawl mix ("materialize it") by doing an archive crawl of the crawl mix. The [[Documentation#archive|Archive Crawl Options]] subsection has more details on how to do this latter operation. The main Mix Crawls activity looks like: + +{{class="docs" +((resource:Documentation:ManageMixes.png|The Manage Mixes form)) +}} + +The first form allows you to name and create a new crawl mixture. Clicking "Create" sends you to a second page where you can provide information about how the mixture should be built. Beneath the Create mix form is a table listing all the previously created crawl mixes. Above this listing, but below the Create form is a standard set of nav elements for selecting which mixes will be displayed in this table. A Crawl mix is "owned" by the user who creates that mix. The table only lists crawl mixes "owned" by the user. The first column has the name of the mix, the second column says how the mix is built out of component crawls, and the actions columns allows you to edit the mix, set it as the default index for Yioop search results, or delete the mix. You can also append "m:name+of+mix" or "mix:name+of+mix" to a query to use that quiz without having to set it as the index. When you create a new mix, and are logged in so Yioop knows the mix belongs to you, your mix will also show up on the Settings page. The "Share" column pops a link where you can share a crawl mix with a Yioop Group. This will post a message with a link to that group so that others can import your mix into their lists of mixes. Creating a new mix or editing an existing mix sends you to a second page: + +{{class="docs" +((resource:Documentation:EditMix.png|The Edit Mixes form)) +}} + +Using the "Back" link on this page will take you to the prior screen. The first text field on the edit page lets you rename your mix if you so desire. Beneath this is an "Add Groups" button. A group is a weighted list of crawls. If only one group were present, then search results would come from any crawl listed for this group. A given result's score would be the weighted sum of the scores of the crawls in the group it appears in. Search results are displayed in descending order according to this total score. If more that one group is present then the number of results field for that group determines how many of the displayed results should come from that group. For the Crawl Mix displayed above, there are three groups: The first group is used to display the first result, the second group is used to display the second result, the last group is used to display any remaining search results. + +The UI for groups works as follows: The top row has three columns. To add new components to a group use the dropdown in the first column. The second column controls for how many results the particular crawl group should be used. Different groups results are presented in the order they appear in the crawl mix. The last group is always used to display any remaining results for a search. The delete group link in the third column can be used to delete a group. Beneath the first row of a group, there is one row for each crawl that belongs to the group. The first link for a crawl says how its scores should be weighted in the search results for that group. The second column is the name of the crawl. The third column is a space separated list of words to add to the query when obtaining results for that crawl. So for example, in the first group above, there are two indexes which will be unioned: Default Crawl with a weight of 1, and CanCrawl Test with a weight of 2. For the Default Crawl we inject two keywords media:text and Canada to the query we get from the user. media:text means we will get whatever results from this crawl that consisted of text rather than image pages. Keywords can be used to make a particular component of a crawl mix behave in a conditional many by using the "if:" meta word described in the search and user interface section. The last link in a crawl row allows you to delete a crawl from a crawl group. For changes on this page to take effect, the "Save" button beneath this dropdown must be clicked. + +[[Documentation#contents|Return to table of contents]]. + +===Classifying Web Pages=== + +Sometimes searching for text that occurs within a page isn't enough to find what one is looking for. For example, the relevant set of documents may have many terms in common, with only a small subset showing up on any particular page, so that one would have to search for many disjoint terms in order to find all relevant pages. Or one may not know which terms are relevant, making it hard to formulate an appropriate query. Or the relevant documents may share many key terms with irrelevant documents, making it difficult to formulate a query that fetches one but not the other. Under these circumstances (among others), it would be useful to have meta words already associated with the relevant documents, so that one could just search for the meta word. The Classifiers activity provides a way to train classifiers that recognize classes of documents; these classifiers can then be used during a crawl to add appropriate meta words to pages determined to belong to one or more classes. + +Clicking on the Classifiers activity displays a text field where you can create a new classifier, and a table of existing classifiers, where each row corresponds to a classifier and provides some statistics and action links. A classifier is identified by its class label, which is also used to form the meta word that will be attached to documents. Each classifier can only be trained to recognize instances of a single target class, so the class label should be a short description of that class, containing only alphanumeric characters and underscores (e.g., "spam", "homepage", or "menu"). Typing a new class label into the text box and hitting the Create button initializes a new classifier, which will then show up in the table. + +{{class="docs" +((resource:Documentation:ManageClassifiers.png|The Manage Classifiers page)) +}} + +Once you have a fresh classifier, the natural thing to do is edit it by clicking on the Edit action link. If you made a mistake, however, or no longer want a classifier for some reason, then you can click on the Delete action link to delete it; this cannot be undone. The Finalize action link is used to prepare a classifier to classify new web pages, which cannot be done until you've added some training examples. We'll discuss how to add new examples next, then return to the Finalize link. + +====Editing a Classifier==== + +Clicking on the Edit action link takes you to a new page where you can change a classifier's class label, view some statistics, and provide examples of positive and negative instances of the target class. The first two options should be self-explanatory, but the last is somewhat involved. A classifier needs labeled training examples in order to learn to recognize instances of a particular class, and you help provide these by picking out example pages from previous crawls and telling the classification system whether they belong to the class or do not belong to the class. The Add Examples section of the Edit Classifier page lets you select an existing crawl to draw potential examples from, and optionally narrow down the examples to those that satisfy a query. Once you've done this, clicking the Load button will send a request to the server to load some pages from the crawl and choose the next one to receive a label. You'll be presented with a record representing the selected document, similar to a search result, with several action links along the side that let you mark this document as either a positive or negative example of the target class, or skip this document and move on to the next one: + +{{class="docs" +((resource:Documentation:ClassifiersEdit.png|The Classifiers edit page)) +}} + +When you select any of the action buttons, your choice is sent back to the server, and a new example to label is sent back (so long as there are more examples in the selected index). The old example record is shifted down the page and its background color updated to reflect your decision—green for a positive example, red for a negative one, and gray for a skip; the statistics at the top of the page are updated accordingly. The new example record replaces the old one, and the process repeats. Each time a new label is sent to the server, it is added to the training set that will ultimately be used to prepare the classifier to classify new web pages during a crawl. Each time you label a set number of new examples (10 by default), the classifier will also estimate its current accuracy by splitting the current training set into training and testing portions, training a simple classifier on the training portion, and testing on the remainder (checking the classifier output against the known labels). The new estimated accuracy, calculated as the proportion of the test pages classified correctly, is displayed under the Statistics section. You can also manually request an updated accuracy estimate by clicking the Update action link next to the Accuracy field. Doing this will send a request to the server that will initiate the same process described previously, and after a delay, display the new estimate. + +All of this happens without reloading the page, so avoid using the web browser's Back button. If you do end up reloading the page somehow, then the current example record and the list of previously-labeled examples will be gone, but none of your progress toward building the training set will be lost. + +====Finalizing a Classifier==== + +Editing a classifier adds new labeled examples to the training set, providing the classifier with a more complete picture of the kinds of documents it can expect to see in the future. In order to take advantage of an expanded training set, though, you need to finalize the classifier. This is broken out into a separate step because it involves optimizing a function over the entire training set, which can be slow for even a few hundred example documents. It wouldn't be practical to wait for the classifier to re-train each time you add a new example, so you have to explicitly tell the classifier that you're done adding examples for now by clicking on the Finalize action link either next to the Load button on the edit classifier page or next to the given classifier's name on the classifier management page. + +Clicking this link will kick off a separate process that trains the classifier in the background. When the page reloads, the Finalize link should have changed to text that reads "Finalizing..." (but if the training set is very small, training may complete almost immediately). After starting finalization, it's fine to walk away for a bit, reload the page, or carry out some unrelated task for the user account. You should not however, make further changes to the classifier's training set, or start a new crawl that makes use of the classifier. When the classifier finishes its training phase, the Finalizing message will be replaced by one that reads "Finalized" indicating that the classifier is ready for use. + +====Using a Classifier==== + +Using a classifier is as simple as checking the "Use to Classify" or "Use to Rank" checkboxes next to the classifier's label on the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] activity, under the "Classifiers and Rankers" heading. When the next crawl starts, the classifier (and any other selected classifiers) will be applied to each fetched page. If "Use to Rank" is checked then the classifier score for that page will be recorded. If "Use to Classify" is checked and if a page is determined to belong to a target class, it will have several meta words added. As an example, if the target class is "spam", and a page is determined to belong to the class with probability .79, then the page will have the following meta words added: + +*class:spam +*class:spam:50plus +*class:spam:60plus +*class:spam:70plus +*class:spam:70 + +These meta words allow one to search for all pages classified as spam at any probability over the preset threshold of .50 (with class:spam), at any probability over a specific multiple of .1 (e.g., over .6 with class:spam:60plus), or within a specific range (e.g., .60–.69 with class:spam:60). Note that no meta words are added if the probability falls below the threshold, so no page will ever have the meta words class:spam:10plus, class:spam:20plus, class:spam:20, and so on. + +[[Documentation#contents|Return to table of contents]]. + +===Page Indexing and Search Options=== + +Several properties about how web pages are indexed and how pages are looked up at search time can be controlled by clicking on Page Options. There are three tabs for this activity: Crawl Time, Search Time, and Test Options. We will discuss each of these in turn. + +====Crawl Time Tab==== + +Clicking on Page Options leads to the default Crawl Time Tab: + +{{class="docs" +((resource:Documentation:PageOptionsCrawl.png|The Page Options Crawl form)) +}} + +This tab controls some aspects about how a page is processed and indexed at crawl time. The form elements before Page Field Extraction Rules are relatively straightforward and we will discuss these briefly below. The Page Rules textarea allows you to specify additional commands for how you would like text to be extracted from a page document summary. The description of this language will take the remainder of this subsection. + +The Get Options From dropdown allows one to load in crawl time options that were used in a previous crawl. Beneath this, The Byte Range to Download dropdown controls how many bytes out of any given web page should be downloaded. Smaller numbers reduce the requirements on disk space needed for a crawl; bigger numbers would tend to improve the search results. If whole pages are being cached, these downloaded bytes are stored in archives with the fetcher. The Summarizer dropdown control what summarizer is used on a page during page processing. Yioop uses a summarizer to control what portions of a page will be put into the index and are available at search time for snippets. The two available summarizers are Basic, which picks the pages meta title, meta description, h1 tags, etc in a fixed order until the summary size is reached; and Centroid, which computes an "average sentence" for the document and adds phrases from the actual document according to nearness to this average. If Centroid summarizer is used Yioop also generates a word cloud for each document. Centroid tends to produces slightly better results than Basic but is slower. How to tweak the Centroid summarizer for a particular locale, is described in the [[Documentation#Localizing%20Yioop%20to%20a%20New%20Language|Localizing Yioop]] section. The Max Page Summary Length in Bytes controls how many of the total bytes can be used to make a page summary which is sent to the queue server. It is only words in this summary which can actually be looked up in search result. Care should be taken in making this value larger as it can increase the both the RAM memory requirements (you might have to change the memory_limit variable at the start of queue_server.php to prevent crashing) while crawling and it can slow the crawl process down. The Cache whole crawled pages checkbox says whether to when crawling to keep both the whole web page downloaded as well as the summary extracted from the web page (checked) or just to keep the page summary (unchecked). The next dropdown, Allow Page Recrawl After, controls how many days that Yioop keeps track of all the URLs that it has downloaded from. For instance, if one sets this dropdown to 7, then after seven days Yioop will clear its Bloom Filter files used to store which urls have been downloaded, and it would be allowed to recrawl these urls again if they happened in links. It should be noted that all of the information from before the seven days will still be in the index, just that now Yioop will be able to recrawl pages that it had previously crawled. Besides letting Yioop get a fresher version of page it already has, this also has the benefit of speeding up longer crawls as Yioop doesn't need to check as many Bloom filter files. In particular, it might just use one and keep it in memory. + +The Page File Types to Crawl checkboxes allow you to decide which file extensions you want Yioop to download during a crawl. This check is done before any download is attempted, so Yioop at that point can only guess the [[http://en.wikipedia.org/wiki/MIME|MIME Type]], as it hasn't received this information from the server yet. An example of a url with a file extension is: + http://flickr.com/humans.txt +which has the extension txt. So if txt is unchecked, then Yioop won't try to download this page even though Yioop can process plain text files. A url like: + http://flickr.com/ +has no file extension and will be assumed to be have a html extension. To crawl sites which have a file extension, but no one in the above list check the unknown checkbox in the upper left of this list. + +The Classifiers and Rankers checkboxes allow you to select the classifiers that will be used to classify or rank pages. Each classifier (see the [[Documentation#Classifying%20Web%20Pages|Classifiers]] section for details) is represented in the list by its class label and two checkboxes. Checking the box under Use to classify indicates that the associated classifier should be used (made active) during the next crawl for classifying, checking the "Use to Rank" indicates that the classifier should be be used (made active) and its score for the document should be stored so that it can be used as part of the search time score. Each active classifier is run on each page downloaded during a crawl. If "Use to Crawl" was checked and the page is determined to belong to the class that the classifier has been trained to recognize, then a meta word like "class:label", where label is the class label, is added to the page summary. For faster access to pages that contain a single term and a label, for example, pages that contain "rich" and are labeled as "non-spam", Yioop actually uses the first character of the label "non-spam" and embeds it as part of the term ID of "rich" on "non-spam" pages with the word "rich". To ensure this speed-up can be used it is useful to make sure ones classifier labels begin with different first characters. If "Use to Rank" is checked then when a classifier is run on the page, the score from the classifier is recorded. When a search is done that might retrieve this page, this score is then used as one component of the overall score that this page receives for the query. + +The Indexing Plugins checkboxes allow you to select which plugins to use during the crawl. Yioop comes with three built-in plugins: AddressesPlugin, RecipePlugin, and WordFilterPlugin. One can also write or downlaod additional plugins. If the plugin can be configured, next to the checkbox will be a link to a configuration screen. Let's briefly look at each of these plugins in turn... + +Checking the AddressesPlugin enables Yioop during a crawl to try to calculate addresses for each page summary it creates. When Yioop processes a page it by default creates a summary of the page with a TITLE and a DESCRIPTION as well as a few other fields. With the addresses plugin activated, it will try to extract data to three additional fields: EMAILS, PHONE_NUMBERS, and ADDRESSES. If you want to test out how these behave, pick some web page, view source on the web page, copy the source, and then paste into the Test Options Tab on the page options page (the Test Options Tab is described later in this section). + +Clicking the RecipePlugin checkbox causes Yioop during a crawl to run the code in indexing_plugins/recipe_plugin.php. This code tries to detect pages which are food recipes and separately extracts these recipes and clusters them by ingredient. It then add search meta words ingredient: and recipe:all to allow one to search recipes by ingredient or only documents containing recipes. + +Checking the WordFilterPlugin causes Yioop to run code in indexing_plugins/wordfilter_plugin.php on each downloaded page. +The [[http://www.yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=04%20Niche%20or%20Subject%20Specific%20Crawling%20With%20Yioop.mp4| Niche Crawling Video Tutorial]] has information about how to use this plugin to create subject-specific crawls of the web. This code checks if the downloaded page has one of the words listed in the textarea one finds on the plugin's configure page. If it does, then the plugin follows the actions listed for pages that contain that term. Below is an example WordFilterPlugin configure page: + +{{class="docs" +((resource:Documentation:WordFilterConfigure.png|Word Filter Configure Page)) +}} + +Lines in the this configure file either specify a url or domain using a syntax like [url_or_domain] or specify a rule or a comment. Whitespace is ignored and everything after a semi-colon on a line is treated as a comment. The rules immediately following a url or domain line up till the next url or domain line are in effect if one crawling is crawling a prage with that url or domain. Each '''rule line''' in the textarea consists of a comma separated list of literals followed by a colon followed by a comma separated list of what to do if the literal condition is satisfied. A single literal in the list of literals is an optional + or - followed by a sequence of non-space characters. After the + or -, up until a ; symbol is called the term in the literal. If the literal sign is + or if no sign is present, then the literal holds for a document if it contains the term, if the literal sign is - then the literal holds for a document if it does not contain the term, if there is a decimal number between 0 and 1, say x, after the # up to a comma or the first white-space character, then this is modified so the literal holds only if x'th fraction of the documents length comes from the literal's term. If rather than a decimal x were a positive natural number then the term would need to occur x times. If all the literal in the comma separated list hold, then the rule is said to hold, and the actions will apply. The line -term0:JUSTFOLLOW says that if the downloaded page does not contain the word "term0" then do not index the page, but do follow outgoing links from the page. The line term1:NOPROCESS says if the document has the word "term1" then do not index it or follow links from it. The last line +term2:NOFOLLOW,NOSNIPPET says if the page contains "term2" then do not follow any outgoing links. NOSNIPPET means that if the page is returned from search results, the link to the page should not have a snippet of text from that page beneath it. As an example of a more complicated rule, consider: + + surfboard#2,bikini#0.02:NOINDEX, NOFOLLOW + +Here for the rule to hold the condition surfboard#2 requires that the term surfboard occurred at least twice in the document and the condition bikini#0.02 requires that 0.02 percent of the documents total length also come from copies of the word bikini. In addition, to the commands just mentioned, WordFilterPlugin supports standard robots.txt directives such as: NOINDEX, NOCACHE, NOARCHIVE, NOODP, NOYDIR, and NONE. More details about how indexing plugins work and how to write your own indexing plugin can be found in the [[Documentation#Modifying%20Yioop%20Code|Modifying Yioop]] section. + +====Page Field Extraction Language==== + +We now return to the Page Field Extraction Rules textarea of the Page Options - Crawl Time tab. Commands in this area allow a user to control what data is extracted from a summary of a page. The textarea allows you to do things like modify the summary, title, and other fields extracted from a page summary; extract new meta words from a summary; and add links which will appear when a cache of a page is shown. Page Rules are especially useful for extracting data from generic text archives and database archives. How to import such archives is described in the Archive Crawls sub-section of +[[Documentation#Performing%20and%20Managing%20Crawls|Performing and Managing Crawls]]. The input to the page rule processor is an asscociative array that results from Yioop doing initial processing on a page. To see what this array looks like one can take a web page and paste it into the form on the Test Options tab. There are two types of page rule statements that a user can define: command statements and assignment statements. In addition, a semicolon ';' can be used to indicate the rest of a line is a comment. Although the initial textarea for rules might appear small. Most modern browsers allow one to resize this area by dragging on the lower right hand corner of the area. This makes it relatively easy to see large sets of rules. + +A command statement takes a key field argument for the page associative array and does a function call to manipulate that page. Below is a list of currently supported commands followed by comments on what they do: + + addMetaWords(field) ;add the field and field value to the META_WORD + ;array for the page + addKeywordLink(field) ;split the field on a comma, view this as a search + ;keywords => link text association, and add this to + ;the KEYWORD_LINKS array. + setStack(field) ;set which field value should be used as a stack + pushStack(field) ;add the field value for field to the top of stack + popStack(field) ;pop the top of the stack into the field value for + ;field + setOutputFolder(dir) ;if auxiliary output, rather than just to the + ;a yioop index, is being done, then set the folder + ;for this output to be dir + setOutputFormat(format) ;set the format of auxiliary output. + ;Should be either CSV or SQL + ;SQL mean that writeOutput will write an insert + ;statement + setOutputTable(table) ;if output is SQL then what table to use for the + ;insert statements + toArray(field) ;splits field value for field on a comma and + ;assign field value to be the resulting array + toString(field) ;if field value is an array then implode that + ;array using comma and store the result in field + ;value + unset(field) ;unset that field value + writeOutput(field) ;use the contents of field value viewed as an array + ;to fill in the columns of a SQL insert statement + ;or CSV row + + +Page rule assignments can either be straight assignments with '=' or concatenation assignments with '.='. Let $page indicate the associative array that Yioop supplies the page rule processor. There are four kinds of values that one can assign: + field = some_other_field ; sets $page['field'] = $page['some_other_field'] + field = "some_string" ; sets $page['field'] to "some string" + field = /some_regex/replacement_where_dollar_vars_allowed/ + ; computes the results of replacing matches to some_regex + ; in $page['field'] with replacement_where_dollar_vars_allowed + field = /some_regex/g ;sets $page['field'] to the array of all matches + ; of some regex in $page['field'] +For each of the above assignments we could have used ".=" instead of "=". We next give a simple example and followed by a couple more complicated examples of page rules and the context in which they were used: + +In the first example, we just want to extract meaningful titles for mail log records that were read in using a TextArchiveBundleIterator. Here after initial page processing a whole email would end up in the DESCRIPTION field of the $page associative array given tot the page rule processor. So we use the following two rules: + TITLE = DESCRIPTION + TITLE = /(.|\n|\Z)*?Subject:[\t ](.+?)\n(.|\n|\Z)*/$2/ +We initially set the TITLE to be the whole record, then use a regex to extract out the correct portion of the subject line. Between the first two slashes recognizes the whole record where the pattern inside the second pair of parentheses (.+?) matches the subject text. The $2 after the second parenthesis says replace the value of TITLE with just this portion. + +The next example was used to do a quick first pass processing of record from the [[http://archive.org/details/utzoo-wiseman-usenet-archive|UTzoo Archive of Usenet Posts from 1981-1991]]. What each block does is described in the comments below + ; + ; Set the UI_FLAGS variable. This variable in a summary controls + ; which of the header elements should appear on cache pages. + ; UI_FLAGS should be set to a string with a comma separated list + ; of the options one wants. In this case, we use: yioop_nav, says that + ; we do want to display header; version, says that we want to display + ; when a cache item was crawled by Yioop; and summaries, says to display + ; the toggle extracted summaries link and associated summary data. + ; Other possible UI_FLAGS are history, whether to display the history + ; dropdown to other cached versions of item; highlight, whether search + ; keywords should be highlighted in cached items + ; + UI_FLAGS = "yioop_nav,version,summaries" + ; + ; Use Post Subject line for title + ; + TITLE = DESCRIPTION + TITLE = /(.|\n)*?Subject:([^\n]+)\n(.|\n)*/$2/ + ; + ; Add a link with a blank keyword search so cache pages have + ; link back to yioop + ; + link_yioop = ",Yioop" + addKeywordLink(link_yioop) + unset(link_yioop) ;using unset so don't have link_yioop in final summary + ; + ; Extract y-M and y-M-j dates as meta word u:date:y-M and u:date:y-M-j + ; + date = DESCRIPTION + date = /(.|\n)*?Date:([^\n]+)\n(.|\n)*/$2/ + date = /.*,\s*(\d*)-(\w*)-(\d*)\s*.*/$3-$2-$1/ + addMetaWord(date) + date = /(\d*)-(\w*)-.*/$1-$2/ + addMetaWord(date) + ; + ; Add a link to articles containing u:date:y-M meta word. The link text + ; is Date:y-M + ; + link_date = "u:date:" + link_date .= date + link_date .= ",Date:" + link_date .= date + addKeywordLink(link_date) + ; + ; Add u:date:y meta-word + ; + date = /(\d*)-.*/$1/ + addMetaWord(date) + ; + ; Get the first three words of subject ignoring re: separated by underscores + ; + subject = TITLE + subject = /(\s*(RE:|re:|rE:|Re:)\s*)?(.*)/$3/ + subject_word1 = subject + subject_word1 = /\s*([^\s]*).*/$1/ + subject_word2 = subject + subject_word2 = /\s*([^\s]*)\s*([^\s]*).*/$2/ + subject_word3 = subject + subject_word3 = /\s*([^\s]*)\s*([^\s]*)\s*([^\s]*).*/$3/ + subject = subject_word1 + unset(subject_word1) + subject .= "_" + subject .= subject_word2 + unset(subject_word2) + subject .= "_" + subject .= subject_word3 + unset(subject_word3) + ; + ; Get the first newsgroup listed in the Newsgroup: line, add a meta-word + ; u:newsgroup:this-newgroup. Add a link to cache page for a search + ; on this meta word + ; + newsgroups = DESCRIPTION + newsgroups = /(.|\n)*?Newsgroups:([^\n]+)\n(.|\n)*/$2/ + newsgroups = /\s*((\w|\.)+).*/$1/ + addMetaWord(newsgroups) + link_news = "u:newsgroups:" + link_news .= newsgroups + link_news .= ",Newsgroup: " + link_news .= newsgroups + addKeywordLink(link_news) + unset(link_news) + ; + ; Makes a thread meta u:thread:newsgroup-three-words-from-subject. + ; Adds a link to cache page to search on this meta word + ; + thread = newsgroups + thread .= ":" + thread .= subject + addMetaWord(thread) + unset(newsgroups) + link_thread = "u:thread:" + link_thread .= thread + link_thread .= ",Current Thread" + addKeywordLink(link_thread) + unset(subject) + unset(thread) + unset(link_thread) +As a last example of page rules, suppose we wanted to crawl the web and whenever we detected a page had an address we wanted to write that address as a SQL insert statement to a series of text files. We can do this using page rules and the AddressesPlugin. First, we would check the AddressesPlugin and then we might use page rules like: + summary = ADDRESSES + setStack(summary) + pushStack(DESCRIPTION) + pushStack(TITLE) + setOutputFolder(/Applications/MAMP/htdocs/crawls/data) + setOutputFormat(sql) + setOutputTable(SUMMARY); + writeOutput(summary) +The first line says copy the contents of the ADDRESSES field of the page into a new summary field. The next line says use the summary field as the current stack. At this point the stack would be an array with all the addresses found on the given page. So you could use the command like popStack(first_address) to copy the first address in this array over to a new variable first_address. In the above case what we do instead is push the contents of the DESCRIPTION field onto the top of the stack. Then we push the contents of the TITLE field. The line + setOutputFolder(/Applications/MAMP/htdocs/crawls/data) +sets /Applications/MAMP/htdocs/crawls/data as the folder that any auxiliary output from the page_processor should go to. setOutputFormat(sql) says we want to output sql, the other possibility is csv. The line setOutputTable(SUMMARY); says the table name to use for INSERT statements should be called SUMMARY. Finally, the line writeOutput(summary) would use the contents of the array entries of the summary field as the column values for an INSERT statement into the SUMMARY table. This writes a line to the file data.txt in /Applications/MAMP/htdocs/crawls/data. If data.txt exceeds 10MB, it is compressed into a file data.txt.0.gz and a new data.txt file is started. + +====Search Time Tab==== + +The Page Options Search Time tab looks like: + +{{class="docs" +((resource:Documentation:PageOptionsSearch.png|The Page Options Search form)) +}} + +The Search Page Elements and Links control group is used to tell which element and links you would like to have presented on the search landing and search results pages. The Word Suggest checkbox controls whether a dropdown of word suggestions should be presented by Yioop when a user starts typing in the Search box. It also controls whether spelling correction and thesaurus suggestions will appear The Subsearch checkbox controls whether the links for Image, Video, and News search appear in the top bar of Yioop You can actually configure what these links are in the [[Documentation#Search%20Sources|Search Sources]] activity. The checkbox here is a global setting for displaying them or not. In addition, if this is unchecked then the hourly activity of downloading any RSS media sources for the News subsearch will be turned off. The Signin checkbox controls whether to display the link to the page for users to sign in to Yioop The Cache checkbox toggles whether a link to the cache of a search item should be displayed as part of each search result. The Similar checkbox toggles whether a link to similar search items should be displayed as part of each search result. The Inlinks checkbox toggles whether a link for inlinks to a search item should be displayed as part of each search result. Finally, the IP address checkbox toggles whether a link for pages with the same ip address should be displayed as part of each search result. + +The Search Ranking Factors group of controls: Title Weight, Description Weight, Link Weight field are used by Yioop to decide how to weigh each portion of a document when it returns query results to you. + +When Yioop ranks search results it search out in its postings list until it finds a certain number of qualifying documents. It then sorts these by their score, returning usually the top 10 results. In a multi-queue-server setting the query is simultaneously asked by the name server machine of each of the queue server machines and the results are aggregated. The Search Results Grouping controls allow you to affect this behavior. Minimum Results to Group controls the number of results the name server want to have before sorting of results is done. When the name server request documents from each queue server, it requests for alpha*(Minimum Results to Group)/(Number of Queue Servers) documents. Server Alpha controls the number alpha. + +The Save button of course saves any changes you make on this form. + +====Test Options Tab==== + +The Page Options Test Options tab looks like: + +{{class="docs" +((resource:Documentation:PageOptionsTest.png|The Page Options Test form)) +}} + +In the Type dropdown one can select a [[http://en.wikipedia.org/wiki/Internet_media_type|MIME Type]] used to select the page processor Yioop uses to extract text from the data you type or paste into the textarea on this page. Test Options let's you see how Yioop would process a web page and add summary data to its index. After filling in the textarea with a page, clicking Test Process Page will show the $summary associative array Yioop would create from the page after the appropriate page processor is applied. Beneath it shows the $summary array that would result after user-defined page rules from the crawl time tab are applied. Yioop stores a serialized form of this array in a IndexArchiveBundle for a crawl. Beneath this array is an array of terms (or character n-grams) that were extracted from the page together with their positions in the document. Finally, a list of meta words that the document has are listed. Either extracted terms or meta-word could be used to look up this document in a Yioop index. + +===Results Editor=== + +Sometimes after a large crawl one finds that there are some results that appear that one does not want in the crawl or that the summary for some result is lacking. The Result Editor activity allows one to fix these issues without having to do a completely new crawl. It has three main forms: An edited urls forms, a url editing form, and a filter websites form. + +If one has already edited the summary for a url, then the dropdown in the edited urls form will list this url. One can select it and click load to get it to display in the url editing form. The purpose of the url editing form is to allow a user to change the title and description for a url that appears on a search results page. By filling out the three fields of the url editing form, or by loading values into them through the previous form and changing them, and then clicking save, updates the appearance of the summary for that url. To return to using the default summary, one only fills out the url field, leaves the other two blank, and saves. This form does not affect whether the page is looked up for a given query, only its final appearance. It can only be used to edit the appearance of pages which appear in the index, not to add pages to the index. Also, the edit will affect the appearance of that page for all indexes managed by Yioop If you know there is a page that won't be crawled by Yioop, but would like it to appear in an index, please look at the crawl options section of [[Documentation#Performing%20and%20Managing%20Crawls|Manage Crawls]] documentation. + +To understand the filter websites form, recall the disallowed sites crawl option allows a user to specify they don't want Yioop to crawl a given web site. After a crawl is done though one might be asked to removed a website from the crawl results, or one might want to remove a website from the crawl results because it has questionable content. A large crawl can take days to replace, to make the job of doing such filtering faster while one is waiting for a replacement crawl where the site has been disallowed, one can use a search filter. + +{{class="docs" +((resource:Documentation:ResultsEditor.png|The Results Editor form)) +}} + +Using the filter websites form one can specify a list of hosts which should be excluded from the search results. The sites listed in the Sites to Filter textarea are required to be hostnames. Using a filter, any web page with the same host name as one listed in the Sites to Filter will not appear in the search results. So for example, the filter settings in the example image above contain the line http://www.cs.sjsu.edu/, so given these settings, the web page http://www.cs.sjsu.edu/faculty/pollett/ would not appear in search results. + +[[Documentation#contents|Return to table of contents]]. + +===Search Sources=== + +The Search Sources activity is used to manage the media sources available to Yioop, and also to control the subsearch links displayed on the top navigation bar. The Search Sources activity looks like: + +{{class="docs" +((resource:Documentation:SearchSources.png|The Search Sources form)) +}} + +The top form is used to add a media source to Yioop. Currently, the Media Kind can be either Video, RSS, or HTML. '''Video Media''' sources are used to help Yioop recognize links which are of videos on a web video site such as YouTube. This helps in both tagging such pages with the meta word media:video in a Yioop index and in being able to render a thumbnail of the video in the search results. When the media kind is set to video, this form has three fields: Name, which should be a short familiar name for the video site (for example, YouTube); URL, which should consist of a url pattern by which to recognize a video on that site; and Thumb, which consist of a url pattern to replace the original pattern by to find the thumbnail for that video. For example, the value of URL for YouTube is: + http://www.youtube.com/watch?v={}& +This will match any url which begins with http://www.youtube.com/watch?v= followed by some string followed by & followed by another string. The {} indicates that from v= to the & should be treated as the identifier for the video. The Thumb url in the case of YouTube is: + http://img.youtube.com/vi/{}/2.jpg +If the identifier in the first video link was yv0zA9kN6L8, then using the above, when displaying a thumb for the video, Yioop would use the image source: + http://img.youtube.com/vi/{yv0zA9kN6L8}/2.jpg +Some video sites have more complicated APIs for specifying thumbnails. In which case, you can still do media:video tagging but display a blank thumbnail rather than suggest a thumbnail link. To do this one uses the thumb url. + http://www.yioop.com/resources/blank.png?{} +If one selects the media kind to be '''RSS''' (really simple syndication, a kind of news feed, you can also use Atom feeds as sources), then the media sources form has four fields: '''Name''', again a short familiar name for the RSS feed; '''URL''', the url of the RSS feed, '''Language''', what language the RSS feed is; and Image XPath, an optional field which allows you to specify and XPath relative to a RSS item, an image url if it is in the item. This language element is used to control whether or not a news item will display given the current language settings of Yioop. If under Manage Machines the Media Updater on the Name Server is turned on, then these RSS feeds will be downloaded hourly. If under the Search Time screen of the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]] activity, the subsearch checkbox is checked, then there will be a link to News which appears on the top of the search page. Clicking on this link will display news items in order of recentness. + +An '''HTML Feed''' is a web page that has news articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example, +<pre> + Name: Cape Breton Post + URL: http://www.capebretonpost.com/News/Local-1968 + Channel: //div[contains(@class, "channel")] + Item: //article + Title: //a + Description: //div[contains(@class, "dek")] + Link: //a +</pre> +The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a . + +Returning again to Image Xpath, which is a field of both the RSS form and the HTML Feed form. Not all RSS feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed: +<pre> + http://feeds.wired.com/wired/index + //description/div[contains(@class, + "rss_thumbnail")]/img/@src +</pre> + +Beneath this media sources form is a table listing all the currently added media sources, their urls, and a links that allows one to edit or delete sources. + +The second form on the page is the Add a Subsearch form. The form allows you to add a new specialized search link which may appear at the top the search page. If there are more that three of these subsearch are added or if one is seeing the page on a mobile platform, one instead gets a "More" link. This links to the tool.php page which then lists out all possible specialized search, some account links, and other useful Yioop tools. The Add a Subsearch form has three fields: Folder Name is a short familiar name for the subsearch, it will appear as part of the query string when the given subsearch is being performed. For example, if the folder names was news, then s=news will appear as part of the query string when a news subsearch is being done. Folder Name is also used to make the localization identifier used in translating the subsearch's name into different languages. This identifer will have the format db_subsearch_identifier. For example, db_subsearch_news. Index Source, the second form element, is used to specify a crawl or a crawl mix that the given subsearch should use in returning results. Results per Page, the last form element, controls the number of search results which should appear when using this kind of subsearch. + +Beneath this form is a table listing all the currently added subsearches and their properties. The actions column at the end of this table let's one either edit, localize or delete a given subsearch. Clicking localize takes one to the Manage Locale's page for the default locale and that particular subsearch localization identifier, so that you can fill in a value for it. Remembering the name of this identifier, one can then in Manage Locales navigate to other locales, and fill in translations for them as well, if desired. + +[[Documentation#contents|Return to table of contents]]. + +===GUI for Managing Machines and Servers=== + +Rather than use the command line as described in the [[Documentation#prerequisites|Prerequisites for Crawling]] section, it is possible to start/stop and view the log files of queue servers and fetcher through the Manage Machines activity. In order to do this, the additional requirements for this activity mentioned in the [[Documentation#Requirements|Requirements]] section must have been met. The Manage Machines activity looks like: + +{{class="docs" +((resource:Documentation:ManageMachines.png|The Manage Machines form)) +}} + +The Add machine form at the top of the page allows one to add a new machine to be controlled by this Yioop instance. The Machine Name field lets you give this machine an easy to remember name. The Machine URL field should be filled in with the URL to the installed Yioop instance. The Mirror checkbox says whether you want the given Yioop installation to act as a mirror for another Yioop installation. Checking it will reveal a Parent Name textfield that allows you to choose which installation amongst the previously entered machines names (not urls) you want to mirror. The Has Queue Server checkbox is used to say whether the given Yioop installation will be running a queue server or not. Finally, the Number of Fetchers dropdown allows you to say how many fetcher instances you want to be able to manage for that machine. Beneath the Add machine form is the Machine Information listings. This shows the currently known about machines. This list always begins with the Name Server itself and a toggle to control whether or not the Media Updater process is running on the Name Server. This allows you to control whether or not Yioop attempts to update its RSS (or Atom) search sources on an hourly basis. There is also a link to the log file of the Media Updater process. Under the Name Server information is a dropdown that can be used to control the number of current machine statuses that are displayed for all other machines that have been added. It also might have next and previous arrow links to go through the currently available machines. To modify a machine that you have already added, + +Beneath this dropdown is a set of boxes for each machine you have added to Yioop. In the far corner of this box is a link to Delete that machine from the list of known machines, if desired. Besides this, each box lists the queue server, if any, and each of the fetchers you requested to be able to manage on that machine. Next to these there is a link to the log file for that server/fetcher and below this there is an On/Off switch for starting and stopping the server/fetcher. This switch is green if the server/fetcher is running and red otherwise. A similar On/Off switch is present to turn on and off mirroring on a machine that is acting as a mirror. It is possible for a switch to be yellow if the machine is crashed but where it is possible that the machine might be automatically restarted by Yioop without your intervention. + +==Building Sites with Yioop== + +===Building a Site using Yioop's Wiki System=== + +As was mentioned in the Configure Activity [[Documentation#advance|Toggle Advance Settings]] section of the documentation, background color, icons, title, and SEO meta information for a Yioop instance call all be configured from the Configure Activity. Adding advertisements such as banner and skyscraper ads can be done using the form on the [[Documentation#Optional%20Server%20and%20Security%20Configurations|Server Settings]] activity. If you would like a site with a more custom landing page, then one can check '''Use Wiki Public Main Page as Landing Page''' under Toggle Advance +Settings : Site Customizations. The Public Main page will then be the page you see when you first go to your site. You can then build out your site using the wiki system for the public group. Common headers and footers can be specified for pages on your site using each wiki page's Settings attributes. More advanced styling of pages can be done by specifying the auxiliary css data under Toggle Advance Settings. As Wiki pages can be set to be galleries, or slide presentations, and as Yioop supports including images, video, and embedding search bars and math equations on pages using its [[Syntax|Yioop's Wiki Syntax]], one can develop quite advanced sites using just this approach. The video tutorial [[https://yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=03%20Building%20Web%20Sites%20with%20Yioop.mp4|Building Websites Using Yioop]] explains how the Seekquarry.com site was built using Yioop Software in this way. + +===Building a Site using Yioop as Framework=== + +For more advanced, dynamic websites than the wiki approach described above, the Yioop code base can still serve as the code base for new custom search web sites. The web-app portion of Yioop uses a [[https://en.wikipedia.org/wiki/Model-view-adapter|model-view-adapter (MVA) framework]]. This is a common, web-suitable variant on the more well-known Model View Controller design pattern. In this set-up, sub-classes of the Model class should handle file I/O and database function, sub-classes of Views should be responsible for rendering outputs, and sub-classes of the Controller class do calculations on data received from the web and from the models to give the views the data they finally need to render. In the remainder of this section we describe how this framework is implemented in Yioop and how to add code to the WORK_DIRECTORY/app folder to customize things for your site. In this discussion we will use APP_DIR to refer to WORK_DIRECTORY/app and BASE_DIR to refer to the directory where Yioop is installed. + +The index.php script is the first script run by the Yioop web app. It has an array $available_controllers which lists the controllers available to the script. The names of the controllers in this array are lowercase. Based on whether the $_REQUEST['c'] variable is in this array index.php either loads the file {$_REQUEST['c']}_controller.php or loads whatever the default controller is. index.php also checks for the existing of APP_DIR/index.php and loads it if it exists. This gives the app developer a chance to change the available controllers and which controller is set for a given request. A controller file should have in it a file which extends the class Controller. Controller files should always have names of the form somename_controller.php and the class inside them should be named SomenameController. Notice it is Somename rather than SomeName. These general naming conventions are used for models, views, etc. Any Controller subclass has methods component($name), model($name), view($name), and indexing_plugin($name). These methods load, instantiate, and return a class with the given name. For example, $my_controller->model("crawl"); checks to see if a CrawlModel has already been instantiated, if so, it returns it; if not, it does a r equire_once on model/crawl_model.php and then instantiates a CrawlModel saves a reference to it, and returns it. + +If a require once is needed. Yioop first looks in APP_DIR. For example, $my_controller->view("search") would first look for a file: APP_DIR/views/search_view.php to include, if it cannot find such a file then it tries to include BASE_DIR/views/search_view.php. So to change the behavior of an existing BASE_DIR file one just has a modified copy of the file in the appropriate place in your APP_DIR. This holds in general for other program files such as components, models, and plugins. It doesn't hold for resources such as images -- we'll discuss those in a moment. Notice because it looks in APP_DIR first, you can go ahead and create new controllers, models, views, etc which don't exists in BASE_DIR and get Yioop to load them. +A Controller must implement the abstract method processRequest. The index.php script after finishing its bootstrap process calls the processRequest method of the Controller it chose to load. If this was your controller, the code in your controller should make use of data gotten out of the loaded models as well as data from the web request to do some calculations. Typically, to determine the calculation performed, the controller cleans and looks at $_REQUEST['a'], the request activity, and uses the method call($activity) to call a method that can handle the activity. When a controller is constructed it makes use of the global variable $COMPONENT_ACTIVITIES defined in configs/config.php to know which components have what activities. The call method checks if there is a Component repsonsible for the requested activity, if there is it calls that Component's $activity method, otherwise, the method that handles $activity is assumed to come from the controller itself. The results of the calculations done in $activity would typically be put into an associative array $data. After the call method complete, processRequest would typically take $data and call the base Controller method displayView($view, $data). Here $view is the whichever loaded view object you would like to display. + +To complete the picture of how Yioop eventually produces a web page or other output, we now describe how subclasses of the View class work. Subclasses of View have a field $layout and two methods helper($name), and element($name). A subclass of View has at most one Layout and it is used for rendering the header and footer of the page. It is included and instantiated by setting $layout to be the name of the layout one wants to load. For example, $layout="web"; would load either the file APP_DIR/views/layouts/web_layout.php or BASE_DIR/views/layouts/web_layout.php. This file is expected to have in it a class WebLayout extending Layout. The contructor of a Layout take as argument a view which it sets to an instance variable. The way Layout's get drawn is as follows: When the controller calls displayView($view, $data), this method does some initialization and then calls the render($data) of the base View class. This in turn calls the render($data) method of whatever Layout was on the view. This render method then draws the header and then calls $this->view->renderView($data); to draw the view, and finally draws the footer. + +The methods helper($name) and element($name) of View load and intantiate, if necessary, and return the Helper or Element $name in a similar fashion to the model($name) method of Controller. Element's have render($data) methods and can be used to draw out portions of pages which may be common across Views. Helper's on the other hand are used typically to render UI elements. For example, OptionsHelper has a render($id, $name, $options, $selected) method and is used to draw select dropdowns. + +When rendering a View or Element one often has css, scripts, images, videos, objects, etc. In BASE_DIR, the targets of these tags would typically be stored in the css, scripts, or resources folders. The APP_DIR/css, APP_DIR/scripts, and APP_DIR/resources folder are a natural place for them in your customized site. One wrinkle, however, is that APP_DIR, unlike BASE_DIR, doesn't have to be under your web servers DOCUMENT_ROOT. So how does one refer in a link to these folders? To this one uses Yioop's ResourceController class which can be invoked by a link like: + <img src="?c=resource&a=get&n=myicon.png&f=resources" /> +Here c=resource specifies the controller, a=get specifies the activity -- to get a file, n=myicon.png specifies we want the file myicon.png -- the value of n is cleaned to make sure it is a filename before being used, and f=resources specifies the folder -- f is allowed to be one of css, script, or resources. This would get the file APP_DIR/resources/myicon.png . + +This completes our description of the Yioop framework and how to build a new site using it. It should be pointed out that code in the APP_DIR can be localized using the same mechanism as in BASE_DIR. More details on this can be found in the section on [[Documentation#Localizing%20Yioop%20to%20a%20New%20Language|Localizing Yioop]]. + +[[Documentation#contents|Return to table of contents]]. + +===Embedding Yioop in an Existing Site=== + +One use-case for Yioop is to serve search result for your existing site. There are three common ways to do this: (1) On your site have a web-form or links with your installation of Yioop as their target and let Yioop format the results. (2) Use the same kind of form or links, but request an OpenSearch RSS Response from Yioop and then you format the results and display the results within your site. (3) Your site makes functions calls of the Yioop Search API and gets either PHP arrays or a string back and then does what it wants with the results. For access method (1) and (2) it is possible to have Yioop on an different machine so that it doesn't consume your main web-site's machines resources. As we mentioned in the configuration section it is possible to disable each of these access paths from within the Admin portion of the web-site. This might be useful for instance if you are using access methods (2) or (3) and don't want users to be able to access the Yioop search results via its built in web form. We will now spend a moment to look at each of these access methods in more detail... + +====Accessing Yioop via a Web Form==== + +A very minimal code snippet for such a form would be: + <form method="get" action='YIOOP_LOCATION'> + <input type="hidden" name="its" value="TIMESTAMP_OF_CRAWL_YOU_WANT" /> + <input type="hidden" name="l" value="LOCALE_TAG" /> + <input type="text" name="q" value="" /> + <button type="submit">Search</button> + </form> +In the above form, you should change YIOOP_LOCATION to your instance of Yioop's web location, TIMESTAMP_OF_CRAWL_YOU_WANT should be the Unix timestamp that appears in the name of the IndexArchive folder that you want Yioop to serve results from, LOCALE_TAG should be the locale you want results displayed in, for example, en-US for American English. In addition, to embedding this form on some page on your site, you would probably want to change the resources/yioop.png image to something more representative of your site. You might also want to edit the file views/search_view.php to give a link back to your site from the search results. + +If you had a form such as above, clicking Search would take you to the URL: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&q=QUERY +where QUERY was what was typed in the search form. Yioop supports two other kinds of queries: Related sites queries and cache look-up queries. The related query format is: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&a=related&arg=URL +where URL is the url that you are looking up related URLs for. To do a look up of the Yioop cache of a web page the url format is: + YIOOP_LOCATION?its=TIMESTAMP_OF_CRAWL_YOU_WANT&l=LOCALE_TAG&q=QUERY&a=cache&arg=URL +Here the terms listed in QUERY will be styled in different colors in the web page that is returned; URL is the url of the web page you want to look up in the cache. + +===Accessing Yioop and getting and OpenSearch RSS or JSON Response=== + +The same basic urls as above can return RSS or JSON results simply by appending to the end of the them &f=rss or &f=json. This of course only makes sense for usual and related url queries -- cache queries return web-pages not a list of search results. Here is an example of what a portion of an RSS result might look like: + + <?xml version="1.0" encoding="UTF-8" ?> + <rss version="2.0" xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" + xmlns:atom="http://www.w3.org/2005/Atom" + > + <channel> + <title>PHP Search Engine - Yioop : art</title> + <language>en-US</language> + <link>http://localhost/git/yioop/?f=rss&q=art&its=1317152828</link> + <description>Search results for: art</description> + <opensearch:totalResults>1105</opensearch:totalResults> + <opensearch:startIndex>0</opensearch:startIndex> + <opensearch:itemsPerPage>10</opensearch:itemsPerPage> + <atom:link rel="search" type="application/opensearchdescription+xml" + href="http://localhost/git/yioop/yioopbar.xml"/> + <opensearch:Query role="request" searchTerms="art"/> + + <item> + <title> An Online Fine Art Gallery U Can Buy Art - + Buy Fine Art Online</title> + + <link>http://www.ucanbuyart.com/</link> + <description> UCanBuyArt.com is an online art gallery + and dealer designed... art gallery and dealer designed for art + sales of high quality and original... art sales of high quality + and original art from renowned artists. Art</description> + </item> + ... + ... + </channel> + </rss> + +Notice the opensearch: tags tell us the totalResults, startIndex and itemsPerPage. The opensearch:Query tag tells us what the search terms were. + +===Accessing Yioop via the Function API=== + +The last way we will consider to get search results out of Yioop is via its function API. The Yioop Function API consists of the following three methods in controllers/search_controller.php : + function queryRequest($query, $results_per_page, $limit = 0) + + function relatedRequest($url, $results_per_page, $limit = 0, + $crawl_time = 0) + + function cacheRequest($url, $highlight=true, $terms ="", + $crawl_time = 0) +These methods handle basic queries, related queries, and cache of web page requests respectively. The arguments of the first two are reasonably self-explanatory. The $highlight and $terms arguments to cacheRequest are to specify whether or not you want syntax highlighting of any of the words in the returned cached web-page. If wanted then $terms should be a space separated list of terms. + +An example script showing what needs to be set-up before invoking these methods as well as how to extract results from what is returned can be found in the file examples/search_api.php . + +[[Documentation#contents|Return to table of contents]]. + +===Localizing Yioop to a New Language=== + +The Manage Locales activity can be used to configure Yioop for use with different languages and for different regions. If you decide to customize your Yioop installation by adding files to WORK_DIRECTORY/app as described in the [[Documentation#Building%20a%20Site%20using%20Yioop%20as%20Framework|Building a Site using Yioop as a Framework]] section, then the localization tools described in this section can also be used to localize your custom site. Clicking the Manage Locales activity one sees a page like: + +{{class="docs" +((resource:Documentation:ManageLocales.png|The Manage Locales form)) +}} + +The first form on this activity allows you to create a new locale -- an object representing a language and a region. The first field on this form should be filled in with a name for the locale in the language of the locale. So for French you would put Français. The locale tag should be the IETF language tag. The '''Writing Mode''' element on the form is to specify how the language is written. There are four options: lr-tb -- from left-to-write from the top of the page to the bottom as in English, rl-tb from right-to-left from the top the page to the bottom as in Hebrew and Arabic, tb-rl from the top of the page to the bottom from right-to-left as in Classical Chinese, and finally, tb-lr from the top of the page to the bottom from left-to-right as in non-cyrillic Mongolian or American Sign Language. lr-tb and rl-tb support work better than the vertical language support. As of this writing, Internet Explorer and WebKit based browsers (Chrome/Safari) have some vertical language support and the Yioop stylesheets for vertical languages still need some tweaking. For information on the status in Firefox check out this [[https://bugzilla.mozilla.org/show_bug.cgi?id=writing-mode|writing mode bug]]. Finally, the '''Locale Enabled ''' checkbox controls whether or not to present the locale on the Settings Page. This allows you to choose only the locales you want for your website without having to delete the locale data for other locales you don't want, but may want in the future as more translate strings become available. + +Beneath the Add Locale form is an alphabetical in the local-tah table listing some of the current locales. The Show Dropdown let's you control let's you control how many of these locales are displayed in one go. The Search link lets you bring up an advance search form to search for particular locales and also allows you to control the direction of the listing. The Locale List table first colume has a link with the name of the locale. Clicking on this link brings up a page where one can edit the strings for that locale. The next two columns of the Locale List table give the locale tag and writing direction of the locale, this is followed by the percent of strings translated. Clicking the Edit link in the column let's one edit the locale tag, and text direction of a locale. Finally, clicking the Delete link let's one delete a locale and all its strings. + +To translate string ids for a locale click on its name link. This should display the following forms and table of string id and their transated values: + +{{class="docs" +((resource:Documentation:EditLocaleStrings.png|The Edit Locales form)) +}} + +In the above case, the link for English was clicked. The Back link in the corner can be used to written to the previous form. The drop down controls whether to display all localizable strings or just those missing translations. The Filter field can be used to restrict the list of string id's presented to just those matching what is this field. Beneath this dropdown, the Edit Locale page mainly consists of a two column table: the right column being string ids, the left column containing what should be their translation into the given locale. If no translation exists yet, the field will be displayed in red. String ids are extracted by Yioop automatically from controller, view, helper, layout, and element class files which are either in the Yioop Installation itself or in the installation WORK_DIRECTORY/app folder. Yioop looks for tl() function calls to extract ids from these files, for example, on seeing tl('search_view_query_results') Yioop would extract the id search_view_query_results; on seeing tl('search_view_calculated', $data['ELAPSED_TIME']) Yioop would extract the id, 'search_view_calculated'. In the second case, the translation is expected the translation to have a %s in it for the value of $data['ELAPSED_TIME']. Note %s is used regardless of the the type, say int, float, string, etc., of $data['ELAPSED_TIME']. tl() can handle additional arguments, whenever an additional argument is supplied an additional %s would be expected somewhere in the translation string. If you make a set of translations, be sure to submit the form associated with this table by scrolling to the bottom of the page and clicking the Submit link. This saves your translations; otherwise, your work will be lost if you navigate away from this page. One aid to translating is if you hover your mouse over a field that needs translation, then its translation in the default locale (usually English) is displayed. If you want to find where in the source code a string id comes from the ids follow the rough convention file_name_approximate_english_translation. So you would expect to find admin_controller_login_successful in the file controllers/admin_controller.php . String ids with the prefix db_ (such as the names of activities) are stored in the database. So you cannot find these ids in the source code. The tooltip trick mentioned above does not work for database string ids. + +====Localizing Wiki Pages==== +When a user goes to a wiki page with a URL such as + YIOOP_LOCATION?c=group&group_id=some_integer&a=wiki&arg=read&page_name=Some_Page_Name +or + YIOOP_LOCATION?c=admin&group_id=some_integer&a=wiki&arg=read&page_name=Some_Page_Name +or for the public group possible with + YIOOP_LOCATION?c=static&p=Some_Page_Name +the page that is displayed is in the locale that has been most recently set for the user. If no locale was set, then Yioop tries to determine the locale based on browser header info, and if this fails, falls back to the Default Locale set when Yioop was configure. When one edits a wiki page the locale that one is editing the page for is displayed under the page name such as en-US in the image below: +{{class="docs" +((resource:Documentation:LocaleOnWikiPage.png|Locale on a wiki page)) +}} +To edit the page for a different locale, choose the locale you want using the Settings page while logged in and then navigate to the wiki page you would like to edit (using the same name from the original language). Suppose you were editing the Dental_Floss page in en-US locale. To make the French page, you click Settings on the top bar of Yioop, go to your account settings, and choose French (fr-FR) as the language. Now one would navigate back the the wiki you were on to the Dental_Floss page which doesn't exist for French. You could click Edit now and make the French page at this location, but this would be sub-optimal as the French word for dental floss is dentrifice. So instead, on the fr-FR Dental_Floss edit page, you edit the page Settings to make this page a Page Alias for Dentrifice, and then create and edit the French Dentrifice article. If a user then starts on the English version of the page and switches locales to French they will end up on the Dentrifice page. You should also set up the page alias in the reverse direction as well, to handle when someone start on the French Dentrifice page and switches to the en-US Dentrifice. + +====Adding a stemmer, segmenter or supporting character n-gramming for your language==== + +Depending on the language you are localizing to, it may make sense to write a stemmer for words that will be inserted into the index. A stemmer takes inflected or sometimes derived words and reduces them to their stem. For instance, jumps and jumping would be reduced to jump in English. As Yioop crawls it attempts to detect the language of a given web page it is processing. If a stemmer exists for this language it will call the Tokenizer class's stem($word) method on each word it extracts from the document before inserting information about it into the index. Similarly, if an end-user is entering a simple conjunctive search query and a stemmer exists for his language settings, then the query terms will be stemmed before being looked up in the index. Currently, Yioop comes with stemmers for English, French, German, Italian, and Russian. The English stemmer uses the Porter Stemming Algorithm [ [[Documentation#P1980|P1980]]], the other stemmers are based on the algorithms presented at snowball.tartoros.org. Stemmers should be written as a static method located in the file WORK_DIRECTORY/locale/en-US/resources/tokenizer.php . The [[snowball.tartoros.org]] link points to a site that has source code for stemmers for many other languages (unfortunately, not written in PHP). It would not be hard to port these to PHP and then add modify the tokenizer.php file of the appropriate locale folder. For instance, one could modify the file WORK_DIRECTORY/locale/pt/resources/tokenizer.php to contain a class PtTokenizer with a static method stem($word) if one wanted to add a stemmer for Portuguese. + +The class inside tokenizer.php can also be used by Yioop to do word segmentation. This is the process of splitting a string of words without spaces in some language into its component words. Yioop comes with an example segmenter for the zh-CN (Chinese) locale. It works by starting at the ned of the string and trying to greedily find the longest word that can be matched with the portion of the suffix of the string that has been processed yet (reverse maximal match). To do this it makes use of a word Bloom filter as part of how it detects if a string is a word or not. We describe how to make such filter using token_tool.php in a moment. + +In addition to supporting the ability to add stemmers and segmenters, Yioop also supports a default technique which can be used in lieu of a stemmer called character n-grams. When used this technique segments text into sequences of n characters which are then stored in Yioop as a term. For instance if n were 3 then the word "thunder" would be split into "thu", "hun", "und", "nde", and "der" and each of these would be asscociated with the document that contained the word thunder. N-grams are useful for languages like Chinese and Japanese in which words in the text are often not separated with spaces. It is also useful for languages like German which can have long compound words. The drawback of n-grams is that they tend to make the index larger. For Yioop built-in locales that do not have stemmer the file, the file WORK_DIRECTORY/locale/LOCALE-TAG/resources/tokenizer.php has a line of the form $CHARGRAMS['LOCALE_TAG'] = SOME_NUMBER; This number is the length of string to use in doing char-gramming. If you add a language to Yioop and want to use char gramming merely add a tokenizer.php to the corresponding locale folder with such a line in it. + +{{id='token_tool' +====Using token_tool.php to improve search performance and relevance for your language==== +}} + +configs/token_tool.php is used to create suggest word dictionaries and word filter files for the Yioop search engine. To create either of these items, the user puts a source file in Yioop's WORK_DIRECTORY/prepare folder. Suggest word dictionaries are used to supply the content of the dropdown of search terms that appears as a user is entering a query in Yioop. They are also used to do spell correction suggestions after a search has been performed. To make a suggest dictionary one can use a command like: + + php token_tool.php dictionary filename locale endmarker + +Here ''filename'' should be in the current folder or PREP_DIR, locale is the locale this suggest (for example, en-US) file is being made for and where a file suggest_trie.txt.gz will be written, and endmarker is the end of word symbol to use in the trie. For example, $ works pretty well. The format of ''filename'' should be a sequence of line, each line containing a word or phrase followed by a space followed by a frequency count. i.e., the last thing on the line should be a number. Given a corpus of documents a frequency for a word would be the number of occurences of that word in the document. + +token_tool.php can also be used to make filter files used by a word segmenter. To make a filter file token_tool.php is run from the command line as: + php token_tool.php segment-filter dictionary_file locale + +Here dictionary_file should be a text file with one word/line, locale is the IANA language tag of the locale to store the results for. +====Obtaining data sets for token_tool.php==== + +Many word lists with frequencies are obtainable on the web for free with Creative Commons licenses. A good starting point is: + [[http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists]] +A little script-fu can generally take such a list and output it with the line format of "word/phrase space frequency" needed by token_tool.php and as the word/line format used for filter files. + +====Spell correction and romanized input with locale.js==== + +Yioop supports the ability to suggest alternative queries after a search is performed. These queries are mainly restricted to fixing typos in the original query. In order to calculate these spelling corrections, Yioop takes the query and for each query term computes each possible single character change to that term. For each of these it looks up in the given locale's suggest_trie.txt.gz a frequency count of that variant, if it exists. If the best suggestion is some multiple better than the frequency count of the original query then Yioop suggests this alternative query. In order for this to work, Yioop needs to know what constitutes a single character in the original query. The file locale.js in the WORK_DIRECTORY/locale/LOCALE_TAG/resources folder can be used to specify this for the locale given by LOCALE_TAG. To do this, all you need to do is specify a Javascript variable alpha. For example, for French (fr-FR) this looks like: + var alpha = "aåàbcçdeéêfghiîïjklmnoôpqrstuûvwxyz"; +The letters do not have to be in any alphabetical order, but should be comprehensive of the non-punctuation symbols of the language in question. + +Another thing locale.js can be used for is to given mappings between roman letters and other scripts for use in the Yioop's autosuggest dropdown that appears as you type a query. As you type, scripts/suggest.js function onTypeTerm is called. This in turn will cause a particular locale's locale.js function transliterate(query) if it exists. This function should return a string with the result of the transliteration. An example of doing this is given for the Telugu locale in Yioop. + +====Thesaurus Results and Part of Speech Tagging==== + +As mentioned in the [[Documentation#Search%20Basics|Search Basics]] topic, for some queries Yioop displays a list of related queries to one side of the search results. These are obtained from a "computer thesaurus". In this subsection, we describe how to enable this facility for English and how you could add this functionality for other languages. If enabled, the thesaurus also can be used to modify search ranking as described in the [[Ranking#Final%20Reordering|Final Reordering]] of the Yioop Ranking Mechanisms document. + +In order to generate suggested related queries, Yioop first tags the original query terms according to part of speech. For the en-US, this is done by calling a method: tagTokenizePartOfSpeech($text) in WORK_DIRECTORY/locale/en-US/resources/tokenizer.php. For en-US, a simple Brill tagger (see Ranking document for more info) is implemented to do this. After this method is called the terms in $text should have a suffix ~part-of-speech where ~part-of-speeech where part-of-speech is one of NN for noun, VB for verb, AJ for adjective, AV for adverb, or some other value (which would be ignored by Yioop). For example, the noun dog might become dog~NN after tagging. To localize to another language this method in the corresponding tokenizer.php file would need to be implemented. + +The second method needed for Thesaurus results is scoredThesaurusMatches($term, $word_type, $whole_query) which should also be in tokenizer.php for the desired locale. Here $term is a term (without a part-of-speech tag), $word_type is the part of speech (one of the ones listed above), and $whole_query is the original query. The output of this method should be an array of (score => array of thesaurus terms) associations. The score representing one word sense of term. In the case, of English, this method is implemented using [[http://wordnet.princeton.edu/|WordNet]]. So for thesaurus results to work for English, WordNet needs to be installed and in either the config.php file or local_config.php you need to define the constant WORDNET_EXEC to the path to the WordNet executable on your file system. On a Linux or OSX system, this might be something like: /usr/local/bin/wn . + +====Using Stop Words to improve Centroid Summarization==== + +While crawling, Yioop makes use of a summarizer to extract the important portions of the web page both for indexing and for search result snippet purposes. There are two summarizers that come with Yioop a Basic summarizer, which uses an ad hoc approach to finding the most important parts of the document, and a centroid summarizer which tries to compute an "average sentence" for the document and uses this to pick representative sentence based on nearness to this average. The summarizer that is used can be set under the Crawl Time tab of [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options]]. This latter summarizer works better if certain common words (stop words) from the documents language are removed. When using centroid summarizer, Yioop check to see if tokenizer.php for the current locale contains a method stopwordsRemover($page). If it does it calls it, this method takes a string of words are returns a string with all the stop words removed. This method exists for en-US, but, if desired, could also be implemented for other locales to improve centroid summarization. + +[[Documentation#contents|Return to table of contents]]. + +==Advanced Topics== +===Modifying Yioop Code=== +One advantage of an open-source project is that you have complete access to the source code. Thus, Yioop can be modified to fit in with your existing project. You can also freely add new features onto Yioop. In this section, we look a little bit at some of the common ways you might try to modify Yioop as well as ways to examine the output of a crawl in a more technical manner. If you decide to modify the source code it is recommended that you look at the [[Documentation#Summary%20of%20Files%20and%20Folders|Summary of Files and Folders]] above again, as well as look at the [[http://www.seekquarry.com/yioop-docs/|online Yioop code documentation]]. + +====Handling new File Types==== +One relatively easy enhancement to Yioop is to enhance the way it processes an existing file type or to get it to process new file types. Yioop was written from scratch without dependencies on existing projects. So the PHP processors for Microsoft file formats and for PDF are only approximate. These processors can be found in lib/processors. To write your own processor, you should extend either the TextProcessor or ImageProcessor class. You then need to write in your subclass a static method process($page, $url). Here $page is a string representation of a downloaded document of the file type you are going to handle and $url is the a canonical url from which this page is downloaded. This method should return an array of the format: + $summary['TITLE'] = a title for the document + $summary['DESCRIPTION'] = a text summary extracted from the document + $summary['LINKS'] = an array of links (canonical not relative) extracted + from the document. +A good reference implementation of a TextProcessor subclass can be found in html_processor.php. If you are trying to support a new file type, then to get Yioop to use your processor you need to add lines to some global variables at the top of the file. You should add the extension of the file type you are going to process to the array $INDEXED_FILE_TYPES. You will also need to add an entry $PAGE_PROCESSORS["new_mime_type_handle"] = "NewProcessor". As an example, these are the relevant lines at the top of ppt_processor.php: + $INDEXED_FILE_TYPES[] = "ppt"; + $PAGE_PROCESSORS["application/vnd.ms-powerpoint"] = "PptProcessor"; +If your processor is cool, only relies on code you wrote, and you want to contribute it back to the Yioop, please feel free to e-mail it to chris@pollett.org . + +====Writing an Indexing Plugin==== +An indexing plugin provides a way that an advanced end-user can extend the indexing capabilities of Yioop. Bundled with Yioop are three example indexing plugins. These are found in the lib/indexing_plugins folder. We will discuss the code for the recipe and word filter plugin here. The code for the address plugin, used to extract snail mail address from web pages follows the same kind of structure. If you decide to write your own plugin or want to install a third-party plugin you can put it in the folder: WORK_DIRECTORY/app/lib/indexing_plugins. The recipe indexing plugin can serve as a guide for writing your own plugin if you don't need your plugin to have a configure screen. The recipe plugin is used to detect food recipes which occur on pages during a crawl. It creates "micro-documents" associated with found recipes. These are stored in the index during the crawl under the meta-word "recipe:all". After the crawl is over, the recipe plugin's postProcessing method is called. It looks up all the documents associated with the word "recipe:all". It extracts ingredients from these and does clustering of recipes based on ingredient. It finally injects new meta-words of the form "ingredient:some_food_ingredient", which can be used to retrieve recipes most closely associated with a given ingredient. As it is written, recipe plugin assumes that all the recipes can be read into memory in one go, but one could easily imagine reading through the list of recipes in batches of the amount that could fit in memory in one go. + +The recipe plugin illustrates the kinds of things that can be written using indexing plugins. To make your own plugin, you would need to write a subclass of the class IndexingPlugin with a file name of the form mypluginname_plugin.php. Then you would need to put this file in the folder WORK_DIRECTORY/app/lib/indexing_plugins. RecipePlugin subclasses IndexingPlugin and implements the following four methods: pageProcessing($page, $url), postProcessing($index_name), getProcessors(), getAdditionalMetaWords() so they don't have their return NULL default behavior. We explain what each of these is for in a moment. During a web crawl, after a fetcher has downloaded a batch of web pages, it uses a page's mimetype to determine a page processor class to extract summary data from that page. The page processors that Yioop implements can be found in the folder lib/processors. They have file names of the form someprocessorname_processor.php. As a crawl proceeds, your plugin will typically be called to do further processing of a page only in addition to some of these processors. The static method getProcessors() should return an array of the form array( "someprocessorname1", "someprocessorname2", ...), listing the processors that your plugin will do additional processing of documents for. A page processor has a method handle($page, $url) called by Yioop with a string $page of a downloaded document and a string $url of where it was downloaded from. This method first calls the process($page, $url) method of the processor to do initial summary extraction and then calls method pageProcessing($page, $url) of each indexing_plugin associated with the given processor. A pageProcessing($page, $url) method is expected to return an array of subdoc arrays found on the given page. Each subdoc array should haves a CrawlConstants::TITLE and a CrawlConstants::DESCRIPTION. The handle method of a processor will add to each subdoc the fields: CrawlConstants::LANG, CrawlConstants::LINKS, CrawlConstants::PAGE, CrawlConstants::SUBDOCTYPE. The SUBDOCTYPE is the name of the plugin. The resulting "micro-document" is inserted by Yioop into the index under the word nameofplugin:all . After the crawl is over, Yioop will call the postProcessing($index_name) method of each indexing plugin that was in use. Here $index_name is the timestamp of the crawl. Your plugin can do whatever post processing it wants in this method. For example, the recipe plugin does searches of the index and uses the results of these searches to inject new meta-words into the index. In order for Yioop to be aware of the meta-words you are adding, you need to implement the method getAdditionalMetaWords(). Also, the web snippet you might want in the search results for things like recipes might be longer or shorter than a typical result snippet. The getAdditionalMetaWords() method also tells Yioop this information. For example, for the recipe plugin, getAdditionalMetaWords() returns the associative array: + array("recipe:" => HtmlProcessor::MAX_DESCRIPTION_LEN, + "ingredient:" => HtmlProcessor::MAX_DESCRIPTION_LEN); +The WordFilterPlugin illustrates how one can write an indexing plugin with a configure screen. It overrides the base class' pageSummaryProcessing(&$summary) and getProcessors() methods as well as implements the methods saveConfiguration($configuration), loadConfiguration(), setConfiguration($configuration), configureHandler(&$data), and configureView(&$data). The purpose of getProcessors() was already mentioned under recipe plugin description above. pageSummaryProcessing(&$summary) is called by a page processor after a page has been processed and a summary generated. WordFilterPlugin uses this callback to check if the title or the description in this summary have any of the words the filter is filtering for and if so takes the appropriate action. loadConfiguration, saveConfiguration($configuration), and setConfiguration are three methods to handle persistence for any plugin data that the user can change. The first two operate on the name server, the last might operate on a queue_server or a fetcher. loadConfiguration is be called by configureHandler(&$data) to read in any current configuration, unserialize it and modify it according to any data sent by the user. saveConfiguration($configuration) would then be called by configureHandler(&$data) to serialize and write any $configuration data that needs to be stored by the plugin. For WordFilterPlugin, a list of filter terms and actions are what is saved by saveConfiguration($configuration) and loaded by loadConfiguration. When a crawl is started or when a fetcher contacts the name server, plugin configuration data is sent by the name server. The method setConfiguration($configuration) is used to initialize the local copy of a fetcher's or queue_server's process with the configuration settings from the name server. For WordFilterPlugin, the filter terms and actions are stored in a field variable by this function. + +As has already been hinted at by the configuration discussion above, configureHandler(&$data) plays the role of a controller for an index plugin. It is in fact called by the AdminController activity pageOptions if the configure link for a plugin has been clicked. In addition, to managing the load and save configuration process, it also sets up any data needed by configureView(&$data). For WordFilterPlugin, this involves setting a variable $data["filter_words"] so that configureView(&$data) has access to a list of filter words and actions to draw. Finally, the last method of the WordFilterPlugin we describe, configureView(&$data), outputs using $data the HTML that will be seen in the configure screen. This HTML will appear in a div tag on the final page. It is initially styled so that it is not displayed. Clicking on the configure link will cause the div tag data to be displayed in a light box in the center of the screen. For WordFilterPlugin, this methods draws a title and textarea form with the currently filtered terms in it. It makes use of Yioop's tl() functions so that the text of the title can be localized to different languages. This form has hidden field c=admin, a=pageOptions option-type=crawl_time, so that hte AdminController will know to call pageOption and pageOption will know in turn to let plugin's configureHandler methods to get a chance to handle this data. + +[[Documentation#contents|Return to table of contents]]. + +===Yioop Command-line Tools=== +In addition to [[Documentation#token_tool|token_tool.php]] which we describe in the section on localization, and to [[Documentation#configs|export_public_help_db.php]] whcih we describe in the section on the Yioop folder structure, Yioop comes with several useful command-line tools and utilities. We next describe these in roughly their order of likely utility: + +* [[Documentation#configure_tool|bin/configure_tool.php]]: Used to configure Yioop from the command-line +* [[Documentation#arc_tool|bin/arc_tool.php]]: Used to examine the contents of WebArchiveBundle's and IndexArchiveBundles's +* [[Documentation#query_tool|bin/query_tool.php]]: Used to query an index from the command-line +* [[Documentation#code_tool|bin/code_tool.php]]: Used to help code Yioop and to help make clean patches for Yioop. +* [[Documentation#classifier_tool|bin/classifier_tool.php]]: Used to make Yioop a Yioop classifier from the command line rather than using the GUI interface. + +{{id='configure_tool' +====Configuring Yioop from the Command-line==== +}} + +In a multiple queue server and fetcher setting, one might have web access only to the name server machine -- all the other machines might be on virtual private servers to which one has only command-line access. Hence, it is useful to be able to set up a work directory and configure Yioop through the command-line. To do this one can use the script configs/configure_tool.php. One can run it from the command-line within the configs folder, with a line like: + php configure_tool.php +When launched, this program will display a menu like: + Yioop CONFIGURATION TOOL + +++++++++++++++++++++++++ + + Checking Yioop configuration... + =============================== + Check Passed. + Using configs/local_config.php so changing work directory above may not work. + =============================== + + Available Options: + ================== + (1) Create/Set Work Directory + (2) Change root password + (3) Set Default Locale + (4) Debug Display Set-up + (5) Search Access Set-up + (6) Search Page Elements and Links + (7) Name Server Set-up + (8) Crawl Robot Set-up + (9) Exit program + + Please choose an option: +Except for the Change root password option, these correspond to the different fieldsets on the Configure activity. The command-line forms one gets from selecting one of these choices let one set the same values as were described earlier in the [[Documentation#Installation%20and%20Configuration|Installation]] section. The change root password option lets one set the account password for root. I.e., the main admin user. On a non-nameserver machine, it is probably simpler to go with a sqlite database, rather than hit on a global mysql database from each machine. Such a barebones local database set-up would typically only have one user, root + +Another thing to consider when configuring a collection of Yioop machines in such a setting, is, by default, under Search Access Set-up, subsearch is unchecked. This means the RSS feeds won't be downloaded hourly on such machines. If one unchecks this, they can be. + +{{id='arc_tool' +====Examining the contents of WebArchiveBundle's and IndexArchiveBundles's==== +}} + +The command-line script bin/arc_tool.php can be used to examine and manipulate the contents of a WebArchiveBundle or an IndexArchiveBundle. Below is a summary of the different command-line uses of arc_tool.php: + +;'''php arc_tool.php count bundle_name''' + or +'''php arc_tool.php count bundle_name save''' : returns the counts of docs and links for each shard in bundle as well as an overall total. The second command saves the just computed count into the index description (can be used to fix the index count if it gets screwed up). +; '''php arc_tool.php dict bundle_name word''' : returns index dictionary records for word stored in index archive bundle. +; '''php arc_tool.php info bundle_name''' : return info about documents stored in archive. +; '''php arc_tool.php inject timestamp file''' : injects the urls in file as a schedule into crawl of given timestamp. This can be used to make a closed index unclosed and to allow for continued crawling. +; '''php arc_tool.php list''' : returns a list of all the archives in the Yioop! crawl directory, including non-Yioop! archives in the /archives sub-folder. +; '''php arc_tool.php mergetiers bundle_name max_tier''' : merges tiers of word dictionary into one tier up to max_tier +; '''php arc_tool.php posting bundle_name generation offset''' + or +'''php arc_tool.php posting bundle_name generation offset num''' : returns info about the posting (num many postings) in bundle_name at the given generation and offset +; '''php arc_tool.php rebuild bundle_name''' : Re-extracts words from summaries files in bundle_name into index shards then builds a new dictionary +; '''php arc_tool.php reindex bundle_name''' : Reindex the word dictionary in bundle_name using existing index shards +; '''php arc_tool.php shard bundle_name generation''' : Prints information about the number of words and frequencies of words within the generation'th index shard in the bundle +; '''php arc_tool.php show bundle_name start num''' : outputs items start through num from bundle_name or name of non-Yioop archive crawl folder + +The bundle name can be a full path name, a relative path from the current directory, or it can be just the bundle directory's file name in which case WORK_DIRECTORY/cache will be assumed to be the bundle's location. The following are some examples of using arc tool. Recall a backslash in Unix/OSX terminal is the line continuation character, so we can image lines where it is indicated below as being all on one line. They are not all from the same session: + |chris-polletts-macbook-pro:bin:108>php arc_tool.php list + Found Yioop Archives: + ===================== + 0-Archive1334468745 + 0-Archive1336527560 + IndexData1334468745 + IndexData1336527560 + + Found Non-Yioop Archives: + ========================= + english-wikipedia2012 + chris-polletts-macbook-pro:bin:109> + + ... + + |chris-polletts-macbook-pro:bin:158>php arc_tool.php info \ + /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/IndexData1293767731 + + Bundle Name: IndexData1293767731 + Bundle Type: IndexArchiveBundle + Description: test + Number of generations: 1 + Number of stored links and documents: 267260 + Number of stored documents: 16491 + Crawl order was: Page Importance + Seed sites: + http://www.ucanbuyart.com/ + http://www.ucanbuyart.com/fine_art_galleries.html + http://www.ucanbuyart.com/indexucba.html + Sites allowed to crawl: + domain:ucanbuyart.com + domain:ucanbuyart.net + Sites not allowed to be crawled: + domain:arxiv.org + domain:ask.com + Meta Words: + http://www.ucanbuyart.com/(.+)/(.+)/(.+)/(.+)/ + + |chris-polletts-macbook-pro:bin:159> + |chris-polletts-macbook-pro:bin:202>php arc_tool.php show \ + /Applications/XAMPP/xamppfiles/htdocs/crawls/cache/Archive1293767731 0 3 + + BEGIN ITEM, LENGTH:21098 + [URL] + http://www.ucanbuyart.com/robots.txt + [HTTP RESPONSE CODE] + 404 + [MIMETYPE] + text/html + [CHARACTER ENCODING] + ASCII + [PAGE DATA] + <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" + "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> + + <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> + + <head> + <base href="http://www.ucanbuyart.com/" /> + </pre> + ... + + |chris-polletts-macbook-pro:bin:117>php arc_tool.php reindex IndexData1317414152 + + Shard 0 + [Sat, 01 Oct 2011 11:05:17 -0700] Adding shard data to dictionary files... + [Sat, 01 Oct 2011 11:05:28 -0700] Merging tiers of dictionary + + Final Merge Tiers + + Reindex complete!! + +The mergetiers command is like a partial reindex. It assumes all the shard words have been added to the dictionary, but that the dictionary still has more than one tier (tiers are the result of incremental log-merges which are made during the crawling process). The mergetiers command merges these tiers into one large tier which is then usable by Yioop for query processing. + +{{id='query_tool' +====Querying an Index from the command-line==== +}} + +The command-line script bin/query_tool.php can be used to query indices in the Yioop WORK_DIRECTORY/cache. This tool can be used on an index regardless of whether or not Apache is running. It can be used for long running queries that might timeout when run within a browser to put their results into memcache or filecache. The command-line arguments for the query tool are: + php query_tool.php query num_results start_num lang_tag +The default num_results is 10, start_num is 0, and lang_tag is en-US. The following shows how one could do a query on "Chris Pollett": + + |chris-polletts-macbook-pro:bin:141>php query_tool.php "Chris Pollett" + + ============ + TITLE: ECCC - Pointers to + URL: http://eccc.hpi-web.de/static/pointers/personal_www_home_pages_of_complexity_theorists/ + IPs: 141.89.225.3 + DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located + at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of + complexity people 2011 2010 2009 2011...1994 POINTE + Rank: 3.9551158411 + Relevance: 0.492443777769 + Proximity: 1 + Score: 4.14 + ============ + + ============ + TITLE: ECCC - Pointers to + URL: http://www.eccc.uni-trier.de/static/pointers/personal_www_home_pages_of_complexity_theorists/ + IPs: 141.89.225.3 + DESCRIPTION: Homepage of the Electronic Colloquium on Computational Complexity located + at the Hasso Plattner Institute of Potsdam, Germany Personal WWW pages of + complexity people 2011 2010 2009 2011...1994 POINTE + Rank: 3.886318974 + Relevance: 0.397622570289 + Proximity: 1 + Score: 4.03 + ============ + + ..... + +The index the results are returned from is the default index; however, all of the Yioop meta words should work so you can do queries like "my_query i:timestamp_of_index_want". Query results depend on the kind of language stemmer/char-gramming being used, so French results might be better if one specifies fr-FR then if one relies on the default en-US. + +{{id='code_tool' +====A Tool for Coding and Making Patches for Yioop==== +}} + +'''bin/code_tool.php''' can perform several useful tasks to help developers program for the Yioop environment. Below is a brief summary of its functionality: + +;'''php code_tool.php clean path''' : Replaces all tabs with four spaces and trims all whitespace off ends of lines in the folder or file path. +;'''php code_tool.php copyright path''' : Adjusts all lines in the files in the folder at path (or if path is a file just that) of the form 2009 - \d\d\d\d to the form 2009 - this_year where this_year is the current year. +;'''php code_tool.php longlines path''' : Prints out all lines in files in the folder or file path which are longer than 80 characters. +;'''php code_tool.php replace path pattern replace_string''' + or +'''php code_tool.php replace path pattern replace_string effect''' : Prints all lines matching the regular expression pattern followed by the result of replacing pattern with replace_string in the folder or file path. Does not change files. +;'''php code_tool.php replace path pattern replace_string interactive''' : Prints each line matching the regular expression pattern followed by the result of replacing pattern with replace_string in the folder or file path. Then it asks if you want to update the line. Lines you choose for updating will be modified in the files. +;'''php code_tool.php replace path pattern replace_string change''' : Each line matching the regular expression pattern is update by replacing pattern with replace_string in the folder or file path. This format doe not echo anything, it does a global replace without interaction. +;'''php code_tool.php search path pattern''' :Prints all lines matching the regular expression pattern in the folder or file path. + +{{id='classifier_tool' +====A Command-line Tool for making Yioop Classifiers==== +}} + +'''bin/classifier_tool.php''' is used to automate the building and testing of classifiers, providing an alternative to the web interface when a labeled training set is available. + +'''classifier_tool.php''' takes an activity to perform, the name of a dataset to use, and a label for the constructed classifier. The activity is the name of one of the 'run*' functions implemented by this class, without the common 'run' prefix (e.g., 'TrainAndTest'). The dataset is specified as the common prefix of two indexes that have the suffixes "Pos" and "Neg", respectively. So if the prefix were "DATASET", then this tool would look for the two existing indexes "DATASET Pos" and "DATASET Neg" from which to draw positive and negative examples. Each document in these indexes should be a positive or negative example of the target class, according to whether it's in the "Pos" or "Neg" index. Finally, the label is just the label to be used for the constructed classifier. + +Beyond these options (set with the -a, -d, and -l flags), a number of other options may be set to alter parameters used by an activity or a classifier. These options are set using the -S, -I, -F, and -B flags, which correspond to string, integer, float, and boolean parameters respectively. These flags may be used repeatedly, and each expects an argument of the form NAME=VALUE, where NAME is the name of a parameter, and VALUE is a value parsed according to the flag. The NAME should match one of the keys of the options member of this class, where a period ('.') may be used to specify nesting. For example: + -I debug=1 # set the debug level to 1 + -B cls.use_nb=0 # tell the classifier to use Naive Bayes +To build and evaluate a classifier for the label 'spam', trained using the two indexes "DATASET Neg" and "DATASET Pos", and a maximum of the top 25 most informative features: + php bin/classifier_tool.php -a TrainAndTest -d 'DATASET' -l 'spam' + -I cls.chi2.max=25 + +==References== + +;{{id='APC2003' '''[APC2003]'''}} : Serge Abiteboul and Mihai Preda and Gregory Cobena. [[http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf|Adaptive on-line page importance computation]]. In: Proceedings of the 12th international conference on World Wide Web. pp.280-290. 2003. +;{{id='B1970' '''[B1970]'''}} : Bloom, Burton H. [[http://www.lsi.upc.edu/~diaz/p422-bloom.pdf|Space/time trade-offs in hash coding with allowable errors]]. Communications of the ACM Volume 13 Issue 7. pp. 422–426. 1970. +;{{id='BSV2004' '''[BSV2004]'''}} : Paolo Boldi and Massimo Santini and Sebastiano Vigna. [[http://vigna.di.unimi.it/ftp/papers/ParadoxicalPageRank.pdf|Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations]]. Algorithms and Models for the Web-Graph. pp. 168–180. 2004. +;{{id='BP1998' '''[BP1998]'''}} : Brin, S. and Page, L. [[http://infolab.stanford.edu/~backrub/google.html|The Anatomy of a Large-Scale Hypertextual Web Search Engine]]. In: Seventh International World-Wide Web Conference (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998. +;{{id='BCC2010' '''[BCC2010]'''}} : S. Büttcher, C. L. A. Clarke, and G. V. Cormack. [[http://mitpress.mit.edu/books/information-retrieval|Information Retrieval: Implementing and Evaluating Search Engines]]. MIT Press. 2010. +;{{id='DG2004' '''[DG2004]'''}} : Jeffrey Dean and Sanjay Ghemawat. [[http://research.google.com/archive/mapreduce-osdi04.pdf|MapReduce: Simplified Data Processing on Large Clusters]]. OSDI'04: Sixth Symposium on Operating System Design and Implementation. 2004 +;{{id='GGL2003' '''[GGL2003]'''}} : Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. [[http://research.google.com/archive/gfs-sosp2003.pdf|The Google File System]]. 19th ACM Symposium on Operating Systems Principles. 2003. +;{{id='GLM2007' '''[GLM2007]'''}} : A. Genkin, D. Lewis, and D. Madigan. [[http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf|Large-scale bayesian logistic regression for text categorization]]. Technometrics. Volume 49. Issue 3. pp. 291--304, 2007. +;{{id='H2002' '''[H2002]'''}} : T. Haveliwala. [[http://infolab.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf|Topic-Sensitive PageRank]]. Proceedings of the Eleventh International World Wide Web Conference (Honolulu, Hawaii). 2002. +;{{id='KSV2010' '''[KSV2010]'''}} : Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. [[http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf|A Model of Computation for MapReduce]]. Proceedings of the ACM Symposium on Discrete Algorithms. 2010. pp. 938-948. +;{{id='KC2004' '''[KC2004]'''}} : Rohit Khare and Doug Cutting. [[http://www.master.netseven.it/files/262-Nutch.pdf|Nutch: A flexible and scalable open-source web search engine]]. CommerceNet Labs Technical Report 04. 2004. +;{{id='LDH2010' '''[LDH2010]'''}} : Jimmy Lin and Chris Dyer. [[http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf|Data-Intensive Text Processing with MapReduce]]. Synthesis Lectures on Human Language Technologies. Morgan and Claypool Publishers. 2010. +;{{id='LM2006' '''[LM2006]'''}} : Amy N. Langville and Carl D. Meyer. [[http://press.princeton.edu/titles/8216.html|Google's PageRank and Beyond]]. Princton University Press. 2006. +;{{id='MRS2008' '''[MRS2008]'''}} : C. D. Manning, P. Raghavan and H. Schütze. [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html|Introduction to Information Retrieval]]. Cambridge University Press. 2008. +;{{id='MKSR2004' '''[MKSR2004]'''}} : G. Mohr, M. Kimpton, M. Stack, and I.Ranitovic. [[http://iwaw.europarchive.org/04/Mohr.pdf|Introduction to Heritrix, an archival quality web crawler]]. 4th International Web Archiving Workshop. 2004. +;{{id='PTSHVC2011' '''[PTSHVC2011]'''}} : Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. [[http://www.ittc.ku.edu/~jsv/Papers/PTS11.InvertedIndexSIGIR.pdf|Inverted indexes for phrases and strings]]. Proceedings of the 34nth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011. +;{{id='P1997a' '''[P1997a]'''}} : J. Peek. [[http://www.usenix.org/publications/library/proceedings/ana97/summaries/monier.html|Summary of the talk: The AltaVista Web Search Engine]] by Louis Monier. USENIX Annual Technical Conference Anaheim. California. ;login: Volume 22. Number 2. April 1997. +;{{id='P1997b' '''[P1997b]'''}} : J. Peek. [[http://www.usenix.org/publications/library/proceedings/ana97/summaries/brewer.html|Summary of the talk: The Inktomi Search Engine by Louis Monier]]. USENIX Annual Technical Conference. Anaheim, California. ;login: Volume 22. Number 2. April 1997. +;{{id='P1994' '''[P1994]'''}} : B. Pinkerton. [[http://web.archive.org/web/20010904075500/http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html|Finding what people want: Experiences with the WebCrawler]]. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland. 1994. +;{{id='P1980' '''[P1980]'''}} : M.F. Porter. [[http://tartarus.org/~martin/PorterStemmer/def.txt|An algorithm for suffix stripping]]. Program. Volume 14 Issue 3. 1980. pp 130−137. On the same website, there are [[http://snowball.tartarus.org/|stemmers for many other languages]]. +;{{id='PDGQ2006' '''[PDGQ2006]'''}} : Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan. [[http://research.google.com/archive/sawzall-sciprog.pdf|Interpreting the Data: Parallel Analysis with Sawzall]]. Scientific Programming Journal. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure.Volume 13. Issue 4. 2006. pp.227-298. +;{{id='W2009' '''[W2009]'''}} : Tom White. [[http://www.amazon.com/gp/product/1449389732/ref=pd_lpo_k2_dp_sr_1?pf_rd_p=486539851&pf_rd_s=lpo-top-stripe-1&pf_rd_t=201&pf_rd_i=0596521979&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=0N5VCGFDA7V7MJXH69G6|Hadoop: The Definitive Guide]]. O'Reilly. 2009. +;{{id='ZCTSR2004' '''[ZCTSR2004]'''}} : Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [[http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf|Microsoft Cambridge at TREC-13: Web and HARD tracks]]. In Proceedings of 3th Annual Text Retrieval Conference. 2004. + +[[Documentation#contents|Return to table of contents]]. +EOD; +$public_pages["en-US"]["Download_Sent"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Download Sent Page + +author=Chris Pollett + +robots=NOINDEX, NOFOLLOW + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS{{center| +==Download Email Sent !== + +You should receive your email to download Yioop software shortly. + +Before checking your email, please take this opportunity to use the button below to make a $10 contribution to support the Yioop project and its continued development at Seekquarry.com . +}} + +{{center| +[[https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=3B94XKR9GTPNG|((resource:Downloads:btn_donateCC_LG.gif|PayPal - The safer, easier way to pay online!))]] + +}} +EOD; +$public_pages["en-US"]["Downloads"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc= + +title=Open Source Search Engine Software - Seekquarry :: Downloads + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS=Downloads= +==Yioop Releases== + +The two most recent versions of Yioop are: + +*[[https://seekquarry.com/?c=main&a=download&version=2.00|Version 2.00]] +*[[https://seekquarry.com/?c=main&a=download&version=1.00|Version 1.00]] + +==Support Services / Support Yioop== + +Too busy to set up or upgrade a Yioop search engine yourself? Or are you interested in paying for a new feature? Please write chris@pollett.org for a quote. We charge a flat rate for a single machine install; however, we do need access to the machine where you'd like stuff installed. New feature prices depend on the scope of the feature and whether you allow the feature to be incorporated back into and licensed under the current Yioop GPLv3 license. + +Seekquarry, LLC is a company owned by Chris Pollett, the principal developer of Yioop. If you like Yioop and would like to show support for this project, please consider making a contribution. + +{{center| +[[https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=3B94XKR9GTPNG|((resource:btn_donateCC_LG.gif|PayPal - The safer, easier way to pay online!))]] + +}} + +==Installation== + +The [[Install|Install Guides]] explain how to get Yioop to work in some common settings. The documentation page has information about the [[Documentation#Requirements|requirements]] of and [[Documentation#Installation%20and%20Configuration|installation]] procedure for Yioop. + +==Upgrading== + +Before upgrading, make sure to back up your data. Then download the latest version of Yioop and unzip it to the location you would like. Set the Search Engine Work Directory by the same method and to the same value as your old Yioop Installation. See the Installation section above for links to instructions on this, if you have forgotten how you did this. Knowing the old Work Directory location, should allow Yioop to complete the upgrade process. + +==Git Repository / Contributing Code== + +The Yioop git repository allows anonymous read-only access. If you would like to contribute to Yioop, just do a clone of the most recent code, make your changes, do a pull, and make a patch. For example, to clone the repository, assuming you have the git version control software installed, just type: + + +'''git clone https://seekquarry.com/git/yioop.git''' + +The [[Coding|Yioop Coding Guidelines]] explain the form your code should be in when making a patch as well as how to create patches. You can create/update an issue in the [[http://www.seekquarry.com/mantis/|Yioop issue tracker]] describing what your patch solves and upload your patch. To contribute localizations, you can use the GUI interface in your own copy of Yioop to enter in your localizations. Next locate in the locale folder of your Yioop work directory the locale tag of the language you added translations for. Within this folder is a configure.ini file, just make an issue in the issue tracker and upload this file there. + +EOD; +$public_pages["en-US"]["Install"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Yioop Install Guides + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS=Installation Guides= + +==Demo Install Video== +A half hour demo of installing Yioop is available at yioop.com: [[http://www.yioop.com/?c=group&a=wiki&group_id=20&arg=media&page_id=26&n=01%20Installing%20Yioop%20Demo.mp4|Yioop Install Demo]]. On the Yioop.com website the [[http://www.yioop.com/?c=group&group_id=20&arg=read&a=wiki&page_name=Main|Yioop Tutorials Wiki]] has video tutorials for several of Yioop's features. This wiki also illustrates the ability of Yioop software to do video streaming. + +==XAMPP on Windows== + +#Download [[http://www.apachefriends.org/en/xampp-windows.html|Xampp]] (Note: Yioop! 0.9 or higher works on latest version; Yioop! 0.88 or lower works up till Xampp 1.7.7). +#Install xampp. +#In Xampp 1.8.1 and higher, php curl seems to be enabled by default. For earlier versions, edit the file + C:\xampp\php\php.ini +in Notepad. Search on curl. Change the line: + ;extension=php_curl.dll +to + extension=php_curl.dll +#Open Control Panel. Go to System => Advanced system settings => Advanced. Click on Environment Variables. Look under System Variables and select Path. Click Edit. Tack onto the end of Variable Values: + ;C:\xampp\php; +Click OK a bunch times to get rid of windows. Close the Control Panel window. Reopen it and go to the same place to make sure the path variable really was changed. +#Download [[Downloads|Yioop]] (You should choose a version ≥ 0.94 or the latest version). Unzip it into + C:\xampp\htdocs +Rename the downloaded folder yioop (so now have a folder C:\xampp\htdocs\yioop). Point your browser at: + http://localhost/yioop/ +#Enter under "Search Engine Work Directory", the path <br><tt>C:/xampp/htdocs/yioop_data</tt><br>. It will ask you to log into Yioop. Login with username root and empty password. +#In Yioop's Configure screen continue filling out your settings: + Default Language: English + + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +#Crawl robot name is what will appear together with a url to a bot.php page in the web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Go to Manage Machines and add a single machine under Add Machine: + Machine Name: Local + Machine Url: http://localhost/yioop/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +You might need to restart your computer to get the next steps to work. +#In Manage Machines, click ON on the queue server and on your fetcher. For your queue server and your fetcher, click on the log file link and make sure that after at most two minutes you are seeing new log entries. +#Now go to Manage Crawls. Click on Options. Set the options you would like for your crawl. Click Save. +#Type the name of the crawl and start crawl. Let it crawl for a while, until you see the Total URLs Seen > 1. +#Click stop crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. Then you can search using this index. + +==Wamp== + +#Download [[http://www.wampserver.com/en/|WampServer]] (Note: Yioop! 0.9 or higher works works with PHP 5.4) +#Download [[Downloads|Yioop]] (you should choose some version ≥ 0.94 or latest) Unzip it into + C:\wamp\www +#Rename the downloaded folder yioop (so you should now have a folder C:\wamp\www\yioop). +#Edit php.ini to enable multicurl. To do this use the Wamp dock tool and navigate to wamp => php => extension. Turn on curl. Next navigate to wamp => php => php.ini . +#Wamp has two php.ini files. The one we just edited by doing this is in + C:\wamp\bin\apache\Apache2.2.21\bin +You need to also edit the php.ini in + C:\wamp\bin\php\php5.4.3 +Depending on your version of Wamp the PHP version number may be different. Open this php.ini in Notepad search on curl then uncomment the line. It should be noted that you might want to choose an earlier or later version of Wamp than the particular one above. This is because out of the box its php_curl.dll did not work. I had to go to [[http://www.anindya.com/php-5-4-3-and-php-5-3-13-x64-64-bit-for-windows/|Anindya.com]], download php_curl-5.4.3-VC9-x64.zip under fixed curl extensions, then move it to + C:\wamp\bin\php\php5.4.3\ext +to get it to work. +#Next go to control panel => system => advanced system settings => advanced => environment variables => system variables =>path. Click edit and add to the path variable: + ;C:\wamp\bin\php\php5.4.3; +Exit control panel, then re-enter to double check that path really was added to the end. +In Yioop's Configure screen continue filling out your settings: + Default Language: English + + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +#Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Go to Manage Machines. Add a single machine under Add Machine using the settings: + Machine Name: Local + Machine Url: http://localhost/yioop/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +#Under Machine Information turn the Queue Server and Fetcher On. +#Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl. +#Let it crawl for a while, until you see the Total URLs Seen > 1. Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index. + +==Mac OSX / Mac OSX Server== + +The instructions given here are for OSX Mountain Lion (10.8) and OSX Mavericks (10.9), Apple changes the positions with which files can be found slightly between versions, so you might have to do a little exploration to find things for earlier OSX versions. + +#Turn on Apache with PHP enabled. +#Not OSX Server: Traditionally, on (pre-Mountain Lion) OSX, one could go to Control Panel => Sharing, and turn on Web Sharing to get the web server running. This option was removed in Mountain Lion, however, from the command line (Terminal), one can type: + sudo apachectl start +to start the Web server, and similarly, + sudo apachectl stop +to stop it. Alternatively, to make it so the WebServer starts each time the machine is turned on one can type: + sudo defaults write /System/Library/LaunchDaemons/org.apache.httpdDisabled -bool false +#By default, document root is /Library/WebServer/Documents. The configuration files for Apache in this setting are located in /etc/apache2. If you want to tweak document root or other Apache settings, look in the folder /etc/apache2/other and edit appropriate files such as httpd-vhosts.conf or httpd-ssl.conf . Before turning on Web Sharing / the web server, you need to edit the file /etc/apache/httpd.conf. Replace + #LoadModule php5_module libexec/apache2/libphp5.so +with + LoadModule php5_module libexec/apache2/libphp5.so +OSX Server: Pre-mountain Lion, OSX Server used /etc/apache2 to store its configuration files. Since Mountain Lion these files are in /Library/Server/Web/Config/apache2 . Within this folder, the sites folder holds Apache directives for specific virtual hosts. OSX Server comes with Server.app which will actively fight any direct tweaking to configuration files. From Server.app to get the web server running click on Websites. Make sure "Enable PHP web applications" is checked and Websites is On. The default web site is + /Library/Server/Web/Data/Sites/Default , +you probably want to click on + under websites and specify document root to be as you like. +#For the remainder of this guide, we assume document root for the web server is: /Library/WebServer/Documents. [[Downloads|Download Yioop]], unpack it into /Library/WebServer/Documents, and rename the Yioop folder to yioop. +#Make a folder for your crawl data: + sudo mkdir /Library/WebServer/Documents/yioop_data + sudo chmod 777 /Library/WebServer/Documents/yioop_data +#You probably want to make sure Spotlight (Mac's built-in file and folder indexer) doesn't index this folder -- especially during a crawl -- or your system might really slow down. To prevent this, open Control Panel, choose Spotlight, select the Privacy tab, and add the above folder to the list of folder Spotlight shouldn't index. If you are storing crawls on an external drive, you might want to make sure that drive gets automounted without a login. This is useful in the event of a power failure that exceeds your backup power supply time. To do this you can write the preference: + sudo defaults write /Library/Preferences/SystemConfiguration/autodiskmount \ + AutomountDisksWithoutUserLogin -bool true +#This will mean the hard drive becomes available when the power comes back. To make your Mac restart when the power is back, under System Preferences => Energy Saver there is a check box next to "Start up automatically after a power failure". Check it. +#In a browser, go to the page http://localhost/yioop/ . You should see a configure screen where you can enter /Library/WebServer/Documents/yioop_data for the Work Directory. It will ask you to re-login. Use the login: root and no password. Now go to Yioop => Configure and input the following settings: + Search Engine Work Directory: /Library/WebServer/Documents/yioop_data + Default Language: English + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +#Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Go to Manage Machines. Add a single machine under Add Machine using the settings: + Machine Name: Local + Machine Url: http://localhost/yioop/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +#Under Machine Information turn the Queue Server and Fetcher On. +#Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl. +#Let it crawl for a while, until you see the Total URLs Seen > 1. +#Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index. + +==Ubuntu Linux / Debian (with Suhosin Hardening Patch)== + +The instructions described here have been tested on Ubuntu 12.04 LTS and Ubuntu 14.04 LTS. + +#Get PHP and Apache set-up by running the following commands as needed (you might have already done some): + sudo apt-get install curl + sudo apt-get install apache2 + sudo apt-get install php5 + sudo apt-get install php5-cli + sudo apt-get install php5-sqlite + sudo apt-get install php5-curl + sudo apt-get install php5-gd +#After this sequence, the files /etc/apache2/mods-enabled/php5.conf and /etc/apache2/mods-enabled/php5.load should exist and link to the corresponding files in /etc/apache2/mods-available. The configuration files for PHP are /etc/php5/apache2/php.ini (for the apache module) and /etc/php5/cli/php.ini (for the command-line interpreter). You want to make changes to both configurations. To get a feel for the changes you can make in a texteditor: ed, vi, nano, gedit, etc., modify the line: + post_max_size = 8M +to + post_max_size = 32M +This change is not strictly necessary, but will improve performance. +#Debian's (not Ubuntu's) PHP version has the Suhosin hardening patch enabled by default. On Yioop before Version 0.941, this caused problems because Yioop made mt_srand calls which were ignored. To fix this you should add to the end of both php.ini files list above (alternatively, you could add to /etc/php5/apache2/conf.d/suhosin.ini and /etc/php5/cli/conf.d/suhosin.ini): + suhosin.srand.ignore = Off + suhosin.mt_srand.ignore = Off +This modification is not needed for Version 0.941 and higher. Suhosin hardening also entails a second place where HTTP post requests are limited. You should also set suhosin.post.max_value_length to the same value you set for post_max_size. +#Looking in the folders /etc/php5/apache2/conf.d and /etc/php5/cli/conf.d you can see which extensions are being loaded by php. Look for files curl.ini, gd.ini, sqlite.ini to know these extensions will be loaded. +#Restart the web server after making your changes: + sudo apachectl stop + sudo apachectl start +The DocumentRoot for web sites (virtual hosts) served by an Ubuntu Linux machine is typically specified by files in /etc/apache2/sites-enabled. In this example, it was given in a file 000-default and specified to be /var/www/. +#[[Downloads|Download Yioop]], unpack it into /var/www and use mv to rename the Yioop folder to yioop. +#Make a folder for your crawl data: + sudo mkdir /var/www/yioop_data + sudo chmod 777 /var/www/yioop_data +#Next set the permissions on the configs.php so that the web server can change set the work directory location. We'll brute force this as: + sudo chmod 777 /var/www/yioop/configs/config.php +#In a browser, go to the page http://localhost/yioop/ . You should see a configure screen where you can enter /var/www/yioop_data for the Work Directory. It will ask you to re-login. Use the login: root and no password. Now go to Yioop => Configure and input the following settings: + Search Engine Work Directory: /Library/WebServer/Documents/yioop_data + Default Language: English + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Go to Manage Machines. Add a single machine under Add Machine using the settings: + Machine Name: Local + Machine Url: http://localhost/yioop/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +#Under Machine Information turn the Queue Server and Fetcher On. +#Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl. +#Let it crawl for a while, until you see the Total URLs Seen > 1. +#Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index. + +==Centos Linux== + +These instructions were tested running a [[http://virtualboxes.org/images/centos/|Centos 6.3 image]] in [[https://www.virtualbox.org/|VirtualBox]]. The keyboard settings for the particular image on the VirtualBox site are Italian, so you will have to tweak them to get an American keyboard or the keyboard you are most comfortable with. Also, in this virtual setting the memory available is somewhat low so you might need to tweak values in config/config.php to reduce the memory needs of yioop. To get started, log in, launch a terminal window, and su root. + +#The image we were using doesn't have Apache installed or the nano editor. These can be installed with the commands: + yum install httpd + yum install nano +#If you didn't su root, then you will need to put sudo before all commands in this guide, and you will have to make sure the user you are running under is in the list of sudoers. +#Apache's configuration files are in the /etc/httpd directory. To get rid of the default web landing page, we switch into the conf.d subfolder and disable welcome.conf. To do this, first type the commands: + cd /etc/httpd/conf.d + nano welcome.conf +Then using the editor put #'s at the start of each line and save the result. +#Next we install git, php, and the various php extensions we need: + yum install git + yum install php + yum install php-mbstring + yum install php-sqlite3 + yum install gd + yum install php-gd +#The default Apache DocumentRoot under Centos is /var/www/html. We will install Yioop in a folder /var/www/html/yioop. This can be accessed by pointing a browser at http://127.0.0.1/yioop/ . To download Yioop to /var/www/html/yioop and to create a work directory, we run the commands: + cd /var/www/html + git clone http://seekquarry.com/git/yioop.git yioop + mkdir yioop_data + chmod 777 yioop_data +Restart/start the web server: + service httpd stop + service httpd start +Tell Yioop where its work directory is: + cd /var/www/html/yioop/configs + php configure_tool.php +#Select option (1) Create/Set Work Directory +#Enter /var/www/html/yioop_data +#then select option (1) to confirm the change. +#Exit the program. +#In a browser, go to the page http://localhost/yioop/ . You should see a configure screen where you can enter /var/www/yioop_data for the Work Directory. It will ask you to re-login. Use the login: root and no password. Now go to Yioop => Configure and input the following settings: + Search Engine Work Directory: /Library/WebServer/Documents/yioop_data + Default Language: English + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Go to Manage Machines. Add a single machine under Add Machine using the settings: + Machine Name: Local + Machine Url: http://localhost/yioop/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +#Under Machine Information turn the Queue Server and Fetcher On. +#Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl. +#Let it crawl for a while, until you see the Total URLs Seen > 1. +#Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index. + +==CPanel== + +Generally, it is not practical to do your crawling in a cPanel hosted website. However, cPanel works perfectly fine for hosting the results of a crawl you did elsewhere. Here we briefly described how to do this. In capacity planning your installation, as a rule of thumb, you should expect your index to be of comparable size (number of bytes) to the sum of the sizes of the pages you downloaded. + +#Download Yioop (you should choose some version > 0.88 or latest) to your local machine. +#In cPanel go to File Manager and navigate to the place you want on your server to serve Yioop from. Click upload and choose your zip file so as to upload it to that location. +#Select the uploaded file and click extract to extract the zip file to a folder. Reload the page. Rename the extracted folder, if necessary. +#For the rest of these instructions, let's assume it was mysite where the testing is being done. If at this point one browsed to: + http://mysite.my/yioop/ +One would see: + SERVICE AVAILABLE ONLY VIA LOCALHOST UNTIL CONFIGURED +Browse to the yioop/configs folder. Create a new file local_config.php Add the code + <?php + define('NO_LOCAL_CHECK', 'true'); + ?> +Now if you browse to: + http://mysite.my/yioop/ +you should see a place to enter a work directory path. +#The work directory must be an absolute path. In the cPanel FileManager next at the top of the directory tree in the left hand side of the screen it lists the file path such as + /public_html/mysite.my/yioop/configs +(if we still happened to be in the configs directory). You want to make this a full path. Typically, this means tacking on /home/username (what you log in with) to the path so far. To keep things simple set the work directory to be: + /home/username/public_html/mysite.my/yioop_data +Here username should be your user name. After filling in this as the Work Directoryclick Load or Create. You will see it briefly display a complete profile page then log you out saying you must login with username root password blank Re-Login. +#Go to Manage account and give yourself a better login and password. +#Go to configure. Many cPanel installation still use PHP 5.2 so you might see: + The following required items were missing: + PHP Version 5.3 or Newer +This means you won't be able to crawl from within cPanel, but you will still be able to serve search results. To do this, perform a crawl elsewhere, for instance on your laptop. +#After performing a crawl, go to Manage Crawls on the machine where you preformed the crawl. Look under Previous Crawls and locate the crawl you want to upload. Note its timestamp. +#Go to THIS_MACHINES_WORK_DIRECTORY/cache . Locate the folder IndexDatatimestamp. where timestamp is the timestamp of the crawl you want. ZIP this folder. +#In FileManager, under cPanel on the machine you want to host your crawl, navigate to + yioop_data/cache. +#Upload the ZIP and extract it. +#Go to Manage Crawls on this instance of Yioop, locate this crawl under Previous Crawls, and set it as the default crawl. You should now be able to search and get results from the crawl. + +You will probably want to uncheck Cache in the Configure activity as in this hosted setting it is somewhat hard to get the cache page feature (where it let's users see complete caches of web-page by clicking a link) of Yioop to work. + +==HipHop== + +[[https://github.com/facebook/hiphop-php/wiki|HipHop]] is Facebook's open-source virtual machine for executing PHP. It can offer a significant speed-up in performance over running the traditional PHP interpreter. Yioop runs under HipHop with the following limitations: (1) The Yioop page processors for epub, pptx, and xslx files make use of the ZipArchive class not supported by HipHop. (2) The Yioop recipe plugin makes use of SplHeap not supported by HipHop. In the former case you should uncheck these file extensions in Page Options. In the latter case, Yioop will automatically disable the recipe plugin, so you don't need to make any changes -- if you had crawled something using this plugin elsewhere, you can still serve the results using HipHop though. For the remainder of this section, we describe how to get a Yioop up and running under Ubuntu 12.04 LTS using HipHop. + +#To begin, get HipHop from GitHub. To do this add the hiphop repository to the apt-get sources list, add to the file /etc/apt/sources.list the line: + deb http://dl.hiphop-php.com/ubuntu precise main +then update the package index: + sudo apt-get update +and install HipHop using apt-get: + sudo apt-get install hiphop-php +#Set up a HipHop configuration file /etc/hhvm.hdf: + Server { + Port = 8080 + SourceRoot = /var/www/yioop + } + + Eval { + Jit = true + } + Log { + Level = Error + UseLogFile = true + File = /var/log/hhvm/error.log + Access { + * { + File = /var/log/hhvm/access.log + Format = %h %l %u %t \"%r\" %>s %b + } + } + } + + VirtualHost { + * { + Pattern = .* + RewriteRules { + dirindex { + pattern = ^/(.*)/$|^/$ + to = $1/index.php + qsa = true + } + } + } + } + + StaticFile { + FilesMatch { + * { + pattern = .*\.(dll|exe) + headers { + * = Content-Disposition: attachment + } + } + } + Extensions { + css = text/css + gif = image/gif + html = text/html + jpe = image/jpeg + jpeg = image/jpeg + jpg = image/jpeg + png = image/png + tif = image/tiff + tiff = image/tiff + txt = text/plain + } + } +Notice this is running on port 8080 -- when I was testing this, I had something else running on port 80. If you want to use the more common port 80, modify the above accordingly. For the purposes of figuring out when error issues it is often convenient to look at the error.log file by running: + tail -n 500 /var/log/hhvm/error.log +This is the location specified by the configuration file; however, the directory /var/log/hhvm does not exist by default so you should create it: + sudo mkdir /var/log/hhvm +Most of the configuration file above comes from the [[http://www.hiphop-php.com/wp/?p=113|HipHop Blog Entry for WordPress Installation]]. I tweaked the rewrite for what are the default index files. +#Start the HipHop virtual machine daemon: + sudo hhvm --mode daemon --user web --config /etc/hhvm.hdf +#[[Downloads|Download Yioop]], unpack it into /var/www . If you didn't install apache2 then you might need to do mkdir to make this folder. Next use mv to rename the Yioop folder to yioop. +#Make a folder for your crawl data: + sudo mkdir /var/www/yioop_data + sudo chmod 777 /var/www/yioop_data +#Tell Yioop where its work directory is: + cd /var/www/html/yioop/configs + sudo hhvm -f configure_tool.php +#Select option (1) Create/Set Work Directory +#Enter /var/www/html/yioop_data +then select option (1) to confirm the change. +#Exit the program. +Notice to run the PHP program above we did not have to install php, we just directly ran it using HipHop from the command line. The -f option is to say the file name we'd like to run. +#In a browser, go to the page http://localhost:8080/ . You should see a configure screen where you can enter /var/www/yioop_data for the Work Directory. It will ask you to re-login. Use the login: root and no password. You can safely ignore the warning: + The following required items were missing: PHP Version 5.3 or Newer +#Now go to Yioop => Configure and input the following settings: + Search Engine Work Directory: /Library/WebServer/Documents/yioop_data + Default Language: English + Crawl Robot Name: TestBot + Robot Description: This bot is for test purposes. It respects robots.txt + If you having problems with it please feel free to ban it. +Crawl robot name is what will appear together with a url to a bot.php page in web server log files of sites you crawl. The bot.php page will display what you write in robot description. This should give contact information in case your robot misbehaves. Obviously, you should customize the above to what you want to say. +#Click [Toggle Advanced Settings] on the configure page. For the Name Server URL set it to: + http://localhost:8080/ +If you didn't use port 8080, but instead the usual port 80, you would not have to do this step. +#Go to Manage Machines. Add a single machine under Add Machine using the settings: + Machine Name: Local + Machine Url: http://localhost:8080/ + Is Mirror: (uncheck) + Has Queue Server: (check) + Number of Fetchers 1 + Submit +#Under Machine Information turn the Queue Server and Fetcher On. +#Go to Manage Crawls. Click on the options to set up where you want to crawl. Type in a name for the crawl and click start crawl. +#Let it crawl for a while, until you see the Total URLs Seen > 1. +#Then click Stop Crawl and wait for the crawl to appear in the previous crawls list. Set it as the default crawl. You should be able to search using this index. +#If you prefer to run the fetcher's and queue_server's from the command line make sure to use hhvm rather than php if you want to use HipHop. I.e., + cd /var/www/yioop/bin + hhvm -f fetcher.php terminal + +==Systems with Multiple Queue Servers== + +This section assumes you have already successfully installed and performed crawls with Yioop in the single queue_server setting and have succeeded to use the Manage Machines to start and stop a queue_server and fetcher. If not, you should consult one of the installation guides above or the general [[Documentation|Yioop Documentation]]. + +Before we begin, what are the advantages in using more than one queue_server? + +#If the queue_servers are running on different processors then they can each be indexing part of the crawl +data independently and so this can speed up indexing. +#After the crawl is done, the index will typically exist on multiple machines and each needs to search +a smaller amount of data before sending it to the name server for final merging. So queries can be faster. + +For the purposes of this note we will consider the case of two queue_servers, +the same idea works for more. To keep things especially simple, we have both of these queue_servers +on the same laptop. Advantages (1) and (2) will likely not apply in this case, but we are describing +this for testing purposes -- you can take the same idea and have the queue servers on different machines after going through this tutorial. + +#Download and install yioop as you would in the single queue_server case. But do this twice. For example, on your machine, under document root you might have two subfolders + git/yioop1 +and + git/yioop2 +each with a complete copy of yioop. We will use the copy git/yioop1 as an instance of Yioop with both a name_server and a queue_server; the git/yioop2 will be an instance with just a queue_server. +#On the Configure element of the git/yioop1 instance, set the work directory to be something like + /Applications/XAMPP/xamppfiles/htdocs/crawls1 +For the git/yioop2 instance we set it to be + /Applications/XAMPP/xamppfiles/htdocs/crawls2 +I.e., the work directories of these two instances should be different! For each crawl in the multiple queue_server setting, each instance will have a copy of those documents it is responsible for. So if we did a crawl with timestamp 10, each instance would have a WORK_DIR/cache/IndexData10 folder and these folders would be disjoint from any other instance. +#Click Toggle Advanced Settings to see the additional configuration fields needed for what follows. +#Continuing down on the Configure element for each instance, make sure under the Search Access fieldset Web, RSS, and API are checked. +#Next click on Server Settings. Make sure the name server and server key are the same for both instances. I.e., In the Name Server Set-up fieldset, one might set: + Server Key:123 + Name Server URL:http://localhost/git/yioop1/ +The Crawl Robot Name should also be the same for the two instances, say: + TestBotFeelFreeToBan +but we want the Robot Instance to be different, say 1 and 2. +#Go to the Manage Machine element for git/yioop1, which is the name server. Only the name server needs to manage machines, so we won't do this for git/yioop2 (or for any other queue servers if we had them). +#Add machines for each Yioop instance we want to manage with the name server. In this particular case, fill out and submit the Add Machine form twice, the first time with: + Machine Name:Local1 + Machine Url:http://localhost/git/yioop1/ + Is Mirror: unchecked + Has Queue Server: checked + Num Fetchers: 1 +the second time with: + Machine Name:Local2 + Machine Url:http://localhost/git/yioop2/ + Is Mirror: unchecked + Has Queue Server: checked + Num Fetchers: 1 +#The Machine Name should be different for each Yioop instance, but can otherwise be whatever you want. Is Mirror controls whether this is a replica of some other node -- I'll save that for a different install guide at some point. If we wanted to run more fetchers we could have chosen a bigger number for Num Fetchers (fetchers are the processes that download web pages). +#After the above steps, there should be two machines listed under Machine Information. Click the On button on the queue server and the fetcher of both of them. They should turn green. If you click the log link you should start seeing new messages (it refreshes once every 30 seconds) after at most a minute or so. +#At this point you are ready to crawl in the multiple queue server setting. You can use Manage Crawl to set-up, start and stop a crawl exactly as in the single queue_server setting. +#Perform a crawl and set it as the default index. You can then turn off all the queue servers and fetchers in Manage Machines, if you like. +#If you type a query into the search bar of the name server (git/yioop1), you should be getting merged results from both queue servers. To check if this is working... Under configure on the name server (git/yioop1) make sure Query Info is checked and that Use Memcache and Use FileCache are not checked -- the latter two are for testing, we can check them later when we know things are working. When you perform a query now, at the bottom of the page you should see a horizontal rule followed by Query Statistics followed by all the queries performed in calculating results. One of these should be PHRASE QUERY. Underneath it you should see Lookup Offset Times and beneath this Machine Subtimes: ID_0 and ID_1. If these appear you know its working. + +When a query is typed into the name server it tacks no:network onto it and asks it of all the queue servers. It then merges the results. So if you type "hello" as the search, i.e., if you go to the url + http://localhost/git/yioop1/?q=hello +the git/yioop1 script will make in parallel the curl requests + http://localhost/git/yioop1/?q=hello&ne ... alse&raw=1 + (raw=1 means no grouping) + http://localhost/git/yioop2/?q=hello&ne ... alse&raw=1 +get the results back, and merges them. Finally, it returns to the user the result. The network=false tells http://localhost/git/yioop1/ to actually do the query lookup rather than make a network request. +EOD; +$public_pages["en-US"]["Main"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Open Source Search Engine Software - Seekquarry + +author=Chris Pollett + +robots= + +description=SeekQuarry provides open source search technologies + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS=Open Source Search Engine Software!= + +SeekQuarry is the parent site for [[https://www.yioop.com/|Yioop]]. Yioop is a [[http://gplv3.fsf.org/|GPLv3]], open source, PHP search engine. + +==What can Yioop do?== + +Yioop software provides many of the same features of larger search portals: + +*'''Search Results.''' Yioop comes with a crawler which can be used to crawl the open web or a selection of URLs of your choice. It also can index popular archive formats like Wikipedia XML-dumps, arc, warc, Open Directory Project-RDF, as well as dumps of emails or databases. Once you have created Yioop indexes of your desired data sources, Yioop can serve as a search engine for your data. It supports "crawl mixes" of different data sources. Yioop also provides tools to classify and sculpt your data before being used in search results. +*'''News Service.''' News is best when it is still fresh. Yioop has a media updater process that can be used to re-index RSS and Atom feeds on an hourly basis. This more timely information can then be incorporated into Yioop search results. +*'''Social Groups, Blogs, and Wikis.''' Yioop can be configured to allow users to create discussion groups, blogs, and wikis. If Yioop is configured to allow multiple users, then users can share mixes of crawls they create. Blogs and discussion group can be made public or private and posts can be made to expire if desired. Public ones have public RSS feeds and the better amongst these can be chosen for incorporation in what Yioop's news service indexes. Each group also comes with its own wiki. Images and video can be uploaded to both feeds and wiki pages and Yioop can be configured to automatically convert video to web viewable formats. +*'''Web Sites.''' Yioop's wiki mechanism can be used to build websites. It also has a [[http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93adapter|Model View Adapter]] framework which can be easily extended to build customized search portal websites. Yioop can also be integrated into existing sites to provide search functionality either through an API, Open Search RSS, or JSON services. + +==Requirements== + +The software and hardware requirements for Yioop are relatively low. At a minimum, you only need a web server such as Apache and PHP 5.3 or better. A test set-up consisting of three 2011 Mac Mini's each with 8GB RAM, a single name server, and five fetchers can add a 100 million pages to its index every four weeks. + +EOD; +$public_pages["en-US"]["Ranking"] = <<< 'EOD' +page_type=standard + +page_border=solid-border + +toc=true + +title=Open Source Search Engine Software - Seekquarry :: Ranking + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS{{id='contents' +=Yioop Search Engine Ranking Mechanisms= +}} +==Introduction== + +A typical query to Yioop is a collection of terms without the use of the OR operator, '|', or the use of the exact match operator, double quotes. On such a query, called a '''conjunctive query''', Yioop tries to return documents which contain all of the query terms. If the sequence of words is particular common, Yioop will try to return results which have that string with the same word order. Yioop further tries to return these documents in descending order of score. Most users only look at the first ten of the results returned. This article tries to explain the different factors which influence whether a page that has all the terms will make it into the top ten. To keep things simple we will assume that the query is being performed on a single Yioop index rather than a crawl mix of several indexes. We will also ignore how news feed items get incorporated into results. + +At its heart, Yioop relies on three main scores for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox). Proximity scores are only used if the query has two or more terms. We will describe later how these three scores are calculated. For now one can think that the Doc Rank roughly indicates how important the document as a whole is, Relevance measures how important the search terms are to the document, and Proximity measures how close the search terms appear to each other on the document. In addition to these three basic scores, a user might select when they perform a crawl that a classifier be used for ranking purposes. After our initial discussion, we will say how we incorporate classifier scores. + +On a given query, Yioop does not scan its whole posting lists to find every document that satisfies the query. Instead, it scans until it finds a fixed number of documents, say n, satisfying the query or until a timeout is exceeded. In the case of a timeout, n is just the number of documents found by the timeout. It then computes the three scores for each of these n documents. For a document d from these n documents, it determines the rank of d with respect to the Doc Rank score, the rank of d with respect to the Relevance score, and the rank of d with respect to the Proximity score. It finally computes a score for each of these n documents using these three rankings and '''reciprocal rank fusion (RRF)''': + +{{center| +`mbox(RRF)(d) := 200(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})` +}} + +This formula essentially comes from Cormack et al. [ [[Ranking#CCB2009|CCB2009]] ]. They do not use the factor 200 and use 60 rather than 59. `mbox(RRF)(d)` is known to do a decent job of combining scores, although there are some recent techniques such as LambdaRank [ [[Ranking#VLZ2012|VLZ2012]] ], which do significantly better at the expense of being harder to compute. To return results, Yioop computes the top ten of these n documents with respect to `mbox(RRF)(d)` and returns these documents. + +It is relatively straightforward to extend the RRF(d) formula to handle scores coming from classifiers: One just adds additional reciprocal terms for each classifier score. For example, if `CL_1,...,CL_n` were the scores from the classifiers being used for ranking, then the formula would become: + +`mbox(RRF)(d) := frac{600}{n+3}(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} + frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} +` +`\qquad\qquad\qquad frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)} + sum_{i=1}^nfrac{1}{59 + mbox(Rank)_(mbox(CL)_i)(d)}).` + +To get a feeling for how the `mbox(RRF)(d)` formula works, let's return to the non-classifiers case and consider some particular example situations: If a document ranked 1 with respect to each score, then `mbox(RRF)(d) = 200(3/(59+1)) = 10`. If a document ranked n for each score, then `mbox(RRF)(d) = 200(3/(59+n)) = 600/(59 + n)`. As `n -> infty`, this goes to `0`. A value `n = 200` is often used with +Yioop. For this `n`, `600/(59 + n) approx 2.32`. If a document ranked 1 on one of the three scores, but ranked `n` on the other two, `mbox(RRF)(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last term again goes to 0 as `n` gets larger, giving a maximum score of `3.33`. For the `n=200` case, one gets a score of `4.88`. So because the three component scores are converted to ranks, and then reciprocal rank fusion is used, one cannot solely use a good score on one of the three components to get a good score overall. + +An underlying assumption used by Yioop is that the first `n` matching documents in Yioop's posting lists contain the 10 most important documents with respect to our scoring function. For this assumption to be valid our posting list must be roughly sorted according to score. For Yioop though, the first `n` documents will in fact most likely be the first `n` documents that Yioop indexed. This does not contradict the assumption +provided we are indexing documents according to the importance of our documents. To do this Yioop tries to index according to Doc Rank and assumes the affects of relevance and proximity are not too drastic. That is, they might be able to move the 100th document into the top 10, but not say the 1000th document into the top 10. + +To see how it is possible to roughly index according to document importance, we next examine how data is acquired during a Yioop web crawl (the process for an archive crawl is somewhat different). This is not only important for determining the Doc Rank of a page, but the text extraction that occurs after the page is downloaded also affects the Relevance and Proximity scores. Once we are done describing these crawl/indexing time factors affecting scores, we will then consider search time factors which affect the scoring of documents and the actually formulas for Doc Rank, Relevance and Proximity. + +[[Ranking#contents|Return to table of contents]]. + +==Crawl Time Ranking Factors== +===Crawl Processes=== + +To understand how crawl and indexing time factors affect search ranking, let's begin by first fixing in our minds how a crawl works in Yioop. A Yioop crawl has three types of processes that play a role in this: + +#A Name server, which acts as an overall coordinator for the crawl, and which is responsible for starting and stopping the crawl. +#One or more Queue Servers, each of which maintain a priority queue of what to download next. +#One or more Fetchers, which actually download pages, and do initial page processing. + +A crawl is started through the Yioop Web app on the Name Server. For each url in the list of starting urls (Seed Sites), its hostname is computed, a hash of the hostname is computed, and based on this hash, that url is sent to a given queue server -- all urls with the same hostname will be handled by the same queue server. Fetchers periodically check the Name Server to see if there is an active crawl, and if so, what its timestamp is. If there is an active crawl, a Fetcher would then pick a Queue Server and request a schedule of urls to download. By default, this can be as many as DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls. + +===Fetchers and their Effect on Search Ranking=== + +Let's examine the fetcher's role in determining what terms get indexed, and hence, what documents can be retrieved using those terms. After receiving a batch of pages, the fetcher downloads pages in batches of a hundred pages at a time. When the fetcher requests a URL for download it sends a range request header asking for the first PAGE_RANGE_REQUEST (defaults to 50000) many bytes. Only the data in these byte has any chance of becoming terms which are indexed. The reason for choosing a fixed, relatively small size is so that one can index a large number of documents even with a relatively small amount of disk space. Some servers do not know how many bytes they will send before sending, they might operate in "chunked" mode, so after receiving the page, the fetcher discards any data after the first PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants that we mention such as PAGE_RANGE_REQUEST can be found in configs/config.php. This particular constant can actually be set from the admin panel under the Page Options - Crawl Time. For each page in the batch of a hundred urls downloaded, the fetcher proceeds through a sequence of processing steps to: + +#Determine page mimetype and choose a page processor. +#Use the page processor to extract a summary for the document. +#Apply any indexing plugins for the page processor to generate auxiliary summaries and/or modify the extracted summary. +#Run classifiers on the summary and add any class labels and rank scores +#Calculate a hash from the downloaded page minus tags and non-word characters to be used for deduplication. +#Prune the number links extracted from the document down to MAX_LINKS_PER_PAGE (defaults to 50). +#Apply any user-defined page rules to the summary extracted. +#Store full-cache of page to disk, add the location of full cache to summary. Full cache pages are stored in folders in WORK_DIRECTORY/cache/FETCHER_PREFIX-ArchiveCRAWL_TIMESTAMP. These folder contain gzipped text files, web archives, each made up of the concatenation of up to NUM_DOCS_PER_GENERATION many cache pages. The class representing this whole structure is called a WebArchiveBundle (lib/web_archive_bundle.php). The class for a single file is called a WebArchive (lib/web_archive.php). +#Keep summaries in fetcher memory until they are shipped off to the appropriate queue server in a process we'll describe later. + +After these steps, the fetcher checks the name server to see if any crawl parameters have changed or if the crawl has stopped before proceeding to download the next batch of a hundred urls. It proceeds in this fashion until it has downloaded and processed four to five hundred urls. It then builds a "mini-inverted index" of the documents it has downloaded and sends the inverted index, the summaries, any discovered urls, and any robots.txt data it has downloaded back to the queue server. It also sends back information on which hosts that the queue server is responsible for that are generating more than DOWNLOAD_ERROR_THRESHOLD (10) HTTP errors in a given schedule. These hosts will automatically be crawl-delayed by the queue server. Sending all of this data, allows the fetcher to clear some of its memory and continue processing its batch of 5000 urls until it has downloaded all of them. At this point, the fetcher picks another queue server and requests a schedule of urls to download from it and so on. + +Page rules, which can greatly effect the summary extracted for a page, are described in more detail in the [[Documentation#Page%20Indexing%20and%20Search%20Options|Page Options Section]] of the Yioop documentation. Before describing how the "mini-inverted index" processing step is done, let's examine Steps 1,2, and 6 above in a little more detail as they are very important in determining what actually is indexed. Based usually on the the HTTP headers, a [[http://en.wikipedia.org/wiki/Internet_media_type|mimetype]] for each page is found. The mimetype determines which summary extraction processor, in Yioop terminology, a page processor, is applied to the page. As an example of the key role that the page processor plays in what eventually ends up in a Yioop index, we list what the HTML page processor extracts from a page and how it does this extraction: + +;'''Language''': +Document language is used to determine how to make terms from the words in a document. For example, if the language is English, Yioop uses the English stemmer on a document. So the word "jumping" in the document will get indexed as "jump". On the other hand, if the language was determined to be Italian then a different stemmer would be used and "jumping" would remain "jumping". The HTML processor determines the language by first looking for a lang attribute on the <html> tag in the document. If none is found it checks if the frequency of characters is close enough to English to guess the document is English. If this fails it leaves the value blank. +;'''Title''': +When search results are displayed, the extracted document title is used as the link text. Words in the title also are given a higher value when Yioop calculates its relevance statistic. The HTML processor uses the contents of the <title> tag as its default title. If this tag is not present or is empty, Yioop then concatenates the contents of the <h1> to <h6> tags in the document. The HTML processor keeps only the first hundred (HtmlProcessor::MAX_TITLE_LEN) characters of the title. +;'''Description''': +The description is used when search results are displayed to generate the snippets beneath the result link. Besides title, it has the remainder on the page words that are used to identify a document. The HTML processor can obtain a description using one of two algorithms that can be set in page options. When using the basic summarizer, it first takes the value of the content attribute of any <meta> tag whose name attribute is some case invariant of "description". To this it concatenates the non-tag contents of the first four <p> and <div> tags, followed by the content of <td>, <li>, <dt>, <dd>, and <a> tags until it reaches a maximum of HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters. These items are added from the one with the most characters to the one with the least. The HTML processor can also obtain a description using a centroid summarizer. Here it removes all tags from the documents and splits the document into sentences. Ignoring common words (stop words), an average sentence vector is calculated. The components of this vector are terms and the value for a component represents the likelihood that a sentence in this document has that term. Then the distance between each sentence and this centroid is calculated and the closest sentences are added to the summary one by one until HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters has been reached. +;'''Links''': +Links are used by Yioop to obtain new pages to download. They are also treated by Yioop as "mini-documents". The url of such a mini-document is the target website of the link, the link text is used as a description. As we will see during searching, these mini-documents get combined with the summary of the site linked to.The HTML processor extracts links from <a>, <frame>, <iframe>, and <img> tags. It extracts up to 300 links per document. When it extracts links it canonicalizes relative links. If a <base> tag was present, it uses it as part of the canonicalization process. Link text is extracted from <a> tag contents and from alt attributes of <img>'s. In addition, rel attributes are examined for robot directives such as nofollow. +;'''Robot Metas''': +This is used to keep track of any robot directives that occurred in meta tags in the document. These directives are things such a NOFOLLOW, NOINDEX, NOARCHIVE, and NOSNIPPET. These can affect what links are extracted from the page, whether the page is indexed, whether cached versions of the page will be displayable from the Yioop interface, and whether snippets can appear beneath the link on a search result page. The HTML processor does a case insensitive match on <meta> tags that contain the string "robot" (so it will treat such tags that contain robot and robots the same). It then extracts the directives from the content attribute of such a tag. + +The page processors for other mimetypes extract similar fields but look at different components of their respective document types. + +After the page processor is done with a page, pages which aren't robot.txt pages which also aren't sitemap pages, then pass through a pruneLinks method. This culls the up to 300 links that might have been extracted down to 50. To do this, for each link, the link text is gzipped and the length of the resulting string is determined. The 50 unique links of longest length are then kept. The idea is that we want to keep links whose text carry the most information. Gzipping is a crude way to eliminate text with lots of redundancies. The length then measures how much useful text is left. Having more useful text means that the link is more likely to be helpful to find the document. + +Now that we have finished discussing Steps 1,2, and 6, let's describe what happens when building a mini-inverted index. For the four to five hundred summaries that we have at the start of mini-inverted index step, we make associative arrays of the form: + term_id_1 => ... + term_id_2 => ... + ... + term_id_i => + ((summary_map_1, (positions in summary 1 that term i appeared) ), + (summary_map_2, (positions in summary 2 that term i appeared) ), + ...) + ... +Term IDs are 20 byte strings. Terms might represent a single word or might represent phrases. The first 8 bytes of a term ID is the first 8 bytes of the md5 hash of the first word in the word or phrase. The next byte is used to indicate whether the term is a word or a phrase. If it is a word the remaining bytes are used to encode what kind of page the word occurs of (media:text, media:image, ... safe:true, safe:false, and some classifier labels if relevant). If it is a phrase, the remaining bytes encode various length hashes of the remaining words in the phrase. Summary map numbers are offsets into a table which can be used to look up a summary. These numbers are in increasing order of when the page was put into the mini-inverted index. To calculate a position of a term, the summary is viewed as a single string consisting of words extracted from the url concatenated with the summary title concatenated with the summary description. One counts the number of words from the start of this string. Phrases start at the position of their first word. Let's consider the case where we only have words and no phrases and we are ignoring the meta word info such as media: and safe:. Then suppose we had two summaries: + Summary 1: + URL: http://test.yioop.com/ + Title: Fox Story + Description: The quick brown fox jumped over the lazy dog. + + Summary 2: http://test.yioop2.com/ + Title: Troll Story + Description: Once there was a lazy troll, P&A, who lived on my + discussion board. +The mini-inverted index might look like: + ( + [test] => ( (1, (0)), (2, (0)) ) + [yioop] => ( (1, (1)) ) + [yioop2] => ( (2, (1)) ) + [fox] => ( (1, (2, 7)) ) + [stori] => ( (1, (3)), (2, (3)) ) + [the] => ( (1, (4, 10)) ) + [quick] => ( (1, (5)) ) + [brown] => ( (1, (6)) ) + [jump] => ( (1, (8)) ) + [over] => ( (1, (9)) ) + [lazi] => ( (1, (11)), (2, (8)) ) + [dog] => ( (1, (12)) ) + [troll] => ( (2, (2, 9)) ) + [onc] => ( (2, (4)) ) + [there] => ( (2, (5)) ) + [wa] => ( (2, (6)) ) + [a] => ( (2, (7))) ) + [p_and_a] => ( (2, (10)) ) + [who] => ( (2, (11)) ) + [live] => ( (2, (12)) ) + [on] => ( (2, (13)) ) + [my] => ( (2, (14)) ) + [discuss] => ( (2, (15)) ) + [board] => ( (2, (16)) ) + ) +The list associated with a term is called a '''posting list''' and an entry in this list is called a '''posting'''. Notice terms are stemmed when put into the mini-inverted index. Also, observe acronyms, abbreviations, emails, and urls, such as P&A, will be manipulated before being put into the index. For some languages such as Japanese where spaces might not be placed between words, char-gramming is done instead. If two character char-gramming is used, the string: æºæ°ç©èª (Tale of Genji) becomes æºæ° æ°ç© ç©èª. A user query æºæ°ç© will, before look-up, be converted to the conjunctive query æºæ° æ°ç© and so would match a document containing æºæ°ç©èª. + +The effect of the meta word portion of a term ID in the single word term case is to split the space of documents containing a word like "dog" into disjoint subsets. This can be used to speed up queries like "dog media:image", "dog media:video". The media tag for a page can only be one of media:text, media:image, media:video; it can't be more than one. A query of just "dog" will actually be calculated as a disjoint union of the fixed, finitely many single word term ID which begin with the same 8 bytes hash as "dog". A query of "dog media:image" will do a look up all term IDs with the same "dog" hash and "media:image" hash portion of the term ID. These term IDs will correspond to disjoint sets of documents which are process in order of doc offset. + +In the worst case to do a conjunctive query takes time proportional to the shortest posting list. To try to get a better guarantee on the runtime of queries, Yioop ties to use Term IDs for phrases are used to speed up queries in the case of multi-word queries. On a query like "earthquake soccer", Yioop uses these term IDs to see how many documents have this exact phrase. If this is greater than a threshold (10), Yioop just does an exact phrase look up using these term IDs. If the number of query words is greater than three, Yioop always uses this mechanism to do look up. If the threshold is not met, Yioop checks if the threshold is met by all, but the last word, or by all but the first word. If so, it does the simpler conjunctive queries of the phrase plus the single word. + +Yioop does not store phrase term IDs for every phrase it has ever found on some document in its index. Instead, it follows the basic approach of [ [[Ranking#PTSHVC2011|PTSHVC2011]] ]. The main difference is that it stores data directly in its inverted index rather than their two ID approach. To get the idea of this approach, consider the stemmed document: + jack be nimbl jack be quick jack jump the candlestick +The words that immediately follows each occurrence of "jack be" (nimbl, quick) in this document are not all the same. Phrases with this property are called '''maximal'''. The whole document "jack be nimbl jack be quick jack jump the candlestick" is also maximal and there is no prefix of it larger than "jack be" which is maximal. We would call this string '''conditionally maximal''' for "jack be". When processing a document, Yioop builds a [[http://en.wikipedia.org/wiki/Suffix_tree|suffix tree]] for it in linear time using Ukkonen's algorithm [ [[Ranking#U1995|U1995]] ]. It uses this tree to quickly build a list of maximal phrases of up to 12 words and any prefixes for which they are conditionally maximal. Only such maximal phrases will be given term IDs and stored in the index. The term ID for such a phrase begins with the 8 byte hash of the prefix for which it is maximal. This is followed by hashes of various lengths for the remaining terms. The format used is specified in the documentation of utility.php's crawlHashPath function. To do an exact lookup of a phrase like "jack be nimbl", it suffices to look up phrase term IDs which have their first 8 bytes either the hash of "jack", "jack be", or "jack be nimbl". Yioop only uses phrase term IDs for lookup of documents not for calculations like proximity where it uses the actual words that make up the phrase to get a score. + +It should be recalled that links are treated as their own little documents and so will be treated as separate documents when making the mini-inverted index. The url of a link is what it points to not the page it is on. So the hostname of the machine that it points to might not be a hostname handled by the queue server from which the schedule was downloaded. In reality, the fetcher actually partitions link documents according to queue server that will handle that link, and builds separate mini-inverted indexes for each queue server. After building mini-inverted indexes, it sends to the queue server the schedule was downloaded from, inverted index data, summary data, host error data, robots.txt data, and discovered links data that was destined for it. It keeps in memory all the other inverted index data destined for other machines. It will send this data to the appropriate queue servers later -- the next time it downloads and processes data for these servers. To make sure this scales, the fetcher checks its memory usage, if it is getting low, it might send some of this data for other queue servers early. + +===Queue Servers and their Effect on Search Ranking=== + +It is back on a queue server that the building blocks for the Doc Rank, Relevance and Proximity scores are assembled. To see how this happens we continue to follow the flow of the data through the web crawl process. + +To communicate with a queue server, a fetcher posts data to the web app of the queue server. The web app writes mini-inverted index and summary data into a file in the WORK_DIRECTORY/schedules/IndexDataCRAWL_TIMESTAMP folder. Similarly, robots.txt data from a batch of 400-500 pages is written to WORK_DIRECTORY/schedules/RobotDataCRAWL_TIMESTAMP, and "to crawl" urls are written to WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP. The Queue Server periodically checks these folders for new files to process. It is often the case that files can be written to these folders faster than the Queue Server can process them. + +A queue server consists of two separate sub-processes: + +;'''An Indexer ''': +The indexer is responsible for reading Index Data files and building a Yioop index. +;'''A Scheduler''' : +The scheduler maintains a priority queue of what urls to download next. It is responsible for reading SchedulateData files to update its priority queue and it is responsible for making sure urls that urls forbidden by RobotData files do not enter the queue. + +When the Indexer processes a schedule IndexData file, it saves the data in an IndexArchiveBundle (lib/index_archive_bundle). These objects are serialized to folders with names of the form: WORK_DIRECTORY/cache/IndexDataCRAWL_TIMESTAMP . IndexArchiveBundle's have the following components: + +;'''summaries''': +This is a WebArchiveBundle folder containing the summaries of pages read from fetcher-sent IndexData files. +;'''posting_doc_shards''': +This contains a sequence of inverted index files, shardNUM, called IndexShard's. shardX holds the postings lists for the Xth block of NUM_DOCS_PER_GENERATION many summaries. NUM_DOCS_PER_GENERATION default to 40000 if the queue server is on a machine with at least 2Gb of memory. shardX also has postings for the link documents that were acquired while acquiring these summaries. +;'''generation.txt''': +Contains a serialized PHP object which says what is the active shard -- the X such that shardX will receive newly acquired posting list data. +;'''dictionary''': +The dictionary contains a sequence of subfolders used to hold for each term in a Yioop index the offsets and length in each IndexShard where the posting list for that term are stored. + +Of these components posting_doc_shards are the most important with regard to page scoring. When a schedules/IndexData file is read, the mini-inverted index in it is appended to the active IndexShard. To do this append, all the summary map offsets, need to adjusted so they now point to locations at the end of the summary of the IndexShard to which data is being appended. These offsets thus provide information about when a document was indexed during the crawl process. The maximum number of links per document is usually 50 for normal documents and 300 for [[http://www.sitemaps.org/|sitemaps]]. Empirically, it has been observed that a typical index shard has offsets for around 24 times as many links summary map entries as document summary map entries. So roughly, if a newly added summary or link, d, has index DOC_INDEX(d) in the active shard, and the active shard is the GENERATION(d) shard, the newly added object will have + +<blockquote> +\begin{eqnarray} +\mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) + (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\ +&&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ +&=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times \\ +&&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d)) +\end{eqnarray} +</blockquote> + +To make this a score out of `10`, we can use logarithms: + +{{center| +`mbox(DR)(d) = 10 - log_(10)(mbox(RANK)(d)).` +}} + +Here `mbox(DR)(d)` is the Doc Rank for one link or summary item stored in a Yioop index. However, as we will see, this does not give us the complete value of Doc Rank for an item when computed at query time. There also some things to note about this formula: + +#Unlike PageRank [ [[Ranking#BP1998|BP1998]] ], it is not some kind of logarithm of a probability, it is the logarithm of a rank. A log probability would preserve information about relative importance of two pages. I.e., it could say something about how far apart things like the number 1 page was compared to the number 2 page. Doc Rank as measured so far does not do that. +#The Doc Rank is a positive number and less than 10 provided the index of the given queue server has fewer than 10 billion items. Since to index 10 billion items using Yioop, you would probably want multiple queue servers, Doc Rank's likely remain positive for larger indexes. +#If we imagined that Yioop indexed the web as a balanced tree starting from some seed node where RANK(i) labels the node i of the tree enumerated level-wise, then log25(RANK(d))=log10(RANK(d))log10(25) would be an estimate of the depth of a node in this tree. So Doc Rank can be viewed as an estimate of how far we are away from the root, with 10 being at the root. +#Doc Rank is computed by different queue servers independent of each other for the same index. So it is possible for two summaries to have the same Doc Rank in the same index, provided they are stored on different queue servers. +#For Doc Ranks to be comparable with each other for the same index on different queue servers, it is assumed that queue servers are indexing at roughly the same speed. + +Besides Doc Rank, Index shards are important for determining relevance and proximity scores as well. An index shard stores the number of summaries seen, the number of links seen, the sum of the lengths of all summaries, the sum of the length of all links. From these statistics, we can derive average summary lengths, and average link lengths. From a posting, the number of occurences of a term in a document can be calculated. These will all be useful statistics for when we compute relevance. As we will see, when we compute relevance, we use the average values obtained for the particular shard the summary occurs in as a proxy for their value throughout all shards. The fact that a posting contains a position list of the location of a term within a document will be use when we calculate proximity scores. + +We next turn to the role of a queue server's Scheduler process in the computation of a page's Doc Rank. One easy way, which is supported by Yioop, for a Scheduler to determine what to crawl next is to use a simple queue. This would yield roughly a breadth-first traversal of the web starting from the seed sites. Since high quality pages are often a small number of hops from any page on the web, there is some evidence [ [[Ranking#NW2001|NW2001]] ] that this lazy strategy is not too bad for crawling according to document importance. However, there are better strategies. When Page Importance is chosen in the Crawl Order dropdown for a Yioop crawl, the Scheduler on each queue server works harder to make schedules so that the next pages to crawl are always the most important pages not yet seen. + +One well-known algorithm for doing this kind of scheduling is called OPIC (Online Page Importance Computation) [ [[Ranking#APC2003|APC2003]] ]. The idea of OPIC is that at the start of a crawl one divides up an initial dollar of cash equally among the starting seed sites. One then picks a site with highest cash value to crawl next. If this site had `alpha` cash value, then when we crawl it and extract links, we divide up the cash and give it equally to each link. So if there were `n` links, each link would receive from the site `alpha/n` cash. Some of these sites might already have been in the queue in which case we add to their cash total. For URLs not in the queue, we add them to the queue with initial value `alpha/n`. Each site has two scores: Its current cash on hand, and the total earnings the site has ever received. When a page is crawled, its cash on hand is reset to `0`. We always choose as the next page to crawl from amongst the pages with the most cash (there might be ties). OPIC can be used to get an estimate of the importance of a page, by taking its total earnings and dividing it by the total earnings received by all pages in the course of a crawl. + +In the experiments conducted by the original paper, OPIC was shown to crawl in a better approximation to page rank order than breadth-first search. Bidoki and Yazdani [ [[Ranking#BY2008|BY2008]] ] have more recently proposed a new page importance measure DistanceRank, they also confirm that OPIC does better than breadth-first, but show the computationally more expensive Partial PageRank and Partial DistanceRank perform even better. Yioop uses a modified version of OPIC to choose which page to crawl next. + +To save a fair bit of crawling overhead, Yioop does not keep for each site crawled historical totals of all earnings a page has received. The cash-based approach is only used for scheduling. Here are some of the issues addressed in the OPIC-based algorithm employed by Yioop: + +#A Scheduler must ensure robots.txt files are crawled before any other page on the host. To do this, robots.txt files are inserted into the queue before any page from that site. Until the robots.txt file for a page is crawled, the robots.txt file receives cash whenever a page on that host receives cash. +#A fraction `alpha` of the cash that a robots.txt file receives is divided amongst any sitemap links on that page. Not all of the cash is given. This is to prevent sitemaps from "swamping" the queue. Currently, `alpha` is set 0.25. Nevertheless, together with the last bullet point, the fact that we do share some cash, means cash totals no longer sum to one. +#Cash might go missing for several reasons: (a) An image page, any other page, might be downloaded with no outgoing links. (b) A page might receive cash and later the Scheduler receives robots.txt information saying it cannot be crawled. (c) Round-off errors due to floating point precision. For these reasons, the Scheduler periodically renormalizes the total amount of cash.. +#A robots.txt file or a slow host might cause the Scheduler to crawl-delay all the pages on the host. These pages might receive sufficient cash to be scheduled earlier, but won't be, because there must be a minimum time gap between requests to that host. +#When a schedule is made with a crawl-delayed host, URLs from that host cannot be scheduled until the fetcher that was processing them completes its schedule. If a Scheduler receives a "to crawl" url from a crawl-delayed host, and there are already MAX_WAITING_HOSTS many crawl-delayed hosts in the queue, then Yioop discards the url. +#The Scheduler has a maximum, in-memory queue size based on NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It will wait on reading new "to crawl" schedule files from fetchers if reading in the file would mean going over this count. For a typical, web crawl this means the "to crawl" files build up much like a breadth-first queue on disk. +#To make a schedule, the Scheduler starts processing the queue from highest priority to lowest. The up to 5000 urls in the schedule are split into slots of 100, where each slot of 100 will be required by the fetcher to take a MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls are inserted into the schedule at the earliest available position. If a URL is crawl-delayed it is inserted at the earliest position in the slot sufficient far from any previous url for that host to ensure that the crawl-delay condition is met. +#If a Scheduler's queue is full, yet after going through all of the url's in the queue it cannot find any to write to a schedule, it goes into a reset mode. It dumps its current urls back to schedule files, starts with a fresh queue (but preserving robots.txt info) and starts reading in schedule files. This can happen if too many urls of crawl-delayed sites start clogging a queue. + +The actual giving of a page's cash to its urls is done in the Fetcher. We discuss it in the section on the queue server because it directly affects the order of queue processing. Cash in Yioop's algorithm is done in a different manner than in the OPIC paper. It is further handled differently for sitemap pages versus all other web pages. For a sitemap page with `n` links, let +{{center| +`\gamma = sum_(j=1)^n 1/j^2`. +}} +Let `C` denote the cash that the sitemap has to distribute. Then the `i`th link on the sitemap page receives cash +{{center| +`C_i = C/(gamma cdot i^2)`. +}} +One can verify that `sum_(i=1)^n C_i = C`. This weighting tends to favor links early in the sitemap and prevent crawling of sitemap links from clustering together too much. For a non-sitemap page, we split the cash by making use of the notion of a company level domain (cld). This is a slight simplification of the notion of a pay level domain (pld) defined in [ [[Ranking#LLWL2009|LLWL2009]]]. For a host of the form, something.2chars.2chars or blah.something.2chars.2chars, the company level domain is something.2chars.2chars. For example, for www.yahoo.co.uk, the company level domain is yahoo.co.uk. For any other url, stuff.2ndlevel.tld, the company level domain is 2ndlevel.tld. For example, for www.yahoo.com, the company level domain is yahoo.com. To distribute cash to links on a page, we first compute the company level domain for the hostname of the url of the page, then for each link we compute its company level domain. Let `n` denote the number of links on the page and let `s` denote the number of links with the same company level domain. If the cld of a link is the same as that the page, and the page had cash `C`, then the link will receive cash: +{{center| +`frac{C}{2n}` +}} +Notice this is half what it would get under usual OPIC. On the other hand, links to a different cld will receive cash: +{{center| +`frac{C - s times C/(2n)}{n-s}` +}} +The idea is to avoid link farms with a lot of internal links. As long as there is at least one link to a different cld, the payout of a page to its links will sum to C. If no links go out of the CLD, then cash will be lost. In the case where someone is deliberately doing a crawl of only one site, then this lost cash will get replaced during normalization, and the above scheme essentially reduces to usual OPIC. + +We conclude this section by mentioning that the Scheduler only affects when a URL is written to a schedule which will then be used by a fetcher. It is entirely possible that two fetchers get consecutive schedules from the same Scheduler, and return data to the Indexers not in the order in which they were scheduled. In which case, they would be indexed out of order and their Doc Ranks would not be in the order of when they were scheduled. The scheduling and indexing process is only approximately correct, we rely on query time manipulations to try to improve the accuracy. + +[[Ranking#contents|Return to table of contents]]. + +==Search Time Ranking Factors== +===Looking up Initial Links=== + +We are at last in a position to describe how Yioop calculates the three scores Doc Rank, Relevance, and Proximity at query time. When a query comes into Yioop it goes through the following stages before an actual look up is performed against an index. + +#Control words are calculated. Control words are terms like m: or i: terms which can be used to select a mix or index to use. They are also commands like raw: which says what level of grouping to use, or no: commands which say not to use a standard processing technique. For example, no:guess (affects whether the next processing step is done), no:network, etc. For the remainder, we will assume the query does not contain control words. +#An attempt is made to guess the semantics of the query. This matches keywords in the query and rewrites them to other query terms. For example, a query term which is in the form of a domain name, will be rewritten to the form of a meta word, site:domain. So the query will return only pages from the domain. Currently, this processing is in a nascent stage. As another example, if you do a search only on "D", it will rewrite the search to be "letter D". +#Stemming or character `n`-gramming is done on the query and acronyms and abbreviations are rewritten. This is the same kind of operation that we did after generating summaries to extract terms. + +After going through the above steps, Yioop builds an iterator object from the resulting terms to iterate over summaries and link entries that contain all of the terms. As described in the section [[Ranking#Fetchers%20and%20their%20Effect%20on%20Search%20Ranking|Fetchers and their Effect on Search Ranking]], some or all of these terms might be whole phrases to reduce the need for computing expensive conjunctive queries. In the single queue server setting one iterator would be built for each term and these iterators would be added to an intersect iterator that would return documents on which all the terms appear. This intersect iterator has a timer associated with it to prevent it from running too long in the case of a conjunctive query of terms with long posting lists with small intersection. These iterators are then fed into a grouping iterator, which groups links and summaries that refer to the same document url. Recall that after downloading pages on the fetcher, we calculated a hash from the downloaded page minus tags. Documents with the same hash are also grouped together by the group iterator. The value `n=200` posting list entries that Yioop scans out on a query referred to in the introduction is actually the number of results the group iterator requests before grouping. This number can be controlled from the Yioop admin pages under Page Options > Search Time > Minimum Results to Group. The number `200` was chosen because on a single machine it was found to give decent results without the queries taking too long. + +In the multiple queue server setting, when the query comes in to +the name server, a network iterator is built. This iterator poses the +query to each queue server being administered by the +name server. If `n=200`, the name server +multiplies this value by the value +Page Options > Search Time > Server Alpha, which we'll denote `alpha`. This defaults +to 1.6, so the total is 320. It then divides this by the number +of queue servers. So if there were 4 queue servers, one would have +80. It then requests the first 80 results for the query from each +queue server. The queue servers don't do grouping, but just +send the results of their intersect iterators to the name server, which +does the grouping. + +In both the networked and non-networked case, after the grouping phase Doc Rank, Relevance, and Proximity scores for each of the grouped results will have been determined. We then combine these three scores into a single score using the reciprocal rank fusion technique described in the introduction. Results are then sorted in descending order of score and output. What we have left to describe is how the scores are calculated in the various iterators mentioned above. + +To fix an example to describe this process, suppose we have a group +`G'` of items `i_j'`, either pages or links that all refer to the same url. +A page in this group means that at some point we downloaded the url and +extracted a summary. It is possible for there to be multiple pages in a group +because we might re-crawl a page. If we have another group `G' '` of items +`i_k' '` of this kind such that the hash of the most recent page matches +that of `G'`, then the two groups are merged. While we are grouping, we are +computing a temporary overall score for a group. The temporary score is used to +determine which page's (or link's if no pages are present) summaries in a group +should be used as the source of url, title, and snippets. Let `G` be the +group one gets performing this process after all groups with the same hash +as `G'` have been merged. We now describe how the individual items in `G` +have their score computed, and finally, how these scores are combined. + +The Doc Rank of an item `d`, `mbox(DR)(d)`, is calculated according to the formula mentioned in the [[Ranking#Queue%20Servers%20and%20their%20Effect%20on%20Search%20Ranking|queue servers]] subsection: +\begin{eqnarray} +\mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) + (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\ +&&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ +&=& (\mbox{DOC_INDEX}(d) + 1) + 25 \times \\ +&&\mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ +\mbox{DR}(d) &=& 10 - \log_{10}(\mbox{RANK}(d)) +\end{eqnarray} +To compute the relevance of an item, we use a variant of BM25F [ [[Ranking#ZCTSR2004|ZCTSR2004]]]. +Suppose a query `q` is a set of terms `t`. View an item `d` as a bag of +terms, let `f_(t,d)` denote the frequency of the term `t` in `d`, +let `N_t` denote the number of items containing `t` in the whole index +(not just the group), let `l_d` denote the length of `d`, where length is +the number of terms including repeats it contains, and +let `l_{avg}` denote the average length of an item in the index. The basic +BM25 formula is: +\begin{eqnarray} +\mbox{Score}_{\mbox{BM25}}(q, d) &=& \sum_{t \in q} \mbox{IDF}(t) +\cdot \mbox{TF}_{\mbox{BM25}}(t,d), \mbox{ where }\\ +\mbox{IDF}(t) &=& \log(\frac{N}{N_t})\mbox{, and}\\ +\mbox{TF}_{\mbox{BM25}}(t,d) &=& +\frac{f_{t,d}\cdot(k_1 +1)}{f_{t,d} + k_1\cdot ((1-b) + b\cdot(l_d / l_{avg}) )} +\end{eqnarray} +`mbox(IDF)(t)`, the inverse document frequency of `t`, in the above can be +thought as measure of how much signal is provided by knowing that the term `t` +appears in the document. For example, its value is zero if `t` is in every +document; whereas, the more rare the term is the larger than value of +`mbox(IDF)(t)`. +`mbox(TF)_(mbox(BM25))` represents a normalized term frequency for `t`. +Here `k_1 = 1.2` and `b=0.75` are tuned parameters which are set to values +commonly used in the literature. `mbox(TF)_(mbox(BM25))` is normalized to +prevent bias toward longer documents. Also, if one spams a document, filling +it with many copies of the term `t`, we approach the limiting situation +`lim_(f_(t,d) -> infty) mbox(TF)_(mbox(BM25))(t,d) = k_1 +1`, which as one +can see prevents the document score from being made arbitrarily larger. + +Yioop computes a variant of BM25F not BM25. This formula also +needs to have values for things like `l_(avg)`, `N`, `N_t`. To keep the +computation simple at the loss of some accuracy when Yioop needs these values +it uses information from the statistics in the particular index shard of `d` as +a stand-in. BM25F is essentially the same as BM25 except that it separates +a document into components, computes the BM25 score of the document with +respect to each component and then takes a weighted sum of these scores. +In the case of Yioop, if the item is a page the two components +are an ad hoc title and a description. Recall when making our position +lists for a term in a documents that we concatenated url keywords, +followed by title, followed by summary. So the first terms in the result +will tend to be from title. We take the first AD_HOC_TITLE_LEN many terms +from a document to be in the ad hoc title. We calculate an ad hoc title +BM25 score for a term from a query being in the ad hoc title of an item. +We multiply this by 2 and then compute a BM25 score of the term being in +the rest of the summary. We add the two results. I.e., +{{center| +`mbox(Rel)(q, d) = 2 times mbox(Score)_(mbox(BM25-Title))(q, d) + mbox(Score)_(mbox(BM25-Description))(q, d)` +}} +This score would be the relevance for a single page item `d` with respect +to `q`. For link items we don't +separate into title and description, but can weight the BM25 score different +than for a page (currently, though, the link weight is set to 1 by default). +These three weights: title weight, description weight, and link weight can +be set in Page Options > Search Time > Search Rank Factors . + +To compute the proximity score of an item `d` with respect to +a query `q` with more than one term, we use the notion of a '''span'''. +A span is an interval `[u_i, v_i]` of positions within `d` which contain +all the terms (including repeats) in `q` such that no smaller interval contains +all the terms (including repeats) . Given `d` we can calculate a proximity +score as a sum of the inverse of the sizes of the spans: +{{center| +`mbox(Prox)(d) = sum(frac(1)(v_i - u_i + 1))`. +}} +This formula comes from Clark et al. [ [[Ranking#CCT2000|CCT2000]]] except that they use covers, rather than spans, where covers ignore repeats. For a page item, Yioop calculates separate proximity scores with respect to its ad hoc title and the rest of a summary. It then adds them with the same weight as was done for the BM25F relevance score. Similarly, link item proximities also have a weight factor multiplied against them. + +Now that we have described how to compute Doc Rank, Relevance, and Proximity +for each item in a group, we now describe how to get these three values +for the whole group. First, for proximity we take the max over all +the proximity scores in a group. The idea is that since we are going +out typically 200 results before grouping, each group has a relatively +small number of items in it. Of these there will typically be at most +one or two page items, and the rest will be link items. We aren't +doing document length normalization for proximity scores and it might +not make sense to do so for links data where the whole link text is relatively +short. Thus, the maximum score in the group is likely to be that of a +page item, and clicking the link it will be these spans the user will see. +Let `[u]` denote all the items that would be grouped +with url `u` in the grouping process, let `q` be a query. +Let `Res(q)` denote results in the index satisfying query `q`, that is, +having all the terms in the query. Then the +proximity of `[u]` with respect to `q` is: +\begin{eqnarray} +\mbox{Prox}(q, [u]) &=& \mbox{max}_{i \in [u], i \in Res(q)}(\mbox{Prox}(q,i)). +\end{eqnarray} +For Doc Rank and Relevance, we split a group into subgroups based on +the host name of where a link came from. So links from +http://www.yahoo.com/something1 and http://www.yahoo.com/something2 +to a url `u` would have the same hostname http://www.yahoo.com/. A link from +http://www.google.com/something1 would have hostname http://www.google.com/. +We will also use a weighting `wt(i)` which has value `2` if `i` is +a page item and the url of i is a hostname, and 1 otherwise. +Let `mbox(Host)(i)` denote the set of hostnames for a page item `i`, and +let `mbox(Host)(i)` denote the hostnames of the page `i` came from in the case +of a link item. Let +{{center| +`H([u]) = { h \quad | h = mbox(Host)(i) \mbox ( for some ) i in [u]}`. +}} +Let `[u]_h` be the items in `[u]` with hostname `h`. +Let `([u]_h)_j` denote the `j`th element of `[u]_h` listed out in order of +Doc Rank except that the first page item found is listed as `i_0`. +It seems reasonable if a particular host tells us the site `u` is great +multiple times, the likelihood that we would have our minds swayed diminishes +with each repeating. This motivates our formulas for Doc Rank and Relevance +which we give now: +\begin{eqnarray} +\mbox{Rel}(q, [u]) &=& \sum_{h \in H([u])} +\sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{Rel}(q, ([u]_h)_j).\\ +\mbox{DR}(q, [u]) &=& \sum_{h \in H([u])} +\sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{DR}(q, ([u]_h)_j). +\end{eqnarray} +Now that we have described how Doc Rank, Relevance, and Proximity +are calculated for groups, we have almost completed our description of the Yioop +scoring mechanism in the conjunctive query case. After performing +pre-processing steps on the query, Yioop retrieves the first `n` +results from its index. Here `n` defaults to 200. It then groups the +results and uses the formulas above to calculate the three scores +for Doc Rank, Relevance, and Proximity. It then uses reciprocal rank +fusion to combine these three scores into a single score, sorts the +results by this score and returns to the user the top 10 of these +results. + +===Final Reordering=== +The top 10 results produced in the last section are what is presented +in a basic configuration of Yioop. It is possible to configure Yioop +to make use of a thesaurus to reorder these 10 results before final +presentation. When this is done, these 10 results are retrieved as +described above. Yioop then does part of speech tagging on the original +query. This is done with a simplified +[[https://en.wikipedia.org/wiki/Brill_tagger|Brill tagger]] [ [[Ranking#B1992|B1992]] ]. +Using the tagged version of the query, it looks up each term for its part of +speech in a thesaurus for the current language. This is currently only +implemented for English and the thesaurus used is +[[http://wordnet.princeton.edu/|WordNet]]. Possible synonyms for a term in +Wordnet often have example sentences. If so, cosine or intersection rank +scores of these sentences versus the original query are computed and the +highest scoring synonym is selected. If there are no example sentences, then +the first is selected. To calculate a cosine score we view the original query +and the sentence as binary vectors where the coordinates of the vectors are +labeled by terms. So, for example, the "the" coordinate of the original +query would be 1 if the original query contained the word "the". The dot +product of these two vectors divided by their lengths, then gives the cosine +of the angle between them, the cosine score. This scoring is done for each +term. Then for each term in the original query, the query is modified by +swapping it for its synonym. The number of documents for the modified query +as a whole phrase is looked up in the index dictionary for the current index. +The three phrases which occur most often in the dictionary are then +selected. For each of the top 10 documents for the query, the sum of the +cosine similarities of these three phrases with a documents summary is +computed to get a thesaurus score. The ten documents are then sorted +by this score and displayed. + +[[Ranking#contents|Return to table of contents]]. + +==References== +; {{id="APC2003" '''[APC2003]'''}}: Serge Abiteboul and Mihai Preda and Gregory Cobena. [[http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf|Adaptive on-line page importance computation]]. In: Proceedings of the 12th international conference on World Wide Web. pp. 280-290. 2003. +; {{id="BY2008" '''[BY2008]'''}}: A. M. Z. Bidoki and Nasser Yazdani. [[http://goanna.cs.rmit.edu.au/~aht/tiger/DistanceRank.pdf|DistanceRank: An intelligent ranking algorithm for web pages]]. Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892. March, 2008. +; {{id="B1992" '''[B1992]'''}}: Eric Brill. 1992. [[http://anthology.aclweb.org//A/A92/A92-1021.pdf|A simple rule-based part of speech tagger]]. In Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics. Stroudsburg, PA, USA. pp. 152--155. +; {{id="BP1998" '''[BP1998]'''}}: Brin, S. and Page, L. [[http://infolab.stanford.edu/~backrub/google.html|The Anatomy of a Large-Scale Hypertextual Web Search Engine]]. In: Seventh International World-Wide Web Conference (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998. +; {{id="CCT2000" '''[CCT2000]'''}}: Charles L. A. Clarke and Gordon V. Cormack and Elizabeth A. Tudhope. [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.1615&rep=rep1&type=pdf|Relevance Ranking for One to Three Term Queries]]. In: Information Processing Management. Vol. 36. Iss. 2. pp.291--311. 2000. +; {{id="CCB2009" '''[CCB2009]'''}}: Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher. [[http://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf|Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods]]. In: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp.758--759. 2009. +; {{id="LLWL2009" '''[LLWL2009]'''}}: H.-T. Lee, D. Leonard, X. Wang, D. Loguinov. [[http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf|IRLbot: Scaling to 6 Billion Pages and Beyond]]. ACM Transactions on the Web. Vol. 3. No. 3. June 2009. +; {{id="NW2001" '''[NW2001]'''}}: Marc Najork and Janet L. Wiener. [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.102.9301&rep=rep1&type=pdf|Breadth-First Search Crawling Yields High-Quality Pages]]. Proceedings of the 10th international conference on World Wide Web. pp 114--118. 2001. +; {{id="PTSHVC2011" '''[PTSHVC2011]'''}}: Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Scott Vitter, Sabrina Chandrasekaran. [[http://www.ittc.ku.edu/~jsv/Papers/PTS11.InvertedIndexSIGIR.pdf|Inverted indexes for phrases and strings]]. Proceedings of the 34nth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 555--564. 2011. +; {{id="U1995" '''[U1995]'''}}: Ukkonen, E. [[http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf|On-line construction of suffix trees]]. Algorithmica. Vol. 14. Iss 3. pp. 249--260. 1995. +; {{id="VLZ2012" '''[VLZ2012]'''}}: Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel. [[http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf|Learning to rank by aggregating expert preferences]]. 21st ACM International Conference on Information and Knowledge Management. pp. 843-851. 2012. +; {{id="ZCTSR2004" '''[ZCTSR2004]'''}}: Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [[http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf|Microsoft Cambridge at TREC-13: Web and HARD tracks]]. In Proceedings of 3th Annual Text Retrieval Conference. 2004. + +[[Ranking#contents|Return to table of contents]]. +EOD; +$public_pages["en-US"]["Resources"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Resources + +author=Chris Pollett + +robots= + +description= + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS=Resources= + +==User Resources== +* [[Discussion|Discussion Boards]] +* [[Install|Install Guides]] +* [[Ranking|Yioop Ranking Mechanisms]] +* [[Syntax|Yioop's Wiki Syntax]] + + +==Developer Resources== +* [[Coding|Coding Guidelines]] +* [[http://www.seekquarry.com/mantis/|Issue Tracking]] +* [[http://www.seekquarry.com/yioop-docs/|PHPDocumentor docs for Yioop source code]] +* [[http://www.seekquarry.com/viewgit/|View Git of Yioop repository]] + +EOD; +$public_pages["en-US"]["Syntax"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Yioop Wiki Syntax + +author=Chris Pollett + +robots= + +description=Describes the markup used by Yioop Software + +page_header=main_header + +page_footer=main_footer + +END_HEAD_VARS=Yioop Wiki Syntax= + +Wiki syntax is a lightweight way to markup a text document so that +it can be formatted and drawn nicely by Yioop. +This page briefly describes the wiki syntax supported by Yioop. + +==Headings== +In wiki syntax headings of documents and sections are written as follows: + +<nowiki> +=Level1= +==Level2== +===Level3=== +====Level4==== +=====Level5===== +======Level6====== +</nowiki> + +and would look like: + +=Level1= +==Level2== +===Level3=== +====Level4==== +=====Level5===== +======Level6====== + +==Paragraphs== +In Yioop two new lines indicates a new paragraph. You can control +the indent of a paragraph by putting colons followed by a space in front of it: + +<nowiki> +: some indent + +:: a little more + +::: even more + +:::: that's sorta crazy +</nowiki> + +which looks like: + +: some indent + +:: a little more + +::: even more + +:::: that's sorta crazy + +==Horizontal Rule== +Sometimes it is convenient to separate paragraphs or sections with a horizontal +rule. This can be done by placing four hyphens on a line by themselves: +<nowiki> +---- +</nowiki> +This results in a line that looks like: +---- + +==Text Formatting Within Paragraphs== +Within a paragraph it is often convenient to make some text bold, italics, +underlined, etc. Below is a quick summary of how to do this: +===Wiki Markup=== +{| +|<nowiki>''italic''</nowiki>|''italic'' +|- +|<nowiki>'''bold'''</nowiki>|'''bold''' +|- +|<nowiki>'''''bold and italic'''''</nowiki>|'''''bold and italic''''' +|} + +===HTML Tags=== +Yioop also supports several html tags such as: +{| +|<nowiki><del>delete</del></nowiki>|<del>delete</del> +|- +|<nowiki><ins>insert</ins></nowiki>|<ins>insert</ins> +|- +|<nowiki><s>strike through</s> or +<strike>strike through</strike> </nowiki>|<s>strike through</s> +|- +|<nowiki><sup>superscript</sup> and +<sub>subscript</sub></nowiki>|<sup>superscript</sup> and +<sub>subscript</sub> +|- +|<nowiki><tt>typewriter</tt></nowiki>|<tt>typewriter</tt> +|- +|<nowiki><u>underline</u></nowiki>|<u>underline</u> +|} + +===Spacing within Paragraphs=== +The HTML entity +<nowiki>&nbsp;</nowiki> +can be used to create a non-breaking space. The tag +<nowiki><br></nowiki> +can be used to produce a line break. + +==Preformatted Text and Unformatted Text== +You can force text to be formatted as you typed it rather +than using the layout mechanism of the browser using the +<nowiki><pre>preformatted text tag.</pre></nowiki> +Alternatively, a sequence of lines all beginning with a +space character will also be treated as preformatted. + +Wiki markup within pre tags is still parsed by Yioop. +If you would like to add text that is not parsed, enclosed +it in `<`nowiki> `<`/nowiki> tags. + +==Styling Text Paragraphs== +Yioop wiki syntax offers a number of templates for +control the styles, and alignment of text for +a paragraph or group of paragraphs:<br /> +`{{`left| some text`}}`,<br /> `{{`right| some text`}}`,<br /> +and<br /> +`{{`center| some text`}}`<br /> can be used to left-justify, +right-justify, and center a block of text. For example, +the last command, would produce: +{{center| +some text +}} +If you know cascading style sheets (CSS), you can set +a class or id selector for a block of text using:<br /> +`{{`class="my-class-selector" some text`}}`<br />and<br /> +`{{`id="my-id-selector" some text`}}`.<br /> +You can also apply inline styles to a block of text +using the syntax:<br /> +`{{`style="inline styles" some text`}}`.<br /> +For example, `{{`style="color:red" some text`}}` looks +like {{style="color:red" some text}}. + +==Lists== +The Yioop Wiki Syntax supported of ways of listing items: +bulleted/unordered list, numbered/ordered lists, and +definition lists. Below are some examples: + +===Unordered Lists=== +<nowiki> +* Item1 +** SubItem1 +** SubItem2 +*** SubSubItem1 +* Item 2 +* Item 3 +</nowiki> +would be drawn as: +* Item1 +** SubItem1 +** SubItem2 +*** SubSubItem1 +* Item 2 +* Item 3 + +===Ordered Lists=== +<nowiki> +# Item1 +## SubItem1 +## SubItem2 +### SubSubItem1 +# Item 2 +# Item 3 +</nowiki> +# Item1 +## SubItem1 +## SubItem2 +### SubSubItem1 +# Item 2 +# Item 3 + +===Mixed Lists=== +<nowiki> +# Item1 +#* SubItem1 +#* SubItem2 +#*# SubSubItem1 +# Item 2 +# Item 3 +</nowiki> +# Item1 +#* SubItem1 +#* SubItem2 +#*# SubSubItem1 +# Item 2 +# Item 3 + +===Definition Lists=== +<nowiki> +;Term 1: Definition of Term 1 +;Term 2: Definition of Term 2 +</nowiki> +;Term 1: Definition of Term 1 +;Term 2: Definition of Term 2 + +==Tables== +A table begins with {`|` and ends with `|`}. Cells are separated with | and +rows are separated with |- as can be seen in the following +example: +<nowiki> +{| +|a||b +|- +|c||d +|} +</nowiki> +{| +|a||b +|- +|c||d +|} +Headings for columns and rows can be made by using an exclamation point, !, +rather than a vertical bar |. For example, +<nowiki> +{| +!a!!b +|- +|c|d +|} +</nowiki> +{| +!a!!b +|- +|c|d +|} +Captions can be added using the + symbol: +<nowiki> +{| +|+ My Caption +!a!!b +|- +|c|d +|} +</nowiki> +{| +|+ My Caption +!a!!b +|- +|c|d +|} +Finally, you can put a CSS class or style attributes (or both) on the first line +of the table to further control how it looks: +<nowiki> +{| class="wikitable" +|+ My Caption +!a!!b +|- +|c|d +|} +</nowiki> +{| class="wikitable" +|+ My Caption +!a!!b +|- +|c|d +|} +Within a cell attributes like align, valign, styles, and class can be used. For +example, +<nowiki> +{| +| style="text-align:right;"| a| b +|- +| lalala | lalala +|} +</nowiki> +{| +| style="text-align:right;"| a| b +|- +| lalala | lalala +|} + +==Math== + +Math can be included into a wiki document by either using the math tag: +<nowiki> +<math> +\sum_{i=1}^{n} i = frac{(n+1)(n)}{2} +</math> +</nowiki> + +<math> +\sum_{i=1}^{n} i = frac{(n+1)(n)}{2} +</math> + +==Adding Resources to a Page== + +Yioop wiki syntax supports adding search bars, audio, images, and video to a +page. The magnifying class edit tool icon can be used to add a search bar via +the GUI. This can also be added by hand with the syntax: +<nowiki> +{{search:default|size:small|placeholder:Search Placeholder Text}} +</nowiki> +This syntax is split into three parts each separated by a vertical bar |. The +first part search:default means results from searches should come from the +default search index. You can replace default with the timestamp of a specific +index or mix if you do not want to use the default. The second group size:small +indicates the size of the search bar to be drawn. Choices of size are small, +medium, and large. Finally, placeholder:Search Placeholder Text indicates the +grayed out background text in the search input before typing is done should +read: Search Placeholder Text. Here is what the above code outputs: + +{{search:default|size:small|placeholder:Search Placeholder Text}} + +Image, video and other media resources can be associated with a page by dragging +and dropping them in the edit textarea or by clicking on the link click to select +link in the gray box below the textarea. This would add wiki code such as + +<pre> +( (resource:myphoto.jpg|Resource Description)) +</pre> + +to the page. Only saving the page will save this code and upload the resource to +the server. In the above myphoto.jpg is the resource that will be inserted and +Resource Description is the alternative text to use in case the viewing browser +cannot display jpg files. A list of media that have already been associated with +a page appears under the Page Resource heading below the textarea. This +table allows the user to rename and delete resources as well as insert the +same resource at multiple locations within the same document. To add a resource +from a different wiki page belonging to the same group to the current wiki +page one can use a syntax like: + +<pre> +( (resource:Documentation:ConfigureScreenForm1.png|The work directory form)) +</pre> + +Here Documentation would be the page and ConfigureScreenForm1.png the resource. + +==Page Settings, Page Type== + +In edit mode for a wiki page, next to the page name, is a link [Settings]. +Clicking this link expands a form which can be used to control global settings +for a wiki page. This form contains a drop down for the page type, another +drop down for the type of border for the page in non-logged in mode, +a checkbox for whether a table of contents should be auto-generated from level 2 +and level three headings and then text +fields or areas for the page title, author, meta robots, and page description. +Beneath this one can specify another wiki page to be used as a header for this +page and also specify another wiki page to be used as a footer for this page. + +The contents of the page title is displayed in the browser title when the +wiki page is accessed with the Activity Panel collapsed or when not logged in. +Similarly, in the collapsed or not logged in mode, if one looks as the HTML +page source for the page, in the head of document, <meta> tags for author, +robots, and description are set according to these fields. These fields can +be useful for search engine optimization. The robots meta tag can be +used to control how search engine robots index the page. Wikipedia has more information on +[[https://en.wikipedia.org/wiki/Meta_element|Meta Elements]]. + +The '''Standard''' page type treats the page as a usual wiki page. + +'''Page Alias''' type redirects the current page to another page name. This can +be used to handle things like different names for the same topic or to do localization +of pages. For example, if you switch the locale from English to French and +you were on the wiki page dental_floss when you switch to French the article +dental_floss might redirect to the page dentrifice. + +'''Media List''' type means that the page, when read, should display just the +resources in the page as a list of thumbnails and links. These links for the +resources go to a separate pages used to display these resources. +This kind of page is useful for a gallery of +images or a collection of audio or video files. + +'''Presentation''' type is for a wiki page whose purpose is a slide presentation. In this mode, +.... +on a line by itself is used to separate one slide. If presentation type is a selected a new +slide icon appears in the wiki edit bar allowining one to easily add new slides. +When the Activity panel is not collapsed and you are reading a presentation, it just +displays as a single page with all slides visible. Collapsing the Activity panel presents +the slides as a typical slide presentation using the +[[www.w3.org/Talks/Tools/Slidy2/Overview.html|Slidy]] javascript. +EOD; +$public_pages["en-US"]["bot"] = <<< 'EOD' +title=Bot + +description=Describes the web crawler used with this +web site +END_HEAD_VARS +==My Web Crawler== + +Please Describe Your Robot +EOD; +$public_pages["en-US"]["captcha_time_out"] = <<< 'EOD' +title=Captcha/Recover Time Out +END_HEAD_VARS +==Account Timeout== + +A large number of captcha refreshes or recover password requests +have been made from this IP address. Please wait until +%s to try again. +EOD; +$public_pages["en-US"]["coding"] = <<< 'EOD' +page_type=page_alias + +page_alias=coding + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["documentation"] = <<< 'EOD' +page_type=page_alias + +page_alias=Documentation + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["downloads"] = <<< 'EOD' +page_type=page_alias + +page_alias=Downloads + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["install"] = <<< 'EOD' +page_type=page_alias + +page_alias=Install + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["main_footer"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS{{class="footer" +(c) 2015 Seekquarry, LLC - [[http://www.seekquarry.com/|Open Source Search Engine Software]]. [[About|About Seekquarry]]. +}} +EOD; +$public_pages["en-US"]["main_header"] = <<< 'EOD' +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS{{class="logo" [[Main|((resource:SeekQuarry.png|SeekQuarry))]]}} {{class="nav inline medium-font" +* [[Demos]] +* [[Downloads]] +* [[Documentation]] +* [[Resources]] +* {{search:default|size:small|placeholder:Search}} + +}} +EOD; +$public_pages["en-US"]["privacy"] = <<< 'EOD' +title=Privacy Policy + +description=Describes what information this site collects and retains about +users and how it uses that information +END_HEAD_VARS +==We are concerned with your privacy== +EOD; +$public_pages["en-US"]["ranking"] = <<< 'EOD' +page_type=page_alias + +page_alias=Ranking + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["register_time_out"] = <<< 'EOD' +title=Create/Recover Account + +END_HEAD_VARS + +==Account Timeout== + +A number of incorrect captcha responses or recover password requests +have been made from this IP address. Please wait until +%s to access this site. +EOD; +$public_pages["en-US"]["resources"] = <<< 'EOD' +page_type=page_alias + +page_alias=Resources + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS +EOD; +$public_pages["en-US"]["suggest_day_exceeded"] = <<< 'EOD' + +EOD; +$public_pages["en-US"]["terms"] = <<< 'EOD' +=Terms of Service= + +Please write the terms for the services provided by this website. +EOD; +// +// Default Help Wiki Pages +// +$help_pages = array(); +$help_pages["en-US"]["Account_Registration"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Account Registration + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe Account Registration field-set is used to control how user's can obtain accounts on a Yioop installation. + +The dropdown at the start of this fieldset allows you to select one of four +possibilities: +* '''Disable Registration''', users cannot register themselves, only the root +account can add users. +When Disable Registration is selected, the Suggest A Url form and link on +the tool.php page is disabled as well, for all other registration type this +link is enabled. +* '''No Activation''', user accounts are immediately activated once a user +signs up. +* '''Email Activation''', after registering, users must click on a link which +comes in a separate email to activate their accounts. +If Email Activation is chosen, then the reset of this field-set can be used +to specify the email address that the email comes to the user. The checkbox Use +PHP mail() function controls whether to use the mail function in PHP to send +the mail, this only works if mail can be sent from the local machine. +Alternatively, if this is not checked like in the image above, one can +configure an outgoing SMTP server to send the email through. +* '''Admin Activation''', after registering, an admin account must activate +the user before the user is allowed to use their account. +EOD; +$help_pages["en-US"]["Ad_Server"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Ad Server + +END_HEAD_VARS* The Ad Server field-set is used to control whether, where, +and what external advertisements should be displayed by this Yioop instance. +EOD; +$help_pages["en-US"]["Add_Locale"] = <<< EOD +page_type=standard + +page_border=solid-border + +toc=true + +title=Add Locale + +description=Help article describing how to add a Locale. + +END_HEAD_VARS==Adding a Locale== + +The Manage Locales activity can be used to configure Yioop for use with +different languages and for different regions. + +* The first form on this activity allows you to create a new &quot;Locale&quot; +-- an object representing a language and a region. +* The first field on this form should be filled in with a name for the locale in +the language of the locale. +* So for French you would put :Fran&ccedil;ais. The locale tag should be the +IETF language tag. +EOD; +$help_pages["en-US"]["Adding_Examples_to_a_Classifier"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSTo train a classifier one needs to add positive and negative examples of the concept that is to be learned. One way to add positive (negative) examples is to select an existing crawl and then marking that all (respectively, none) are in the class using the drop down below. + +<br /> + +Another way to give examples is to pick an existing crawl, leave the dropdown set to label by hand. Then type some keywords to search for in the crawl you picked using the '''Keyword''' textfield and click '''Load'''. This will bring up a list of search results together with links '''In Class''', '''Not in Class''', and '''Skip'''. These can then be used to add positive or negative examples. + +<br /> + +When you are done adding example, click '''Finalize''' to have Yioop actually build the classifier based on your training. + +EOD; +$help_pages["en-US"]["Allowed_to_Crawl_Sites"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Allowed to Crawl Sites''' is a list of urls (one-per-line) and domains that the crawler is allowed to crawl. Only pages that are on sub-sites of the urls listed here will be crawled. + +<br /> + +This textarea is only used in determining by can be crawled if '''Restrict Sites By Url''' is checked. + +<br /> + +A line like: +<pre> + http://www.somewhere.com/foo/ +</pre> +would allow the url +<pre> + http://www.somewhere.com/foo/goo.jpg +</pre> +to be crawled. + +<br /> + +A line like: +<pre> + domain:foo.com +</pre> +would allow the url +<pre> + http://a.b.c.foo.com/blah/ +</pre> +to be crawled. +EOD; +$help_pages["en-US"]["Arc_and_Re-crawls"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Crawl or Arc Folder to Re-index''' dropdown allows one to select a previous Yioop crawl or an archive to do another crawl of. Possible archives that can be index include Arc files, Warc Files, Email, Database dump, Open Directory RDF dumps, Media Wiki dumps etc. Re-crawling an old crawl might be useful if you would like to do further processing of the records in the index. Besides containing previous crawls, the dropdown list is populated by looking at the WORK_DIRECTORY/archives folder for sub-folders containing an arc_description.ini file. + +<br /> + +{{right|[[https://www.seekquarry.com/?c=static&p=Documentation#Archive%20Crawl%20Options| Learn More.]]}} + +EOD; +$help_pages["en-US"]["Authentication_Type"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Authentication Type + +END_HEAD_VARSThe Authentication Type field-set is used to control the protocol +used to log people into Yioop. + +* Below is a list of Authentication types supported. +** '''Normal Authentication''', passwords are checked against stored as +salted hashes of the password; or +** '''ZKP (zero knowledge protocol) authentication''', the server picks +challenges at random and send these to the browser the person is logging in +from, the browser computes based on the password an appropriate response +according to the Fiat Shamir protocol.cThe password is never sent over the +internet and is not stored on the server. These are the main advantages of +ZKP, its drawback is that it is slower than Normal Authentication as to prove +who you are with a low probability of error requires several browser-server +exchanges. + +* You should choose which authentication scheme you want before you create many +users as if you switch everyone will need to get a new password. +EOD; +$help_pages["en-US"]["Browse_Groups"] = <<< EOD +page_type=standard +page_border=solid-border +toc=true +title=Browse Groups +END_HEAD_VARS==Creating or Joining a group== +You can create or Join a Group all in one place using this Text field. +Simply enter the Group Name You want to create or Join. If the Group Name +already exists, you will simply join the group. If the group name doesn't +exist, you will be presented with more options to customize and create your +new Group. +==Browse Existing Groups== +You can use the [Browse] hyper link to browse the existing Groups. +You will then be presented with a web form to narrow your search followed by +a list of all visible groups to you beneath. +{{right|[[https://www.seekquarry.com/?c=static&p=Documentation#Managing%20Users,%20Roles,%20and%20Groups| Learn More..]]}} +EOD; +$help_pages["en-US"]["Captcha_Type"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Captcha Type + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe Captcha Type field set controls what kind of +[[https://en.wikipedia.org/wiki/CAPTCHA|captcha]] will be used during account +registration, password recovery, and if a user wants to suggest a url. + +* The choices for captcha are: +** '''Text Captcha''', the user has to select from a series of dropdown answers +to questions of the form: ''Which in the following list is the most/largest/etc? +or Which is the following list is the least/smallest/etc?; '' +** '''Graphic Captcha''', the user needs to enter a sequence of characters from +a distorted image; +** '''Hash captcha''', the user's browser (the user doesn't need to do anything) +needs to extend a random string with additional characters to get a string +whose hash begins with a certain lead set of characters. + +Of these, Hash Captcha is probably the least intrusive but requires +Javascript and might run slowly on older browsers. A text captcha might be used +to test domain expertise of the people who are registering for an account. +Finally, the graphic captcha is probably the one people are most familiar with. +EOD; +$help_pages["en-US"]["Changing_the_Classifier_Label"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe label of a classifier determines what meta-words will be added to pages that have that concept. + +<br /> + +If the label is foo, and the foo classifier is used in a crawl, then pages which have the foo property +will have the meta-word class:foo added to the list of words that are indexed. +EOD; +$help_pages["en-US"]["Crawl_Mixes"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSA '''Crawl Mix''' allows one to combine several crawl indexes into one to greater customize search results. This page allows one to either create a new crawl mix or find and edit an existing one. The list of crawl mixes is user dependent -- each user can create their own mixes of crawls that exist on the Yioop system. + +<br /> + +Clicking '''Share''' on a crawl mix allows a user to post their crawl mix to a group's feed. User's of that group can then import this crawl mix into their own list of mixes by clicking on it. + +<br /> + +Clicking '''Set as Index''' on a crawl mix means that by default the given crawl mix will be used to serve search results for this site. +EOD; +$help_pages["en-US"]["Crawl_Order"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Crawl Order''' controls how the crawl determines what to crawl next. + +<br /> + +'''Breadth-first Search''' means that Yioop first crawls the seeds sites, followed by those +sites directly linked to the seed site, followed by those directly linked to sites directly linked +to seed sites, etc. + +<br /> + +'''Page Importance''' gives each seed site an initial amount of cash. Yioop then crawls the seed sites. A given crawled page has its cash splits amongst the sites that it link to based on the link quality and whether it has been crawled yet. The sites with the most cash are crawled next and this process is continued. +EOD; +$help_pages["en-US"]["Create_Group"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Create Group + +END_HEAD_VARS''You will get to this form when the Group Name is available to +create a new Group. '' +---- + +'''Name''' Field is used to specify the name of the new Group. +<br /> +'''Register''' dropdown says how other users are allowed to join the group: +* <u>No One</u> means no other user can join the group (you can still invite +other users). +* <u>By Request</u> means that other users can request the group owner to join +the group. +* <u>Anyone</u> means all users are allowed to join the group. +<br /> +The '''Access''' dropdown controls how users who belong/subscribe to a group +other than the owner can access that group. +* <u>No Read</u> means that a non-owner member of the group cannot read or +write the group news feed and cannot read the group wiki. +* <u>Read</u> means that a non-owner member of the group can read the group +news feed and the groups wiki page. +* <u>Read</u> Comment means that a non-owner member of the group can read the +group feed and wikis and can comment on any existing threads, but cannot start +new ones. +* <u>Read Write</u>, means that a non-owner member of the group can start new +threads and comment on existing ones in the group feed and can edit and create +wiki pages for the group's wiki. +'''Voting''' +* Specify the kind of voting allowed in the new group. + Voting allows users to +vote up, -- Voting allows users to vote down. +/- allows Voting up and down. +'''Post Life time''' - Specifies How long the posts should be kept. +EOD; +$help_pages["en-US"]["Database_Setup"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Database Setup + +END_HEAD_VARSThe database is used to store information about what users are +allowed to use the admin panel and what activities and roles these users have. +* The Database Set-up field-set is used to specify what database management +system should be used, how it should be connected to, and what user name and +password should be used for the connection. + +* Supported Databases +** PDO (PHP's generic DBMS interface). +** Sqlite3 Database. +** Mysql Database. + +* Unlike many database systems, if an sqlite3 database is being used then the +connection is always a file on the current filesystem and there is no notion of +login and password, so in this case only the name of the database is asked for. +For sqlite, the database is stored in WORK_DIRECTORY/data. + +* For single user settings with a limited number of news feeds, sqlite is +probably the most convenient database system to use with Yioop. If you think you +are going to make use of Yioop's social functionality and have many users, +feeds, and crawl mixes, using a system like Mysql or Postgres might be more +appropriate. +EOD; +$help_pages["en-US"]["Disallowed_and_Sites_With_Quotas"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Disallowed to Crawl Sites''' are urls or domains (listed one-per-line) that Yioop should not crawl. + +<br /> + +A line like: +<pre> + http://www.somewhere.com/foo/ +</pre> +would disallow the url +<pre> + http://www.somewhere.com/foo/goo.jpg +</pre> +to be crawled. + +<br /> + +A line like: +<pre> + domain:foo.com +</pre> +would disallow the url +<pre> + http://a.b.c.foo.com/blah/ +</pre> +to be crawled. +<br /> + +'''Sites with Quotes''' are urls or domains that Yioop should at most crawl some fixed number of urls from in an hour. These are listed in the same text area as Disallowed to Crawl Sites. To indicate the quota one lists after the url a fragment #some_number. For example, +<pre> + http://www.yelp.com/#100 +</pre> +would restrict crawling of urls from Yelp to 100/hour. +EOD; +$help_pages["en-US"]["Discover_Groups"] = <<< EOD +page_type=standard + +page_border=solid-border + +toc=true + +title=Discover Groups + +END_HEAD_VARS'''Name''' Field is used to specify the name of the Group to +search for. +'''Owner''' Field lets you search a Group using it's Owner name. +<br /> +'''Register''' dropdown says how other users are allowed to join the group: +* <u>No One</u> means no other user can join the group (you can still invite +other users). +* <u>By Request</u> means that other users can request the group owner to join +the group. +* <u>Anyone</u> means all users are allowed to join the group. +<br /> +''It should be noted that the root account can always join any group. +The root account can also always take over ownership of any group.'' +<br /> +The '''Access''' dropdown controls how users who belong/subscribe to a group +other than the owner can access that group. +* <u>No Read</u> means that a non-owner member of the group cannot read or +write the group news feed and cannot read the group wiki. +* <u>Read</u> means that a non-owner member of the group can read the group +news feed and the groups wiki page. +* <u>Read</u> Comment means that a non-owner member of the group can read the +group feed and wikis and can comment on any existing threads, but cannot start +new ones. +* <u>Read Write</u>, means that a non-owner member of the group can start new +threads and comment on existing ones in the group feed and can edit and create +wiki pages for the group's wiki. +<br /> +The access to a group can be changed by the owner after a group is created. +* <u>No Read</u> and <u>Read</u> are often suitable if a group's owner wants to +perform some kind of moderation. +* <u>Read</u> and <u>Read Comment</u> groups are often suitable if someone wants +to use a Yioop Group as a blog. +* <u>Read</u> Write makes sense for a more traditional bulletin board. +EOD; +$help_pages["en-US"]["Editing_Locales"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe '''Edit Locale''' form can be used to specify how various message strings in Yioop are translated in different languages. + +The table below has two columns: a column of string identifiers and a column of translations. A string identifier refers to a location in the code marked as needing to be translated, the corresponding translation in that row is how it should be translated for the current locale. Identifiers typically specify the code file in which the identifier occurs. For example, the identifier + serversettings_element_name_server +would appear in the file views/elements/server_settings.php . To see where this identifier occurs one could open that file and search for this string. + +If no translation exists yet for an identifier the translation value for that row will appear in red. Hovering the mouse over this red field will show the translation of this field in the default locale (usually English). + +The '''Show dropdown''' allows one to show either all identifiers or just those missing translations. The filter field let's one to see only identifiers that contain the filter as a substring. +EOD; +$help_pages["en-US"]["Editing_a_Crawl_Mix"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSA crawl mix is built out of a list of '''search result fragments'''. + +<br /> + +A fragment has a '''Results Shown''' dropdown which specifies up to how many results that given fragment is responsible for. If one that had three fragments, the first with this value set to 1 the next with it set to 5 and the last set to whatever. Then on a query the Yioop will try to get the first result from the first fragment, up to the next five results from the next fragment, and all remaining results from the last fragment. If a given fragment doesn't produce results the search engine skips to the next fragment. + +<br /> + +The '''Add Crawls''' dropdown can be used to add a crawl to the given fragment. Several crawl indexes can be added to a given fragment. When search results are computed for the fragment, the search is performed on all of these indexes and a score for each result is determined. The '''Weight''' dropdown can then be set to specify how important a given indexes score of a result should be in the total score of a search result. The top totals scores are then returned by the fragment. If when performing the search on a given index you would like additional terms to be added to the query these can be specified in the '''Keywords''' field. + + +EOD; +$help_pages["en-US"]["Filtering_Search_Results"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS==Filter Websites From Results Form== +The textarea in this form is used to list hosts one per line which are to be removed from any search result page in which they might appear. Lines in the textarea must be hostnames not general urls. Listing a host name like: +<pre> + http://www.cs.sjsu.edu/ +</pre> +would prevent any urls from this site from appearing in search results. I.e., so for example, the URL +<pre> + http://www.cs.sjsu.edu/faculty/pollett/ +</pre> +would be prevented from appearing in search results. +EOD; +$help_pages["en-US"]["Indexing_Plugins"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Indexing Plugins''' are additional indexing processors that a document can be made to go through during the indexing process. Users who know how to code can create their own plugins using the plugin API. Plugins can be used to extract new "micro-documents" from a given document, do clustering, or can be used to control the indexing or non-indexing of web pages based on their content. + +<br /> + +The table below allows a user to select and configure which plugins should be used in the current crawl. + +<br /> + + +{{right|[[http://www.seekquarry.com/?c=static&p=Documentation#Page%20Indexing%20and%20Search%20Options|Learn More..]]}} +EOD; +$help_pages["en-US"]["Kinds_of_Summarizers"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSYioop uses a '''summarizer''' to extract from a downloaded, or otherwise acquired document, text that it will add to its index. This text is also used for search result snippet generation. Only terms which appear in this summary can be used to look up a document. + +<br /> + +The <b>Basic</b> summarizer tries to pick text from an ad hoc list of presumed important places in a web document until it has gotten the desired amount of text for a summary. For example, it might try to get text from title tags, h1 tags, etc before try to get it from paragraph tags. + +<br /> + +The <b>Centroid</b> summarizer splits a document into "sentence" units. It then computes an "average" sentence for the document. It then adds to the summary sentences in order of how close they are to this average until the desired amount of text has been acquired. +EOD; +$help_pages["en-US"]["Locale_List"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Locale List + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSBeneath the Add Locale form is a table listing some of the current +locales. + + +* The Show Dropdown let's you control how many of these locales are displayed in +one go. +* The Search link lets you bring up an advance search form to search for +particular locales and also allows you to control the direction of the listing. + +The Locale List table +* The first column in the table has a link with the name of the locale. +Clicking on this link brings up a page where one can edit the strings for that +locale. +* The next three columns of the Locale List table give the locale tag, +whether user's can use that locale in Settings, and the writing +direction of the locale, this is followed by the percent of strings translated. +* The Edit link in the column let&#039;s you edit the locale tag, enabled status, and +text direction of a locale. +* Finally, clicking the Delete link let&#039;s one delete a locale and all +its strings. +EOD; +$help_pages["en-US"]["Locale_Writing_Mode"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Locale Writing Mode + +END_HEAD_VARSThe last field on the form is to specify how the language is +written. There are four options: +# lr-tb -- from left-to-write from the top of the page to the bottom as in +English. +# rl-tb from right-to-left from the top the page to the bottom as in Hebrew +and Arabic. +# tb-rl from the top of the page to the bottom from right-to-left as in +Classical Chinese. +# tb-lr from the top of the page to the bottom from left-to-right as in +non-cyrillic Mongolian or American Sign Language. + +''lr-tb and rl-tb support work better than the vertical language support. As of +this writing, Internet Explorer and WebKit based browsers (Chrome/Safari) have +some vertical language support and the Yioop stylesheets for vertical languages +still need some tweaking. For information on the status in Firefox check out +this writing mode bug.'' +EOD; +$help_pages["en-US"]["Machine_Information"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Machine Information''' shows the currently known about machines. + +<br /> + +This list always begins with the '''Name Server''' itself and a toggle to control whether or not the Media Updater process is running on the Name Server. This allows you to control whether or not Yioop attempts to update its RSS (or Atom) search sources on an hourly basis. Yioop also uses the Media updater to convert videos that have been uploaded into mp4 and webm if ffmpeg is installed. + +<br /> + +There is also a link to the log file of the Media Updater process. Under the Name Server information is a dropdown that can be used to control the number of current machine statuses that are displayed for all other machines that have been added. It also might have next and previous arrow links to go through the currently available machines. + +<br /> + +{{right|[[https://www.seekquarry.com/?c=static&p=Documentation#GUI%20for%20Managing%20Machines%20and%20Servers| Learn More.]]}} +EOD; +$help_pages["en-US"]["Manage_Machines"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Add Machine''' allows you to add a new machine to be controlled by this Yioop instance. + +<br /> + +The '''Machine Name''' field lets you give this machine an easy to remember name. The Machine URL field should be filled in with the URL to the installed Yioop instance. + +<br /> + +The '''Mirror''' check-box says whether you want the given Yioop installation to act as a mirror for another Yioop installation. Checking it will reveal a drop-down menu that allows you to choose which installation amongst the previously entered machines you want to mirror. + +<br /> + +The '''Has Queue Server''' check-box is used to say whether the given Yioop installation will be running a queue server or not. + +<br /> + +Finally, the '''Number of Fetchers''' drop down allows you to say how many fetcher instances you want to be able to manage for that machine. + +<br /> + +{{right|[[https://www.seekquarry.com/?c=static&p=Documentation#GUI%20for%20Managing%20Machines%20and%20Servers|Learn More..]]}} +EOD; +$help_pages["en-US"]["Media_Sources"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Media Sources''' are used to specify how Yioop should handle video and news sites. + +<br /> + +A '''Video source''' is used to specify where to find the thumb nail of a video given the url of the video on a website. This is used by Yioop when displaying search results containing the video link to show the thumb nail. For example, if the Url value is + http://www.youtube.com/watch?v={} +and the Thumb value is + http://i1.ytimg.com/vi/{}/default.jpg, +this tells Yioop that if a search result contains something like +<pre> + https://www.youtube.com/watch?v=dQw4w9WgXcQ +</pre> +this says find the thumb at +<pre> + http://i1.ytimg.com/vi/dQw4w9WgXcQ/default.jpg +</pre> + +An '''RSS media source''' can be used to add an RSS or Atom feed (it auto-detects which kind) to the list of feeds which are downloaded hourly when Yioop's Media Updater is turned on. Besides the name you need to specify the URL of the feed in question. + +<br /> + +An '''HTML media source''' is a web page that has news articles like an RSS page that you want the Media Updater to scrape on an hourly basis. To specify where in the HTML page the news items appear you specify different XPath information. For example, +<pre> + Name: Cape Breton Post + URL: http://www.capebretonpost.com/News/Local-1968 + Channel: //div[contains(@class, "channel")] + Item: //article + Title: //a + Description: //div[contains(@class, "dek")] + Link: //a +</pre> +The Channel field is used to specify the tag that encloses all the news items. Relative to this as the root tag, //article says the path to an individual news item. Then relative to an individual news item, //a gets the title, etc. Link extracts the href attribute of that same //a . + +<br /> + +Not all RSS feeds use the same tag to specify the image associated with a news item. The Image XPath allows you to specify relative to a news item (either RSS or HTML) where an image thumbnail exists. If a site does not use such thumbnail one can prefix the path with ^ to give the path relative to the root of the whole file to where a thumb nail for the news source exists. Yioop automatically removes escaping from RSS containing escaped HTML when computing this. For example, the following works for the feed: +<pre> + http://feeds.wired.com/wired/index + //description/div[contains(@class, + "rss_thumbnail")]/img/@src +</pre> +EOD; +$help_pages["en-US"]["Name_Server_Setup"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSYioop can be run in a single machine or multi-machine setting. In a multi-machine setting, copies of Yioop software would be on different machines. One machine called the '''Name Server''' would be responsible for coordinating who crawls what between these machines. This fieldset allows the user to specify the url of the Name Server as well as a string (which should be the same amongst all machines using that name server) that will be used to verify that this machine is allowed to talk to the Name Server. In a single machine setting these settings can be left at their default values. + +<br /> + +When someone enters a query into a Yioop set-up, they typically enter the query on the name server. The '''Use Filecache''' checkbox controls whether the query results are cached in a file so that they don't have to be recalculated when someone enters the same query again. The file cache is purged periodically so that it doesn't get too large. +EOD; +$help_pages["en-US"]["Page_Byte_Ranges"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Byte Range to Download''' determines the maximum number of bytes that Yioop will download for a given page when crawling. Setting a maximum is important so that Yioop does not get stuck downloading very large files. + +<br /> + +When Yioop shows the cached version of a URL it shows only what it downloaded. +EOD; +$help_pages["en-US"]["Page_Classifiers"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSClassifiers are used to say whether a page has or does not have a property. The '''Manage Classifiers''' activity let's you create and manage the classifiers for this Yioop system. Creating a classifier will take you to a page that let's you train the classifier against existing data such as a crawl indexed. Once you have a classifier you can use it to add meta words for that concept to pages in future crawls by selecting in on the Page Options activity. You can also use classifiers to score documents for ranking purposes in search results, again this can be done under the Page Options Activity. +EOD; +$help_pages["en-US"]["Page_Grouping_Options"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe '''Search Results Grouping''' controls allow you to control on a search query how many qualifying documents from an index to compute before trying to sort and rank them to find the top k results (here k is usually 10). In a multi-queue-server setting the query is simultaneously asked by the name server machine of each of the queue server machines and the results are aggregated. + +<br /> + +'''Minimum Results to Group''' controls the number of results the name server want to have before sorting of results is done. When the name server request documents from each queue server, it requests for +<br /> +&alpha; &times; (Minimum Results to Group)/(Number of Queue Servers) documents. + +<br /> +'''Server Alpha''' controls the number alpha. +EOD; +$help_pages["en-US"]["Page_Ranking_Factors"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSIn computing the relevance of a word/term to a page the fields on this form allow one to set the relative weight given to the word depending on whether it appears in the title, a link, or if it appears anywhere +else (description). +EOD; +$help_pages["en-US"]["Page_Rules"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Page Field Extraction Rules ''' are statements from a Yioop-specific indexing language which can be applied to the words in a summary page before it is stored in an index. Details on this language can be found in the [[http://www.seekquarry.com/?c=static&p=Documentation#Page%20Indexing%20and%20Search%20Options|Page Indexing and Search Options]] section of the Yioop Documentation. + +<br /> + +The textarea below this heading can be used to list out which extraction rules should be used for the current crawl. +EOD; +$help_pages["en-US"]["Proxy_Server"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title=Proxy server + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS* Yioop can make use of a proxy server to do web +crawling. +EOD; +$help_pages["en-US"]["Proxy_server"] = <<< EOD +page_type=standard + +page_border=solid-border + +title=Proxy server + +END_HEAD_VARS* The Proxy Server field-set is used to control which proxies to use while crawling. By default Yioop does not use any proxies while crawling. A Tor Proxy can serve as a gateway to the Tor Network. Yioop can use this proxy to download .onion URLs on the [[https://en.wikipedia.org/wiki/Tor_%28anonymity_network%29|Tor network]]. + +* Obviously, this proxy needs to be running though for Yioop to make use of it. Beneath the Tor Proxy input field is a checkbox labelled '''Crawl via Proxies'''. Checking this box, will reveal a text-area labelled Proxy Servers. You can enter the '''''address:port or address:port:proxytype''''' of proxy servers you would like to crawl through. If proxy servers are used, Yioop will make any requests to download pages to a randomly chosen server on the list which will proxy the request to the site which has the page to download. To some degree this can make the download site think the request is coming from a different ip (and potentially location) than it actually is. In practice, servers can often use HTTP headers to guess that a proxy is being used. +EOD; +$help_pages["en-US"]["Search_Results_Editor"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe '''Edit Result Page''' form can be used to change the title and snippet text associated with a given url if it appears in search results. The Edited Urls dropdown let's one see which URLs have been previously edited and allows one to load and re-edit these if desired. Edited words in the title and description of an edited URL are not indexed. Only the words from the page as originally appearing in the index are used for this. This form only controls the title and snippet text of the URL when it appears in a search engine result page. +EOD; +$help_pages["en-US"]["Search_Results_Page_Elements"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThese checkboxes control whether various links and drop downs on the search result and landing +pages appear or not. + +; '''Word Suggest''': Controls whether the suggested query drop down appear as a query is entered in the search bar and whether thesaurus results appear on search result pages. +; '''Subsearch''' : Controls whether the links to subsearches such as Image, Video, and News search appear at the top of all search pages +; '''Signin''' : Controls whether the '''Sign In''' link appears at the top of the Yioop landing and search result pages. +; '''Cache''', '''Similar''', '''Inlinks''', '''IP Address''': Control whether the corresponding links appear after each search result item. + + + +EOD; +$help_pages["en-US"]["Seed_Sites_and_URL_Suggestions"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Seed Sites''' are a list of urls that Yioop should start a crawl from. + +<br /> + +If under Server Settings : Account Registration user's are allowed to register for Yioop accounts at some +level other than completely disabled, then the Tools: Suggest a Url form will be enabled. URLs suggested through this form can be added to the seed sites by clicking the '''Add User Suggest data''' link. These URLS will appear at the end of the seeds sites and will appear with a timestamp of when they added before them. Adding this data to the seed sites clears the list of suggested sites from where it is temporarily stored before being added. + +<br /> + +Some site's robot.txt forbid crawl of the site. If you would like to create a placeholder page for such a site so that a link to that site might still appear in the index, but so that the site itself is not crawled by the crawler, you can use a syntax like: + +<nowiki> +http://www.facebool.com/###! +Facebook###! +A%20famous%20social%20media%20site +</nowiki> + +This should all be on one line. Here ###! is used a separator and the format is url##!title###!description. +EOD; +$help_pages["en-US"]["Start_Crawl"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSEnter a name for your crawl and click start to begin a new crawl. Previously completed crawls appear in the table below. + +<br /> + +Before you start your crawl be sure to start the queue servers and fetchers to be used for the crawl under '''Manage Machines'''. + +<br /> + +The '''Options''' link let's you specify what web sites you want to crawl or if you want to do an archive previous crawls or different kinds of data sets. +EOD; +$help_pages["en-US"]["Subsearches"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARS'''Subsearches''' are specialized search hosted on a Yioop site other than the default index. For example, a site might have a usual web search and also offer News and Images subsearches. This form let's you set up such a subsearch. + +<br /> + +A list of links to all the current subsearches on a Yioop site appears at the + site_url?a=more +page. Links to some of the subsearches may appear at the top left hand side of of the default landing page provided the Pages Options : Search Time : Subsearch checkbox is checked. + +<br /> + +The '''Folder Name''' of a subsearch is the name that appears as part of the query string when doing a search restricted to that subsearch. After creating a subsearch, the table below will have a '''Localize''' link next to its name. This lets you give names for your subsearch on the More page mentioned above with respect to different languages. + +EOD; +$help_pages["en-US"]["Summary_Length"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThis determines the maximum number of bytes that can appear in a summary generated for a document that Yioop has crawled. To have any effect this value should be smaller that the byte range downloaded. +EOD; +$help_pages["en-US"]["Test_Indexing_a_Page"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe '''Test Page''' form is used to test how Yioop would process a given web page. To test a web page one copies and pastes the source of the web page (obtainable by doing View Source in a browser) into the textarea. Then one selects the mimetype of the page (usually, text/html) and submits the form to see the processing results. +EOD; +$help_pages["en-US"]["Using_a_Classifier_or_Ranker"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSA <b>binary classifier</b> is used to say whether or not a page has a property (for example, being a spam page or not). Classifiers can be created using the Manage Classifiers activity. + +<br/> + +The classifiers that have been created in this Yioop instance are listed in the table below and can be used for future crawls. Given a classifier named foo, selecting the '''Use to Classify''' check box for it tells Yioop to insert some subset of the following labels as meta-words when it indexes a page: +<pre> + class:foo + class:foo:10plus + class:foo:20plus + class:foo:30plus + class:foo:40plus + ... + class:foo:50 + ... +</pre> +When a document is scored against a classifier foo, it gets a score between 0 and 1 and if the score is greater than 0.5 the meta-word class:foo is added. A meta-word class:foo:XXplus indicates the document achieved at least a score of XX with respect to the classifier, and a meta-word class:foo:XX indicates it had a score between 0.XX and 0.XX + 0.9. + +<br /> + +The '''Use to Rank''' checkbox indicates that Yioop should take the score between 0 and 1 and use this as one of the scores when ranking search results. +EOD; +$help_pages["en-US"]["Work_Directory"] = <<< EOD +page_type=standard + +page_alias= + +page_border=solid-border + +toc=true + +title= + +author= + +robots= + +description= + +page_header= + +page_footer= + +END_HEAD_VARSThe '''Work Directory''' is a folder used to store all the customizations of this instance of Yioop. +This field should be a complete file system path to a folder that exists. +It should use forward slashes. For example: + + /some_folder/some_subfolder/yioop_data +(more appropriate for Mac or Linux) or + c:/some_folder/some_subfolder/yioop_data +(more appropriate on a Windows system). + +If you decide to upgrade Yioop at some later date you only have to replace the code folder +of Yioop and set the Work Directory path to the value of your pre-upgrade version. For this +reason the Work Directory should not be a subfolder of the Yioop code folder. +EOD; + +?> \ No newline at end of file diff --git a/resources/4Pa/4PaP2dQJZTE/FeedsWikis2.png b/resources/4Pa/4PaP2dQJZTE/FeedsWikis2.png new file mode 100644 index 0000000..045d0dd Binary files /dev/null and b/resources/4Pa/4PaP2dQJZTE/FeedsWikis2.png differ diff --git a/resources/4Pa/4PaP2dQJZTE/GroupingIcons.png b/resources/4Pa/4PaP2dQJZTE/GroupingIcons.png new file mode 100644 index 0000000..4e100fc Binary files /dev/null and b/resources/4Pa/4PaP2dQJZTE/GroupingIcons.png differ diff --git a/resources/4Pa/4PaP2dQJZTE/IntegratedHelp.png b/resources/4Pa/4PaP2dQJZTE/IntegratedHelp.png new file mode 100644 index 0000000..94466a4 Binary files /dev/null and b/resources/4Pa/4PaP2dQJZTE/IntegratedHelp.png differ diff --git a/resources/4Pa/4PaP2dQJZTE/LocaleOnWikiPage.png b/resources/4Pa/4PaP2dQJZTE/LocaleOnWikiPage.png new file mode 100644 index 0000000..1c244cd Binary files /dev/null and b/resources/4Pa/4PaP2dQJZTE/LocaleOnWikiPage.png differ diff --git a/resources/4Pa/4PaP2dQJZTE/WikiPageSettings.png b/resources/4Pa/4PaP2dQJZTE/WikiPageSettings.png new file mode 100644 index 0000000..9c1bb69 Binary files /dev/null and b/resources/4Pa/4PaP2dQJZTE/WikiPageSettings.png differ diff --git a/resources/xyq/xyqWsOS2HOY/FeedsWikis2.png.jpg b/resources/xyq/xyqWsOS2HOY/FeedsWikis2.png.jpg new file mode 100644 index 0000000..d65e84b Binary files /dev/null and b/resources/xyq/xyqWsOS2HOY/FeedsWikis2.png.jpg differ diff --git a/resources/xyq/xyqWsOS2HOY/GroupingIcons.png.jpg b/resources/xyq/xyqWsOS2HOY/GroupingIcons.png.jpg new file mode 100644 index 0000000..5ead720 Binary files /dev/null and b/resources/xyq/xyqWsOS2HOY/GroupingIcons.png.jpg differ diff --git a/resources/xyq/xyqWsOS2HOY/IntegratedHelp.png.jpg b/resources/xyq/xyqWsOS2HOY/IntegratedHelp.png.jpg new file mode 100644 index 0000000..16a2259 Binary files /dev/null and b/resources/xyq/xyqWsOS2HOY/IntegratedHelp.png.jpg differ diff --git a/resources/xyq/xyqWsOS2HOY/LocaleOnWikiPage.png.jpg b/resources/xyq/xyqWsOS2HOY/LocaleOnWikiPage.png.jpg new file mode 100644 index 0000000..b4b0b20 Binary files /dev/null and b/resources/xyq/xyqWsOS2HOY/LocaleOnWikiPage.png.jpg differ diff --git a/resources/xyq/xyqWsOS2HOY/WikiPageSettings.png.jpg b/resources/xyq/xyqWsOS2HOY/WikiPageSettings.png.jpg new file mode 100644 index 0000000..d3d2c6c Binary files /dev/null and b/resources/xyq/xyqWsOS2HOY/WikiPageSettings.png.jpg differ