viewgit/inc/functions.php:22 Function utf8_encode() is deprecated [8192]
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml index edec05a..2ad6bf0 100755 --- a/en-US/pages/documentation.thtml +++ b/en-US/pages/documentation.thtml @@ -2,7 +2,7 @@ <h1>Yioop Documentation v 0.94</h1> <h2 id='toc'>Table of Contents</h2> <ul> - <li><a href="#quick">Preface: Quick Start Guides</a></li> + <li><a href="#quick">Getting Started</a></li> <li><a href="#intro">Introduction</a></li> <li><a href="#features">Feature List</a></li> <li><a href="#requirements">Requirements</a></li> @@ -26,12 +26,15 @@ <li><a href="#commandline">Yioop Command-line Tools</a></li> <li><a href="#references">References</a></li> </ul> - <h2 id="quick">Preface: Quick Start Guides</h2> + <h2 id="quick">Getting Started</h2> <p>This document serves as a detailed reference for the - Yioop search engine. If you want to get started using Yioop now, - but perhaps in less detail, you might want to first read the + Yioop search engine. If you want to get started using Yioop now, + you probably want to first read the <a href="?c=main&p=install">Installation - Guides</a> page. + Guides</a> page. If you cannot find your particular machine configuration + there, you can check the Yioop <a href="#requirements">Requirements</a> + section followed by the more general <a + href="#installation">Installation and Configuration</a> instructions. </p> <h2 id="intro">Introduction</h2> <p>The Yioop search engine is designed to allow users @@ -317,7 +320,8 @@ http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml">WARC format</a> are often used by TREC conferences to store test data sets such as <a href="http://ir.dcs.gla.ac.uk/test_collections/">GOV2</a> and the - <a href="http://lemurproject.org/clueweb09/">ClueWeb Dataset</a>. + <a href="http://lemurproject.org/clueweb09/">ClueWeb 2009</a> / + <a href="http://lemurproject.org/clueweb12/">ClueWeb 2012</a> Datasets. In addition, it was used by grub.org (hopefully, only on a temporary hiatus), a distributed, open-source, search engine project in C#. Another important format for archiving web pages is the XML format used by diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml index d9963fc..046a713 100644 --- a/en-US/pages/ranking.thtml +++ b/en-US/pages/ranking.thtml @@ -425,7 +425,8 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})` now point to locations at the end of the summary of the IndexShard. These offsets thus provide information about when a document was indexed during the crawl process. The maximum number of links per document - is usually 50 for normal documents and 300 for sitemaps. Emperically, + is usually 50 for normal documents and 300 for + <a href="http://www.sitemaps.org/">sitemaps</a>. Emperically, it has been observed that a typical index shard has offsets for around 24 times as many links summary maps as document summary maps. So roughly, if a newly added summary or link has index <i>DOC_INDEX</i> @@ -566,10 +567,18 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})` a URL is crawl-delayed it is inserted at the earliest position in the slot sufficient far from any previous url for that host to ensure that the crawl-delay condition is met.</li> + <li>If a Scheduler's queue is full, yet after going through all + of the url's in the queue it cannot find any to write to a schedule + it goes into a reset mode. It dumps its current urls back to + schedule files, starts with a fresh queue (but preserving robots.txt + info) and starts reading in schedule files. This can happen if too many + urls of crawl-delayed sites start clogging a queue.</li> </ul> -<p>The actual giving of a page's cash to its urls is done in the Fetcher. This -is actually done in a different manner than in the OPIC paper. It is further -handled different for sitemap pages versus all other web pages. For a +<p>The actual giving of a page's cash to its urls is done in the Fetcher. +We discuss it in the section on the queue server because it directly +affects the order of queue processing. Cash in Yioop's algorithm +is done in a different manner than in the OPIC paper. It is further +handled differently for sitemap pages versus all other web pages. For a sitemap page with `n` links, let<p> <p class="center">`\gamma = sum_(j=1)^n 1/j^2`.</p> Let `C` denote the cash that the sitemap has to distribute. Then the `i`th @@ -603,14 +612,29 @@ to its links will sum to C. If no links go out of the CLD, then cash will be lost. In the case where someone is deliberately doing a crawl of only one site, then this lost cash will get replaced during normalization, and the above scheme essentially reduces to usual OPIC.</p> +<p>We conclude this section by mentioning that the Scheduler only + affects when a URL is written to a schedule which will then be +used by a fetcher. It is entirely possible that two fetchers get consecutive +schedules from the same Scheduler, and return data to the Indexers +not in the order in which they were scheduled. In which case, they would +be indexed out of order and their Doc Ranks would not be in the order +of when they were scheduled. The scheduling and indexing process is +only approximately correct, we rely on query time manipulations to +try to improve the accuracy.</p> <p><a href="#toc">Return to table of contents</a>.</p> <h2 id='search'>Search Time Ranking Factors</h2> -calculateControlWords (SearchController) - -guessSemantics (PhraseModel) +<p>We are at last in a position to describe how Yioop calculates + the three scores Doc Rank, Relevance, and Proximity at query time. When +a query comes into Yioop it goes through the following stages before an actual +look up is performed against an index. +</p> +<ol> +<li>Control words are calculated.</li> +<li>An attempt is made to guess the semantics of the query.</li> +<li>Stemming or character n-gramming is done on the query and acronyms +and abbreviations are rewritten.</li> +</ol> -stemming word gramming -special characters and acronyms Network Versus non network queries Grouping (links and documents) deduplication Conjunctive queries