diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml
new file mode 100644
index 0000000..55c09a4
--- /dev/null
+++ b/en-US/pages/ranking.thtml
@@ -0,0 +1,529 @@
+<div class="docs">
+<h1>Yioop Ranking Mechanisms</h1>
+ <h2 id='toc'>Table of Contents</h2>
+ <ul>
+ <li><a href="#intro">Introduction</a></li>
+ <li><a href="#crawl">Crawl Time Ranking Factors</a>
+ <ul>
+ <li><a href="#crawl-processes">Crawl Processes</a></li>
+ <li><a href="#fetchers">Fetchers and their Effect on Search
+ Ranking</a></li>
+ <li><a href="#queue-servers">Queue Servers and their Effect on
+ Search Ranking</a></li>
+ </ul>
+ </li>
+ <li><a href="#search">Search Time Ranking Factors</a></li>
+ <li><a href="#references">References</a></li>
+ </ul>
+ <h2 id='intro'>Introduction</h2>
+ <p>
+ A typical query to Yioop is a collection of terms without the use
+ of the OR operator, '|', or the use of the exact match operator, double
+ quotes around a phrase. On such a query, called a <b>conjunctive query</b>,
+ Yioop tries to return documents which contain all of the query terms.
+ Yioop further tries to return these documents in descending order of score.
+ Most users only look at the first ten of the results returned. This article
+ tries to explain the different factors which influence whether a page that
+ has all the terms will make it into the top ten. To keep things simple
+ we will assume that the query is being performed on a single Yioop
+ index rather than a crawl mix of several indexes. We will also ignore
+ how news feed search items get incorporated into results.
+ </p>
+ <p>At its heart, Yioop currently relies on three main scores
+ for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox).
+ Proximity scores are only used if the query has two or more terms.
+ We will describe later how these three scores are calculated.
+ For now one can think that the Doc Rank roughly indicates how important
+ the document as a whole is, Relevance measures how important the search
+ terms are to the document, and Proximity measures how close the search terms
+ appear to each other on the document.
+ </p>
+ </p>
+ On a given query, Yioop does not scan its whole posting lists to find
+ every document that satisfies the query. Instead, it scans until it finds
+ a fixed number of documents, say `n`, satisfying the query. It then
+ computes the three scores for each of these `n` documents. For a document
+ `d` from these `n` documents, it determines the rank of `d` with respect to
+ the Doc Rank score, the rank of `d` with respect to the Relevance score,
+ and the rank of `d` with respect
+ to the Proximity score. It finally computes a score for each of these
+ `n` documents using these three rankings and the so-called
+ <b>reciprocal rank fusion (RRF)</b>:</p>
+<p class="center">
+`\R\R\F(d) := 200(frac{1}{59 + mbox(Rank)_(DR)(d)} +
+frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})`
+</p><p>
+ This formula essentially comes from Cormack et al.
+ [<a href="#CCB2009">CCB2009</a>]. They do not
+ use the factor `200` and use `60` rather than `59`. `\R\R\F(d)` is known
+ to do a decent job of combining scores, although there are some
+ recent techniques such as LambdaRank [<a href="#VLZ2012">VLZ2012</a>],
+ which do significantly better at the
+ expense of being harder to compute. To return results,
+ Yioop computes the top ten of
+ these `n` documents with respect to `\R\R\F(d)` and returns these
+ documents.</p>
+ <p> To get a feeling for how the `\R\R\F(d)` formula works, consider some
+ particular example situations:
+ If a document ranked 1 with respect to each score, then
+ `\R\R\F(d) = 200(3/(59+1)) = 10`. If a document
+ ranked n for each score, then `\R\R\F(d) = 200(3/(59+n)) = 600/(59 + n)`.
+ As `n -> infty` this goes to 0. A value `n = 200` is often used with
+ Yioop. For this `n`, `600/(59 + n) approx 2.32`.
+ If a document
+ ranked 1 on one of the three scores, but ranked `n` on the other two,
+ `\R\R\F(d) = 200/60 + 400/(59 +n) approx 3.33 + 400/(59 + n)`. The last
+ term again goes to 0 as `n` gets larger, giving a maximum
+ score of `3.33`. For the `n=200` case, one gets a score of `4.88`.
+ So because the three component scores are converted to ranks,
+ and then reciprocal rank fusion is used, one cannot solely use a good score
+ on one of the three components to get a good score overall.</p>
+ <p>An underlying assumption used by Yioop is that the first `n` matching
+ documents in Yioop's posting lists contain the 10 most important documents
+ with respect to our scoring function. For this assumption to be valid our
+ posting list must be roughly sorted according to score. For Yioop though,
+ the first `n` documents will in fact most likely be the first `n` documents
+ that Yioop indexed. This does not contradict the assumption
+ provided we are indexing documents according to the importance of our
+ documents. To do this Yioop tries to index according to Doc Rank and assume
+ the affects of relevance and proximity are not too drastic. That is, they
+ might be able to move the 100th document into the top 10, but not say the
+ 1000th document into the top 10.</p>
+ <p>To see how it is
+ possible to roughly index according to document importance, we next
+ examine how data is acquired during a Yioop web crawl (the process
+ for an archive crawl is somewhat different). This is not only important for
+ determining the Doc
+ Rank of a page, but the text extraction that occurs after the page is
+ downloaded also affects the Relevance and Proximity scores. Once we
+ are done describing these crawl/indexing time factors affecting scores,
+ we will then consider search time factors which affect the scoring
+ of documents.</p>
+ <p><a href="#toc">Return to table of contents</a>.</p>
+ <h2 id='crawl'>Crawl Time Ranking Factors</h2>
+ <h3 id='crawl-processes'>Crawl Processes</h3>
+ <p>A Yioop Crawl has three types of processes:</p>
+ <ol>
+ <li>A Name server, which acts as an overall coordinator for the crawl,
+ and which is responsible for starting and stopping the crawl</li>
+ <li>One or more Queue Servers, each of which maintain a priority queue of
+ what to download next.</li>
+ <li>One or more Fetchers, which actually download pages, and do initial
+ page processing.</li>
+ </ol>
+ <p>A crawl is started through the Yioop Web app on
+ the Name Server. For each url in the list of starting urls (Seed Sites),
+ its hostname is computed, a hash of the hostname is computed, and
+ based on this hash, that url is sent to a given queue server -- all
+ urls with the same hostname will be handled by the same queue server.
+ Fetchers periodically check the Name Server to see if there is an
+ active crawl, and if so, what its timestamp is. If there is an
+ active crawl, a Fetcher would then pick a Queue Server and request
+ a schedule of urls to download. By default, this can be as many as
+ DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls.</p>
+ <h3 id='fetchers'>Fetchers and their Effect on Search Ranking</h3>
+ <p>After receiving a batch of pages, the fetcher downloads pages in batches
+ of a hundred pages at a time. When the fetcher requests a URL for download
+ it sends a range request header asking for the first PAGE_RANGE_REQUEST
+ (defaults to 50000) many bytes. Some servers do not know how many bytes
+ they will send before sending, they might operate in "chunked" mode,
+ so after receiving the page, the fetcher discards any data after the first
+ PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants
+ that we mention such as PAGE_RANGE_REQUEST can be found in
+ configs/config.php .
+ For each page in the batch of a hundred urls downloaded, the
+ fetcher proceeds through a sequence of processing steps to:</p>
+ <ol>
+ <li>Determine page mimetype and choose a page processor.</li>
+ <li>Use the page processor to extract a summary for the document.</li>
+ <li>Apply any indexing plugins for the page processor to generate
+ auxiliary summaries and/or modify the extracted summary.</li>
+ <li>Calculate a hash from the downloaded page minus tags and
+ non-word characters to be used for deduplication.</li>
+ <li>Prune the number links extracted from the document down to
+ MAX_LINKS_PER_PAGE (defaults to 50).</li>
+ <li>Apply any user-defined page rules to the summary extracted.</li>
+ <li>Store full-cache of page to disk, add the location of full cache to
+ summary. Full cache pages are stored
+ in folders in WORK_DIRECTORY/cache/FETCHER_PREFIX-ArchiveCRAWL_TIMESTAMP.
+ These folder contain gzipped text files, web archives, each made up of
+ the concatenation of up to NUM_DOCS_PER_GENERATION many cache pages.
+ The class representing this whole structure is called a
+ WebArchiveBundle (lib/web_archive_bundle.php). The class for a
+ single file is called a WebArchive (lib/web_archive.php).</li>
+ <li>Keep summaries in fetcher memory until they are shipped
+ off to the appropriate queue server in a process
+ we'll describe later.</li>
+ </ol>
+ <p>
+ After these steps, the fetcher checks the name server to see
+ if any crawl parameters
+ have changed or if the crawl has stopped before proceeding to download
+ the next batch of a hundred urls. It proceeds in this fashion until it
+ has downloaded and processed four to five hundred urls. It then
+ builds a "mini-inverted index" of the documents it has downloaded and
+ sends the inverted index, the summaries, any discovered urls, and any
+ robots.txt data it has downloaded back
+ to the queue server. It also sends back information on which hosts
+ that the queue server is responsible for that are generating more than
+ DOWNLOAD_ERROR_THRESHOLD (10) HTTP errors in a given schedule.
+ These hosts will automatically be crawl-delayed by
+ the queue server. Sending all of this data,
+ allows the fetcher to clear some of its memory and continue
+ processing its batch of 5000 urls until it has downloaded all of them.
+ At this point, the fetcher picks another queue server and requests
+ a schedule of urls to download from it and so on.
+ </p>
+ <p>
+ Page rules, which can greatly effect the summary extracted for a page,
+ are described in more detail in the <a
+ href="?c=main&p=documentation#page-options">Page Options Section</a>
+ of the Yioop documentation. Before describing how the
+ "mini-inverted index" processing step is done, let's examine
+ Steps 1,2, and 5 above in a little more detail as they are very important
+ in determining what actually is indexed. Based usually on the
+ the HTTP headers, a
+ <a href="http://en.wikipedia.org/wiki/Internet_media_type">mimetype</a>
+ for each page is found. The mimetype determines which summary extraction
+ processor, in Yioop terminology, a page processor, is applied to the page.
+ As an example of the key role that the page processor plays in what
+ eventually ends up in a Yioop index, we list what the HTML page processor
+ extracts from a page and how it does this extraction:
+ </p>
+ <dl>
+ <dt>Language</dt><dd>Document language is used to determine
+ how to make terms from the words in a document. For example, if the
+ language is English, Yioop uses the English stemmer on a
+ document. So the word "jumping" in the document will get indexed as
+ "jump". On the other hand, if the language was determined to be Italian
+ then a different stemmer would be used and "jumping" would remain
+ "jumping". The HTML processor determines the language by first looking
+ for a lang attribute on the <html> tag in the document. If
+ none is found it checks it the frequency of characters is close enough
+ to English to guess the document is English. If this fails it leaves
+ the value blank.</dd>
+ <dt>Title</dt><dd>When search results are displayed, the extracted
+ document title is used as the link text. Words in the title also
+ are given a higher value when Yioop calculates its relevance statistic.
+ The HTML processor uses the contents of the <title> tag
+ as its default title. If this tag is not present or is empty,
+ Yioop then concatenates the contents of the <h1> to <h6>
+ tags in the document. The HTML processor keeps only the
+ first hundred (HtmlProcessor::MAX_TITLE_LEN) characters of the title.
+ </dd>
+ <dt>Description</dt><dd>The description is used when search results
+ are displayed to generate the snippets beneath the result link.
+ Besides title, it has the remainder on the page words that are
+ used to identify a document. To obtain a description, the HTML processor
+ first takes the value of the content attribute of any <meta> tag
+ whose name attribute is some case invariant of "description".
+ To this it concatenates the non-tag
+ contents of the first four <p> and <div> tags,
+ followed by the content of <td>, <li>,
+ <dt>, <dd>, and <a> tags until it reaches
+ a maximum of HtmlProcessor::MAX_DESCRIPTION_LEN (2000) characters.
+ These items are added from the one
+ with the most characters to the one with the least.</dd>
+ <dt>Links</dt><dd>Links are used by Yioop to obtain new pages
+ to download. They are also treated by Yioop as "mini-documents".
+ The url of such mini document is the target website of the
+ link, the link text is used as a description. As we will see
+ during searching, these mini-documents get combined with the
+ summary of the site linked to.The HTML processor extracts
+ links from <a>, <frame>, <iframe>, and <img>
+ tags. It extracts up to 300 links per document. When it extracts
+ links it canonicalizes relative links. If a <base> tag was present
+ it uses it as part of the canonicalization process. Link text is
+ extracted from <a> tag contents and from alt attributes of
+ <img>'s. In addition, rel attributes are examined for robot
+ directives such as nofollow.</dd>
+ <dt>Robot Metas</dt><dd>This is used to keep track of
+ any robot directives that occurred in meta tags in the document.
+ These directives are things such a NOFOLLOW, NOINDEX, NOARCHIVE, and
+ NOSNIPPET. These can affect what links are extracted from the page,
+ whether the page is indexed, whether cached versions of the page
+ will be displayable from the Yioop interface, and whether snippets
+ can appear beneath the link on a search result page. The HTML
+ processor does a case insensitive match on <meta> tags
+ that contain the string "robot" (so it will treat such tags that contain
+ robot and robots the same). It then extracts the directives from
+ the content attribute of such a tag.</dd>
+ </dl>
+ <p>
+ The page processors for other mimetypes extract similar fields but
+ look at different components of their respective document types.
+ </p>
+ <p>After the page processor is done with a page, non-robot and sitemap
+ pages then pass through a pruneLinks method. This culls the up to 300 links
+ that might have been extracted down to 50. To do this, for each link,
+ the link text is gzipped and the length of the resulting string is
+ determined. The 50 unique links of longest length are then kept. The idea
+ is that we want to keep links whose text carry the most information.
+ Gzipping is a crude way to eliminate text with lots of redundancies.
+ The length then measures how much useful text is left. Having more
+ useful text means that the link is more likely to be helpful to find
+ the document.</p>
+ <p>
+ Now that we have finished discussing Steps 1,2, and 5, let's describe what
+ happens when building a mini-inverted index. For the four to- five hundred
+ summaries that we have at the start of mini-inverted index
+ step, we make associative arrays of the form:
+ </p>
+ <pre>
+ term_id_1 => ...
+ term_id_2 => ...
+ ...
+ term_id_i =>
+ ((summary_map_1, (positions in summary 1 that term i appeared) ),
+ (summary_map_2, (positions in summary 2 that term i appeared) ),
+ ...)
+ ...
+ </pre>
+ <p>Term IDs are 8 byte strings consisting of the XOR of the two halves
+ of the 16 byte md5 hash of the term. Summary map numbers are
+ offsets into a table which can be used to look up a summary. These
+ numbers are increasing order of when the page was put into the
+ mini-inverted index. To calculate a position of a term, a string is made
+ from terms extracted from the url followed by the summary title
+ followed by the summary description. One counts
+ the number of terms from the start of this string. For example, suppose
+ we had two summaries:</p>
+ <pre>
+ Summary 1:
+ URL: http://test.yioop.com/
+ Title: Fox Story
+ Description: The quick brown fox jumped over the lazy dog.
+
+ Summary 2: http://test.yioop2.com/
+ Title: Troll Story
+ Description: Once there was a lazy troll, P&A, who lived on my
+ discussion board.
+ </pre>
+ <p>The mini-inverted index might look like:</p>
+ <pre>
+ (
+ [test] => ( (1, (0)), (2, (0)) )
+ [yioop] => ( (1, (1)) )
+ [yioop2] => ( (2, (1)) )
+ [fox] => ( (1, (2, 7)) )
+ [stori] => ( (1, (3)), (2, (3)) )
+ [the] => ( (1, (4, 10)) )
+ [quick] => ( (1, (5)) )
+ [brown] => ( (1, (6)) )
+ [jump] => ( (1, (8)) )
+ [over] => ( (1, (9)) )
+ [lazi] => ( (1, (11)), (2, (8)) )
+ [dog] => ( (1, (12)) )
+ [troll] => ( (2, (2, 9)) )
+ [onc] => ( (2, (4)) )
+ [there] => ( (2, (5)) )
+ [wa] => ( (2, (6)) )
+ [a] => ( (2, (7))) )
+ [p_and_a] => ( (2, (10)) )
+ [who] => ( (2, (11)) )
+ [live] => ( (2, (12)) )
+ [on] => ( (2, (13)) )
+ [my] => ( (2, (14)) )
+ [discuss] => ( (2, (15)) )
+ [board] => ( (2, (16)) )
+ )
+ </pre>
+ <p>The list associated with a term is called a <b>posting list</b>
+ and an entry in this list is called a <b>posting</b>. Notice terms
+ are stemmed when put into the mini-inverted index.
+ Also, observe acronyms, abbreviations, emails, and urls, such as
+ P&A, will be manipulated before being put into the index. For
+ some Asian languages such as Chinese where spaces might not be placed
+ between words char-gramming is done instead. If two character
+ char-gramming is used, the string:
+ 您要不要吃? becomes 您要 要不 不要 要吃 吃? A user query 要不要 will,
+ before look-up, be converted to the conjunctive query 要不 不要 and so
+ would match a document containing 您要不要吃? Yioop can also be
+ <a href="?c=main&p=documentation#token_tool">configured to make use of a
+ Bloom filter</a> containing n-word grams for a language. This is typically
+ done for n-word grams coming from Wikipedia page titles. So for example,
+ if the document had "Rolling Stones" beginning at the position 7. This
+ would be recognized as an n-word gram in such a Bloom filter and
+ three terms would be extracted [roll stone] at position 7, [roll] at
+ position 7, and [stone] at position 8. In this way, a query for just
+ roll will match this document, as will one for just stone. On the other
+ a query for rolling stones will also match and will make use of
+ the position list for [roll stone], so only documents with these two
+ terms adjacent would be returned.
+ </p>
+ <p>It should be recalled that links are treated as their own little
+ documents and so will be treated as separate documents when making the
+ mini-inverted index. The url of a link is what it points to not the page
+ it is on. So the hostname of the machine that it points to might not be a
+ hostname handled by the queue server from which the schedule was downloaded.
+ In reality, the fetcher actually partitions link documents according to
+ queue server that will handle that link, and builds separate mini-inverted
+ indexes for each queue server. After building mini-inverted indexes,
+ it sends to the queue server the schedule was downloaded from,
+ inverted index data, summary data, host error data, robots.txt
+ data, and discovered links data that was destined for it. It keeps in
+ memory all the other inverted index data destined for other machines.
+ It will send this data to the appropriate queue servers later -- the
+ next time it downloads and processes data for these servers. To make
+ sure this scales, the fetcher checks its memory usage, if it is getting
+ low, it might send some of this data for other queue servers early.</p>
+
+ <h3 id='queue-servers'>Queue Servers and their Effect on Search Ranking</h3>
+
+ <p>It is back on a queue server that the building blocks for
+ the Doc Rank, Relevance and Proximity scores are assembled. To see
+ how this happens we continue to follow the flow of the data through
+ the web crawl process.
+ </p>
+ <p>
+ To communicate with a queue server, a fetcher posts data to the web app
+ of the queue server. The web app writes mini-inverted index and summary
+ data into a file in the WORK_DIRECTORY/schedules/IndexDataCRAWL_TIMESTAMP
+ folder. Similarly, robots.txt data from a batch of 400-500 pages
+ is written to WORK_DIRECTORY/schedules/RobotDataCRAWL_TIMESTAMP, and
+ "to crawl" urls are written to
+ WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP. The Queue Server
+ periodically checks these folders for new files to process. It is often
+ the case that files can be written to these folders faster than the
+ Queue Server can process them.
+ </p>
+ <p>A queue server consists of two separate sub-processes:</p>
+ <dl>
+ <dt>An Indexer</dt><dd>The indexer is responsible for reading Index Data
+ files and building a Yioop index.</dd>
+ <dt>A Scheduler</dt><dd>The scheduler maintains a priority queue of
+ what urls to download next. It is responsible for reading
+ SchedulateData files to update its priority queue and it is
+ responsible for making sure urls that urls
+ forbidden by RobotData files do not enter the queue.</dd>
+ </dl>
+ <p>When the Indexer processes a schedule IndexData file, it saves the
+ data in an IndexArchiveBundle (lib/index_archive_bundle). These objects
+ are serialized to folders with names of the form:
+ WORK_DIRECTORY/cache/IndexDataCRAWL_TIMESTAMP . IndexArchiveBundle's have
+ the following components:
+ <dl>
+ <dt>summaries</dt><dd>This is a WebArchiveBundle folder containing
+ the summaries of pages read from fetcher-sent IndexData files.</dd>
+ <dt>posting_doc_shards</dt><dd>This contains a sequence of
+ inverted index files, shardNUM, called IndexShard's. shardX holds the
+ postings lists for the Xth block of NUM_DOCS_PER_GENERATION many
+ summaries. NUM_DOCS_PER_GENERATION default to 50000 if Queue Server is
+ on a machine with at least 1Gb of memory. shardX also has postings for the
+ link documents that were acquired while acquiring these summaries.</dd>
+ <dt>generation.txt</dt><dd>Contains a serialized PHP object which
+ says what is the active shard -- the X such that shardX will receive
+ newly acquired posting list data.</dd>
+ <dt>dictionary</dt><dd>The dictionary contains a sequence of subfolders
+ used to hold for each term in a Yioop index the offsets and length in each
+ IndexShard where the posting list for that term are stored.</dd>
+ </dl>
+ <p>Of these components posting_doc_shards are the most important
+ with regard to page scoring. When a schedules/IndexData file is read,
+ the mini-inverted index in it is appended to the active IndexShard.
+ To do this append, all the summary map offsets, need to adjusted so they
+ now point to locations at the end of the summary of the IndexShard.
+ These offsets thus provide information about when a document was indexed
+ during the crawl process. The maximum number of links per document
+ is usually 50 for normal documents and 300 for sitemaps. Emperically,
+ it has been observed that a typical index shard has offsets for around
+ 24 times as many links summary maps as document summary maps. So
+ roughly, if a newly added summary or link has index <i>DOC_INDEX</i>
+ in the active shard, and the active shard is the GENERATIONth shard,
+ the newly add object will have
+ </p>
+ <p>
+ `mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times
+ mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+ `\qquad = (mbox(DOC_INDEX) + 1) + 25 times
+ mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))`
+ </p>
+ <p>To make this a score out of 10, we can use logarithms:</p>
+ <p>`mbox(DOC_RANK) = 10 - log_(10)(mbox(RANK)).`</p>
+ <p>So this gives us a DOC_RANK for one link or summary item stored
+ in a Yioop index. However, as we will see, this does not give us the
+ complete value of DOC_RANK when computed at query time.</p>
+ <p>Index shards are important for determining relevance and proximity
+ scores as well. An index shard stores the number of doc seen,
+ number of links seen, the sum of the lengths of all summaries, the
+ sum of the length of all links. From these we can derive average
+ summary lengths, and average link lengths. From a posting, the
+ number of occurences of a term in a document can be calculated.
+ These will all be useful statistics for when we compute relevance.
+ As we will see, when we compute relevance, we use the average values
+ obtained for the particular shard the summary occurs in as a proxy
+ for their value throughout all shards. The fact that a posting
+ contains a position list of the location of a term within a
+ document will be use when we calculate poximity scores.</p>
+ <p>We next turn to the role of a Queue Server's Scheduler in
+ the computation of a page's Doc Rank.</p>
+How data is split amongst Fetchers, Queue Servers and Name Servers
+Web Versus Archive Crawl
+The order in which something is crawled. Opic or Breadth-first
+Company level domains
+robots.txt crawl delay.
+Queue size in ram. Schedule on disk.
+Page Range Request
+Mimetype
+Summary Extraction. Title description link extraction
+(what are important elements on page for html.
+Page Rules
+Statistics come from mini inverted indexes, not whole crawl.
+Stemming or char gramming
+n-gram word filter
+special characters and acronyms
+ <p><a href="#toc">Return to table of contents</a>.</p>
+ <h2 id='search'>Search Time Ranking Factors</h2>
+calculateControlWords (SearchController)
+
+guessSemantics (PhraseModel)
+
+stemming word gramming
+special characters and acronyms
+Network Versus non network queries
+Grouping (links and documents) deduplication
+Conjunctive queries
+Scores BM25F, proximity, document rank
+
+News articles, Images, Videos
+
+How related queries work
+ <p><a href="#toc">Return to table of contents</a>.</p>
+ <h2 id="references">References</h2>
+ <dl>
+<dt id="APC2003">[APC2003]</dt>
+<dd>Serge Abiteboul and Mihai Preda and Gregory Cobena.
+<a href="http://leo.saclay.inria.fr/publifiles/gemo/GemoReport-290.pdf"
+>Adaptive on-line page importance computation</a>.
+In: Proceedings of the 12th international conference on World Wide Web.
+pp. 280-290. 2003.
+</dd>
+<dt id="CCB2009">[CCB2009]</dt>
+<dd>Gordon V. Cormack and Charles L. A. Clarke and Stefan Büttcher.
+<a href="http://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf"
+>Reciprocal Rank Fusion outperforms Condorcet and
+individual Rank Learning Methods</a>. In:
+Proceedings of the 32nd Annual International ACM SIGIR Conference on Research
+and Development in Information Retrieval. pp.758--759. 2009.
+</dd>
+
+<dt id="LLWL2009">[LLWL2009]</dt>
+<dd>H.-T. Lee, D. Leonard, X. Wang, D. Loguinov.
+<a href="http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf"
+>IRLbot: Scaling to 6 Billion Pages and Beyond</a>.
+ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
+</dd>
+<dt id="VLZ2012">[VLZ2012]</dt>
+<dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel.
+<a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf"
+>Learning to rank by aggregating expert preferences</a>.
+21st ACM International Conference on Information and Knowledge Management.
+pp. 843-851. 2012.
+</dd>
+</dl>
+
+ <p><a href="#toc">Return to table of contents</a>.</p>
+</div>
+<script type="text/javascript"
+ src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_HTMLorMML"></script>
+