diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml index 2ad6bf0..208fd18 100755 --- a/en-US/pages/documentation.thtml +++ b/en-US/pages/documentation.thtml @@ -176,7 +176,10 @@ getting better. Since the original Google paper, techniques to rank pages have been simplified [<a href="#APC2003">APC2003</a>]. It is also possible to approximate some of the global statistics - needed in BM25F using suitably large samples.</p> + needed in BM25F using suitably large samples. More + details on the exact ranking mechanisms used by Yioop + and be found on the <a href="?c=main&p=ranking" + >Yioop Ranking Mechanisms</a> page.</p> <p>Yioop tries to exploit these advances to use a simplified distributed model which might be easier to deploy in a smaller setting. Each node in a Yioop system diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml index 5e27b14..a1c7a65 100644 --- a/en-US/pages/ranking.thtml +++ b/en-US/pages/ranking.thtml @@ -19,7 +19,7 @@ <p> A typical query to Yioop is a collection of terms without the use of the OR operator, '|', or the use of the exact match operator, double - quotes around a phrase. On such a query, called a <b>conjunctive query</b>, + quotes. On such a query, called a <b>conjunctive query</b>, Yioop tries to return documents which contain all of the query terms. Yioop further tries to return these documents in descending order of score. Most users only look at the first ten of the results returned. This article @@ -47,7 +47,7 @@ the Doc Rank score, the rank of `d` with respect to the Relevance score, and the rank of `d` with respect to the Proximity score. It finally computes a score for each of these - `n` documents using these three rankings and the so-called + `n` documents using these three rankings and <b>reciprocal rank fusion (RRF)</b>:</p> <p class="center"> `mbox(RRF)(d) := 200(frac{1}{59 + mbox(Rank)_(mbox(DR))(d)} + @@ -69,7 +69,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + If a document ranked 1 with respect to each score, then `mbox(RRF)(d) = 200(3/(59+1)) = 10`. If a document ranked n for each score, then `mbox(RRF)(d) = 200(3/(59+n)) = 600/(59 + n)`. - As `n -> infty` this goes to 0. A value `n = 200` is often used with + As `n -> infty`, this goes to `0`. A value `n = 200` is often used with Yioop. For this `n`, `600/(59 + n) approx 2.32`. If a document ranked 1 on one of the three scores, but ranked `n` on the other two, @@ -86,7 +86,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + the first `n` documents will in fact most likely be the first `n` documents that Yioop indexed. This does not contradict the assumption provided we are indexing documents according to the importance of our - documents. To do this Yioop tries to index according to Doc Rank and assume + documents. To do this Yioop tries to index according to Doc Rank and assumes the affects of relevance and proximity are not too drastic. That is, they might be able to move the 100th document into the top 10, but not say the 1000th document into the top 10.</p> @@ -99,14 +99,18 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + downloaded also affects the Relevance and Proximity scores. Once we are done describing these crawl/indexing time factors affecting scores, we will then consider search time factors which affect the scoring - of documents.</p> + of documents and the actually formulas for Doc Rank, Relevance and + Proximity.</p> <p><a href="#toc">Return to table of contents</a>.</p> <h2 id='crawl'>Crawl Time Ranking Factors</h2> <h3 id='crawl-processes'>Crawl Processes</h3> - <p>A Yioop Crawl has three types of processes:</p> + <p>To understand how crawl and indexing time factors affect + search ranking, let's begin by first fixing in our minds how a + crawl works in Yioop. A Yioop crawl has three types of processes + that play a role in this:</p> <ol> <li>A Name server, which acts as an overall coordinator for the crawl, - and which is responsible for starting and stopping the crawl</li> + and which is responsible for starting and stopping the crawl.</li> <li>One or more Queue Servers, each of which maintain a priority queue of what to download next.</li> <li>One or more Fetchers, which actually download pages, and do initial @@ -123,11 +127,18 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + a schedule of urls to download. By default, this can be as many as DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls.</p> <h3 id='fetchers'>Fetchers and their Effect on Search Ranking</h3> - <p>After receiving a batch of pages, the fetcher downloads pages in batches - of a hundred pages at a time. When the fetcher requests a URL for download - it sends a range request header asking for the first PAGE_RANGE_REQUEST - (defaults to 50000) many bytes. Some servers do not know how many bytes - they will send before sending, they might operate in "chunked" mode, + <p> Let's examine the fetcher's role in determining what terms get + indexed, and hence, what documents can be retrieved using those + terms. After receiving a batch of pages, the fetcher downloads pages in + batches of a hundred pages at a time. When the fetcher requests a URL for + download it sends a range request header asking for the first + PAGE_RANGE_REQUEST (defaults to 50000) many bytes. Only the data in these + byte has any chance of becoming terms which are indexed. The reason for + choosing a fixed, relatively small size is so that one can index a large + number of documents even with a relatively small amount of disk space. + Some servers do not + know how many bytes they will send before sending, they might operate in + "chunked" mode, so after receiving the page, the fetcher discards any data after the first PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants that we mention such as PAGE_RANGE_REQUEST can be found in @@ -200,7 +211,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + then a different stemmer would be used and "jumping" would remain "jumping". The HTML processor determines the language by first looking for a lang attribute on the <html> tag in the document. If - none is found it checks it the frequency of characters is close enough + none is found it checks if the frequency of characters is close enough to English to guess the document is English. If this fails it leaves the value blank.</dd> <dt>Title</dt><dd>When search results are displayed, the extracted @@ -227,13 +238,13 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + with the most characters to the one with the least.</dd> <dt>Links</dt><dd>Links are used by Yioop to obtain new pages to download. They are also treated by Yioop as "mini-documents". - The url of such mini document is the target website of the + The url of such a mini-document is the target website of the link, the link text is used as a description. As we will see during searching, these mini-documents get combined with the summary of the site linked to.The HTML processor extracts links from <a>, <frame>, <iframe>, and <img> tags. It extracts up to 300 links per document. When it extracts - links it canonicalizes relative links. If a <base> tag was present + links it canonicalizes relative links. If a <base> tag was present, it uses it as part of the canonicalization process. Link text is extracted from <a> tag contents and from alt attributes of <img>'s. In addition, rel attributes are examined for robot @@ -254,8 +265,9 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + The page processors for other mimetypes extract similar fields but look at different components of their respective document types. </p> - <p>After the page processor is done with a page, non-robot and sitemap - pages then pass through a pruneLinks method. This culls the up to 300 links + <p>After the page processor is done with a page, pages which + aren't robot.txt pages which also aren't sitemap + pages, then pass through a pruneLinks method. This culls the up to 300 links that might have been extracted down to 50. To do this, for each link, the link text is gzipped and the length of the resulting string is determined. The 50 unique links of longest length are then kept. The idea @@ -283,10 +295,11 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + <p>Term IDs are 8 byte strings consisting of the XOR of the two halves of the 16 byte md5 hash of the term. Summary map numbers are offsets into a table which can be used to look up a summary. These - numbers are increasing order of when the page was put into the - mini-inverted index. To calculate a position of a term, a string is made - from terms extracted from the url followed by the summary title - followed by the summary description. One counts + numbers are in increasing order of when the page was put into the + mini-inverted index. To calculate a position of a term, the + summary is viewed as a single string consisting of + terms extracted from the url concatenated with the summary title + concatenated with the summary description. One counts the number of terms from the start of this string. For example, suppose we had two summaries:</p> <pre> @@ -335,7 +348,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + Also, observe acronyms, abbreviations, emails, and urls, such as P&A, will be manipulated before being put into the index. For some Asian languages such as Chinese where spaces might not be placed - between words char-gramming is done instead. If two character + between words, char-gramming is done instead. If two character char-gramming is used, the string: 您要不要吃? becomes 您要 要不 不要 要吃 吃? A user query 要不要 will, before look-up, be converted to the conjunctive query 要不 不要 and so @@ -348,7 +361,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + three terms would be extracted [roll stone] at position 7, [roll] at position 7, and [stone] at position 8. In this way, a query for just roll will match this document, as will one for just stone. On the other - a query for rolling stones will also match and will make use of + hand, a query for rolling stones will also match and will make use of the position list for [roll stone], so only documents with these two terms adjacent would be returned. </p> @@ -409,7 +422,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + <dt>posting_doc_shards</dt><dd>This contains a sequence of inverted index files, shardNUM, called IndexShard's. shardX holds the postings lists for the Xth block of NUM_DOCS_PER_GENERATION many - summaries. NUM_DOCS_PER_GENERATION default to 50000 if Queue Server is + summaries. NUM_DOCS_PER_GENERATION default to 50000 if the queue server is on a machine with at least 1Gb of memory. shardX also has postings for the link documents that were acquired while acquiring these summaries.</dd> <dt>generation.txt</dt><dd>Contains a serialized PHP object which @@ -423,16 +436,18 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + with regard to page scoring. When a schedules/IndexData file is read, the mini-inverted index in it is appended to the active IndexShard. To do this append, all the summary map offsets, need to adjusted so they - now point to locations at the end of the summary of the IndexShard. + now point to locations at the end of the summary of the IndexShard to + which data is being appended. These offsets thus provide information about when a document was indexed during the crawl process. The maximum number of links per document is usually 50 for normal documents and 300 for <a href="http://www.sitemaps.org/">sitemaps</a>. Emperically, it has been observed that a typical index shard has offsets for around - 24 times as many links summary maps as document summary maps. So - roughly, if a newly added summary or link, d, has index <i>DOC_INDEX(d)</i> + 24 times as many links summary map entries as document summary map + entries. So roughly, if a newly added summary or link, d, has index + <i>DOC_INDEX(d)</i> in the active shard, and the active shard is the GENERATION(d) shard, - the newly add object will have + the newly added object will have </p> <blockquote> <p> @@ -460,27 +475,28 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + page. Doc Rank as measured so far does not do that.</li> <li>The Doc Rank is a positive number and less than 10 provided the index of the given queue server has fewer than 10 billion items. Since - to index 10 billion items using Yioop you would probably want + to index 10 billion items using Yioop, you would probably want multiple queue servers, Doc Rank's likely remain positive for larger indexes.</li> - <li>If we imagined the Yioop indexed the web as a balanced tree starting - from some seed node where Rank labels the nodes of the tree level-wise, + <li>If we imagined that Yioop indexed the web as a balanced tree starting + from some seed node where RANK(`i`) labels the node `i` of the tree + enumerated level-wise, then `log_(25)(mbox(RANK)(d)) = (log_(10)(mbox(RANK)(d)))/(log_(10)(25))` - would be an estimate of the depth of a node in this tree. So DOC RANK + would be an estimate of the depth of a node in this tree. So Doc Rank can be viewed as an estimate of how far we are away from the root, with 10 being at the root.</li> <li>Doc Rank is computed by different queue servers independent of each other for the same index. So it is possible for two summaries to - have the same Doc Rank in the same index if they are stored on different - queue server.</li> + have the same Doc Rank in the same index, provided they are stored on + different queue servers.</li> <li>For Doc Ranks to be comparable with each other for the same index on different queue servers, it is assumed that queue servers are indexing at roughly the same speed.</li> </ol> <p>Besides Doc Rank, Index shards are important for determining relevance and proximity scores as well. An index shard stores the number of summaries - seen, number of links seen, the sum of the lengths of all summaries, the - sum of the length of all links. From these we can derive average + seen, the number of links seen, the sum of the lengths of all summaries, the + sum of the length of all links. From these statistics, we can derive average summary lengths, and average link lengths. From a posting, the number of occurences of a term in a document can be calculated. These will all be useful statistics for when we compute relevance. @@ -489,18 +505,18 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + for their value throughout all shards. The fact that a posting contains a position list of the location of a term within a document will be use when we calculate proximity scores.</p> - <p>We next turn to the role of a Queue Server's Scheduler in + <p>We next turn to the role of a queue server's Scheduler process in the computation of a page's Doc Rank. One easy way, which is supported by Yioop, for a Scheduler to determine what to crawl next is to use a simple queue. This would yield roughly a breadth-first traversal of - the web starting from the seed sites. Since highly quality pages are often a - small number of hops, from any page on the web, there is some evidence + the web starting from the seed sites. Since high quality pages are often a + small number of hops from any page on the web, there is some evidence [<a href="NW2001">NW2001</a>] that this lazy strategy is not too - bad for crawling roughly according to document importance. However, there - are better strategies. When Page Importance is chosen as - the Crawl Order for a Yioop crawl, the Scheduler on each queue server works - harder to make schedules so that the next pages to crawl are always the - most important pages not yet seen.</p> + bad for crawling according to document importance. However, there + are better strategies. When Page Importance is chosen in the + Crawl Order dropdown for a Yioop crawl, the Scheduler on each queue server + works harder to make schedules so that the next pages to crawl are always + the most important pages not yet seen.</p> <p>One well-known algorithm for doing this kind of scheduling is called OPIC (Online Page Importance Computation) [<a href="#APC2003">APC2003</a>]. @@ -513,7 +529,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + sites might already have been in the queue in which case we add to their cash total. For URLs not in the queue, we add them to the queue with initial value `alpha/n`. Each site has two scores: Its current - cash on hand, and total earnings the site has ever received. When + cash on hand, and the total earnings the site has ever received. When a page is crawled, its cash on hand is reset to 0. We always choose as the next page to crawl from amongst the pages with the most cash (there might be ties). OPIC can be used to get an estimate of the @@ -525,12 +541,12 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + and Yazdani [<a href="#BY2008">BY2008</a>] have more recently proposed a new page importance measure DistanceRank, they also confirm that OPIC does better than breadth-first, but show the - computationally more expensive PartialPageRank and Partial DistanceRank + computationally more expensive Partial PageRank and Partial DistanceRank perform even better. Yioop uses a modified version of OPIC to choose which page to crawl next.</p> <p>To save a fair bit of crawling overhead, Yioop does not keep for each site crawled historical totals of all - earnings a page has received, the cash-based approach is only used for + earnings a page has received. The cash-based approach is only used for scheduling. Here are some of the issues addressed in the OPIC-based algorithm employed by Yioop: </p> @@ -538,11 +554,11 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + <li>A Scheduler must ensure robots.txt files are crawled before any other page on the host. To do this, robots.txt files are inserted into the queue before any page from that site. Until the robots.txt - file for a page is crawled, it receives cash whenever a page on - that host receives cash.</li> + file for a page is crawled, the robots.txt file receives cash whenever + a page on that host receives cash.</li> <li>A fraction `alpha` of the cash that a robots.txt file receives is divided amongst any sitemap links on that page. Not all of the cash is - given to prevent sitemaps from "swamping" the queue. Currently, + given. This is to prevent sitemaps from "swamping" the queue. Currently, `alpha` is set 0.25. Nevertheless, together with the last bullet point, the fact that we do share some cash, means cash totals no longer sum to one.</li> @@ -555,7 +571,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + the total amount of cash..</li> <li>A robots.txt file or a slow host might cause the Scheduler to crawl-delay all the pages on the host. These pages might receive sufficient - cash to be scheduled earlier, but won't be because there must be a minimum + cash to be scheduled earlier, but won't be, because there must be a minimum time gap between requests to that host.</li> <li>When a schedule is made with a crawl-delayed host, URLs from that host cannot be scheduled until the @@ -566,7 +582,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + </li> <li>The Scheduler has a maximum, in-memory queue size based on NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It - will wait on reading new "to crawl" schedule files from fetchers, + will wait on reading new "to crawl" schedule files from fetchers if reading in the file would mean going over this count. For a typical, web crawl this means the "to crawl" files build up much like a breadth-first queue on disk. @@ -580,7 +596,7 @@ frac{1}{59 + mbox(Rank)_(mbox(Rel))(d)} + frac{1}{59 + slot sufficient far from any previous url for that host to ensure that the crawl-delay condition is met.</li> <li>If a Scheduler's queue is full, yet after going through all - of the url's in the queue it cannot find any to write to a schedule + of the url's in the queue it cannot find any to write to a schedule, it goes into a reset mode. It dumps its current urls back to schedule files, starts with a fresh queue (but preserving robots.txt info) and starts reading in schedule files. This can happen if too many @@ -600,7 +616,7 @@ link on the sitemap page receives cash</p> links early in the sitemap and prevent crawling of sitemap links from clustering together too much. For a non-sitemap page, we split the cash by making use of the notion of a company level domain (cld). This is a slight -simplification of the notion of a pay level domain defined (pld) in [<a +simplification of the notion of a pay level domain (pld) defined in [<a href="#LLWL2009">LLWL2009</a>]. For a host of the form, something.2chars.2chars or blah.something.2chars.2chars, the company level domain is something.2chars.2chars. For example, for www.yahoo.co.uk, @@ -667,7 +683,7 @@ one iterator would be built for each term and these iterators would be added to an intersect iterator that would return documents on which all the terms appear. These iterators are then fed into a grouping iterator, which groups links and summaries that refer -to the same document. Recall after downloading pages on the fetcher +to the same document url. Recall that after downloading pages on the fetcher, we calculated a hash from the downloaded page minus tags. Documents with the same hash are also grouped together by the group iterator. The value `n=200` posting list entries that Yioop scans out on a query @@ -678,9 +694,10 @@ Search Time > Minimum Results to Group. The number 200 was chosen because on a single machine it was found to give decent results without the queries taking too long. </p> -<p>In the multiple queue server setting, the query comes in to - the name server a network iterator is built which poses the -query to each queue server. If `n=200`, the name server +<p>In the multiple queue server setting, when the query comes in to + the name server, a network iterator is built. This iterator poses the +query to each queue server being administered by the +name server. If `n=200`, the name server multiplies this value by the value Page Options > Search Time > Server Alpha, which we'll denote `alpha`. This defaults @@ -692,18 +709,18 @@ queue server. The queue servers don't do grouping, but just does the grouping.</p> <p>In both the networked and non-networked case, after the grouping phase Doc Rank, Relevance, and Proximity scores for each of the grouped results -have been determined. We then combine these three scores into a single +will have been determined. We then combine these three scores into a single score using the reciprocal rank fusion technique described in the introduction. Results are then sorted in descending order of score and output. What we have left to describe is how the scores are calculated in the various iterators mentioned above.</p> <p>To fix an example to describe this process, suppose we have a group `G'` of items `i_j'`, either pages or links that all refer to the same url. -A page in this group means at some point we downloaded the url and extracted -a summary. It is possible for there to be multiple pages in a group because -we might re-crawl a page. If we have another group `G''` of items `i_k''` of -this kind that such that the hash of the most recent page matches that -of `G'`, then the two groups are merged. While we are grouping we are +A page in this group means that at some point we downloaded the url and +extracted a summary. It is possible for there to be multiple pages in a group +because we might re-crawl a page. If we have another group `G''` of items +`i_k''` of this kind such that the hash of the most recent page matches +that of `G'`, then the two groups are merged. While we are grouping, we are computing a temporary overall score for a group. The temporary score is used to determine which page's (or link's if no pages are present) summaries in a group should be used as the source of url, title, and snippets. Let `G` be the @@ -711,10 +728,9 @@ group one gets performing this process after all groups with the same hash as `G'` have been merged. We now describe how the individual items in `G` have their score computed, and finally, how these scores are combined. </p> -<p>The Doc Rank of an item d, `DR(d)`, is calculated according to the formula +<p>The Doc Rank of an item d, `mbox(DR)(d)`, is calculated according to the formula mentioned in the <a href="#queue-servers">queue servers subsection</a>:</p> -<blockquote> -<p> +<div> \begin{eqnarray} \mbox{RANK}(d) &=& (\mbox{DOC_INDEX}(d) + 1) + (\mbox{AVG_LINKS_PER_PAGE} + 1) \times\\ @@ -723,40 +739,40 @@ mentioned in the <a href="#queue-servers">queue servers subsection</a>:</p> \mbox{NUM_DOCS_PER_GENERATION} \times \mbox{GENERATION}(d))\\ \mbox{DR}(d) &=& 10 - \log_{10}(\mbox{RANK}(d)) \end{eqnarray} -</p> -</blockquote> +</div> <p>To compute the relevance of an item, we use a variant of BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]. Suppose a query `q` is a set -of terms `t`. View a item `d` as a bag of terms, let `f_(t,d)` denote +of terms `t`. View an item `d` as a bag of terms, let `f_(t,d)` denote the frequency of the term `t` in `d`, let `N_t` denote the number of items containing `t` in the whole index (not just the group), let `l_d` denote the length of `d`, where length is the number of terms including repeats it contains, and let `l_{avg}` denote the average length of an item in the index. The basic BM25 formula is:</p> -<blockquote> -<p> -`mbox(Score)_(mbox(BM25))(q, d) = sum_(t in q) mbox(IDF)(t) -cdot mbox(TF)_(mbox(BM25))(t,d)`, where<br /> -`mbox(IDF)(t) = log(frac(N)(N_t))`, and<br /> -`mbox(TF)_(mbox(BM25))(t,d) = -frac(f_(t,d)\cdot(k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))` -</p> -</blockquote> +<div> +\begin{eqnarray} +\mbox{Score}_{\mbox{BM25}}(q, d) &=& \sum_{t \in q} \mbox{IDF}(t) +\cdot \mbox{TF}_{\mbox{BM25}}(t,d), \mbox{ where }\\ +\mbox{IDF}(t) &=& \log(\frac{N}{N_t})\mbox{, and}\\ +\mbox{TF}_{\mbox{BM25}}(t,d) &=& +\frac{f_{t,d}\cdot(k_1 +1)}{f_{t,d} + k_1\cdot ((1-b) + b\cdot(l_d / l_{avg}) )} +\end{eqnarray} +</div> <p>`mbox(IDF)(t)`, the inverse document frequency of `t`, in the above can be thought as measure of how much signal is provided by knowing that the term `t` appears in the document. For example, its value is zero if `t` is in every -document; whereas the more rare the term is the larger than value of +document; whereas, the more rare the term is the larger than value of `mbox(IDF)(t)`. `mbox(TF)_(mbox(BM25))` represents a normalized term frequency for `t`. Here `k_1 = 1.2` and `b=0.75` are tuned parameters which are set to values -commonly used in the literature. It is normalized to prevent bias toward longer -documents. Also, if one spams a document filling it with one the term `t`, -we have `lim_(f_(t,d) -> infty) mbox(TF)_(mbox(BM25))(t,d) = k_1 +1`, which -limits the ability to push the document score larger. +commonly used in the literature. `mbox(TF)_(mbox(BM25))` is normalized to +prevent bias toward longer documents. Also, if one spams a document, filling +it with many copies of the term `t`, we approach the limiting situation +`lim_(f_(t,d) -> infty) mbox(TF)_(mbox(BM25))(t,d) = k_1 +1`, which as one +can see prevents the document score from being made arbitrarily larger. </p> <p>Yioop computes a variant of BM25F not BM25. This formula also -needs to have values for things like `L_(avg)`, `N`, `N_t`. To keep the +needs to have values for things like `l_(avg)`, `N`, `N_t`. To keep the computation simple at the loss of some accuracy when Yioop needs these values it uses information from the statistics in the particular index shard of `d` as a stand-in. BM25F is essentially the same as BM25 except that it separates @@ -775,88 +791,92 @@ the rest of the summary. We add the two results. I.e.,</p> `mbox(Rel)(q, d) = 2 times mbox(Score)_(mbox(BM25-Title))(q, d) + mbox(Score)_(mbox(BM25-Description))(q, d)`</p> <p> -This score would be the relevance for a single summary item `d` with respect +This score would be the relevance for a single page item `d` with respect to `q`. For link items we don't -separate into title and description but can weight the BM25 score different -than for a page (currently though link weight is set to 1 by default). +separate into title and description, but can weight the BM25 score different +than for a page (currently, though, the link weight is set to 1 by default). These three weights: title weight, description weight, and link weight can be set in Page Options > Search Time > Search Rank Factors . </p> <p>To compute the proximity score of an item `d` with respect to -a query `q` with more than one term. we use the notion of a <b>span</b>. +a query `q` with more than one term, we use the notion of a <b>span</b>. A span is an interval `[u_i, v_i]` of positions within `d` which contain all the terms (including repeats) in `q` such that no smaller interval contains -all the terms. Given `d` we can calculate a proximity score as a sum of -the inverse of the sizes of the spans:</p> +all the terms (including repeats) . Given `d` we can calculate a proximity +score as a sum of the inverse of the sizes of the spans:</p> <p class='center'> -`mbox(pscore)(d) = sum(frac(1)(v_i - u_i + 1))`. +`mbox(Prox)(d) = sum(frac(1)(v_i - u_i + 1))`. </p> <p>This formula comes from Clark et al. [<a href="#CCT2000">CCT2000</a>] -except they use covers, rather than spans, where covers ignore repeats. It is -the starting point of our proximity calculation. For a page item, Yioop -calculates separate pscores with respect to its ad hoc title and the rest of a -summary. It then adds them with the same weight as was done for the BM25F -relevance score. Similarly, link item pscores also have a weight factor -multiplied against them. Finally, Yioop normalizes the pscore calculated -with these weights by item length to get:</p> -<p class='center'> -`mbox(Prox)(d) = (100 times mbox(weighted-pscore)(d))/l_d`. +except that they use covers, rather than spans, where covers ignore repeats. +For a page item, Yioop calculates separate proximity scores with respect to its +ad hoc title and the rest of a summary. It then adds them with the same +weight as was done for the BM25F relevance score. Similarly, link item +proximities also have a weight factor multiplied against them. </p> <p>Now that we have described how to compute Doc Rank, Relevance, and Proximity for each item in a group, we now describe how to get these three values -for the whole group. Since both Relevance and Proximity as we have defined -have a normalization for document length, it is reasonable to take -a statistic such as median or average value to compute the Proximity -or Relevance for the group. An average has the drawback that a given -site might be able to skew the statistic, and spam the value for a group. -Since neither Relevance nor Proximity make -use of a notion of page importance, a straight median can also be spammed -- -a single domain with lots of pages could skew the median. To solve these -issue Yioop treats each domain within the group as its -own subgroup and computes an average proximity value and relevance value for -that subgroup, then it takes the median value of all the subgroup -values to get the group proximity value and relevance value. -One thing to notice about groups is that they -are query dependent: Which links to a page have all the query terms depends -on the query terms. So in coming up with a document rank for a group of -items we will have introduced a query dependence to our notion of -document rank. The scheduling algorithm, using company level domains, -of Yioop already makes an attempt at preventing Doc Rank from being easily -manipulated. So taking a weighted sum of the Doc Ranks of a group seems -reasonable. Yioop uses three different weights: We use a weight of 2 -if an item is the summary of domain name page, we use a weight of 1 for -any other summary page item, and a weight of 1/2 for a link item. The -justification for the slightly lower weights for links is that some -links have already contributed to the given url being crawled; whereas, -some have not, so the score of 1/2 was arbitrarily chosen to adjust for -this. +for the whole group. First, for proximity we take the max over all +the proximity scores in a group. The idea is that since we are going +out typically 200 results before grouping, each group has a relatively +small number of items in it. Of these there will typically be at most +one or two page items, and the rest will be link items. We aren't +doing document length normalization for proximity scores and it might +not make sense to do so for links data where the whole link text is relatively +short. Thus, the maximum score in the group is likely to be that of a +page item, and clicking the link it will be these spans the user will see. +Let `[u]` denote all the items that would be grouped +with url `u` in the grouping process, let `q` be a query. +Let `Res(q)` denote results in the index satisfying query `q`, that is, +having all the terms in the query. Then the +proximity of `[u]` with respect to `q` is: </p> -<p>Given a url `u` let `[u]` denote -the set of all items in an index that might be grouped with `u`. -For a query `q` many items in `[u]` might not contain all the terms -in `q` and so by Yioop's score mechanism not contribute to the score of -this result. Let `mbox(Dom)([u])` denote the distinct domain names of -urls in `[u]`. For a url `u'` and domain name `d`, write `u in d` if -the domain name of `u` is `d`. Let `mbox(type)(i)` be one of <i>domain</i>, -<i>page</i>, <i>link</i> and denote the type of an item. -Given this let `mbox(wt)(mbox(type)(i))` denote its weight. -Using these notations, we can summarize -how scores of group `[u]` from the score of its items are calculated -with the following equations: +<div> +\begin{eqnarray} +\mbox{Prox}(q, [u]) &=& \mbox{max}_{i \in [u], i \in Res(q)}(\mbox{Prox}(q,i)). +\end{eqnarray} +</div> +<p>For Doc Rank and Relevance, we split a group into subgroups based on +the host name of where a link came from. So links from +http://www.yahoo.com/something1 and http://www.yahoo.com/something2 +to a url `u` would have the same hostname http://www.yahoo.com/. A link from +http://www.google.com/something1 would have hostname http://www.google.com/. +We will also use a weighting `wt(i)` which has value `2` if `i` is +a page item and the url of i is a hostname, and 1 otherwise. +Let `mbox(Host)(i)` denote the set of hostnames for a page item `i`, and +let `mbox(Host)(i)` denote the hostnames of the page `i` came from in the case +of a link item. Let </p> +<p class='center'> +`H([u]) = { h \quad | h = mbox(Host)(i) \mbox ( for some ) i in [u]}`. +</p> +<p>Let `[u]_h` be the items in `[u]` with hostname `h`. +Let `([u]_h)_j` denote the `j`th element of `[u]_h` listed out in order of +Doc Rank except that the first page item found is listed as `i_0`. +It seems reasonable if a particular host tells us the site `u` is great +multiple times, the likelihood that we would have our minds swayed diminishes +with each repeating. This motivates our formulas for Doc Rank and Relevance +which we give now: </p> <p> \begin{eqnarray} -\mbox{Rel}(q, [u]) &=& \mbox{Median}_{d \in\mbox{Dom}([u]) }( - \mbox{Avg}_{i \in d}(\mbox{Rel}(q,i))).\\ -\mbox{Prox}(q, [u]) &=& \mbox{Median}_{d \in\mbox{Dom}([u]) }( - \mbox{Avg}_{i \in d}(\mbox{Prox}(q,i))).\\ -\mbox{DR}(q, [u]) &=& \sum_{i\in[u]}\mbox{DR}(i)\cdot \mbox{wt}(\mbox{type}(i)). +\mbox{Rel}(q, [u]) &=& \sum_{h \in H([u])} +\sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{Rel}(q, ([u]_h)_j).\\ +\mbox{DR}(q, [u]) &=& \sum_{h \in H([u])} +\sum_{j=0}^{|[u]_h|}\frac{1}{2^j}wt(([u]_h)_j) \cdot \mbox{DR}(q, ([u]_h)_j). + \end{eqnarray} </p> -<p>In the above, median's and average's are taken only over the non-zero -elements of the respective sets. -This completes our description of the Yioop scoring mechanism -in the conjunctive query case. </p> +<p></p> +<p>Now that we have described how Doc Rank, Relevance, and Proximity +are calculated for groups, we have completed our description of the Yioop +scoring mechanism in the conjunctive query case. After performing +pre-processing steps on the query, Yioop retrieves the first `n` +results from its index. Here `n` defaults to 200. It then groups the +results and uses the formulas above to calculate the three scores +for Doc Rank, Relevance, and Proximity. It then uses reciprocal rank +fusion to combine these three scores into a single score, sorts the +results by this score and returns to the user the top 10 of these +results.</p> <p><a href="#toc">Return to table of contents</a>.</p> <h2 id="references">References</h2>