viewgit/inc/functions.php:22 Function utf8_encode() is deprecated [8192]
Filename | |
---|---|
en-US/pages/ranking.thtml |
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml index 046a713..771ac99 100644 --- a/en-US/pages/ranking.thtml +++ b/en-US/pages/ranking.thtml @@ -629,20 +629,151 @@ a query comes into Yioop it goes through the following stages before an actual look up is performed against an index. </p> <ol> -<li>Control words are calculated.</li> -<li>An attempt is made to guess the semantics of the query.</li> +<li>Control words are calculated. Control words are terms like +m: or i: terms which can be used to select a mix or index to use. +They are also commands like raw: which says what level of grouping to +use, or no: commands which say not to use a standard processing technique. +For example, no:guess (affects +whether the next processing step is done), +no:network, etc. For the remainder, we will +assume the query does not contain control words.</li> +<li>An attempt is made to guess the semantics of the query. This +matches keywords in the query and rewrites them to other query terms. +For example, a query term which is in the form of a domain name, will +be rewritten to the form of a meta word, site:domain. So the query will +return only pages from the domain. Currently, this processing is +in a nascent stage. As another example, if you do a search only +on "D", it will rewrite the search to be "letter D".</li> <li>Stemming or character n-gramming is done on the query and acronyms -and abbreviations are rewritten.</li> +and abbreviations are rewritten. This is the same kind of operation +that we did after generating summaries to extract terms.</li> </ol> +<p>After going through the above steps, Yioop builds an iterator +object from the resulting terms to iterate over summaries and link +entries that contain all of the terms. In the single queue server setting +one iterator would be built for each term and these iterators +would be added to an intersect iterator that would return documents +on which all the terms appear. These iterators are then fed into +a grouping iterator, which groups links and summaries that refer +to the same document. Recall after downloading pages on the fetcher +we calculated a hash from the downloaded page minus tags. Documents +with the same hash are also grouped together by the group iterator. +The value `n=200` posting list entries that Yioop scans out on a query +referred to in the introduction is actually the number of results +the group iterator requests before grouping. This number can be +controlled from the Yioop admin pages under Page Options > +Search Time > Minimum Results to Group. The number 200 was chosen +because on a single machine it was found to give decent results without +the queries taking too long. +</p> +<p>In the multiple queue server setting, the query comes in to + the name server a network iterator is built which poses the +query to each queue server. If `n=200`, the name server +multiplies this value by the value +Page Options > +Search Time > Server Alpha, which we'll denote `alpha`. This defaults +to 1.6, so the total is 320. It then divides this by the number +of queue servers. So if there were 4 queue servers, one would have +80. It then requests the first 80 results for the query from each +queue server. The queue servers don't do grouping, but just + send the results of their intersect iterators to the name server, which +does the grouping.</p> +<p>In both the networked and non-networked case, after the grouping +phase Doc Rank, Relevance, and Proximity scores for each of the grouped results +have been determined. We then combine these three scores into a single +score using the reciprocal rank fusion technique described in the introduction. +Results are then sorted in descending order of score and output. +What we have left to describe is how the scores are calculated in the +various iterators mentioned above.</p> +<p>To fix an example to describe this process, suppose we have a group +`G'` of items `i_j'`, either pages or links that all refer to the same url. +A page in this group means at some point we downloaded the url and extracted +a summary. It is possible for there to be multiple pages in a group because +we might re-crawl a page. If we have another group `G''` of items `i_k''` of +this kind that such that the hash of the most recent page matches that +of `G'`, then the two groups are merged. While we are grouping we are +computing a temporary overall score for a group. The temporary score is used to +determine which page's (or link's if no pages are present) summaries in a group +should be used as the source of url, title, and snippets. Let `G` be the +group one gets performing this process after all groups with the same hash +as `G'` have been merged. We now describe how the individual items in `G` +have their score computed, and finally, how these scores are combined. +</p> +<p>The Doc Rank of an item is calculated according to the formula mentioned +in the <a href="#queue-servers">queue servers subsection</a>:</p> +<p> +`mbox(RANK) = (mbox(DOC_INDEX) + 1) + (mbox(AVG_LINKS_PER_PAGE) + 1) times + mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))` +`\qquad = (mbox(DOC_INDEX) + 1) + 25 times + mbox(NUM_DOCS_PER_GENERATION) times mbox(GENERATION))` +</p> +<p>To compute the relevance of an item, we use a variant of +BM25F [<a href="#ZCTSR2004">ZCTSR2004</a>]. Suppose a query `q` is a set +of terms `t`. View a item `d` as a bag of terms, let `f_(t,d)` denote +the frequency of the term `t` in `d`, let `N_t` denote the number of items +containing `t` in the whole index (not just the group), let `l_d` denote +the length of `d`, where length is +the number of terms including repeats it contains, and +let `l_{avg}` denote the average length of an item in the index. The basic +BM25 formula is:</p> +<p> +`S\c\o\r\e_(BM25)(q, d) = sum_(t in q) IDF(t) cdot TF_(BM25)(t,d)`, where<br /> +`IDF(t) = log(frac(N)(N_t))`, and<br /> +`TF_(BM25) = +frac(f_(t,d)\cdot(k_1 +1))(f_(t,d) + k_1 cdot ((1-b) + b cdot(l_d / l_(avg)) ))` +</p> +<p>`IDF(t)`, the inverse document frequency of `t`, in the above can be +thought as measure of how much signal is provided by knowing that the term `t` +appears in the document. For example, its value is zero if `t` is in every +document; whereas the more rare the term is the larger than value of `IDF(t)`. +`TF_(BM25)` represents a normalized term frequency for `t`. Here `k_1 = 1.2` +and `b=0.75` are tuned parameters which are set to values commonly used +in the literature. It is normalized to prevent bias toward longer +documents. Also, if one spams a document filling it with one the term `t`, +we have `lim_(f_(t,d) -> infty) TF_(BM25)(t,d) = k_1 +1`, which limits +the ability to push the document score larger. +</p> +<p>Yioop computes a variant of BM25F not BM25. This formula also +needs to have values for things like `L_(avg)`, `N`, `N_t`. To keep the +computation simple at the loss of some accuracy when Yioop needs these values +it uses information from the statistics in the particular index shard of `d` as +a stand-in. BM25F is essentially the same as BM25 except that it separates +a document into components, computes the BM25 score of the document with +respect to each component and then takes a weighted sum of these scores. +In the case of Yioop, if the item is a page the two components +are an ad hoc title and a description. Recall when making our positions +lists for a term in a documents that we concatenated url keywords, +followed by title, followed by summary. So the first terms in the result +will tend to be from title. We take the first AD_HOC_TITLE_LEN many terms +from a document to be in the ad hoc title. We calculate an ad hoc title +BM25 score for a term from a query being in the ad hoc title of an item. +We multiply this by 2 and then compute a BM25 score of the term being in +the rest of the summary. We add the two results. For link items we don't +separate them into two component but can weight the BM25 score different +than for a page (currently though link weight is set to 1 by default). +These three weights: title weight, description weight, and link weight can +be set in Page Options > Search Time > Search Rank Factors . +</p> +<p>To compute the proximity score of an item `d` with respect to +a query `q` with more than one term. we use the notion of a <b>cover</b>. +A cover is an interval `[u_i, v_i]` of positions within `d` which contain +all the terms in `q` such that no smaller interval contains all the +terms. Given `d` we can calculate a proximity score as a sum of +the inverse of the sizes of the covers:</p> +<p class='center'> +`mbox(score)(d) = sum(frac(1)(v_i - u_i + 1))`. +</p> +<p>For a page item, Yioop calculates separate proximity scores with +respect to its ad hoc title and the rest of a summary. It then adds +them with the same weight as was done for the BM25F relevance score. +Similarly, link item proximities also have a weight factor multiplied against +them. +</p> +<p>Now that we have described how to compute Doc Rank, Relevance, and Proximity +for each item in a group, we now describe how to get these three values +for the whole group. +</p> -Network Versus non network queries -Grouping (links and documents) deduplication -Conjunctive queries -Scores BM25F, proximity, document rank - -News articles, Images, Videos - -How related queries work <p><a href="#toc">Return to table of contents</a>.</p> <h2 id="references">References</h2> <dl> @@ -691,6 +822,7 @@ March, 2008. Proceedings of the 10th international conference on World Wide Web. pp 114--118. 2001. </dd> + <dt id="VLZ2012">[VLZ2012]</dt> <dd>Maksims Volkovs, Hugo Larochelle, and Richard S. Zemel. <a href="http://www.cs.toronto.edu/~zemel/documents/cikm2012_paper.pdf" @@ -698,6 +830,14 @@ pp 114--118. 2001. 21st ACM International Conference on Information and Knowledge Management. pp. 843-851. 2012. </dd> + +<dt id="ZCTSR2004">[ZCTSR2004]</dt> +<dd>Hugo Zaragoza, Nick Craswell, Michael Taylor, +Suchi Saria, and Stephen Robertson. +<a +href="http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf" +>Microsoft Cambridge at TREC-13: Web and HARD tracks</a>. +In Proceedings of 3th Annual Text Retrieval Conference. 2004.</dd> </dl> <p><a href="#toc">Return to table of contents</a>.</p>