Yioop Search Engine Ranking Mechanisms
Introduction
A typical query to Yioop is a collection of terms without the use of the OR operator, '|', or the use of the exact match operator, the double quotes. On such a query, called a conjunctive query , Yioop tries to return documents which contain all of the query terms. If the sequence of words is particularly common, that is common enough to have a Wikipedia page, Yioop will try to return results which have that string with the same word order. Yioop further tries to return these documents in descending order of score. Most users only look at the first ten of the results returned. This article tries to explain the different factors which influence whether a page that has all the terms will make it into the top ten. To keep things simple we will assume that the query is being performed on a single Yioop index rather than a crawl mix of several indexes. We will also ignore how news feed items get incorporated into results. In reality, even if a user only searches on a single term, several meta-keywords will be added to the query such as a language meta keyword, and a keyword for how safe the search results are allowed to be.
At its heart, Yioop relies on three main scores for a document: Doc Rank (DR), Relevance (Rel), and Proximity (Prox). Proximity scores are only used if the query has two or more non-meta keyword terms. We will describe later how these three scores are calculated. For now one can think that the Doc Rank roughly indicates how important the document as a whole is and Relevance measures how important the search terms are to the document. Finally, Proximity measures how frequently all of the query terms appear close together.
On a given query, Yioop does not scan its whole posting lists to find every document that satisfies the query. Instead, it scans until it finds a fixed number of documents, say `n`, satisfying the query or until a timeout is exceeded. In the case of a timeout, `n` is just the number of documents found by the timeout. It then computes the three scores for each of these `n` documents. This value `n` in Yioop can be set in Page Option - Search Time and defaults to 300. An underlying assumption used by Yioop is that the first `n` matching documents in Yioop's posting lists contain the 10 (or default number of results to display) most important documents with respect to our scoring function. For this assumption to be valid our posting list must be roughly sorted according to score. For Yioop, though, the first `n` documents will in fact most likely be the sorted in the order that Yioop indexed them. This does not contradict the assumption
provided we are indexing documents according to the importance of our documents. To do this Yioop tries to index according to Doc Rank and assumes the affects of relevance and proximity are not too drastic. That is, they might be able to move the 100th document into the top 10, but not say the 1000th document into the top 10.
To see how it is possible to roughly index according to document importance, we next examine how data is acquired during a Yioop web crawl (the process for an archive crawl is somewhat different). This is not only important for determining the Doc Rank of a page, but the text extraction that occurs after the page is downloaded also affects the Relevance and Proximity scores. Once we are done describing these crawl/indexing time factors affecting scores, we will then consider search time factors which affect the scoring of documents and the actually formulas for Doc Rank, Relevance and Proximity.
Crawl Time Ranking Factors
Crawl Processes
To understand how crawl and indexing time factors affect search ranking, let's begin by first fixing in our minds how a crawl works in Yioop. Yioop supports multiple simultaneous crawls each on its own channel, but we will focus on the case where there is a single active crawl. A Yioop crawl has three types of processes that play a role in this:
- A Name server, which acts as an overall coordinator for the crawl, and which is responsible for starting and stopping the crawl.
- One or more Queue Servers, each of which maintain a priority queue of what to download next.
- One or more Fetchers, which actually download pages, and do initial page processing.
A crawl is started through the Yioop Web app on the Name Server. For each url in the list of starting urls (Seed Sites), its hostname is computed, a hash of the hostname is computed, and based on this hash, that url is sent to a given queue server -- all urls with the same hostname will be handled by the same queue server. Fetchers periodically check the Name Server to see if there is an active crawl, and if so, what its timestamp is. If there is an active crawl, a Fetcher would then pick a Queue Server and request a schedule of urls to download. By default, this can be as many as DOWNLOAD_SIZE_INTERVAL (defaults to 5000) urls.
Fetchers and their Effect on Search Ranking
Let's examine the fetcher's role in determining what terms get indexed, and hence, what documents can be retrieved using those terms. After receiving a batch of pages, the fetcher downloads pages in batches of a hundred pages at a time. When the fetcher requests a URL for download it sends a range request header asking for the first PAGE_RANGE_REQUEST (defaults to 1000000) many bytes. Only the data in these bytes has any chance of becoming terms which are indexed. The reason for choosing a fixed size is so that one can index a large number of documents even with a relatively small amount of disk space. The smaller the size of PAGE_RANGE_REQUEST, the more documents can be indexed, but the less likely they will contain all of their data data. As most images on the web all less than a 1MB in size, this value was chosen as a compromise. Some servers do not know how many bytes they will send before sending, they might operate in "chunked" mode, so after receiving the page, the fetcher discards any data after the first PAGE_RANGE_REQUEST many bytes -- this data won't be indexed. Constants that we mention such as PAGE_RANGE_REQUEST can be found in configs/Config.php. This particular constant can actually be set from the admin panel under the Page Options - Crawl Time. For each page in the batch of a hundred urls downloaded, the fetcher proceeds through a sequence of processing steps to:
- Determine page mimetype and choose a page processor (the source code for page processors is in src/library/processors). For HtmlProcessor, a scraper might in addition be chosen which is targeted toward particular kinds of Web pages/sites (for example, WordPress).
- Use the page processor to extract a summary for the document.
- Apply any indexing plugins for the page processor to generate auxiliary summaries and/or modify the extracted summary.
- Run classifiers on the summary and add any class labels and rank scores
- Calculate a hash from the downloaded page minus tags and non-word characters to be used for deduplication.
- Prune the number links extracted from the document down to MAX_LINKS_PER_PAGE (defaults to 50).
- Apply any user-defined page rules to the summary extracted.
- Keep summaries, and if configured to keep full caches of pages, these too, in fetcher memory until have downloaded complete schedule or SEEN_URLS_BEFORE_UPDATE_SCHEDULER many have been downloaded. At this point, summaries and caches are shipped off to the appropriate queue server in a process we'll describe later.
These steps are applied to groups of about 100 pages. After these steps, the fetcher checks the name server to see if any crawl parameters have changed or if the crawl has stopped before proceeding to download the next batch or continuing to process the current batch of urls. After the last step, if SEEN_URLS_BEFORE_UPDATE_SCHEDULER has been reached (usually every four to five hundred urls), it then the summaries, the page caches, any discovered urls, and any robots.txt data it has downloaded back to the queue server. It also sends back information on which hosts that the queue server is responsible for that are generating more than DOWNLOAD_ERROR_THRESHOLD (50) HTTP errors in a given schedule. These hosts will automatically be crawl-delayed by the queue server. Sending all of this data, allows the fetcher to clear some of its memory and continue processing its batch of 5000 urls until it has downloaded all of them. At this point, the fetcher picks another queue server and requests a schedule of urls to download from it and so on.
Page rules, which can greatly effect the summary extracted for a page, are described in more detail in the
Page Options Section of the Yioop documentation. Before describing how the "mini-inverted index" processing step is done, let's examine Steps 1,2, and 6 above in a little more detail as they are very important in determining what actually is indexed. Based usually on the the HTTP headers, a
mimetype for each page is found. The mimetype determines which summary extraction processor, in Yioop terminology, a page processor, is applied to the page. As an example of the key role that the page processor plays in what eventually ends up in a Yioop index, we list what the HTML page processor extracts from a page and how it does this extraction. To begin, the extraction process, Yioop checks if the current HTML page matches the signature of any of its HTML Scrapers, these are user defined objects which extract portions of a web page of a given type deemed to have the most useful content. Yioop comes with scrapers for Drupal, MediaWiki, VBulletin, Word Press, as well as for sites which say using Open Graph meta tags that they contain a video. If a signature matches, Yioop deletes from the body of the web page anything not in the Text XPath of the scraper. It then further deletes subtrees of this according to additional xpaths provided in the scraper. The scraper might also specify meta words to extract. If no scraper was applicable then the original web page is unchanged excepts script and style tags are stripped. After this step the following items are extracted from the document by the HTML page processor:
- Language
-
Document language is used to determine how to make terms from the words in a document. For example, if the language is English, Yioop uses the English stemmer on a document. So the word "jumping" in the document will get indexed as "jump". On the other hand, if the language was determined to be Italian then a different stemmer would be used and "jumping" would remain "jumping". The HTML processor determines the language by first looking for a lang attribute on the <html> tag in the document or xml prologue or for a <meta > tag giving the language. If this is not en or en-US, then Yioop uses the value to determine the language. Since many websites have this attribute as en or en-US even if the website is some other language, if the lang attribute is one of these values or if it is not present, then an attempt to guess the language based on the url is done. For example, a url ending in .vn might be guessed as Vietnamese. If this test is inconclusive, then Yioop uses a text sample from the document, and records its initial length, say `L`. For each language supported by Yioop, Yioop takes always the original text and deletes from this document all occurrences of the top 100 most frequent words from that language to get a document of some resulting length `L_i`. The language whose value of `L_i` is the least is then guessed to be the language of the document. This whole approach can be done using regular expressions, so is relatively fast.
- Title
-
When search results are displayed, the extracted document title is used as the link text. Words in the title also are given a higher value when Yioop calculates its relevance statistic. The HTML processor uses the contents of the <title> tag as its default title. If this tag is not present or is empty, Yioop tries to use the first, non-empty <h1>, ..., <h5>, or <h6> tag in the document as the title. The HTML processor keeps only the first hundred (HtmlProcessor::MAX_TITLE_LEN) characters of the title.
- Description
-
The description is used when search results are displayed to generate the snippets beneath the result link. Besides title, it has the remainder of the page words that are indexed to identify a document. The HTML processor obtains a description using one of four algorithms that can be set in Page Options. Yioop's built-in Help wiki describes what each of these does in detail, but at a high level each of them split the document into units we'll call sentences, assigns a score to these sentences, and then selects the top sentences up to the maximum page length specified under Page Options (defaults to 2000 characters). It then concatenates the sentences computed in the original order they appeared in the document to make the summary. After summarization, Yioop keeps track of the scores it had for each sentence in the document. When later computing the importance of a term to a document, Yioop adds sums over the sentences the frequency of the term in the sentence times the score for that sentence.
- Links
-
Links are used by Yioop to obtain new pages to download. They are also treated by Yioop as "mini-documents". The url of such a mini-document is the target website of the link, the link text less useless stop-words such as "click" and "here" is used as a description. As we will see during searching, these mini-documents get combined with the summary of the site linked to. The HTML processor extracts links from <a>, <frame>, <iframe>, and <img> tags. It extracts up to 300 links per document. When it extracts links it canonicalizes relative links. If a <base> tag was present, it uses it as part of the canonicalization process. Link text is extracted from <a> tag contents and from alt attributes of <img>'s. In addition, rel attributes are examined for robot directives such as nofollow.
- Robot Metas
-
This is used to keep track of any robot directives that occurred in meta tags in the document. These directives are things such a NOFOLLOW, NOINDEX, NOARCHIVE, and NOSNIPPET. These can affect what links are extracted from the page, whether the page is indexed, whether cached versions of the page will be displayable from the Yioop interface, and whether snippets can appear beneath the link on a search result page. The HTML processor does a case insensitive match on <meta> tags that contain the string "robot" (so it will treat such tags that contain robot and robots the same). It then extracts the directives from the content attribute of such a tag.
The page processors for other mimetypes extract similar fields but look at different components of their respective document types.
When a page processor is done with a page, for a page that isn't robot.txt page and that isn't a sitemap page, the page processor calls a pruneLinks method with the links found on that page. This culls the up to 300 links that might have been extracted down to 50. To do this, the effective number of terms in each link is computed and the 50 links with the highest effective number of terms are returned. Here effective number of terms is defined to be the number of terms in the link times the ratio of the gzipped length of the link text by the ungzipped length. Gzipping is a crude way to eliminate repetitions and so obliquely captures the number of unique words.
It should be recalled that links are treated as their own little documents and so will be treated as separate documents when making the inverted index. The url of a link is what it points to not the page it is on. So the hostname of the machine that it points to might not be a hostname handled by the queue server from which the schedule was downloaded. In reality, the fetcher actually partitions link documents according to queue server that will handle that link. So the fetcher sends to a queue server the schedule was downloaded from summary data, host error data, robots.txt data, and discovered links data that was destined for it. It keeps in memory all the other inverted index data destined for other machines. It will send this data to the appropriate queue servers later -- the next time it downloads and processes data for these servers. To make sure this scales, the fetcher checks its memory usage, if it is getting low, it might send some of this data for other queue servers early.
Queue Servers and their Effect on Search Ranking
It is back on a queue server that the building blocks for the Doc Rank, Relevance and Proximity scores are assembled. To see how this happens we continue to follow the flow of the data through the web crawl process.
To communicate with a queue server, a fetcher posts data to the web app of the queue server. The web app writes mini-inverted index and summary data into a file in the WORK_DIRECTORY/schedules/IndexDataCRAWL_TIMESTAMP folder. Similarly, robots.txt data from a batch of 400-500 pages is written to WORK_DIRECTORY/schedules/RobotDataCRAWL_TIMESTAMP, and "to crawl" urls are written to WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP. The Queue Server periodically checks these folders for new files to process. It is often the case that files can be written to these folders faster than the Queue Server can process them.
A queue server consists of three separate sub-processes:
- An Indexer
-
The indexer is responsible managing the overall indexing process. It adds Index Data file information to the active partition and periodically launches a DictionaryUpdater sub-sub process.
- A DictionaryUpdater
-
When the active partition get full, a new partition is started. TheDictionaryUpdater sub-sub-process builds an inverted index for the old active partition and adds the result to the overall index.
- A Scheduler
-
The scheduler maintains a priority queue of what urls to download next. It is responsible for reading SchedulateData files to update its priority queue and it is responsible for making sure urls that urls forbidden by RobotData files do not enter the queue.
When the Indexer processes a schedule IndexData file, it saves the data in an IndexDocumentBundle (src/library/IndexDocumentBundle.php). These objects are serialized to folders with names of the form: WORK_DIRECTORY/cache/IndexDataCRAWL_TIMESTAMP . These folder contains a variety of files with records in them. When Yioop reads or writes a record for any of these files it uses a variable length record format given in Yioop's src/library/PackedTableTool.php. Such a record consists of a preamble saying the number of bytes for individual columns, followed by the data. For example, for each int column this preamble would contain a 2 bit code to say whether the int should be stored using 1, 2, 4, or 8 bytes. Bytes with value 255 are eliminated for a record by coding 255 as two bytes 254 254 and coding 254 as 254 253. The value 255 is then used as a record separator. Returning to our discussion, IndexDocumentBundle's have the following components:
- documents
-
This is a PartitionDocumentBundle folder (managing code is in src/library/PartitionDocumentBundle.php). It contains a sequence of file pairs called partitions: partition_SOME_INTEGER.ix, partition_SOME_INTEGER.txt.gz Each partition typically stores around 100,000 documents on an 8GB RAM machine. The actual number is determined by the less of the number of documents that will fit in a 2GB file and 8 * NUMBER_OF_GB_OF_RAM * 10,000. A given partition_SOME_INTEGER.txt.gz file contains a sequence of gzip-compressed document summary, gzip-compressed document objects. The file partition_SOME_INTEGER.ix contains a sequence of records of the format (doc_id, offset in txt.gz file to summary, offset in txt.gz file to document, length of document object). A doc_id consists of (8 byte hash of url, 1byte doctype code, 7 byte hash of document text, 8 byte hash of url hostname). Eight byte hashes used for id's are always md5 hashes where the first and last 8 bytes have been XOR'd, shorter, seven byte hashes, use the first 7 bytes of this. In addition to these files, the documents folder has a file pdb_parameters.txt which contains information about compression and record formats used, the max size in bytes for a partition, the maximum number of record for a partition, the index's ACTIVE_COUNT (typically, web page documents), and VISITED_URLS_COUNT (documents for links to between pages -- recall hyperlinks are treated as documents in their own right).
- positions_doc_map
-
Consists of a sequence of integer numbered folders corresponding to partitions in the previously described documents folder. Each folder except the last folder contains three files: a doc_map file, a positions file, and a postings file. The last folder has in addition a last_entries file. The doc_map file consists of a sequence of tuple pairs doc_id => position_score_list where position_score_list is a list of pairs (position, score) . The first such (position, score) is the offset to the document in the txt.gz partition file, followed by an overall score for the document in the partition. Records in the .ix file that had the same hash_url (either the actual web page, or a link to it) or same hash document text at this point have been grouped, and a sole representative's doc_id (usually the web page not a link to it) has been chosen for the doc_map record. The score then is the sum of the grouped item scores, where individual item scores came from the crawl process as described in a moment. (position, score) 's after the first score are position term positions within the representative document, scores for the terms from this position and the previous. So a second tuple pair (10, 0.5) would indicate term locations 0, 1, 2, ..., 10 should each be weighted 0.5 when determining how important a term having one of these location is. After these set of pairs the position_score_list has additional pair (0, user_score) for the scores of the document with respect to each classifier being used for the crawl. positions is a binary file used to store for each term found amongst a partition's worth of documents, the locations of the term within each document that it occurred in. Such a list is stored using a gamma-code for the first value, followed by a Rice-code of a difference list of the remaining values. The postings file has two similar formats, one for the partition new documents are being added to and for other partition. For the new document partition, it contains an inverted index for that partition. That is, a sorted-and-grouped-by-term sequence of tuple pairs. (term_id => posting_list_for_term) , where posting_list_for_term , consists of, for each document the term appears in, a tuple (index of document in doc_map file, frequency of term in document, offset of terms position list in positions file, length of positions file entry) . For all other partitions, the format is almost the same, execpt term_id's are not in the file as they are already stored in the B+-tree dictionary entry. Finally, the last_entries file is used for record keeping for each term to be able to output postings correctly. For a given term it consists of a triple (term_id, last_doc_index, last_offset, num_occurrences) . A term_id consists of the first seven characters of the term (padding to 7 characters using underscore if necessary), followed by an underscore followed by an 8 byte of hash of the term). Here num_occurrences is the total number of occurrence of the given term. For postings, the index of the document and offset of terms in the position list are stored as a difference from the previous value (a delta list value). The last_entries components last_doc_index, last_offset keep track of the last non-delta'd values for these components so that it is easy to compute the appropriate value for the next posting to be added.
- dictionary
-
The dictionary contains a sequence of subfolders used to implement a B+ -tree containing key value pairs: term_id -> posting_list_for_term. In Yioop's B+ -tree implementation internal nodes are implemented as folders whose name is either start for the first node in a level, or the name of its least child. Leaf nodes consist of a pair of files: either start for the first leaf or iterm_id , where term_id is the least term_id in the leaf. In an internal node there is also a next file which contains the name of the next internal node. The files start and iterm_id , consist of a sorted sequence of records: term_id -> sorted list of partition info records for term_id. A partition info record is a tuple (partition number, number of docs term appeared in for the partition, total number of occurrences of term in partition, offset into postings file where postings for partition can be found, length of posting data).
- next_partition.txt
-
Contains a single integer indicating the next partition that could be added to the dictionary that been yet.
- archive_info.txt
-
This file contains information about the creation time of the archive, the crawl parameters used to obtain the data stored in the archive, and the archive version format.
Before we proceed with a description of scoring, we briefly complete the picture of how indexing is done. When a schedules/IndexData file is read, the pages it contains are added to the end of the active partition in the documents sub-folder. To keep the search results fresh periodically, the postings , doc_map , positions , and last_entries files in the partition sub-folder of positions_doc_map are recomputed so that occurrences of terms in this most recent partition can be found. When the active partition becomes full, one final computation of these files is done, and then the posting file data is added to the B+ -tree, and it and the last_entries files are deleted. The lookup of a term at query time is done via
B+ -tree and supplemented with any addition posting information in the active partition.
When a term is looked up the list of documents containing that term are presented in order of partition, then
within that partition in the order that the documents are listed in the doc_map for that partition. If one has two documents `A` and `B`, if `A` is in partition `n` and `B` in partition `m`, where `n < m` then `A` was crawled before `B`. We assume documents are roughly crawled discovered in order of importance and at least for documents in different partitions will tend to be correct most of the time. In creating a doc_map, we make groups of entries sharing the same hash of their urls. Each entry is given a score before grouping as
(num_doc_in_partition - entry_position_in_partition).
Such a group typically consists of a
document together with a collection of links to that document. After grouping the document (or most important link if there is no document for that link in the partition) gets the sum of the scores of its constituents. In the next phase of grouping documents whose text hashes to the same value are grouped. For example, https://www.yahoo.com/ and https://www.yahoo.com/?useless=1 have different urls but the same text, and so hash text. The document in the group with the larger score is chosen as the representative for the combined group and the scores are added. Finally, the resulting list of groups is sorted by score and used to make the doc_map file. If one has two documents `A` and `B`, if `A` appears earlier than `B` in the same doc_map, `A` will tend to be more important than `B` because to appear earlier it had to have a higher score by the above process. In all, this paragraph demonstrates that the pair (partition, doc_map_index) is a good measure of document importance. To make these into a single numeric value for a document `A`, we can take the sum over the number of documents in partition after A's partition + A' number of document after its doc_map_index in its partition. We denote this number_of_document_after(A). For any crawl of fewer than 10,000,000,000 doc_map entries (roughly, documents or web pages), taking log base 10 of this number gives a value less than 10. So we define:
DOC_RANK(A) `= log_{10} mbox(number_of_document_after(A))`
There also some things to note about this formula:
- It is possible for the same hash_url or hash_of_text to be in multiple partitions, so the above index time grouping in the doc_map will miss grouping these items. At query time, a second round of grouping is done to capture this. The maximum of the grouped items DOC_RANK's is then used as the group's DOC_RANK.
- The score of the document depends on the size of the index, a document found early will have a higher score if the index is bigger, then a document at the same location in a smaller index.
- The Doc Rank is a positive number and less than 10 provided the index of the given queue server has fewer than 10 billion items. Since to index 10 billion items using Yioop, you would probably want multiple queue servers, Doc Rank's likely remain below 10.
- Doc Rank is computed by different queue servers independent of each other for the same index. So it is possible for two summaries to have the same Doc Rank in the same index, provided they are stored on different queue servers.
- For Doc Ranks to be comparable with each other for the same index on different queue servers, it is assumed that queue servers are indexing at roughly the same speed.
We next turn to the role of a queue server's Scheduler process in the computation of a page's Doc Rank. One easy way, which is supported by Yioop, for a Scheduler to determine what to crawl next is to use a simple queue. This would yield roughly a breadth-first traversal of the web starting from the seed sites. Since high quality pages are often a small number of hops from any page on the web, there is some evidence [
NW2001 ] that this lazy strategy is not too bad for crawling according to document importance. However, there are better strategies. When Page Importance is chosen in the Crawl Order dropdown for a Yioop crawl, the Scheduler on each queue server works harder to make schedules so that the next pages to crawl are always the most important pages not yet seen.
One well-known algorithm for doing this kind of scheduling is called OPIC (Online Page Importance Computation) [
APC2003 ]. The idea of OPIC is that at the start of a crawl one divides up an initial dollar of cash equally among the starting seed sites. One then picks a site with highest cash value to crawl next. If this site had `alpha` cash value, then when we crawl it and extract links, we divide up the cash and give it equally to each link. So if there were `n` links, each link would receive from the site `alpha/n` cash. Some of these sites might already have been in the queue in which case we add to their cash total. For URLs not in the queue, we add them to the queue with initial value `alpha/n`. Each site has two scores: Its current cash on hand, and the total earnings the site has ever received. When a page is crawled, its cash on hand is reset to `0`. We always choose as the next page to crawl from amongst the pages with the most cash (there might be ties). OPIC can be used to get an estimate of the importance of a page, by taking its total earnings and dividing it by the total earnings received by all pages in the course of a crawl.
In the experiments conducted by the original paper, OPIC was shown to crawl in a better approximation to page rank order than breadth-first search. Bidoki and Yazdani [
BY2008 ] have more recently proposed a new page importance measure DistanceRank, they also confirm that OPIC does better than breadth-first, but show the computationally more expensive Partial PageRank and Partial DistanceRank perform even better. Prior to Version 9.0, Yioop uses a modified version of OPIC to choose which page to crawl next.
One drawback to OPIC is the need to keep track of a priority queue and constantly to adjust the weights of urls in the queue. Priority queues are much easier to implement in RAM, and so earlier versions of Yioop kept about 200,000 urls in this RAM based queue, and queued, in files, in order of discovery, urls waiting to fit into this queue. This essentially meant for larger scale crawls, Yioop was behaving more like a breadth-first crawler.
The current version of Yioop orders what to crawl next using a technique we call
Host Budgeting inspired by IRLBot [
LLWL2009 ]. When a fetcher sends a queue server urls it has discovered, these are initially placed into a file in the WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP/DAY_DISCOVERED_STAMP/ subfolder ordered by timestamp of when the urls were received. The Scheduler process periodically checks WORK_DIRECTORY/schedules/ScheduleDataCRAWL_TIMESTAMP, and picks the day_folder/file of urls with the oldest timestamp for processing. These urls are sorted into files in subfolders of crawl queue bundle folder WORK_DIRECTORY/cache/QueueBundleCRAWL_TIMESTAMP. The three main folders are:
WaitRobotUrl , whose subfolders are organized as HASH_HOSTNAME/DAY_DISCOVERED_STAMP/compressed_files_of_urls;
CrawlDelayedHosts whose subfolders are organized as HASH_HOSTNAME/DAY_DISCOVERED_STAMP/compressed_files_of_urls; and
UrlQueue , whose subfolders are organized as TierNUMBER/DAY_ADDED_STAMP/compressed_files_of_urls. When sorting the urls in a schedules file, if a robots.txt file has already been donwnloaded for a url, and the url is safe to crawl, it will be immediately added to some
UrlQueue sub-folder. If the robots.txt for the url hasn't been downloaded, the url is sorted into the
WaitRobotUrl folder and file for the hostname of that url and the url for its robots.txt file is added to a UrlQueue folder. To make sure the same url is not added multiple times to the QueueBundle data, a Bloom filter (a kind of multi-key, Hash Set) is used. Data for this Bloom filter is materialized to the folder WORK_DIRECTORY/cache/QueueBundleCRAWL_TIMESTAMP/UrlExistsFilterBundle. To determine which Tier a url gets put into when it is placed in a
UrlQueue folder, a QueueBundle uses a subfolder CLDData (company level domain data) to maintain a linear hash table of
HASH_OF_COMPANY_LEVEL_DOMAIN => (SEEN_URLS, WEIGHTED_SEEN_URLS, WEIGHTED_INCOMING_URLS)
associations. For a host of the form,
something.2chars.2chars or
blah.something.2chars.2chars , the company level domain is
something.2chars.2chars . For example, for
www.yahoo.co.uk , the company level domain is
yahoo.co.uk . For any other url,
stuff.2ndlevel.tld , the company level domain is
2ndlevel.tld . For example, for
www.yahoo.com , the company level domain is
yahoo.com . The idea is such a domain usually needs to be paid for to obtain from an internet service provider. In the above, SEEN_URLS is just the raw count of the number of urls seen for the given domain. When links for a web page are processed by a fetcher, an initial pass to filter out uncrawlable links is done, and the links are sorted in order of importance by a crude estimate of the useful link text. The `i`th link according to this order, will initially have a link weight of `i` when it arrives at the queue server. When such the url of this link is added to UrlQueue, rather than add 1 to the WEIGHTED_SEEN_URLS, instead
`\qquad\qquad min(1, 1 + log(1 + i, 5))`
is added.
So a less important link will tend to added more to WEIGHTED_SEEN_URLS. WEIGHTED_INCOMING_URLS is only adjusted for urls that came from a link from a different CLD. In which, case if the linking page had weight `w`, and was crawled after being scheduled in Tier `t`,
`\qquad\qquad 1/((1.1 + t + log(1 + w, 5)))`
is added to WEIGHTED_INCOMING_URLS. Finally, after adjusting WEIGHTED_SEEN_URLS, WEIGHTED_INCOMING_URLS for its cld, a url would be written into a crawl file in Tier:
`\qquad\qquad \lfloor (log_{10} (max(1, mbox(WEIGHTED_SEEN_URLS) - mbox(WEIGHTED_INCOMING_URLS))) \rfloor`
For urls to sitemaps a constant SITEMAP_TIER_PENALTY (4) is added to this score. For non-HTML pages, one is added to the tier. Notice the basic formula implies that the first few links added for a given CLD will tend to be in a low numbered tiers. The subtraction of WEIGHTED_INCOMING_URLS in the above formula, means that if a CLD has a lot of good incoming links, it will tend to be able to have more pages scheduled to lower level tiers. Also, notice that the higher the numbered tier, the more likely there will be a lot of urls waiting to actually be scheduled. When the scheduler produces a schedule for a fetcher, it chooses a tier to make the schedule from in a round robin fashion. It then selects the urls, in the order they came into the tier. Because of our earlier observation, this means that the first few urls (and presumably important) from a site will tend to be crawled before less important (discovered later) urls, but are not guaranteed to be crawled first.
Before we finish talking about how the Scheduler affects when a page is crawled and hence our Doc Rank, we mention a couple wrinkles to the crawling process. The first is a website's robots.txt file can indicate a crawl-delay between successive requests. Yioop also might determine a site is becoming overloaded and institute its own delay for that how. If a url is for a website whose robots.txt indicates a crawl-delay, then the url needs to be spaced in the schedule at least one request round (100 urls) from the previous url for the same host. If there is an outstanding schedule with urls for that host, or if there are no more slots in the current schedule for the crawl delayed url, it gets moved into a file in the CrawlDelayedHosts folding to wait for that schedule before being requeued into UrlQueue folder. Crawl delay is managed in the multi-queue server setting because urls are sorted to QueueServers based on the hash of the hostname, so two urls with the same hostname would always be sent to the same QueueServer. Another wrinkle to the crawling process, it that some settings of Yioop allow pages to be recrawled after a certain amount of time. Yioop maintains an EtagExpires folder of HTTP Etag and Expires header for urls on whether to expect a url to have changed or not since the last time it was crawled.
We conclude this section by mentioning that the Scheduler only affects when a URL is written to a schedule which will then be used by a fetcher. It is entirely possible that two fetchers get consecutive schedules from the same Scheduler, and return data to the Indexers not in the order in which they were scheduled. In which case, they would be indexed out of order and their Doc Ranks would not be in the order of when they were scheduled. The scheduling and indexing process is only approximately correct, we rely on query time manipulations to try to improve the accuracy.
Search Time Ranking Factors
Looking up Initial Links
So far we have discussed how the Doc Rank for a search result is calculated. In this section, we describe the process of determining search results and how Yioop calculates at search time Relevance, and Proximity. When a query comes into Yioop it goes through the following stages before an actual look up is performed against an index.
- Control words are calculated. Control words are terms like m: or i: terms which can be used to select a mix or index to use. They are also commands like raw: which says what level of grouping to use, or no: commands which say not to use a standard processing technique. For example, no:guess (affects whether the next processing step is done), no:network, etc. For the remainder, we will assume the query does not contain control words.
- An attempt is made to guess the semantics of the query. This matches keywords in the query and rewrites them to other query terms. For example, a query term which is in the form of a domain name, will be rewritten to the form of a meta word, site:domain. So the query will return only pages from the domain. Currently, this processing is in a nascent stage. As another example, if you do a search only on "D", it will rewrite the search to be "letter D".
- Stemming or character `n`-gramming is done on the query and acronyms and abbreviations are rewritten. This is the same kind of operation that we did after generating summaries to extract terms for document indexing.
After going through the above steps, Yioop builds an iterator object from the resulting terms to iterate over summaries and link entries that contain all of the terms. The source code for Yioop's iterators can be found in src/library/index_bundle_iterators. As described in the section
Fetchers and their Effect on Search Ranking , some or all of these terms might be whole phrases to reduce the need for computing expensive conjunctive queries. In the single queue server setting one iterator would be built for each term and these iterators would be added to an intersect iterator that would return documents on which all the terms appear. This intersect iterator has a timer associated with it to prevent it from running too long in the case of a conjunctive query of terms with long posting lists with small intersection. These iterators are then fed into a grouping iterator, which groups links and summaries that refer to the same document url (but might have been in different partitions so not already grouped). Recall that after downloading pages on the fetcher, we calculated a hash from the downloaded page minus tags. Documents with the same hash are also grouped together by the group iterator. The value `n=200` posting list entries that Yioop scans out on a query referred to in the introduction is actually the number of results the group iterator requests before grouping. This number can be controlled from the Yioop admin pages under Page Options > Search Time > Minimum Results to Group. The number `200` was chosen because on a single machine it was found to give decent results without the queries taking too long. Once these `200` results have been grouped, they are scored, sorted, and the top 10 are usually presented to the user. It should be noted that 200 or so results come from the index in roughly Doc Rank order, so it is hoped that this order is good enough, so that it does contain the top 10 best links overall.
In the multiple queue server setting, when the query comes in to
the name server, a network iterator is built. This iterator poses the
query to each queue server being administered by the
name server. If `n=200`, the name server
multiplies this value by the value
Page Options > Search Time > Server Alpha, which we'll denote `alpha`. This defaults
to 1.6, so the total is 320. It then divides this by the number
of queue servers. So if there were 4 queue servers, one would have
80. It then requests the first 80 results for the query from each
queue server. The queue servers don't do grouping, but just
send the results of their intersect iterators to the name server, which
does the grouping.
We next look at how documents are scored. We have already described Doc Rank. In addition, to Relavance and proximity which are directly added to the score. For example, Yioop directly adds to the Doc Rank a configurable under Page Options bonus if the url is a domain name, and a slightly higher bonus if the url is a company level domain (these default to 0.5 and 2 respectively). There are three more bonus factors depending on whether query terms appear in the the host portion of the url, as title keyword text, or appear in the path portion of the url. These bonus factors are also configurable under Page Options and generally are scored as:
bonus_factor*(number_of_query_terms_in_item)/number_of_terms_in_item.
To compute the Relevance of a query to a document summary, the Divergence From Randomness (DFR) scoring function developed by Amati and Rijsbergen [
AvR2002 ] is used:
`\qquad\qquad\sum_{t in q}\frac(log_2(1 + l_t/N) + f_(t,d) log_2(1 + N/l_t))(f_(t,d) + 1)`
In the above, `q` represents the set of query terms, `t` is a term in the query, `d` is the document being scored, `N` is the total number of indexed documents, `l_t` is the total number of occurrences of `t` across all indexed documents, and `f_{t,d}` is the number of occurrences of `t` in document `d` normalized as we describe in a moment. The slight tweaks to the raw `f_{t,d}` as number of occurrences of `t` in document `d` is first we weight occurrences of `t` by weight of the region of text where it appeared. This weighting of text regions for a document is stored in the doc_map file which we described when we were discussing the positions_doc_map folder earlier. The frequency we get after this weight is then normalized to adjust for differences of lengths of documents in the index according to Amati and Rijsbergen [
AvR2002 ] formula:
`f'_(t,d) = f_(t,d) cdot log(1 + l_(avg)/l_d)`
where `l_d` is number of terms in document `d`, and `l_{avg}` is the average number of terms in a document in the index.
To compute the Proximity of an item `d` with respect to
a query `q` with more than one term, we use the notion of a span .
A span is an interval `[u_i, v_i]` of positions within `d` which contain
all the terms (including repeats) in `q` such that no smaller interval contains
all the terms (including repeats) . Given `d` we can calculate an initial proximity
score as a sum of the inverse of the sizes of the spans:
`\qquad\qquad\mbox(Prox)(d) = sum(frac(1)(v_i - u_i + 1))`.
To normalize this with respect to document length, we multiple it by the number of terms in the query and divide it by the number of position with the document that a span could start. To get a final score we then multiply this by a weight factor that can be set under Page Options.
To compute a final score for our document, we add the Doc Rank, bonus factors, Relevance, and the Proximity. As we said before we then sort the first 200 or so documents returned by our query iterator with respect to score calculated in this fashion and then output the top 10 results as our initially query results, completing our description of how Yioop computes search results.
References
Gianni Amati and Cornelis Joost Van Rijsbergen (2002) Probabilistic models of information retrieval based on measuring the divergence from
randomness. ACM Transactions on Information Systems 20(4):pp. 357-
389.
Serge Abiteboul and Mihai Preda and Gregory Cobena. Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web. pp. 280-290. 2003.
A. M. Z. Bidoki and Nasser Yazdani. DistanceRank: An intelligent ranking algorithm for web pages. Information Processing and Management. Vol. 44. Iss. 2. pp. 877--892. March, 2008.
Brin, S. and Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh International World-Wide Web Conference (WWW 1998). April 14-18, 1998. Brisbane, Australia. 1998.
Charles L. A. Clarke and Gordon V. Cormack and Elizabeth A. Tudhope. Relevance Ranking for One to Three Term Queries. In: Information Processing Management. Vol. 36. Iss. 2. pp.291--311. 2000.
H.-T. Lee, D. Leonard, X. Wang, D. Loguinov. IRLbot: Scaling to 6 Billion Pages and Beyond. ACM Transactions on the Web. Vol. 3. No. 3. June 2009.
Marc Najork and Janet L. Wiener. Breadth-First Search Crawling Yields High-Quality Pages. Proceedings of the 10th international conference on World Wide Web. pp 114--118. 2001.
Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [[Cambridge at TREC-13: Web and HARD tracks]. In Proceedings of 3th Annual Text Retrieval Conference. 2004.