viewgit/inc/functions.php:22 Function utf8_encode() is deprecated [8192]
Filename | |
---|---|
en-US/pages/ranking.thtml |
diff --git a/en-US/pages/ranking.thtml b/en-US/pages/ranking.thtml index 3d0b004..ee2d315 100644 --- a/en-US/pages/ranking.thtml +++ b/en-US/pages/ranking.thtml @@ -518,7 +518,7 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})` <p>To save a fair bit of crawling overhead, Yioop does not keep for each site crawled historical totals of all earnings a page has received, the cash-based approach is only used for - scheduling. Here are some of the changes and issues addressed in the + scheduling. Here are some of the issues addressed in the OPIC-based algorithm employed by Yioop: </p> <ul> @@ -527,18 +527,45 @@ frac{1}{59 + mbox(Rank)_(Rel)(d)} + frac{1}{59 + mbox(Rank)_(mbox(Prox))(d)})` into the queue before any page from that site. Until the robots.txt file for a page is crawled, it receives cash whenever a page on that host receives cash.</li> - <li>The cash that a robots.txt file receives is be divided amongst - any sitemap links on that page. Together with the last point, - this means cash totals no longer sum to one.</li> + <li>A fraction `alpha` of the cash that a robots.txt file receives is + divided amongst any sitemap links on that page. Not all of the cash is + given to prevent sitemaps from "swamping" the queue. Currently, + `alpha` is set 0.25. Nevertheless, together with the last bullet point, + the fact that we do share some cash, means cash totals no longer + sum to one.</li> <li>Cash might go missing for several reasons: (a) An image page, any other page, might be downloaded with no outgoing links. (b) A page might receive cash - and later the scheduler receives robots.txt information saying it cannot + and later the Scheduler receives robots.txt information saying it cannot be crawled. (c) Round-off errors due to floating point precision. For these reasons, the Scheduler periodically renormalizes the total amount of cash..</li> - <li>A robots.txt file or a slow host might cause the scheduler to - crawl-delay all the pages on the host.</li> + <li>A robots.txt file or a slow host might cause the Scheduler to + crawl-delay all the pages on the host. These pages might receive sufficient + cash to be scheduled earlier, but won't be because there must be a minimum + time gap between requests to that host.</li> + <li>When a schedule is made with a + crawl-delayed host, URLs from that host cannot be scheduled until the + fetcher that was processing them completes its schedule. If a Scheduler + receives a "to crawl" url from a crawl-delayed host, and there are + already MAX_WAITING_HOSTS many crawl-delayed hosts in the queue, + then Yioop discards the url. + </li> + <li>The Scheduler has a maximum, in-memory queue size based on + NUM_URLS_QUEUE_RAM (320,000 urls in a 2Gb memory configuration). It + will wait on reading new "to crawl" schedule files from fetchers, + if reading in the file would mean going over this count. For a typical, + web crawl this means the "to crawl" files build up much like a breadth-first + queue on disk. + </li> + <li>To make a schedule, the Scheduler starts processing the queue + from highest priority to lowest. The up to 5000 urls in the schedule + are split into slots of 100, where each slot of 100 will be required by the + fetcher to take at MINIMUM_FETCH_LOOP_TIME (5 seconds). Urls + are inserted into the schedule at the earliest available position. If + a URL is crawl-delayed it is inserted at the earliest position in the + slot sufficient far from any previous url for that host to ensure that + the crawl-delay condition is met.</li> </ul> How data is split amongst Fetchers, Queue Servers and Name Servers Web Versus Archive Crawl