viewgit/inc/functions.php:22 Function utf8_encode() is deprecated [8192]
Filename | |
---|---|
en-US/pages/documentation.thtml |
diff --git a/en-US/pages/documentation.thtml b/en-US/pages/documentation.thtml index 178a866..d495dc1 100755 --- a/en-US/pages/documentation.thtml +++ b/en-US/pages/documentation.thtml @@ -15,6 +15,7 @@ <li><a href="#userroles">Managing Users and Roles</a></li> <li><a href="#crawls">Managing Crawls</a></li> <li><a href="#mixes">Mixing Crawl Indexes</a></li> + <li><a href="#classifiers">Classifying Web Pages</a></li> <li><a href="#page-options">Page Indexing and Search Options</a></li> <li><a href="#editor">Results Editor</a></li> <li><a href="#sources">Search Sources</a></li> @@ -1925,11 +1926,154 @@ encoding = "ASCII"; be clicked. </p> <p><a href="#toc">Return to table of contents</a>.</p> + + <h2 id='classifiers'>Classifying Web Pages</h2> + <p>Sometimes searching for text that occurs within a page isn't enough to + find what one is looking for. For example, the relevant set of documents + may have many terms in common, with only a small subset showing up on any + particular page, so that one would have to search for many disjoint terms + in order to find all relevant pages. Or one may not know which terms are + relevant, making it hard to formulate an appropriate query. Or the relevant + documents may share many key terms with irrelevant documents, making it + difficult to formulate a query that fetches one but not the other. Under + these circumstances (among others), it would be useful to have meta words + already associated with the relevant documents, so that one could just + search for the meta word. The Classifiers activity provides a way to train + classifiers that recognize classes of documents; these classifiers can then + be used during a crawl to add appropriate meta words to pages determined to + belong to one or more classes.</p> + + <p>Clicking on the Classifiers activity displays a text field where you can + create a new classifier, and a table of existing classifiers, where each + row corresponds to a classifier and provides some statistics and action + links. A classifier is identified by its class label, which is also used to + form the meta word that will be attached to documents. Each classifier can + only be trained to recognize instances of a single target class, so the + class label should be a short description of that class, containing only + alphanumeric characters and underscores (e.g., "spam", + "homepage", or "menu"). Typing a new class label into + the text box and hitting the Create button initializes a new classifier, + which will then show up in the table.</p> + + <img src="resources/ClassifiersManage.png" + alt="The Classifiers manage page" /> + + <p>Once you have a fresh classifier, the natural thing to do is edit it by + clicking on the Edit action link. If you made a mistake, however, or no + longer want a classifier for some reason, then you can click on the Delete + action link to delete it; this cannot be undone. The Finalize action link + is used to prepare a classifier to classify new web pages, which cannot be + done until you've added some training examples. We'll discuss how to add + new examples next, then return to the Finalize link.</p> + + <h3>Editing a Classifier</h3> + + <p>Clicking on the Edit action link takes you to a new page where you can + change a classifier's class label, view some statistics, and provide + examples of positive and negative instances of the target class. The first + two options should be self-explanatory, but the last is somewhat involved. + A classifier needs labeled training examples in order to learn to recognize + instances of a particular class, and you help provide these by picking out + example pages from previous crawls and telling the classification system + whether they belong to the class or do not belong to the class. The Add + Examples section of the Edit Classifier page lets you select an existing + crawl to draw potential examples from, and optionally narrow down the + examples to those that satisfy a query. Once you've done this, clicking the + Load button will send a request to the server to load some pages from the + crawl and choose the next one to receive a label. You'll be presented with + a record representing the selected document, similar to a search result, + with several action links along the side that let you mark this document as + either a positive or negative example of the target class, or skip this + document and move on to the next one:</p> + + <img src="resources/ClassifiersEdit.png" alt="The Classifiers edit page" /> + + <p>When you select any of the action buttons, your choice is sent back to + the server, and a new example to label is sent back (so long as there are + more examples in the selected index). The old example record is shifted + down the page and its background color updated to reflect your + decision—green for a positive example, red for a negative one, and + gray for a skip; the statistics at the top of the page are updated + accordingly. The new example record replaces the old one, and the process + repeats. Each time a new label is sent to the server, it is added to the + training set that will ultimately be used to prepare the classifier to + classify new web pages during a crawl. Each time you label a set number of + new examples (10 by default), the classifier will also estimate its current + accuracy by splitting the current training set into training and testing + portions, training a simple classifier on the training portion, and testing + on the remainder (checking the classifier output against the known labels). + The new estimated accuracy, calculated as the proportion of the test pages + classified correctly, is displayed under the Statistics section. You can + also manually request an updated accuracy estimate by clicking the Update + action link next to the Accuracy field. Doing this will send a request to + the server that will initiate the same process described previously, and + after a delay, display the new estimate.</p> + + <p>All of this happens without reloading the page, so avoid using the web + browser's Back button. If you do end up reloading the page somehow, then + the current example record and the list of previously-labeled examples will + be gone, but none of your progress toward building the training set will be + lost.</p> + + <h3>Finalizing a Classifier</h3> + + <p>Editing a classifier adds new labeled examples to the training set, + providing the classifier with a more complete picture of the kinds of + documents it can expect to see in the future. In order to take advantage of + an expanded training set, though, you need to <em>finalize</em> the + classifier. This is broken out into a separate step because it involves + optimizing a function over the entire training set, which can be slow for + even a few hundred example documents. It wouldn't be practical to wait for + the classifier to re-train each time you add a new example, so you have to + explicitly tell the classifier that you're done adding examples for now by + clicking on the Finalize action link on the classifier management page.</p> + + <p>Clicking this link will kick off a separate process that trains the + classifier in the background. When the page reloads, the Finalize link + should have changed to text that reads "Finalizing..." (but if + the training set is very small, training may complete almost immediately). + After starting finalization, it's fine to walk away for a bit, reload the + page, or carry out some unrelated task in the admin console. You shouldn't + however, make further changes to the classifier's training set, or start a + new crawl that makes use of the classifier. When the classifier finishes + its training phase, the Finalizing message will be replaced by one that + reads "Finalized" (you'll have to reload the page, as it will not + update itself), indicating that the classifier is ready for use.</p> + + <h3>Using a Classifier</h3> + + <p>Using a classifier is as simple as selecting the classifier's label + on the Page Options activity, under the "Classifiers to Apply" + heading. When the next crawl starts, the classifier (and any other selected + classifiers) will be applied to each fetched page, and if a page is + determined to belong to a target class, it will have several meta words + added. As an example, if the target class is "spam", and a page + is determined to belong to the class with probability .79, then the + page will have the following meta words added:</p> + + <ul> + <li>class:spam</li> + <li>class:spam:50plus</li> + <li>class:spam:60plus</li> + <li>class:spam:70plus</li> + <li>class:spam:70</li> + </ul> + + <p>These meta words allow one to search for all pages classified as spam at + any probability over the preset threshold of .50 (with class:spam), at any + probability over a specific multiple of .1 (e.g., over .6 with + class:spam:60plus), or within a specific range (e.g., .60–.69 with + class:spam:60). Note that no meta words are added if the probability falls + below the threshold, so no page will ever have the meta words + class:spam:10plus, class:spam:20plus, class:spam:20, and so on.</p> + + <p><a href="#toc">Return to table of contents</a>.</p> + <h2 id='page-options'>Page Indexing and Search Options</h2> - <p>Several properties about how web pages are indexed and - how pages are looked up at search time can be controlled - by clicking on Page Options. There are three tabs for this activity: Crawl Time, - Search Time, and Test Options. We will discuss each of these in turn.</p> + <p>Several properties about how web pages are indexed and how pages are + looked up at search time can be controlled by clicking on Page Options. + There are three tabs for this activity: Crawl Time, Search Time, and Test + Options. We will discuss each of these in turn.</p> <h3>Crawl Time Tab</h3> <p>Clicking on Page Options leads to the default Crawl Time Tab:</p> <img src='resources/PageOptionsCrawl.png' alt='The Page Options Crawl form'/> @@ -1985,7 +2129,19 @@ encoding = "ASCII"; check the unknown checkbox in the upper left of this list. </p> <p> - The indexing plugins checkboxes, allow you to select which plugins + The Classifiers to Apply checkboxes allow you to select the classifiers + that will be used to classify pages during a crawl. Each classifier (see + the <a href="#classifiers">Classifiers</a> section for details) is + represented in the list by its class label and a checkbox. Checking the box + indicates that the associated classifier should be used (made active) + during the next crawl. Each active classifier is run on each page + downloaded during a crawl, and if the page is determined to belong to the + class that the classifier has been trained to recognize, then a meta word + like "class:<i>label</i>", where <i>label</i> is the class label, + is added to the page summary. + </p> + <p> + The Indexing Plugins checkboxes allow you to select which plugins to use during the crawl. For instance, clicking the RecipePlugin checkbox would cause Yioop to run the code in indexing_plugins/recipe_plugin.php. This code tries to detect pages