An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 74
Текст из файла (страница 74)
For example, 14 documentsfrom grain were incorrectly assigned to wheat. Adapted from Picca et al. (2006).system. For example, to address the second largest error in Table 14.5 (14 inthe row grain), one could attempt to introduce features that distinguish wheatdocuments from grain documents.?✄14.6Exercise 14.5Create a training set of 300 documents, 100 each from three different languages (e.g.,English, French, Spanish). Create a test set by the same procedure, but also add 100documents from a fourth language.
Train (i) a one-of classifier (ii) an any-of classifier on this training set and evaluate it on the test set. (iii) Are there any interestingdifferences in how the two classifiers behave on this task?The bias-variance tradeoffNonlinear classifiers are more powerful than linear classifiers. For someproblems, there exists a nonlinear classifier with zero classification error, butno such linear classifier. Does that mean that we should always use nonlinearclassifiers for optimal effectiveness in statistical text classification?To answer this question, we introduce the bias-variance tradeoff in this section, one of the most important concepts in machine learning.
The tradeoffhelps explain why there is no universally optimal learning method. Selectingan appropriate learning method is therefore an unavoidable part of solvinga text classification problem.Throughout this section, we use linear and nonlinear classifiers as prototypical examples of “less powerful” and “more powerful” learning, respectively. This is a simplification for a number of reasons. First, many nonlinearmodels subsume linear models as a special case. For instance, a nonlinearlearning method like kNN will in some cases produce a linear classifier.
Second, there are nonlinear models that are less complex than linear models.For instance, a quadratic polynomial with two parameters is less powerfulthan a 10,000-dimensional linear classifier. Third, the complexity of learning is not really a property of the classifier because there are many aspectsOnline edition (c) 2009 Cambridge UP14.6 The bias-variance tradeoff309of learning (such as feature selection, cf. (Section 13.5, page 271), regularization, and constraints such as margin maximization in Chapter 15) that makea learning method either more powerful or less powerful without affectingthe type of classifier that is the final result of learning – regardless of whetherthat classifier is linear or nonlinear.
We refer the reader to the publicationslisted in Section 14.7 for a treatment of the bias-variance tradeoff that takesinto account these complexities. In this section, linear and nonlinear classifiers will simply serve as proxies for weaker and stronger learning methodsin text classification.We first need to state our objective in text classification more precisely.
InSection 13.1 (page 256), we said that we want to minimize classification error on the test set. The implicit assumption was that training documentsand test documents are generated according to the same underlying distribution. We will denote this distribution P(hd, ci) where d is the documentand c its label or class. Figures 13.4 and 13.5 were examples of generativemodels that decompose P(hd, ci) into the product of P(c) and P(d|c). Figures 14.10 and 14.11 depict generative models for hd, ci with d ∈ R2 andc ∈ {square, solid circle}.In this section, instead of using the number of correctly classified test documents (or, equivalently, the error rate on test documents) as evaluationmeasure, we adopt an evaluation measure that addresses the inherent uncertainty of labeling.
In many text classification problems, a given documentrepresentation can arise from documents belonging to different classes. Thisis because documents from different classes can be mapped to the same document representation. For example, the one-sentence documents China suesFrance and France sues China are mapped to the same document representation d′ = {China, France, sues} in a bag of words model. But only the latterdocument is relevant to the class c′ = legal actions brought by France (whichmight be defined, for example, as a standing query by an international tradelawyer).To simplify the calculations in this section, we do not count the numberof errors on the test set when evaluating a classifier, but instead look at howwell the classifier estimates the conditional probability P(c|d) of a documentbeing in a class.
In the above example, we might have P(c′ |d′ ) = 0.5.Our goal in text classification then is to find a classifier γ such that, averaged over documents d, γ(d) is as close as possible to the true probabilityP(c|d). We measure this using mean squared error:(14.6)MSE(γ) = Ed [γ(d) − P(c|d)]2where Ed is the expectation with respect to P(d). The mean squared errorterm gives partial credit for decisions by γ that are close if not completelyright.Online edition (c) 2009 Cambridge UP31014 Vector space classification(14.8)E [ x − α ]2(14.9)ED Ed [ΓD (d) − P(c|d)]2= Ex2 − 2Exα + α2= ( Ex )2 − 2Exα + α2+ Ex2 − 2( Ex )2 + ( Ex )2= [ Ex − α]2+ Ex2 − E2x ( Ex ) + E( Ex )2= [ Ex − α]2 + E[ x − Ex ]2==Ed ED [ΓD (d) − P(c|d)]2Ed [ [ ED ΓD (d) − P(c|d)]2+ ED [ΓD (d) − ED ΓD (d)]2 ]◮ Figure 14.13 Arithmetic transformations for the bias-variance decomposition.For the derivation of Equation (14.9), we set α = P (c| d) and x = ΓD (d) in Equation (14.8).OPTIMAL CLASSIFIERLEARNING ERROR(14.7)OPTIMAL LEARNINGMETHODWe define a classifier γ to be optimal for a distribution P(hd, ci) if it minimizes MSE(γ).Minimizing MSE is a desideratum for classifiers.
We also need a criterionfor learning methods. Recall that we defined a learning method Γ as a functionthat takes a labeled training set D as input and returns a classifier γ.For learning methods, we adopt as our goal to find a Γ that, averaged overtraining sets, learns classifiers γ with minimal MSE.
We can formalize this asminimizing learning error:learning-error(Γ) = ED [MSE(Γ(D ))]where ED is the expectation over labeled training sets. To keep things simple,we can assume that training sets have a fixed size – the distribution P(hd, ci)then defines a distribution P(D ) over training sets.We can use learning error as a criterion for selecting a learning method instatistical text classification. A learning method Γ is optimal for a distributionP(D ) if it minimizes the learning error.Writing ΓD for Γ(D ) for better readability, we can transform Equation (14.7)as follows:learning-error(Γ)(14.10)(14.11)=ED [MSE(ΓD )]==ED Ed [ΓD (d) − P(c|d)]2Ed [bias(Γ, d) + variance(Γ, d)]Online edition (c) 2009 Cambridge UP31114.6 The bias-variance tradeoff(14.12)(14.13)BIASVARIANCEbias(Γ, d)variance(Γ, d)= [ P(c|d) − ED ΓD (d)]2= ED [ΓD (d) − ED ΓD (d)]2where the equivalence between Equations (14.10) and (14.11) is shown inEquation (14.9) in Figure 14.13.
Note that d and D are independent of eachother. In general, for a random document d and a random training set D, Ddoes not contain a labeled instance of d.Bias is the squared difference between P(c|d), the true conditional probability of d being in c, and ΓD (d), the prediction of the learned classifier,averaged over training sets. Bias is large if the learning method producesclassifiers that are consistently wrong. Bias is small if (i) the classifiers areconsistently right or (ii) different training sets cause errors on different documents or (iii) different training sets cause positive and negative errors on thesame documents, but that average out to close to 0. If one of these three conditions holds, then ED ΓD (d), the expectation over all training sets, is close toP ( c | d ).Linear methods like Rocchio and Naive Bayes have a high bias for nonlinear problems because they can only model one type of class boundary, alinear hyperplane.
If the generative model P(hd, ci) has a complex nonlinearclass boundary, the bias term in Equation (14.11) will be high because a largenumber of points will be consistently misclassified. For example, the circularenclave in Figure 14.11 does not fit a linear model and will be misclassifiedconsistently by linear classifiers.We can think of bias as resulting from our domain knowledge (or lackthereof) that we build into the classifier.
If we know that the true boundarybetween the two classes is linear, then a learning method that produces linearclassifiers is more likely to succeed than a nonlinear method. But if the trueclass boundary is not linear and we incorrectly bias the classifier to be linear,then classification accuracy will be low on average.Nonlinear methods like kNN have low bias. We can see in Figure 14.6 thatthe decision boundaries of kNN are variable – depending on the distribution of documents in the training set, learned decision boundaries can varygreatly.
As a result, each document has a chance of being classified correctlyfor some training sets. The average prediction ED ΓD (d) is therefore closer toP(c|d) and bias is smaller than for a linear learning method.Variance is the variation of the prediction of learned classifiers: the average squared difference between ΓD (d) and its average ED ΓD (d). Variance islarge if different training sets D give rise to very different classifiers ΓD . It issmall if the training set has a minor effect on the classification decisions ΓDmakes, be they correct or incorrect. Variance measures how inconsistent thedecisions are, not whether they are correct or incorrect.Linear learning methods have low variance because most randomly drawntraining sets produce similar decision hyperplanes.