An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 62
Текст из файла (страница 62)
Most retrievalsystems today contain multiple components that use some form of classifier.The classification task we will use as an example in this book is text classification.A computer is not essential for classification. Many classification taskshave traditionally been solved manually. Books in a library are assignedLibrary of Congress categories by a librarian. But manual classification isexpensive to scale. The multicore computer chips example illustrates one alternative approach: classification by the use of standing queries – which canbe thought of as rules – most commonly written by hand.
As in our example (multicore OR multi-core) AND (chip OR processor OR microprocessor), rules aresometimes equivalent to Boolean expressions.A rule captures a certain combination of keywords that indicates a class.Hand-coded rules have good scaling properties, but creating and maintaining them over time is labor intensive. A technically skilled person (e.g., adomain expert who is good at writing regular expressions) can create rulesets that will rival or exceed the accuracy of the automatically generated classifiers we will discuss shortly; however, it can be hard to find someone withthis specialized skill.Apart from manual classification and hand-crafted rules, there is a thirdapproach to text classification, namely, machine learning-based text classification.
It is the approach that we focus on in the next several chapters. Inmachine learning, the set of rules or, more generally, the decision criterion ofthe text classifier, is learned automatically from training data. This approachis also called statistical text classification if the learning method is statistical.In statistical text classification, we require a number of good example documents (or training documents) for each class.
The need for manual classification is not eliminated because the training documents come from a personwho has labeled them – where labeling refers to the process of annotatingeach document with its class. But labeling is arguably an easier task thanwriting rules. Almost anybody can look at a document and decide whetheror not it is related to China. Sometimes such labeling is already implicitlypart of an existing workflow. For instance, you may go through the newsarticles returned by a standing query each morning and give relevance feedback (cf.
Chapter 9) by moving the relevant articles to a special folder likemulticore-processors.We begin this chapter with a general introduction to the text classificationproblem including a formal definition (Section 13.1); we then cover NaiveBayes, a particularly simple and effective classification method (Sections 13.2–13.4). All of the classification algorithms we study represent documents inhigh-dimensional spaces.
To improve the efficiency of these algorithms, itis generally desirable to reduce the dimensionality of these spaces; to thisend, a technique known as feature selection is commonly applied in text clas-Online edition (c) 2009 Cambridge UP25613 Text classification and Naive Bayessification as discussed in Section 13.5. Section 13.6 covers evaluation of textclassification. In the following chapters, Chapters 14 and 15, we look at twoother families of classification methods, vector space classifiers and supportvector machines.13.1DOCUMENT SPACECLASSTRAINING SETThe text classification problemIn text classification, we are given a description d ∈ X of a document, whereX is the document space; and a fixed set of classes C = {c1 , c2 , .
. . , c J }. Classesare also called categories or labels. Typically, the document space X is sometype of high-dimensional space, and the classes are human defined for theneeds of an application, as in the examples China and documents that talkabout multicore computer chips above. We are given a training set D of labeleddocuments hd, ci,where hd, ci ∈ X × C. For example:hd, ci = hBeijing joins the World Trade Organization, ChinaiLEARNING METHODCLASSIFIER(13.1)SUPERVISED LEARNINGTEST SETfor the one-sentence document Beijing joins the World Trade Organization andthe class (or label) China.Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:γ:X→CThis type of learning is called supervised learning because a supervisor (thehuman who defines the classes and labels training documents) serves as ateacher directing the learning process. We denote the supervised learningmethod by Γ and write Γ(D ) = γ.
The learning method Γ takes the trainingset D as input and returns the learned classification function γ.Most names for learning methods Γ are also used for classifiers γ. Wetalk about the Naive Bayes (NB) learning method Γ when we say that “NaiveBayes is robust,” meaning that it can be applied to many different learningproblems and is unlikely to produce classifiers that fail catastrophically. Butwhen we say that “Naive Bayes had an error rate of 20%,” we are describingan experiment in which a particular NB classifier γ (which was produced bythe NB learning method) had a 20% error rate in an application.Figure 13.1 shows an example of text classification from the Reuters-RCV1collection, introduced in Section 4.2, page 69. There are six classes (UK, China,.
. . , sports), each with three training documents. We show a few mnemonicwords for each document’s content. The training set provides some typicalexamples for each class, so that we can learn the classification function γ.Once we have learned γ, we can apply it to the test set (or test data), for example, the new document first private Chinese airline whose class is unknown.Online edition (c) 2009 Cambridge UP25713.1 The text classification problemγ(d′ ) =Chinaregionsclasses:trainingset:industriessubject areaspoultrycoffeeelectionssportsOlympicsfeedroastingrecountdiamondBeijingchickenbeansvotesbaseballUKChinacongestionLondonParliamenttourismpatearabicaseatforwardBig BenGreat Wallducksrobustarun-offsoccerWindsorMaobird fluKenyaTV adsteamthe Queencommunistturkeyharvestcampaigncaptaind′testset:firstprivateChineseairline◮ Figure 13.1 Classes, training set, and test set in text classification .In Figure 13.1, the classification function assigns the new document to classγ(d) = China, which is the correct assignment.The classes in text classification often have some interesting structure suchas the hierarchy in Figure 13.1.
There are two instances each of region categories, industry categories, and subject area categories. A hierarchy can bean important aid in solving a classification problem; see Section 15.3.2 forfurther discussion. Until then, we will make the assumption in the text classification chapters that the classes form a set with no subset relationshipsbetween them.Definition (13.1) stipulates that a document is a member of exactly oneclass. This is not the most appropriate model for the hierarchy in Figure 13.1.For instance, a document about the 2008 Olympics should be a member oftwo classes: the China class and the sports class. This type of classificationproblem is referred to as an any-of problem and we will return to it in Section 14.5 (page 306).
For the time being, we only consider one-of problemswhere a document is a member of exactly one class.Our goal in text classification is high accuracy on test data or new data – forexample, the newswire articles that we will encounter tomorrow morningin the multicore chip example. It is easy to achieve high accuracy on thetraining set (e.g., we can simply memorize the labels).
But high accuracy onthe training set in general does not mean that the classifier will work well onOnline edition (c) 2009 Cambridge UP25813 Text classification and Naive Bayesnew data in an application. When we use the training set to learn a classifierfor test data, we make the assumption that training data and test data aresimilar or from the same distribution. We defer a precise definition of thisnotion to Section 14.6 (page 308).13.2MULTINOMIAL N AIVEB AYES(13.2)MAXIMUM APOSTERIORI CLASS(13.3)Naive Bayes text classificationThe first supervised learning method we introduce is the multinomial NaiveBayes or multinomial NB model, a probabilistic learning method. The probability of a document d being in class c is computed asP(c|d) ∝ P(c)∏1≤k ≤n dP(tk |c)where P(tk |c) is the conditional probability of term tk occurring in a document of class c.1 We interpret P(tk |c) as a measure of how much evidencetk contributes that c is the correct class.