An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 70
Текст из файла (страница 70)
Such classifiers partition the space of features into regions separated by linear decision hyperplanes, in a manner to be detailed below. Becauseof the bias-variance tradeoff (Section 14.6) more complex nonlinear modelsOnline edition (c) 2009 Cambridge UP29114.1 Document representations and measures of relatedness in vector spacesx2 x3 x4ex1x1′d trux2′ x3′ x4′dprojectedx5x1′x2′x3′x4′x5′x5′◮ Figure 14.2 Projections of small areas of the unit sphere preserve distances.
Left:A projection of the 2D semicircle to 1D. For the points x1 , x2 , x3 , x4 , x5 at x coordinates−0.9, −0.2, 0, 0.2, 0.9 the distance | x2 x3 | ≈ 0.201 only differs by 0.5% from | x2′ x3′ | =0.2; but | x1 x3 | /| x1′ x3′ | = dtrue /dprojected ≈ 1.06/0.9 ≈ 1.18 is an example of a largedistortion (18%) when projecting a large area. Right: The corresponding projection ofthe 3D hemisphere to 2D.are not systematically better than linear models. Nonlinear models havemore parameters to fit on a limited amount of training data and are morelikely to make mistakes for small and noisy data sets.When applying two-class classifiers to problems with more than two classes,there are one-of tasks – a document must be assigned to exactly one of severalmutually exclusive classes – and any-of tasks – a document can be assigned toany number of classes as we will explain in Section 14.5.
Two-class classifierssolve any-of problems and can be combined to solve one-of problems.14.1Document representations and measures of relatedness in vector spacesAs in Chapter 6, we represent documents as vectors in R |V | in this chapter.To illustrate properties of document vectors in vector classification, we willrender these vectors as points in a plane as in the example in Figure 14.1.In reality, document vectors are length-normalized unit vectors that pointto the surface of a hypersphere. We can view the 2D planes in our figuresas projections onto a plane of the surface of a (hyper-)sphere as shown inFigure 14.2.
Distances on the surface of the sphere and on the projectionplane are approximately the same as long as we restrict ourselves to smallareas of the surface and choose an appropriate projection (Exercise 14.1).Online edition (c) 2009 Cambridge UP29214 Vector space classificationDecisions of many vector space classifiers are based on a notion of distance, e.g., when computing the nearest neighbors in kNN classification.We will use Euclidean distance in this chapter as the underlying distancemeasure. We observed earlier (Exercise 6.18, page 131) that there is a directcorrespondence between cosine similarity and Euclidean distance for lengthnormalized vectors.
In vector space classification, it rarely matters whetherthe relatedness of two documents is expressed in terms of similarity or distance.However, in addition to documents, centroids or averages of vectors alsoplay an important role in vector space classification. Centroids are not lengthnormalized. For unnormalized vectors, dot product, cosine similarity andEuclidean distance all have different behavior in general (Exercise 14.6). Wewill be mostly concerned with small local regions when computing the similarity between a document and a centroid, and the smaller the region themore similar the behavior of the three measures is.?14.2DECISION BOUNDARYR OCCHIOCLASSIFICATIONCENTROID(14.1)Exercise 14.1For small areas, distances on the surface of the hypersphere are approximated wellby distances on its projection (Figure 14.2) because α ≈ sin α for small angles.
Forwhat size angle is the distortion α/ sin(α) (i) 1.01, (ii) 1.05 and (iii) 1.1?Rocchio classificationFigure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional(2D) space. Documents are shown as circles, diamonds and X’s. The boundaries in the figure, which we call decision boundaries, are chosen to separatethe three classes, but are otherwise arbitrary. To classify a new document,depicted as a star in the figure, we determine the region it occurs in and assign it the class of that region – China in this case.
Our task in vector spaceclassification is to devise algorithms that compute good boundaries where“good” means high classification accuracy on data unseen during training.Perhaps the best-known way of computing good class boundaries is Rocchio classification, which uses centroids to define the boundaries. The centroidof a class c is computed as the vector average or center of mass of its members:1~µ(c) =~v(d)| D c | d∑∈Dcwhere Dc is the set of documents in D whose class is c: Dc = {d : hd, ci ∈ D }.We denote the normalized vector of d by ~v(d) (Equation (6.11), page 122).Three example centroids are shown as solid circles in Figure 14.3.The boundary between two classes in Rocchio classification is the set ofpoints with equal distance from the two centroids. For example, | a1 | = | a2 |,Online edition (c) 2009 Cambridge UP29314.2 Rocchio classification⋄⋄UK⋄⋆⋄⋄a1⋄a2b1c1b2c2ChinaxxKenyaxx◮ Figure 14.3 Rocchio classification.|b1 | = |b2 |, and |c1 | = |c2 | in the figure.
This set of points is always a line.The generalization of a line in M-dimensional space is a hyperplane, whichwe define as the set of points ~x that satisfy:(14.2)NORMAL VECTORw~ T~x = bwhere w~ is the M-dimensional normal vector1 of the hyperplane and b is aconstant. This definition of hyperplanes includes lines (any line in 2D canbe defined by w1 x1 + w2 x2 = b) and 2-dimensional planes (any plane in 3Dcan be defined by w1 x1 + w2 x2 + w3 x3 = b).
A line divides a plane in two,a plane divides 3-dimensional space in two, and hyperplanes divide higherdimensional spaces in two.Thus, the boundaries of class regions in Rocchio classification are hyperplanes. The classification rule in Rocchio is to classify a point in accordancewith the region it falls into. Equivalently, we determine the centroid ~µ (c) thatthe point is closest to and then assign it to c.
As an example, consider the starin Figure 14.3. It is located in the China region of the space and Rocchiotherefore assigns it to China. We show the Rocchio algorithm in pseudocodein Figure 14.4.1. Recall from basic linear algebra that ~v · w~ = ~v T w~ , i.e., the dot product of ~v and w~ equals theproduct by matrix multiplication of the transpose of ~v and w~.Online edition (c) 2009 Cambridge UP29414 Vector space classificationterm weightsvectord~1d~2d~3d~4d~5~µc~µcChineseJapanTokyoMacaoBeijingShanghai00000000000.710.7100.710000.710.7100.71001.0000.3301.000000.33001.00000.330◮ Table 14.1 Vectors and class centroids for the data in Table 13.1.✎Table 14.1 shows the tf-idf vector representations of the five documents in Table 13.1 (page 261), using the formula (1 + log10 tft,d ) log10 (4/dft ) if tft,d >0 (Equation (6.14), page 127).
The two class centroids are µ c = 1/3 · (d~1 + d~2 + d~3 )and µ c = 1/1 · (d~4 ). The distances of the test document from the centroids are| µ c − d~5 | ≈ 1.15 and | µ c − d~5 | = 0.0. Thus, Rocchio assigns d5 to c.The separating hyperplane in this case has the following parameters:Example 14.1:w~b≈=(0 − 0.71 − 0.71 1/3 1/3 1/3) T−1/3See Exercise 14.15 for how to compute w~ and b. We can easily verify that this hyperplane separates the documents as desired: w~ T d~1 ≈ 0 · 0 + −0.71 · 0 + −0.71 · 0 +1/3 · 0 + 1/3 · 1.0 + 1/3 · 0 = 1/3 > b (and, similarly, w~ T ~di > b for i = 2 and i = 3)Tand w~ d~4 = −1 < b.
Thus, documents in c are above the hyperplane (w~ T d~ > b) and~ T d~ < b).documents in c are below the hyperplane (wThe assignment criterion in Figure 14.4 is Euclidean distance (A PPLY R OC line 1). An alternative is cosine similarity:CHIO,Assign d to class c = arg max cos(~µ(c′ ), ~v(d))c′As discussed in Section 14.1, the two assignment criteria will sometimesmake different classification decisions. We present the Euclidean distancevariant of Rocchio classification here because it emphasizes Rocchio’s closecorrespondence to K-means clustering (Section 16.4, page 360).Rocchio classification is a form of Rocchio relevance feedback (Section 9.1.1,page 178). The average of the relevant documents, corresponding to the mostimportant component of the Rocchio vector in relevance feedback (Equation (9.3), page 182), is the centroid of the “class” of relevant documents.We omit the query component of the Rocchio formula in Rocchio classification since there is no query in text classification.