An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 83
Текст из файла (страница 83)
Suchwork can and has been pursued using the structural SVM framework whichwe mentioned in Section 15.2.2, where the class being predicted is a rankingof results for a query, but here we will present the slightly simpler rankingSVM.The construction of a ranking SVM proceeds as follows.
We begin with aset of judged queries. For each training query q, we have a set of documentsreturned in response to the query, which have been totally ordered by a person for relevance to the query. We construct a vector of features ψj = ψ(d j , q)for each document/query pair, using features such as those discussed in Section 15.4.1, and many more. For two documents d i and d j , we then form thevector of feature differences:Φ( d i , d j , q) = ψ ( d i , q) − ψ ( d j , q)(15.18)By hypothesis, one of di and d j has been judged more relevant.
If d i isjudged more relevant than d j , denoted di ≺ d j (di should precede d j in theresults ordering), then we will assign the vector Φ(d i , d j , q) the class y ijq =+1; otherwise −1. The goal then is to build a classifier which will returnw~ T Φ(di , d j , q) > 0 iff(15.19)di ≺ d jThis SVM learning task is formalized in a manner much like the other examples that we saw before:(15.20)Find w~ , and ξ i,j ≥ 0 such that:•1 T~ w~ + C ∑i,j ξ i,j2wis minimized• and for all {Φ(d i , d j , q) : di ≺ d j }, w~ T Φ(di , d j , q) ≥ 1 − ξ i,jWe can leave out yijq in the statement of the constraint, since we only needto consider the constraint for document pairs ordered in one direction, since≺ is antisymmetric.
These constraints are then solved, as before, to givea linear classifier which can rank pairs of documents. This approach hasbeen used to build ranking functions which outperform standard hand-builtranking functions in IR evaluations on standard data sets; see the referencesfor papers that present such results.Both of the methods that we have just looked at use a linear weightingof document features that are indicators of relevance, as has most work inthis area. It is therefore perhaps interesting to note that much of traditionalIR weighting involves nonlinear scaling of basic measurements (such as logweighting of term frequency, or idf). At the present time, machine learning isvery good at producing optimal weights for features in a linear combinationOnline edition (c) 2009 Cambridge UP34615 Support vector machines and machine learning on documents(or other similar restricted model classes), but it is not good at coming upwith good nonlinear scalings of basic measurements.
This area remains thedomain of human feature engineering.The idea of learning ranking functions has been around for a number ofyears, but it is only very recently that sufficient machine learning knowledge,training document collections, and computational power have come togetherto make this method practical and exciting. It is thus too early to write something definitive on machine learning approaches to ranking in informationretrieval, but there is every reason to expect the use and importance of machine learned ranking approaches to grow over time. While skilled humanscan do a very good job at defining ranking functions by hand, hand tuningis difficult, and it has to be done again for each new document collection andclass of users.?Exercise 15.7Plot the first 7 rows of Table 15.3 in the α-ω plane to produce a figure like that inFigure 15.7.Exercise 15.8Write down the equation of a line in the α-ω plane separating the Rs from the Ns.Exercise 15.9Give a training example (consisting of values for α, ω and the relevance judgment)that when added to the training set makes it impossible to separate the R’s from theN’s using a line in the α-ω plane.15.5References and further readingThe somewhat quirky name support vector machine originates in the neural networks literature, where learning algorithms were thought of as architectures, and often referred to as “machines”.
The distinctive element ofthis model is that the decision boundary to use is completely decided (“supported”) by a few training data points, the support vectors.For a more detailed presentation of SVMs, a good, well-known articlelength introduction is (Burges 1998). Chen et al. (2005) introduce the morerecent ν-SVM, which provides an alternative parameterization for dealingwith inseparable problems, whereby rather than specifying a penalty C, youspecify a parameter ν which bounds the number of examples which can appear on the wrong side of the decision surface.
There are now also severalbooks dedicated to SVMs, large margin learning, and kernels: (Cristianiniand Shawe-Taylor 2000) and (Schölkopf and Smola 2001) are more mathematically oriented, while (Shawe-Taylor and Cristianini 2004) aims to bemore practical. For the foundations by their originator, see (Vapnik 1998).Online edition (c) 2009 Cambridge UP15.5 References and further reading347Some recent, more general books on statistical learning, such as (Hastie et al.2001) also give thorough coverage of SVMs.The construction of multiclass SVMs is discussed in (Weston and Watkins1999), (Crammer and Singer 2001), and (Tsochantaridis et al.
2005). The lastreference provides an introduction to the general framework of structuralSVMs.The kernel trick was first presented in (Aizerman et al. 1964). For moreabout string kernels and other kernels for structured data, see (Lodhi et al.2002) and (Gaertner et al. 2002). The Advances in Neural Information Processing (NIPS) conferences have become the premier venue for theoreticalmachine learning work, such as on SVMs. Other venues such as SIGIR aremuch stronger on experimental methodology and using text-specific featuresto improve classifier effectiveness.A recent comparison of most current machine learning classifiers (thoughon problems rather different from typical text problems) can be found in(Caruana and Niculescu-Mizil 2006).
(Li and Yang 2003), discussed in Section 13.6, is the most recent comparative evaluation of machine learning classifiers on text classification. Older examinations of classifiers on text problems can be found in (Yang 1999, Yang and Liu 1999, Dumais et al. 1998).Joachims (2002a) presents his work on SVMs applied to text problems in detail. Zhang and Oles (2001) present an insightful comparison of Naive Bayes,regularized logistic regression and SVM classifiers.Joachims (1999) discusses methods of making SVM learning practical overlarge text data sets. Joachims (2006a) improves on this work.A number of approaches to hierarchical classification have been developedin order to deal with the common situation where the classes to be assignedhave a natural hierarchical organization (Koller and Sahami 1997, McCallum et al.
1998, Weigend et al. 1999, Dumais and Chen 2000). In a recentlarge study on scaling SVMs to the entire Yahoo! directory, Liu et al. (2005)conclude that hierarchical classification noticeably if still modestly outperforms flat classification. Classifier effectiveness remains limited by the verysmall number of training documents for many classes. For a more generalapproach that can be applied to modeling relations between classes, whichmay be arbitrary rather than simply the case of a hierarchy, see Tsochantaridis et al.
(2005).Moschitti and Basili (2004) investigate the use of complex nominals, propernouns and word senses as features in text classification.Dietterich (2002) overviews ensemble methods for classifier combination,while Schapire (2003) focuses particularly on boosting, which is applied totext classification in (Schapire and Singer 2000).Chapelle et al. (2006) present an introduction to work in semi-supervisedmethods, including in particular chapters on using EM for semi-supervisedtext classification (Nigam et al. 2006) and on transductive SVMs (JoachimsOnline edition (c) 2009 Cambridge UP34815 Support vector machines and machine learning on documents2006b). Sindhwani and Keerthi (2006) present a more efficient implementation of a transductive SVM for large data sets.Tong and Koller (2001) explore active learning with SVMs for text classification; Baldridge and Osborne (2004) point out that examples selected forannotation with one classifier in an active learning context may be no betterthan random examples when used with another classifier.Machine learning approaches to ranking for ad hoc retrieval were pioneered in (Wong et al.
1988), (Fuhr 1992), and (Gey 1994). But limited trainingdata and poor machine learning techniques meant that these pieces of workachieved only middling results, and hence they only had limited impact atthe time.Taylor et al. (2006) study using machine learning to tune the parametersof the BM25 family of ranking functions (Section 11.4.3, page 232) so as tomaximize NDCG (Section 8.4, page 163). Machine learning approaches toordinal regression appear in (Herbrich et al. 2000) and (Burges et al. 2005),and are applied to clickstream data in (Joachims 2002b).