The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 13
Текст из файла (страница 13)
With sucha model it becomes natural to use least squares as a data criterion formodel estimation as in (2.1). Simple modifications can be made to avoidthe independence assumption; for example, we can have Var(Y |X = x) =σ(x), and now both the mean and variance depend on X. In general theconditional distribution Pr(Y |X) can depend on X in complicated ways,but the additive error model precludes these.So far we have concentrated on the quantitative response. Additive errormodels are typically not used for qualitative outputs G; in this case the target function p(X) is the conditional density Pr(G|X), and this is modeleddirectly. For example, for two-class data, it is often reasonable to assumethat the data arise from independent binary trials, with the probability ofone particular outcome being p(X), and the other 1 − p(X).
Thus if Y isthe 0–1 coded version of G, then E(Y |X = x) = p(x), but the variancedepends on x as well: Var(Y |X = x) = p(x)[1 − p(x)].2.6.2 Supervised LearningBefore we launch into more statistically oriented jargon, we present thefunction-fitting paradigm from a machine learning point of view. Supposefor simplicity that the errors are additive and that the model Y = f (X) + εis a reasonable assumption. Supervised learning attempts to learn f byexample through a teacher.
One observes the system under study, boththe inputs and outputs, and assembles a training set of observations T =(xi , yi ), i = 1, . . . , N . The observed input values to the system xi are alsofed into an artificial system, known as a learning algorithm (usually a computer program), which also produces outputs fˆ(xi ) in response to the inputs. The learning algorithm has the property that it can modify its input/output relationship fˆ in response to differences yi − fˆ(xi ) between theoriginal and generated outputs. This process is known as learning by example. Upon completion of the learning process the hope is that the artificialand real outputs will be close enough to be useful for all sets of inputs likelyto be encountered in practice.2.6.3 Function ApproximationThe learning paradigm of the previous section has been the motivationfor research into the supervised learning problem in the fields of machinelearning (with analogies to human reasoning) and neural networks (withbiological analogies to the brain).
The approach taken in applied mathematics and statistics has been from the perspective of function approximation and estimation. Here the data pairs {xi , yi } are viewed as points in a(p + 1)-dimensional Euclidean space. The function f (x) has domain equalto the p-dimensional input subspace, and is related to the data via a model302. Overview of Supervised Learningsuch as yi = f (xi ) + εi .
For convenience in this chapter we will assume thedomain is IRp , a p-dimensional Euclidean space, although in general theinputs can be of mixed type. The goal is to obtain a useful approximationto f (x) for all x in some region of IRp , given the representations in T .Although somewhat less glamorous than the learning paradigm, treatingsupervised learning as a problem in function approximation encourages thegeometrical concepts of Euclidean spaces and mathematical concepts ofprobabilistic inference to be applied to the problem. This is the approachtaken in this book.Many of the approximations we will encounter have associated a set ofparameters θ that can be modified to suit the data at hand.
For example,the linear model f (x) = xT β has θ = β. Another class of useful approximators can be expressed as linear basis expansionsfθ (x) =KXhk (x)θk ,(2.30)k=1where the hk are a suitable set of functions or transformations of the inputvector x. Traditional examples are polynomial and trigonometric expansions, where for example hk might be x21 , x1 x22 , cos(x1 ) and so on. Wealso encounter nonlinear expansions, such as the sigmoid transformationcommon to neural network models,hk (x) =1.1 + exp(−xT βk )(2.31)We can use least squares to estimate the parameters θ in fθ as we didfor the linear model, by minimizing the residual sum-of-squaresRSS(θ) =NXi=1(yi − fθ (xi ))2(2.32)as a function of θ.
This seems a reasonable criterion for an additive errormodel. In terms of function approximation, we imagine our parameterizedfunction as a surface in p + 1 space, and what we observe are noisy realizations from it. This is easy to visualize when p = 2 and the verticalcoordinate is the output y, as in Figure 2.10.
The noise is in the outputcoordinate, so we find the set of parameters such that the fitted surfacegets as close to the observed points as possible, where close is measured bythe sum of squared vertical errors in RSS(θ).For the linear model we get a simple closed form solution to the minimization problem. This is also true for the basis function methods, if thebasis functions themselves do not have any hidden parameters.
Otherwisethe solution requires either iterative methods or numerical optimization.While least squares is generally very convenient, it is not the only criterion used and in some cases would not make much sense. A more general2.6 Statistical Models, Supervised Learning and Function Approximation••31••••• ••• • • • •• ••• ••• •••• •• • •••••••••••• •••• ••••••FIGURE 2.10. Least squares fitting of a function of two inputs. The parametersof fθ (x) are chosen so as to minimize the sum-of-squared vertical errors.principle for estimation is maximum likelihood estimation.
Suppose we havea random sample yi , i = 1, . . . , N from a density Prθ (y) indexed by someparameters θ. The log-probability of the observed sample isL(θ) =NXlog Prθ (yi ).(2.33)i=1The principle of maximum likelihood assumes that the most reasonablevalues for θ are those for which the probability of the observed sample islargest. Least squares for the additive error model Y = fθ (X) + ε, withε ∼ N (0, σ 2 ), is equivalent to maximum likelihood using the conditionallikelihoodPr(Y |X, θ) = N (fθ (X), σ 2 ).(2.34)So although the additional assumption of normality seems more restrictive,the results are the same.
The log-likelihood of the data isL(θ) = −N1 XNlog(2π) − N log σ − 2(yi − fθ (xi ))2 ,22σ i=1(2.35)and the only term involving θ is the last, which is RSS(θ) up to a scalarnegative multiplier.A more interesting example is the multinomial likelihood for the regression function Pr(G|X) for a qualitative output G. Suppose we have a modelPr(G = Gk |X = x) = pk,θ (x), k = 1, . . . , K for the conditional probability of each class given X, indexed by the parameter vector θ. Then the322.
Overview of Supervised Learninglog-likelihood (also referred to as the cross-entropy) isL(θ) =NXlog pgi ,θ (xi ),(2.36)i=1and when maximized it delivers values of θ that best conform with the datain this likelihood sense.2.7 Structured Regression ModelsWe have seen that although nearest-neighbor and other local methods focusdirectly on estimating the function at a point, they face problems in highdimensions. They may also be inappropriate even in low dimensions incases where more structured approaches can make more efficient use of thedata. This section introduces classes of such structured approaches. Beforewe proceed, though, we discuss further the need for such classes.2.7.1 Difficulty of the ProblemConsider the RSS criterion for an arbitrary function f ,RSS(f ) =NXi=1(yi − f (xi ))2 .(2.37)Minimizing (2.37) leads to infinitely many solutions: any function fˆ passingthrough the training points (xi , yi ) is a solution.
Any particular solutionchosen might be a poor predictor at test points different from the trainingpoints. If there are multiple observation pairs xi , yiℓ , ℓ = 1, . . . , Ni at eachvalue of xi , the risk is limited. In this case, the solutions pass throughthe average values of the yiℓ at each xi ; see Exercise 2.6. The situation issimilar to the one we have already visited in Section 2.4; indeed, (2.37) isthe finite sample version of (2.11) on page 18.
If the sample size N weresufficiently large such that repeats were guaranteed and densely arranged,it would seem that these solutions might all tend to the limiting conditionalexpectation.In order to obtain useful results for finite N , we must restrict the eligiblesolutions to (2.37) to a smaller set of functions. How to decide on thenature of the restrictions is based on considerations outside of the data.These restrictions are sometimes encoded via the parametric representationof fθ , or may be built into the learning method itself, either implicitly orexplicitly.
These restricted classes of solutions are the major topic of thisbook. One thing should be clear, though. Any restrictions imposed on fthat lead to a unique solution to (2.37) do not really remove the ambiguity2.8 Classes of Restricted Estimators33caused by the multiplicity of solutions. There are infinitely many possiblerestrictions, each leading to a unique solution, so the ambiguity has simplybeen transferred to the choice of constraint.In general the constraints imposed by most learning methods can bedescribed as complexity restrictions of one kind or another.
This usuallymeans some kind of regular behavior in small neighborhoods of the inputspace. That is, for all input points x sufficiently close to each other insome metric, fˆ exhibits some special structure such as nearly constant,linear or low-order polynomial behavior.
The estimator is then obtained byaveraging or polynomial fitting in that neighborhood.The strength of the constraint is dictated by the neighborhood size. Thelarger the size of the neighborhood, the stronger the constraint, and themore sensitive the solution is to the particular choice of constraint. Forexample, local constant fits in infinitesimally small neighborhoods is noconstraint at all; local linear fits in very large neighborhoods is almost aglobally linear model, and is very restrictive.The nature of the constraint depends on the metric used.