The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 15
Текст из файла (страница 15)
For small k the few closestneighbors will have values f (x(ℓ) ) close to f (x0 ), so their average should382. Overview of Supervised LearningLow BiasHigh VariancePrediction ErrorHigh BiasLow VarianceTest SampleTraining SampleLowHighModel ComplexityFIGURE 2.11. Test and training error as a function of model complexity.be close to f (x0 ). As k grows, the neighbors are further away, and thenanything can happen.The variance term is simply the variance of an average here, and decreases as the inverse of k.
So as k varies, there is a bias–variance tradeoff.More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The opposite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.Typically we would like to choose our model complexity to trade biasoff with variance in such a way as to minimizePthe test error.
An obviousestimate of test error is the training error N1 i (yi − ŷi )2 . Unfortunatelytraining error is not a good estimate of test error, as it does not properlyaccount for model complexity.Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error).
Inthat case the predictions fˆ(x0 ) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.Exercises39Bibliographic NotesSome good general books on the learning problem are Duda et al. (2000),Bishop (1995),(Bishop, 2006), Ripley (1996), Cherkassky and Mulier (2007)and Vapnik (1996). Parts of this chapter are based on Friedman (1994b).ExercisesEx.
2.1 Suppose each of K-classes has an associated target tk , which is avector of all zeros, except a one in the kth position. Show that classifying tothe largest element of ŷ amounts to choosing the closest target, mink ||tk −ŷ||, if the elements of ŷ sum to one.Ex. 2.2 Show how to compute the Bayes decision boundary for the simulation example in Figure 2.5.Ex. 2.3 Derive equation (2.24).Ex. 2.4 The edge effect problem discussed on page 23 is not peculiar touniform sampling from bounded domains. Consider inputs drawn from aspherical multinormal distribution X ∼ N (0, Ip ).
The squared distancefrom any sample point to the origin has a χ2p distribution with mean p.Consider a prediction point x0 drawn from this distribution, and let a =x0 /||x0 || be an associated unit vector. Let zi = aT xi be the projection ofeach of the training points on this direction.Show that the zi are distributed N (0, 1) with expected squared distancefrom the origin 1, while the target point has expected squared distance pfrom the origin.Hence for p = 10, a randomly drawn test point is about 3.1 standarddeviations from the origin, while all the training points are on averageone standard deviation along direction a.
So most prediction points seethemselves as lying on the edge of the training set.Ex. 2.5(a) Derive equation (2.27). The last line makes use of (3.8) through aconditioning argument.(b) Derive equation (2.28), making use of the cyclic property of the traceoperator [trace(AB) = trace(BA)], and its linearity (which allows usto interchange the order of trace and expectation).Ex. 2.6 Consider a regression problem with inputs xi and outputs yi , and aparameterized model fθ (x) to be fit by least squares. Show that if there areobservations with tied or identical values of x, then the fit can be obtainedfrom a reduced weighted least squares problem.402. Overview of Supervised LearningEx.
2.7 Suppose we have a sample of N pairs xi , yi drawn i.i.d. from thedistribution characterized as follows:xi ∼ h(x), the design densityyi = f (xi ) + εi , f is the regression functionεi ∼ (0, σ 2 ) (mean zero, variance σ 2 )We construct an estimator for f linear in the yi ,fˆ(x0 ) =NXi=1ℓi (x0 ; X )yi ,where the weights ℓi (x0 ; X ) do not depend on the yi , but do depend on theentire training sequence of xi , denoted here by X .(a) Show that linear regression and k-nearest-neighbor regression are members of this class of estimators. Describe explicitly the weights ℓi (x0 ; X )in each of these cases.(b) Decompose the conditional mean-squared errorEY|X (f (x0 ) − fˆ(x0 ))2into a conditional squared bias and a conditional variance component.Like X , Y represents the entire training sequence of yi .(c) Decompose the (unconditional) mean-squared errorEY,X (f (x0 ) − fˆ(x0 ))2into a squared bias and a variance component.(d) Establish a relationship between the squared biases and variances inthe above two cases.Ex.
2.8 Compare the classification performance of linear regression and k–nearest neighbor classification on the zipcode data. In particular, consideronly the 2’s and 3’s, and k = 1, 3, 5, 7 and 15. Show both the training andtest error for each choice. The zipcode data are available from the bookwebsite www-stat.stanford.edu/ElemStatLearn.Ex. 2.9 Consider a linear regression model with p parameters, fit by leastsquares to a set of training data (x1 , y1 ), .
. . , (xN , yN ) drawn at randomfrom a population. Let β̂ be the least squares estimate. Suppose we havesome test data (x̃1 , ỹ1 ), . . . , (x̃M , ỹM ) drawn at random from the same popPNulation as the training data. If Rtr (β) = N1 1 (yi − β T xi )2 and Rte (β) =PM1T21 (ỹi − β x̃i ) , prove thatME[Rtr (β̂)] ≤ E[Rte (β̂)],Exercises41where the expectations are over all that is random in each expression. [Thisexercise was brought to our attention by Ryan Tibshirani, from a homeworkassignment given by Andrew Ng.]422.
Overview of Supervised LearningThis is page 43Printer: Opaque this3Linear Methods for Regression3.1 IntroductionA linear regression model assumes that the regression function E(Y |X) islinear in the inputs X1 , . . . , Xp . Linear models were largely developed inthe precomputer age of statistics, but even in today’s computer era thereare still good reasons to study and use them. They are simple and oftenprovide an adequate and interpretable description of how the inputs affectthe output.
For prediction purposes they can sometimes outperform fanciernonlinear models, especially in situations with small numbers of trainingcases, low signal-to-noise ratio or sparse data. Finally, linear methods can beapplied to transformations of the inputs and this considerably expands theirscope. These generalizations are sometimes called basis-function methods,and are discussed in Chapter 5.In this chapter we describe linear methods for regression, while in thenext chapter we discuss linear methods for classification. On some topics wego into considerable detail, as it is our firm belief that an understandingof linear methods is essential for understanding nonlinear ones.
In fact,many nonlinear techniques are direct generalizations of the linear methodsdiscussed here.443. Linear Methods for Regression3.2 Linear Regression Models and Least SquaresAs introduced in Chapter 2, we have an input vector X T = (X1 , X2 , . .
. , Xp ),and want to predict a real-valued output Y . The linear regression modelhas the formpXf (X) = β0 +Xj βj .(3.1)j=1The linear model either assumes that the regression function E(Y |X) islinear, or that the linear model is a reasonable approximation. Here theβj ’s are unknown parameters or coefficients, and the variables Xj can comefrom different sources:• quantitative inputs;• transformations of quantitative inputs, such as log, square-root orsquare;• basis expansions, such as X2 = X12 , X3 = X13 , leading to a polynomialrepresentation;• numeric or “dummy” coding of the levels of qualitative inputs.
Forexample, if G is a five-level factor input, we might create Xj , j =1, . . . , 5, such that Xj = I(G = j). Together this group of Xj represents the effect of G by a set of level-dependent constants, since inP5j=1 Xj βj , one of the Xj s is one, and the others are zero.• interactions between variables, for example, X3 = X1 · X2 .No matter the source of the Xj , the model is linear in the parameters.Typically we have a set of training data (x1 , y1 ) . . .
(xN , yN ) from whichto estimate the parameters β. Each xi = (xi1 , xi2 , . . . , xip )T is a vectorof feature measurements for the ith case. The most popular estimationmethod is least squares, in which we pick the coefficients β = (β0 , β1 , . . . , βp )Tto minimize the residual sum of squaresRSS(β)=NXi=1=(yi − f (xi ))2N Xi=1yi − β0 −pXj=1xij βj2.(3.2)From a statistical point of view, this criterion is reasonable if the trainingobservations (xi , yi ) represent independent random draws from their population.