The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 14
Текст из файла (страница 14)
Some methods,such as kernel and local regression and tree-based methods, directly specifythe metric and size of the neighborhood. The nearest-neighbor methodsdiscussed so far are based on the assumption that locally the function isconstant; close to a target input x0 , the function does not change much, andso close outputs can be averaged to produce fˆ(x0 ).
Other methods suchas splines, neural networks and basis-function methods implicitly defineneighborhoods of local behavior. In Section 5.4.1 we discuss the conceptof an equivalent kernel (see Figure 5.8 on page 157), which describes thislocal dependence for any method linear in the outputs. These equivalentkernels in many cases look just like the explicitly defined weighting kernelsdiscussed above—peaked at the target point and falling away smoothlyaway from it.One fact should be clear by now. Any method that attempts to produce locally varying functions in small isotropic neighborhoods will runinto problems in high dimensions—again the curse of dimensionality.
Andconversely, all methods that overcome the dimensionality problems have anassociated—and often implicit or adaptive—metric for measuring neighborhoods, which basically does not allow the neighborhood to be simultaneously small in all directions.2.8Classes of Restricted EstimatorsThe variety of nonparametric regression techniques or learning methods fallinto a number of different classes depending on the nature of the restrictionsimposed. These classes are not distinct, and indeed some methods fall inseveral classes. Here we give a brief summary, since detailed descriptions342. Overview of Supervised Learningare given in later chapters. Each of the classes has associated with it oneor more parameters, sometimes appropriately called smoothing parameters,that control the effective size of the local neighborhood. Here we describethree broad classes.2.8.1 Roughness Penalty and Bayesian MethodsHere the class of functions is controlled by explicitly penalizing RSS(f )with a roughness penaltyPRSS(f ; λ) = RSS(f ) + λJ(f ).(2.38)The user-selected functional J(f ) will be large for functions f that vary toorapidly over small regions of input space.
For example, the popular cubicsmoothing spline for one-dimensional inputs is the solution to the penalizedleast-squares criterionPRSS(f ; λ) =NXi=1(yi − f (xi ))2 + λZ[f ′′ (x)]2 dx.(2.39)The roughness penalty here controls large values of the second derivativeof f , and the amount of penalty is dictated by λ ≥ 0. For λ = 0 no penaltyis imposed, and any interpolating function will do, while for λ = ∞ onlyfunctions linear in x are permitted.Penalty functionals J can be constructed for functions in any dimension,and special versions can be createdPp to impose special structure. For example, additive penalties J(f ) = j=1 J(fj ) are used in conjunction withPpadditive functions f (X) =j=1 fj (Xj ) to create additive models withsmooth coordinatefunctions.Similarly,projection pursuit regression modPMTels have f (X) = m=1 gm (αmX) for adaptively chosen directions αm , andthe functions gm can each have an associated roughness penalty.Penalty function, or regularization methods, express our prior belief thatthe type of functions we seek exhibit a certain type of smooth behavior, andindeed can usually be cast in a Bayesian framework.
The penalty J corresponds to a log-prior, and PRSS(f ; λ) the log-posterior distribution, andminimizing PRSS(f ; λ) amounts to finding the posterior mode. We discussroughness-penalty approaches in Chapter 5 and the Bayesian paradigm inChapter 8.2.8.2 Kernel Methods and Local RegressionThese methods can be thought of as explicitly providing estimates of the regression function or conditional expectation by specifying the nature of thelocal neighborhood, and of the class of regular functions fitted locally.
Thelocal neighborhood is specified by a kernel function Kλ (x0 , x) which assigns2.8 Classes of Restricted Estimators35weights to points x in a region around x0 (see Figure 6.1 on page 192). Forexample, the Gaussian kernel has a weight function based on the Gaussiandensity function1||x − x0 ||2Kλ (x0 , x) = exp −(2.40)λ2λand assigns weights to points that die exponentially with their squaredEuclidean distance from x0 . The parameter λ corresponds to the varianceof the Gaussian density, and controls the width of the neighborhood. Thesimplest form of kernel estimate is the Nadaraya–Watson weighted averagePNKλ (x0 , xi )yi.(2.41)fˆ(x0 ) = Pi=1Ni=1 Kλ (x0 , xi )In general we can define a local regression estimate of f (x0 ) as fθ̂ (x0 ),where θ̂ minimizesRSS(fθ , x0 ) =NXi=1Kλ (x0 , xi )(yi − fθ (xi ))2 ,(2.42)and fθ is some parameterized function, such as a low-order polynomial.Some examples are:• fθ (x) = θ0 , the constant function; this results in the Nadaraya–Watson estimate in (2.41) above.• fθ (x) = θ0 + θ1 x gives the popular local linear regression model.Nearest-neighbor methods can be thought of as kernel methods having amore data-dependent metric.
Indeed, the metric for k-nearest neighbors isKk (x, x0 ) = I(||x − x0 || ≤ ||x(k) − x0 ||),where x(k) is the training observation ranked kth in distance from x0 , andI(S) is the indicator of the set S.These methods of course need to be modified in high dimensions, to avoidthe curse of dimensionality. Various adaptations are discussed in Chapter 6.2.8.3 Basis Functions and Dictionary MethodsThis class of methods includes the familiar linear and polynomial expansions, but more importantly a wide variety of more flexible models.
Themodel for f is a linear expansion of basis functionsfθ (x) =MXm=1θm hm (x),(2.43)362. Overview of Supervised Learningwhere each of the hm is a function of the input x, and the term linear hererefers to the action of the parameters θ. This class covers a wide variety ofmethods. In some cases the sequence of basis functions is prescribed, suchas a basis for polynomials in x of total degree M .For one-dimensional x, polynomial splines of degree K can be representedby an appropriate sequence of M spline basis functions, determined in turnby M − K knots.
These produce functions that are piecewise polynomialsof degree K between the knots, and joined up with continuity of degreeK − 1 at the knots. As an example consider linear splines, or piecewiselinear functions. One intuitively satisfying basis consists of the functionsb1 (x) = 1, b2 (x) = x, and bm+2 (x) = (x − tm )+ , m = 1, . . . , M − 2,where tm is the mth knot, and z+ denotes positive part. Tensor productsof spline bases can be used for inputs with dimensions larger than one(see Section 5.2, and the CART and MARS models in Chapter 9.) Theparameter θ can be the total degree of the polynomial or the number ofknots in the case of splines.Radial basis functions are symmetric p-dimensional kernels located atparticular centroids,fθ (x) =MXKλm (µm , x)θm ;(2.44)m=12for example, the Gaussian kernel Kλ (µ, x) = e−||x−µ|| /2λ is popular.Radial basis functions have centroids µm and scales λm that have tobe determined.
The spline basis functions have knots. In general we wouldlike the data to dictate them as well. Including these as parameters changesthe regression problem from a straightforward linear problem to a combinatorially hard nonlinear problem. In practice, shortcuts such as greedyalgorithms or two stage processes are used. Section 6.7 describes some suchapproaches.A single-layer feed-forward neural network model with linear outputweights can be thought of as an adaptive basis function method.
The modelhas the formMXTβm σ(αmx + bm ),(2.45)fθ (x) =m=1where σ(x) = 1/(1 + e−x ) is known as the activation function. Here, asin the projection pursuit model, the directions αm and the bias terms bmhave to be determined, and their estimation is the meat of the computation.Details are give in Chapter 11.These adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionaryD of candidate basis functions from which to choose, and models are builtup by employing some kind of search mechanism.2.9 Model Selection and the Bias–Variance Tradeoff372.9 Model Selection and the Bias–VarianceTradeoffAll the models described above and many others discussed in later chaptershave a smoothing or complexity parameter that has to be determined:• the multiplier of the penalty term;• the width of the kernel;• or the number of basis functions.In the case of the smoothing spline, the parameter λ indexes models rangingfrom a straight line fit to the interpolating model.
Similarly a local degreem polynomial model ranges between a degree-m global polynomial whenthe window size is infinitely large, to an interpolating fit when the windowsize shrinks to zero. This means that we cannot use residual sum-of-squareson the training data to determine these parameters as well, since we wouldalways pick those that gave interpolating fits and hence zero residuals. Sucha model is unlikely to predict future data well at all.The k-nearest-neighbor regression fit fˆk (x0 ) usefully illustrates the competing forces that affect the predictive ability of such approximations.
Suppose the data arise from a model Y = f (X) + ε, with E(ε) = 0 andVar(ε) = σ 2 . For simplicity here we assume that the values of xi in thesample are fixed in advance (nonrandom). The expected prediction errorat x0 , also known as test or generalization error, can be decomposed:EPEk (x0 )===E[(Y − fˆk (x0 ))2 |X = x0 ]σ 2 + [Bias2 (fˆk (x0 )) + VarT (fˆk (x0 ))]h1σ 2 + f (x0 ) −kkXℓ=1f (x(ℓ) )i2+σ2.k(2.46)(2.47)The subscripts in parentheses (ℓ) indicate the sequence of nearest neighborsto x0 .There are three terms in this expression.
The first term σ 2 is the irreducible error—the variance of the new test target—and is beyond ourcontrol, even if we know the true f (x0 ).The second and third terms are under our control, and make up themean squared error of fˆk (x0 ) in estimating f (x0 ), which is broken downinto a bias component and a variance component. The bias term is thesquared difference between the true mean f (x0 ) and the expected value ofthe estimate—[ET (fˆk (x0 )) − f (x0 )]2 —where the expectation averages therandomness in the training data. This term will most likely increase withk, if the true function is reasonably smooth.