The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 51
Текст из файла (страница 51)
For example, thetruncated power basis has an equivalent B-spline basis for the same spaceof functions; the cancellation is exact in this case.Kernel methods achieve flexibility by fitting simple models in a regionlocal to the target point x0 . Localization is achieved via a weighting kernelKλ , and individual observations receive weights Kλ (x0 , xi ).Radial basis functions combine these ideas, by treating the kernel functions Kλ (ξ, x) as basis functions. This leads to the modelf (x)=MXKλj (ξj , x)βjj=1=MXDj=1||x − ξj ||λjβj ,(6.28)where each basis element is indexed by a location or prototype parameter ξjand a scale parameter λj .
A popular choice for D is the standard Gaussiandensity function. There are several approaches to learning the parameters{λj , ξj , βj }, j = 1, . . . , M . For simplicity we will focus on least squaresmethods for regression, and use the Gaussian kernel.• Optimize the sum-of-squares with respect to all the parameters:min{λj ,ξj ,βj }M1)2T(x−ξ)(x−ξ)ijijy i − β 0 − .βj exp −λ2jj=1i=1NXMX((6.29)This model is commonly referred to as an RBF network, an alternative to the sigmoidal neural network discussed in Chapter 11; the ξjand λj playing the role of the weights. This criterion is nonconvex2130.00.40.81.2024686.7 Radial Basis Functions and KernelsFIGURE 6.16. Gaussian radial basis functions in IR with fixed width can leaveholes (top panel).
Renormalized Gaussian radial basis functions avoid this problem, and produce basis functions similar in some respects to B-splines.with multiple local minima, and the algorithms for optimization aresimilar to those used for neural networks.• Estimate the {λj , ξj } separately from the βj . Given the former, theestimation of the latter is a simple least squares problem. Often thekernel parameters λj and ξj are chosen in an unsupervised way usingthe X distribution alone. One of the methods is to fit a Gaussianmixture density model to the training xi , which provides both thecenters ξj and the scales λj .
Other even more adhoc approaches useclustering methods to locate the prototypes ξj , and treat λj = λas a hyper-parameter. The obvious drawback of these approaches isthat the conditional distribution Pr(Y |X) and in particular E(Y |X)is having no say in where the action is concentrated.
On the positiveside, they are much simpler to implement.While it would seem attractive to reduce the parameter set and assumea constant value for λj = λ, this can have an undesirable side effect ofcreating holes—regions of IRp where none of the kernels has appreciablesupport, as illustrated in Figure 6.16 (upper panel). Renormalized radialbasis functions,D(||x − ξj ||/λ),(6.30)hj (x) = PMk=1 D(||x − ξk ||/λ)avoid this problem (lower panel).The Nadaraya–Watson kernel regression estimator (6.2) in IRp can beviewed as an expansion in renormalized radial basis functions,PN(x0 ,xi )fˆ(x0 ) = i=1 yi PNKλKλ (x0 ,xi )i=1PN= i=1 yi hi (x0 )(6.31)2146.
Kernel Smoothing Methodswith a basis function hi located at every observation and coefficients yi ;that is, ξi = xi , β̂i = yi , i = 1, . . . , N .Note the similarity between the expansion (6.31) and the solution (5.50)on page 169 to the regularization problem induced by the kernel K. Radialbasis functions form the bridge between the modern “kernel methods” andlocal fitting technology.6.8 Mixture Models for Density Estimation andClassificationThe mixture model is a useful tool for density estimation, and can be viewedas a kind of kernel method.
The Gaussian mixture model has the formMXαm φ(x; µm , Σm )(6.32)f (x) =m=1Pwith mixing proportions αm ,m αm = 1, and each Gaussian density hasa mean µm and covariance matrix Σm . In general, mixture models can useany component densities in place of the Gaussian in (6.32): the Gaussianmixture model is by far the most popular.The parameters are usually fit by maximum likelihood, using the EMalgorithm as described in Chapter 8. Some special cases arise:• If the covariance matrices are constrained to be scalar: Σm = σm I,then (6.32) has the form of a radial basis expansion.• If in addition σm = σ > 0 is fixed, and M ↑ N , then the maximum likelihood estimate for (6.32) approaches the kernel densityestimate (6.22) where α̂m = 1/N and µ̂m = xm .Using Bayes’ theorem, separate mixture densities in each class lead to flexible models for Pr(G|X); this is taken up in some detail in Chapter 12.Figure 6.17 shows an application of mixtures to the heart disease riskfactor study.
In the top row are histograms of Age for the no CHD and CHDgroups separately, and then combined on the right. Using the combineddata, we fit a two-component mixture of the form (6.32) with the (scalars)Σ1 and Σ2 not constrained to be equal. Fitting was done via the EMalgorithm (Chapter 8): note that the procedure does not use knowledge ofthe CHD labels. The resulting estimates wereµ̂1 = 36.4,µ̂2 = 58.0,Σ̂1 = 157.7,Σ̂2 = 15.6,α̂1 = 0.7,α̂2 = 0.3.The component densities φ(µ̂1 , Σ̂1 ) and φ(µ̂2 , Σ̂2 ) are shown in the lowerleft and middle panels.
The lower-right panel shows these component densities (orange and blue) along with the estimated mixture density (green).6.8 Mixture Models for Density Estimation and ClassificationCHDCombined201510Count10Count40506020305060203040506040506050600.10Mixture Estimate0.080.100.080.060.020.000.020.002030Age0.04Mixture Estimate0.080.060.040.000.02Mixture Estimate40Age0.10Age0.06300.042000055510Count1525201530No CHD21520Age30405060Age203040AgeFIGURE 6.17. Application of mixtures to the heart disease risk-factor study.(Top row:) Histograms of Age for the no CHD and CHD groups separately, andcombined.
(Bottom row:) estimated component densities from a Gaussian mixture model, (bottom left, bottom middle); (bottom right:) Estimated componentdensities (blue and orange) along with the estimated mixture density (green). Theorange density has a very large standard deviation, and approximates a uniformdensity.The mixture model also provides an estimate of the probability thatobservation i belongs to component m,α̂m φ(xi ; µ̂m , Σ̂m )r̂im = PM,k=1 α̂k φ(xi ; µ̂k , Σ̂k )(6.33)where xi is Age in our example. Suppose we threshold each value r̂i2 andhence define δ̂i = I(r̂i2 > 0.5). Then we can compare the classification ofeach observation by CHD and the mixture model:CHDNoYesMixture modelδ̂ = 0δ̂ = 1232707684Although the mixture model did not use the CHD labels, it has done a fairjob in discovering the two CHD subpopulations.
Linear logistic regression,using the CHD as a response, achieves the same error rate (32%) when fit tothese data using maximum-likelihood (Section 4.4).2166. Kernel Smoothing Methods6.9 Computational ConsiderationsKernel and local regression and density estimation are memory-based methods: the model is the entire training data set, and the fitting is done atevaluation or prediction time. For many real-time applications, this canmake this class of methods infeasible.The computational cost to fit at a single observation x0 is O(N ) flops,except in oversimplified cases (such as square kernels).
By comparison,an expansion in M basis functions costs O(M ) for one evaluation, andtypically M ∼ O(log N ). Basis function methods have an initial cost of atleast O(N M 2 + M 3 ).The smoothing parameter(s) λ for kernel methods are typically determined off-line, for example using cross-validation, at a cost of O(N 2 ) flops.Popular implementations of local regression, such as the loess function inS-PLUS and R and the locfit procedure (Loader, 1999), use triangulationschemes to reduce the computations. They compute the fit exactly at Mcarefully chosen locations (O(N M )), and then use blending techniques tointerpolate the fit elsewhere (O(M ) per evaluation).Bibliographic NotesThere is a vast literature on kernel methods which we will not attempt tosummarize.
Rather we will point to a few good references that themselveshave extensive bibliographies. Loader (1999) gives excellent coverage of local regression and likelihood, and also describes state-of-the-art softwarefor fitting these models. Fan and Gijbels (1996) cover these models froma more theoretical aspect. Hastie and Tibshirani (1990) discuss local regression in the context of additive modeling. Silverman (1986) gives a goodoverview of density estimation, as does Scott (1992).ExercisesEx.
6.1 Show that the Nadaraya–Watson kernel smooth with fixed metricbandwidth λ and a Gaussian kernel is differentiable. What can be said forthe Epanechnikov kernel? What can be said for the Epanechnikov kernelwith adaptive nearest-neighbor bandwidth λ(x0 )?PNEx. 6.2 Show that i=1 (xi −x0 )li (x0 ) = 0 for local linear regression. DefinePNbj (x0 ) = i=1 (xi − x0 )j li (x0 ). Show that b0 (x0 ) = 1 for local polynomialregression of any degree (including local constants). Show that bj (x0 ) = 0for all j ∈ {1, 2, . . .
, k} for local polynomial regression of degree k. Whatare the implications of this on the bias?Exercises217Ex. 6.3 Show that ||l(x)|| (Section 6.1.2) increases with the degree of thelocal polynomial.Ex. 6.4 Suppose that the p predictors X arise from sampling relativelysmooth analog curves at p uniformly spaced abscissa values. Denote byCov(X|Y ) = Σ the conditional covariance matrix of the predictors, andassume this does not change much with Y . Discuss the nature of Mahalanobis choice A = Σ−1 for the metric in (6.14). How does this comparewith A = I? How might you construct a kernel A that (a) downweightshigh-frequency components in the distance metric; (b) ignores themcompletely?Ex.
6.5 Show that fitting a locally constant multinomial logit model ofthe form (6.19) amounts to smoothing the binary response indicators foreach class separately using a Nadaraya–Watson kernel smoother with kernelweights Kλ (x0 , xi ).Ex. 6.6 Suppose that all you have is software for fitting local regression,but you can specify exactly which monomials are included in the fit. Howcould you use this software to fit a varying-coefficient model in some of thevariables?Ex.
6.7 Derive an expression for the leave-one-out cross-validated residualsum-of-squares for local polynomial regression.Ex. 6.8 Suppose that for continuous response Y and predictor X, we modelthe joint density of X, Y using a multivariate Gaussian kernel estimator.Note that the kernel in this case would be the product kernel φλ (X)φλ (Y ).Show that the conditional mean E(Y |X) derived from this estimate is aNadaraya–Watson estimator. Extend this result to classification by providing a suitable kernel for the estimation of the joint distribution of acontinuous X and discrete Y .Ex. 6.9 Explore the differences between the naive Bayes model (6.27) anda generalized additive logistic regression model, in terms of (a) model assumptions and (b) estimation. If all the variables Xk are discrete, what canyou say about the corresponding GAM?Ex.