The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 44
Текст из файла (страница 44)
(Left panel) The first 16 normalized eigenvectors of K, the200 × 200 kernel matrix for the first coordinate of the mixture data. These areviewed as estimates φ̂ℓ of the eigenfunctions in (5.45), and are represented asfunctions in IR1 with the observed values superimposed in color. They are√arrangedin rows, starting at the top left. (Right panel) Rescaled versions hℓ = γ̂ ℓ φ̂ℓ ofthe functions in the left panel, for which the kernel computes the “inner product.”01020304050FIGURE 5.15. The largest 50 eigenvalues of K; all those beyond the 30th areeffectively zero.1745.
Basis Expansions and Regularizationing feature space representation of the eigenfunctionsphℓ (x) = γ̂ ℓ φ̂ℓ (x), ℓ = 1, . . . , N.(5.65)K(x, y) = kx − yk2 log(kx − yk).(5.66)Note that hh(xi ), h(xi′ )i = K(xi , xi′ ). The scaling by the eigenvalues quicklyshrinks most of the functions down to zero, leaving an effective dimensionof about 12 in this case. The corresponding optimization problem is a standard ridge regression, as in (5.63).
So although in principle the implicitfeature space is infinite dimensional, the effective dimension is dramatically lower because of the relative amounts of shrinkage applied to eachbasis function. The kernel scale parameter ν plays a role here as well; largerν implies more local km functions, and increases the effective dimension ofthe feature space. See Hastie and Zhu (2006) for more details.It is also known (Girosi et al., 1995) that a thin-plate spline (Section 5.7)is an expansion in radial basis functions, generated by the kernelRadial basis functions are discussed in more detail in Section 6.7.Support Vector ClassifiersThe support vector machines of Chapter12 for a two-class classificationPNproblem have the form f (x) = α0 + i=1 αi K(x, xi ), where the parametersare chosen to minimize(N)Xλ Tmin[1 − yi f (xi )]+ + α Kα ,(5.67)α0 ,α2i=1where yi ∈ {−1, 1}, and [z]+ denotes the positive part of z.
This can beviewed as a quadratic optimization problem with linear constraints, andrequires a quadratic programming algorithm for its solution. The namesupport vector arises from the fact that typically many of the α̂i = 0 [dueto the piecewise-zero nature of the loss function in (5.67)], and so fˆ is anexpansion in a subset of the K(·, xi ). See Section 12.3.3 for more details.5.9 Wavelet SmoothingWe have seen two different modes of operation with dictionaries of basisfunctions.
With regression splines, we select a subset of the bases, usingeither subject-matter knowledge, or else automatically. The more adaptiveprocedures such as MARS (Chapter 9) can capture both smooth and nonsmooth behavior. With smoothing splines, we use a complete basis, butthen shrink the coefficients toward smoothness.5.9 Wavelet SmoothingHaar Wavelets175Symmlet-8 Waveletsψ6,35ψ6,15ψ5,15ψ5,1ψ4,9ψ4,4ψ3,5ψ3,2ψ2,3ψ2,1ψ1,00.00.20.40.6Time0.81.00.00.20.40.60.81.0TimeFIGURE 5.16. Some selected wavelets at different translations and dilationsfor the Haar and symmlet families. The functions have been scaled to suit thedisplay.Wavelets typically use a complete orthonormal basis to represent functions, but then shrink and select the coefficients toward a sparse representation.
Just as a smooth function can be represented by a few spline basisfunctions, a mostly flat function with a few isolated bumps can be represented with a few (bumpy) basis functions. Wavelets bases are very popularin signal processing and compression, since they are able to represent bothsmooth and/or locally bumpy functions in an efficient way—a phenomenondubbed time and frequency localization. In contrast, the traditional Fourierbasis allows only frequency localization.Before we give details, let’s look at the Haar wavelets in the left panelof Figure 5.16 to get an intuitive idea of how wavelet smoothing works.The vertical axis indicates the scale (frequency) of the wavelets, from lowscale at the bottom to high scale at the top.
At each scale the wavelets are“packed in” side-by-side to completely fill the time axis: we have only shown1765. Basis Expansions and Regularizationa selected subset. Wavelet smoothing fits the coefficients for this basis byleast squares, and then thresholds (discards, filters) the smaller coefficients.Since there are many basis functions at each scale, it can use bases whereit needs them and discard the ones it does not need, to achieve time andfrequency localization. The Haar wavelets are simple to understand, but notsmooth enough for most purposes.
The symmlet wavelets in the right panelof Figure 5.16 have the same orthonormal properties, but are smoother.Figure 5.17 displays an NMR (nuclear magnetic resonance) signal, whichappears to be composed of smooth components and isolated spikes, plussome noise. The wavelet transform, using a symmlet basis, is shown in thelower left panel. The wavelet coefficients are arranged in rows, from lowestscale at the bottom, to highest scale at the top. The length of each linesegment indicates the size of the coefficient.
The bottom right panel showsthe wavelet coefficients after they have been thresholded. The thresholdprocedure, given below in equation (5.69), is the same soft-thresholdingrule that arises in the lasso procedure for linear regression (Section 3.4.2).Notice that many of the smaller coefficients have been set to zero. Thegreen curve in the top panel shows the back-transform of the thresholdedcoefficients: this is the smoothed version of the original signal. In the nextsection we give the details of this process, including the construction ofwavelets and the thresholding rule.5.9.1 Wavelet Bases and the Wavelet TransformIn this section we give details on the construction and filtering of wavelets.Wavelet bases are generated by translations and dilations of a single scaling function φ(x) (also known as the father).
The red curves in Figure 5.18are the Haar and symmlet-8 scaling functions. The Haar basis is particularly easy to understand, especially for anyone with experience in analysisof variance or trees, since it produces a piecewise-constant representation.Thus if φ(x) = I(x ∈ [0, 1]), then φ0,k (x) = φ(x−k), k an integer, generatesan orthonormal basis for functions with jumpsat the integers. Call this ref√erence space V0 . The dilations φ1,k (x) = 2φ(2x−k) form an orthonormalbasis for a space V1 ⊃ V0 of functions piecewise constant on intervals oflength 12 . In fact, more generally we have · · · ⊃ V1 ⊃ V0 ⊃ V−1 ⊃ · · · whereeach Vj is spanned by φj,k = 2j/2 φ(2j x − k).Now to the definition of wavelets. In analysis of variance, we often represent a pair of means µ1 and µ2 by their grand mean µ = 12 (µ1 + µ2 ), andthen a contrast α = 12 (µ1 − µ2 ).
A simplification occurs if the contrast α isvery small, because then we can set it to zero. In a similar manner we mightrepresent a function in Vj+1 by a component in Vj plus the component inthe orthogonal complement Wj of Vj to Vj+1 , written as Vj+1 = Vj ⊕ Wj .The component in Wj represents detail, and we might wish to set some elements of this component to zero.
It is easy to see that the functions ψ(x−k)5.9 Wavelet Smoothing1770204060NMR Signal0200400600Wavelet Transform - Original SignalSignalW9W9W8W8W7W7W6W6W5W5W4W4V4V420040060080010001000Wavelet Transform - WaveShrunk SignalSignal080002004006008001000FIGURE 5.17. The top panel shows an NMR signal, with the wavelet-shrunkversion superimposed in green. The lower left panel represents the wavelet transform of the original signal, down to V4 , using the symmlet-8 basis. Each coefficient is represented by the height (positive or negative) of the vertical bar. Thelower right panel represents the wavelet coefficients after being shrunken usingthe waveshrink function in S-PLUS, which implements the SureShrink methodof wavelet adaptation of Donoho and Johnstone.1785.
Basis Expansions and RegularizationHaar BasisSymmlet Basisφ(x)φ(x)ψ(x)ψ(x)FIGURE 5.18. The Haar and symmlet father (scaling) wavelet φ(x) and motherwavelet ψ(x).generated by the mother wavelet ψ(x) = φ(2x)−φ(2x−1) form an orthonormal basis for W0 for the Haar family. Likewise ψj,k = 2j/2 ψ(2j x − k) forma basis for Wj .Now Vj+1 = Vj ⊕ Wj = Vj−1 ⊕ Wj−1 ⊕ Wj , so besides representing afunction by its level-j detail and level-j rough components, the latter canbe broken down to level-(j − 1) detail and rough, and so on.
Finally we geta representation of the form VJ = V0 ⊕ W0 ⊕ W1 · · · ⊕ WJ−1 . Figure 5.16on page 175 shows particular wavelets ψj,k (x).Notice that since these spaces are orthogonal, all the basis functions areorthonormal. In fact, if the domain is discrete with N = 2J (time) points,this is as far as we can go. There are 2j basis elements at level j, andadding up, we have a total of 2J − 1 elements in the Wj , and one in V0 .This structured orthonormal basis allows for a multiresolution analysis,which we illustrate in the next section.While helpful for understanding the construction above, the Haar basisis often too coarse for practical purposes.
Fortunately, many clever waveletbases have been invented. Figures 5.16 and 5.18 include the Daubechiessymmlet-8 basis. This basis has smoother elements than the correspondingHaar basis, but there is a tradeoff:• Each wavelet has a support covering 15 consecutive time intervals,rather than one for the Haar basis.