The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 84
Текст из файла (страница 84)
. . , K − 1,(10.57)xi ∈R pik (1 − pik )PKwhere pik = exp(fk (xi ))/ exp( ℓ=1 fℓ (xi )).(c) We prefer our update to sum to zero, as the current model does. Usingsymmetry arguments, show thatγ̂k =K1 X 1K −1 1(γk −γℓ ), k = 1, . . . , KKK(10.58)ℓ=1is an appropriate update, where γk1 is defined as in (10.57) for allk = 1, . . . , K.Ex. 10.9 Consider a K-class problem where the targets yik are coded as1 if observation i is in class k and zero otherwise. Using the multinomialdeviance loss function (10.22) and the symmetric logistic transform, usethe arguments leading to the gradient boosting Algorithm 10.3 to deriveAlgorithm 10.4. Hint: See exercise 10.8 for step 2(b)iii.Ex. 10.10 Show that for K = 2 class classification, only one tree needs tobe grown at each gradient-boosting iteration.Ex.
10.11 Show how to compute the partial dependence function fS (XS )in (10.47) efficiently.Ex. 10.12 Referring to (10.49), let S = {1} and C = {2}, with f (X1 , X2 ) =X1 . Assume X1 and X2 are bivariate Gaussian, each with mean zero, variance one, and E(X1 , X2 ) = ρ.
Show that E(f (X1 , X2 |X2 ) = ρX2 , eventhough f is not a function of X2 .Exercises387Algorithm 10.4 Gradient Boosting for K-class Classification.1. Initialize fk0 (x) = 0, k = 1, 2, . . . , K.2. For m=1 to M :(a) Setefk (x)pk (x) = PK, k = 1, 2, . . .
, K.fℓ (x)ℓ=1 e(b) For k = 1 to K:i. Compute rikm = yik − pk (xi ), i = 1, 2, . . . , N .ii. Fit a regression tree to the targets rikm , i = 1, 2, . . . , N ,giving terminal regions Rjkm , j = 1, 2, . . . , Jm .iii. ComputePK −1xi ∈Rjkm rikmP, j = 1, 2, . . . , Jm .γjkm =Kxi ∈Rjkm |rikm |(1 − |rikm |)iv. Update fkm (x) = fk,m−1 (x) +P Jm3. Output fˆk (x) = fkM (x), k = 1, 2, . . .
, K.j=1γjkm I(x ∈ Rjkm ).38810. Boosting and Additive TreesThis is page 389Printer: Opaque this11Neural Networks11.1 IntroductionIn this chapter we describe a class of learning methods that was developedseparately in different fields—statistics and artificial intelligence—basedon essentially identical models. The central idea is to extract linear combinations of the inputs as derived features, and then model the target asa nonlinear function of these features.
The result is a powerful learningmethod, with widespread applications in many fields. We first discuss theprojection pursuit model, which evolved in the domain of semiparametric statistics and smoothing. The rest of the chapter is devoted to neuralnetwork models.11.2 Projection Pursuit RegressionAs in our generic supervised learning problem, assume we have an inputvector X with p components, and a target Y . Let ωm , m = 1, 2, .
. . , M, beunit p-vectors of unknown parameters. The projection pursuit regression(PPR) model has the formf (X) =MXTgm (ωmX).(11.1)m=1TX ratherThis is an additive model, but in the derived features Vm = ωmthan the inputs themselves. The functions gm are unspecified and are esti-390Neural Networksg(V )g(V )X2X2X1X1FIGURE 11.1. Perspective plots of two ridge functions.√(Left:) g(V ) = 1/[1 + exp(−5(V − 0.5))], where V = (X1 + X2 )/ 2.(Right:) g(V ) = (V + 0.1) sin(1/(V /3 + 0.1)), where V = X1 .mated along with the directions ωm using some flexible smoothing method(see below).TThe function gm (ωmX) is called a ridge function in IRp .
It varies onlyTin the direction defined by the vector ωm . The scalar variable Vm = ωmXis the projection of X onto the unit vector ωm , and we seek ωm so thatthe model fits well, hence the name “projection pursuit.” Figure 11.1√ showssome examples of ridge functions. In the example on the left ω = (1/ 2)(1, 1)T ,so that the function only varies in the direction X1 + X2 . In the exampleon the right, ω = (1, 0).The PPR model (11.1) is very general, since the operation of formingnonlinear functions of linear combinations generates a surprisingly largeclass of models. For example, the product X1 · X2 can be written as [(X1 +X2 )2 − (X1 − X2 )2 ]/4, and higher-order products can be represented similarly.In fact, if M is taken arbitrarily large, for appropriate choice of gm thePPR model can approximate any continuous function in IRp arbitrarilywell.
Such a class of models is called a universal approximator. Howeverthis generality comes at a price. Interpretation of the fitted model is usuallydifficult, because each input enters into the model in a complex and multifaceted way. As a result, the PPR model is most useful for prediction, andnot very useful for producing an understandable model for the data. TheM = 1 model, known as the single index model in econometrics, is anexception. It is slightly more general than the linear regression model, andoffers a similar interpretation.How do we fit a PPR model, given training data (xi , yi ), i = 1, 2, . .
. , N ?We seek the approximate minimizers of the error functionNXi=1"yi −MXm=1Tgm (ωmxi )#2(11.2)11.2 Projection Pursuit Regression391over functions gm and direction vectors ωm , m = 1, 2, . . . , M . As in othersmoothing problems, we need either explicitly or implicitly to impose complexity constraints on the gm , to avoid overfit solutions.Consider just one term (M = 1, and drop the subscript). Given thedirection vector ω, we form the derived variables vi = ω T xi .
Then we havea one-dimensional smoothing problem, and we can apply any scatterplotsmoother, such as a smoothing spline, to obtain an estimate of g.On the other hand, given g, we want to minimize (11.2) over ω. A Gauss–Newton search is convenient for this task. This is a quasi-Newton method,in which the part of the Hessian involving the second derivative of g isdiscarded. It can be simply derived as follows.
Let ωold be the currentestimate for ω. We writeTTg(ω T xi ) ≈ g(ωoldxi ) + g ′ (ωoldxi )(ω − ωold )T xi(11.3)to giveNXi=1Tyi − g(ω xi )2≈NXi=1g′T(ωoldx i )2TωoldxiTyi − g(ωoldxi )+T x )g ′ (ωoldiT− ω xi2(11.4)To minimize the right-hand side, we carry out a least squares regressionTTTwith target ωoldxi +(yi −g(ωoldxi ))/g ′ (ωoldxi ) on the input xi , with weights′T2g (ωold xi ) and no intercept (bias) term. This produces the updated coefficient vector ωnew .These two steps, estimation of g and ω, are iterated until convergence.With more than one term in the PPR model, the model is built in a forwardstage-wise manner, adding a pair (ωm , gm ) at each stage.There are a number of implementation details.• Although any smoothing method can in principle be used, it is convenient if the method provides derivatives. Local regression and smoothing splines are convenient.• After each step the gm ’s from previous steps can be readjusted usingthe backfitting procedure described in Chapter 9.
While this maylead ultimately to fewer terms, it is not clear whether it improvesprediction performance.• Usually the ωm are not readjusted (partly to avoid excessive computation), although in principle they could be as well.• The number of terms M is usually estimated as part of the forwardstage-wise strategy. The model building stops when the next termdoes not appreciably improve the fit of the model. Cross-validationcan also be used to determine M ..392Neural NetworksThere are many other applications, such as density estimation (Friedmanet al., 1984; Friedman, 1987), where the projection pursuit idea can be used.In particular, see the discussion of ICA in Section 14.7 and its relationshipwith exploratory projection pursuit. However the projection pursuit regression model has not been widely used in the field of statistics, perhapsbecause at the time of its introduction (1981), its computational demandsexceeded the capabilities of most readily available computers.
But it doesrepresent an important intellectual advance, one that has blossomed in itsreincarnation in the field of neural networks, the topic of the rest of thischapter.11.3 Neural NetworksThe term neural network has evolved to encompass a large class of modelsand learning methods. Here we describe the most widely used “vanilla” neural net, sometimes called the single hidden layer back-propagation network,or single layer perceptron. There has been a great deal of hype surroundingneural networks, making them seem magical and mysterious. As we makeclear in this section, they are just nonlinear statistical models, much likethe projection pursuit regression model discussed above.A neural network is a two-stage regression or classification model, typically represented by a network diagram as in Figure 11.2.
This networkapplies both to regression or classification. For regression, typically K = 1and there is only one output unit Y1 at the top. However, these networkscan handle multiple quantitative responses in a seamless fashion, so we willdeal with the general case.For K-class classification, there are K units at the top, with the kthunit modeling the probability of class k.
There are K target measurementsYk , k = 1, . . . , K, each being coded as a 0 − 1 variable for the kth class.Derived features Zm are created from linear combinations of the inputs,and then the target Yk is modeled as a function of linear combinations ofthe Zm ,TZm = σ(α0m + αmX), m = 1, . . . , M,Tk = β0k + βkT Z, k = 1, . . . , K,(11.5)fk (X) = gk (T ), k = 1, . . . , K,where Z = (Z1 , Z2 , . . . , ZM ), and T = (T1 , T2 , . . . , TK ).The activation function σ(v) is usually chosen to be the sigmoid σ(v) =1/(1 + e−v ); see Figure 11.3 for a plot of 1/(1 + e−v ). Sometimes Gaussianradial basis functions (Chapter 6) are used for the σ(v), producing what isknown as a radial basis function network.Neural network diagrams like Figure 11.2 are sometimes drawn with anadditional bias unit feeding into every unit in the hidden and output layers.11.3 Neural Networks1001Z1X1X2393YKY1Y2Z2Z3ZmMX3Xp-1XpFIGURE 11.2.
Schematic of a single hidden layer, feed-forward neural network.Thinking of the constant “1” as an additional input feature, this bias unitcaptures the intercepts α0m and β0k in model (11.5).The output function gk (T ) allows a final transformation of the vector ofoutputs T . For regression we typically choose the identity function gk (T ) =Tk . Early work in K-class classification also used the identity function, butthis was later abandoned in favor of the softmax functioneT kgk (T ) = PKℓ=1eT ℓ.(11.6)This is of course exactly the transformation used in the multilogit model(Section 4.4), and produces positive estimates that sum to one.