The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 88
Текст из файла (страница 88)
. . . . .. .. .. .. .. .. .. ... ... o. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. ..o..o..o. ..o. . .. .. .. ..o. . . ..oo...o.... .... o....o.... o.... .... .... .... .... .... .... ....o.... o... ... ... ... ... ... ... ... ... ... ... ... o.. .. .. .. .. .. .. ... .. ... ... ...o.. .. .. .. ... ... ... o..
o.. .. .. .. ..o.. o.....o. . . .o. . .. .. .. .. .. .. ... ... ... ... ... ... ... ...oo....o.... .... ....oo.... oo.. .. .. .. .. .. ... ... ... ... ...o...o...o... ... ... ... ... ... ... ... ...o... ... ... ... ... ... o.. .. .. ..
.. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. ..o... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... ... ... ... ... ...o.... ....o.... o.... .... o.... .... .... .... o.... .... .... .... .... .... .... .... o.. .. ... ... ... ... o. . . . .. .. .. .. ..
.. .. .. .. .. .. .. .. o......... .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. o... ... ... o.. .. .. .. ... ... ... ... o.. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ...o.. .. .. o.. .. .. ..o.o........oooo .... .... .... .... ....
.... .... ..... o..... ..... ..... ..... ..... ..... .....o..... .....o..... o..... o.....oo..... ..... ..... o..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .......o. . . . . . ..o..o.. ..o..o. . . .. o. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. ..o................ ... ... ... ... ... ... o.o. . . .. .. .. ... ...o. .. o. . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ... . . . . . . .. .. ..o.. .. .. .. o.o. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....o..... .....oo..... ..... ..... o.. o.. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..oo ..... ..... .....o..... ..... .....o..... ..... ..... .....o..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. o..o.. ..
.. .. .. o.. .. .. ..o.................... .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...... ... ... ... ... ... ... ... ... ... ... ... ... o... o.. o.. .. ... ... ... ... o.. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. o.. ... ... ...o.. .. o.. .. ... ... ... ... ...o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ...oo. . . . . . . . . . . . . . . . ... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.................. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..oFIGURE 11.4. A neural network on the mixture example of Chapter 2. Theupper panel uses no weight decay, and overfits the training data. The lower paneluses weight decay, and achieves close to the Bayes error rate (broken purpleboundary). Both use the softmax activation function and cross-entropy error.Neural NetworksNo weight decayWeight decayy1y1z9z10z8z7z6z5z3z2z9z10z8z7z6z51z1x11z3x1z2x2z1x2z11z1z2z3z1z5z6z7z8z9z10y21z1z2z3z1z5z6z7z8z9z10y2z1400FIGURE 11.5. Heat maps of the estimated weights from the training of neuralnetworks from Figure 11.4. The display ranges from bright green (negative) tobright red (positive).solution. At the outset it is best to standardize all inputs to have mean zeroand standard deviation one.
This ensures all inputs are treated equally inthe regularization process, and allows one to choose a meaningful range forthe random starting weights. With standardized inputs, it is typical to takerandom uniform weights over the range [−0.7, +0.7].11.5.4 Number of Hidden Units and LayersGenerally speaking it is better to have too many hidden units than too few.With too few hidden units, the model might not have enough flexibility tocapture the nonlinearities in the data; with too many hidden units, theextra weights can be shrunk toward zero if appropriate regularization isused.
Typically the number of hidden units is somewhere in the range of5 to 100, with the number increasing with the number of inputs and number of training cases. It is most common to put down a reasonably largenumber of units and train them with regularization. Some researchers usecross-validation to estimate the optimal number, but this seems unnecessary if cross-validation is used to estimate the regularization parameter.Choice of the number of hidden layers is guided by background knowledgeand experimentation. Each layer extracts features of the input for regression or classification. Use of multiple hidden layers allows construction ofhierarchical features at different levels of resolution.
An example of theeffective use of multiple layers is given in Section 11.6.11.5.5 Multiple MinimaThe error function R(θ) is nonconvex, possessing many local minima. As aresult, the final solution obtained is quite dependent on the choice of start-11.6 Example: Simulated Data401ing weights. One must at least try a number of random starting configurations, and choose the solution giving lowest (penalized) error. Probably abetter approach is to use the average predictions over the collection of networks as the final prediction (Ripley, 1996). This is preferable to averagingthe weights, since the nonlinearity of the model implies that this averagedsolution could be quite poor.
Another approach is via bagging, which averages the predictions of networks training from randomly perturbed versionsof the training data. This is described in Section 8.7.11.6 Example: Simulated DataWe generated data from two additive error models Y = f (X) + ε:Sum of sigmoids: Y=Radial: Y=σ(aT1 X) + σ(aT2 X) + ε1 ;10Yφ(Xm ) + ε2 .m=1Here X T = (X1 , X2 , . . .
, Xp ), each Xj being a standard Gaussian variate,with p = 2 in the first model, and p = 10 in the second.For the sigmoid model, a1 = (3, 3), a2 = (3, −3); for the radial model,φ(t) = (1/2π)1/2 exp(−t2 /2). Both ε1 and ε2 are Gaussian errors, withvariance chosen so that the signal-to-noise ratioVar(E(Y |X))Var(f (X))=Var(Y − E(Y |X))Var(ε)(11.18)is 4 in both models. We took a training sample of size 100 and a test sampleof size 10, 000. We fit neural networks with weight decay and various numbers of hidden units, and recorded the average test error ETest (Y − fˆ(X))2for each of 10 random starting weights. Only one training set was generated, but the results are typical for an “average” training set.