The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 89
Текст из файла (страница 89)
The testerrors are shown in Figure 11.6. Note that the zero hidden unit model refersto linear least squares regression. The neural network is perfectly suited tothe sum of sigmoids model, and the two-unit model does perform the best,achieving an error close to the Bayes rate. (Recall that the Bayes rate forregression with squared error is the error variance; in the figures, we reporttest error relative to the Bayes error).
Notice, however, that with more hidden units, overfitting quickly creeps in, and with some starting weights themodel does worse than the linear model (zero hidden unit) model. Evenwith two hidden units, two of the ten starting weight configurations produced results no better than the linear model, confirming the importanceof multiple starting values.A radial function is in a sense the most difficult for the neural net, as it isspherically symmetric and with no preferred directions. We see in the right402Neural NetworksRadial201015Test Error2.001.051.5Test Error2.525303.0Sum of Sigmoids0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10Number of Hidden UnitsNumber of Hidden UnitsFIGURE 11.6.
Boxplots of test error, for simulated data example, relative tothe Bayes error (broken horizontal line). True function is a sum of two sigmoidson the left, and a radial function is on the right. The test error is displayed for10 different starting weights, for a single hidden layer neural network with thenumber of units as indicated.panel of Figure 11.6 that it does poorly in this case, with the test errorstaying well above the Bayes error (note the different vertical scale fromthe left panel). In fact, since a constant fit (such as the sample average)achieves a relative error of 5 (when the SNR is 4), we see that the neuralnetworks perform increasingly worse than the mean.In this example we used a fixed weight decay parameter of 0.0005, representing a mild amount of regularization.
The results in the left panel ofFigure 11.6 suggest that more regularization is needed with greater numbers of hidden units.In Figure 11.7 we repeated the experiment for the sum of sigmoids model,with no weight decay in the left panel, and stronger weight decay (λ = 0.1)in the right panel. With no weight decay, overfitting becomes even moresevere for larger numbers of hidden units. The weight decay value λ = 0.1produces good results for all numbers of hidden units, and there does notappear to be overfitting as the number of units increase.
Finally, Figure 11.8shows the test error for a ten hidden unit network, varying the weight decayparameter over a wide range. The value 0.1 is approximately optimal.In summary, there are two free parameters to select: the weight decay λand number of hidden units M . As a learning strategy, one could fix eitherparameter at the value corresponding to the least constrained model, toensure that the model is rich enough, and use cross-validation to choosethe other parameter.
Here the least constrained values are zero weight decayand ten hidden units. Comparing the left panel of Figure 11.7 to Figure11.8, we see that the test error is less sensitive to the value of the weight11.6 Example: Simulated Data2.51.01.52.0Test Error2.01.01.5Test Error2.53.0Weight Decay=0.13.0No Weight Decay4030 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10Number of Hidden UnitsNumber of Hidden UnitsFIGURE 11.7. Boxplots of test error, for simulated data example, relative to theBayes error. True function is a sum of two sigmoids. The test error is displayedfor ten different starting weights, for a single hidden layer neural network withthe number units as indicated. The two panels represent no weight decay (left)and strong weight decay λ = 0.1 (right).1.81.61.01.21.4Test Error2.02.2Sum of Sigmoids, 10 Hidden Unit Model0.000.020.040.060.080.100.120.14Weight Decay ParameterFIGURE 11.8.
Boxplots of test error, for simulated data example. True functionis a sum of two sigmoids. The test error is displayed for ten different startingweights, for a single hidden layer neural network with ten hidden units and weightdecay parameter value as indicated.404Neural NetworksFIGURE 11.9. Examples of training cases from ZIP code data. Each image isa 16 × 16 8-bit grayscale representation of a handwritten digit.decay parameter, and hence cross-validation of this parameter would bepreferred.11.7 Example: ZIP Code DataThis example is a character recognition task: classification of handwrittennumerals. This problem captured the attention of the machine learning andneural network community for many years, and has remained a benchmarkproblem in the field. Figure 11.9 shows some examples of normalized handwritten digits, automatically scanned from envelopes by the U.S.
PostalService. The original scanned digits are binary and of different sizes andorientations; the images shown here have been deslanted and size normalized, resulting in 16 × 16 grayscale images (Le Cun et al., 1990). These 256pixel values are used as inputs to the neural network classifier.A black box neural network is not ideally suited to this pattern recognition task, partly because the pixel representation of the images lack certaininvariances (such as small rotations of the image). Consequently early attempts with neural networks yielded misclassification rates around 4.5%on various examples of the problem.
In this section we show some of thepioneering efforts to handcraft the neural network to overcome some thesedeficiencies (Le Cun, 1989), which ultimately led to the state of the art inneural network performance(Le Cun et al., 1998)1 .Although current digit datasets have tens of thousands of training andtest examples, the sample size here is deliberately modest in order to em1 Thefigures and tables in this example were recreated from Le Cun (1989).11.7 Example: ZIP Code Data40510104x4101216x16Net-18x816x16Net-216x16Net-3Local Connectivity10104x4x44x48x8x28x8x216x16Net-416x16Shared WeightsNet-5FIGURE 11.10.
Architecture of the five networks used in the ZIP code example.phasize the effects. The examples were obtained by scanning some actualhand-drawn digits, and then generating additional images by random horizontal shifts. Details may be found in Le Cun (1989). There are 320 digitsin the training set, and 160 in the test set.Five different networks were fit to the data:Net-1: No hidden layer, equivalent to multinomial logistic regression.Net-2: One hidden layer, 12 hidden units fully connected.Net-3: Two hidden layers locally connected.Net-4: Two hidden layers, locally connected with weight sharing.Net-5: Two hidden layers, locally connected, two levels of weight sharing.These are depicted in Figure 11.10. Net-1 for example has 256 inputs, oneeach for the 16 × 16 input pixels, and ten output units for each of the digits0–9.
The predicted value fˆk (x) represents the estimated probability thatan image x has digit class k, for k = 0, 1, 2, . . . , 9.406Neural Networks100Net-5% Correct on Test DataNet-490Net-3Net-280Net-17060051015202530Training EpochsFIGURE 11.11. Test performance curves, as a function of the number of training epochs, for the five networks of Table 11.1 applied to the ZIP code data.(Le Cun, 1989)The networks all have sigmoidal output units, and were all fit with thesum-of-squares error function. The first network has no hidden layer, andhence is nearly equivalent to a linear multinomial regression model (Exercise 11.4). Net-2 is a single hidden layer network with 12 hidden units, ofthe kind described above.The training set error for all of the networks was 0%, since in all casesthere are more parameters than training observations.
The evolution of thetest error during the training epochs is shown in Figure 11.11. The linearnetwork (Net-1) starts to overfit fairly quickly, while test performance ofthe others level off at successively superior values.The other three networks have additional features which demonstratethe power and flexibility of the neural network paradigm. They introduceconstraints on the network, natural for the problem at hand, which allowfor more complex connectivity but fewer parameters.Net-3 uses local connectivity: this means that each hidden unit is connected to only a small patch of units in the layer below.
In the first hiddenlayer (an 8 × 8 array), each unit takes inputs from a 3 × 3 patch of the inputlayer; for units in the first hidden layer that are one unit apart, their receptive fields overlap by one row or column, and hence are two pixels apart.In the second hidden layer, inputs are from a 5 × 5 patch, and again unitsthat are one unit apart have receptive fields that are two units apart. Theweights for all other connections are set to zero.