The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 90
Текст из файла (страница 90)
Local connectivity makeseach unit responsible for extracting local features from the layer below, and11.7 Example: ZIP Code Data407TABLE 11.1. Test set performance of five different neural networks on a handwritten digit classification example (Le Cun, 1989).Net-1:Net-2:Net-3:Net-4:Net-5:Network ArchitectureSingle layer networkTwo layer networkLocally connectedConstrained network 1Constrained network 2Links25703214122622665194Weights25703214122611321060% Correct80.0%87.0%88.5%94.0%98.4%reduces considerably the total number of weights.
With many more hiddenunits than Net-2, Net-3 has fewer links and hence weights (1226 vs. 3214),and achieves similar performance.Net-4 and Net-5 have local connectivity with shared weights. All unitsin a local feature map perform the same operation on different parts of theimage, achieved by sharing the same weights. The first hidden layer of Net4 has two 8×8 arrays, and each unit takes input from a 3×3 patch just likein Net-3. However, each of the units in a single 8 × 8 feature map share thesame set of nine weights (but have their own bias parameter). This forcesthe extracted features in different parts of the image to be computed bythe same linear functional, and consequently these networks are sometimesknown as convolutional networks.
The second hidden layer of Net-4 hasno weight sharing, and is the same as in Net-3. The gradient of the errorfunction R with respect to a shared weight is the sum of the gradients ofR with respect to each connection controlled by the weights in question.Table 11.1 gives the number of links, the number of weights and theoptimal test performance for each of the networks.
We see that Net-4 hasmore links but fewer weights than Net-3, and superior test performance.Net-5 has four 4 × 4 feature maps in the second hidden layer, each unitconnected to a 5 × 5 local patch in the layer below. Weights are sharedin each of these feature maps. We see that Net-5 does the best, havingerrors of only 1.6%, compared to 13% for the “vanilla” network Net-2.The clever design of network Net-5, motivated by the fact that features ofhandwriting style should appear in more than one part of a digit, was theresult of many person years of experimentation.
This and similar networksgave better performance on ZIP code problems than any other learningmethod at that time (early 1990s). This example also shows that neuralnetworks are not a fully automatic tool, as they are sometimes advertised.As with all statistical models, subject matter knowledge can and should beused to improve their performance.This network was later outperformed by the tangent distance approach(Simard et al., 1993) described in Section 13.3.3, which explicitly incorporates natural affine invariances.
At this point the digit recognition datasetsbecome test beds for every new learning procedure, and researchers worked408Neural Networkshard to drive down the error rates. As of this writing, the best error rates ona large database (60, 000 training, 10, 000 test observations), derived fromstandard NIST2 databases, were reported to be the following: (Le Cun etal., 1998):• 1.1% for tangent distance with a 1-nearest neighbor classifier (Section 13.3.3);• 0.8% for a degree-9 polynomial SVM (Section 12.3);• 0.8% for LeNet-5, a more complex version of the convolutional network described here;• 0.7% for boosted LeNet-4. Boosting is described in Chapter 8. LeNet4 is a predecessor of LeNet-5.Le Cun et al. (1998) report a much larger table of performance results, andit is evident that many groups have been working very hard to bring thesetest error rates down.
They report a standard error of 0.1% on the errorestimates, which is based on a binomial average with N = 10, 000 andp ≈ 0.01. This implies that error rates within 0.1—0.2% of one anotherare statistically equivalent. Realistically the standard error is even higher,since the test data has been implicitly used in the tuning of the variousprocedures.11.8 DiscussionBoth projection pursuit regression and neural networks take nonlinear functions of linear combinations (“derived features”) of the inputs.
This is apowerful and very general approach for regression and classification, andhas been shown to compete well with the best learning methods on manyproblems.These tools are especially effective in problems with a high signal-to-noiseratio and settings where prediction without interpretation is the goal. Theyare less effective for problems where the goal is to describe the physical process that generated the data and the roles of individual inputs.
Each inputenters into the model in many places, in a nonlinear fashion. Some authors(Hinton, 1989) plot a diagram of the estimated weights into each hiddenunit, to try to understand the feature that each unit is extracting. Thisis limited however by the lack of identifiability of the parameter vectorsαm , m = 1, . . . , M . Often there are solutions with αm spanning the samelinear space as the ones found during training, giving predicted values that2 The National Institute of Standards and Technology maintain large databases, including handwritten character databases; http://www.nist.gov/srd/.11.9 Bayesian Neural Nets and the NIPS 2003 Challenge409are roughly the same.
Some authors suggest carrying out a principal component analysis of these weights, to try to find an interpretable solution. Ingeneral, the difficulty of interpreting these models has limited their use infields like medicine, where interpretation of the model is very important.There has been a great deal of research on the training of neural networks. Unlike methods like CART and MARS, neural networks are smoothfunctions of real-valued parameters. This facilitates the development ofBayesian inference for these models. The next sections discusses a successful Bayesian implementation of neural networks.11.9 Bayesian Neural Nets and the NIPS 2003ChallengeA classification competition was held in 2003, in which five labeled training datasets were provided to participants.
It was organized for a NeuralInformation Processing Systems (NIPS) workshop. Each of the data setsconstituted a two-class classification problems, with different sizes and froma variety of domains (see Table 11.2). Feature measurements for a validation dataset were also available.Participants developed and applied statistical learning procedures tomake predictions on the datasets, and could submit predictions to a website on the validation set for a period of 12 weeks. With this feedback,participants were then asked to submit predictions for a separate test setand they received their results. Finally, the class labels for the validationset were released and participants had one week to train their algorithmson the combined training and validation sets, and submit their final predictions to the competition website.
A total of 75 groups participated, with20 and 16 eventually making submissions on the validation and test sets,respectively.There was an emphasis on feature extraction in the competition. Artificial “probes” were added to the data: these are noise features with distributions resembling the real features but independent of the class labels.The percentage of probes that were added to each dataset, relative to thetotal set of features, is shown on Table 11.2. Thus each learning algorithmhad to figure out a way of identifying the probes and downweighting oreliminating them.A number of metrics were used to evaluate the entries, including thepercentage correct on the test set, the area under the ROC curve, and acombined score that compared each pair of classifiers head-to-head.
Theresults of the competition are very interesting and are detailed in Guyon etal. (2006). The most notable result: the entries of Neal and Zhang (2006)were the clear overall winners. In the final competition they finished first410Neural NetworksTABLE 11.2. NIPS 2003 challenge data sets. The column labeled p is the numberof features. For the Dorothea dataset the features are binary. Ntr , Nval and Nteare the number of training, validation and test cases, respectivelyDatasetDomainArceneDexterDorotheaGisetteMadelonMass spectrometryText classificationDrug discoveryDigit recognitionArtificialFeatureTypeDenseSparseSparseDenseDensep10,00020,000100,0005000500PercentProbes3050503096NtrNvalNte100300800600020001003003501000600700200080065001800in three of the five datasets, and were 5th and 7th on the remaining twodatasets.In their winning entries, Neal and Zhang (2006) used a series of preprocessing feature-selection steps, followed by Bayesian neural networks,Dirichlet diffusion trees, and combinations of these methods.