The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 91
Текст из файла (страница 91)
Here we focusonly on the Bayesian neural network approach, and try to discern whichaspects of their approach were important for its success. We rerun theirprograms and compare the results to boosted neural networks and boostedtrees, and other related methods.11.9.1 Bayes, Boosting and BaggingLet us first review briefly the Bayesian approach to inference and its application to neural networks. Given training data Xtr , ytr , we assume a sampling model with parameters θ; Neal and Zhang (2006) use a two-hiddenlayer neural network, with output nodes the class probabilities Pr(Y |X, θ)for the binary outcomes.
Given a prior distribution Pr(θ), the posteriordistribution for the parameters isPr(θ|Xtr , ytr ) = RPr(θ)Pr(ytr |Xtr , θ)Pr(θ)Pr(ytr |Xtr , θ)dθ(11.19)For a test case with features Xnew , the predictive distribution for thelabel Ynew isZPr(Ynew |Xnew , Xtr , ytr ) = Pr(Ynew |Xnew , θ)Pr(θ|Xtr , ytr )dθ (11.20)(c.f.
equation 8.24). Since the integral in (11.20) is intractable, sophisticatedMarkov Chain Monte Carlo (MCMC) methods are used to sample from theposterior distribution Pr(Ynew |Xnew , Xtr , ytr ). A few hundred values θ aregenerated and then a simple average of these values estimates the integral.Neal and Zhang (2006) use diffuse Gaussian priors for all of the parameters. The particular MCMC approach that was used is called hybrid MonteCarlo, and may be important for the success of the method.
It includesan auxiliary momentum vector and implements Hamiltonian dynamics inwhich the potential function is the target density. This is done to avoid11.9 Bayesian Neural Nets and the NIPS 2003 Challenge411random walk behavior; the successive candidates move across the samplespace in larger steps. They tend to be less correlated and hence convergeto the target distribution more rapidly.Neal and Zhang (2006) also tried different forms of pre-processing of thefeatures:1. univariate screening using t-tests, and2.
automatic relevance determination.In the latter method (ARD), the weights (coefficients) for the jth featureto each of the first hidden layer units all share a common prior varianceσj2 , and prior mean zero. The posterior distributions for each variance σj2are computed, and the features whose posterior variance concentrates onsmall values are discarded.There are thus three main features of this approach that could be important for its success:(a) the feature selection and pre-processing,(b) the neural network model, and(c) the Bayesian inference for the model using MCMC.According to Neal and Zhang (2006), feature screening in (a) is carriedout purely for computational efficiency; the MCMC procedure is slow witha large number of features.
There is no need to use feature selection to avoidoverfitting. The posterior average (11.20) takes care of this automatically.We would like to understand the reasons for the success of the Bayesianmethod. In our view, power of modern Bayesian methods does not lie intheir use as a formal inference procedure; most people would not believethat the priors in a high-dimensional, complex neural network model areactually correct. Rather the Bayesian/MCMC approach gives an efficientway of sampling the relevant parts of model space, and then averaging thepredictions for the high-probability models.Bagging and boosting are non-Bayesian procedures that have some similarity to MCMC in a Bayesian model.
The Bayesian approach fixes the dataand perturbs the parameters, according to current estimate of the posterior distribution. Bagging perturbs the data in an i.i.d fashion and thenre-estimates the model to give a new set of model parameters. At the end,a simple average of the model predictions from different bagged samples iscomputed. Boosting is similar to bagging, but fits a model that is additivein the models of each individual base learner, which are learned using noni.i.d. samples.
We can write all of these models in the formfˆ(xnew ) =LXℓ=1wℓ E(Ynew |xnew , θ̂ℓ )(11.21)412Neural NetworksIn all cases the θ̂ℓ are a large collection of model parameters. For theBayesian model the wℓ = 1/L, and the average estimates the posteriormean (11.21) by sampling θℓ from the posterior distribution.
For bagging,wℓ = 1/L as well, and the θ̂ℓ are the parameters refit to bootstrap resamples of the training data. For boosting, the weights are all equal to1, but the θ̂ℓ are typically chosen in a nonrandom sequential fashion toconstantly improve the fit.11.9.2 Performance ComparisonsBased on the similarities above, we decided to compare Bayesian neuralnetworks to boosted trees, boosted neural networks, random forests andbagged neural networks on the five datasets in Table 11.2.
Bagging andboosting of neural networks are not methods that we have previously usedin our work. We decided to try them here, because of the success of Bayesianneural networks in this competition, and the good performance of baggingand boosting with trees. We also felt that by bagging and boosting neuralnets, we could assess both the choice of model as well as the model searchstrategy.Here are the details of the learning methods that were compared:Bayesian neural nets.
The results here are taken from Neal and Zhang(2006), using their Bayesian approach to fitting neural networks. Themodels had two hidden layers of 20 and 8 units. We re-ran somenetworks for timing purposes only.Boosted trees. We used the gbm package (version 1.5-7) in the R language.Tree depth and shrinkage factors varied from dataset to dataset. Weconsistently bagged 80% of the data at each boosting iteration (thedefault is 50%).
Shrinkage was between 0.001 and 0.1. Tree depth wasbetween 2 and 9.Boosted neural networks. Since boosting is typically most effective with“weak” learners, we boosted a single hidden layer neural network withtwo or four units, fit with the nnet package (version 7.2-36) in R.Random forests. We used the R package randomForest (version 4.5-16)with default settings for the parameters.Bagged neural networks. We used the same architecture as in the Bayesianneural network above (two hidden layers of 20 and 8 units), fit usingboth Neal’s C language package “Flexible Bayesian Modeling” (200411-10 release), and Matlab neural-net toolbox (version 5.1).11.9 Bayesian Neural Nets and the NIPS 2003 ChallengeUnivariate Screened Features413ARD Reduced Features25Test Error (%)155155Test Error (%)25Bayesian neural netsboosted treesboosted neural netsrandom forestsbagged neural networksArceneDexterDorotheaGisetteMadelonArceneDexterDorotheaGisetteMadelonFIGURE 11.12.
Performance of different learning methods on five problems,using both univariate screening of features (top panel) and a reduced feature setfrom automatic relevance determination. The error bars at the top of each plothave width equal to one standard error of the difference between two error rates.On most of the problems several competitors are within this error bound.This analysis was carried out by Nicholas Johnson, and full details maybe found in Johnson (2008)3 . The results are shown in Figure 11.12 andTable 11.3.The figure and table show Bayesian, boosted and bagged neural networks,boosted trees, and random forests, using both the screened and reducedfeatures sets.
The error bars at the top of each plot indicate one standarderror of the difference between two error rates. Bayesian neural networksagain emerge as the winner, although for some datasets the differencesbetween the test error rates is not statistically significant. Random forestsperforms the best among the competitors using the selected feature set,while the boosted neural networks perform best with the reduced featureset, and nearly match the Bayesian neural net.The superiority of boosted neural networks over boosted trees suggestthat the neural network model is better suited to these particular problems. Specifically, individual features might not be good predictors here3 Wealso thank Isabelle Guyon for help in preparing the results of this section.414Neural NetworksTABLE 11.3. Performance of different methods.
Values are average rank of testerror across the five problems (low is good), and mean computation time andstandard error of the mean, in minutes.Screened FeaturesARD Reduced FeaturesMethodAverageAverageAverageAverageRankTimeRankTimeBayesian neural networks1.5 384(138)1.6600(186)Boosted trees3.4 3.03(2.5)4.034.1(32.4)Boosted neural networks3.89.4(8.6)2.235.6(33.5)Random forests2.71.9(1.7)3.211.2(9.3)Bagged neural networks3.63.5(1.1)4.06.4(4.4)and linear combinations of features work better.
However the impressiveperformance of random forests is at odds with this explanation, and cameas a surprise to us.Since the reduced feature sets come from the Bayesian neural networkapproach, only the methods that use the screened features are legitimate,self-contained procedures. However, this does suggest that better methodsfor internal feature selection might help the overall performance of boostedneural networks.The table also shows the approximate training time required for eachmethod.
Here the non-Bayesian methods show a clear advantage.Overall, the superior performance of Bayesian neural networks here maybe due to the fact that(a) the neural network model is well suited to these five problems, and(b) the MCMC approach provides an efficient way of exploring the important part of the parameter space, and then averaging the resultingmodels according to their quality.The Bayesian approach works well for smoothly parametrized models likeneural nets; it is not yet clear that it works as well for non-smooth modelslike trees.11.10 Computational ConsiderationsWith N observations, p predictors, M hidden units and L training epochs, aneural network fit typically requires O(N pM L) operations.
There are manypackages available for fitting neural networks, probably many more thanexist for mainstream statistical methods. Because the available softwarevaries widely in quality, and the learning problem for neural networks issensitive to issues such as input scaling, such software should be carefullychosen and tested.Exercises415Bibliographic NotesProjection pursuit was proposed by Friedman and Tukey (1974), and specialized to regression by Friedman and Stuetzle (1981). Huber (1985) givesa scholarly overview, and Roosen and Hastie (1994) present a formulationusing smoothing splines. The motivation for neural networks dates backto McCulloch and Pitts (1943), Widrow and Hoff (1960) (reprinted in Anderson and Rosenfeld (1988)) and Rosenblatt (1962).