Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 55
Текст из файла (страница 55)
This figure also shows how individual hidden units workcollaboratively to approximate the final function. The role of hidden units in a simpleclassification problem is illustrated in Figure 5.4 using the synthetic classificationdata set described in Appendix A.5.1.1 Weight-space symmetriesOne property of feed-forward networks, which will play a role when we considerBayesian model comparison, is that multiple distinct choices for the weight vectorw can all give rise to the same mapping function from inputs to outputs (Chen et al.,1993). Consider a two-layer network of the form shown in Figure 5.1 with M hiddenunits having ‘tanh’ activation functions and full connectivity in both layers.
If wechange the sign of all of the weights and the bias feeding into a particular hiddenunit, then, for a given input pattern, the sign of the activation of the hidden unit willbe reversed, because ‘tanh’ is an odd function, so that tanh(−a) = − tanh(a). Thistransformation can be exactly compensated by changing the sign of all of the weightsleading out of that hidden unit. Thus, by changing the signs of a particular group ofweights (and a bias), the input–output mapping function represented by the networkis unchanged, and so we have found two different weight vectors that give rise tothe same mapping function.
For M hidden units, there will be M such ‘sign-flip’2325. NEURAL NETWORKSFigure 5.4Example of the solution of a simple two- 3class classification problem involvingsynthetic data using a neural networkhaving two inputs, two hidden units with 2‘tanh’ activation functions, and a singleoutput having a logistic sigmoid activa- 1tion function. The dashed blue linesshow the z = 0.5 contours for each of 0the hidden units, and the red line showsthe y = 0.5 decision surface for the net- −1work. For comparison, the green linedenotes the optimal decision boundary −2computed from the distributions used togenerate the data.−2−1012symmetries, and thus any given weight vector will be one of a set 2M equivalentweight vectors .Similarly, imagine that we interchange the values of all of the weights (and thebias) leading both into and out of a particular hidden unit with the correspondingvalues of the weights (and bias) associated with a different hidden unit.
Again, thisclearly leaves the network input–output mapping function unchanged, but it corresponds to a different choice of weight vector. For M hidden units, any given weightvector will belong to a set of M ! equivalent weight vectors associated with this interchange symmetry, corresponding to the M ! different orderings of the hidden units.The network will therefore have an overall weight-space symmetry factor of M !2M .For networks with more than two layers of weights, the total level of symmetry willbe given by the product of such factors, one for each layer of hidden units.It turns out that these factors account for all of the symmetries in weight space(except for possible accidental symmetries due to specific choices for the weight values).
Furthermore, the existence of these symmetries is not a particular property ofthe ‘tanh’ function but applies to a wide range of activation functions (Ku̇rková andKainen, 1994). In many cases, these symmetries in weight space are of little practical consequence, although in Section 5.7 we shall encounter a situation in which weneed to take them into account.5.2. Network TrainingSo far, we have viewed neural networks as a general class of parametric nonlinearfunctions from a vector x of input variables to a vector y of output variables.
Asimple approach to the problem of determining the network parameters is to make ananalogy with the discussion of polynomial curve fitting in Section 1.1, and thereforeto minimize a sum-of-squares error function. Given a training set comprising a setof input vectors {xn }, where n = 1, . . . , N , together with a corresponding set of5.2. Network Training233target vectors {tn }, we minimize the error function1y(xn , w) − tn 2 .2NE(w) =(5.11)n=1However, we can provide a much more general view of network training by firstgiving a probabilistic interpretation to the network outputs. We have already seenmany advantages of using probabilistic predictions in Section 1.5.4. Here it will alsoprovide us with a clearer motivation both for the choice of output unit nonlinearityand the choice of error function.We start by discussing regression problems, and for the moment we considera single target variable t that can take any real value.
Following the discussionsin Section 1.2.5 and 3.1, we assume that t has a Gaussian distribution with an xdependent mean, which is given by the output of the neural network, so thatp(t|x, w) = N t|y(x, w), β −1(5.12)where β is the precision (inverse variance) of the Gaussian noise. Of course thisis a somewhat restrictive assumption, and in Section 5.6 we shall see how to extendthis approach to allow for more general conditional distributions. For the conditionaldistribution given by (5.12), it is sufficient to take the output unit activation functionto be the identity, because such a network can approximate any continuous functionfrom x to y.
Given a data set of N independent, identically distributed observationsX = {x1 , . . . , xN }, along with corresponding target values t = {t1 , . . . , tN }, wecan construct the corresponding likelihood functionp(t|X, w, β) =Np(tn |xn , w, β).n=1Taking the negative logarithm, we obtain the error functionNβNN{y(xn , w) − tn }2 −ln β +ln(2π)222(5.13)n=1which can be used to learn the parameters w and β. In Section 5.7, we shall discuss the Bayesian treatment of neural networks, while here we consider a maximumlikelihood approach. Note that in the neural networks literature, it is usual to consider the minimization of an error function rather than the maximization of the (log)likelihood, and so here we shall follow this convention.
Consider first the determination of w. Maximizing the likelihood function is equivalent to minimizing thesum-of-squares error function given by1{y(xn , w) − tn }22NE(w) =n=1(5.14)2345. NEURAL NETWORKSwhere we have discarded additive and multiplicative constants. The value of w foundby minimizing E(w) will be denoted wML because it corresponds to the maximumlikelihood solution. In practice, the nonlinearity of the network function y(xn , w)causes the error E(w) to be nonconvex, and so in practice local maxima of thelikelihood may be found, corresponding to local minima of the error function, asdiscussed in Section 5.2.1.Having found wML , the value of β can be found by minimizing the negative loglikelihood to giveN11 ={y(xn , wML ) − tn }2 .(5.15)βMLNn=1Note that this can be evaluated once the iterative optimization required to find wMLis completed.
If we have multiple target variables, and we assume that they are independent conditional on x and w with shared noise precision β, then the conditionaldistribution of the target values is given byp(t|x, w) = N t|y(x, w), β −1 I .(5.16)Exercise 5.2Following the same argument as for a single target variable, we see that the maximumlikelihood weights are determined by minimizing the sum-of-squares error function(5.11). The noise precision is then given by1βMLExercise 5.3=N1 y(xn , wML ) − tn 2NK(5.17)n=1where K is the number of target variables. The assumption of independence can bedropped at the expense of a slightly more complex optimization problem.Recall from Section 4.3.6 that there is a natural pairing of the error function(given by the negative log likelihood) and the output unit activation function.
In theregression case, we can view the network as having an output activation function thatis the identity, so that yk = ak . The corresponding sum-of-squares error functionhas the property∂E= yk − tk(5.18)∂akwhich we shall make use of when discussing error backpropagation in Section 5.3.Now consider the case of binary classification in which we have a single targetvariable t such that t = 1 denotes class C1 and t = 0 denotes class C2 . Followingthe discussion of canonical link functions in Section 4.3.6, we consider a networkhaving a single output whose activation function is a logistic sigmoidy = σ(a) ≡11 + exp(−a)(5.19)so that 0 y(x, w) 1. We can interpret y(x, w) as the conditional probabilityp(C1 |x), with p(C2 |x) given by 1 − y(x, w).
The conditional distribution of targetsgiven inputs is then a Bernoulli distribution of the form1−tp(t|x, w) = y(x, w)t {1 − y(x, w)}.(5.20)5.2. Network Training235If we consider a training set of independent observations, then the error function,which is given by the negative log likelihood, is then a cross-entropy error functionof the formN{tn ln yn + (1 − tn ) ln(1 − yn )}(5.21)E(w) = −n=1Exercise 5.4where yn denotes y(xn , w). Note that there is no analogue of the noise precision βbecause the target values are assumed to be correctly labelled. However, the modelis easily extended to allow for labelling errors.
Simard et al. (2003) found that usingthe cross-entropy error function instead of the sum-of-squares for a classificationproblem leads to faster training as well as improved generalization.If we have K separate binary classifications to perform, then we can use a network having K outputs each of which has a logistic sigmoid activation function.Associated with each output is a binary class label tk ∈ {0, 1}, where k = 1, . . . , K.If we assume that the class labels are independent, given the input vector, then theconditional distribution of the targets isp(t|x, w) =K1−tkyk (x, w)tk [1 − yk (x, w)].(5.22)k=1Exercise 5.5Taking the negative logarithm of the corresponding likelihood function then givesthe following error functionE(w) = −N K{tnk ln ynk + (1 − tnk ) ln(1 − ynk )}(5.23)n=1 k=1Exercise 5.6where ynk denotes yk (xn , w). Again, the derivative of the error function with respect to the activation for a particular output unit takes the form (5.18) just as in theregression case.It is interesting to contrast the neural network solution to this problem with thecorresponding approach based on a linear classification model of the kind discussedin Chapter 4.
Suppose that we are using a standard two-layer network of the kindshown in Figure 5.1. We see that the weight parameters in the first layer of thenetwork are shared between the various outputs, whereas in the linear model eachclassification problem is solved independently. The first layer of the network canbe viewed as performing a nonlinear feature extraction, and the sharing of featuresbetween the different outputs can save on computation and can also lead to improvedgeneralization.Finally, we consider the standard multiclass classification problem in which eachinput is assigned to one of K mutually exclusive classes. The binary target variablestk ∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the networkoutputs are interpreted as yk (x, w) = p(tk = 1|x), leading to the following errorfunctionN Ktkn ln yk (xn , w).(5.24)E(w) = −n=1 k=12365.