Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 54
Текст из файла (страница 54)
Our goal is to extend this model by making thebasis functions φj (x) depend on parameters and then to allow these parameters tobe adjusted, along with the coefficients {wj }, during training. There are, of course,many ways to construct parametric nonlinear basis functions. Neural networks usebasis functions that follow the same form as (5.1), so that each basis function is itselfa nonlinear function of a linear combination of the inputs, where the coefficients inthe linear combination are adaptive parameters.This leads to the basic neural network model, which can be described a seriesof functional transformations.
First we construct M linear combinations of the inputvariables x1 , . . . , xD in the formaj =D(1)(1)wji xi + wj 0(5.2)i=1where j = 1, . . . , M , and the superscript (1) indicates that the corresponding param(1)eters are in the first ‘layer’ of the network. We shall refer to the parameters wji as(1)weights and the parameters wj 0 as biases, following the nomenclature of Chapter 3.The quantities aj are known as activations.
Each of them is then transformed usinga differentiable, nonlinear activation function h(·) to givezj = h(aj ).Exercise 5.1(5.3)These quantities correspond to the outputs of the basis functions in (5.1) that, in thecontext of neural networks, are called hidden units. The nonlinear functions h(·) aregenerally chosen to be sigmoidal functions such as the logistic sigmoid or the ‘tanh’function. Following (5.1), these values are again linearly combined to give outputunit activationsM(2)(2)wkj zj + wk0(5.4)ak =j =1where k = 1, . . .
, K, and K is the total number of outputs. This transformation cor(2)responds to the second layer of the network, and again the wk0 are bias parameters.Finally, the output unit activations are transformed using an appropriate activationfunction to give a set of network outputs yk . The choice of activation function isdetermined by the nature of the data and the assumed distribution of target variables2285. NEURAL NETWORKSFigure 5.1Network diagram for the twolayer neural network corresponding to (5.7). The input,hidden, and output variablesare represented by nodes, andxDthe weight parameters are represented by links between thenodes, in which the bias parameters are denoted by links inputscoming from additional inputand hidden variables x0 andz0 .
Arrows denote the direcx1tion of information flow throughthe network during forwardpropagation.(1)wM Dhidden unitszM(2)wKMyKoutputsy1z1x0(2)w10z0and follows the same considerations as for linear models discussed in Chapters 3 and4. Thus for standard regression problems, the activation function is the identity sothat yk = ak . Similarly, for multiple binary classification problems, each output unitactivation is transformed using a logistic sigmoid function so thatyk = σ(ak )whereσ(a) =1.1 + exp(−a)(5.5)(5.6)Finally, for multiclass problems, a softmax activation function of the form (4.62)is used. The choice of output unit activation function is discussed in detail in Section 5.2.We can combine these various stages to give the overall network function that,for sigmoidal output unit activation functions, takes the formMD (2) (1)(1)(2)wkj hwji xi + wj 0 + wk0(5.7)yk (x, w) = σj =1i=1where the set of all weight and bias parameters have been grouped together into avector w.
Thus the neural network model is simply a nonlinear function from a setof input variables {xi } to a set of output variables {yk } controlled by a vector w ofadjustable parameters.This function can be represented in the form of a network diagram as shownin Figure 5.1. The process of evaluating (5.7) can then be interpreted as a forwardpropagation of information through the network. It should be emphasized that thesediagrams do not represent probabilistic graphical models of the kind to be considered in Chapter 8 because the internal nodes represent deterministic variables ratherthan stochastic ones.
For this reason, we have adopted a slightly different graphical5.1. Feed-forward Network Functions229notation for the two kinds of model. We shall see later how to give a probabilisticinterpretation to a neural network.As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed intothe set of weight parameters by defining an additional input variable x0 whose valueis clamped at x0 = 1, so that (5.2) takes the formaj =D(1)wji xi .(5.8)i=0We can similarly absorb the second-layer biases into the second-layer weights, sothat the overall network function becomesMD (2) (1)yk (x, w) = σwkj hwji xi.(5.9)j =0i=0As can be seen from Figure 5.1, the neural network model comprises two stagesof processing, each of which resembles the perceptron model of Section 4.1.7, andfor this reason the neural network is also known as the multilayer perceptron, orMLP.
A key difference compared to the perceptron, however, is that the neural network uses continuous sigmoidal nonlinearities in the hidden units, whereas the perceptron uses step-function nonlinearities. This means that the neural network function is differentiable with respect to the network parameters, and this property willplay a central role in network training.If the activation functions of all the hidden units in a network are taken to belinear, then for any such network we can always find an equivalent network withouthidden units.
This follows from the fact that the composition of successive lineartransformations is itself a linear transformation. However, if the number of hiddenunits is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionalityreduction at the hidden units. In Section 12.4.2, we show that networks of linearunits give rise to principal component analysis. In general, however, there is littleinterest in multilayer networks of linear units.The network architecture shown in Figure 5.1 is the most commonly used onein practice. However, it is easily generalized, for instance by considering additionallayers of processing each consisting of a weighted linear combination of the form(5.4) followed by an element-wise transformation using a nonlinear activation function.
Note that there is some confusion in the literature regarding the terminologyfor counting the number of layers in such networks. Thus the network in Figure 5.1may be described as a 3-layer network (which counts the number of layers of units,and treats the inputs as units) or sometimes as a single-hidden-layer network (whichcounts the number of layers of hidden units). We recommend a terminology in whichFigure 5.1 is called a two-layer network, because it is the number of layers of adaptive weights that is important for determining the network properties.Another generalization of the network architecture is to include skip-layer connections, each of which is associated with a corresponding adaptive parameter.
For2305. NEURAL NETWORKSFigure 5.2Example of a neural network having ageneral feed-forward topology. Note thateach hidden and output unit has anassociated bias parameter (omitted forclarity).z2y2x2inputsz1outputsy1x1z3instance, in a two-layer network these would go directly from inputs to outputs. Inprinciple, a network with sigmoidal hidden units can always mimic skip layer connections (for bounded input values) by using a sufficiently small first-layer weightthat, over its operating range, the hidden unit is effectively linear, and then compensating with a large weight value from the hidden unit to the output. In practice,however, it may be advantageous to include skip-layer connections explicitly.Furthermore, the network can be sparse, with not all possible connections withina layer being present.
We shall see an example of a sparse network architecture whenwe consider convolutional neural networks in Section 5.5.6.Because there is a direct correspondence between a network diagram and itsmathematical function, we can develop more general network mappings by considering more complex network diagrams. However, these must be restricted to afeed-forward architecture, in other words to one having no closed directed cycles, toensure that the outputs are deterministic functions of the inputs. This is illustratedwith a simple example in Figure 5.2. Each (hidden or output) unit in such a networkcomputes a function given bywkj zj(5.10)zk = hjwhere the sum runs over all units that send connections to unit k (and a bias parameter is included in the summation).
For a given set of values applied to the inputs ofthe network, successive application of (5.10) allows the activations of all units in thenetwork to be evaluated including those of the output units.The approximation properties of feed-forward networks have been widely studied (Funahashi, 1989; Cybenko, 1989; Hornik et al., 1989; Stinchecombe and White,1989; Cotter, 1990; Ito, 1991; Hornik, 1991; Kreinovich, 1991; Ripley, 1996) andfound to be very general.
Neural networks are therefore said to be universal approximators. For example, a two-layer network with linear outputs can uniformlyapproximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units. This resultholds for a wide range of hidden unit activation functions, but excluding polynomials. Although such theorems are reassuring, the key problem is how to find suitableparameter values given a set of training data, and in later sections of this chapter we5.1. Feed-forward Network FunctionsFigure 5.3 Illustration of the capability of a multilayer perceptronto approximate four different functions comprising (a) f (x) = x2 , (b)f (x) = sin(x), (c), f (x) = |x|,and (d) f (x) = H(x) where H(x)is the Heaviside step function.
Ineach case, N = 50 data points,shown as blue dots, have been sampled uniformly in x over the interval(−1, 1) and the corresponding values of f (x) evaluated. These datapoints are then used to train a twolayer network having 3 hidden unitswith ‘tanh’ activation functions andlinear output units. The resultingnetwork functions are shown by thered curves, and the outputs of thethree hidden units are shown by thethree dashed curves.(a)(b)(c)(d)231will show that there exist effective solutions to this problem based on both maximumlikelihood and Bayesian approaches.The capability of a two-layer network to model a broad range of functions isillustrated in Figure 5.3.