Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 47
Текст из файла (страница 47)
The perceptron learning algorithm isillustrated in Figure 4.7.If we consider the effect of a single update in the perceptron learning algorithm,we see that the contribution to the error from a misclassified pattern will be reducedbecause from (4.55) we have−w(τ +1)T φn tn = −w(τ )T φn tn − (φn tn )T φn tn < −w(τ )T φn tn(4.56)where we have set η = 1, and made use of φn tn 2 > 0. Of course, this doesnot imply that the contribution to the error function from the other misclassifiedpatterns will have been reduced.
Furthermore, the change in weight vector may havecaused some previously correctly classified patterns to become misclassified. Thusthe perceptron learning rule is not guaranteed to reduce the total error function ateach stage.However, the perceptron convergence theorem states that if there exists an exact solution (in other words, if the training data set is linearly separable), then theperceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.
Proofs of this theorem can be found for example in Rosenblatt (1962),Block (1962), Nilsson (1965), Minsky and Papert (1969), Hertz et al. (1991), andBishop (1995a). Note, however, that the number of steps required to achieve convergence could still be substantial, and in practice, until convergence is achieved,we will not be able to distinguish between a nonseparable problem and one that issimply slow to converge.Even when the data set is linearly separable, there may be many solutions, andwhich one is found will depend on the initialization of the parameters and on the order of presentation of the data points. Furthermore, for data sets that are not linearlyseparable, the perceptron learning algorithm will never converge.4.1.
Discriminant Functions110.50.500−0.5−0.5−1−1−0.500.51−1−1110.50.500−0.5−0.5−1−1−0.500.51−1−1195−0.500.51−0.500.51Figure 4.7 Illustration of the convergence of the perceptron learning algorithm, showing data points from twoclasses (red and blue) in a two-dimensional feature space (φ1 , φ2 ). The top left plot shows the initial parametervector w shown as a black arrow together with the corresponding decision boundary (black line), in which thearrow points towards the decision region which classified as belonging to the red class. The data point circledin green is misclassified and so its feature vector is added to the current weight vector, giving the new decisionboundary shown in the top right plot.
The bottom left plot shows the next misclassified point to be considered,indicated by the green circle, and its feature vector is again added to the weight vector giving the decisionboundary shown in the bottom right plot for which all data points are correctly classified.1964. LINEAR MODELS FOR CLASSIFICATIONFigure 4.8 Illustration of the Mark 1 perceptron hardware. The photograph on the left shows how the inputswere obtained using a simple camera system in which an input scene, in this case a printed character, wasilluminated by powerful lights, and an image focussed onto a 20 × 20 array of cadmium sulphide photocells,giving a primitive 400 pixel image. The perceptron also had a patch board, shown in the middle photograph,which allowed different configurations of input features to be tried.
Often these were wired up at random todemonstrate the ability of the perceptron to learn without the need for precise wiring, in contrast to a moderndigital computer. The photograph on the right shows one of the racks of adaptive weights. Each weight wasimplemented using a rotary variable resistor, also called a potentiometer, driven by an electric motor therebyallowing the value of the weight to be adjusted automatically by the learning algorithm.Aside from difficulties with the learning algorithm, the perceptron does not provide probabilistic outputs, nor does it generalize readily to K > 2 classes. The mostimportant limitation, however, arises from the fact that (in common with all of themodels discussed in this chapter and the previous one) it is based on linear combinations of fixed basis functions.
More detailed discussions of the limitations ofperceptrons can be found in Minsky and Papert (1969) and Bishop (1995a).Analogue hardware implementations of the perceptron were built by Rosenblatt,based on motor-driven variable resistors to implement the adaptive parameters wj .These are illustrated in Figure 4.8.
The inputs were obtained from a simple camerasystem based on an array of photo-sensors, while the basis functions φ could bechosen in a variety of ways, for example based on simple fixed functions of randomlychosen subsets of pixels from the input image. Typical applications involved learningto discriminate simple shapes or characters.At the same time that the perceptron was being developed, a closely relatedsystem called the adaline, which is short for ‘adaptive linear element’, was beingexplored by Widrow and co-workers. The functional form of the model was the sameas for the perceptron, but a different approach to training was adopted (Widrow andHoff, 1960; Widrow and Lehr, 1990).4.2.
Probabilistic Generative ModelsWe turn next to a probabilistic view of classification and show how models withlinear decision boundaries arise from simple assumptions about the distribution ofthe data. In Section 1.5.4, we discussed the distinction between the discriminativeand the generative approaches to classification. Here we shall adopt a generative1974.2. Probabilistic Generative ModelsFigure 4.9Plot of the logistic sigmoid function1σ(a) defined by (4.59), shown inred, together with the scaled probit function Φ(λa), for λ2 = π/8,shown in dashed blue, where Φ(a)is defined by (4.114). The scaling factor π/8 is chosen so that thederivatives of the two curves are 0.5equal for a = 0.0−505approach in which we model the class-conditional densities p(x|Ck ), as well as theclass priors p(Ck ), and then use these to compute posterior probabilities p(Ck |x)through Bayes’ theorem.Consider first of all the case of two classes.
The posterior probability for classC1 can be written asp(C1 |x) ==p(x|C1 )p(C1 )p(x|C1 )p(C1 ) + p(x|C2 )p(C2 )1= σ(a)1 + exp(−a)where we have defineda = lnp(x|C1 )p(C1 )p(x|C2 )p(C2 )(4.57)(4.58)and σ(a) is the logistic sigmoid function defined byσ(a) =11 + exp(−a)(4.59)which is plotted in Figure 4.9. The term ‘sigmoid’ means S-shaped. This type offunction is sometimes also called a ‘squashing function’ because it maps the wholereal axis into a finite interval.
The logistic sigmoid has been encountered alreadyin earlier chapters and plays an important role in many classification algorithms. Itsatisfies the following symmetry propertyσ(−a) = 1 − σ(a)as is easily verified. The inverse of the logistic sigmoid is given by σ a = ln1−σ(4.60)(4.61)and is known as the logit function. It represents the log of the ratio of probabilitiesln [p(C1 |x)/p(C2 |x)] for the two classes, also known as the log odds.1984.
LINEAR MODELS FOR CLASSIFICATIONNote that in (4.57) we have simply rewritten the posterior probabilities in anequivalent form, and so the appearance of the logistic sigmoid may seem rather vacuous. However, it will have significance provided a(x) takes a simple functionalform. We shall shortly consider situations in which a(x) is a linear function of x, inwhich case the posterior probability is governed by a generalized linear model.For the case of K > 2 classes, we havep(Ck |x) ==p(x|Ck )p(Ck )j p(x|Cj )p(Cj )exp(ak )j exp(aj )(4.62)which is known as the normalized exponential and can be regarded as a multiclassgeneralization of the logistic sigmoid.
Here the quantities ak are defined byak = ln p(x|Ck )p(Ck ).(4.63)The normalized exponential is also known as the softmax function, as it representsa smoothed version of the ‘max’ function because, if ak aj for all j = k, thenp(Ck |x) 1, and p(Cj |x) 0.We now investigate the consequences of choosing specific forms for the classconditional densities, looking first at continuous input variables x and then discussing briefly the case of discrete inputs.4.2.1 Continuous inputsLet us assume that the class-conditional densities are Gaussian and then explorethe resulting form for the posterior probabilities. To start with, we shall assume thatall classes share the same covariance matrix.
Thus the density for class Ck is givenby111T −1exp − (x − µk ) Σ (x − µk ) .(4.64)p(x|Ck ) =2(2π)D/2 |Σ|1/2Consider first the case of two classes. From (4.57) and (4.58), we havep(C1 |x) = σ(wT x + w0 )(4.65)w = Σ−1 (µ1 − µ2 )11p(C1 )−1−1µ1 + µTµ2 + lnw0 = − µT.1Σ2Σ22p(C2 )(4.66)where we have defined(4.67)We see that the quadratic terms in x from the exponents of the Gaussian densitieshave cancelled (due to the assumption of common covariance matrices) leading toa linear function of x in the argument of the logistic sigmoid.
This result is illustrated for the case of a two-dimensional input space x in Figure 4.10. The resulting4.2. Probabilistic Generative Models199Figure 4.10 The left-hand plot shows the class-conditional densities for two classes, denoted red and blue.On the right is the corresponding posterior probability p(C1 |x), which is given by a logistic sigmoid of a linearfunction of x. The surface in the right-hand plot is coloured using a proportion of red ink given by p(C1 |x) and aproportion of blue ink given by p(C2 |x) = 1 − p(C1 |x).decision boundaries correspond to surfaces along which the posterior probabilitiesp(Ck |x) are constant and so will be given by linear functions of x, and thereforethe decision boundaries are linear in input space.
The prior probabilities p(Ck ) enteronly through the bias parameter w0 so that changes in the priors have the effect ofmaking parallel shifts of the decision boundary and more generally of the parallelcontours of constant posterior probability.For the general case of K classes we have, from (4.62) and (4.63),ak (x) = wkT x + wk0(4.68)where we have definedwkwk0= Σ−1 µk1= − µTΣ−1 µk + ln p(Ck ).2 k(4.69)(4.70)We see that the ak (x) are again linear functions of x as a consequence of the cancellation of the quadratic terms due to the shared covariances. The resulting decisionboundaries, corresponding to the minimum misclassification rate, will occur whentwo of the posterior probabilities (the two largest) are equal, and so will be definedby linear functions of x, and so again we have a generalized linear model.If we relax the assumption of a shared covariance matrix and allow each classconditional density p(x|Ck ) to have its own covariance matrix Σk , then the earliercancellations will no longer occur, and we will obtain quadratic functions of x, giving rise to a quadratic discriminant.