Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 7
Текст из файла (страница 7)
Let us suppose that in so doing we pick the redbox 40% of the time and we pick the blue box 60% of the time, and that when weremove an item of fruit from a box we are equally likely to select any of the piecesof fruit in the box.In this example, the identity of the box that will be chosen is a random variable,which we shall denote by B. This random variable can take one of two possiblevalues, namely r (corresponding to the red box) or b (corresponding to the bluebox). Similarly, the identity of the fruit is also a random variable and will be denotedby F .
It can take either of the values a (for apple) or o (for orange).To begin with, we shall define the probability of an event to be the fractionof times that event occurs out of the total number of trials, in the limit that the totalnumber of trials goes to infinity. Thus the probability of selecting the red box is 4/10Figure 1.9We use a simple example of twocoloured boxes each containing fruit(apples shown in green and oranges shown in orange) to introduce the basic ideas of probability.1.2. Probability Theory13ci}Figure 1.10 We can derive the sum and product rules of probability byconsidering two random variables, X, which takes the values {xi } wherei = 1, .
. . , M , and Y , which takes the values {yj } where j = 1, . . . , L.In this illustration we have M = 5 and L = 3. If we consider a totalnumber N of instances of these variables, then we denote the numberof instances where X = xi and Y = yj by nij , which is the number of yjpoints in the corresponding cell of the array. The number of points incolumn i, corresponding to X = xi , is denoted by ci , and the number ofpoints in row j, corresponding to Y = yj , is denoted by rj .nij}rjxiand the probability of selecting the blue box is 6/10. We write these probabilitiesas p(B = r) = 4/10 and p(B = b) = 6/10.
Note that, by definition, probabilitiesmust lie in the interval [0, 1]. Also, if the events are mutually exclusive and if theyinclude all possible outcomes (for instance, in this example the box must be eitherred or blue), then we see that the probabilities for those events must sum to one.We can now ask questions such as: “what is the overall probability that the selection procedure will pick an apple?”, or “given that we have chosen an orange,what is the probability that the box we chose was the blue one?”. We can answerquestions such as these, and indeed much more complex questions associated withproblems in pattern recognition, once we have equipped ourselves with the two elementary rules of probability, known as the sum rule and the product rule.
Havingobtained these rules, we shall then return to our boxes of fruit example.In order to derive the rules of probability, consider the slightly more general example shown in Figure 1.10 involving two random variables X and Y (which couldfor instance be the Box and Fruit variables considered above).
We shall suppose thatX can take any of the values xi where i = 1, . . . , M , and Y can take the values yjwhere j = 1, . . . , L. Consider a total of N trials in which we sample both of thevariables X and Y , and let the number of such trials in which X = xi and Y = yjbe nij . Also, let the number of trials in which X takes the value xi (irrespectiveof the value that Y takes) be denoted by ci , and similarly let the number of trials inwhich Y takes the value yj be denoted by rj .The probability that X will take the value xi and Y will take the value yj iswritten p(X = xi , Y = yj ) and is called the joint probability of X = xi andY = yj . It is given by the number of points falling in the cell i,j as a fraction of thetotal number of points, and hencep(X = xi , Y = yj ) =nij.N(1.5)Here we are implicitly considering the limit N → ∞. Similarly, the probability thatX takes the value xi irrespective of the value of Y is written as p(X = xi ) and isgiven by the fraction of the total number of points that fall in column i, so thatp(X = xi ) =ci.N(1.6)Because the number of instances in column i in Figure 1.10 is just the sum of thenumber of instances in each cell of that column, we have ci = j nij and therefore,141.
INTRODUCTIONfrom (1.5) and (1.6), we havep(X = xi ) =Lp(X = xi , Y = yj )(1.7)j =1which is the sum rule of probability. Note that p(X = xi ) is sometimes called themarginal probability, because it is obtained by marginalizing, or summing out, theother variables (in this case Y ).If we consider only those instances for which X = xi , then the fraction ofsuch instances for which Y = yj is written p(Y = yj |X = xi ) and is called theconditional probability of Y = yj given X = xi .
It is obtained by finding thefraction of those points in column i that fall in cell i,j and hence is given byp(Y = yj |X = xi ) =nij.ci(1.8)From (1.5), (1.6), and (1.8), we can then derive the following relationshipnijnij ci·=Nci N= p(Y = yj |X = xi )p(X = xi )p(X = xi , Y = yj ) =(1.9)which is the product rule of probability.So far we have been quite careful to make a distinction between a random variable, such as the box B in the fruit example, and the values that the random variablecan take, for example r if the box were the red one.
Thus the probability that B takesthe value r is denoted p(B = r). Although this helps to avoid ambiguity, it leadsto a rather cumbersome notation, and in many cases there will be no need for suchpedantry. Instead, we may simply write p(B) to denote a distribution over the random variable B, or p(r) to denote the distribution evaluated for the particular valuer, provided that the interpretation is clear from the context.With this more compact notation, we can write the two fundamental rules ofprobability theory in the following form.The Rules of Probabilitysum rulep(X) =p(X, Y )(1.10)Yproduct rulep(X, Y ) = p(Y |X)p(X).(1.11)Here p(X, Y ) is a joint probability and is verbalized as “the probability of X andY ”.
Similarly, the quantity p(Y |X) is a conditional probability and is verbalized as“the probability of Y given X”, whereas the quantity p(X) is a marginal probability1.2. Probability Theory15and is simply “the probability of X”. These two simple rules form the basis for allof the probabilistic machinery that we use throughout this book.From the product rule, together with the symmetry property p(X, Y ) = p(Y, X),we immediately obtain the following relationship between conditional probabilitiesp(Y |X) =p(X|Y )p(Y )p(X)(1.12)which is called Bayes’ theorem and which plays a central role in pattern recognitionand machine learning.
Using the sum rule, the denominator in Bayes’ theorem canbe expressed in terms of the quantities appearing in the numeratorp(X) =p(X|Y )p(Y ).(1.13)YWe can view the denominator in Bayes’ theorem as being the normalization constantrequired to ensure that the sum of the conditional probability on the left-hand side of(1.12) over all values of Y equals one.In Figure 1.11, we show a simple example involving a joint distribution over twovariables to illustrate the concept of marginal and conditional distributions. Herea finite sample of N = 60 data points has been drawn from the joint distributionand is shown in the top left.
In the top right is a histogram of the fractions of datapoints having each of the two values of Y . From the definition of probability, thesefractions would equal the corresponding probabilities p(Y ) in the limit N → ∞. Wecan view the histogram as a simple way to model a probability distribution given onlya finite number of points drawn from that distribution. Modelling distributions fromdata lies at the heart of statistical pattern recognition and will be explored in greatdetail in this book.
The remaining two plots in Figure 1.11 show the correspondinghistogram estimates of p(X) and p(X|Y = 1).Let us now return to our example involving boxes of fruit. For the moment, weshall once again be explicit about distinguishing between the random variables andtheir instantiations. We have seen that the probabilities of selecting either the red orthe blue boxes are given byp(B = r) = 4/10p(B = b) = 6/10(1.14)(1.15)respectively.
Note that these satisfy p(B = r) + p(B = b) = 1.Now suppose that we pick a box at random, and it turns out to be the blue box.Then the probability of selecting an apple is just the fraction of apples in the bluebox which is 3/4, and so p(F = a|B = b) = 3/4. In fact, we can write out all fourconditional probabilities for the type of fruit, given the selected boxp(Fp(Fp(Fp(F= a|B = r)= o|B = r)= a|B = b)= o|B = b)====1/43/43/41/4.(1.16)(1.17)(1.18)(1.19)161. INTRODUCTIONp(Y )p(X, Y )Y =2Y =1Xp(X)p(X|Y = 1)XXFigure 1.11 An illustration of a distribution over two variables, X, which takes 9 possible values, and Y , whichtakes two possible values. The top left figure shows a sample of 60 points drawn from a joint probability distribution over these variables.
The remaining figures show histogram estimates of the marginal distributions p(X)and p(Y ), as well as the conditional distribution p(X|Y = 1) corresponding to the bottom row in the top leftfigure.Again, note that these probabilities are normalized so thatp(F = a|B = r) + p(F = o|B = r) = 1(1.20)p(F = a|B = b) + p(F = o|B = b) = 1.(1.21)and similarlyWe can now use the sum and product rules of probability to evaluate the overallprobability of choosing an applep(F = a) = p(F = a|B = r)p(B = r) + p(F = a|B = b)p(B = b)436111×+ ×=(1.22)=4 10 4 1020from which it follows, using the sum rule, that p(F = o) = 1 − 11/20 = 9/20.1.2. Probability Theory17Suppose instead we are told that a piece of fruit has been selected and it is anorange, and we would like to know which box it came from.
This requires thatwe evaluate the probability distribution over boxes conditioned on the identity ofthe fruit, whereas the probabilities in (1.16)–(1.19) give the probability distributionover the fruit conditioned on the identity of the box. We can solve the problem ofreversing the conditional probability by using Bayes’ theorem to givep(B = r|F = o) =p(F = o|B = r)p(B = r)34202= ××= .p(F = o)4 1093(1.23)From the sum rule, it then follows that p(B = b|F = o) = 1 − 2/3 = 1/3.We can provide an important interpretation of Bayes’ theorem as follows.