Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 13
Текст из файла (страница 13)
We seethat, for large D, this fraction tends to 1 even for small values of . Thus, in spacesof high dimensionality, most of the volume of a sphere is concentrated in a thin shellnear the surface!As a further example, of direct relevance to pattern recognition, consider thebehaviour of a Gaussian distribution in a high-dimensional space. If we transformfrom Cartesian to polar coordinates, and then integrate out the directional variables,we obtain an expression for the density p(r) as a function of radius r from the origin.Thus p(r)δr is the probability mass inside a thin shell of thickness δr located atradius r.
This distribution is plotted, for various values of D, in Figure 1.23, and wesee that for large D the probability mass of the Gaussian is concentrated in a thinshell.The severe difficulty that can arise in spaces of many dimensions is sometimescalled the curse of dimensionality (Bellman, 1961). In this book, we shall make extensive use of illustrative examples involving input spaces of one or two dimensions,because this makes it particularly easy to illustrate the techniques graphically. Thereader should be warned, however, that not all intuitions developed in spaces of lowdimensionality will generalize to spaces of many dimensions.371.4.
The Curse of DimensionalityPlot of the fraction of the volume ofa sphere lying in the range r = 1−to r = 1 for various values of thedimensionality D.1D = 200.8volume fractionFigure 1.22D =5D =20.6D =10.40.2000.20.40.60.81Although the curse of dimensionality certainly raises important issues for pattern recognition applications, it does not prevent us from finding effective techniquesapplicable to high-dimensional spaces. The reasons for this are twofold. First, realdata will often be confined to a region of the space having lower effective dimensionality, and in particular the directions over which important variations in the targetvariables occur may be so confined.
Second, real data will typically exhibit somesmoothness properties (at least locally) so that for the most part small changes in theinput variables will produce small changes in the target variables, and so we can exploit local interpolation-like techniques to allow us to make predictions of the targetvariables for new values of the input variables. Successful pattern recognition techniques exploit one or both of these properties. Consider, for example, an applicationin manufacturing in which images are captured of identical planar objects on a conveyor belt, in which the goal is to determine their orientation.
Each image is a pointPlot of the probability density withrespect to radius r of a Gaussian distribution for various valuesof the dimensionality D.In ahigh-dimensional space, most of theprobability mass of a Gaussian is located within a thin shell at a specificradius.2D=1p(r)Figure 1.23D=210D = 2002r4381. INTRODUCTIONin a high-dimensional space whose dimensionality is determined by the number ofpixels. Because the objects can occur at different positions within the image andin different orientations, there are three degrees of freedom of variability betweenimages, and a set of images will live on a three dimensional manifold embeddedwithin the high-dimensional space.
Due to the complex relationships between theobject position or orientation and the pixel intensities, this manifold will be highlynonlinear. If the goal is to learn a model that can take an input image and output theorientation of the object irrespective of its position, then there is only one degree offreedom of variability within the manifold that is significant.1.5. Decision TheoryWe have seen in Section 1.2 how probability theory provides us with a consistentmathematical framework for quantifying and manipulating uncertainty.
Here weturn to a discussion of decision theory that, when combined with probability theory,allows us to make optimal decisions in situations involving uncertainty such as thoseencountered in pattern recognition.Suppose we have an input vector x together with a corresponding vector t oftarget variables, and our goal is to predict t given a new value for x.
For regressionproblems, t will comprise continuous variables, whereas for classification problemst will represent class labels. The joint probability distribution p(x, t) provides acomplete summary of the uncertainty associated with these variables. Determinationof p(x, t) from a set of training data is an example of inference and is typically avery difficult problem whose solution forms the subject of much of this book. Ina practical application, however, we must often make a specific prediction for thevalue of t, or more generally take a specific action based on our understanding of thevalues t is likely to take, and this aspect is the subject of decision theory.Consider, for example, a medical diagnosis problem in which we have taken anX-ray image of a patient, and we wish to determine whether the patient has canceror not.
In this case, the input vector x is the set of pixel intensities in the image,and output variable t will represent the presence of cancer, which we denote by theclass C1 , or the absence of cancer, which we denote by the class C2 . We might, forinstance, choose t to be a binary variable such that t = 0 corresponds to class C1 andt = 1 corresponds to class C2 . We shall see later that this choice of label values isparticularly convenient for probabilistic models.
The general inference problem theninvolves determining the joint distribution p(x, Ck ), or equivalently p(x, t), whichgives us the most complete probabilistic description of the situation. Although thiscan be a very useful and informative quantity, in the end we must decide either togive treatment to the patient or not, and we would like this choice to be optimalin some appropriate sense (Duda and Hart, 1973). This is the decision step, andit is the subject of decision theory to tell us how to make optimal decisions giventhe appropriate probabilities. We shall see that the decision stage is generally verysimple, even trivial, once we have solved the inference problem.Here we give an introduction to the key ideas of decision theory as required for1.5.
Decision Theory39the rest of the book. Further background, as well as more detailed accounts, can befound in Berger (1985) and Bather (2000).Before giving a more detailed analysis, let us first consider informally how wemight expect probabilities to play a role in making decisions. When we obtain theX-ray image x for a new patient, our goal is to decide which of the two classes toassign to the image.
We are interested in the probabilities of the two classes giventhe image, which are given by p(Ck |x). Using Bayes’ theorem, these probabilitiescan be expressed in the formp(Ck |x) =p(x|Ck )p(Ck ).p(x)(1.77)Note that any of the quantities appearing in Bayes’ theorem can be obtained fromthe joint distribution p(x, Ck ) by either marginalizing or conditioning with respect tothe appropriate variables. We can now interpret p(Ck ) as the prior probability for theclass Ck , and p(Ck |x) as the corresponding posterior probability. Thus p(C1 ) represents the probability that a person has cancer, before we take the X-ray measurement.Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’ theorem inlight of the information contained in the X-ray.
If our aim is to minimize the chanceof assigning x to the wrong class, then intuitively we would choose the class havingthe higher posterior probability. We now show that this intuition is correct, and wealso discuss more general criteria for making decisions.1.5.1 Minimizing the misclassification rateSuppose that our goal is simply to make as few misclassifications as possible.We need a rule that assigns each value of x to one of the available classes. Such arule will divide the input space into regions Rk called decision regions, one for eachclass, such that all points in Rk are assigned to class Ck .
The boundaries betweendecision regions are called decision boundaries or decision surfaces. Note that eachdecision region need not be contiguous but could comprise some number of disjointregions. We shall encounter examples of decision boundaries and decision regions inlater chapters. In order to find the optimal decision rule, consider first of all the caseof two classes, as in the cancer problem for instance. A mistake occurs when an inputvector belonging to class C1 is assigned to class C2 or vice versa. The probability ofthis occurring is given byp(mistake) = p(x ∈ R1 , C2 ) + p(x ∈ R2 , C1 )=p(x, C2 ) dx +p(x, C1 ) dx.R1(1.78)R2We are free to choose the decision rule that assigns each point x to one of the twoclasses.
Clearly to minimize p(mistake) we should arrange that each x is assigned towhichever class has the smaller value of the integrand in (1.78). Thus, if p(x, C1 ) >p(x, C2 ) for a given value of x, then we should assign that x to class C1 . From theproduct rule of probability we have p(x, Ck ) = p(Ck |x)p(x).
Because the factorp(x) is common to both terms, we can restate this result as saying that the minimum401. INTRODUCTIONxx0p(x, C1 )p(x, C2 )xR1Figure 1.24R2Schematic illustration of the joint probabilities p(x, Ck ) for each of two classes plottedagainst x, together with the decision boundary x = xb. Values of x xb are classified asb are classifiedclass C2 and hence belong to decision region R2 , whereas points x < xas C1 and belong to R1 . Errors arise from the blue, green, and red regions, so that forx<xb the errors are due to points from class C2 being misclassified as C1 (represented bythe sum of the red and green regions), and conversely for points in the region x xb theerrors are due to points from class C1 being misclassified as C2 (represented by the blueregion).
As we vary the location xb of the decision boundary, the combined areas of theblue and green regions remains constant, whereas the size of the red region varies. Theoptimal choice for xb is where the curves for p(x, C1 ) and p(x, C2 ) cross, corresponding toxb = x0 , because in this case the red region disappears.
This is equivalent to the minimummisclassification rate decision rule, which assigns each value of x to the class having thehigher posterior probability p(Ck |x).probability of making a mistake is obtained if each value of x is assigned to the classfor which the posterior probability p(Ck |x) is largest. This result is illustrated fortwo classes, and a single input variable x, in Figure 1.24.For the more general case of K classes, it is slightly easier to maximize theprobability of being correct, which is given byp(correct) =Kp(x ∈ Rk , Ck )k=1=K k=1Rkp(x, Ck ) dx(1.79)which is maximized when the regions Rk are chosen such that each x is assignedto the class for which p(x, Ck ) is largest.