Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 9
Текст из файла (страница 9)
During the 18th century, issues regarding probability arose in connection withgambling and with the new concept of insurance. Oneparticularly important problem concerned so-called inverse probability. A solution was proposed by ThomasBayes in his paper ‘Essay towards solving a problemin the doctrine of chances’, which was published in1764, some three years after his death, in the Philosophical Transactions of the Royal Society.
In fact,Bayes only formulated his theory for the case of a uniform prior, and it was Pierre-Simon Laplace who independently rediscovered the theory in general form andwho demonstrated its broad applicability.221. INTRODUCTIONtion of probability. Consider the example of polynomial curve fitting discussed inSection 1.1. It seems reasonable to apply the frequentist notion of probability to therandom values of the observed variables tn . However, we would like to address andquantify the uncertainty that surrounds the appropriate choice for the model parameters w.
We shall see that, from a Bayesian perspective, we can use the machineryof probability theory to describe the uncertainty in model parameters such as w, orindeed in the choice of model itself.Bayes’ theorem now acquires a new significance. Recall that in the boxes of fruitexample, the observation of the identity of the fruit provided relevant informationthat altered the probability that the chosen box was the red one. In that example,Bayes’ theorem was used to convert a prior probability into a posterior probabilityby incorporating the evidence provided by the observed data.
As we shall see indetail later, we can adopt a similar approach when making inferences about quantitiessuch as the parameters w in the polynomial curve fitting example. We capture ourassumptions about w, before observing the data, in the form of a prior probabilitydistribution p(w). The effect of the observed data D = {t1 , . . . , tN } is expressedthrough the conditional probability p(D|w), and we shall see later, in Section 1.2.5,how this can be represented explicitly. Bayes’ theorem, which takes the formp(w|D) =p(D|w)p(w)p(D)(1.43)then allows us to evaluate the uncertainty in w after we have observed D in the formof the posterior probability p(w|D).The quantity p(D|w) on the right-hand side of Bayes’ theorem is evaluated forthe observed data set D and can be viewed as a function of the parameter vectorw, in which case it is called the likelihood function.
It expresses how probable theobserved data set is for different settings of the parameter vector w. Note that thelikelihood is not a probability distribution over w, and its integral with respect to wdoes not (necessarily) equal one.Given this definition of likelihood, we can state Bayes’ theorem in wordsposterior ∝ likelihood × prior(1.44)where all of these quantities are viewed as functions of w. The denominator in(1.43) is the normalization constant, which ensures that the posterior distributionon the left-hand side is a valid probability density and integrates to one. Indeed,integrating both sides of (1.43) with respect to w, we can express the denominatorin Bayes’ theorem in terms of the prior distribution and the likelihood function(1.45)p(D) = p(D|w)p(w) dw.In both the Bayesian and frequentist paradigms, the likelihood function p(D|w)plays a central role.
However, the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, w is considered to be a fixedparameter, whose value is determined by some form of ‘estimator’, and error bars1.2. Probability TheorySection 2.1Section 2.4.3Section 1.323on this estimate are obtained by considering the distribution of possible data sets D.By contrast, from the Bayesian viewpoint there is only a single data set D (namelythe one that is actually observed), and the uncertainty in the parameters is expressedthrough a probability distribution over w.A widely used frequentist estimator is maximum likelihood, in which w is setto the value that maximizes the likelihood function p(D|w). This corresponds tochoosing the value of w for which the probability of the observed data set is maximized. In the machine learning literature, the negative log of the likelihood functionis called an error function.
Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error.One approach to determining frequentist error bars is the bootstrap (Efron, 1979;Hastie et al., 2001), in which multiple data sets are created as follows.
Suppose ouroriginal data set consists of N data points X = {x1 , . . . , xN }. We can create a newdata set XB by drawing N points at random from X, with replacement, so that somepoints in X may be replicated in XB , whereas other points in X may be absent fromXB . This process can be repeated L times to generate L data sets each of size N andeach obtained by sampling from the original data set X. The statistical accuracy ofparameter estimates can then be evaluated by looking at the variability of predictionsbetween the different bootstrap data sets.One advantage of the Bayesian viewpoint is that the inclusion of prior knowledge arises naturally.
Suppose, for instance, that a fair-looking coin is tossed threetimes and lands heads each time. A classical maximum likelihood estimate of theprobability of landing heads would give 1, implying that all future tosses will landheads! By contrast, a Bayesian approach with any reasonable prior will lead to amuch less extreme conclusion.There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by thefact that there is no unique frequentist, or even Bayesian, viewpoint.
For instance,one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection ofany prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difficulty. Reducingthe dependence on the prior is one motivation for so-called noninformative priors.However, these lead to difficulties when comparing different models, and indeedBayesian methods based on poor choices of prior can give poor results with highconfidence.
Frequentist evaluation methods offer some protection from such problems, and techniques such as cross-validation remain useful in areas such as modelcomparison.This book places a strong emphasis on the Bayesian viewpoint, reflecting thehuge growth in the practical importance of Bayesian methods in the past few years,while also discussing useful frequentist concepts as required.Although the Bayesian framework has its origins in the 18th century, the practical application of Bayesian methods was for a long time severely limited by thedifficulties in carrying through the full Bayesian procedure, particularly the need tomarginalize (sum or integrate) over the whole of parameter space, which, as we shall241. INTRODUCTIONsee, is required in order to make predictions or to compare different models.
Thedevelopment of sampling methods, such as Markov chain Monte Carlo (discussed inChapter 11) along with dramatic improvements in the speed and memory capacityof computers, opened the door to the practical use of Bayesian techniques in an impressive range of problem domains. Monte Carlo methods are very flexible and canbe applied to a wide range of models. However, they are computationally intensiveand have mainly been used for small-scale problems.More recently, highly efficient deterministic approximation schemes such asvariational Bayes and expectation propagation (discussed in Chapter 10) have beendeveloped. These offer a complementary alternative to sampling methods and haveallowed Bayesian techniques to be used in large-scale applications (Blei et al., 2003).1.2.4 The Gaussian distributionWe shall devote the whole of Chapter 2 to a study of various probability distributions and their key properties. It is convenient, however, to introduce here oneof the most important probability distributions for continuous variables, called thenormal or Gaussian distribution.
We shall make extensive use of this distribution inthe remainder of this chapter and indeed throughout much of the book.For the case of a single real-valued variable x, the Gaussian distribution is defined by1122exp − 2 (x − µ)(1.46)N x|µ, σ =2σ(2πσ 2 )1/2which is governed by two parameters: µ, called the mean, and σ 2 , called the variance. The square root of the variance, given by σ, is called the standard deviation,and the reciprocal of the variance, written as β = 1/σ 2 , is called the precision.