Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 2
Текст из файла (страница 2)
. , wM )T .The notation [a, b] is used to denote the closed interval from a to b, that is theinterval including the values a and b themselves, while (a, b) denotes the corresponding open interval, that is the interval excluding a and b. Similarly, [a, b) denotes aninterval that includes a but excludes b. For the most part, however, there will belittle need to dwell on such refinements as whether the end points of an interval areincluded or not.The M × M identity matrix (also known as the unit matrix) is denoted IM ,which will be abbreviated to I where there is no ambiguity about it dimensionality.It has elements Iij that equal 1 if i = j and 0 if i = j.A functional is denoted f [y] where y(x) is some function.
The concept of afunctional is discussed in Appendix D.The notation g(x) = O(f (x)) denotes that |f (x)/g(x)| is bounded as x → ∞.For instance if g(x) = 3x2 + 2, then g(x) = O(x2 ).The expectation of a function f (x, y) with respect to a random variable x is denoted by Ex [f (x, y)]. In situations where there is no ambiguity as to which variableis being averaged over, this will be simplified by omitting the suffix, for instancexixiiMATHEMATICAL NOTATIONE[x].
If the distribution of x is conditioned on another variable z, then the corresponding conditional expectation will be written Ex [f (x)|z]. Similarly, the varianceis denoted var[f (x)], and for vector variables the covariance is written cov[x, y]. Weshall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expectations and covariances are introduced in Section 1.2.2.If we have N values x1 , . .
. , xN of a D-dimensional vector x = (x1 , . . . , xD )T ,we can combine the observations into a data matrix X in which the nth row of Xcorresponds to the row vector xTn . Thus the n, i element of X corresponds to theith element of the nth observation xn . For the case of one-dimensional variables weshall denote such a matrix by x, which is a column vector whose nth element is xn .Note that x (which has dimensionality N ) uses a different typeface to distinguish itfrom x (which has dimensionality D).ContentsPrefaceviiMathematical notationxiContents1Introduction1.1 Example: Polynomial Curve Fitting .
. . . . . .1.2 Probability Theory . . . . . . . . . . . . . . . .1.2.1 Probability densities . . . . . . . . . . .1.2.2 Expectations and covariances . . . . . .1.2.3 Bayesian probabilities . . . . . . . . . .1.2.4 The Gaussian distribution . . . . . . . .1.2.5 Curve fitting re-visited . . . . . . . . . .1.2.6 Bayesian curve fitting . . . . .
. . . . .1.3 Model Selection . . . . . . . . . . . . . . . . .1.4 The Curse of Dimensionality . . . . . . . . . . .1.5 Decision Theory . . . . . . . . . . . . . . . . .1.5.1 Minimizing the misclassification rate . .1.5.2 Minimizing the expected loss . . . . . .1.5.3 The reject option . . . . . . . . . . . . .1.5.4 Inference and decision . . .
. . . . . . .1.5.5 Loss functions for regression . . . . . . .1.6 Information Theory . . . . . . . . . . . . . . . .1.6.1 Relative entropy and mutual informationExercises . . . . . . . . . . . . . . . . . . . . . . . .xiii..............................................................................................................................................................................................14121719212428303233383941424246485558xiiixivCONTENTS23Probability Distributions2.1 Binary Variables .
. . . . . . . . . . . . . . . . . .2.1.1 The beta distribution . . . . . . . . . . . . .2.2 Multinomial Variables . . . . . . . . . . . . . . . .2.2.1 The Dirichlet distribution . . . . . . . . . . .2.3 The Gaussian Distribution . . . . . . . . . . . . . .2.3.1 Conditional Gaussian distributions . . . . . .2.3.2 Marginal Gaussian distributions . . .
. . . .2.3.3 Bayes’ theorem for Gaussian variables . . . .2.3.4 Maximum likelihood for the Gaussian . . . .2.3.5 Sequential estimation . . . . . . . . . . . . .2.3.6 Bayesian inference for the Gaussian . . . . .2.3.7 Student’s t-distribution . . . . . . . . . . . .2.3.8 Periodic variables . . . . . . . . . . . . . . .2.3.9 Mixtures of Gaussians .
. . . . . . . . . . .2.4 The Exponential Family . . . . . . . . . . . . . . .2.4.1 Maximum likelihood and sufficient statistics2.4.2 Conjugate priors . . . . . . . . . . . . . . .2.4.3 Noninformative priors . . . . . . . . . . . .2.5 Nonparametric Methods . . . . . . . . . . . . . . .2.5.1 Kernel density estimators . . . . . . . . . . .2.5.2 Nearest-neighbour methods .
. . . . . . . .Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .................................................................................................................................................................................676871747678858890939497102105110113116117117120122124127Linear Models for Regression3.1 Linear Basis Function Models . . . . .
. . . .3.1.1 Maximum likelihood and least squares .3.1.2 Geometry of least squares . . . . . . .3.1.3 Sequential learning . . . . . . . . . . .3.1.4 Regularized least squares . . . . . . . .3.1.5 Multiple outputs . . . . . . . . . . . .3.2 The Bias-Variance Decomposition . . . . . . .3.3 Bayesian Linear Regression . . . . . . . . . .3.3.1 Parameter distribution . . . .
. . . . .3.3.2 Predictive distribution . . . . . . . . .3.3.3 Equivalent kernel . . . . . . . . . . . .3.4 Bayesian Model Comparison . . . . . . . . . .3.5 The Evidence Approximation . . . . . . . . .3.5.1 Evaluation of the evidence function . .3.5.2 Maximizing the evidence function . . .3.5.3 Effective number of parameters . . . .3.6 Limitations of Fixed Basis Functions . . . . .Exercises . . . .
. . . . . . . . . . . . . . . . . . .................................................................................................................................................137138140143143144146147152152156159161165166168170172173......................................................xvCONTENTS45Linear Models for Classification4.1 Discriminant Functions . . . . . . . . . . . . . .4.1.1 Two classes . . . . . . .
. . . . . . . . .4.1.2 Multiple classes . . . . . . . . . . . . . .4.1.3 Least squares for classification . . . . . .4.1.4 Fisher’s linear discriminant . . . . . . . .4.1.5 Relation to least squares . . . . . . . . .4.1.6 Fisher’s discriminant for multiple classes4.1.7 The perceptron algorithm . . . . . . . .
.4.2 Probabilistic Generative Models . . . . . . . . .4.2.1 Continuous inputs . . . . . . . . . . . .4.2.2 Maximum likelihood solution . . . . . .4.2.3 Discrete features . . . . . . . . . . . . .4.2.4 Exponential family . . . . . . . . .
. . .4.3 Probabilistic Discriminative Models . . . . . . .4.3.1 Fixed basis functions . . . . . . . . . . .4.3.2 Logistic regression . . . . . . . . . . . .4.3.3 Iterative reweighted least squares . . . .4.3.4 Multiclass logistic regression . . . . . . .4.3.5 Probit regression .
. . . . . . . . . . . .4.3.6 Canonical link functions . . . . . . . . .4.4 The Laplace Approximation . . . . . . . . . . .4.4.1 Model comparison and BIC . . . . . . .4.5 Bayesian Logistic Regression . . . . . . . . . .4.5.1 Laplace approximation . . . . . . . . . .4.5.2 Predictive distribution . . . . . . . . . .Exercises . . . . . .
. . . . . . . . . . . . . . . . . .....................................................................................................................................................................................................................................................................179181181182184186189191192196198200202202203204205207209210212213216217217218220Neural Networks5.1 Feed-forward Network Functions .
. . . . . .5.1.1 Weight-space symmetries . . . . . . .5.2 Network Training . . . . . . . . . . . . . . . .5.2.1 Parameter optimization . . . . . . . . .5.2.2 Local quadratic approximation . . . . .5.2.3 Use of gradient information . . . . . .5.2.4 Gradient descent optimization . . . . .5.3 Error Backpropagation . . . . .
. . . . . . . .5.3.1 Evaluation of error-function derivatives5.3.2 A simple example . . . . . . . . . . .5.3.3 Efficiency of backpropagation . . . . .5.3.4 The Jacobian matrix . . . . . . . . . .5.4 The Hessian Matrix . . . . . . . . . . . . . . .5.4.1 Diagonal approximation .
. . . . . . .5.4.2 Outer product approximation . . . . . .5.4.3 Inverse Hessian . . . . . . . . . . . . .................................................................................................................................................................225227231232236237239240241242245246247249250251252................xviCONTENTS675.4.4 Finite differences . . . . . . . .
. . . . . .5.4.5 Exact evaluation of the Hessian . . . . . .5.4.6 Fast multiplication by the Hessian . . . . .5.5 Regularization in Neural Networks . . . . . . . .5.5.1 Consistent Gaussian priors . . . . . . . . .5.5.2 Early stopping . . . . . . . . . . . . . . .5.5.3 Invariances . . . . . . . . . . . . . .
. . .5.5.4 Tangent propagation . . . . . . . . . . . .5.5.5 Training with transformed data . . . . . . .5.5.6 Convolutional networks . . . . . . . . . .5.5.7 Soft weight sharing . . . . . . . . . . . . .5.6 Mixture Density Networks . . . . . . . . . . . . .5.7 Bayesian Neural Networks . . . . . . . . . .
. . .5.7.1 Posterior parameter distribution . . . . . .5.7.2 Hyperparameter optimization . . . . . . .5.7.3 Bayesian neural networks for classificationExercises . . . . . . . . . . . . . . . . . . . . . . . . ..........................................................................................................................................................252253254256257259261263265267269272277278280281284Kernel Methods6.1 Dual Representations . . . .