Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf), страница 5
Описание файла
PDF-файл из архива "Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf", который расположен в категории "". Всё это находится в предмете "(ммо) методы машинного обучения" из 10 семестр (2 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 5 страницы из PDF
The data for this example is generated from the function sin(2πx)with random noise included in the target values, as described in detail in Appendix A.Now suppose that we are given a training set comprising N observations of x,written x ≡ (x1 , . . . , xN )T , together with corresponding observations of the valuesof t, denoted t ≡ (t1 , . . . , tN )T .
Figure 1.2 shows a plot of a training set comprisingN = 10 data points. The input data set x in Figure 1.2 was generated by choosing values of xn , for n = 1, . . . , N , spaced uniformly in range [0, 1], and the targetdata set t was obtained by first computing the corresponding values of the function1.1. Example: Polynomial Curve Fitting5sin(2πx) and then adding a small level of random noise having a Gaussian distribution (the Gaussian distribution is discussed in Section 1.2.4) to each such point inorder to obtain the corresponding value tn . By generating data in this way, we arecapturing a property of many real data sets, namely that they possess an underlyingregularity, which we wish to learn, but that individual observations are corrupted byrandom noise. This noise might arise from intrinsically stochastic (i.e.
random) processes such as radioactive decay but more typically is due to there being sources ofvariability that are themselves unobserved.Our goal is to exploit this training set in order to make predictions of the valuet of the target variable for some new value x of the input variable. As we shall seelater, this involves implicitly trying to discover the underlying function sin(2πx).This is intrinsically a difficult problem as we have to generalize from a finite dataset. Furthermore the observed data are corrupted with noise, and so for a given xthere is uncertainty as to the appropriate value for t. Probability theory, discussedin Section 1.2, provides a framework for expressing such uncertainty in a preciseand quantitative manner, and decision theory, discussed in Section 1.5, allows us toexploit this probabilistic representation in order to make predictions that are optimalaccording to appropriate criteria.For the moment, however, we shall proceed rather informally and consider asimple approach based on curve fitting.
In particular, we shall fit the data using apolynomial function of the formMy(x, w) = w0 + w1 x + w2 x + . . . + wM x2=Mwj xj(1.1)j =0where M is the order of the polynomial, and xj denotes x raised to the power of j.The polynomial coefficients w0 , . . . , wM are collectively denoted by the vector w.Note that, although the polynomial function y(x, w) is a nonlinear function of x, itis a linear function of the coefficients w.
Functions, such as the polynomial, whichare linear in the unknown parameters have important properties and are called linearmodels and will be discussed extensively in Chapters 3 and 4.The values of the coefficients will be determined by fitting the polynomial to thetraining data. This can be done by minimizing an error function that measures themisfit between the function y(x, w), for any given value of w, and the training setdata points. One simple choice of error function, which is widely used, is given bythe sum of the squares of the errors between the predictions y(xn , w) for each datapoint xn and the corresponding target values tn , so that we minimize12{y(xn , w) − tn }2NE(w) =(1.2)n=1where the factor of 1/2 is included for later convenience.
We shall discuss the motivation for this choice of error function later in this chapter. For the moment wesimply note that it is a nonnegative quantity that would be zero if, and only if, the61. INTRODUCTIONFigure 1.3The error function (1.2) corresponds to (one half of) the sum of tthe squares of the displacements(shown by the vertical green bars)of each data point from the functiony(x, w).tny(xn , w)xnxfunction y(x, w) were to pass exactly through each training data point.
The geometrical interpretation of the sum-of-squares error function is illustrated in Figure 1.3.Exercise 1.1We can solve the curve fitting problem by choosing the value of w for whichE(w) is as small as possible. Because the error function is a quadratic function ofthe coefficients w, its derivatives with respect to the coefficients will be linear in theelements of w, and so the minimization of the error function has a unique solution,denoted by w , which can be found in closed form. The resulting polynomial isgiven by the function y(x, w ).There remains the problem of choosing the order M of the polynomial, and aswe shall see this will turn out to be an example of an important concept called modelcomparison or model selection.
In Figure 1.4, we show four examples of the resultsof fitting polynomials having orders M = 0, 1, 3, and 9 to the data set shown inFigure 1.2.We notice that the constant (M = 0) and first order (M = 1) polynomialsgive rather poor fits to the data and consequently rather poor representations of thefunction sin(2πx). The third order (M = 3) polynomial seems to give the best fitto the function sin(2πx) of the examples shown in Figure 1.4. When we go to amuch higher order polynomial (M = 9), we obtain an excellent fit to the trainingdata. In fact, the polynomial passes exactly through each data point and E(w ) = 0.However, the fitted curve oscillates wildly and gives a very poor representation ofthe function sin(2πx).
This latter behaviour is known as over-fitting.As we have noted earlier, the goal is to achieve good generalization by makingaccurate predictions for new data. We can obtain some quantitative insight into thedependence of the generalization performance on M by considering a separate testset comprising 100 data points generated using exactly the same procedure usedto generate the training set points but with new choices for the random noise valuesincluded in the target values. For each choice of M , we can then evaluate the residualvalue of E(w ) given by (1.2) for the training data, and we can also evaluate E(w )for the test data set.
It is sometimes more convenient to use the root-mean-square71.1. Example: Polynomial Curve FittingM =01M =11tt00−1−10x10M =31xM =91t1t00−1−10x10x1Figure 1.4 Plots of polynomials having various orders M , shown as red curves, fitted to the data set shown inFigure 1.2.(RMS) error defined byERMS =2E(w )/N(1.3)in which the division by N allows us to compare different sizes of data sets onan equal footing, and the square root ensures that ERMS is measured on the samescale (and in the same units) as the target variable t.
Graphs of the training andtest set RMS errors are shown, for various values of M , in Figure 1.5. The testset error is a measure of how well we are doing in predicting the values of t fornew data observations of x. We note from Figure 1.5 that small values of M giverelatively large values of the test set error, and this can be attributed to the fact thatthe corresponding polynomials are rather inflexible and are incapable of capturingthe oscillations in the function sin(2πx). Values of M in the range 3 M 8give small values for the test set error, and these also give reasonable representationsof the generating function sin(2πx), as can be seen, for the case of M = 3, fromFigure 1.4.1.
INTRODUCTIONFigure 1.5Graphs of the root-mean-squareerror, defined by (1.3), evaluatedon the training set and on an independent test set for various valuesof M .1TrainingTestERMS80.5003M69For M = 9, the training set error goes to zero, as we might expect becausethis polynomial contains 10 degrees of freedom corresponding to the 10 coefficientsw0 , . . . , w9 , and so can be tuned exactly to the 10 data points in the training set.However, the test set error has become very large and, as we saw in Figure 1.4, thecorresponding function y(x, w ) exhibits wild oscillations.This may seem paradoxical because a polynomial of given order contains alllower order polynomials as special cases.
The M = 9 polynomial is therefore capable of generating results at least as good as the M = 3 polynomial. Furthermore, wemight suppose that the best predictor of new data would be the function sin(2πx)from which the data was generated (and we shall see later that this is indeed thecase). We know that a power series expansion of the function sin(2πx) containsterms of all orders, so we might expect that results should improve monotonically aswe increase M .We can gain some insight into the problem by examining the values of the coefficients w obtained from polynomials of various order, as shown in Table 1.1.We see that, as M increases, the magnitude of the coefficients typically gets larger.In particular for the M = 9 polynomial, the coefficients have become finely tunedto the data by developing large positive and negative values so that the correspondTable 1.1Table of the coefficients w forpolynomials of various order.Observe how the typical magnitude of the coefficients increases dramatically as the order of the polynomial increases.w0w1w2w3w4w5w6w7w8w9M =00.19M =10.82-1.27M =60.317.99-25.4317.37M =90.35232.37-5321.8348568.31-231639.30640042.26-1061800.521042400.18-557682.99125201.4391.1.