Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 35
Текст из файла (страница 35)
Thus x will always appear in the set of conditioning variables, and sofrom now on we will drop the explicit x from expressions such as p(t|x, w, β) in order to keep the notation uncluttered. Taking the logarithm of the likelihood function,and making use of the standard form (1.46) for the univariate Gaussian, we haveln p(t|w, β) =Nln N (tn |wT φ(xn ), β −1 )n=1=NNln β −ln(2π) − βED (w)22(3.11)where the sum-of-squares error function is defined by1ED (w) ={tn − wT φ(xn )}2 .2N(3.12)n=1Having written down the likelihood function, we can use maximum likelihood todetermine w and β. Consider first the maximization with respect to w.
As observedalready in Section 1.2.5, we see that maximization of the likelihood function under aconditional Gaussian noise distribution for a linear model is equivalent to minimizinga sum-of-squares error function given by ED (w). The gradient of the log likelihoodfunction (3.11) takes the form∇ ln p(t|w, β) =Nn=1tn − wT φ(xn ) φ(xn )T .(3.13)1423. LINEAR MODELS FOR REGRESSIONSetting this gradient to zero gives0=Ntn φ(xn ) − wTTn=1Nφ(xn )φ(xn )T.(3.14)n=1Solving for w we obtain−1 TΦ twML = ΦT Φ(3.15)which are known as the normal equations for the least squares problem. Here Φ is anN ×M matrix, called the design matrix, whose elements are given by Φnj = φj (xn ),so that⎛⎞φ0 (x1 ) φ1 (x1 ) · · · φM −1 (x1 )⎜ φ0 (x2 ) φ1 (x2 ) · · · φM −1 (x2 ) ⎟⎟.Φ=⎜(3.16)........⎝⎠....φ0 (xN ) φ1 (xN ) · · ·The quantityφM −1 (xN )−1 TΦΦ† ≡ ΦT Φ(3.17)is known as the Moore-Penrose pseudo-inverse of the matrix Φ (Rao and Mitra,1971; Golub and Van Loan, 1996).
It can be regarded as a generalization of thenotion of matrix inverse to nonsquare matrices. Indeed, if Φ is square and invertible,then using the property (AB)−1 = B−1 A−1 we see that Φ† ≡ Φ−1 .At this point, we can gain some insight into the role of the bias parameter w0 . Ifwe make the bias parameter explicit, then the error function (3.12) becomesED (w) =NM−11{tn − w0 −wj φj (xn )}2 .2n=1(3.18)j =1Setting the derivative with respect to w0 equal to zero, and solving for w0 , we obtainw0 = t −M−1wj φj(3.19)j =1where we have definedN1 t=tn ,NN1 φj =φj (xn ).Nn=1(3.20)n=1Thus the bias w0 compensates for the difference between the averages (over thetraining set) of the target values and the weighted sum of the averages of the basisfunction values.We can also maximize the log likelihood function (3.11) with respect to the noiseprecision parameter β, giving1βML=N1 T{tn − wMLφ(xn )}2Nn=1(3.21)1433.1.
Linear Basis Function ModelsFigure 3.2Geometrical interpretation of the least-squaressolution, in an N -dimensional space whose axesare the values of t1 , . . . , tN . The least-squaresregression function is obtained by finding the orthogonal projection of the data vector t onto thesubspace spanned by the basis functions φj (x)in which each basis function is viewed as a vector ϕj of length N with elements φj (xn ).Sϕ1tyϕ2and so we see that the inverse of the noise precision is given by the residual varianceof the target values around the regression function.3.1.2 Geometry of least squaresExercise 3.2At this point, it is instructive to consider the geometrical interpretation of theleast-squares solution.
To do this we consider an N -dimensional space whose axesare given by the tn , so that t = (t1 , . . . , tN )T is a vector in this space. Each basisfunction φj (xn ), evaluated at the N data points, can also be represented as a vector inthe same space, denoted by ϕj , as illustrated in Figure 3.2. Note that ϕj correspondsto the j th column of Φ, whereas φ(xn ) corresponds to the nth row of Φ. If thenumber M of basis functions is smaller than the number N of data points, then theM vectors φj (xn ) will span a linear subspace S of dimensionality M .
We definey to be an N -dimensional vector whose nth element is given by y(xn , w), wheren = 1, . . . , N . Because y is an arbitrary linear combination of the vectors ϕj , it canlive anywhere in the M -dimensional subspace. The sum-of-squares error (3.12) isthen equal (up to a factor of 1/2) to the squared Euclidean distance between y andt.
Thus the least-squares solution for w corresponds to that choice of y that lies insubspace S and that is closest to t. Intuitively, from Figure 3.2, we anticipate thatthis solution corresponds to the orthogonal projection of t onto the subspace S. Thisis indeed the case, as can easily be verified by noting that the solution for y is givenby ΦwML , and then confirming that this takes the form of an orthogonal projection.In practice, a direct solution of the normal equations can lead to numerical difficulties when ΦT Φ is close to singular. In particular, when two or more of the basisvectors ϕj are co-linear, or nearly so, the resulting parameter values can have largemagnitudes. Such near degeneracies will not be uncommon when dealing with realdata sets. The resulting numerical difficulties can be addressed using the techniqueof singular value decomposition, or SVD (Press et al., 1992; Bishop and Nabney,2008).
Note that the addition of a regularization term ensures that the matrix is nonsingular, even in the presence of degeneracies.3.1.3 Sequential learningBatch techniques, such as the maximum likelihood solution (3.15), which involve processing the entire training set in one go, can be computationally costly forlarge data sets. As we have discussed in Chapter 1, if the data set is sufficiently large,it may be worthwhile to use sequential algorithms, also known as on-line algorithms,1443. LINEAR MODELS FOR REGRESSIONin which the data points are considered one at a time, and the model parameters updated after each such presentation.
Sequential learning is also appropriate for realtime applications in which the data observations are arriving in a continuous stream,and predictions must be made before all of the data points are seen.We can obtain a sequential learning algorithm by applying the technique ofstochastic gradient descent, also known as sequential gradient descent, as follows. Ifthe error function comprises a sum over data points E = n En , then after presentation of pattern n, the stochastic gradient descent algorithm updates the parametervector w usingw(τ +1) = w(τ ) − η∇En(3.22)where τ denotes the iteration number, and η is a learning rate parameter. We shalldiscuss the choice of value for η shortly.
The value of w is initialized to some startingvector w(0) . For the case of the sum-of-squares error function (3.12), this givesw(τ +1) = w(τ ) + η(tn − w(τ )T φn )φn(3.23)where φn = φ(xn ). This is known as least-mean-squares or the LMS algorithm.The value of η needs to be chosen with care to ensure that the algorithm converges(Bishop and Nabney, 2008).3.1.4 Regularized least squaresIn Section 1.1, we introduced the idea of adding a regularization term to anerror function in order to control over-fitting, so that the total error function to beminimized takes the form(3.24)ED (w) + λEW (w)where λ is the regularization coefficient that controls the relative importance of thedata-dependent error ED (w) and the regularization term EW (w). One of the simplest forms of regularizer is given by the sum-of-squares of the weight vector elements1EW (w) = wT w.(3.25)2If we also consider the sum-of-squares error function given by1{tn − wT φ(xn )}2E(w) =2N(3.26)n=1then the total error function becomesλ1{tn − wT φ(xn )}2 + wT w.22N(3.27)n=1This particular choice of regularizer is known in the machine learning literature asweight decay because in sequential learning algorithms, it encourages weight valuesto decay towards zero, unless supported by the data.
In statistics, it provides an example of a parameter shrinkage method because it shrinks parameter values towards3.1. Linear Basis Function Modelsq = 0.5Figure 3.3q=1q=2145q=4Contours of the regularization term in (3.29) for various values of the parameter q.zero. It has the advantage that the error function remains a quadratic function ofw, and so its exact minimizer can be found in closed form. Specifically, setting thegradient of (3.27) with respect to w to zero, and solving for w as before, we obtain−1 TΦ t.(3.28)w = λI + ΦT ΦThis represents a simple extension of the least-squares solution (3.15).A more general regularizer is sometimes used, for which the regularized errortakes the formNM1λ{tn − wT φ(xn )}2 +|wj |q(3.29)22n=1Exercise 3.5j =1where q = 2 corresponds to the quadratic regularizer (3.27).
Figure 3.3 shows contours of the regularization function for different values of q.The case of q = 1 is know as the lasso in the statistics literature (Tibshirani,1996). It has the property that if λ is sufficiently large, some of the coefficientswj are driven to zero, leading to a sparse model in which the corresponding basisfunctions play no role.
To see this, we first note that minimizing (3.29) is equivalentto minimizing the unregularized sum-of-squares error (3.12) subject to the constraintM|wj |q η(3.30)j =1Appendix Efor an appropriate value of the parameter η, where the two approaches can be relatedusing Lagrange multipliers. The origin of the sparsity can be seen from Figure 3.4,which shows that the minimum of the error function, subject to the constraint (3.30).As λ is increased, so an increasing number of parameters are driven to zero.Regularization allows complex models to be trained on data sets of limited sizewithout severe over-fitting, essentially by limiting the effective model complexity.However, the problem of determining the optimal model complexity is then shiftedfrom one of finding the appropriate number of basis functions to one of determininga suitable value of the regularization coefficient λ. We shall return to the issue ofmodel complexity later in this chapter.1463.