Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 81
Текст из файла (страница 81)
Ofcourse, in the case of the RVM we always have the option of starting with a smallernumber of basis functions than N + 1. More significantly, in the relevance vectormachine the parameters governing complexity and noise variance are determinedautomatically from a single training run, whereas in the support vector machine theparameters C and (or ν) are generally found using cross-validation, which involvesmultiple training runs. Furthermore, in the next section we shall derive an alternativeprocedure for training the relevance vector machine that improves training speedsignificantly.7.2.2 Analysis of sparsityWe have noted earlier that the mechanism of automatic relevance determinationcauses a subset of parameters to be driven to zero. We now examine in more detail3507.
SPARSE KERNEL MACHINESt2t2tCCt1ϕtt1Figure 7.10 Illustration of the mechanism for sparsity in a Bayesian linear regression model, showing a trainingset vector of target values given by t = (t1 , t2 )T , indicated by the cross, for a model with one basis vectorϕ = (φ(x1 ), φ(x2 ))T , which is poorly aligned with the target data vector t. On the left we see a model havingonly isotropic noise, so that C = β −1 I, corresponding to α = ∞, with β set to its most probable value. Onthe right we see the same model but with a finite value of α. In each case the red ellipse corresponds to unitMahalanobis distance, with |C| taking the same value for both plots, while the dashed green circle shows thecontrition arising from the noise term β −1 .
We see that any finite value of α reduces the probability of theobserved data, and so for the most probable solution the basis vector is removed.the mechanism of sparsity in the context of the relevance vector machine. In theprocess, we will arrive at a significantly faster procedure for optimizing the hyperparameters compared to the direct techniques given above.Before proceeding with a mathematical analysis, we first give some informalinsight into the origin of sparsity in Bayesian linear models. Consider a data setcomprising N = 2 observations t1 and t2 , together with a model having a singlebasis function φ(x), with hyperparameter α, along with isotropic noise having precision β. From (7.85), the marginal likelihood is given by p(t|α, β) = N (t|0, C) inwhich the covariance matrix takes the formC=11I + ϕϕTβα(7.92)where ϕ denotes the N -dimensional vector (φ(x1 ), φ(x2 ))T , and similarly t =(t1 , t2 )T .
Notice that this is just a zero-mean Gaussian process model over t withcovariance C. Given a particular observation for t, our goal is to find α and β bymaximizing the marginal likelihood. We see from Figure 7.10 that, if there is a pooralignment between the direction of ϕ and that of the training data vector t, then thecorresponding hyperparameter α will be driven to ∞, and the basis vector will bepruned from the model. This arises because any finite value for α will always assigna lower probability to the data, thereby decreasing the value of the density at t, provided that β is set to its optimal value. We see that any finite value for α would causethe distribution to be elongated in a direction away from the data, thereby increasingthe probability mass in regions away from the observed data and hence reducing thevalue of the density at the target data vector itself. For the more general case of M7.2.
Relevance Vector Machines351basis vectors ϕ1 , . . . , ϕM a similar intuition holds, namely that if a particular basisvector is poorly aligned with the data vector t, then it is likely to be pruned from themodel.We now investigate the mechanism for sparsity from a more mathematical perspective, for a general case involving M basis functions. To motivate this analysiswe first note that, in the result (7.87) for re-estimating the parameter αi , the terms onthe right-hand side are themselves also functions of αi . These results therefore represent implicit solutions, and iteration would be required even to determine a singleαi with all other αj for j = i fixed.This suggests a different approach to solving the optimization problem for theRVM, in which we make explicit all of the dependence of the marginal likelihood(7.85) on a particular αi and then determine its stationary points explicitly (Faul andTipping, 2002; Tipping and Faul, 2003).
To do this, we first pull out the contributionfrom αi in the matrix C defined by (7.86) to give−1TC = β −1 I +αj−1 ϕj ϕTj + αi ϕi ϕij=i= C−i + αi−1 ϕi ϕTi(7.93)where ϕi denotes the ith column of Φ, in other words the N -dimensional vector withelements (φi (x1 ), . . . , φi (xN )), in contrast to φn , which denotes the nth row of Φ.The matrix C−i represents the matrix C with the contribution from basis function iremoved. Using the matrix identities (C.7) and (C.15), the determinant and inverseof C can then be written−1|C| = |C−i ||1 + αi−1 ϕTi C−i ϕi |C−1Exercise 7.151= C−−i −1T −1C−−i ϕi ϕi C−i.−1αi + ϕTi C−i ϕi(7.94)(7.95)Using these results, we can then write the log marginal likelihood function (7.85) inthe form(7.96)L(α) = L(α−i ) + λ(αi )where L(α−i ) is simply the log marginal likelihood with basis function ϕi omitted,and the quantity λ(αi ) is defined by1qi2ln αi − ln (αi + si ) +(7.97)λ(αi ) =2αi + siand contains all of the dependence on αi .
Here we have introduced the two quantitiessiqi−1= ϕTi C−i ϕi=1ϕi C−−i t.T(7.98)(7.99)Here si is called the sparsity and qi is known as the quality of ϕi , and as we shallsee, a large value of si relative to the value of qi means that the basis function ϕi3527. SPARSE KERNEL MACHINESFigure 7.11 Plots of the logmarginal likelihood λ(αi ) versusln αi showing on the left, the singlemaximum at a finite αi for qi2 = 4and si = 1 (so that qi2 > si ) and onthe right, the maximum at αi = ∞for qi2 = 1 and si = 2 (so thatqi2 < si ).2200−2−2−4−4−505−505is more likely to be pruned from the model. The ‘sparsity’ measures the extent towhich basis function ϕi overlaps with the other basis vectors in the model, and the‘quality’ represents a measure of the alignment of the basis vector ϕn with the errorbetween the training set values t = (t1 , .
. . , tN )T and the vector y−i of predictionsthat would result from the model with the vector ϕi excluded (Tipping and Faul,2003).The stationary points of the marginal likelihood with respect to αi occur whenthe derivativeα−1 s2 − (qi2 − si )dλ(αi )= i i(7.100)dαi2(αi + si )2is equal to zero. There are two possible forms for the solution. Recalling that αi 0,we see that if qi2 < si , then αi → ∞ provides a solution.
Conversely, if qi2 > si , wecan solve for αi to obtains2αi = 2 i .(7.101)qi − siExercise 7.16These two solutions are illustrated in Figure 7.11. We see that the relative size ofthe quality and sparsity terms determines whether a particular basis vector will bepruned from the model or not. A more complete analysis (Faul and Tipping, 2002),based on the second derivatives of the marginal likelihood, confirms these solutionsare indeed the unique maxima of λ(αi ).Note that this approach has yielded a closed-form solution for αi , for givenvalues of the other hyperparameters.
As well as providing insight into the origin ofsparsity in the RVM, this analysis also leads to a practical algorithm for optimizingthe hyperparameters that has significant speed advantages. This uses a fixed setof candidate basis vectors, and then cycles through them in turn to decide whethereach vector should be included in the model or not. The resulting sequential sparseBayesian learning algorithm is described below.Sequential Sparse Bayesian Learning Algorithm1. If solving a regression problem, initialize β.2. Initialize using one basis function ϕ1 , with hyperparameter α1 set using(7.101), with the remaining hyperparameters αj for j = i initialized toinfinity, so that only ϕ1 is included in the model.7.2. Relevance Vector Machines3533.
Evaluate Σ and m, along with qi and si for all basis functions.4. Select a candidate basis function ϕi .5. If qi2 > si , and αi < ∞, so that the basis vector ϕi is already included inthe model, then update αi using (7.101).6. If qi2 > si , and αi = ∞, then add ϕi to the model, and evaluate hyperparameter αi using (7.101).7. If qi2 si , and αi < ∞ then remove basis function ϕi from the model,and set αi = ∞.8. If solving a regression problem, update β.9. If converged terminate, otherwise go to 3.Note that if qi2 si and αi = ∞, then the basis function ϕi is already excludedfrom the model and no action is required.In practice, it is convenient to evaluate the quantitiesQiSi−1= ϕTti CT −1= ϕi C ϕi .(7.102)(7.103)The quality and sparseness variables can then be expressed in the formqi =si =Exercise 7.17αi Qiαi − S iαi S i.αi − S i(7.104)(7.105)Note that when αi = ∞, we have qi = Qi and si = Si .
Using (C.7), we can writeQiSiT2 T= βϕTi t − β ϕi ΦΣΦ tT2 T= βϕTi ϕi − β ϕi ΦΣΦ ϕi(7.106)(7.107)where Φ and Σ involve only those basis vectors that correspond to finite hyperparameters αi . At each stage the required computations therefore scale like O(M 3 ),where M is the number of active basis vectors in the model and is typically muchsmaller than the number N of training patterns.7.2.3 RVM for classificationWe can extend the relevance vector machine framework to classification problems by applying the ARD prior over weights to a probabilistic linear classificationmodel of the kind studied in Chapter 4. To start with, we consider two-class problems with a binary target variable t ∈ {0, 1}.