Bayesian Estimation (779797), страница 5
Текст из файла (страница 5)
when a number of samples of the vector x are lost) andis observed as y. The observation y is the so-called incomplete data.Maximisation of the likelihood of the incomplete data, fY;Θ(y;θ), withrespect to the parameter vector θ is often a difficult task, whereasmaximisation of the likelihood of the complete data fX;Θ(x;θ) is relativelyeasy. Since the complete data is unavailable, the parameter estimate isobtained through maximisation of the conditional expectation of the loglikelihood of the complete data defined asE [ln f X ;Θ ( x ;θ ) y ] =∫ f X|Y ;Θ ( x y ;θ )ln f X ;Θ ( x ;θ ) dx(4.86)XIn Equation (4.86), the computation of the term fX|Y;Θ(x|y;θ) requires anestimate of the unknown parameter vector θ.
For this reason, the expectationof the likelihood function is maximised iteratively starting with an initialestimate of θ, and updating the estimate as described in the following.“Complete data”Signal processwithparameter θxf X ;Θ ( x;θ )“Incomplete data”Non-invertabletransformationyfY ;Θ ( y;θ )Figure 4.14 Illustration of transformation of complete data to incomplete data.118Bayesian EstimationEM AlgorithmStep 1: Initialisation Select an initial parameter estimate θ 0 , andfor i = 0, 1, ... until convergence:Step 2: Expectation ComputeU (θ ,θˆ ) = E [ln f X ;Θ ( x;θ ) | y;θˆi )= ∫ f X |Y ;Θ ( x | y;θˆi ) ln f X ;Θ ( x;θ ) dx(4.87)XStep 3: Maximisation Selectθˆi +1 = arg max U (θ ,θˆi )(4.88)θStep 4: Convergence test If not converged then go to Step 2.4.3.1 Convergence of the EM AlgorithmIn this section, it is shown that the EM algorithm converges to a maximumof the likelihood of the incomplete data fY;Θ(y;θ).
The likelihood of thecomplete data can be written asf X ,Y ;Θ ( x , y;θ ) = f X |Y ;Θ ( x| y;θ ) f Y ;Θ ( y;θ )(4.89)where fX,Y;Θ(x,y;θ) is the likelihood of x and y with θ as a parameter. FromEquation (4.89), the log-likelihood of the incomplete data is obtained asln f Y ;Θ ( y;θ ) = ln f X ,Y ;Θ ( x, y;θ ) − ln f X |Y ;Θ ( x| y;θ )(4.90)Using an estimate θˆi of the parameter vector θ, and taking the expectationof Equation (4.90) over the space of the complete signal x, we obtainln fY ;Θ ( y;θ )=U (θ ;θˆi )−V (θ ;θˆi )(4.91)where for a given y, the expectation of ln fY;Θ(y;θ) is itself, and the functionU(θ ; θˆ ) is the conditional expectation of ln fX,Y;Θ(x,y;θ):119Estimate–Maximise (EM) MethodU (θ ,θˆ ) = E [ln f X ,Y ;Θ ( x , y;θ ) | y;θˆi )= ∫ f X |Y ;Θ ( x | y;θˆi ) ln f X ;Θ ( x;θ ) dx(4.92)XThe function V( θ , θˆ ) is the conditional expectation of ln fX|Y;Θ(x|y;θ):[V (θ ,θˆi ) = E ln f X |Y ;Θ ( x| y;θ ) y;θˆi]= ∫ f X |Y ;Θ ( x| y;θˆi )ln f X |Y ;Θ ( x| y;θ ) dx(4.93)XNow, from Equation (4.91), the log-likelihood of the incomplete data y withparameter estimate θˆi at iteration i isln f Y ;Θ ( y ;θˆi ) = U (θˆi ;θˆi ) − V (θˆi ;θˆi )(4.94)It can be shown (see Dempster et al., 1977) that the function V satisfies theinequalityV (θˆi +1 ;θˆi ) ≤ V (θˆi ;θˆi )(4.95)and in the maximisation step of EM we choose θˆi+1 such thatU (θˆi +1 ;θˆi ) ≥ U (θˆi ;θˆi )(4.96)From Equation (4.94) and the inequalities (4.95) and (4.96), it follows thatln f Y ;Θ ( y; θˆi+1 ) ≥ ln f Y ; Θ (y; θˆi )(4.97)Therefore at every iteration of the EM algorithm, the conditional likelihoodof the estimate increases until the estimate converges to a local maximum ofthe log-likelihood function ln fY;Θ(y;θ).The EM algorithm is applied to the solution of a number of problems inthis book.
In Section 4.5, of this chapter the estimation of the parameters ofa mixture Gaussian model for the signal space of a recorded process isformulated in an EM framework. In Chapter 5, the EM is used for estimationof the parameters of a hidden Markov model.120Bayesian Estimation4.4 Cramer–Rao Bound on the Minimum Estimator VarianceAn important measure of the performance of an estimator is the variance ofthe estimate with the varying values of the observation signal y and theparameter vector θ. The minimum estimation variance depends on thedistributions of the parameter vector θ and on the observation signal y.
Inthis section, we first consider the lower bound on the variance of theestimates of a constant parameter, and then extend the results to randomparameters.The Cramer–Rao lower bound on the variance of estimate of the ithcoefficient θi of a parameter vector θ is given as2∂θ 1 + Bias ∂θi Var[θˆi ( y )] ≥ ∂ ln f Y |Θ ( y | θ ) 2 E ∂ θi (4.98)An estimator that achieves the lower bound on the variance is called theminimum variance, or the most efficient, estimator.Proof The bias in the estimate θˆi ( y) of the ith coefficient of the parametervector θ, averaged over the observation space Y, is defined asE [θˆ i ( y) − θ i ] =∞∫ [θˆi ( y ) − θ i ] f Y|Θ ( y | θ ) dy =θ Bias(4.99)−∞Differentiation of Equation (4.99) with respect to θi yields∞∫−∞∂ fY |Θ ( y | θ ) ˆ∂θ− fY |Θ ( y | θ ) dy = Bias [θ i ( y ) −θ i ]∂θi∂θi(4.100)For a probability density function we have∞∫ f Y|Θ (y|θ ) dy = 1−∞(4.101)121Cramer–Rao BoundTherefore Equation (4.100) can be written as∞∫[θˆi ( y )θ i ]∂ fY |Θ ( y | θ )∂θi−∞dy = 1 +∂ θ Bias∂θi(4.102)Now, since the derivative of the integral of a pdf is zero, taking thederivative of Equation (4.101) and multiplying the result by θBias yieldsθ Bias∞∫∂ fY |Θ ( y | θ )∂θi−∞dy = 0(4.103)Substituting∂ f Y |Θ ( y | θ ) / ∂ θ i = fY |Θ ( y | θ ) ∂ ln f Y |Θ ( y | θ ) / ∂θ iEquation (4.102), and using Equation (4.103), we obtain∞∫[θˆi ( y )− θ Bias − θ i ]−∞∂ ln fY |Θ ( y | θ )∂θifY |Θ ( y | θ ) dy = 1 +∂ θ Bias∂θiinto(4.104)Now squaring both sides of Equation (4.104), we obtain22∞ ∂ ln fY |Θ ( y | θ )∂θ Bias [θˆi ( y ) − θ Bias −θ i ]fY |Θ ( y | θ ) dy = 1+∫ ∂θi∂ θ i −∞(4.105)For the left-hand side of Equation (4.105) application of the followingSchwartz inequality∞ f ( y ) g ( y ) dx ∫ −∞yields2 ≤∞∫ ( f ( y))−∞2dx ×∞∫ (g ( y ) )−∞2dy(4.106)122Bayesian Estimation ∞1/ 2 ∫ [θˆi ( y ) − θ Bias −θ i ] f Y |Θ ( y | θ ) −∞( ∞∫ ([θˆi ( y ) − θ Bias−∞2 ∂ ln f Y |Θ ( y | θ ) 1 / 2f Y |Θ ( y | θ ) dy ≤∂θi ∞ ∂ ln f Y |Θ ( y | θ ) 2 f Y |Θ ( y | θ ) dy− θ i ] 2 f Y |Θ ( y | θ ) dy ∫ ∂θi −∞))(4.107)From Equations (4.105) and (4.107), we have ∂ ln f ( y | θ )Y |ΘˆVar[θ i ( y )] × E ∂θi2∂θ ≥ 1 + Bias∂θi 2(4.108)The Cramer–Rao inequality (4.98) results directly from the inequality(4.108).4.4.1 Cramer–Rao Bound for Random ParametersFor random parameters the Cramer–Rao bound may be obtained using thesame procedure as above, with the difference that in Equation (4.98) insteadof the likelihood fY|Θ(y|θ) we use the joint pdf fY,Θ(y,θ), and we also use thelogarithmic relation∂ ln f Y , Θ (y, θ )∂ ln f Y |Θ ( y| θ )∂ ln f Θ (θ )=+∂ θi∂ θi∂ θi(4.109)The Cramer–Rao bound for random parameters is obtained as2∂ θ Bias 1 +∂ θ i ˆVar[θ i ( y )] ≥ ∂ ln f Y |Θ ( y | θ ) 2 ∂ ln f (θ ) 2 Θ + E ∂θi ∂ θi (4.110)where the second term in the denominator of Equation (4.110) describes theeffect of the prior pdf of θ.
As expected the use of the prior, fΘ(θ), can resultin a decrease in the variance of the estimate. An alternative form of the123Cramer–Rao Boundminimum bound on estimation variance can be obtained by using thelikelihood relation ∂ ln f Y ,Θ ( y,θ ) 2 ∂ 2 ln f Y ,Θ ( y,θ ) E = −E ∂ θi∂ θ i2 (4.111)asVar [θˆi ( y ) ] ≥ −∂ θ Bias 1 +∂θ i2 ∂ 2 ln f Y |Θ ( y | θ ) ∂ 2 ln f (θ ) Θ+E22∂θ∂θii(4.112)4.4.2 Cramer–Rao Bound for a Vector ParameterFor real-valued P-dimensional vector parameters, the Cramer–Rao boundfor the covariance matrix of an unbiased estimator of θ is given byCov [θˆ ] ≥ J −1 (θ )(4.113)where J is the P × P Fisher information matrix, with elements given by ∂ 2 ln fY ,Θ ( y ,θ ) [J (θ )]ij = − E ∂θ i ∂θ j(4.114)The lower bound on the variance of the ith element of the vector θ is givenby1Var (θˆi ) ≥ J −1 (θ ) ii =(4.115)2 ∂ ln f Y ,Θ ( y ,θ ) E∂θ i2[]where (J–1(θ)ii) is the ith diagonal element of the inverse of the Fishermatrix.124Bayesian Estimationxy21xy21Figure 4.15 Illustration of probabilistic modelling of a two-dimensional signalspace with a mixture of five bivariate Gaussian densities.4.5 Design of Mixture Gaussian ModelsA practical method for the modelling of the probability density function ofan arbitrary signal space is to fit (or “tile”) the space with a mixture of anumber of Gaussian probability density functions.
Figure 4.15 illustrates themodelling of a two-dimensional signal space with a number of circular andelliptically shaped Gaussian processes. Note that the Gaussian densities canbe overlapping, with the result that in an area of overlap, a data point can beassociated with different probabilities to different components of theGaussian mixture.A main advantage of the use of a mixture Gaussian model is that itresults in mathematically tractable signal processing solutions. A mixtureGaussian pdf model for a process X is defined asKf X ( x ) = ∑ Pk N k ( x ; µ k , Σ k )(4.116)k =1where N k (x; µ k , Σ k ) denotes the kth component of the mixture Gaussianpdf, with mean vector µk and covariance matrix Σk.
The parameter Pk is the125Design of Mixture Gaussian Modelsprior probability of the kth mixture, and it can be interpreted as the expectedfraction of the number of vectors from the process X associated with the kthmixture.In general, there are an infinite number of different K-mixture Gaussiandensities that can be used to “tile up” a signal space. Hence the modelling ofa signal space with a K-mixture pdf space can be regarded as a many-to-onemapping, and the expectation-maximisation (EM) method can be applied forthe estimation of the parameters of the Gaussian pdf models.4.5.1 The EM Algorithm for Estimation of Mixture GaussianDensitiesThe EM algorithm, discussed in Section 4.4, is an iterative maximumlikelihood (ML) estimation method, and can be employed to calculate theparameters of a K-mixture Gaussian pdf model for a given data set. Toapply the EM method we first need to define the so-called complete andincomplete data sets.
As usual the observation vectors [y(m) m=0, ..., N–1]form the incomplete data. The complete data may be viewed as theobservation vectors with a label attached to each vector y(m) to indicate thecomponent of the mixture Gaussian model that generated the vector. Notethat if each signal vector y(m) had a mixture component label attached, thenthe computation of the mean vector and the covariance matrix of eachcomponent of the mixture would be a relatively simple exercise. Thereforethe complete and incomplete data can be defined as follows:The incomplete dataThe complete datay(m) , m = 0, ,N − 1x (m)=[ y (m), k ]= y k (m), m = 0, , N − 1, k ∈(1, , K )The probability of the complete data is the probability that an observationvector y(m) has a label k associating it with the kth component of the mixturedensity. The main step in application of the EM method is to define theexpectation of the complete data, given the observations and a currentestimate of the parameter vector, asU (Θ ,Θˆ ) = E [ln f Y , K ;Θ ( y (m), k ;Θ ) | y (m);Θˆ i ]N −1 K=∑∑m =0 k = 0f Y , K |Θ ( y (m), k ) |Θˆ i )ln f Y , K ;Θ ( y (m), k ;Θ )f | ( y (m) |Θˆ i )YΘ(4.117)126Bayesian Estimationwhere Θ={θk=[Pk, µk, Σk], k=1,..., K}, are the parameters of the Gaussianmixture as in Equation (4.116).















