Heath - Scientific Computing (523150), страница 54
Текст из файла (страница 54)
This drawback has motivated thedevelopment of quasi-Newton methods that converge somewhat less rapidly but requiremuch less work per iteration (and are often more robust as well).Many variants of Newton’s method have been developed to improve its reliability and196CHAPTER 6. OPTIMIZATIONreduce its overhead. These quasi-Newton methods have the formxk+1 = xk − αk Bk−1 ∇f (xk ),where αk is a line search parameter and Bk is some approximation to the Hessian matrixobtained in any of a number of ways, including secant updating, finite differences, periodicreevaluation, or neglecting some terms in the true Hessian of the objective function.Many quasi-Newton methods are more robust than the pure Newton method, are superlinearly convergent, and have considerably lower overhead per iteration.
For example,secant updating methods for this problem require no second derivative evaluations, requireonly one gradient evaluation per iteration, and solve the necessary linear system at eachiteration by updating methods that require only O(n2 ) work rather than the O(n3 ) workthat would be required by a matrix factorization at each step. This substantial savingsin work per iteration more than offsets their somewhat slower convergence rate (generallysuperlinear but not quadratic), so that they usually take less total time to find a solution.6.3.5Secant Updating MethodsAs with secant updating methods for solving nonlinear equations, the motivation for secantupdating methods for minimization is to reduce the work per iteration of Newton’s methodand possibly improve its robustness.
One could simply use Broyden’s method to seek a zeroof the gradient, but this approach would not preserve the symmetry of the Hessian matrix.Several secant updating formulas for unconstrained minimization have been developed thatnot only preserve symmetry in the approximate Hessian matrix but also preserve positivedefiniteness.
Symmetry reduces the amount of work required by about half, and positivedefiniteness guarantees that the quasi-Newton step will be a descent direction.One of the most effective of these secant updating methods for minimization is calledBFGS, after the initials of its four coinventors. Starting with an initial guess x0 and asymmetric positive definite approximate Hessian matrix B0 , the following steps are repeateduntil convergence.1.2.3.4.Solve Bk sk = −∇f (xk ) for sk .xk+1 = xk + sk .yk = ∇f (xk+1 ) − ∇f (xk ).Bk+1 = Bk + (yk ykT )/(ykT sk ) − (Bk sk sTk Bk )/(sTk Bk sk ).In practice, a factorization of Bk is updated rather than Bk itself, so that the linearsystem for the quasi-Newton step sk can be solved at a cost of O(n2 ) rather than O(n3 ) work.Note that unlike Newton’s method for minimization, no second derivatives are required.These methods are often started with B0 = I, which means that the initial step is alongthe negative gradient (i.e., the direction of steepest descent); and then second derivativeinformation is gradually built up in the approximate Hessian matrix through successiveiterations.Like most secant updating methods, BFGS normally has a superlinear convergence rate,even though the approximate Hessian does not necessarily converge to the true Hessian.
Aline search can also be used to enhance the effectiveness of the method. Indeed, for aquadratic objective function, if an exact line search is performed at each iteration, then6.3. MULTIDIMENSIONAL UNCONSTRAINED OPTIMIZATION197the BFGS method terminates at the exact solution in at most n iterations, where n is thedimension of the problem.Example 6.7 BFGS Method. We illustrate the BFGS method by again minimizing thefunction of Example 6.5,f (x) = 0.5x21 + 2.5x22 ,whose gradient is given byx1∇f (x) =.5x2Starting with x0 = [ 51 ]T and B0 = I, the initial step is simply the negative gradient, so 5−50x1 = x0 + s0 =+=.1−5−4Updating the approximate Hessian according to the BFGS formula, we get0.667 0.333B1 =.0.333 0.667A new step is now computed and the process continued. The sequence of iterations is shownin the following table:xk5.0000.000−2.2220.816−0.009−0.0011.000−4.0000.4440.082−0.0150.001f (xk )15.00040.0002.9630.3500.0010.000∇f (xk )5.0005.0000.000 −20.000−2.2222.2220.8160.408−0.009−0.077−0.0010.005The increase in function value on the first iteration could have been avoided by using a linesearch.6.3.6Conjugate Gradient MethodThe conjugate gradient method is another alternative to Newton’s method that does notrequire explicit second derivatives.
Indeed, unlike secant updating methods, the conjugategradient method does not even store an approximation to the Hessian matrix, which makesit especially suitable for very large problems.As we saw in Section 6.3.2, the steepest descent method tends to search in the samedirections repeatedly, leading to very slow convergence. As its name suggests, the conjugategradient method also uses gradients, but it avoids repeated searches by modifying thegradient at each step to remove components in previous directions. The resulting sequenceof conjugate (i.e., orthogonal in some inner product) search directions implicitly accumulatesinformation about the Hessian matrix as iterations proceed. Theoretically, the method isexact after at most n iterations for a quadratic objective function in n dimensions, but it is198CHAPTER 6.
OPTIMIZATIONusually quite effective for more general unconstrained minimization problems as well. Themotivation for this algorithm is discussed in Section 11.5.5.To minimize f starting from an initial guess x0 , we initialize g0 = ∇f (x0 ) and s0 = −g0 ;then the following steps are repeated until convergence.1.2.3.4.xk+1 = xk + αk sk , where αk is determined by a line search.gk+1 = ∇f (xk+1 ).T gTβk+1 = (gk+1k+1 )/(gk gk ).sk+1 = −gk+1 + βk+1 sk .The formula for βk+1 given above is due to Fletcher and Reeves. An alternative formula,due to Polak and Ribiere, isβk+1 = ((gk+1 − gk )T gk+1 )/(gkT gk ).It is common to restart the algorithm after every n iterations, reinitializing to use thenegative gradient at the current point.Example 6.8 Conjugate Gradient Method.
We illustrate the conjugate gradientmethod by using it to minimize the functionf (x) = 0.5x21 + 2.5x22 ,whose gradient is given byx1∇f (x) =.5x2Starting with x0 = [ 51 ]T , the initial search direction is the negative gradient,−5s0 = −g0 = −∇f (x0 ) =.−5The exact minimum along this line is given by α0 = 13 , so that the next approximation isx1 = [ 3.333 −0.667 ]T , at which point we compute the new gradient,3.333g1 = ∇f (x1 ) =.−3.333So far there is no difference from the steepest descent method.
At this point, however,rather than search along the new negative gradient, we compute instead the quantityβ1 = (g1T g1 )/(g0T g0 ) = 0.444,which gives as the next search direction −3.333−5−5.556s1 = −g1 + β1 s0 =+ 0.444=.3.333−51.111The minimum along this direction is given by α1 = 0.6, which gives the exact solutionat the origin. Thus, as expected for a quadratic function, the conjugate gradient methodconverges in n = 2 steps in this case.6.4. NONLINEAR LEAST SQUARES6.3.7199Truncated Newton MethodsDespite its rapid asymptotic convergence, Newton’s method can be unattractive because ofits high cost per iteration, especially for very large problems, for which storage requirementsare also an important consideration.
Another way of potentially reducing the work periteration is to solve the linear system for the Newton step,Bk sk = −∇f (xk ),where Bk is the true or approximate Hessian matrix, by an iterative method (see Section 11.5) rather than by a direct method based on factorization of Bk . One advantage isthat only a few iterations of the iterative method may be sufficient to produce a step skthat is almost as good as the true Newton step. Indeed, far from the minimum the trueNewton step may offer no special advantage, yet can be very costly to compute exactly.Such an approach is called an inexact or truncated Newton method, since the linear systemfor the Newton step is solved inexactly by terminating the linear iterative solver beforeconvergence.A good choice for the linear iterative solver is the conjugate gradient method (see Section 11.5.5). The conjugate gradient method begins with the negative gradient vectorand eventually converges to the true Newton step, so truncating the iterations producesa step that is intermediate between these two vectors and is always a descent directionprovided Bk is positive definite.
Moreover, since the conjugate gradient method requiresonly matrix-vector products, the Hessian matrix need not be formed explicitly, which canmean a substantial savings in storage. To supply the product Bk v, for example, the finitedifference approximation∇f (xk + hv) − ∇f (xk )Bk v ≈hcan be computed instead, without ever forming Bk .In implementing a truncated Newton method, the termination criterion for the inneriteration must be chosen carefully to preserve the superlinear convergence rate of the outeriteration. In addition, special measures may be required if the matrix Bk is not positivedefinite. Nevertheless, truncated Newton methods are usually very effective in practice andare among the best methods available for large sparse problems.6.4Nonlinear Least SquaresLeast squares data fitting can be viewed as an optimization problem. Given m data points(ti , yi ), we wish to find the n-vector x of parameters that gives the best fit in the leastsquares sense to the model function f (t, x).