Nash - Compact Numerical Methods for Computers (523163), страница 45
Текст из файла (страница 45)
The convergencepattern of the program was also slightly different and had the following form:01 7727411 215.59179E-42 225.59179E-4CONVERGED TO 5.59179E-4# ITNS= 3# EVALS= 29In the above output, the quantities printed are the number of iterations(gradient evaluations), the number of function evaluations and the lowest functionvalue found so far. The sensitivity of the gradient and the convergence pattern torelatively small changes in arithmetic is, in my experience, quite common foralgorithms of this type.Previous HomeChapter 17MINIMISING A NONLINEAR SUM OF SQUARES17.1. INTRODUCTIONThe mathematical problem to be considered here is that of minimising(17.1)with respect to the parameters xj, j=1, 2, .
. . , n (collected for convenience as thevector x), where at least one of the functions fi(x) is nonlinear in x. Note that bycollecting the m functions fi(x), i=1, 2, . . . , m, as a vector f, we get(17.2)S(x )=f Tf .The minimisation of a nonlinear sum-of-squares function is a sufficiently widespread activity to have developed special methods for its solution.
The principalreason for this is that it arises whenever a least-squares criterion is used to fit anonlinear model to data. For instance, let yi represent the weight of somelaboratory animal at week i after birth and suppose that it is desired to model thisby some function of the week number i, which will be denoted y( i, x), where x isthe set of parameters which will be varied to fit the model to the data. If thecriterion of fit is that the sum of squared deviations from the data is to beminimised (least squares) then the objective is to minimise (17.1) where(17.3)f i(x) =y(i,x) -y ior, in the case that confidence weightings are available for each data point,(17.4)f i(x)=[y(i,x) -y i]wiwhere wi, i=1, 2, .
. . , m, are the weightings. As a particular example of a growthfunction, consider the three-parameter logistic function (Oliver 1964)(17.5)y(i,x)=y(i,x 1 ,x 2 ,x 3 ) =x i/[1+exp( x 2 +i x3 ) ] .Note that the form of the residuals chosen in (17.3) and (17.4) is the negativeof the usual ‘actual minus fitted’ used in most of the statistical literature. Thereason for this is to make the derivatives of fi (x) coincide with those of y(i,x).The minimisation of S(x) could, of course, be approached by an algorithm forthe minimisation of a general function of n variables. Bard (1970) suggests thatthis is not as efficient as methods which recognise the sum-of-squares form ofS(x), though more recently Biggs (1975) and McKeown (1974) have foundcontrary results.
In the paragraphs below, algorithms will be described which takeexplicit note of the sum-of-squares form of S(x), since these are relatively simpleand, as building blocks, use algorithms for linear least-squares computationswhich have already been discussed in earlier chapters.207Next208Compact numerical methods for computers17.2. TWO METHODSAlmost immediately two possible routes to minimising S( x) suggest themselves.The Cauchy steepest descents methodFind the gradient 2υ(x) of S(x) and step downhill along it.
(The reason for thefactor 2 will become apparent shortly.) Suppose that t represents the step lengthalong the gradient, then for some t we haveS ( x- tv ) <S ( x)(17.6)except at a local minimum or a saddle point. The steepest descents methodreplaces x by (x-tv) and repeats the process from the new point.
The iteration iscontinued until a t cannot be found for which (17.6) is satisfied. The method,which was suggested by Cauchy (1848), is then taken to have converged. It can beshown always to converge if S(x) is convex, that is, ifS (c x 1 +(1-c )x 2 )<c S(x 1 )+(1-c)S(x 2 )(17.7)for 0<c<1. Even for non-convex functions which are bounded from below, thesteepest descents method will find a local minimum or saddle point.
All thepreceding results are, of course, subject to the provision that the function andgradient are computed exactly (an almost impossible requirement). In practice,however, convergence is so slow as to disqualify the method of steepest descentson its own as a candidate for minimising functions.Often the cause of this slowness of convergence is the tendency of the methodto take pairs of steps which are virtually opposites, and which are both essentiallyperpendicular to the direction in which the minimum is to be found. In atwo-parameter example we may think of a narrow valley with the minimumsomewhere along its length. Suppose our starting point is somewhere on the sideof the valley but not near this minimum.
The gradient will be such that thedirection of steepest descent is towards the floor of the valley. However, the steptaken can easily traverse the valley. The situation is then similar to the originalone, and it is possible to step back across the valley almost to our starting pointwith only a very slight motion along the valley toward the solution point. One canpicture the process as following a path similar to that which would be followed bya marble or ball-bearing rolling over the valley-shaped surface.To illustrate the slow convergence, a modified version of steepest descents wasprogrammed in BASIC on a Data General NOVA minicomputer having machineprecision 2-22.
The modification consisted of step doubling if a step is successful.The step length is divided by 4 if a step is unsuccessful. This reduction in step sizeis repeated until either a smaller sum of squares is found or the step is so smallthat none of the parameters change. As a test problem, consider the Rosenbrockbanana-shaped valley:starting withS(-l·2,1)=24·1999Minimising a nonlinear sum of squares209(as evaluated). The steepest descents program above required 232 computationsof the derivative and 2248 evaluations of S(x) to findS(1·00144, 1·0029)=2·1×10-6.The program was restarted with this point and stopped manually after 468derivative and 4027 sum-of-squares computations, whereS(1·00084, 1·00168)=7·1×10-7.By comparison, the Marquardt method to be described below requires 24derivative and 32 sum-of-squares evaluations to reachS(1,1)=1·4×10 - 1 4 .(There are some rounding errors in the display of x1, x2 or in the computation ofS(x), since S(1,1)=0 is the solution to the Rosenbrock problem.)The Gauss-Newton methodAt the minimum the gradient v(x) must be null.
The functions vi (x), j=1,2, . . . , n, provide a set of n nonlinear functions in n unknowns x such thatv (x)= 0(17.8)the solution of which is a stationary point of the function S(x), that is, a localmaximum or minimum or a saddle point, The particular form (17.1) or (17.2) ofS(x) gives gradient components(17.9)which reduces to(17.10)orv = J Tf(17.11)by defining the Jacobian matrix J by(17.12)Some approximation must now be made to simplify the equations (17.8).Consider the Taylor expansion of vj (x) about x(17.13)If the terms in q2 (that is, those involving qkqj for k, j=1, 2, . . .
, n) are assumedto be negligible and v j (x+q) is taken as zero because it is assumed to be thesolution, then(17.14)210Compact numerical methods for computersfor each j=1, 2, . . . , n. From (17.10) and (17.12), therefore, we have(17.15)To apply the Newton-Raphson iteration (defined by equation (15.9)) to thesolution of (17.8) thus requires the second derivatives of f with respect to theparameters, a total of mn2 computations. On the grounds that near the minimumthe functions should be ‘small’ (a very cavalier assumption), the second term inthe summation (17.15) is neglected, so that the partial derivatives of v areapproximated by the matrix JTJ, reducing (17.14) toJT Jq = - v =- JTf .(17.16)These are simply normal equations for a linearised least-squares problemevaluated locally, and the Gauss-Newton iteration proceeds by replacing x by(x+q) and repeating until either q is smaller than some predetermined toleranceorS (x+q )>S(x).(17.17)17.3.
HARTLEY’S MODIFICATIONAs it has been stated, the Gauss-Newton iteration can quite easily fail since thesecond term in the summation (17.15) is often not negligible, especially in thecase when the residuals f are large. However, if instead of replacing x by (x+q)we use (x+tq), where t is a step-length parameter, it is possible to show that if Jis of full rank (JTJ is positive definite and hence non-singular), then this modifiedGauss-Newton algorithm always proceeds towards a minimum. For in this case, wehaveq = - (JT J) - 1JTf(17.18)so that the inner product between q and the direction of the steepest descent- JT f is(17.19)- qTv = f TJ(JTJ) - 1JTf >0since (JTJ)-1 is positive definite by virtue of the positive definiteness of J T J.
Thusthe cosine of the angle between q and -v is always positive, guaranteeing that thesearch is downhill. In practice, however, angles very close to 90° are observed,which may imply slow convergence.The modified Gauss-Newton procedure above has been widely used to minimise sum-of-squares functions (Hartley 1961). Usually the one-dimensional searchat each iteration is accomplished by fitting a quadratic polynomial to the sum-ofsquares surface along the search direction and evaluating the function at theestimated minimum of this quadratic. This requires at least three sum-of-squaresfunction evaluations per iteration and, furthermore, the apparent simplicity of thisparabolic inverse interpolation to accomplish a linear search when stated in wordshides a variety of strategems which must be incorporated to prevent moves eitherin the wrong direction or to points where S(x+q)>S(x).