J89 (1121213), страница 6
Текст из файла (страница 6)
This property is shown in Fig. 5b in which the ESPs arenot at the global minimum of Ld (y, α)T . Rather, CSA aims to implicitly minimizeW(y) according to GSA [25,26]. That is, y∗ ∈ Yopt corresponds to y∗ = (y∗ , α max )T withthe minimum W(y), and W (y∗ , α max )T < W (y, α)T for all y = y∗ and α ∈ andfor all y = y∗ and α = α max . The following theorem shows that CSA asymptoticallyconverges to y∗ with probability one. (See the proof in Appendix B.)Theorem 4 Given the inhomogeneous Markov chain modeling CSA with transitionprobability defined in (36) and the sequence of decreasing temperatures that satisfy (41),the Markov chain converges to a CGMd with probability one as k → ∞.Example 2 (cont’d) We illustrate the virtual energy W(y) of the Markov chain inFig. 5a and the convergence behavior of CSA and random search.One approach to find W(y) that works well for a small problem is to enumerateall possible spanning trees rooted at y and to find the one with the minimum cost.Another more efficient way adopted in this example is to compute W(y) using (44).This can be done by first numerically computing the stationary probability πT (y) of theW (y)650.5 0.60.7 0.80.9 1.01.1 1.2 2ya) Virtual energy W (y)4α31.2CSA1 random search0.80.60.40.2050 100 150 200 250Number of Iterations kReach.
Prob. Prk ((1, 6)T )W (y)86420Conv. Prob. Pck ((1, 6)T )J Glob Optim1.210.80.60.40.20CSArandom search10 20 30 40 50 60 70 80 90100Number of Iterations kb) Convergence prob. at (1, 6)T c) Reachability prob. at (1, 6)TFig. 6 Virtual energy of the Markov chain in Fig. 5a and the convergence behavior of CSA andrandom search at (1.0, 6)TMarkov chain at a given T using the one-step transition probability PT (y, y ) in (36),where πT evolves with iteration k as follows:Pck+1 = Pck PTfor any given initial convergence probability vector Pc0 ,(45)until ||Pck+1 − Pck || ≤ ε.
In this example, we set ε = 10−16 as the stopping precision.Since πT = lim Pck , independent of the initial vector Pc0 , we set Pc0 (i) = |S1 | fork→∞i = 1, . . . , |S|.Figure 6a shows W (y, α)T of Fig. 5a. Clearly, Ld (y, α)T = W (y, α)T . For agiven y, W (y, α)T is nonincreasing as α increases. For example, W (0.6, 3)T =4.44 ≥ W (0.6, 4)T = 4.03, and W (0.8, 2)T = 4.05 ≥ W (0.8, 6)T = 3.14.
Wealso have W (y, α)T minimized at y = 1.0 when α = α max = 6: W (0.6, 6)T =3.37 ≥ W (0.8, 6)T = 3.14 ≥ W (1.0, 6)T = 0.097. Hence, W (y, α)T is minimizedTTat (y∗ , α max ) = (1.0, 6) , which is an ESP with the minimum objective value. In contrast, Ld (y, α)T is nondecreasing as α increases. In Fig. 5b, the minimum value ofLd (y, α)T is at (1.2, 2)T , which is not a feasible point.To illustrate the convergence of CSA to y∗ = 1.0, Fig. 6b plots Pck (y∗ ) as a functionof k, where y∗ = (1.0, 6)T .
In this example, we set T0 = 1.0, NT = 5, and κ = 0.9(the cooling schedule in Fig. 1). Obviously, as the cooling schedule is more aggressivethan that in Theorem 3, one would not expect the search to converge to a CGMd withprobability one, as proved in Theorem 4. As T approaches zero, W(y∗ ) approacheszero, and Pck (y∗ ) monotonically increases and approaches one. Similar figures can bedrawn to show that Pck (y), y = y∗ , decreases to zero as T is reduced.
Therefore, CSA ismore likely to find y∗ as the search progresses. In contrast, for random search, Pck (y∗ )is constant, independent of k.Note that it is not possible to demonstrate asymptotic convergence using only afinite number of iterations. Our example, however, shows that the probability of finding a CGMd improves over time. Hence, it becomes more likely to find a CGMd whenmore time is spent to solve the problem.Last, Fig. 6c depicts the reachability probability Prk (y∗ ) of finding y∗ in any of thefirst k iterations.
Assuming all the iterations are independent, Prk (y∗ ) is defined as:Prk (y∗ ) = 1 −k#i=01 − P(y∗ found in the ith iteration) .(46)J Glob OptimThe figure shows that CSA has better reachability probabilities than random searchover the 100 iterations evaluated, although the difference diminishes as the numberof iterations is increased.It is easy to show that CSA has asymptotic reachability [3] of y∗ ; that is, limk→∞ Prk (y∗ )= 1. Asymptotic reachability is weaker than asymptotic convergence because it onlyrequires the algorithm to hit a global minimum sometime during a search and can beguaranteed if the algorithm is ergodic. (Ergodicity means that any two points in thesearch space can be reached from each other with a nonzero probability.) Asymptoticreachability can be accomplished in any ergodic search by keeping track of the bestsolution found during the search.
In contrast, asymptotic convergence requires thealgorithm to converge to a global minimum with probability one. Consequently, theprobability of a probe to hit the solution increases as the search progresses.4.2 Asymptotic convergence of CPSABy following a similar approach in the last section on proving the asymptotic convergence of CSA, we prove in this section the asymptotic convergence of CPSA to aCGMd of Pd .CPSA can be modeled by an inhomogeneous Markov chain that consists of asequence of homogeneous Markov chains of finite length, each at a specific temperature in a given cooling schedule. The state space of the Markov chain can be describedby state y = (y, α, γ )T , where y ∈ D w is the vector of problem variables and α and γare the penalty vectors.According to the generation probability G(t) (y, y ) and the acceptance probabilityAT (y, y ), the one-step transition probability matrix of the Markov chain for CPSA isPT = [PT (y, y )], where:⎧(t)⎪Ps (t)G(t) (y, y )AT (y, y ),if y ∈ N p (y),⎪⎪⎪⎪⎪t = 1, .
. . , N,⎪⎪⎪⎪ ∈ N (g) (y),⎨Ps (N + 1)G(g) (y, y )AT (y, y ),ifyp⎤⎡PT (y, y ) =N⎪"""⎪⎣⎪PT (y, y )⎦ −PT (y, y ),if y = y,1−⎪⎪⎪(t)(g)t=1⎪y ∈Np (y)y ∈Np (y)⎪⎪⎪⎩0,otherwise.(47)!Let yopt = (y∗ , α max , γ max )T | y∗ ∈ Yopt , and NL be the maximum of the minimumnumber of transitions required to reach yopt from all y ∈ S. Given {Tk , k = 0, 1, 2, . . . , }that satisfy (41) and NT , the number of trials per temperature, be NL , a similar theorem as in Theorem 3 can be proved [9]. This means that state y of the Markov chainhas a unique stationary probability πT (y).Note that L defined in Theorem 3 is the maximum difference between thepenalty-function values of two neighboring states. Although this value depends onthe user-defined neighborhood, it is usually smaller for CPSA than for CSA becauseCPSA has a partitioned neighborhood, and two neighboring states can differ by onlya subset of the variables.
In contrast, two states in CSA can differ by more variables and have larger variations in their penalty-function values. According to (41),J Glob Optima smaller L allows the temperature to be reduced faster in the convergence to aCGMd .Similar to CSA, (47) also fits into the framework of GSA if we define an irreducible Markov kernel PT (y, y ) and its associated communication cost v(y, y ), where v:S × S → [0, +∞]:(Ld (y ) − Ld (y))+ ,if y = (y , α, γ )T ,v(y, y ) =(48)if y = (y, α , γ )T or y = (y, α, γ )T .(Ld (y) − Ld (y ))+ ,In a way similar to that in CSA, we use the result that any process modeled by GSAminimizes an implicit virtual energy W(y) and converges to the global minimum ofW(y) with probability one. The following theorem states the asymptotic convergenceof CPSA to a CGMd .
The proof in Appendix C shows that W(y) is minimized at(y∗ , α max , γ max )T for some α max and γ max .Theorem 5 Given the inhomogeneous Markov chain modeling CPSA with transitionprobability defined in (47) and the sequence of decreasing temperatures that satisfy (41),the Markov chain converges to a CGMd with probability one as k → ∞.Again, the cooling schedule of CPSA in Fig. 3 is more aggressive than that inTheorem 5.5 Experimental results on continuous constrained problemsIn this section, we apply CSA and CPSA to solve some nonlinear continuous optimization benchmarks and compare their performance to that of other dynamic penaltymethods.5.1 Implementation details of CSA for solving continuous problemsIn theory, any neighborhoods N c1(x) and N c2(α) that satisfy (21) and (22) can be used.In practice, however, appropriate neighborhoods must be chosen in any efficientimplementation.In generating trial point x = (x , α)T from x = (x, α)T where x ∈ N c1 (x), we choosex to differ from x in the ith element, where i is uniformly distributed in {1, 2, .
. . , n}:x = x + θ ⊗ e1 = x + (θ1 e1,1 , θ2 e1,2 , . . . , θn e1,n )T(49)and ⊗ is the vector-product operator. Here, e1 is a vector whose ith element is 1 andthe other elements are 0, and θ is a vector whose ith element θi is Cauchy distributediwith density fd (xi ) = π1 σ 2σ+x2 and scale parameter σi .
Other distributions of θi studiediiinclude uniform and Gaussian [31]. During the course of CSA, we dynamically updateσi using the following modified one-to-one rate rule [10] in order to balance the ratiobetween accepted and rejected configurations:⎧σi [1+β0 (pi −pu )]⎪,if pi > pu ,⎪⎨1−puσiσi ←−(50)if pi < pv ,[1+β1 (pv −pi )/pv ] ,⎪⎪⎩unchanged,otherwise,where pi is the fraction of x accepted.
If pi is low, then too many trial points of xare rejected, and σi is reduced; otherwise, the trial points of x are too close to x, andJ Glob Optimσi is increased. We set β0 = 7, β1 = 2, pu = 0.3, and pv = 0.2 after experimentingdifferent combinations of parameters [31]. Note that it is possible to get somewhatbetter convergence results when problem-specific parameters are used, although theresults will not be general in that case.Similarly, in generating trial point x = (x, α )T from x = (x, α)T where α ∈ N c2 (α),we choose α to differ from α in the jth element, where j is uniformly distributed in{1, 2, .