Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 81
Текст из файла (страница 81)
This variant therefore has slightly smaller constants in the operation count for n x nmatrices. For the product (22.3) the formulae areS1 = A21 + A22,M1 = S 2 S 6 ,S2 = S1 – A11,M 2 = A 1l B 1l , T 2 = T 1 + M4,S3 = A11 – A21,S4 = A12 – S2,M 3 = A 12 B 21 ,M4 = S 3 S 7 ,S5 = B12 – B11,M5 = S 1 S 5 ,S6 = B22 – S5,S7 = B22 – B12,M 6 = S 4 B 22 ,C11 = M2 + M3,C12 = TI + M5 + M6,M7 = A22 S 8 ,C21 = T2 – M7,S8 = S6 – B21,T 1 = M1 + M2 ,(22.6)C22 = T2 + M5.Until the late 1980s there was a widespread view that Strassen’s methodwas of theoretical interest only, because of its supposed large overheads fordimensions of practical interest (see, for example, [909, 1988]), and this viewis still expressed by some [842, 1992].
However, in 1970 Brent implementedStrassen’s algorithm in Algol-W on an IBM 360/67 and concluded that in thisenvironment, and with just one level of recursion, the method runs faster thanthe conventional method for n > 110 [142, 1970]. In 1988, Bailey comparedhis Fortran implementation of Strassen’s algorithm for the Cray-2 with theCray library routine for matrix multiplication and observed speedup factorsranging from 1.45 for n = 128 to 2.01 for n = 2048 (although 35% of thesespeedups were due to Cray-specific techniques) [43, 1988]. These empiricalresults, together with more recent experience of various researchers, show thatStrassen’s algorithm is of practical interest, even for n in the hundreds. Indeed, Fortran codes for (Winograd’s variant of) Straasen’s method have beensupplied with IBM’s ESSL library [595, 1988] and Cray’s UNICOS library[602, 1989] since the late 1980s.Strassen’s paper raised the question “what is the minimum exponent ωsuch that multiplication of n x n matrices can be done inoperations?”Clearly, ω > 2, since each element of each matrix must partake in at least oneoperation.
It was 10 years before the exponent was reduced below Strassen’slog2 7 . A flurry of publications, beginning in 1978 with Pan and his exponent 2.795 [815, 1978], resulted in reduction of the exponent to the currentrecord 2.376, obtained by Coppersmith and Winograd in 1987 [245, 1987].Figure 22.1 plots exponent versus time of publication (not all publications arerepresented in the graph); in principle, the graph should extend back to 1850!Some of the fast multiplication methods are based on a generalization ofStrassen’s idea to bilinear forms. Let A, BA bilinear noncommuta-22.1 M ETHODS449Figure 22.1.
Exponent versus time for matrix multiplication.tive algorithm overfor multiplying A and B with t “nonscalar multiplications” forms the product C = AB according to(22.7a)(22.7b)where the elements of the matrices W, U (k) , and V (k) are constants. Thisalgorithm can be used to multiply n x n matrices A and B, where n = hk, asfollows: partition A, B, and C into h 2 blocks Aij , Bij , and Cij of size h k–1 ,then compute C = AB by the bilinear algorithm, with the scalars aij, bij, andcij replaced by the corresponding matrix blocks. (The algorithm is applicableto matrices since, by assumption, the underlying formulae do not depend oncommutativity.) To form the t products Pk of (n/h) x (n/h) matrices, partitionthem into h 2 blocks of dimension n/h2 and apply the algorithm recursively.The total number of scalar multiplications required for the multiplication istk = n α, where α = logh t.Strassen’s method has h = 2 and t =7.
For 3 x 3 multiplication (h = 3),the smallest t obtained so far is 23 [683, 1976]; since log323 2.854 > log27,this does not yield any improvement over Strassen’s method. The methodFAST M ATRIX M ULTIPLICATION450described in Pan’s 1978 paper has h = 70 and t = 143,640, which yieldsα = log70143,640 = 2.795 . .
. .In the methods that achieve exponents lower than 2.775, various intricatetechniques are used. Laderman, Pan, and Sha [684, 1992] explain that forthese methods “very large overhead constants are hidden in the ’O’ notation’’, and that the methods “improve on Strassen’s (and even the classical)algorithm only for immense numbers N“.A further method that is appropriate to discuss in the context of fastmultiplication methods, even though it does not reduce the exponent, is amethod for efficient multiplication of complex matrices. The clever formula(a + ib)(c + id) = ac - bd + i[(a + b)(c + d) - ac - bd](22.8)computes the product of two complex numbers using three real multiplicationsinstead of the usual four.
Since the formula does not rely on commutativityit extends to matrices. Let A = A1 + iA2 and B = Bl + iB2, where Aj, Bjand define C = C1 + iC2 = AB. Then C can be formed using threereal matrix multiplications asT1 = A1 B1 ,T2 = A2 B2 ,C1 = T1 – T2,(22.9)C2 = (A1 + A2)(B1 + B2) – T1 – T2,which we will refer to as the "3M method”. This computation involves 3n 3scalar multiplications and 3n3 + 2n2 scalar additions. Straightforward evaluation of the conventional formula C = A1B1 – A2B2 + i(A1B2 + A2B1) requires4n3 multiplications and 4n3 – 2n2 additions.
Thus, the 3M method requiresstrictly fewer arithmetic operations than the conventional means of multiplying complex matrices for n > 3, and it achieves a saving of about 25% forn > 30 (say). Similar savings occur in the important special case where A orB is triangular. This kind of clear-cut computational saving is rare in matrixcomputations!IBM’s ESSL library and Cray’s UNICOS library both contain routines forcomplex matrix multiplication that apply the 3M method and use Strassen’smethod to evaluate the resulting three real matrix products.22.2.
Error AnalysisTo be of practical use, a fast matrix multiplication method needs to be fasterthan conventional multiplication for reasonable dimensions without sacrificingnumerical stability. The stability properties of a fast matrix multiplicationmethod are much harder to predict than its practical efficiency, and needcareful investigation.22.2 E RROR A NALYSIS451The forward error bound (3.12) for conventional computation of C = AB,where A, Bcan be written(22.10Miller [756, 1975] shows that any polynomial algorithm for multiplying n x nmatrices that satisfies a bound of the form (22.10) (perhaps with a differentconstant) must perform at least n 3 multiplications.
(A polynomial algorithmis essentially one that uses only scalar addition, subtraction, and multiplication.) Hence Strassen’s method, and all other polynomial algorithms with anexponent less than 3, cannot satisfy (22.10). Miller also states, without proof,that any polynomial algorithm in which the multiplications are all of the formmust satisfy a bound of the form(22.11)It follows that any algorithm based on recursive application of a bilinear noncommutative algorithm satisfies (22.11); however, the all-important constantf n is not specified. These general results are useful because they show uswhat types of results we can and cannot prove and thereby help to focus ourefforts.In the subsections below we analyse particular methods.Throughout the rest of this chapter an unsubscripted matrix norm denotesAs noted in 6.2, this is not a consistent matrix norm, but we do have thebound ||AB|| < n||A|| ||B|| for n x n matrices.22.2.1.
Winograd’s MethodWinograd’s method does not satisfy the conditions required for the bound(22.11), since it involves multiplications with operands of the form aij + brs.However, it is straightforward to derive an error bound.Theorem 22.1 (Brent). Let x, ywhere n is even. The inner product computed by Winograd’s method satisfies(22.12)Proof. A straightforward adaptation of the inner product error analysisin 3.1 produces the following analogue of (3.3):452FAST M ATRIX M ULTIPLICATIONwhere theand β i are all bounded in modulus by γn /2+4.HenceThe analogue of (22.12) for matrix multiplication is ||AB – fl(AB)|| <Conventional evaluation of xTy yields the bound (see (3.5))(22.13)The bound (22.
12) for Winograd’s method exceeds the bound (22.13) by a factor approximatelyTherefore Winograd’s methodis stable ifhave similar magnitude, but potentially unstableif they differ widely in magnitude. The underlying reason for the instability is that Winograd’s method relies on cancellation of terms x2i–1 x2i andy2i–ly2i that can be much largerthan the final answertherefore the intermediate rounding errors can swampthe desired inner product.A simple way to avoid the instability is to scale x µ x and yµ-lybefore applying Winograd’s method, where µ, which in practice might be apower of the machine base to avoid roundoff, is chosen so thatWhen using Winograd’s method for a matrix multiplication AB it suffices tocarry out a single scaling AµA and Bµ -lB such that ||A|| ||B||. If−1A and B are scaled so that τ < ||A||/||B|| < τ then22.2.2. Strassen’s MethodUntil recently there was a widespread belief that Strassen’s method is numerically unstable. The next theorem, originally proved by Brent in 1970, showsthat this belief is unfounded.22.2 E RROR ANALYSIS453Theorem 22.2 (Brent).
Let A, Bwhere n = 2 k. Suppose thatC = AB is computed by Strassen’s method and that n0 = 2 r is the thresholdat which conventional multiplication is used. The computed product satisfies(22.14)Proof. We will use without comment the norm inequality ||AB|| <n||A|| ||B|| = 2 k||A|| ||B||.Assume that the computed productAB from Strassen’s methodsatisfies(22.15)||E|| < cku||A|| ||B|| + O(u2),= AB + E,where ck is a constant. In view of (22.10), (22.15) certainly holds for n = no,with c r =Our aim is to verify (22.15) inductively and at the same timeto derive a recurrence for the unknown constant ck.Consider Cll in (22.4), and, in particular, its subterm P1. Accounting forthe errors in matrix addition and invoking (22.15), we obtainwhere| ∆A| < u|A11 + A22|,|∆B | < u|B11 + B22 |,||E1|| < ck-1u||A11 + A22 + ∆A|| ||Bll + B22 + ∆B|| + O(u2)< 4ck-1u||A|| ||B|| + O(u2).Hence= P1 + F1,||F1|| < (8 · 2 k-1 + 4c k - 1)u||A|| ||B|| + O(u2).Similarly,where= A 22 (B 21 – B ll + ∆B) + E4,|∆B| < u|B21 – B11|,||E4 || < ck-1u||A22 || ||B21 - B11 + ∆B|| + O(u2),which gives= P4 + F4,||F4|| < (2 · 2 k-1 + 2c k - 1) u||A|| ||B|| + O(u2).Now454FAST M ATRIX M ULTIPLICATIONwhere=: P5 + F5 and=: P7 + F7 satisfy exactly the same error boundsasandrespectively.