Higham - Accuracy and Stability of Numerical Algorithms (523152), страница 49
Текст из файла (страница 49)
SCHREIBER, Block Algorithms for Parallel Machines (1988)It should be realized that, with partial pivoting,any matrix has a triangular factorization.DECOMP actually works faster when zero pivots occur because they mean thatthe corresponding column is already in triangular form.— GEORGE E. FORSYTHE, MICHAEL A. MALCOLM, and CLEVE B. MOLER,Computer Methods for Mathematical Computations (1977)It was quite usual when dealing with very large matrices toperform an iterative process as follows:the original matrix would be read from cards and the reduced matrix punchedwithout more than a single row of the original matrixbeing kept in store at any one time;then the output hopper of the punch would betransferred to the card reader and the iteration repeated.— MARTIN CAMPBELL-KELLY, Programming the Pi/et ACE (1981)245246B LOCK LU FACTORIZATION12.1.
Block Versus Partitioned LU FactorizationAs we noted in Chapter 9 (Notes and References), Gaussian elimination (GE)comprises three nested loops that can be ordered in six ways, each yielding adifferent algorithmic variant of the method. These variants involve differentcomputational kernels: inner product and saxpy operations (level-1 BLAS),or outer product and gaxpy operations (level-2 BLAS). To introduce matrix–matrix operations (level-3 BLAS), which are beneficial for high-performancecomputing, further manipulation beyond loop reordering is needed. We willuse the following terminology, which emphasises an important distinction.A partitioned algorithm is a scalar (or point) algorithm in which the operations have been grouped and reordered into matrix operations.A block algorithm is a generalization of a scalar algorithm in which thebasic scalar operations become matrix operations (a+ β, aβ, and a/β becomeA+B, AB, and AB-1), and a matrix property based on the nonzero structurebecomes the corresponding property blockwise (in particular, the scalars 0and 1 become the zero matrix and the identity matrix, respectively).
A blockfactorization is defined in a similar way and is usually what a block algorithmcomputes.A partitioned version of the outer product form of LU factorization maybe developed as follows. Forand a given block size r, write(12.1)where A11 is r × r. One step of the algorithm consists of factoring A11 =L 11 U 11, solving the multiple right-hand side triangular systems L 11 U 12 = A1 2and L2 1 U 11 = A21 for U 12 and L21, respectively, and then forming B =A22 – L 21 U 12; this procedure is repeated on B. The block operations definingU 12, L21, and B are level-3 BLAS operations. This partitioned algorithm doesprecisely the same arithmetic operations as any other variant of GE, but itdoes the operations in an order that permits them to be expressed as matrixoperations.A genuine block algorithm computes a block LU factorization, which is afactorization A = LUwhere L and U are block triangular and L hasidentity matrices on the diagonal:In general, the blocks can be of different dimensions.
Note that this factorization is not the same as a standard LU factorization, because U is not24712.1 B LOCK V ERSUS P ARTITIONED LU FACTORIZATIONtriangular. However, the standard and block LU factorizations are related asfollows: if A = LU is a block LU factorization and each Uii has LU factorization Uii =then A = LdiagU is an LU factorization.Conditions for the existence of a block LU factorization are easy to state.Theorem 12.1. The matrix A =has a unique block LUfactorization if and only if the first m – 1 leading principal block submatricesof A are nonsingular.Proof.
The proof is entirely analogous to the proof of Theorem 9.1.This theorem makes clear that a block LU factorization may exist whenan LU factorization does not.is nonsingular we can writeIf A 1 1(12.2)which describes one block step of an outer-product-based algorithm for computing a block LU factorization. Here, S = A 22 –is the Schurcomplement of A11 in A. If the (1, 1) block of S of appropriate dimension isnonsingular then we can factorize S in a similar manner, and this process canbe continued recursively to obtain the complete block LU factorization. Theoverall algorithm can be expressed as follows.Algorithm 12.2 (block LU factorization).
This algorithm computes a blockLU factorization A = LUusing the notation of (12.2).1. U 11 = A11, U 12 = A1 2 .2. Solve L2 1 A 11 = A21 for L2 1 .3. S = A22 – L2 1 A 12 (Schur complement).4. Compute the block LU factorization of S, recursively.Given a block LU factorization of A, the solution to a system Ax = b canbe obtained by solving Ly = b by forward substitution (since L is triangular)and solving Ux = y by block back substitution. There is freedom in howstep 2 of Algorithm 12.2 is accomplished, and how the linear systems withcoefficient matrices U ii that arise in the block back substitution are solved.The two main possibilities are as follows.Implementation 1: A11 is factorized by GE with partial pivoting.
Step 2and the solution of linear systems with Uii are accomplished by substitutionwith the LU factors of A1 1 .Implementation 2:is computed explicitly, so that step 2 becomes amatrix multiplication and Ux = y is solved entirely by matrix–vector multiplications. This approach is attractive for parallel machines.248B LOCK LU FACTORIZATIONWhat can be said about the numerical stability of partitioned and blockLU factorization? Because the partitioned algorithm is just a rearrangementof standard GE, the standard error analysis applies if the matrix operationsare computed in the conventional way. However, if fast matrix multiplicationtechniques are used (for example, Strassen’s method), the standard resultsare not applicable.
Standard results are, in any case, not applicable to blockLU factorization; its stability can be very different from that of LU factorization. Therefore we need error analysis for both partitioned and block LUfactorization based on general assumptions that permit the use of fast matrixmultiplication.Unless otherwise stated, in this chapter an unsubscripted norm denotes||A|| := maxi,j |aij|.
We make two assumptions about the underlying level-3BLAS (matrix-matrix operations).then the computed approximationto(1) IfandC = AB satisfies(12.3)where c1 (m, n, p) denotes a constant depending on m, n and p.(2) The computed solutionto the triangular systems TX = B, whereandsatisfies(12.4)For conventional multiplication and substitution, conditions (12.3) and(12.4) hold with c1 (m, n, p) = n 2 and c2 (m,p) = m2. For implementationsbased on Strassen’s method, (12.3) and (12.4) hold with c1 and c2 rathercomplicated functions of the dimensions m, n, p and the threshold no thatdetermines the level of recursion (see Theorem 22.2 and [544, 1990]).12.2.
Error Analysis of Partitioned LU FactorizationAn error analysis for partitioned LU factorization must answer two questions.The first is whether partitioned LU factorization becomes unstable in somefundamental way when fast matrix multiplication is used. The second iswhether the constants in (12.3) and (12.4) are propagated stably into thefinal error bound (exponential growth of the constants would be disastrous).We will assume that the block level LU factorization is done in such a waythat the computed LU factors of A1 1satisfy(12.5)Theorem 12.3 (Demmel and Higham). Under the assumptions (12.3),(12.4), and (12.5), the LU factors ofcomputed using the partitioned12.2 E RROR A NALYSISOFP ARTITIONED LU FACTORIZATIONouter product form of LU factorization with block size r satisfiywhere249= A+ ∆A,(12.6)and whereProof.
The proof is essentially inductive. To save clutter we will omit“+O(u 2)” from each bound. For n = r, the result holds trivially. Considerthe first block stage of the factorization, with the partitioning (12.1). Theassumptions imply that(12.7)(12.8)To obtain B = A22 – L2 1 U12 we first compute C =obtainingand then subtract from A22, obtaining(12.9)It follows that(12.10a)(12.10b)The remainder of the algorithm consists of the computation of the LU factorization of B, and by our inductive assumption (12.6), the computed LUfactors satisfy(12.11a)(12.11b)Combining (12.10) and (12.11), and boundingusing (12.9), we obtain(12.12)250B LOCK LU FACTORIZATIONCollecting (12.5), (12.7), (12.8), and (12.12) we have= A + ∆A, wherebounds on ||∆Aij|| are given in the equations just mentioned.
These boundsfor the blocks of ∆A can be weakened slightly and expressed together in themore succinct form (12.6).These recurrences for δ(n,r) and θ(n,r) show that the basic error constantsin assumptions (12.3), (12.4), and (12.5) combine additively at worst. Thus,the backward error analysis for the LU factorization is commensurate withthe error analysis for the particular implementation of the BLAS3 employedin the partitioned factorization.
In the case of the conventional BLAS3 weobtain a Wilkinson-style result for GE without pivoting, with θ(n,r) = O(n 3 )(the growth factor is hidden inandAlthough the above analysis is phrased in terms of the partitioned outerproduct form of LU factorization, the same result holds for other “ijk” partitioned forms (with slightly different constants), for example, the gaxpy orsdot forms. There is no difficulty in extending the analysis to cover partialpivoting and solution of Ax = b using the computed LU factorization (seeProblem 12.6).12.3. Error Analysis of Block LU FactorizationNow we turn to block LU factorization. We assume that the computed matrices L21 from step 2 of Algorithm 12.2 satisfyWe also assume that when a system U i i x i = d i of order r is solved, thecomputed solutionsatisfies(12.14)The assumptions (12.13) and (12.14) are satisfied for Implementation 1 ofAlgorithm 12.2 and are sufficient to prove the following result.Theorem 12.4 (Demmel, Higham, and Schreiber).