supercomp91 (1158319), страница 4

Файл №1158319 supercomp91 (Раздаточные материалы) 4 страницаsupercomp91 (1158319) страница 42019-09-182019-09-18СтудИзба

Раздаточные материалы

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 4)

The rst parameter is the setof elements over which the second method is to be applied. In this case, our indexing scheme on the treeallows the tree levels to be described as a range ofvalues.We have used the power of C++ to overload theoperator \*" and \+=" to make it easy to expressthe dotproduct in the usual manner. Because elem is7more work associated with it than the original pointoperator, we now have enough parallel slack to maskboth the latencies of communication and other systemoverhead. Currently the process of replacing the basicelement in the collection by a matrix block is done byprogrammer. In the future we plan to have it done automatically by the compiler in most of the cases.

Wenow list the GP1000 times in the Table 2 for a matrixof size 1024 by 1024. Running this same blocked codeon the 12 processor I860 based Alliant Fx2800 we havethe results in the Table 3. It must be noted that wecheated slightly in reporting the speedup number forthe 2800. The actual speed of this code on one processor is 3.98 Mops when run with the I860 cache enabled. Unfortunately, in concurrent mode the cachesare all disabled so we have based the speedups on theone processor concurrent mode times.Many readers may be concerned that the Mopnumbers reported above are very small relative to theactual peak rate of the machine. In fact, for the BBNGP1000, the node processor is the old M68020 whichis not very fast.

The numbers above are very goodfor that machine. For the I860 we see two problems.First the Alliant code generator is not yet mature andthe cache problem must be xed before we can get better numbers. However there is another solution to thisproblem.

Because of the modular nature of the PC++code it is very easy to use assembly coded kernels foroperations such as the 32 by 32 block matrix multiply.To illustrate this point we have done exactly this forthe old Alliant FX/8 with 4 processors. The resultsare shown in the Table 4. These numbers are verynear the peak speeds of 32 Mops for this machine.We intend to build a library of hand coded assemblyroutines for the I860 processor to speed performanceon standard collections running on the Fx2800 and theIntel IPSC.declared as an element of matrix, it inherits thisindexfrom the distributedarray collection. Thisindex willgive the index of the current element. operator () isalso overloaded in the distributedarray for the accessof a particular element given indexes.Finally, we show the main program below:matrix<elem>A([MAXPROC],[M,M],[Block,Whole]);matrix<elem>B([MAXPROC],[M,M],[Block,Whole]);matrix<elem>C([MAXPROC],[M,M],[Block,Whole]);main() {init();A.matmul(&B,&C);}This version of matrix multiplication is known asa \point" algorithm, i.e., the basic element containsa single oating point value.

The experienced parallel programmer knows that point parallel algorithmperform very poorly on most machines. The reasonis that the total amount of communication relative toarithmetic is very high. Indeed, in our example, thereferences to B(i; j) and C(k; j) in the inner loop mayinvolve interprocessor communication. If we have twointerprocessor communications for a single scalar multiply and scalar add, you will not see very good performance. Taking this program almost as is (the onlychange involved substituting a function name for theoperator because overloading is not completely supported at the time of this writing) and running it onthe BBN GP1000 we see the problem with a point algorithm very clearly.

Table 1 shows the performanceof the matrix multiply in megaops and the speed-upover the execution on one processor. The size of thematrix was 48 by 48. The speed-up reported in thispaper is calculated with respect to the execution ofPC++ programs on a single node.Fortunately there has been a substantial amount ofrecent work on blocked algorithms for matrix computation that we can take advantage of. If we replaceour basic element in the matrix collection by a smallmatrix block, we can achieve a substantial improvement in performance.

The only dierence in the twocodes is that we have a element of the form5.2 PDE Computations: The ConjugateGradient MethodOne of the most frequently used methods of solving elliptic partial dierential equations is the methodof conjugate gradients (CG). In this example, we willlook at a simple version of CG for a two dimensionnite dierence operator. We begin by dening a collection called Grid to represent a simple M by M meshon the unit square. The nite dierence approximation to the partial dierential operator is given by theequationAu= fclass elem ElementTypeOf matrix {float x[32][32];friend elem operator *(elem *, elem *);elem operator +=( elem &);dotproduct(matrix *B,matrix *C);...};The main dierence is that the overloaded \*" operator is now written as a 32 by 32 subblock matrix multiply. Because this operator has a substantial amount8p=1p=2 p=3 p=4 p=6p=8p = 12 p = 16Mops0.0041 0.0073 0.0104 0.01298 0.01087 0.01435 0.0083 0.0081Speed-up 1.01.782.543.1652.653.52.021.97Table 1: Pointwise Matrix Multiply of size 48 by 48 on the GP1000p=1p = 2 p = 3 p = 4 p = 6 p = 8 p = 12 p = 16Mops0.0775 0.1547 0.2319 0.3091 0.4627 0.6166 0.9137 1.21187Speed-up 1.01.992.993.985.977.9511.815.6Table 2: Blocked Matrix Multiply of size 1024 by 1024 on the GP1000where u is a two dimensional array of unknowns and fis a two dimensional array of \right hand side" values.Each component of u and f is associated with onevertex in the grid.

The operator A is taken to be asimple nite dierence operator of the formint left_edge(){if (thisindex[1]==0)return 1;else return 0;}int right_edge(){if (thisindex[1]==M-1)return 1;else return 0;}double localdot(int, int);};A(ui;j ) = ui;j 0:25 (ui 1;j +ui+1;j +ui;j 1 +ui;j +1):The boundary conditions will be that u is zero alongthe boundary of our grid.

Our rst approach to thisproblem will be to have each element of the Grid collection correspond to one vertex in the grid. This vertex element will contain one component of each of u,f and four other arrays e, p, q and r needed by theCG algorithm. Because of local structure of the nite dierence operator, each vertex element must beable to access its neighbors to the north, south, eastand west. Fortunately, the Collection DistributedArray has, in two dimensions, the self(i,j) operator taketwo parameters corresponding to displacements alongthe two dimensions of the array. Consequently, theoperator A is almost a trivial generalization of the local average operator illustrated in section 4.1.

Theonly major dierence is that each element must knowwhen it is on the boundary of the collection so thatthe boundary conditions may be applied. Because thisis a property of the global topology of the grid, we willput a set of special boolean functions, top edge(), bottom edge(), left edge, right edge in the collection thatreturns 1 if an element is on the corresponding boundary. Our Grid collection now takes the formThe CG algorithm requires that we frequently usethe inner product operationXdotprod(u; f) = ui;j fi;j :i;jComputation of the dotprod function requires that weform the local products (which we compute using aelement function localdot() and do a global sum overthe entire Grid collection. (The protocol for usingthe dotprod function is to give each variable in thelocal element a unique integer identier.

The call dotprod(3,2) will cause the element method localdot(3,2)to be called which, in turn, should compute the products of the 3rd and 2nd variables to be multiplied. Amore elegant solution exists, but was not used in theexperiments described here.)At this point it is not hard to complete this \point"version of the CG algorithm, but, just as with thepoint version of matrix multiply, we will nd thatthe communication between elements on dierent processors will dominate the execution time. Again, toachieve better parallel slackness, we can block thecomputation as illustrated in Figure 3.

Each blockis in fact a small subgrid of size N by N. If we nowdene our basic element of the Grid collection to besuch a subgrid we haveCollection Grid: DistributedArray{public:Grid(vconstant *P,vconstant *G,vconstant *D);double dotprod(int, int);MethodOfElement:int top_edge(){if (thisindex[0]==0)return 1;else return 0;}int bottom_edge(){if (thisindex[0]==M-1)return 1;else return 0;}class SubGrid ElementTypeOf Grid{public:float_array f[N][N], u[N][N],e[N][N];float_array p[N][N], q[N][N], r[N][N];void Ap();double localdot(int, int);9p=1 p = 2 p = 3 p = 4 p = 6 p = 12Mops2.26 4.26 6.48 8.56 12.78 22.81Speed-up 1.0 1.88 2.86 3.78 5.65 10.09Table 3: Blocked Matrix Multiply of size 1024 by 1024 on the FX2800p=1 p = 2 p = 3 p = 4Mops7.91 15.69 23.11 30.74Speed-up 1.0 1.98 2.92 3.88Table 4: Blocked Matrix Multiply of size 1024 by 1024 on the FX/8 with assembly BLAS3In the case of the Alliant Fx2800 we have the resultsin the Table 6.

In this case, increasing the block sizedoes not help the Fx2800 because, unlike the matrixmultiply example, the CG algorithm must sweep overall the subblocks before one is reused. This causes theAlliant cache to overow and we are then limited bythe bus bandwidth to the speed-ups above.};where oat array is a C++ class that has all thebasic arithmetic operators overloaded for \vector" operations. Everything we have said about the pointalgorithm holds for the blocked \subgrid" algorithm,except our grid will actually be of size N M by N M.The body of the CG computation is shown below.Grid<SubGrid> G([MAXPROC], [M,M], [Block,Block]);conjugategradient(){int i;double gamma, alfa, tau, gamnew, beta;G.Ap(); // q = A*pG.r = G.f - G.q;gamma = G.dotprod(R,R);G.p = G.r;for(i = 0; i < M*N*M*N && gamma > 1.0e-12 ; i++){G.Ap(); // q = A*ptau = G.dotprod(P,Q);// tau = <p,q>if(tau == 0.0) {printf("tau = 0\n"); exit(1);}alfa = gamma/tau;G.r = G.r - alf*G.q;gamnew = G.dotprod(R,R); // gamnew = <r,r>beta = gamnew/gamma;G.u = G.u + alfa*G.p; G.p = G.r + beta*G.p;gamma = gamnew;}}The performance of this algorithm for the GP1000is shown in the Table 5 for dierent block sizes anddierent distribution schemes.

Характеристики

Тип файла

PDF-файл

Размер

257,78 Kb

Материал

Раздаточные материалы

Тип материала

Другое

Предмет

Модели параллельных вычислений и DVM технология разработки параллельных программ

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов учебной работы

razdatochnye-materialy.rar

Раздаточные материалы

SAGE

stage-fin

2000_11_NT

doc

Installation guide (rus).doc

Installation guide.doc

User guide (rus).doc

User guide.doc

2000_11_UNIX

doc

Installation guide (rus).doc

Installation guide.doc

User guide (rus).doc

Полное содержание архива

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.