суперкомпьютер семейства ILLIAC (ILLInois Automated Computer), страница 2
Описание файла
PDF-файл из архива "суперкомпьютер семейства ILLIAC (ILLInois Automated Computer)", который расположен в категории "". Всё это находится в предмете "параллельные системы и параллельные вычисления" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве НИУ «МЭИ» . Не смотря на прямую связь этого архива с НИУ «МЭИ» , его также можно найти и в других разделах. Архив можно найти в разделе "остальное", в предмете "параллельные системы и параллельные вычисления" в общих файлах.
Просмотр PDF-файла онлайн
Текст 2 страницы из PDF
However, each PE has its own independent memory (PEM). Toachieve independence of addressing within the PE memories, each PE isprovided with an index register. The contents of this index register can bespecified to be added to the content of the address field.The simplest example of the utility of the index register lies in the fetchingof a matrix that is stored skewed (See illustration below).
By not indexing,a matrix row is fetched. By inserting a function of the PE's own numberinto the index register, and then indexing the address, a column of thematrix is fetched into the PE's (Column 4 is shown in black below left).Indexing at the PE level is also allowed on shift counts of shift instructions,and on the bit number of the bit-manipulation instructions, providing forversatility in nonnumerical processing. In double precision arithmetic, forexample, the normalization correction, once developed, is inserted into theindex register and used to control the double-length shift that accomplishesnormalization.BROADCASTINGEvery PE instruction is sent to the PE's from the CU as 266 enable levels toeach and all PE's.
Accompanying this set of enable levels is a 64-bit"common data bus" that also goes to all PE's. Depending on the specificinstruction and variant thereof, the data on the common data bus is usedwithin the PE's as a 64-bit literal, or the least significant 16 bits may beused as the address of an operand in memory, or to designate a register inthe PE, or as a shift distance.
Memory addresses and shift distances may befurther indexed in the PE, as described above.This facility to transmit 64-bit literals is used in the software to transmit or"broadcast" those variables to the PE's that are constant across all PE's, asin the expression:where P is the PE number.Variable A will be fetched to the CU and broadcast, being the same for allPE's; whereas, B(P) is distinct in each PE and will be fetched out of the PE'smemory.Memory SizesPERFORMANCEAlgorithms for estimating the "speed", "power", or "throughput" ofcomputers have been the most elusive and illusory in the industry.
Amongthe simpler means employed have been the clock rate, add time, and morerecently, "Mips" (million instructions per second). Bench mark programshave also been used with more success. Presented here is a discussion of theclock rate, the instruction times for some typical instructions, and adiscussion of Mips. Such oversimplifications are recommended for use ononly the grossest basis for comparison among various computers, since theactual performance on any given application is dependent on factors such asthe memory allocation algorithm used, the compiler efficiency, as well asthe application itself and the efficiency of the application programs. Listedbelow are salient characteristics of ILLlAC IV which depict the "speed","power", and "throughput" of the machine.
The following data are basedon a 64-PU system. Future copies can also be built with 8, 16, or 32 PU's:Data RatesBetween the 64 PE's and their 64 PEM's - 9,362,285,714 bitslsec (64 pathsat 64 bits each 312.5 to 437.5 nanoseconds)Between array memory and parallel disks(980.468 transfershc of 1024 bits each)Array Subsystem(May be expanded in future copies)B 6700 Subsystem(Modular in increments of 16,384 words)8,388,6081,048,576131,072262,144bits orbytes orwords of 64 bits orwords of 32 bits524,288 words of 48 bits2,516,582,400 bitsParallel disk file(Modular in increments of 78,643,200 bits)Memory TimesArray- cycle time200 nanoseconds (bare memory)312.5 nanoseconds (in system)B 6700 - cycle time1.2 microsecondsParallel disk system - access time(Average access time for a single item)19.6 milliseconds- 1,004,000,000 bitslsecFLOATING POINT ADD AND MULTIPLY TIMESWord bus of the B 6700 - 40,000,000 bitshecClock Rates (MHz)Array SubsystemFloating point add time, including alignment of operands, addition, andnormalization of the sum is 4.9 nanoseconds per pair of 64-bit operands,and 2.9 nanoseconds per pair of 32-bit operands.
This is based on the timeto execute one add instruction simultaneously on 64 pairs of 64-bitoperands in 312.5 nanoseconds, and on the execution of one addinstruction simultaneously on 128 pairs of 32-bit operands in 375nanoseconds.I10 SubsystemB 6700 SubsystemClock track on parallel disk fileFloating point multiply time, including normalization of the product, is 8.8nanoseconds per pair of 64-bit operands and 4.9 nanoseconds per pair of32-bit operands. This is based on the time to execute one multiplyinstruction simultaneously on 64 pairs of 64-bit operands in 563nanoseconds, and on executing onem u l t i p l y instruction simultaneously on 128 pairs of 32-bitoperands in 625 nanoseconds.These add and multiply timesassume that the operands are already in place in the registers ofthe PE's. This was true for 50percent of the add instructions andfor over one third of the multiplyinstructions in the codes surveyed.
For the other 50 percentadd instructions and the more than60 percent multiply instructions, amemory access time of 375 nanoseconds must be added. Overlap inthe Control Unit will hide anundetermined fraction of thismemory access time.ILLUSTRATIVE KERNELSTo illustrate ILLlAC lVrs speed of operation the times required to evaluatesome simple arithmetic expressions are given.For 64 variablefor all j 0.875n(Method of nested polynomials)MIPSserial.
Among the items that can be done to allow some parallelism ofoperation in a problem that appears to be serial are:"Mips" (million instructions per second) is a measure that has been widelyused to describe the computing capabilities of various machines. A figureof 200 Mips is used for the ILLlAC IV, based on a large set of assumptionswhich are necessary as there is no agreement in the industry as to whatconstitutes an instruction.
These assumptions include:The instruction frequencies counted from 12 different codesare typical.The overlap in the Control Unit is 95 percent efficient and theproportion of Control Unit instructions is less than 25 percentof the total.The code uses the full 64 Processing Elements efficiently.For codes using only one of the 64 Processing Elements a t a time, the figureof 4.5 Mips is used.THE EFFECT OF PARALLELISM ON THROUGHPUTAs is well known, the throughput of ILLIAC IV is problem dependent. Tooversimplify, assume some fraction, x, of a problem is stubbornly serial andthat the rest of the problem is perfectly parallel and fits the 64-PEwidth.
Also assume that the problem takes a time, Ts on a hypotheticalone-PE machine, then it takes a time, Tp, on the parallel machine:Thus, if 90 percent of the problem is perfectly parallel and 10 percent isstubbornly serial, one expects a speed improvement by a factor of 8.8 onILLIAC IV.The above analysis is an oversimplification because, in general, portions ofproblems are neither perfectly parallel nor perfectly stubborn about beingA d o p t some variation in the algorithm that allowsparallelism. In some cases the variant algorithm will be lessefficient mathematically than the stubbornly serialalgorithm. However, because it allows parallelism, it causes anet gain in efficiency on the ILLlAC IV.
Matrix inversion anddata transfers offer examples of this sort.Reallocate the memory so that the variables that were all inone PE, and therefore could only be treated one at a time, arespread across many PE's.Thus, programs that are executed on the ILLIAC IV, in comparison to thesame problems being solved on a hypothetical one-PE serial machine, runwith an efficiency that is a combination of many effects. For example:blSome small pod6n of the problem really is stubbornly serial,and runs with no speed-up over the one-PE comparison case.Some larger portion of the problem runs efficiently in parallel,approaching a 64: 1 speed-up over the one-PE comparison case.Much of the program runs in parallel on the machine, withvarying degrees of efficiency.
This variation is due to paralleloperations that do not use 64 PE's and also to the cases inwhich the parallelism was bought at the price of a less efficientalgorithm. For these sections of the code, the speed-up overthe one-PE comparison case is much greater than 1:l but lessthan 64: 1.Therefore, it can be seen that the speed-up factor available on the l LLIACIV is dependent on the application and on the cleverness of theprogrammer; it will vary between 1:1 and 64: 1 over a hypothetical machinewith only one PE.
It is also true that better speed-up is expected in practicethan that predicted by a naive separation of problems into "serial" vs."parallel".Fragment the indices into their individual bits, settingAPPLICATIONSThe power of parallel processing is most effectively applied to solvingproblems when they are arranged in a parallel manner. To illustrate themethods involved, t w o illustrations (one computational, onenon-computational)are given.The method is based on the observation that wjk is the product of factorsrelated to the fragmented indices, and that these individual factors (such aswjm-1 or wk0) are repeated in a regular way in the 22m values of W J ~There.are only 2m of these factors.FOURIER TRANSFORMSAs an analytical technique, transformation to the frequency domain has toomany applications to catalog.
Fairly recent developments have produced acollection of "fast Fourier transform" methods. Perhaps, the first widelyread of these is from the article by Cooley and Tukey.Fast Fourier transforms turn out to be ideal for ILLlAC IV, operating atalmost full efficiency, thereby opening up all those problems amenable tosolution in the frequency domain to efficient attack. These include lineardifferential equations, filtering, simulation and signal analysis.The transformation process proceeds as follows:The Ak, multiplied by appropriate W factors, are combined pair-wise into avector Bl,k. The vector B1 k is combined pair-wise into a vector B2,k, andso on until a vector Bm,k is iormed.The vector Bmnkis the desired transform Xk, except for the ordering of theindex. If we reverse the order of the bits in the index k (call that k'), thenwe find that Bmgk'= Xk.The fast Fourier transform method is described as follows:Let Ac, where k runs from 0 to 2""-1, be the 2'"(=N) samples of some timeThe pair-wise combination of the elementsof each succeeding vector to formthe succeeding vector involves elements of the vector that are successivelycloser by factors of two.