Volume 1 Application Programming (794095), страница 50
Текст из файла (страница 50)
Fordetails, see “Save and Restore State” on page 156.4.11.2 Parameter Passing128-bit media procedures can use MOVx instructions to pass data to other such procedures. This canbe done directly, via the XMM registers, or indirectly by storing data on the procedure stack. Whenstoring to the stack, software should use the rSP register for the memory address and, after the save,explicitly decrement rSP by 16 for each 128-bit XMM register parameter stored on the stack.Likewise, to load a 128-bit XMM register from the stack, software should increment rSP by 16 afterthe load. There is a choice of MOVx instructions designed for aligned and unaligned moves, asdescribed in “Data Transfer” on page 135 and “Data Transfer” on page 157.The processor does not check the data type of instruction operands prior to executing instructions. Itonly checks them at the point of execution.
For example, if the processor executes an arithmeticinstruction that takes double-precision operands but is provided with single-precision operands byMOVx instructions, the processor will first convert the operands from single precision to doubleprecision prior to executing the arithmetic operation, and the result will be correct. However, therequired conversion may cause degradation of performance.Because of this possibility of data-type mismatching between MOVx instructions used to passparameters and the instructions in the called procedure that subsequently operate on the moved data,the calling procedure should save its own state prior to the call. The called procedure cannot determinethe caller’s data types, and thus it cannot optimize its choice of instructions for storing a caller’s state.For further information, see the software optimization documentation for particular hardwareimplementations.4.11.3 Accessing Operands in MMX™ RegistersSoftware may freely mix 128-bit media instructions (integer or floating-point) with 64-bit mediainstructions (integer or floating-point) and general-purpose instructions in a single procedure.
Thereare no restrictions on transitioning from 128-bit media procedures to x87 procedures, except when a188128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 Technology128-bit media procedure accesses an MMX register by means of a data-transfer or data-conversioninstruction.In such cases, software should separate such procedures or dynamic link libraries (DLLs) from x87floating-point procedures or DLLs by clearing the MMX state with the EMMS instruction, asdescribed in “Exit Media State” on page 209. For further details, see “Mixing Media Code with x87Code” on page 233.4.12Performance ConsiderationsIn addition to typical code optimization techniques, such as those affecting loops and the inlining offunction calls, the following considerations may help improve the performance of applicationprograms written with 128-bit media instructions.These are implementation-independent performance considerations. Other considerations depend onthe hardware implementation.
For information about such implementation-dependent considerationsand for more information about application performance in general, see the data sheets and thesoftware-optimization guides relating to particular hardware implementations.4.12.1 Use Small Operand SizesThe performance advantages available with 128-bit media operations is to some extent a function ofthe data sizes operated upon. The smaller the data size, the more data elements that can be packed intosingle 128-bit vectors.
The parallelism of computation increases as the number of elements per vectorincreases.4.12.2 Reorganize Data for Parallel OperationsMuch of the performance benefit from the 128-bit media instructions comes from the parallelisminherent in vector operations. It can be advantageous to reorganize data before performing arithmeticoperations so that its layout after reorganization maximizes the parallelism of the arithmeticoperations.The speed of memory access is particularly important for certain types of computation, such asgraphics rendering, that depend on the regularity and locality of data-memory accesses.
For example,in matrix operations, performance is high when operating on the rows of the matrix, because row bytesare contiguous in memory, but lower when operating on the columns of the matrix, because columnbytes are not contiguous in memory and accessing them can result in cache misses. To improveperformance for operations on such columns, the matrix should first be transposed.
Suchtranspositions can, for example, be done using a sequence of unpacking or shuffle instructions.4.12.3 Remove BranchesBranch can be replaced with 128-bit media instructions that simulate predicated execution orconditional moves, as described in “Branch Removal” on page 114. The branch can be replaced with128-Bit Media and Scientific Programming189AMD64 Technology24592—Rev. 3.13—July 2007128-bit media instructions that simulate predicated execution or conditional moves.
Figure 4-10 onpage 115 shows an example of a non-branching sequence that implements a two-way multiplexer.Where possible, break long dependency chains into several shorter dependency chains that can beexecuted in parallel. This is especially important for floating-point instructions because of their longerlatencies.4.12.4 Use Streaming StoresThe MOVNTDQ and MASKMOVDQU instructions store streaming (non-temporal) data to memory.These instructions indicate to the processor that the data they reference will be used only once and istherefore not subject to cache-related overhead (such as write-allocation). A typical case benefittingfrom streaming stores occurs when data written by the processor is never read by the processor, such asdata written to a graphics frame buffer.4.12.5 Align DataData alignment is particularly important for performance when data written by one instruction is readby a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data.These cases may occur frequently in 128-bit media procedures.Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or fromrepetition of data at different alignment boundaries, as required by different loops that process the data.4.12.6 Organize Data for CacheabilityPack small data structures into cache-line-size blocks.
Organize frequently accessed constants andcoefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged inmemory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the availablememory bandwidth.For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses tosuch memory are not burdened by the overhead of cache protocols.4.12.7 Prefetch DataMedia applications typically operate on large data sets. Because of this, they make intensive use of thememory bus. Memory latency can be substantially reduced—especially for data that will be used onlyonce—by prefetching such data into various levels of the cache hierarchy.
Software can use thePREFETCHx instructions very effectively in such cases, as described in “Cache and MemoryManagement” on page 66.Some of the best places to use prefetch instructions are inside loops that process large amounts of data.If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try touse virtually all of the prefetched data. This usually requires unit-stride memory accesses—those inwhich all accesses are to contiguous memory locations. Exactly one PREFETCHx instruction per190128-Bit Media and Scientific Programming24592—Rev.
3.13—July 2007AMD64 Technologycache line must be used. For further details, see the Optimization Guide for AMD Athlon™ 64 andAMD Opteron™ Processors, order# 25112.4.12.8 Use 128-Bit Media Code for Moving DataMovements of data between memory, GPR, XMM, and MMX registers can take advantage of theparallel vector operations supported by the 128-bit media MOVx instructions. Figure 4-6 on page 111illustrates the range of move operations available.4.12.9 Retain Intermediate Results in XMM RegistersKeep intermediate results in the XMM registers as much as possible, especially if the intermediateresults are used shortly after they have been produced.
Avoid spilling intermediate results to memoryand reusing them shortly thereafter. In 64-bit mode, the architecture’s 16 XMM registers offer twicethe number of legacy XMM registers.4.12.10 Replace GPR Code with 128-Bit Media Code.In 64-bit mode, the AMD64 architecture provides twice the number of general-purpose registers(GPRs) as the legacy x86 architecture, thereby reducing potential pressure on GPRs.
Nevertheless,general-purpose instructions do not operate in parallel on vectors of elements, as do 128-bit mediainstructions. Thus, 128-bit media code supports parallel operations and can perform better withalgorithms and data that are organized for parallel operations.4.12.11 Replace x87 Code with 128-Bit Media CodeOne of the most useful advantages of 128-bit media instructions is the ability to intermix integer andfloating-point instructions in the same procedure, using a register set that is separate from the GPR,MMX, and x87 register sets. Code written with 128-bit media floating-point instructions can operatein parallel on four times as many single-precision floating-point operands as can x87 floating-pointcode.
This achieves potentially four times the computational work of x87 instructions that take singleprecision operands. Also, the higher density of 128-bit media floating-point operands may make itpossible to remove local temporary variables that would otherwise be needed in x87 floating-pointcode. 128-bit media code is also easier to write than x87 floating-point code, because the XMMregister file is flat, rather than stack-oriented, and in 64-bit mode there are twice the number of XMMregisters as x87 registers.