Volume 1 Application Programming (794095), страница 33
Текст из файла (страница 33)
See Section 3.3.14,“Semaphores,” on page 64. The use of cross-modifying code can result in performance degradation.Synchronization for cross-modifying code is not required for code that resides within an aligned8-bytes of memory.General-Purpose Programming103AMD64 Technology10424592—Rev. 3.13—July 2007General-Purpose Programming24592—Rev. 3.13—July 20074AMD64 Technology128-Bit Media and Scientific ProgrammingThis chapter describes the 128-bit media and scientific programming model. This model includes allinstructions that access the 128-bit XMM registers—called the 128-bit media instructions. Theseinstructions perform integer and floating-point operations primarily on vector operands (a few of theinstructions take scalar operands).
They can speed up certain types of procedures—typically highperformance media and scientific procedures—by substantial factors, depending on data-element sizeand the regularity and locality of data accesses to memory.4.1Overview4.1.1 OriginsThe 128-bit media instruction set includes instructions originally introduced as the streaming SIMDextensions (SSE), and instructions added in subsequent extensions (SSE2, SSE3, and SSE4A).
Fordetails on the instruction set origin of each instruction, see “Instruction Subsets vs. CPUID FeatureSets” in Volume 3.4.1.2 CompatibilityThe 128-bit media instructions can be executed in any of the architecture’s operating modes. ExistingSSE, SSE2, SSE3, and SSE4A binary programs run in legacy and compatibility modes withoutmodification. The support provided by the AMD64 architecture for such binaries is identical to thatprovided by legacy x86 architectures.To run in 64-bit mode, legacy 128-bit media programs must be recompiled.
The recompilation has noside effects on such programs, other than to provide access to the following additional resources:•••••Access to the eight extended XMM registers (for a total of 16 XMM registers).Access to the eight extended general-purpose registers (for a total of 16 GPRs).Access to the extended 64-bit width of all GPRs.Access to the 64-bit virtual address space.Access to the RIP-relative addressing mode.The 128-bit media instructions use data registers, a control and status register (MXCSR), roundingcontrol, and an exception reporting and response mechanism that are distinct from and functionallyindependent of those used by the x87 floating-point instructions. Because of this, 128-bit mediaprogramming support usually requires exception handlers that are distinct from those used for x87exceptions.
This support is provided by virtually all legacy operating systems for the x86 architecture.128-Bit Media and Scientific Programming105AMD64 Technology4.224592—Rev. 3.13—July 2007CapabilitiesThe 128-bit media instructions are designed to support media and scientific applications. The vectoroperands used by these instructions allow applications to operate in parallel on multiple elements ofvectors.
The elements can be integers (from bytes to quadwords) or floating-point (either singleprecision or double-precision). Arithmetic operations produce signed, unsigned, and/or saturatingresults.The availability of several types of vector move instructions and (in 64-bit mode) twice the legacynumber of XMM registers (a total of 16 such registers) can eliminate substantial memory-accessoverhead, making a substantial difference in performance.4.2.1 Types of ApplicationsTypical media applications well-suited to the 128-bit media programming model include a broad rangeof audio, video, and graphics programs. For example, music synthesis, speech synthesis, speechrecognition, audio and video compression (encoding) and decompression (decoding), 2D and 3Dgraphics, streaming video (up to high-definition TV), and digital signal processing (DSP) kernels areall likely to experience higher performance using 128-bit media instructions than using other types ofinstructions in AMD64 architecture.Such applications commonly use small-sized integer or single-precision floating-point data elementsin repetitive loops, in which the typical operations are inherently parallel.
For example, 8-bit and 16-bitdata elements are commonly used for pixel information in graphics applications, in which each of theRGB pixel components (red, green, blue, and alpha) are represented by an 8-bit or 16-bit integer. 16bit data elements are also commonly used for audio sampling.The 128-bit media instructions allow multiple data elements like these to be packed into 128-bit vectoroperands located in XMM registers or memory.
The instructions operate in parallel on each of theelements in these vectors. For example, 16 elements of 8-bit data can be packed into a 128-bit vectoroperand, so that all 16 byte elements are operated on simultaneously, and in pairs of source operands,by a single instruction.The 128-bit media instructions also support a broad spectrum of scientific applications. For example,their ability to operate in parallel on double-precision floating-point vector elements makes them wellsuited to computations like dense systems of linear equations, including matrix and vector-spaceoperations with real and complex numbers. In professional CAD applications, for example, highperformance physical-modeling algorithms can be implemented to simulate processes such as heattransfer or fluid dynamics.4.2.2 Integer Vector OperationsMost of the 128-bit media arithmetic instructions perform parallel operations on pairs of vectors.Vector operations are also called packed or SIMD (single-instruction, multiple-data) operations.
Theytake vector operands consisting of multiple elements, and all elements are operated on in parallel.Figure 4-1 on page 107 shows an example of parallel operations on pairs of 16 byte-sized integers in106128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 Technologythe source operands. The result of the operation replaces the first source operand.
There are alsoinstructions that operate on vectors of words, doublewords, or quadwords.operand 1operand 21270127. . . . . . . . . . . . . .0. . . . . . . . . . . . . .operationoperation. . . . . . . . . . . . . .1270result513-163.epsFigure 4-1. Parallel Operations on Vectors of Integer Elements4.2.3 Floating-Point Vector OperationsThere are almost as many 128-bit floating-point instructions as integer instructions.
Figure 4-2 showsan example of parallel operations on vectors containing four 32-bit single-precision floating-pointvalues. There are also instructions that operate on vectors containing two 64-bit double-precisionfloating-point values.operand 1operand 21270127FP single FP single FP single FP single.0FP single FP single FP single FP single...operationoperation..FP single FP single FP single FP single127Figure 4-2.result0513-164.epsParallel Operations on Vectors of Floating-Point ElementsInteger and floating-point instructions can be freely intermixed in the same procedure. The floatingpoint instructions allow media applications such as 3D graphics to accelerate geometry, clipping, andlighting calculations. Pixel data are typically integer-based, although both integer and floating-point128-Bit Media and Scientific Programming107AMD64 Technology24592—Rev.
3.13—July 2007instructions are often required to operate completely on the data. For example, software can change theviewing perspective of a 3D scene through transformation matrices by using floating-pointinstructions in the same procedure that contains integer operations on other aspects of the graphicsdata.It is typically much easier to write 128-bit media programs using floating-point instructions. Suchprograms perform better than x87 floating-point programs, because the XMM register file is flat ratherthan stack-oriented, there are twice as many registers (in 64-bit mode), and 128-bit media instructionscan operate on two or four times the number of floating-point operands as can x87 instructions. Thisability to operate in parallel on multiple pairs of floating-point elements often makes it possible toremove local temporary variables that would otherwise be needed in x87 floating-point code.4.2.4 Data Conversion and ReorderingThere are instructions that support data conversion of vector elements, including conversions betweeninteger and floating-point data types—located in XMM registers, MMX™ registers, GPR registers, ormemory—and conversions of element-ordering or precision.
For example, the unpack instructionstake two vector operands and interleave their low or high elements. Figure 4-3 shows an unpack andinterleave operation on word-sized elements (PUNCKLWD). If the left-hand source operand haselements whose value is zero, the operation converts each element in the low half of the right-handoperand to a data type of twice its original precision—useful, for example, in multiply operations inwhich results may overflow or underflow.operand 1operand 21270.127127..result0.0513-149.epsFigure 4-3. Unpack and Interleave OperationThere are also pack instructions, such as PACKSSDW shown in Figure 4-4 on page 109, that converteach element in a pair of vectors to lower precision by selecting the elements in the low half of eachvector. Vector-shift instructions are also supported.