Volume 1 Application Programming (794095), страница 51
Текст из файла (страница 51)
Moreover, when integer and floating-point instructions must be usedtogether, 128-bit media floating-point instructions avoid the potential need to save and restore statebetween integer operations and floating-point procedures.128-Bit Media and Scientific Programming191AMD64 Technology19224592—Rev. 3.13—July 2007128-Bit Media and Scientific Programming24592—Rev. 3.13—July 20075AMD64 Technology64-Bit Media ProgrammingThis chapter describes the 64-bit media programming model. This model includes all instructions thataccess the MMX™ registers, including the MMX and 3DNow!™ instructions, as well as some SSEand SSE2 instructions.The 64-bit media instructions perform integer and floating-point operations primarily on vectoroperands (a few of the instructions take scalar operands).
The MMX integer operations producesigned, unsigned, and/or saturating results. The 3DNow! floating-point operations take singleprecision operands and produce saturating results without generating floating-point exceptions. Theinstructions that take vector operands can speed up certain types of procedures by significant factors,depending on data-element size and the regularity and locality of data accesses to memory.The term 64-bit is used in two different contexts within the AMD64 architecture: the 64-bit mediainstructions, described in this chapter, and the 64-bit operating mode, described in “64-Bit Mode” onpage 6.5.1OriginsThe 64-bit media instructions were introduced in the following extensions to the legacy x86architecture:•••MMX Instructions—These are primarily integer instructions that take vector operands in 64-bitMMX registers or memory locations.3DNow! Instructions—These are primarily floating-point instructions, most of which take vectoroperands in MMX registers or memory locations.SSE, SSE2, SSE3, and SSE4A Instructions—These are the streaming SIMD extensions (SSE),SSE2, SSE3, and SSE4A instructions.
Some of them perform conversions between operands in the64-bit MMX register set and other register sets.For details on the extension-set origin of each instruction, see “Instruction Subsets vs. CPUID FeatureSets” in Volume 3.5.2Compatibility64-bit media instructions can be executed in any of the architecture’s operating modes. Existing MMXand 3DNow! binary programs run in legacy and compatibility modes without modification. Thesupport provided by the AMD64 architecture for such binaries is identical to that provided by legacyx86 architectures.To run in 64-bit mode, 64-bit media programs must be recompiled.
The recompilation has no sideeffects on such programs, other then to make available the extended general-purpose registers and 64bit virtual address space.64-Bit Media Programming193AMD64 Technology24592—Rev. 3.13—July 2007The MMX and 3DNow! instructions introduce no additional registers, status bits, or other processorstate to the legacy x86 architecture. Instead, they use the x87 floating-point registers that have longbeen a part of most x86 architectures. Because of this, 64-bit media procedures require no specialoperating-system support or exception handlers.
When state-saves are required between procedures,the same instructions that system software uses to save and restore x87 floating-point state also saveand restore the 64-bit media-programming state.AMD no longer recommends the use of 3DNow! instructions, which have been superceded by theirmore efficient 128-bit media counterparts. Relevant recommendations are provided below and in theAMD64 Programmer’s Manual Volume 4: 64-Bit Media and x87 Floating-Point Instructions.5.3CapabilitiesThe 64-bit media instructions are designed to support multimedia and communication applicationsthat operate on vectors of small-sized data elements. For example, 8-bit and 16-bit integer dataelements are commonly used for pixel information in graphics applications, and 16-bit integer dataelements are used for audio sampling. The 64-bit media instructions allow multiple data elements likethese to be packed into single 64-bit vector operands located in an MMX register or in memory.
Theinstructions operate in parallel on each of the elements in these vectors. For example, 8-bit integer datacan be packed in vectors of eight elements in a single 64-bit register, so that a single instruction canoperated on all eight byte elements simultaneously.Typical applications of the 64-bit media integer instructions include music synthesis, speech synthesis,speech recognition, audio and video compression (encoding) and decompression (decoding), 2D and3D graphics (including 3D texture mapping), and streaming video. Typical applications of the 64-bitmedia floating-point instructions include digital signal processing (DSP) kernels and front-end 3Dgraphics algorithms, such as geometry, clipping, and lighting.These types of applications are referred to as media applications. Such applications commonly usesmall data elements in repetitive loops, in which the typical operations are inherently parallel. In 256color video applications, for example, 8-bit operands in 64-bit MMX registers can be used to computetransformations on eight pixels per instruction.5.3.1 Parallel OperationsMost of the 64-bit media instructions perform parallel operations on vectors of operands.
Vectoroperations are also called packed or SIMD (single-instruction, multiple-data) operations. They takeoperands consisting of multiple elements and operate on all elements in parallel. Figure 5-1 onpage 195 shows an example of an integer operation on two vectors, each containing 16-bit (word)elements. There are also 64-bit media instructions that operate on vectors of byte or doublewordelements.19464-Bit Media Programming24592—Rev. 3.13—July 2007AMD64 Technologyoperand 1operand 2630op6363opopresult0op0513-121.epsFigure 5-1. Parallel Integer Operations on Elements of Vectors5.3.2 Data Conversion and ReorderingThe 64-bit media instructions support conversions of various integer data types to floating-point datatypes, and vice versa.There are also instructions that reorder vector-element ordering or the bit-width of vector elements.For example, the unpack instructions take two vector operands and interleave their low or highelements.
Figure 5-2 on page 196 shows an unpack operation (PUNPCKLWD) that interleaves loworder elements of each source operand. If each element of operand 2 has the value zero, the operationzero-extends each element of operand 1 to twice its original width. This may be useful, for example,prior to an arithmetic operation in which the data-conversion result must be paired with another sourceoperand containing vector elements that are twice the width of the pre-conversion (half-size) elements.There are also pack instructions that convert each element of 2x size in a pair of vectors to elements of1x size, with saturation at maximum and minimum values.64-Bit Media Programming195AMD64 Technology24592—Rev.
3.13—July 2007operand 163operand 20636300result513-144.epsFigure 5-2. Unpack and Interleave OperationFigure 5-3 shows a shuffle operation (PSHUFW), in which one of the operands provides vector data,and an immediate byte provides shuffle control for up to 256 permutations of the data.63operand 163063resultoperand 200513-126.epsFigure 5-3. Shuffle Operation (1 of 256)5.3.3 Matrix OperationsMedia applications often multiply and accumulate vector and matrix data.
In 3D graphics applications,for example, objects are typically represented by triangles, each of whose vertices are located in 3Dspace by a matrix of coordinate values, and matrix transforms are performed to simulate objectmovement.64-bit media integer and floating-point instructions can perform several types of matrix-vector ormatrix-matrix operations, such as addition, subtraction, multiplication, and accumulation. The integer19664-Bit Media Programming24592—Rev.
3.13—July 2007AMD64 Technologyinstructions can also perform multiply-accumulate operations. Efficient matrix multiplication isfurther supported with instructions that can first transpose the elements of matrix rows and columns.These transpositions can make subsequent accesses to memory or cache more efficient whenperforming arithmetic matrix operations.Figure 5-4 shows a vector multiply-add instruction (PMADDWD) that multiplies vectors of 16-bitinteger elements to yield intermediate results of 32-bit elements, which are then summed pair-wise toyield two 32-bit elements.operand 163*operand 2063**0*1270+63+result0513-119.epsFigure 5-4.Multiply-Add OperationThe operation shown in Figure 5-4 can be used together with transpose and vector-add operations (see“Addition” on page 216) to accumulate dot product results (also called inner or scalar products),which are used in many media algorithms.5.3.4 SaturationSeveral of the 64-bit media integer instructions and most of the 64-bit media floating-point instructionsproduce vector results in which each element saturates independently of the other elements in theresult vector.