Volume 1 Application Programming (794095), страница 34
Текст из файла (страница 34)
They can scale each element in a vector to higheror lower values.108128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 Technologyoperand 1127operand 2012712700result513-150.epsFigure 4-4. Pack OperationFigure 4-5 shows one of many types of shuffle operation (PSHUFD). Here, the second operand is avector containing doubleword elements, and an immediate byte provides shuffle control for up to 256permutations of the elements. Shuffles are useful, for example, in color imaging when computingalpha saturation of RGB values. In this case, a shuffle instruction can replicate an alpha value in aregister so that parallel comparisons with three RGB values can be performed.operand 1127operand 20127127result00513-151.epsFigure 4-5.Shuffle OperationThere is an instruction that inserts a single word from a general-purpose register or memory into anXMM register, at a specified location, leaving the other words in the XMM register unmodified.128-Bit Media and Scientific Programming109AMD64 Technology24592—Rev.
3.13—July 20074.2.5 Block OperationsMove instructions—along with unpack instructions—are among the most frequently used instructionsin 128-bit media procedures. Figure 4-6 on page 111 shows the combined set of move operationssupported by the integer and floating-point move instructions. These instructions provide a fast way tocopy large amounts of data between registers or between registers and memory. They support blockcopies and sequential processing of contiguous data.When moving between XMM registers, or between an XMM register and memory, each integer moveinstruction can copy up to 16 bytes of data. When moving between an XMM register and an MMX orGPR register, an integer move instruction can move 8 bytes of data.
The floating-pointmove instructions can copy vectors of four single-precision or two double-precision floating-pointoperands in parallel.Streaming-store versions of the move instructions permit bypassing the cache when storing data that isaccessed only once. This maximizes memory-bus utilization and minimizes cache pollution. There isalso a streaming-store integer move-mask instruction that stores bytes from one vector, as selected bymask values in a second vector. Figure 4-7 on page 112 shows the MASKMOVDQU operation. It canbe used, for example, to handle end cases in block copies and block fills based on streaming stores.110128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 TechnologyXMM0127XMM or Memory0127XMM or Memory0127XMM0127XMM0127XMM00127memorymemory127GPR or MemoryXMM0memory63XMM063GPR or Memory0memory127TM63MMX Register 0127XMM1270XMM630MMX Register0513-171.epsFigure 4-6.
Move Operations128-Bit Media and Scientific Programming111AMD64 Technology24592—Rev. 3.13—July 2007operand 1operand 212701270. . . . . . . . . . . . . .select. . . . . . . . . . . . . .selectstore addressmemoryrDI513-148.epsFigure 4-7. Move Mask Operation4.2.6 Matrix and Special Arithmetic OperationsThe instruction set provides a broad assortment of vector add, subtract, multiply, divide, and squareroot operations for use on matrices and other data structures common to media and scientificapplications.
It also provides special arithmetic operations including multiply-add, average, sum-ofabsolute differences, reciprocal square-root, and reciprocal estimation.Media applications often multiply and accumulate vector and matrix data. In 3D-graphics geometry,for example, objects are typically represented by triangles, each of whose vertices are located in 3Dspace by a matrix of coordinate values, and matrix transforms are performed to simulate objectmovement.128-bit media integer and floating-point instructions can perform several types of matrix-vector ormatrix-matrix operations, such as addition, subtraction, multiplication, and accumulation, to effect 3Dtranforms of vertices. Efficient matrix multiplication is further supported with instructions that canfirst transpose the elements of matrix rows and columns.
These transpositions can make subsequentaccesses to memory or cache more efficient when performing arithmetic matrix operations.Figure 4-8 on page 113 shows a vector multiply-add instruction (PMADDWD) that multiplies vectorsof 16-bit integer elements to yield intermediate results of 32-bit elements, which are then summedpair-wise to yield four 32-bit elements. This operation can be used with one source operand (forexample, a coefficient) taken from memory and the other source operand (for example, the data to bemultiplied by that coefficient) taken from an XMM register.
It can also be used together with a vectoradd operation to accumulate dot product results (also called inner or scalar products), which are used112128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 Technologyin many media algorithms such as those required for finite impulse response (FIR) filters, one of thecommonly used DSP algorithms.operand 1operand 21270*1270**.255intermediate result..++127.0+result*+0513-154.epsFigure 4-8.Multiply-Add OperationThere is also a sum-of-absolute-differences instruction (PSADBW), shown in Figure 4-9 on page 114.This is useful, for example, in computing motion-estimation algorithms for video compression.128-Bit Media and Scientific Programming113AMD64 Technology24592—Rev. 3.13—July 2007operand 1operand 21270.
. . . . .ABS Δ127. . . . . .high-orderintermediate result......0. . . . . .ABS ΔABS ΔΣ0127. . . . . .low-orderintermediate result......ABS ΔΣ0result0513-155.epsFigure 4-9.Sum-of-Absolute-Differences OperationThere is an instruction for computing the average of unsigned bytes or words. The instruction is usefulfor MPEG decoding, in which motion compensation involves many byte-averaging operationsbetween and within macroblocks. In addition to speeding up these operations, the instruction also freesup registers and make it possible to unroll the averaging loops.Some of the arithmetic and pack instructions produce vector results in which each element saturatesindependently of the other elements in the result vector. Such results are clamped (limited) to themaximum or minimum value representable by the destination data type when the true result exceedsthat maximum or minimum representable value. Saturating data is useful for representing physicalworld data, such as sound and color.
It is used, for example, when combining values for pixel coloring.4.2.7 Branch RemovalBranching is a time-consuming operation that, unlike most 128-bit media vector operations, does notexhibit parallel behavior (there is only one branch target, not multiple targets, per branch instruction).In many media applications, a branch involves selecting between only a few (often only two) cases.Such branches can be replaced with 128-bit media vector compare and vector logical instructions thatsimulate predicated execution or conditional moves.Figure 4-10 on page 115 shows an example of a non-branching sequence that implements a two-waymultiplexer—one that is equivalent to the ternary operator “?:” in C and C++.
The comparable codesequence is explained in “Compare and Write Mask” on page 153.114128-Bit Media and Scientific Programming24592—Rev. 3.13—July 2007AMD64 TechnologyThe sequence in Figure 4-10 begins with a vector compare instruction that compares the elements oftwo source operands in parallel and produces a mask vector containing elements of all 1s or 0s. Thismask vector is ANDed with one source operand and ANDed-Not with the other source operand toisolate the desired elements of both operands. These results are then ORed to select the relevantelements from each operand. A similar branch-removal operation can be done using floating-pointsource operands.operand 1operand 2127a70a6a5a4a3a2a1127a0b70b6b5b4b3b2b1b0Compare and Write MaskFFFF 0000 0000 FFFF FFFF 0000 0000 FFFFAnda7 0000 0000 a4And-Not0000 b6a3 0000 0000 a0b5 0000 0000 b2b1 0000Ora7b6b5127Figure 4-10.a4a3b2b1a00513-170.epsBranch-Removal SequenceThe min/max compare instructions, for example, are useful for clamping, such as color clamping in 3Dgraphics, without the need for branching.
Figure 4-11 on page 116 illustrates a move-mask instruction(PMOVMSKB) that copies sign bits to a general-purpose register (GPR). The instruction can extractbits from mask patterns, or zero values from quantized data, or sign bits—resulting in a byte that canbe used for data-dependent branching.128-Bit Media and Scientific Programming115AMD64 Technology24592—Rev. 3.13—July 2007GPR127XMM00concatenate 16 most-significant bits513-157..epsFigure 4-11.4.3Move Mask OperationRegistersOperands for most 128-bit media instructions are located in XMM registers or memory.