Volume 1 Application Programming (794095), страница 56
Текст из файла (страница 56)
The instruction is useful in 3D rasterization, which operates on unsigned pixel values.The PMULUDQ instruction, unlike the other PMULx instructions, preserves the full precision of theresult. It multiplies 32-bit unsigned integer values in the first and second operands and writes the full64-bit result to the destination.See “Shift” on page 219 for shift instructions that can be used to perform multiplication and divisionby powers of 2.64-Bit Media Programming217AMD64 Technology24592—Rev.
3.13—July 2007Multiply-Add• PMADDWD—Packed Multiply Words and Add DoublewordsThe PMADDWD instruction multiplies each 16-bit signed value in the first operand by thecorresponding 16-bit signed value in the second operand. The instruction then adds the adjacent 32-bitintermediate results of each multiplication, and writes the 32-bit result of each addition into thecorresponding doubleword of the destination.
PMADDWD thus performs two signed (16 × 16 = 32) +(16 × 16 = 32) multiply-adds in parallel. Figure 5-16 shows the PMADDWD operation.The only case in which overflow can occur is when all four of the 16-bit source operands used toproduce a 32-bit multiply-add result have the value 8000h. In this case, the result returned is8000_0000h, because the maximum negative 16-bit value of 8000h multiplied by itself equals4000_0000h, and 4000_0000h added to 4000_0000h equals 8000_0000h. The result of multiplyingtwo negative numbers should be a positive number, but 8000_0000h is the maximum possible 32-bitnegative number rather than a positive number.operand 163*operand 2063**0*1270+63+result0513-119.epsFigure 5-16.PMADDWD Multiply-Add OperationPMADDWD can be used with one source operand (for example, a coefficient) taken from memory andthe other source operand (for example, the data to be multiplied by that coefficient) taken from anMMX register.
The instruction can also be used together with the PADDD instruction (page 216) tocompute dot products, such as those required for finite impulse response (FIR) filters, one of thecommonly used DSP algorithms. Scaling can be done, before or after the multiply, using a vector-shiftinstruction (page 219).21864-Bit Media Programming24592—Rev.
3.13—July 2007AMD64 TechnologyFor floating-point multiplication operations, see the PFMUL instruction on page 225. For floatingpoint accumulation operations, see the PFACC, PFNACC, and PFPNACC instructions on page 226.Average• PAVGB—Packed Average Unsigned Bytes• PAVGW—Packed Average Unsigned Words• PAVGUSB—Packed Average Unsigned Packed BytesThe PAVGx instructions compute the rounded average of each unsigned 8-bit (PAVGB) or 16-bit(PAVGW) integer value in the first operand and the corresponding, same-sized unsigned integer in thesecond operand.
The instructions then write each average in the corresponding, same-sized element ofthe destination. The rounded average is computed by adding each pair of operands, adding 1 to thetemporary sum, and then right-shifting the temporary sum by one bit.The PAVGB instruction is useful for MPEG decoding, in which motion compensation performs manybyte-averaging operations between and within macroblocks.
In addition to speeding up theseoperations, PAVGB can free up registers and make it possible to unroll the averaging loops.The PAVGUSB instruction (a 3DNow! instruction) performs a function identical to the PAVGBinstruction, described on page 219, although the two instructions have different opcodes.Sum of Absolute Differences• PSADBW—Packed Sum of Absolute Differences of Bytes into a WordThe PSADBW instruction computes the absolute values of the differences of corresponding 8-bitsigned integer values in the first and second operands.
The instruction then sums the differences andwrites an unsigned 16-bit integer result in the low-order word of the destination. The remaining bytesin the destination are cleared to all 0s.Sums of absolute differences are used to compute the L1 norm in motion-estimation algorithms forvideo compression.5.6.7 ShiftThe vector-shift instructions are useful for scaling vector elements to higher or lower precision,packing and unpacking vector elements, and multiplying and dividing vector elements by powers of 2.Left Logical Shift• PSLLW—Packed Shift Left Logical Words• PSLLD—Packed Shift Left Logical Doublewords• PSLLQ—Packed Shift Left Logical QuadwordsThe PSLLx instructions left-shift each of the 16-bit (PSLLW), 32-bit (PSLLD), or 64-bit (PSLLQ)values in the first operand by the number of bits specified in the second operand. The instructions thenwrite each shifted value into the corresponding, same-sized element of the destination.
The first and64-Bit Media Programming219AMD64 Technology24592—Rev. 3.13—July 2007second operands are either an MMX register and another MMX register or 64-bit memory location, oran MMX register and an immediate-byte value. The low-order bits that are emptied by the shiftoperation are cleared to 0.In integer arithmetic, left logical shifts effectively multiply unsigned operands by positive powers of 2.Right Logical Shift• PSRLW—Packed Shift Right Logical Words• PSRLD—Packed Shift Right Logical Doublewords• PSRLQ—Packed Shift Right Logical QuadwordsThe PSRLx instructions right-shift each of the 16-bit (PSRLW), 32-bit (PSRLD), or 64-bit (PSRLQ)values in the first operand by the number of bits specified in the second operand. The instructions thenwrite each shifted value into the corresponding, same-sized element of the destination.
The first andsecond operands are either an MMX register and another MMX register or 64-bit memory location, oran MMX register and an immediate-byte value. The high-order bits that are emptied by the shiftoperation are cleared to 0. In integer arithmetic, right logical shifts effectively divide unsignedoperands or positive signed operands by positive powers of 2.PSRLQ can be used to move the high 32 bits of an MMX register to the low 32 bits of the register.Right Arithmetic Shift• PSRAW—Packed Shift Right Arithmetic Words• PSRAD—Packed Shift Right Arithmetic DoublewordsThe PSRAx instructions right-shifts each of the 16-bit (PSRAW) or 32-bit (PSRAD) values in the firstoperand by the number of bits specified in the second operand.
The instructions then write each shiftedvalue into the corresponding, same-sized element of the destination. The high-order bits that areemptied by the shift operation are filled with the sign bit of the initial value.In integer arithmetic, right arithmetic shifts effectively divide signed operands by positive powers of 2.5.6.8 CompareThe integer vector-compare instructions compare two operands, and they either write a mask or theywrite the maximum or minimum value.Compare and Write Mask• PCMPEQB—Packed Compare Equal Bytes• PCMPEQW—Packed Compare Equal Words• PCMPEQD—Packed Compare Equal Doublewords• PCMPGTB—Packed Compare Greater Than Signed Bytes• PCMPGTW—Packed Compare Greater Than Signed Words• PCMPGTD—Packed Compare Greater Than Signed Doublewords22064-Bit Media Programming24592—Rev.
3.13—July 2007AMD64 TechnologyThe PCMPEQx and PCMPGTx instructions compare corresponding bytes, words, or doubleword inthe first and second operands. The instructions then write a mask of all 1s or 0s for each compare intothe corresponding, same-sized element of the destination.For the PCMPEQx instructions, if the compared values are equal, the result mask is all 1s. If the valuesare not equal, the result mask is all 0s.
For the PCMPGTx instructions, if the signed value in the firstoperand is greater than the signed value in the second operand, the result mask is all 1s. If the value inthe first operand is less than or equal to the value in the second operand, the result mask is all 0s.PCMPEQx can be used to set the bits in an MMX register to all 1s by specifying the same register forboth operands.By specifying the same register for both operands, PCMPEQx can be used to set the bits in an MMXregister to all 1s.Figure 5-5 on page 198 shows an example of a non-branching sequence that implements a two-waymultiplexer—one that is equivalent to the following sequence of ternary operators in C or C++:r0r1r2r3====a0a1a2a3>>>>b0b1b2b3????a0a1a2a3::::b0b1b2b3Assuming mmx0 contains a, and mmx1 contains b, the above C sequence can be implemented with thefollowing assembler sequence:MOVQPCMPGTWPANDPANDNPORmmx3,mmx3,mmx0,mmx3,mmx0,mmx0mmx2mmx3mmx1mmx3;;;;aaar>>>=bbba??>>0xffff : 0a: 00 : bb ? a: bIn the above sequence, PCMPGTW, PAND, PANDN, and POR operate, in parallel, on all fourelements of the vectors.Compare and Write Minimum or Maximum• PMAXUB—Packed Maximum Unsigned Bytes• PMINUB—Packed Minimum Unsigned Bytes• PMAXSW—Packed Maximum Signed Words• PMINSW—Packed Minimum Signed WordsThe PMAXUB and PMINUB instructions compare each of the 8-bit unsigned integer values in thefirst operand with the corresponding 8-bit unsigned integer values in the second operand.