Volume 1 Basic Architecture (794100), страница 76
Текст из файла (страница 76)
To use FISTTP, software only needs to checksupport for SSE3.In the initial implementation of MONITOR and MWAIT, these two instructions areavailable to ring 0 and conditionally available at ring level greater than 0. Before anVol. 1 12-7PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3application attempts to use the MONITOR and MWAIT instructions, the applicationshould use the following steps:1.
Check that the processor supports MONITOR and MWAIT. IfCPUID.01H:ECX.MONITOR[bit 3] = 1, MONITOR and MWAIT are available atring 0.2. Query the smallest and largest line size that MONITOR uses. UseCPUID.05H:EAX.smallest[bits 15:0];EBX.largest[bits15:0]. Values are returnedin bytes in EAX and EBX.3. Ensure the memory address range(s) that will be supplied to MONITOR meetsmemory type requirements.MONITOR and MWAIT are targeted for system software that supports efficient threadsynchronization, See Chapter 12 in the Intel® 64 and IA-32 Architectures SoftwareDeveloper’s Manual, Volume 3A for details.12.4.3Enable FTZ and DAZ for SIMD Floating-Point ComputationEnabling the FTZ and DAZ flags in the MXCSR register is likely to accelerate SIMDfloating-point computation where strict compliance to the IEEE standard 754-1985 isnot required. The FTZ flag is available to Intel 64 and IA-32 processors that supportthe SSE; DAZ is available to Intel 64 processors and to most IA-32 processors thatsupport SSE/SSE2/SSE3.Software can detect the presence of DAZ, modify the MXCSR register, and save andrestore state information by following the techniques discussed in Section 11.6.3through Section 11.6.6.12.4.4Programming SSE3 with SSE/SSE2 ExtensionsSIMD instructions in SSE3 extensions are intended to complement the use ofSSE/SSE2 in programming SIMD applications.
Application software that intends touse SSE3 instructions should also check for the availability of SSE/SSE2 instructions.The FISTTP instruction in SSE3 is intended to accelerate x87 style programmingwhere performance is limited by frequent floating-point conversion to integers; thishappens when the x87 FPU control word is modified frequently. Use of FISTTP caneliminate the need to access the x87 FPU control word.12.5OVERVIEW OF SSSE3 INSTRUCTIONSSSSE3 provides 32 instructions to accelerate a variety of multimedia and signalprocessing applications employing SIMD integer data.
See:•Section 12.6, “SSSE3 Instructions,” provides an introduction to individual SSE3instructions.12-8 Vol. 1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3•Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes2A & 2B, provide detailed information on individual instructions.•Chapter 12, “System Programming for Streaming SIMD Instruction Sets,” in theIntel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A,gives guidelines for integrating SSE/SSE2/SSE3/SSSE3 extensions into anoperating-system environment.12.6SSSE3 INSTRUCTIONSSSSE3 instructions include:•••Twelve instructions that perform horizontal addition or subtraction operations.•Two instructions that accelerate packed-integer multiply operations and produceinteger values with scaling.•Two instructions that perform a byte-wise, in-place shuffle according to thesecond shuffle control operand.•Six instructions that negate packed integers in the destination operand if thesigns of the corresponding element in the source operand is less than zero.•Two instructions that align data from the composite of two operands.Six instructions that evaluate the absolute values.Two instructions that perform multiply and add operations and speed up theevaluation of dot products.The operands of these instructions are packed integers of byte, word, or double wordsizes.
The operands are stored as 64 or 128 bit data in MMX registers, XMM registers,or memory.The instructions are discussed in more detail in the following paragraphs.12.6.1Horizontal Addition/SubtractionIn analogy to the packed, floating-point horizontal add and subtract instructions inSSE3, SSSE3 offers similar capabilities on packed integer data. Data elements ofsigned words, doublewords are supported. Saturated version for horizontal add andsubtract on signed words are also supported.
The horizontal data movement ofPHADD is shown in Figure 12-3.Vol. 1 12-9PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3X3X2X1X0Y3Y2Y1Y0ADDADDY2 + Y3Y0 + Y1ADDADDX2 + X3X0 + X1Figure 12-3. Horizontal Data Movement in PHADDDThere are six horizontal add instructions (represented by three mnemonics); threeoperate on 128-bit operands and three operate on 64-bit operands. The width ofeach data element is either 16 bits or 32 bits. The mnemonics are listed below.•PHADDW adds two adjacent, signed 16-bit integers horizontally from the sourceand destination operands and packs the signed 16-bit results to the destinationoperand.•PHADDSW adds two adjacent, signed 16-bit integers horizontally from the sourceand destination operands and packs the signed, saturated 16-bit results to thedestination operand.•PHADDD adds two adjacent, signed 32-bit integers horizontally from the sourceand destination operands and packs the signed 32-bit results to the destinationoperand.There are six horizontal subtract instructions (represented by three mnemonics);three operate on 128-bit operands and three operate on 64-bit operands.
The widthof each data element is either 16 bits or 32 bits. These are listed below.•PHSUBW performs horizontal subtraction on each adjacent pair of 16-bit signedintegers by subtracting the most significant word from the least significant wordof each pair in the source and destination operands. The signed 16-bit results arepacked and written to the destination operand.•PHSUBSW performs horizontal subtraction on each adjacent pair of 16-bit signedintegers by subtracting the most significant word from the least significant wordof each pair in the source and destination operands.
The signed, saturated 16-bitresults are packed and written to the destination operand.•PHSUBD performs horizontal subtraction on each adjacent pair of 32-bit signedintegers by subtracting the most significant doubleword from the least significant12-10 Vol. 1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3double word of each pair in the source and destination operands. The signed32-bit results are packed and written to the destination operand.12.6.2Packed Absolute ValuesThere are six packed-absolute-value instructions (represented by three mnemonics).Three operate on 128-bit operands and three operate on 64-bit operands. The widthsof data elements are 8 bits, 16 bits or 32 bits.
The absolute value of each dataelement of the source operand is stored as an UNSIGNED result in the destinationoperand.•••PABSB computes the absolute value of each signed byte data element.PABSW computes the absolute value of each signed 16-bit data element.PABSD computes the absolute value of each signed 32-bit data element.12.6.3Multiply and Add Packed Signed and Unsigned BytesThere are two multiply-and-add-packed-signed-unsigned-byte instructions (represented by one mnemonic). One operates on 128-bit operands and the other operateson 64-bit operands. Multiplications are performed on each vertical pair of dataelements.
The data elements in the source operand are signed byte values, the inputdata elements of the destination operand are unsigned byte values.•PMADDUBSW multiplies each unsigned byte value with the corresponding signedbyte value to produce an intermediate, 16-bit signed integer. Each adjacent pairof 16-bit signed values are added horizontally. The signed, saturated 16-bitresults are packed to the destination operand.12.6.4Packed Multiply High with Round and ScaleThere are two packed-multiply-high-with-round-and-scale instructions (representedby one mnemonic). One operates on 128-bit operands and the other operates on64-bit operands.•PMULHRSW multiplies vertically each signed 16-bit integer from the destinationoperand with the corresponding signed 16-bit integer of the source operand,producing intermediate, signed 32-bit integers.
Each intermediate 32-bit integeris truncated to the 18 most significant bits. Rounding is always performed byadding 1 to the least significant bit of the 18-bit intermediate result. The finalresult is obtained by selecting the 16 bits immediately to the right of the mostsignificant bit of each 18-bit intermediate result and packed to the destinationoperand.Vol. 1 12-11PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE312.6.5Packed Shuffle BytesThere are two packed-shuffle-bytes instructions (represented by one mnemonic).One operates on 128-bit operands and the other operates on 64-bit operands.
Theshuffle operations are performed bytewise on the destination operand using thesource operand as a control mask.•PSHUFB permutes each byte in place, according to a shuffle control mask. Theleast significant three or four bits of each shuffle control byte of the control maskform the shuffle index.