Volume 1 Basic Architecture (794100), страница 75
Текст из файла (страница 75)
In64-bit mode, eight additional XMM registers are accessible. Registers XMM8-XMM15are accessed by using REX prefixes.Memory operands are specified using the ModR/M, SIB encoding described in Section3.7.5.Some SSE3 instructions may be used to operate on general-purpose registers. Usethe REX.W prefix to access 64-bit general-purpose registers. Note that if a REX prefixis used when it has no meaning, the prefix is ignored.Vol. 1 12-1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE312.1.2Compatibility of SSE3/SSSE3 with MMX Technology, the x87FPU Environment, and SSE/SSE2 ExtensionsSSE3/SSSE3 do not introduce any new state to the Intel 64 and IA-32 executionenvironments.For SIMD and x87 programming, the FXSAVE and FXRSTOR instructions save andrestore the architectural states of XMM, MXCSR, x87 FPU, and MMX registers.
TheMONITOR and MWAIT instructions use general purpose registers on input, they donot modify the content of those registers.12.1.3Horizontal and Asymmetric ProcessingMany SSE/SSE2/SSE3/SSSE3 instructions accelerate SIMD data processing using amodel referred to as vertical computation. Using this model, data flow is verticalbetween the data elements of the inputs and the output.Figure 12-1 illustrates the asymmetric processing of the SSE3 instructionADDSUBPD. Figure 12-2 illustrates the horizontal data movement of the SSE3instruction HADDPD.X1X0Y1Y0ADDSUBX1 + Y1X0 -Y0Figure 12-1.
Asymmetric Processing in ADDSUBPD12-2 Vol. 1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3X1X0Y1Y0ADDADDY0 + Y1X0 + X1Figure 12-2. Horizontal Data Movement in HADDPD12.2OVERVIEW OF SSE3 INSTRUCTIONSSSE3 extensions include 13 instructions. See:•Section 12.3, “SSE3 Instructions,” provides an introduction to individual SSE3instructions.•Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes2A & 2B, provide detailed information on individual instructions.•Chapter 12, “System Programming for Streaming SIMD Instruction Sets,” in theIntel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A,gives guidelines for integrating SSE/SSE2/SSE3 extensions into an operatingsystem environment.12.3SSE3 INSTRUCTIONSSSE3 instructions are grouped as follows:•x87 FPU instruction— One instruction that improves x87 FPU floating-point to integer conversion•SIMD integer instruction— One instruction that provides a specialized 128-bit unaligned data load•SIMD floating-point instructions— Three instructions that enhance LOAD/MOVE/DUPLICATE performance— Two instructions that provide packed addition/subtraction— Four instructions that provide horizontal addition/subtractionVol.
1 12-3PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3•Thread synchronization instructions— Two instructions that improve synchronization between multi-threadedagentsThe instructions are discussed in more detail in the following paragraphs.12.3.1x87 FPU Instruction for Integer ConversionThe FISTTP instruction (x87 FPU Store Integer and Pop with Truncation) behaves likeFISTP, but uses truncation regardless of what rounding mode is specified in the x87FPU control word. The instruction converts the top of stack (ST0) to integer withrounding to and pops the stack.The FISTTP instruction is available in three precisions: short integer (word or 16-bit),integer (double word or 32-bit), and long integer (64-bit).
With FISTTP, applicationsno longer need to change the FCW when truncation is required.12.3.2SIMD Integer Instruction for Specialized 128-bit UnalignedData LoadThe LDDQU instruction is a special 128-bit unaligned load designed to avoid cacheline splits. If the address of a 16-byte load is on a 16-byte boundary, LDQQU loadsthe bytes requested. If the address of the load is not aligned on a 16-byte boundary,LDDQU loads a 32-byte block starting at the 16-byte aligned address immediatelybelow the load request.
It then extracts the requested 16 bytes.The instruction provides significant performance improvement on 128-bit unalignedmemory accesses at the cost of some usage model restrictions.12.3.3SIMD Floating-Point Instructions That EnhanceLOAD/MOVE/DUPLICATE PerformanceThe MOVSHDUP instruction loads/moves 128-bits, duplicating the second and fourth32-bit data elements.•MOVSHDUP OperandA, OperandB— OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a— OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b— Result (stored in OperandA): 3b, 3b, 1b, 1bThe MOVSLDUP instruction loads/moves 128-bits, duplicating the first and third32-bit data elements.•MOVSLDUP OperandA, OperandB— OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a12-4 Vol.
1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3— OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b— Result (stored in OperandA): 2b, 2b, 0b, 0bThe MOVDDUP instruction loads/moves 64-bits; duplicating the 64 bits from thesource.•MOVDDUP OperandA, OperandB— OperandA (128 bits, two data elements): 1a, 0a— OperandB (64 bits, one data element): 0b— Result (stored in OperandA): 0b, 0b12.3.4SIMD Floating-Point Instructions Provide PackedAddition/SubtractionThe ADDSUBPS instruction has two 128-bit operands. The instruction performssingle-precision addition on the second and fourth pairs of 32-bit data elementswithin the operands; and single-precision subtraction on the first and third pairs.•ADDSUBPS OperandA, OperandB— OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a— OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b— Result (stored in OperandA): 3a+3b, 2a-2b, 1a+1b, 0a-0bThe ADDSUBPD instruction has two 128-bit operands.
The instruction performsdouble-precision addition on the second pair of quadwords, and double-precisionsubtraction on the first pair.•ADDSUBPD OperandA, OperandB— OperandA (128 bits, two data elements): 1a, 0a— OperandB (128 bits, two data elements): 1b, 0b— Result (stored in OperandA): 1a+1b, 0a-0b12.3.5SIMD Floating-Point Instructions Provide HorizontalAddition/SubtractionMost SIMD instructions operate vertically.
This means that the result in position i is afunction of the elements in position i of both operands. Horizontal addition/subtraction operates horizontally. This means that contiguous data elements in the samesource operand are used to produce a result.The HADDPS instruction performs a single-precision addition on contiguous dataelements. The first data element of the result is obtained by adding the first andsecond elements of the first operand; the second element by adding the third andfourth elements of the first operand; the third by adding the first and secondVol.
1 12-5PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3elements of the second operand; and the fourth by adding the third and fourthelements of the second operand.•HADDPS OperandA, OperandB— OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a— OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b— Result (Stored in OperandA): 3b+2b, 1b+0b, 3a+2a, 1a+0aThe HSUBPS instruction performs a single-precision subtraction on contiguous dataelements.
The first data element of the result is obtained by subtracting the secondelement of the first operand from the first element of the first operand; the secondelement by subtracting the fourth element of the first operand from the third elementof the first operand; the third by subtracting the second element of the secondoperand from the first element of the second operand; and the fourth by subtractingthe fourth element of the second operand from the third element of the secondoperand.•HSUBPS OperandA, OperandB— OperandA (128 bits, four data elements): 3a, 2a, 1a, 0a— OperandB (128 bits, four data elements): 3b, 2b, 1b, 0b— Result (Stored in OperandA): 2b-3b, 0b-1b, 2a-3a, 0a-1aThe HADDPD instruction performs a double-precision addition on contiguous dataelements.
The first data element of the result is obtained by adding the first andsecond elements of the first operand; the second element by adding the first andsecond elements of the second operand.•HADDPD OperandA, OperandB— OperandA (128 bits, two data elements): 1a, 0a— OperandB (128 bits, two data elements): 1b, 0b— Result (Stored in OperandA): 1b+0b, 1a+0aThe HSUBPD instruction performs a double-precision subtraction on contiguous dataelements. The first data element of the result is obtained by subtracting the secondelement of the first operand from the first element of the first operand; the secondelement by subtracting the second element of the second operand from the firstelement of the second operand.•HSUBPD OperandA OperandB— OperandA (128 bits, two data elements): 1a, 0a— OperandB (128 bits, two data elements): 1b, 0b— Result (Stored in OperandA): 0b-1b, 0a-1a12-6 Vol.
1PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE312.3.6Two Thread Synchronization InstructionsThe MONITOR instruction sets up an address range that is used to monitor writeback-stores.MWAIT enables a logical processor to enter into an optimized state while waiting fora write-back-store to the address range set up by MONITOR. MONITOR and MWAITrequire the use of general purpose registers for its input. The registers used byMONITOR and MWAIT must be initialized properly; register content is not modified bythese instructions.12.4WRITING APPLICATIONS WITH SSE3 EXTENSIONSThe following sections give guidelines for writing application programs and operating-system code that use SSE3 instructions.12.4.1Guidelines for Using SSE3 ExtensionsThe following guidelines describe how to maximize the benefits of using SSE3 extensions:•Check that the processor supports SSE3 extensions.— Application may need to ensure that the target operating system supportsSSE3.
(Operating system support for the SSE extensions implies sufficientsupport for SSE2 extensions and SSE3 extensions.)••Ensure your operating system supports MONITOR and MWAIT.Employ the optimization and scheduling techniques described in the Intel® 64and IA-32 Architectures Optimization Reference Manual (see Section 1.4,“Related Literature”).12.4.2Checking for SSE3 SupportBefore an application attempts to use the SIMD subset of SSE3 extensions, the application should follow the steps illustrated in Section 11.6.2, “Checking for SSE/SSE2Support.” Next, use the additional step provided below:•Check that the processor supports the SIMD and x87 SSE3 extensions (ifCPUID.01H:ECX.SSE3[bit 0] = 1).An operating systems that provides application support for SSE, SSE2 also providessufficient application support for SSE3.