Volume 1 Basic Architecture (794100), страница 73
Текст из файла (страница 73)
For example, an application can perform theVol. 1 11-31PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)majority of its floating-point computations in the XMM registers, using the packedand scalar floating-point instructions, and at the same time use the x87 FPU toperform trigonometric and other transcendental computations. Likewise, anapplication can perform packed 64-bit and 128-bit SIMD integer operationstogether without restrictions.•Those SSE and SSE2 instructions that operate on MMX registers (such as theCVTPS2PI, CVTTPS2PI, CVTPI2PS, CVTPD2PI, CVTTPD2PI, CVTPI2PD,MOVDQ2Q, MOVQ2DQ, PADDQ, and PSUBQ instructions) can also be executed inthe same instruction stream as 64-bit SIMD integer or x87 FPU instructions,however, here they are subject to the restrictions on the simultaneous use ofMMX technology and x87 FPU instructions, which include:— Transition from x87 FPU to MMX technology instructions or to SSE or SSE2instructions that operate on MMX registers should be preceded by saving thestate of the x87 FPU.— Transition from MMX technology instructions or from SSE or SSE2 instructions that operate on MMX registers to x87 FPU instructions should bepreceded by execution of the EMMS instruction.11.6.8Compatibility of SIMD and x87 FPU Floating-Point DataTypesSSE and SSE2 extensions operate on the same single-precision and double-precisionfloating-point data types that the x87 FPU operates on.
However, when operating onthese data types, the SSE and SSE2 extensions operate on them in their nativeformat (single-precision or double-precision), in contrast to the x87 FPU whichextends them to double extended-precision floating-point format to perform computations and then rounds the result back to a single-precision or double-precisionformat before writing results to memory. Because the x87 FPU operates on a higherprecision format and then rounds the result to a lower precision format, it may returna slightly different result when performing the same operation on the same singleprecision or double-precision floating-point values than is returned by the SSE andSSE2 extensions.
The difference occurs only in the least-significant bits of the significand.11.6.9Mixing Packed and Scalar Floating-Point and 128-Bit SIMDInteger Instructions and DataSSE and SSE2 extensions define typed operations on packed and scalar floatingpoint data types and on 128-bit SIMD integer data types, but IA-32 processors do notenforce this typing at the architectural level.
They only enforce it at the microarchitectural level. Therefore, when a Pentium 4 or Intel Xeon processor loads a packed orscalar floating-point operand or a 128-bit packed integer operand from memory intoan XMM register, it does not check that the actual data being loaded matches thedata type specified in the instruction. Likewise, when the processor performs an11-32 Vol.
1PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)arithmetic operation on the data in an XMM register, it does not check that the databeing operated on matches the data type specified in the instruction.As a general rule, because data typing of SIMD floating-point and integer data typesis not enforced at the architectural level, it is the responsibility of the programmer,assembler, or compiler to insure that code enforces data typing. Failure to enforcecorrect data typing can lead to computations that return unexpected results.For example, in the following code sample, two packed single-precision floating-pointoperands are moved from memory into XMM registers (using MOVAPS instructions);then a double-precision packed add operation (using the ADDPD instruction) isperformed on the operands:movapsxmm0, [eax]; EAX register contains pointer to packed; single-precision floating-point operandmovapsxmm1, [ebx]addpdxmm0, xmm1Pentium 4 and Intel Xeon processors execute these instructions without generatingan invalid-operand exception (#UD) and will produce the expected results in registerXMM0 (that is, the high and low 64-bits of each register will be treated as a doubleprecision floating-point value and the processor will operate on them accordingly).Because the data types operated on and the data type expected by the ADDPDinstruction were inconsistent, the instruction may result in a SIMD floating-pointexception (such as numeric overflow [#O] or invalid operation [#I]) being generated, but the actual source of the problem (inconsistent data types) is not detected.The ability to operate on an operand that contains a data type that is inconsistentwith the typing of the instruction being executed, permits some valid operations to beperformed.
For example, the following instructions load a packed double-precisionfloating-point operand from memory to register XMM0, and a mask to registerXMM1; then they use XORPD to toggle the sign bits of the two packed values inregister XMM0.movapdxmm0, [eax]; EAX register contains pointer to packed; double-precision floating-point operandmovapsxmm1, [ebx]; EBX register contains pointer to packed; double-precision floating-point maskxorpdxmm0, xmm1 ; XOR operation toggles sign bits using; the mask in xmm1In this example: XORPS or PXOR can be used in place of XORPD and yield the samecorrect result. However, because of the type mismatch between the operand datatype and the instruction data type, a latency penalty will be incurred due to implementations of the instructions at the microarchitecture level.Latency penalties can also be incurred by using move instructions of the wrong type.For example, MOVAPS and MOVAPD can both be used to move a packed single-preci-Vol.
1 11-33PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)sion operand from memory to an XMM register. However, if MOVAPD is used, alatency penalty will be incurred when a correctly typed instruction attempts to usethe data in the register.Note that these latency penalties are not incurred when moving data from XMMregisters to memory.11.6.10 Interfacing with SSE/SSE2 Procedures and FunctionsSSE and SSE2 extensions allow direct access to XMM registers.
This means that allexisting interface conventions between procedures and functions that apply to theuse of the general-purpose registers (EAX, EBX, etc.) also apply to XMM registerusage.11.6.10.1 Passing Parameters in XMM RegistersThe state of XMM registers is preserved across procedure (or function) boundaries.Parameters can be passed from one procedure to another using XMM registers.11.6.10.2 Saving XMM Register State on a Procedure or Function CallThe state of XMM registers can be saved in two ways: using an FXSAVE instruction ora move instruction. FXSAVE saves the state of all XMM registers (along with the stateof MXCSR and the x87 FPU registers).
This instruction is typically used for majorchanges in the context of the execution environment, such as a task switch.FXRSTOR restores the XMM, MXCSR, and x87 FPU registers stored with FXSAVE.In cases where only XMM registers must be saved, or where selected XMM registersneed to be saved, move instructions (MOVAPS, MOVUPS, MOVSS, MOVAPD,MOVUPD, MOVSD, MOVDQA, and MOVDQU) can be used. These instructions can alsobe used to restore the contents of XMM registers. To avoid performance degradationwhen saving XMM registers to memory or when loading XMM registers from memory,be sure to use the appropriately typed move instructions.The move instructions can also be used to save the contents of XMM registers on thestack.
Here, the stack pointer (in the ESP register) can be used as the memoryaddress to the next available byte in the stack. Note that the stack pointer is notautomatically incremented when using a move instruction (as it is with PUSH).A move-instruction procedure that saves the contents of an XMM register to the stackis responsible for decrementing the value in the ESP register by 16.
Likewise, amove-instruction procedure that loads an XMM register from the stack needs also toincrement the ESP register by 16. To avoid performance degradation when movingthe contents of XMM registers, use the appropriately typed move instructions.Use the LDMXCSR and STMXCSR instructions to save and restore, respectively, thecontents of the MXCSR register on a procedure call and return.11-34 Vol. 1PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)11.6.10.3 Caller-Save Requirement for Procedure and Function CallsWhen making procedure (or function) calls from SSE or SSE2 code, a caller-saveconvention is recommended for saving the state of the calling procedure. Using thisconvention, any register whose content must survive intact across a procedure callmust be stored in memory by the calling procedure prior to executing the call.The primary reason for using the caller-save convention is to prevent performancedegradation.