Volume 1 Basic Architecture (794100), страница 69
Текст из файла (страница 69)
Like the PADDQ instruction, PSUBQ canoperate on either unsigned or signed (two’s complement notation) integer operands.The PMULUDQ (multiply packed unsigned doubleword integers) instruction performsan unsigned multiply of unsigned doubleword integers and returns a quadwordresult. Both 64-bit and 128-bit versions of this instruction are available. The 64-bitversion operates on two doubleword integers stored in the low doubleword of eachsource operand, and the quadword result is returned to an MMX register.
The 128-bitversion performs a packed multiply of two pairs of doubleword integers. Here, thedoublewords are packed in the first and third doublewords of the source operands,and the quadword results are stored in the low and high quadwords of an XMMregister.The PSHUFLW (shuffle packed low words) instruction shuffles the word integerspacked into the low quadword of the source operand and stores the shuffled result inthe low quadword of the destination operand.
An 8-bit immediate operand specifiesthe shuffle order.The PSHUFHW (shuffle packed high words) instruction shuffles the word integerspacked into the high quadword of the source operand and stores the shuffled resultin the high quadword of the destination operand. An 8-bit immediate operand specifies the shuffle order.Vol. 1 11-15PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)The PSHUFD (shuffle packed doubleword integers) instruction shuffles the doubleword integers packed into the source operand and stores the shuffled result in thedestination operand.
An 8-bit immediate operand specifies the shuffle order.The PSLLDQ (shift double quadword left logical) instruction shifts the contents of thesource operand to the left by the amount of bytes specified by an immediateoperand. The empty low-order bytes are cleared (set to 0).The PSRLDQ (shift double quadword right logical) instruction shifts the contents ofthe source operand to the right by the amount of bytes specified by an immediateoperand. The empty high-order bytes are cleared (set to 0).The PUNPCKHQDQ (Unpack high quadwords) instruction interleaves the high quadword of the source operand and the high quadword of the destination operand andwrites them to the destination register.The PUNPCKLQDQ (Unpack low quadwords) instruction interleaves the low quadwords of the source operand and the low quadwords of the destination operand andwrites them to the destination register.Two additional SSE instructions enable data movement from the MMX registers to theXMM registers.The MOVQ2DQ (move quadword integer from MMX to XMM registers) instructionmoves the quadword integer from an MMX source register to an XMM destinationregister.The MOVDQ2Q (move quadword integer from XMM to MMX registers) instructionmoves the low quadword integer from an XMM source register to an MMX destinationregister.11.4.3128-Bit SIMD Integer Instruction ExtensionsAll of 64-bit SIMD integer instructions introduced with MMX technology and SSEextensions (with the exception of the PSHUFW instruction) have been extended bySSE2 extensions to operate on 128-bit packed integer operands located in XMMregisters.
The 128-bit versions of these instructions follow the same SIMD conventions regarding packed operands as the 64-bit versions. For example, where the64-bit version of the PADDB instruction operates on 8 packed bytes, the 128-bitversion operates on 16 packed bytes.11.4.4Cacheability Control and Memory Ordering InstructionsSSE2 extensions that give programs more control over the caching, loading, andstoring of data. are described below.11-16 Vol. 1PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)11.4.4.1FLUSH Cache LineThe CLFLUSH (flush cache line) instruction writes and invalidates the cache line associated with a specified linear address. The invalidation is for all levels of theprocessor’s cache hierarchy, and it is broadcast throughout the cache coherencydomain.NOTECLFLUSH was introduced with the SSE2 extensions.
However, theinstruction can be implemented in IA-32 processors that do notimplement the SSE2 extensions. Detect CLFLUSH using the featurebit (if CPUID.01H:EDX.CLFSH[bit 19] = 1).11.4.4.2Cacheability Control InstructionsThe following four instructions enable data from XMM and general-purpose registersto be stored to memory using a non-temporal hint. The non-temporal hint directs theprocessor to store data to memory without writing the data into the cache hierarchywhenever this is possible.
See Section 10.4.6.2, “Caching of Temporal vs. NonTemporal Data,” for more information about non-temporal stores and hints.The MOVNTDQ (store double quadword using non-temporal hint) instruction storespacked integer data from an XMM register to memory, using a non-temporal hint.The MOVNTPD (store packed double-precision floating-point values using nontemporal hint) instruction stores packed double-precision floating-point data from anXMM register to memory, using a non-temporal hint.The MOVNTI (store doubleword using non-temporal hint) instruction stores integerdata from a general-purpose register to memory, using a non-temporal hint.The MASKMOVDQU (store selected bytes of double quadword) instruction storesselected byte integers from an XMM register to memory, using a byte mask to selectively write the individual bytes. The memory location does not need to be aligned ona natural boundary.
This instruction also uses a non-temporal hint.11.4.4.3Memory Ordering InstructionsSSE2 extensions introduce two new fence instructions (LFENCE and MFENCE) ascompanions to the SFENCE instruction introduced with SSE extensions.The LFENCE instruction establishes a memory fence for loads. It guarantees orderingbetween two loads and prevents speculative loads from passing the load fence (thatis, no speculative loads are allowed until all loads specified before the load fence havebeen carried out).The MFENCE instruction combines the functions of LFENCE and SFENCE by establishing a memory fence for both loads and stores. It guarantees that all loads andstores specified before the fence are globally observable prior to any loads or storesbeing carried out after the fence.Vol.
1 11-17PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)11.4.4.4PauseThe PAUSE instruction is provided to improve the performance of “spin-wait loops”executed on a Pentium 4 or Intel Xeon processor. On a Pentium 4 processor, it alsoprovides the added benefit of reducing processor power consumption while executinga spin-wait loop.
It is recommended that a PAUSE instruction always be included inthe code sequence for a spin-wait loop.11.4.5Branch HintsSSE2 extensions designate two instruction prefixes (2EH and 3EH) to provide branchhints to the processor (see “Instruction Prefixes” in Chapter 2 of the Intel® 64 andIA-32 Architectures Software Developer’s Manual, Volume 2A).
These prefixes canonly be used with the Jcc instruction and only at the machine code level (that is,there are no mnemonics for the branch hints).11.5SSE, SSE2, AND SSE3 EXCEPTIONSSSE/SSE2/SSE3 extensions generate two general types of exceptions:••Non-numeric exceptionsSIMD floating-point exceptions1SSE/SSE2/SSE3 instructions can generate the same type of memory-access andnon-numeric exceptions as other IA-32 architecture instructions. Existing exceptionhandlers can generally handle these exceptions without any code modification.
See“Providing Non-Numeric Exception Handlers for Exceptions Generated by the SSE,SSE2 and SSE3 Instructions” in Chapter 12 of the Intel® 64 and IA-32 ArchitecturesSoftware Developer’s Manual, Volume 3A, for a list of the non-numeric exceptionsthat can be generated by SSE/SSE2/SSE3 instructions and for guidelines for handlingthese exceptions.SSE/SSE2/SSE3 instructions do not generate numeric exceptions on packed integeroperations; however, they can generate numeric (SIMD floating-point) exceptions onpacked single-precision and double-precision floating-point operations.
These SIMDfloating-point exceptions are defined in the IEEE Standard 754 for Binary FloatingPoint Arithmetic and are the same exceptions that are generated for x87 FPU instructions. See Section 11.5.1, “SIMD Floating-Point Exceptions,” for a description ofthese exceptions.1. The FISTTP instruction in SSE3 does not generate SIMD floating-point exceptions, but it can generate x87 FPU floating-point exceptions.11-18 Vol. 1PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)11.5.1SIMD Floating-Point ExceptionsSIMD floating-point exceptions are those exceptions that can be generated bySSE/SSE2/SSE3 instructions that operate on packed or scalar floating-point operands.Six classes of SIMD floating-point exceptions can be generated:••••••Invalid operation (#I)Divide-by-zero (#Z)Denormal operand (#D)Numeric overflow (#O)Numeric underflow (#U)Inexact result (Precision) (#P)All of these exceptions (except the denormal operand exception) are defined in IEEEStandard 754, and they are the same exceptions that are generated with the x87floating-point instructions.
Section 4.9, “Overview of Floating-Point Exceptions,”gives a detailed description of these exceptions and of how and when they are generated. The following sections discuss the implementation of these exceptions inSSE/SSE2/SSE3 extensions.All SIMD floating-point exceptions are precise and occur as soon as the instructioncompletes execution.Each of the six exception conditions has a corresponding flag (IE, DE, ZE, OE, UE,and PE) and mask bit (IM, DM, ZM, OM, UM, and PM) in the MXCSR register (seeFigure 10-3). The mask bits can be set with the LDMXCSR or FXRSTOR instruction;the mask and flag bits can be read with the STMXCSR or FXSAVE instruction.The OSXMMEXCEPT flag (bit 10) of control register CR4 provides additional controlover generation of SIMD floating-point exceptions by allowing the operating systemto indicate whether or not it supports software exception handlers for SIMD floatingpoint exceptions.