Volume 1 Basic Architecture (794100), страница 65
Текст из файла (страница 65)
Fencing ensures that all system agents have global visibility of the storeddata; for instance, failure to fence may result in a written cache line staying within aprocessor and not being visible to other agents.For processors that implement non-temporal stores by updating data in-place thatalready resides in the cache hierarchy, the destination region should also be mappedas WC. If mapped as WB or WT, there is the potential for speculative processor readsto bring the data into the caches; in this case, non-temporal stores would thenupdate in place, and data would not be flushed from the processor by a subsequentfencing operation.The memory type visible on the bus in the presence of memory type aliasing is implementation specific. As one possible example, the memory type written to the busmay reflect the memory type for the first store to this line, as seen in program order;other alternatives are possible.
This behavior should be considered reserved, anddependence on the behavior of any particular implementation risks future incompatibility.10.4.6.3PREFETCHh InstructionsThe PREFETCHh instructions permit programs to load data into the processor at asuggested cache level, so that the data is closer to the processor’s load and store unitwhen it is needed. These instructions fetch 32 aligned bytes (or more, depending onthe implementation) containing the addressed byte to a location in the cache hierarchy specified by the temporal locality hint (see Table 10-1). In this table, the firstlevel cache is closest to the processor and second-level cache is farther away fromthe processor than the first-level cache. The hints specify a prefetch of eithertemporal or non-temporal data (see Section 10.4.6.2, “Caching of Temporal vs.
NonTemporal Data”). Subsequent accesses to temporal data are treated like normalaccesses, while those to non-temporal data will continue to minimize cache pollution.If the data is already present at a level of the cache hierarchy that is closer to theprocessor, the PREFETCHh instruction will not result in any data movement. ThePREFETCHh instructions do not affect functional behavior of the program.Vol. 1 10-19PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)See Section 11.6.13, “Cacheability Hint Instructions,” for additional informationabout the PREFETCHh instructions.Table 10-1.
PREFETCHh Instructions Caching HintsPREFETCHhInstruction MnemonicPREFETCHT0ActionsTemporal data—fetch data into all levels of cache hierarchy:• Pentium III processor—1st-level cache or 2nd-level cache• Pentium 4 and Intel Xeon processor—2nd-level cachePREFETCHT1Temporal data—fetch data into level 2 cache and higher• Pentium III processor—2nd-level cache• Pentium 4 and Intel Xeon processor—2nd-level cachePREFETCHT2Temporal data—fetch data into level 2 cache and higher• Pentium III processor—2nd-level cache• Pentium 4 and Intel Xeon processor—2nd-level cachePREFETCHNTANon-temporal data—fetch data into location close to the processor,minimizing cache pollution• Pentium III processor—1st-level cache• Pentium 4 and Intel Xeon processor—2nd-level cache10.4.6.4SFENCE InstructionThe SFENCE (Store Fence) instruction controls write ordering by creating a fence formemory store operations.
This instruction guarantees that the result of every storeinstruction that precedes the store fence in program order is globally visible beforeany store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between procedures that produce weakly-ordereddata and procedures that consume that data.10.5FXSAVE AND FXRSTOR INSTRUCTIONSThe FXSAVE and FXRSTOR instructions were introduced into the IA-32 architecture inthe Pentium II processor family (prior to the introduction of the SSE extensions). Theoriginal versions of these instructions performed a fast save and restore, respectively, of the x87 FPU register state.
(By saving the state of the x87 FPU data registers, the FXSAVE and FXRSTOR instructions implicitly save and restore the state ofthe MMX registers.)The SSE extensions expanded the scope of these instructions to save and restore thestates of the XMM registers and the MXCSR register, along with the x87 FPU and MMXstate.10-20 Vol. 1PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)The FXSAVE and FXRSTOR instructions can be used in place of the FSAVE/FNSAVEand FRSTOR instructions; however, the operation of the FXSAVE and FXRSTORinstructions are not identical to the operation of FSAVE/FNSAVE and FRSTOR.NOTEThe FXSAVE and FXRSTOR instructions are not considered partof the SSE instruction group. They have a separate CPUIDfeature bit to indicate whether they are present (ifCPUID.01H:EDX.FXSR[bit 24] = 1).The CPUID feature bit for SSE extensions does not indicate thepresence of FXSAVE and FXRSTOR.10.6HANDLING SSE INSTRUCTION EXCEPTIONSSee Section 11.5, “SSE, SSE2, and SSE3 Exceptions,” for a detailed discussion of thegeneral and SIMD floating-point exceptions that can be generated with the SSEinstructions and for guidelines for handling these exceptions when they occur.10.7WRITING APPLICATIONS WITH THE SSE EXTENSIONSSee Section 11.6, “Writing Applications with SSE/SSE2 Extensions,” for additionalinformation about writing applications and operating-system code using the SSEextensions.Vol.
1 10-21PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)10-22 Vol. 1CHAPTER 11PROGRAMMING WITHSTREAMING SIMD EXTENSIONS 2 (SSE2)The streaming SIMD extensions 2 (SSE2) were introduced into the IA-32 architecturein the Pentium 4 and Intel Xeon processors. These extensions enhance the performance of IA-32 processors for advanced 3-D graphics, video decoding/encoding,speech recognition, E-commerce, Internet, scientific, and engineering applications.This chapter describes the SSE2 extensions and provides information to assist inwriting application programs that use these and the SSE extensions.11.1OVERVIEW OF SSE2 EXTENSIONSSSE2 extensions use the single instruction multiple data (SIMD) execution modelthat is used with MMX technology and SSE extensions.
They extend this model withsupport for packed double-precision floating-point values and for 128-bit packedintegers.If CPUID.01H:EDX.SSE2[bit 26] = 1, SSE2 extensions are present.SSE2 extensions add the following features to the IA-32 architecture, while maintaining backward compatibility with all existing IA-32 processors, applications andoperating systems.•Six data types:— 128-bit packed double-precision floating-point (two IEEE Standard 754double-precision floating-point values packed into a double quadword)— 128-bit packed byte integers— 128-bit packed word integers— 128-bit packed doubleword integers— 128-bit packed quadword integers•Instructions to support the additional data types and extend existing SIMDinteger operations:— Packed and scalar double-precision floating-point instructions— Additional 64-bit and 128-bit SIMD integer instructions— 128-bit versions of SIMD integer instructions introduced with the MMXtechnology and the SSE extensions— Additional cacheability-control and instruction-ordering instructions•Modifications to existing IA-32 instructions to support SSE2 features:— Extensions and modifications to the CPUID instruction— Modifications to the RDPMC instructionVol.
1 11-1PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)These new features extend the IA-32 architecture’s SIMD programming model inthree important ways:•They provide the ability to perform SIMD operations on pairs of packed doubleprecision floating-point values. This permits higher precision computations to becarried out in XMM registers, which enhances processor performance in scientificand engineering applications and in applications that use advanced 3-D geometrytechniques (such as ray tracing).
Additional flexibility is provided with instructions that operate on single (scalar) double-precision floating-point valueslocated in the low quadword of an XMM register.•They provide the ability to operate on 128-bit packed integers (bytes, words,doublewords, and quadwords) in XMM registers. This provides greater flexibilityand greater throughput when performing SIMD operations on packed integers.The capability is particularly useful for applications such as RSA authenticationand RC5 encryption. Using the full set of SIMD registers, data types, and instructions provided with the MMX technology and SSE/SSE2 extensions, programmerscan develop algorithms that finely mix packed single- and double-precisionfloating-point data and 64- and 128-bit packed integer data.•SSE2 extensions enhance the support introduced with SSE extensions forcontrolling the cacheability of SIMD data.
SSE2 cache control instructions providethe ability to stream data in and out of the XMM registers without polluting thecaches and the ability to prefetch data before it is actually used.SSE2 extensions are fully compatible with all software written for IA-32 processors.All existing software continues to run correctly, without modification, on processorsthat incorporate SSE2 extensions, as well as in the presence of applications thatincorporate these extensions. Enhancements to the CPUID instruction permit detection of the SSE2 extensions. Also, because the SSE2 extensions use the same registers as the SSE extensions, no new operating-system support is required for savingand restoring program state during a context switch beyond that provided for theSSE extensions.SSE2 extensions are accessible from all IA-32 execution modes: protected mode,real address mode, virtual 8086 mode.The following sections in this chapter describe the programming environment forSSE2 extensions including: the 128-bit XMM floating-point register set, data types,and SSE2 instructions.