Volume 1 Application Programming (794095), страница 60
Текст из файла (страница 60)
3.13—July 2007Table 5-6.AMD64 TechnologyMapping Between Internal and Software-Visible Tag BitsArchitectural StateStateBinary ValueValid00Zero01Special(NaN, infinity, denormal)210Empty11Internal State1Full (0)Empty (1)Note:1. For a more detailed description of this mapping, see “Deriving FSAVE Tag Fieldfrom FXSAVE Tag Field” in Volume 2.2. The 64-bit media floating point (3DNow!™) instructions do not support NaNs, infinities, and denormals.When the processor executes an FSAVE or FNSAVE (but not FXSAVE) instruction, it changes theinternal 1-bit tag state to its 2-bit architectural tag by reading the data in all 80 bits of the physical dataregisters and using the mapping in Table 5-6.
For example, if the value in the high 16 bits of the 80-bitphysical register indicate a NaN, the two tag bits for that register are changed to a binary value of 10before the x87 status word is written to memory.The tag bits have no effect on the execution of 64-bit media instructions or their interpretation of thecontents of the MMX registers. However, the converse is not true: execution of 64-bit mediainstructions that write to an MMX register alter the tag bits and thus may affect execution ofsubsequent x87 floating-point instructions.For a more detailed description of the mapping shown in Table 5-6, see “Deriving FSAVE Tag Fieldfrom FXSAVE Tag Field” in Volume 2 and its accompanying text.5.13Mixing Media Code with x87 Code5.13.1 Mixing CodeSoftware may freely mix 64-bit media instructions (integer or floating-point) with 128-bit mediainstructions (integer or floating-point) and general-purpose instructions in a single procedure.However, before transitioning from a 64-bit media procedure—or a 128-bit media procedure thataccesses an MMX™ register—to an x87 procedure, or to software that may eventually branch to anx87 procedure, software should clear the MMX state, as described immediately below.5.13.2 Clearing MMX™ StateSoftware should separate 64-bit media procedures, 128-bit media procedures, or dynamic link libraries(DLLs) that access MMX registers from x87 floating-point procedures or DLLs by clearing the MMXstate with the EMMS or FEMMS instruction before leaving a 64-bit media procedure, as described in“Exit Media State” on page 209.64-Bit Media Programming233AMD64 Technology24592—Rev.
3.13—July 2007The 64-bit media instructions and x87 floating-point instructions interpret the contents of their aliasedMMX and x87 registers differently. Because of this, software should not exchange register databetween 64-bit media and x87 floating-point procedures, or use conditional branches at the end ofloops that might jump to code of the other type. Software must not rely on the contents of the aliasedMMX and x87 registers across such code-type transitions. If a transition to an x87 procedure occursfrom a 64-bit media procedure that does not clear the MMX state, the x87 stack may overflow.5.14State-Saving5.14.1 Saving and Restoring StateIn general, system software should save and restore MMX™ and x87 state between task switches orother interventions in the execution of 64-bit media procedures.
Virtually all modern operatingsystems running on x86 processors—including such systems as Windows NT™, UNIX, and OS/2—are preemptive multitasking operating systems that handle such saving and restoring of state properlyacross task switches, independently of hardware task-switch support.No changes are needed to the x87 register-saving performed by 32-bit operating systems, exceptionhandlers, or device drivers. The same support provided in a 32-bit operating system’s device-notavailable (#NM) exception handler by any of the x87-register save/restore instructions describedbelow also supports saving and restoring the MMX registers.However, application procedures are also free to save and restore MMX and x87 state at any time theydeem useful.5.14.2 State-Saving InstructionsSoftware running at any privilege level may save and restore 64-bit media and x87 state by executingthe FSAVE, FNSAVE, or FXSAVE instruction.
Alternatively, software may use move instructions forsaving only the contents of the MMX registers, rather than the complete 64-bit media and x87 state.For example, when saving MMX register values, use eight MOVQ instructions.FSAVE/FNSAVE and FRSTOR Instructions. The FSAVE, FNSAVE, and FRSTOR instructions aredescribed in “Save and Restore 64-Bit Media and x87 State” on page 223. After saving state withFSAVE or FNSAVE, the tag bits for all MMX and x87 registers are changed to empty and thusavailable for a new procedure. Thus, FSAVE and FNSAVE also perform the state-clearing function ofEMMS or FEMMS.FXSAVE and FXRSTOR Instructions.
The FSAVE, FNSAVE, and FRSTOR instructions aredescribed in “Save and Restore 128-Bit, 64-Bit, and x87 State” on page 223. The FXSAVE andFXRSTOR instructions execute faster than FSAVE/FNSAVE and FRSTOR because they do not saveand restore the x87 error pointers (described in “Pointers and Opcode State” on page 247) except in therelatively rare cases in which the exception-summary (ES) bit in the x87 status word (register imagefor FXSAVE, memory image for FXRSTOR) is set to 1, indicating that an unmasked x87 exceptionhas occurred.23464-Bit Media Programming24592—Rev.
3.13—July 2007AMD64 TechnologyUnlike FSAVE and FNSAVE, however, FXSAVE does not alter the tag bits (thus, it does not performthe state-clearing function of EMMS or FEMMS). The state of the saved MMX and x87 registers isretained, thus indicating that the registers may still be valid (or whatever other value the tag bitsindicated prior to the save). To invalidate the contents of the MMX and x87 registers after FXSAVE,software must explicitly execute an FINIT instruction.
Also, FXSAVE (like FNSAVE) and FXRSTORdo not check for pending unmasked x87 floating-point exceptions. An FWAIT instruction can be usedfor this purpose.For details about the FXSAVE and FXRSTOR memory formats, see “Media and x87 Processor State”in Volume 2.5.15Performance ConsiderationsIn addition to typical code optimization techniques, such as those affecting loops and the inlining offunction calls, the following considerations may help improve the performance of applicationprograms written with 64-bit media instructions.These are implementation-independent performance considerations. Other considerations depend onthe hardware implementation.
For information about such implementation-dependent considerationsand for more information about application performance in general, see the data sheets and thesoftware-optimization guides relating to particular hardware implementations.5.15.1 Use Small Operand SizesThe performance advantages available with 64-bit media operations is to some extent a function of thedata sizes operated upon. The smaller the data size, the more data elements that can be packed intosingle 64-bit vectors.
The parallelism of computation increases as the number of elements per vectorincreases.5.15.2 Reorganize Data for Parallel OperationsMuch of the performance benefit from the 64-bit media instructions comes from the parallelisminherent in vector operations. It can be advantageous to reorganize data before performing arithmeticoperations so that its layout after reorganization maximizes the parallelism of the arithmeticoperations.The speed of memory access is particularly important for certain types of computation, such asgraphics rendering, that depend on the regularity and locality of data-memory accesses.
For example,in matrix operations, performance is high when operating on the rows of the matrix, because row bytesare contiguous in memory, but lower when operating on the columns of the matrix, because columnbytes are not contiguous in memory and accessing them can result in cache misses. To improveperformance for operations on such columns, the matrix should first be transposed. Suchtranspositions can, for example, be done using a sequence of unpacking or shuffle instructions.64-Bit Media Programming235AMD64 Technology24592—Rev. 3.13—July 20075.15.3 Remove BranchesBranch can be replaced with 64-bit media instructions that simulate predicated execution orconditional moves, as described in “Branch Removal” on page 198.
Where possible, break longdependency chains into several shorter dependency chains which can be executed in parallel. This isespecially important for floating-point instructions because of their longer latencies.5.15.4 Align DataData alignment is particularly important for performance when data written by one instruction is readby a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data—data that will not be reused and therefore should not be cached. These cases may occur frequently in64-bit media procedures.Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or fromrepetition of data at different alignment boundaries, as required by different loops that process the data.5.15.5 Organize Data for CacheabilityPack small data structures into cache-line-size blocks.