Volume 1 Application Programming (794095), страница 32
Текст из файла (страница 32)
If the operand specifies an invalidmemory address, no exception occurs, and the instruction has no effect. Attempts to prefetch datafrom non-cacheable memory, such as video frame buffers, or data from write-combining memory,are also ignored. The exact actions performed by the PREFETCHlevel instructions depend on theprocessor implementation. Current AMD processor families map all PREFETCHlevel instructionsGeneral-Purpose Programming99AMD64 Technology24592—Rev.
3.13—July 2007to a PREFETCH. Refer to the Optimization Guide for AMD Athlon™ 64 and AMD Opteron™Processors, order# 25112, for details relating to a particular processor family, brand or model.- PREFETCHT0—Prefetches temporal data into the entire cache hierarchy.-••PREFETCHT1—Prefetches temporal data into the second-level (L2) and higher-level caches,but not into the L1 cache.- PREFETCHT2—Prefetches temporal data into the third-level (L3) and higher-level caches,but not into the L1 or L2 cache.- PREFETCHNTA—Prefetches non-temporal data into the processor, minimizing cachepollution.
The specific technique for minimizing cache pollution is implementation-dependentand can include such techniques as allocating space in a software-invisible buffer, allocating acache line in a single cache or a specific way of a cache, etc.PREFETCH—(a 3DNow! instruction) Prefetches read data into the L1 data cache. Data can bewritten to such a cache line, but doing so can result in additional delay because the processor mustsignal externally to negotiate the right to change the cache line’s cache-coherency state for thepurpose of writing to it.PREFETCHW—(a 3DNow! instruction) Prefetches write data into the L1 data cache.
Data can bewritten to the cache line without additional delay, because the data is already prefetched in themodified cache-coherency state. Data can also be read from the cache line without additional delay.However, prefetching write data takes longer than prefetching read data if the processor must waitfor another caching master to first write-back its modified copy of the requested data to memorybefore the prefetch request is satisfied.The PREFETCHW instruction provides a hint to the processor that the cache line is to be modified,and is intended for use when the cache line will be written to shortly after the prefetch is performed.The processor can place the cache line in the modified state when it is prefetched, but before it isactually written.
Doing so can save time compared to a PREFETCH instruction, followed by asubsequent cache-state change due to a write.To prevent a false-store dependency from stalling a prefetch instruction, prefetched data should belocated at least one cache-line away from the address of any surrounding data write. For example, ifthe cache-line size is 32 bytes, avoid prefetching from data addresses within 32 bytes of the dataaddress in a preceding write instruction.Non-Temporal Stores. Non-temporal store instructions are provided to prevent memory writes frombeing stored in the cache, thereby reducing cache pollution. These non-temporal store instructions arespecific to the type of register they write:•••GPR Non-Temporal Stores—MOVNTI.XMM Non-Temporal Stores—MASKMOVDQU, MOVNTDQ, MOVNTPD, and MOVNTPS.MMX Non-Temporal Stores—MASKMOVQ and MOVNTQ.Removing Stale Cache Lines.
When cache data becomes stale, it occupies space in the cache thatcould be used to store frequently-accessed data. Applications can use the CLFLUSH instruction to freea stale cache-line for use by other data. CLFLUSH writes the contents of a cache line to memory and100General-Purpose Programming24592—Rev. 3.13—July 2007AMD64 Technologythen invalidates the line in the cache and in all other caches in the cache hierarchy that contain the line.Once invalidated, the line is available for use by the processor and can be filled with other data.3.10Performance ConsiderationsIn addition to typical code optimization techniques, such as those affecting loops and the inlining offunction calls, the following considerations may help improve the performance of applicationprograms written with general-purpose instructions.These are implementation-independent performance considerations.
Other considerations depend onthe hardware implementation. For information about such implementation-dependent considerationsand for more information about application performance in general, see the data sheets and thesoftware-optimization guides relating to particular hardware implementations.3.10.1 Use Large Operand SizesLoading, storing, and moving data with the largest relevant operand size maximizes the memorybandwidth of these instructions.3.10.2 Use Short InstructionsUse the shortest possible form of an instruction (the form with fewest opcode bytes).
This increasesthe number of instructions that can be decoded at any one time, and it reduces overall code size.3.10.3 Align DataData alignment directly affects memory-access performance. Data alignment is particularly importantwhen accessing streaming (also called non-temporal) data—data that will not be reused and thereforeshould not be cached. Data alignment is also important in cases where data that is written by oneinstruction is subsequently read by a subsequent instruction soon after the write.3.10.4 Avoid BranchesBranching can be very time-consuming. If the body of a branch is small, the branch may bereplaceable with conditional move (CMOVcc) instructions, or with 128-bit or 64-bit mediainstructions that simulate predicated parallel execution or parallel conditional moves.3.10.5 Prefetch DataMemory latency can be substantially reduced—especially for data that will be used multiple times—by prefetching such data into various levels of the cache hierarchy. Software can use the PREFETCHxinstructions very effectively in such cases.
One PREFETCHx per cache line should be used.Some of the best places to use prefetch instructions are inside loops that process large amounts of data.If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try touse virtually all of the prefetched data. This usually requires unit-stride memory accesses—those inwhich all accesses are to contiguous memory locations.General-Purpose Programming101AMD64 Technology24592—Rev. 3.13—July 2007For data that will be used only once in a procedure, consider using non-temporal accesses. Suchaccesses are not burdened by the overhead of cache protocols.3.10.6 Keep Common Operands in RegistersKeep frequently used values in registers rather than in memory.
This avoids the comparatively longlatencies for accessing memory.3.10.7 Avoid True DependenciesSpread out true dependencies (write-read or flow dependencies) to increase the opportunities forparallel execution. This spreading out is not necessary for anti-dependencies and output dependencies.3.10.8 Avoid Store-to-Load DependenciesStore-to-load dependencies occur when data is stored to memory, only to be read back shortlythereafter. Hardware implementations of the architecture may contain means of accelerating suchstore-to-load dependencies, allowing the load to obtain the store data before it has been written tomemory.
However, this acceleration might be available only when the addresses and operand sizes ofthe store and the dependent load are matched, and when both memory accesses are aligned.Performance is typically optimized by avoiding such dependencies altogether and keeping the data,including temporary variables, in registers.3.10.9 Optimize Stack AllocationWhen allocating space on the stack for local variables and/or outgoing parameters within a procedure,adjust the stack pointer and use moves rather than pushes. This method of allocation allows randomaccess to the outgoing parameters, so that they can be set up when they are calculated instead of beingheld in a register or memory until the procedure call.
This method also reduces stack-pointerdependencies.3.10.10 Consider Repeat-Prefix Setup TimeThe repeat instruction prefixes have a setup overhead. If the repeated count is variable, the overheadcan sometimes be avoided by substituting a simple loop to move or store the data. Repeated stringinstructions can be expanded into equivalent sequences of inline loads and stores. For details, see“Repeat Prefixes” in Volume 3.3.10.11 Replace GPR with Media InstructionsSome integer-based programs can be made to run faster by using 128-bit media or 64-bit mediainstructions.
These instructions have their own register sets. Because of this, they relieve registerpressure on the GPR registers. For loads, stores, adds, shifts, etc., media instructions may be goodsubstitutes for general-purpose integer instructions. GPR registers are freed up, and the mediainstructions increase opportunities for parallel operations.102General-Purpose Programming24592—Rev.
3.13—July 2007AMD64 Technology3.10.12 Organize Data in Memory BlocksOrganize frequently accessed constants and coefficients into cache-line-size blocks and prefetch them.Procedures that access data arranged in memory-bus-sized blocks, or memory-burst-sized blocks, canmake optimum use of the available memory bandwidth.3.11Cross-Modifying CodeSoftware that writes into a code segment running simultaneously on another processor with the intentthat the other processor execute the written data as code is classified as cross-modifying code.To avoid cache-coherency issues when using cross-modifying code, the processor doing the storeshould provide synchronization between the processors using a semaphore.