Volume 1 Application Programming (794095), страница 61
Текст из файла (страница 61)
Organize frequently accessed constants andcoefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged inmemory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the availablememory bandwidth.For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses tosuch memory are not burdened by the overhead of cache protocols.5.15.6 Prefetch DataMedia applications typically operate on large data sets.
Because of this, they make intensive use of thememory bus. Memory latency can be substantially reduced—especially for data that will be used onlyonce—by prefetching such data into various levels of the cache hierarchy. Software can use thePREFETCHx instructions very effectively in such cases, as described in “Cache and MemoryManagement” on page 66.Some of the best places to use prefetch instructions are inside loops that process large amounts of data.If the loop goes through less than one cache line of data per iteration, partially unroll the loop to obtainmultiple iterations of the loop within a cache line. Try to use virtually all of the prefetched data.
Thisusually requires unit-stride memory accesses—those in which all accesses are to contiguous memorylocations.5.15.7 Retain Intermediate Results in MMX™ RegistersKeep intermediate results in the MMX registers as much as possible, especially if the intermediateresults are used shortly after they have been produced. Avoid spilling intermediate results to memoryand reusing them shortly thereafter.23664-Bit Media Programming24592—Rev. 3.13—July 20076AMD64 Technologyx87 Floating-Point ProgrammingThis chapter describes the x87 floating-point programming model. This model supports all aspects ofthe legacy x87 floating-point model and complies with the IEEE 754 and 854 standards for binaryfloating-point arithmetic. In hardware implementations of the AMD64 architecture, support forspecific features of the x87 programming model are indicated by the CPUID feature bits, as describedin “Feature Detection” on page 279.6.1OverviewFloating-point software is typically written to manipulate numbers that are very large or very small,that require a high degree of precision, or that result from complex mathematical operations, such astranscendentals.
Applications that take advantage of floating-point operations include geometriccalculations for graphics acceleration, scientific, statistical, and engineering applications, and processcontrol.6.1.1 CapabilitiesThe advantages of using x87 floating-point instructions include:•••••Representation of all numbers in common IEEE-754/854 formats, ensuring replicability of resultsacross all platforms that conform to IEEE-754/854 standards.Availability of separate floating-point registers. Depending on the hardware implementation of thearchitecture, this may allow execution of x87 floating-point instructions in parallel with executionof general-purpose and 128-bit media instructions.Availability of instructions that compute absolute value, change-of-sign, round-to-integer, partialremainder, and square root.Availability of instructions that compute transcendental values, including 2x-1, cosine, partial arctangent, partial tangent, sine, sine with cosine, y*log2x, and y*log2(x+1).
The cosine, partial arctangent, sine, and sine with cosine instructions use angular values expressed in radians foroperands and results.Availability of instructions that load common constants, such as log2e, log210, log102, loge2, Pi, 1,and 0.x87 instructions operate on data in three floating-point formats—32-bit single-precision, 64-bitdouble-precision, and 80-bit double-extended-precision (sometimes called extended precision)—aswell as integer, and 80-bit packed-BCD formats.x87 instructions carry out all computations using the 80-bit double-extended-precision format.
Whenan x87 instruction reads a number from memory in 80-bit double-extended-precision format, thenumber can be used directly in computations, without conversion. When an x87 instruction reads anumber in a format other than double-extended-precision format, the processor first converts thex87 Floating-Point Programming237AMD64 Technology24592—Rev.
3.13—July 2007number into double-extended-precision format. The processor can convert numbers back to specificformats, or leave them in double-extended-precision format when writing them to memory.Most x87 operations for addition, subtraction, multiplication, and division specify two sourceoperands, the first of which is replaced by the result. Instructions for subtraction and division havereverse forms which swap the ordering of operands.6.1.2 OriginsIn 1979, AMD introduced the first floating-point coprocessor for microprocessors—the AM9511arithmetic circuit.
This coprocessor performed 32-bit floating-point operations under microprocessorcontrol. In 1980, AMD introduced the AM9512, which performed 64-bit floating-point operations.These coprocessors were second-sourced as the 8231 and 8232 coprocessors. Before then,programmers working with general-purpose microprocessors had to use much slower, vendor-suppliedsoftware libraries for their floating-point needs.In 1985, the Institute of Electrical and Electronics Engineers published the IEEE Standard for BinaryFloating-Point Arithmetic, also referred to as the ANSI/IEEE Std 754-1985 standard, or IEEE 754.This standard defines the data types, operations, and exception-handling methods that are the basis forthe x87 floating-point technology implemented in the legacy x86 architecture.
In 1987, the IEEEpublished a more general radix-independent version of that standard, called the ANSI/IEEE Std 8541987 standard, or IEEE 854 for short. The AMD64 architecture complies with both the IEEE 754 andIEEE 854 standards.6.1.3 Compatibilityx87 floating-point instructions can be executed in any of the architecture’s operating modes.
Existingx87 binary programs run in legacy and compatibility modes without modification. The supportprovided by the AMD64 architecture for such binaries is identical to that provided by legacy x86architectures.To run in 64-bit mode, x87 floating-point programs must be recompiled. The recompilation has no sideeffects on such programs, other then to make available the extended general-purpose registers and 64bit virtual address space.6.2RegistersOperands for the x87 instructions are located in x87 registers or memory. Figure 6-1 on page 239shows an overview of the x87 registers.238x87 Floating-Point Programming24592—Rev.
3.13—July 2007AMD64 Technologyx87 Data Registers790fpr0fpr1fpr2fpr3fpr4fpr5fpr6fpr7Instruction Pointer (rIP)ControlControlWordWordData Pointer (rDP)StatusStatusWordWord63Opcode10TagTagWordWord0150513-321.epsFigure 6-1.x87 RegistersThese registers include eight 80-bit data registers, three 16-bit registers that hold the x87 control word,status word, and tag word, two 64-bit registers that hold instruction and data pointers, and an 11-bitregister that holds a permutation of an x87 opcode.6.2.1 x87 Data RegistersFigure 6-2 on page 240 shows the eight 80-bit data registers in more detail.
Typically, x87 instructionsreference these registers as a stack. x87 instructions store operands only in these 80-bit registers or inmemory. They do not (with two exceptions) access the GPR registers, and they do not access the XMMregisters.x87 Floating-Point Programming239AMD64 Technology24592—Rev. 3.13—July 2007x87StatusWordST(6)fpr0ST(7)fpr1TOPST(0)fpr2ST(1)fpr3ST(2)fpr4ST(3)fpr5ST(4)fpr6ST(5)fpr71311790513-134.epsFigure 6-2.x87 Physical and Stack RegistersStack Organization.
The bank of eight physical data registers, FPR0–FPR7, are organized internallyas a stack, ST(0)–ST(7). The stack functions like a circular modulo-8 buffer. The stack top can be setby software to start at any register position in the bank. Many instructions access the top of stack aswell as individual registers relative to the top of stack.Stack Pointer.
Bits 13–11 of the x87 status word (“x87 Status Word Register (FSW)” on page 241)are the top-of-stack pointer (TOP). The TOP specifies the mapping of the stack registers onto thephysical registers. The TOP contains the physical-register index of the location of the top of stack,ST(0). Instructions that load operands from memory into an x87 register first decrement the stackpointer and then copy the operand (often with conversion to the double-extended-precision format)from memory into the decremented top-of-stack register. Instructions that store operands from an x87register to memory copy the operand (often with conversion from the double-extended-precisionformat) in the top-of-stack register to memory and then increment the stack pointer.Figure 6-2 shows the mapping between stack registers and physical registers when the TOP has thevalue 2. Modulo-8 wraparound addressing is used. Pushing a new element onto this stack—forexample with the FLDZ (floating-point load +0.0) instruction—decrements the TOP to 1, so thatST(0) refers to FPR1, and the new top-of-stack is loaded with +0.0.The architecture provides alternative versions of many instructions that either modify or do not modifythe TOP as a side effect.