Volume 3A System Programming Guide_ Part 1 (794103), страница 71
Текст из файла (страница 71)
The only enhancements in the Pentium4, Intel Xeon, and P6 family processors are:••Added support for speculative reads.•Out of order store from long string store and string move operations (see Section7.2.3, “Out-of-Order Stores For String Operations,” below).Store-buffer forwarding, when a read passes a write to the same memorylocation.Order of Writes From Individual ProcessorsProcessor #1Each processoris guaranteed toperform writes inprogram order.Write A.1Write B.1Write C.1Processor #2Write A.2Write B.2Write C.2Processor #3Write A.3Write B.3Write C.3Example of order of actual writesfrom all processors to memoryWrites are in orderwith respect toindividual processes.Write A.1Write B.1Write A.2Write A.3Write C.1Write B.2Write C.2Write B.3Write C.3Writes from allprocessors arenot guaranteedto occur in aparticular order.Figure 7-1.
Example of Write Ordering in Multiple-Processor SystemsNOTEIn P6 processor family, store-buffer forwarding to reads of WC memory fromstreaming stores to the same address does not occur due to errata.7.2.3Out-of-Order Stores For String OperationsThe Intel Core 2 Duo, Intel Core, Pentium 4, and P6 family processors modify theprocessors operation during the string store operations (initiated with the MOVS andSTOS instructions) to maximize performance. Once the “fast string” operations initial7-10 Vol.
3MULTIPLE-PROCESSOR MANAGEMENTconditions are met (as described below), the processor will essentially operate on,from an external perspective, the string in a cache line by cache line mode. Thisresults in the processor looping on issuing a cache-line read for the source addressand an invalidation on the external bus for the destination address, knowing that allbytes in the destination cache line will be modified, for the length of the string. In thismode interrupts will only be accepted by the processor on cache line boundaries.
It ispossible in this mode that the destination line invalidations, and therefore stores, willbe issued on the external bus out of order.Code dependent upon sequential store ordering should not use the string operationsfor the entire data structure to be stored. Data and semaphores should be separated.Order dependent code should use a discrete semaphore uniquely stored to after anystring operations to allow correctly ordered data to be seen by all processors.Initial conditions for “fast string” operations:•EDI and ESI must be 8-byte aligned for the Pentium III processor.
EDI must be 8byte aligned for the Pentium 4 processor.•••String operation must be performed in ascending address order.•The memory type for both source and destination addresses must be either WBor WC.The initial operation counter (ECX) must be equal to or greater than 64.Source and destination must not overlap by less than a cache line (64 bytes, forIntel Core 2 Duo, Intel Core, Pentium M, and Pentium 4 processors; 32 bytes P6family and Pentium processors).7.2.4Strengthening or Weakening the Memory Ordering ModelThe Intel 64 and IA-32 architectures provide several mechanisms for strengtheningor weakening the memory ordering model to handle special programming situations.These mechanisms include:•The I/O instructions, locking instructions, the LOCK prefix, and serializinginstructions force stronger ordering on the processor.•The SFENCE instruction (introduced to the IA-32 architecture in the Pentium IIIprocessor) and the LFENCE and MFENCE instructions (introduced in the Pentium4 processor) provide memory ordering and serialization capability for specifictypes of memory operations.•The memory type range registers (MTRRs) can be used to strengthen or weakenmemory ordering for specific area of physical memory (see Section 10.11,“Memory Type Range Registers (MTRRs)”).
MTRRs are available only in thePentium 4, Intel Xeon, and P6 family processors.•The page attribute table (PAT) can be used to strengthen memory ordering for aspecific page or group of pages (see Section 10.12, “Page Attribute Table (PAT)”).The PAT is available only in the Pentium 4, Intel Xeon, and Pentium III processors.These mechanisms can be used as follows.Vol. 3 7-11MULTIPLE-PROCESSOR MANAGEMENTMemory mapped devices and other I/O devices on the bus are often sensitive to theorder of writes to their I/O buffers. I/O instructions can be used to (the IN and OUTinstructions) impose strong write ordering on such accesses as follows.
Prior toexecuting an I/O instruction, the processor waits for all previous instructions in theprogram to complete and for all buffered writes to drain to memory. Only instructionfetch and page tables walks can pass I/O instructions. Execution of subsequentinstructions do not begin until the processor determines that the I/O instruction hasbeen completed.Synchronization mechanisms in multiple-processor systems may depend upon astrong memory-ordering model. Here, a program can use a locking instruction suchas the XCHG instruction or the LOCK prefix to insure that a read-modify-write operation on memory is carried out atomically.
Locking operations typically operate likeI/O operations in that they wait for all previous instructions to complete and for allbuffered writes to drain to memory (see Section 7.1.2, “Bus Locking”).Program synchronization can also be carried out with serializing instructions (seeSection 7.4).
These instructions are typically used at critical procedure or taskboundaries to force completion of all previous instructions before a jump to a newsection of code or a context switch occurs. Like the I/O and locking instructions, theprocessor waits until all previous instructions have been completed and all bufferedwrites have been drained to memory before executing the serializing instruction.The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient wayof insuring load and store memory ordering between routines that produce weaklyordered results and routines that consume that data. The functions of these instructions are as follows:•SFENCE — Serializes all store (write) operations that occurred prior to theSFENCE instruction in the program instruction stream, but does not affect loadoperations.•LFENCE — Serializes all load (read) operations that occurred prior to the LFENCEinstruction in the program instruction stream, but does not affect storeoperations.•MFENCE — Serializes all store and load operations that occurred prior to theMFENCE instruction in the program instruction stream.Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficientmethod of controlling memory ordering than the CPUID instruction.The MTRRs were introduced in the P6 family processors to define the cache characteristics for specified areas of physical memory.
The following are two examples ofhow memory types set up with MTRRs can be used strengthen or weaken memoryordering for the Pentium 4, Intel Xeon, and P6 family processors:•The strong uncached (UC) memory type forces a strong-ordering model onmemory accesses. Here, all reads and writes to the UC memory region appear onthe bus and out-of-order or speculative accesses are not performed. Thismemory type can be applied to an address range dedicated to memory mappedI/O devices to force strong memory ordering.7-12 Vol. 3MULTIPLE-PROCESSOR MANAGEMENT•For areas of memory where weak ordering is acceptable, the write back (WB)memory type can be chosen.
Here, reads can be performed speculatively andwrites can be buffered and combined. For this type of memory, cache locking isperformed on atomic (locked) operations that do not split across cache lines,which helps to reduce the performance penalty associated with the use of thetypical synchronization instructions, such as XCHG, that lock the bus during theentire read-modify-write operation. With the WB memory type, the XCHGinstruction locks the cache instead of the bus if the memory access is containedwithin a cache line.The PAT was introduced in the Pentium III processor to enhance the caching characteristics that can be assigned to pages or groups of pages.
The PAT mechanism typically used to strengthen caching characteristics at the page level with respect to thecaching characteristics established by the MTRRs. Table 10-7 shows the interaction ofthe PAT with the MTRRs.We recommended that software written to run on Intel Core 2 Duo, Intel Core Duo,Pentium 4, Intel Xeon, and P6 family processors assume the processor-orderingmodel or a weaker memory-ordering model. The Intel Core 2 Duo, Intel Core Duo,Pentium 4, Intel Xeon, and P6 family processors do not implement a strong memoryordering model, except when using the UC memory type.
Despite the fact thatPentium 4, Intel Xeon, and P6 family processors support processor ordering, Inteldoes not guarantee that future processors will support this model. To make softwareportable to future processors, it is recommended that operating systems provide critical region and resource control constructs and API’s (application program interfaces)based on I/O, locking, and/or serializing instructions be used to synchronize accessto shared areas of memory in multiple-processor systems.
Also, software should notdepend on processor ordering in situations where the system hardware does notsupport this memory-ordering model.7.3PROPAGATION OF PAGE TABLE AND PAGEDIRECTORY ENTRY CHANGES TO MULTIPLEPROCESSORSIn a multiprocessor system, when one processor changes a page table or page directory entry, the changes must also be propagated to all other processors. This processis commonly referred to as “TLB shootdown.” The propagation of changes to pagetable or page directory entries can be done using memory-based semaphores and/orinterprocessor interrupts (IPI).For example, the following describes a simple TLB shootdown sequence for an Intel64 or IA-32 processor:1.
Begin barrier — Stop all but one processor; that is, cause all but one to HALT orstop in a spin loop.2. Let the active processor change the necessary PTEs and/or PDEs.Vol. 3 7-13MULTIPLE-PROCESSOR MANAGEMENT3. Let all processors invalidate the PTEs and PDEs modified in their TLBs.4. End barrier — Resume all processors; resume general processing.Alternate, performance-optimized, TLB shootdown algorithms may be developed;however, care must be taken by the developers to ensure that either of the followingconditions are met:•Different TLB mappings are not used on different processors during the updateprocess.•The operating system is prepared to deal with the case where processors areusing the stale mapping during the update process.7.4SERIALIZING INSTRUCTIONSThe Intel 64 and IA-32 architectures define several serializing instructions.
Theseinstructions force the processor to complete all modifications to flags, registers, andmemory by previous instructions and to drain all buffered writes to memory beforethe next instruction is fetched and executed. For example, when a MOV to controlregister instruction is used to load a new value into control register CR0 to enableprotected mode, the processor must perform a serializing operation before it entersprotected mode. This serializing operation insures that all operations that werestarted while the processor was in real-address mode are completed before theswitch to protected mode is made.The concept of serializing instructions was introduced into the IA-32 architecturewith the Pentium processor to support parallel instruction execution.
Serializinginstructions have no meaning for the Intel486 and earlier processors that do notimplement parallel instruction execution.It is important to note that executing of serializing instructions on P6 and morerecent processor families constrain speculative execution because the results ofspeculatively executed instructions are discarded. The following instructions are serializing instructions:•Privileged serializing instructions — MOV (to control register, with theexception of MOV CR81), MOV (to debug register), WRMSR, INVD, INVLPG,WBINVD, LGDT, LLDT, LIDT, and LTR.•Non-privileged serializing instructions — CPUID, IRET, and RSM.When the processor serializes instruction execution, it ensures that all pendingmemory transactions are completed (including writes stored in its store buffer)before it executes the next instruction.