Volume 3A System Programming Guide_ Part 1 (794103), страница 80
Текст из файла (страница 80)
This information can bequeried using the CPUID monitor leaf function (EAX = 05H). You will need thesmallest and largest monitor line size:•To avoid missed wake-ups: make sure that the data structure used to monitorwrites fits within the smallest monitor line-size. Otherwise, the processor maynot wake up after a write intended to trigger an exit from MWAIT.•To avoid false wake-ups; use the largest monitor line size to pad the datastructure used to monitor writes. Software must make sure that beyond the datastructure, no unrelated data variable exists in the triggering area for MWAIT. Apad may be needed to avoid this situation.These above two values bear no relationship to cache line size in the system and software should not make any assumptions to that effect. Within a single-cluster system,the two parameters should default to be the same (the size of the monitor triggeringarea is the same as the system coherence line size).Based on the monitor line sizes returned by the CPUID, the OS should dynamicallyallocate structures with appropriate padding.
If static data structures must be usedby an OS, attempt to adapt the data structure and use a dynamically allocated databuffer for thread synchronization. When the latter technique is not possible, considernot using MONITOR/MWAIT when using static data structures.7-50 Vol. 3MULTIPLE-PROCESSOR MANAGEMENTTo set up the data structure correctly for MONITOR/MWAIT on multi-clusteredsystems: interaction between processors, chipsets, and the BIOS is required (systemcoherence line size may depend on the chipset used in the system; the size could bedifferent from the processor’s monitor triggering area). The BIOS is responsible toset the correct value for system coherence line size using theIA32_MONITOR_FILTER_LINE_SIZE MSR. Depending on the relative magnitude ofthe size of the monitor triggering area versus the value written into theIA32_MONITOR_FILTER_LINE_SIZE MSR, the smaller of the parameters will bereported as the Smallest Monitor Line Size.
The larger of the parameters will bereported as the Largest Monitor Line Size.7.11.6Required Operating System SupportThis section describes changes that must be made to an operating system to run onprocessors supporting Hyper-Threading Technology. It also describes optimizationsthat can help an operating system make more efficient use of the logical processorssharing execution resources. The required changes and suggested optimizations arerepresentative of the types of modifications that appear in Windows* XP and Linux*kernel 2.4.0 operating systems for Intel processors supporting Hyper-ThreadingTechnology. Additional optimizations for processors supporting Hyper-ThreadingTechnology are described in the Intel® 64 and IA-32 Architectures OptimizationReference Manual.7.11.6.1Use the PAUSE Instruction in Spin-Wait LoopsIntel recommends that a PAUSE instruction be placed in all spin-wait loops that runon Intel processors supporting Hyper-Threading Technology and multi-core processors.Software routines that use spin-wait loops include multiprocessor synchronizationprimitives (spin-locks, semaphores, and mutex variables) and idle loops.
Suchroutines keep the processor core busy executing a load-compare-branch loop while athread waits for a resource to become available. Including a PAUSE instruction in sucha loop greatly improves efficiency (see Section 7.11.2, “PAUSE Instruction”). Thefollowing routine gives an example of a spin-wait loop that uses a PAUSE instruction:Spin_Lock:CMP lockvar, 0;Check if lock is freeJE Get_LockPAUSE;Short delayJMP Spin_LockGet_Lock:MOV EAX, 1XCHG EAX, lockvar ;Try to get lockCMP EAX, 0;Test if successfulVol. 3 7-51MULTIPLE-PROCESSOR MANAGEMENTJNE Spin_LockCritical_Section:<critical section code>MOV lockvar, 0...Continue:The spin-wait loop above uses a “test, test-and-set” technique for determining theavailability of the synchronization variable.
This technique is recommended whenwriting spin-wait loops.In IA-32 processor generations earlier than the Pentium 4 processor, the PAUSEinstruction is treated as a NOP instruction.7.11.6.2Potential Usage of MONITOR/MWAIT in C0 Idle LoopsAn operating system may implement different handlers for different idle states. Atypical OS idle loop on an ACPI-compatible OS is shown in Example 7-5:Example 7-5. A Typical OS Idle Loop// WorkQueue is a memory location indicating there is a thread// ready to run. A non-zero value for WorkQueue is assumed to// indicate the presence of work to be scheduled on the processor.// The idle loop is entered with interrupts disabled.WHILE (1) {IF (WorkQueue) THEN {// Schedule work at WorkQueue.} ELSE {// No work to do - wait in appropriate C-state handler depending// on Idle time accumulatedIF (IdleTime >= IdleTimeThreshhold) THEN {// Call appropriate C1, C2, C3 state handler, C1 handler// shown below}}}// C1 handler uses a Halt instructionVOID C1Handler(){ STIHLT}The MONITOR and MWAIT instructions may be considered for use in the C0 idle state loops, ifMONITOR and MWAIT are supported.7-52 Vol.
3MULTIPLE-PROCESSOR MANAGEMENTExample 7-6. An OS Idle Loop with MONITOR/MWAIT in the C0 Idle Loop// WorkQueue is a memory location indicating there is a thread// ready to run. A non-zero value for WorkQueue is assumed to// indicate the presence of work to be scheduled on the processor.// The following example assumes that the necessary padding has been// added surrounding WorkQueue to eliminate false wakeups// The idle loop is entered with interrupts disabled.WHILE (1) {IF (WorkQueue) THEN {// Schedule work at WorkQueue.} ELSE {// No work to do - wait in appropriate C-state handler depending// on Idle time accumulated.IF (IdleTime >= IdleTimeThreshhold) THEN {// Call appropriate C1, C2, C3 state handler, C1// handler shown belowMONITOR WorkQueue // Setup of eax with WorkQueue// LinearAddress,// ECX, EDX = 0IF (WorkQueue != 0) THEN {MWAIT}}}}// C1 handler uses a Halt instruction.VOID C1Handler(){ STIHLT}7.11.6.3Halt Idle Logical ProcessorsIf one of two logical processors is idle or in a spin-wait loop of long duration, explicitlyhalt that processor by means of a HLT instruction.In an MP system, operating systems can place idle processors into a loop that continuously checks the run queue for runnable software tasks.
Logical processors thatexecute idle loops consume a significant amount of core’s execution resources thatVol. 3 7-53MULTIPLE-PROCESSOR MANAGEMENTmight otherwise be used by the other logical processors in the physical package. Forthis reason, halting idle logical processors optimizes the performance.8 If all logicalprocessors within a physical package are halted, the processor will enter a powersaving state.7.11.6.4Potential Usage of MONITOR/MWAIT in C1 Idle LoopsAn operating system may also consider replacing HLT with MONITOR/MWAIT in its C1idle loop. An example is shown in Example 7-7:Example 7-7.
An OS Idle Loop with MONITOR/MWAIT in the C1 Idle Loop// WorkQueue is a memory location indicating there is a thread// ready to run. A non-zero value for WorkQueue is assumed to// indicate the presence of work to be scheduled on the processor.// The following example assumes that the necessary padding has been// added surrounding WorkQueue to eliminate false wakeups// The idle loop is entered with interrupts disabled.WHILE (1) {IF (WorkQueue) THEN {// Schedule work at WorkQueue} ELSE {// No work to do - wait in appropriate C-state handler depending// on Idle time accumulatedIF (IdleTime >= IdleTimeThreshhold) THEN {// Call appropriate C1, C2, C3 state handler, C1// handler shown below}}}// C1 handler uses a Halt instructionVOID C1Handler(){MONITOR WorkQueue // Setup of eax with WorkQueue LinearAddress,// ECX, EDX = 0IF (WorkQueue != 0) THEN {STIMWAIT// EAX, ECX = 0}}8.
Excessive transitions into and out of the HALT state could also incur performance penalties.Operating systems should evaluate the performance trade-offs for their operating system.7-54 Vol. 3MULTIPLE-PROCESSOR MANAGEMENT7.11.6.5Guidelines for Scheduling Threads on Logical Processors SharingExecution ResourcesBecause the logical processors, the order in which threads are dispatched to logicalprocessors for execution can affect the overall efficiency of a system. The followingguidelines are recommended for scheduling threads for execution.•Dispatch threads to one logical processor per processor core before dispatchingthreads to the other logical processor sharing execution resources in the sameprocessor core.•In an MP system with two or more physical packages, distribute threads out overall the physical processors, rather than concentrate them in one or two physicalprocessors.•Use processor affinity to assign a thread to a specific processor core or package,depending on the cache-sharing topology.