Thuspaddingcan be insertedbetween columnsof an array (intraarraypadding),or between arrays (interarraypadding).CompilerA furtherperformanceartifact,calledcache miss J“amming,can occur on machines that allow processingto continueduringa cache miss: if cache misses arespread nonuniformlyacross the loop iterations,the asynchronyof the processorwillnot be exploited,and performancewillbe reduced.Jammingtypicallyoccurs whenseveralarraysare accessedwith the same stride and when all havethe same alignmentrelativeto cache lineboundaries(thatis, the same low address bits).

Bacon et al. [1994]describecache miss jammingin detailand present a unifiedframeworkfor interandintraarraypaddingto handleset conflicts and jamming.The disadvantagesof paddingare thatit increasesmemoryconsumptionandmakesthesubscriptcalculationsforoperationsover the wholearraymorecomplex,since the array has “holes.”Inparticular,paddingreduces the benefitsof loop collapsing(see Section 6.3.4).6.5.2 Scalar ExpansionLoops oftencontainvariablesthatareused as temporarieswithinthe loop body.Such variableswill create an antidependence S’z ‘~) S1 from one iterationto thenext, and will have no other loop-carrieddependence.Allocatingone temporaryfor each iterationremovesthe dependence and makes the loop a candidateforparallelization[Paduaet al.

1980; Wolfe1989b],as shownin Figure32. If thefinal value of c is used after the loop, cmust be assigned the value of T[n].Scalarexpansionis a fundamentaltechniquefor vectorizingcompilers,andwas performedby the BurroughsScientific Processor[Kuckand Stokes1982]and Cray-1 [Russell1978] compilers.Analternativefor parallelmachinesis touse privatevariables,where each processor has its own instanceof the variable;these may be introducedby the compiler(see Section 7.1.3) or, if the languagesuPportsprivatevariables,bytheprogrammer.If the compilervectorizesor parallelizes a loop, scalar expansionmust beTransformationsdoi=l,●nc =a[i]endb[i]= a[i]+ cdo(a) originalrealloopT[n]doalli=l,T[i]n= b [i]a[i]end377= a[i]do+ T[i]all(b)Figure 32.afterscalarScalarexpansionexpansion.performedforanycompiler-generatedtemporariesin a loop.

To avoid creatingunnecessarilylarge temporaryarrays,avectorizingcompilercan performscalarexpansionafter strip mining,expandingthe temporaryto the size of the vectorstrip.Scalar expansioncan also increaseinstruction-levelparallelismby removingdependence.6.5.3Array ContractionAftertransformationof a loop nest, itmay be possibleto contractscalarsorarraysthathavepreviouslybeenexpanded.It may also be possibleto contract other arrays due to interchangeorthe use of redundantstorageallocationby the programmer[Wolfe 1989b].If the iterationvariableof the pth loopin a loop nest is being used to index thek th dimensionof an arrayx, then dimensionk may be removedfrom x if (1)loop p is not parallel,(2) all distancevectorsV involvingx have VP = O, and(3) x is not used subsequently(that is, xis dead after the loop).

The lattertwoconditionsaretrueforcompilerexpandedvariablesunless the loop structure of the programwas changedafterexpansion.In particular,loop distributioncan inhibitarraycontractionbycausingthesecondconditionto beviolated.ACMComputingSurveys,Vol. 26, No.

4, Decsmber1994378David0realT[n,F. Baconet al.don]idodoi=l,doallj=l,end=endnT[i,j]= a[i,j]*3b[i,j]= T[i,j]dol,nji,ntotalnend=end+ b[i,j]/T[i,[i](a)originalloopnestdodocodei=dorealjl,ndoi=l,totalndoallj=l,end= a[i,j]doendnT[j]b[i,=[i]l,nT = T + a[i,T[n]endj]*3= T[j][i]= Tdo(b)+ b[i,j]doafterscalarreplacementj]/T[j]Figure 34.allScalarreplacement.do(b)afterFigure 33.arrayArraycontractioncontraction.Contractionreducestheamountofstorage consumedby compiler-generatedtemporaries,as wellas reducingthenumberof cache lines referenced.Othermethodsfor reducingstorageconsumption by temporariesare strip mining(seeSection6.2.4) and dynamicallocationoftemporaries,either from the heap or froma staticblockof memoryreservedfortemporaries.Scalar ReplacementEven when it is not possibleto contractan arrayinto a scalar,a similaroptimizationcan be performedwhen a frequentlyreferencedarrayelementisinvariantwithinthe innermostloop orloom.

In this case. the arrav elementcanbe loadedinto a scalar (anfi presumablythereforea register)before the inner loopand, if it is modified,storedaftertheinner loop [Callahanet al. 1990].ReplacementmultipliesQ for the array elementby the numberof iterationsin the inner loop(s). It can also eliminateunnecessarysubscriptcalculations,although that optimizationis often done byloop-invariantcode motion(see SectionACMj]doj]T = total6.5.4+ a[i,all(a) originalend= total[ildoComputmgSurveys,Vol. 26, No, 4, December19946.1.3).

Loop interchangecan be used toenableor improvescalarreplacement;Carr [1993] examinesthe combinationofscalar replacement,interchange,and unroll-and-jamin thecontextof cacheoptimization.An exampleof scalarreplacementisshown in Figure34; for a discussionofthe interactionbetweenreplacementandloop interchange,see Section 6, CollocationCode collocationimprovesmemoryaccessbehaviorby placingrelatedcode in closeproximity.The earliestwork rearrangedcode (often at the granularityof a procedure) to improvepaging behavior[F’errari1976; Hatfieldand Gerald1971].Morerecentstrategiesfocus on improvingcache behaviorby placingthemost frequentsuccessor to a basic block(or the most frequentcallee of a procedure) immediatelyadjacentto it in instructionmemory[Hwu and Chang 1989;Pettis and Hansen1990].An estimateis made of the frequencywith whicheach arc in the controlflowgraph will be traversedduringprogramexecution(using either profilinginformation or static estimates).Proceduresaregroupedtogetherusinga greedyalgorithmthat always takes the pair of pro-Compilercedures(or proceduregroups)withthelargest numberof calls betweenthem.Withina procedure,basic blocks can begroupedin the same way (althoughthedirectionof the controlflowmustbetaken into account),or a top-downalgorithmcan be used that startsfrom theprocedureentry node.

Basic blocks with afrequencyestimateof zero can be movedto a separatepage to increaselocalityfurther.However,accessingthatpagemay requirelong displacementjumpstobe introduced(see the next subsection),creatingthe potentialfor performanceloss if the basic blocks in questionareactuallyexecuted.Procedureinlining(see Section6.8.5)can also affect code locality,and has beenstudiedboth in conjunctionwith[Hwuand Chang1989]and independentof[McFarling1991] code positioning.Inlining improvesperformanceoften by reducing overheadand increasinglocality,butif a procedureis called more than once ina loop, inliningwilloften increasethenumberof cache misses because the procedurebody willbe loadedmore thanonce.6.5.6DkplacementMinimizationThe target of a branchor a jump is usually specified relativeto the currentvalueof the programcounter(PC).

The largestoffset that can be specifiedvaries amongarchitectures;it can be as few as 4 bits. Ifcontrolis transferredtoa locationoutside of the range of the offset, a multiinstructionsequenceor long-formatinstructionis requiredto performthe jump.Forinstance,theS-DLXinstructionBEQZ R4, error is only legal if error iswithin215 bytes. Otherwise,the instruction must be replacedwith:BNEZLILUIR4,R8,R8,JRR8centerrorerror>>16; reversed test;get low bits;get high bits;jump to targetcent:This sequencerequiresthreeextrastructions.Giventhe cost of longindis-Transformations●379the code shouldbeplacementjumps,organizedto keep relatedsectionsclosetogetherin memory,in particularthosesectionsexecutedmostfrequently[Szymanski19781.Displacementminimizationcan also beappliedto data.

For instance,a base register may be allocatedfor a Fortrancommon block or group of blocks:common /big / q, r, x[20000],y, zIf the arrayx containsword-sizedelements, the commonblock is largerthanthe amountof memoryindexableby theoffset field in the load instruction(216bytes on S-DLX).To addressy and z,multiple-instructionsequencesmustbeused in a manneranalogousto the longjumpsequencesabove. The problemisavoided if the layout of big is:common /big / q, r, y, z, x[20000]6.6 PartialEvaluationPartialevaluationrefers to the generaltechniqueof performingpart of a computation at compile time. Most of the classical optimizationsbasedon data-flowanalysisare either a form of partialevaluationor of redundancyelimination(describedin Sectiondata-flowoptimizationsSection

