Architecture-dependent optimizations
Functional units, delay slots and
dependency analysis
RISC architectures
The pipeline structure of modern architectures requires
careful instruction scheduling.
If instruction I1 creates a value, there may be a latency
that has to elapse before another instruction I 2 can use
this value
If an instruction awaits the result of a previous
computation, the pipeline may have to stall until the
result becomes available
A branch instruction is time-consuming and affects the
contents of the instruction cache. Execution cannot start
at destination before one or more cycles have elapsed.
Instruction scheduling
Purpose: minimize stalls and delays, fill delay slots with
useful computations, minimize execution time of basic
block.
Tool: dependency analysis. Uncover legal reorderings of
instructions, available parallelism in basic blocks and
beyond
Applications:
Filling delay slots is important for all programs
Dependency analysis is critical for reordering of loop
computations on vector processors and others
Dependence relations
A data dependence is a constraint that arises from the flow of data
between statements. Violating a data dependence by reordering
may lead to incorrect results.
If S1 sets a value that S2 uses, this is flow dependence or true
dependence between S1 and S2
If S1 uses some variable’s value and S2 sets it, there is an
antidependence between them.
If both S1 and S2 set the value of some variable, there is an output
dependence between them.
If both S1 and S2 read the value of some variable there is an input
dependence between then. This does not impose an ordering.
The dependence DAG of
a basic block
There is an edge in the dependence dag if:
I writes a register or location that I uses:
I1 fd I2
1
2
I uses a register or location that I changes: I ad I
1
2
1
2
I and I write to the same register or location: I od I
1
2
1
2
I1 and I2 exhibit a structural hazard: a load followed by a store
cannot be interchanged unless the addresses are known to be
distinct:
X := A[I];
A[J] := Y;
-- cannot interchange, X might get Y
if there is an edge between I1 and I2, I2 must not start executing
until I1 has executed for some number of cycles.
Example
1: R3 = [R15]
2: R4 = [R15 + 4]
3: R2 = R3 – R4 -- needs R4: stall one cycle
4: R5 = [R12]
5: R12 = R12 + 4
1
2
6: R6 = R3 * R5
3
7
7: [R15+4] = R3
8: R5 = R6 + 2
4
5
6
8
Contention for resources
Functional unit is pipelined, consists of multiple resources. Instructions
through the pipeline may conflict on use of resources.
Eg: floating-point unit on MIPS:
A: Mantissa add
E: Exception test
M: Multiplier first stage
N: Multiplier second stage
R: Adder Round
S: Operand shift
U: unpack
Add uses successively U, S and A, A and R, R and S. (4 cycles)
Mul uses U, M, M, M, M, M and A, R. (7 cycles)
Conflict depends on relative starting time of two instructions.
Edges in dependency graph are labelled with latencies (>= 1).
Branch scheduling
Important use of dependency graph: fill delay slots
(branch takes two cycles to reach destination)
R2 = [R1]
R3 = [R1+4]
R4 = R2 + R3 (stall)
R5 = R2 -1
goto L1
nop
R2 = [R1]
R3 = [R1+4]
R5 = R2 -1
goto L1
R4 = R2 + R3
Conditional jumps and delay slots
Instruction in delay slot is executed while jump is in
progress. What if jump is not taken? Need mechanism to
annull instruction.
Branch prediction: assume target is known, fill delay slot
with first instruction in target block
If both destinations start with same instruction, ideal
choice for delay slot
Good heuristics for loops: assume that a backwards
conditional jump is usually taken. Move first instruction in
loop to delay slot for branch at end
Call instruction has delay slot: fill with parameter push
A greedy algorithm: list scheduling
Finding optimal schedule for DAG is NP-complete
Simple algorithm is O (N2) at worst, usually linear
Roots of DAG are instructions without predecessors
First pass: from leaves to roots: compute latest possible
starting time for each instruction to end of block
For leaf: execution time of instruction
For inner node: maximum delay imposed by successors
E.g. if In is followed by Im, Im can start at T – 4, and there is a
latency of 2 between In and Im, then In must start by T – 6.
List scheduling: second pass
Second pass: from roots to leaves: schedule instructions
with the greatest slack (farthest from block end) and that
can start as early as possible from now.
At each step:
D1: candidates with the largest remaining delay
D2: candidates with the earliest possible starting time (computed
from starting time of their predecessors)
Choose from D1 if unique, else from D2 if unique, else
use heuristics:
Choose earliest starting time, or
Choose instruction that uses least used pipeline, or
Choose instruction that frees register.
Procedure integration: inlining
Calls make optimizations harder. There is a large payoff
to local optimizations over large basic blocks: inlining
subprogram bodies is often very effective:
It exposes the values of the actuals in the body
It creates larger basic blocks
It saves the cost of the call
Can be done at the tree level or at the RTL level. In both
cases it can enable other optimizations.
Possible disadvantages: code size increase, debugging
is harder
Inlining as a tree transformation
Treat body of subprogram as a generic unit
Each inlined body needs its own local variables
Global references are captured at the point of definition
Inlining works like instantiation: replace formals with
actuals, complete analysis and expansion of inserted
body
Replace multiple return statements where needed
Introduce temporary to hold return value of function
Name capture: recognize global entities
function memo (x : integer) return integer is
local : integer := x **2;
begin
Saved := Saved + local + x; -- Saved is global
return Saved;
end memo;
…
Val := memo (15);
Becomes
declare
local : integer := 15 ** 2;
-- each inlining has its own
result : integer;
-- maybe superfluous if context is assignment
begin
Saved := Saved + local + 15l; -- Saved is the same entity in all inlinings
result := Saved;
end;
Val := result;
Handling return statements
Subprogram needs a label to serve a single exit point.
In a function: identify target of result, or create temporary
for it; replace return with assignment to target, followed
by goto to exit label
In a procedure: replace return with goto to exit label
Optimizations
if body of function is single return statement and context is
assignment, can replace right-hand side with expression
If procedure has no return statement, exit label is superfluous
Parameter passing
If actual is an expression, it is evaluated once: create temporary in
block and replace formal with temporary
Val2 := memo (x * f (z));
Becomes:
declare
C1 : integer := x * f (z);
local : integer := C1 ** 2;
begin
Saved := Saved + local + C1;
Val2 := Saved;
-- context is assignment
end;
Parameter passing: variables
An in-out parameter is a location, cannot create a
temporary for it: must use a renaming declaration.
procedure incr (x : in out integer) is
begin x := x + 1; end;
…
incr (a (i));
Becomes
declare
c1 : integer renames a (i);
begin
c1 := c1 + 1;
end;
Context includes more than global names
semantics of inlined call must be identical to original
program
If constraint checks are not suppressed in the body, they
must not be suppressed in the inlined block, even if
suppressed at the point of call.
Status of constraint checks is part of closure of body to
inline, must be applied when analyzing inlined block
Specialized inlining: loop unrolling
Create successive copies of body of loop: saves tests, makes bigger basic
block, increases instruction level parallelism
for j in 1 .. N loop
loop_body;
end loop;
Becomes
for k in 1 .. N / r loop
-- unroll r times
loop_ body
loop_body [j -> j +1]
-- replace loop variable for each unrolling
…
loop_ body [j -> j +r -1]
end;
for k in N / r +1.. N loop loop_body end loop; -- leftover iterations