Code Optimization Techniques
Make compiled code faster without changing its meaning. Covers the full spectrum of optimizations from simple peephole window transformations to powerful global and loop-level improvements including code motion, strength reduction, and cache-oriented loop optimizations.
Learning Goals
- Define code optimization and state the fundamental correctness constraint it must obey.
- Apply peephole optimization to identify and eliminate redundant loads, stores, and jumps.
- Identify loop-invariant computations and apply loop-invariant code motion.
- Apply strength reduction to replace expensive operations (multiplication) with cheaper ones (addition) inside loops.
- Explain induction variable elimination and its impact on loop performance.
- Explain the memory hierarchy and how loop interchange and loop tiling improve cache performance.
The Goal of Code Optimization
Code optimization is the process of transforming a program so that it executes more efficiently (faster) or consumes fewer resources (less memory/power) without changing its semantic meaning.
The Three Fundamental Constraints:
- Correctness: The optimized code must produce the exact same output as the original for all inputs.
- Efficiency: On average, the transformation must improve the program's performance.
- Feasibility: The optimization itself must not take so long that it makes the compilation process impractical.
Levels of Optimization
Optimizations are typically categorized by the scope of code they analyze:
| Level | Scope | Example |
|---|---|---|
| Local | Within a single Basic Block. | Algebraic simplification, constant folding. |
| Global | Across multiple basic blocks (the CFG). | Global dead code elimination, constant propagation. |
| Loop | Within loops (where programs spend 90% of time). | Code motion, strength reduction, loop tiling. |
| Peephole | A tiny "sliding window" of 2-3 instructions. | Redundant load/store elimination. |
Peephole optimization looks at a very small window of target instructions and replaces them with a shorter or faster sequence.
Common Patterns:
-
Redundant Load/Store:
1MOV R0, a 2MOV a, R0 <-- Redundant! Delete. -
Algebraic Simplification:
- Replace
x = x + 0orx = x * 1with nothing. - Replace
x = x * 2withx = x + xorx = x << 1(Faster).
- Replace
-
Unreachable Code:
1GOTO L2 2MOV R0, 5 <-- This can NEVER execute. Delete. 3L2: ... -
Flow of Control:
- Replace
GOTO L1 ... L1: GOTO L2withGOTO L2.
- Replace
-
Null Sequence Elimination:
- Remove instructions that have no effect, e.g.
x = x + 0,x = x * 1, or aGOTOthat jumps to the immediately following label.
- Remove instructions that have no effect, e.g.
-
Copy Propagation (Peephole-level):
1MOV R1, R2 ; R1 = R2 2ADD R3, R1 ; use of R1If R1 is only used once after the copy, replace
ADD R3, R1withADD R3, R2and delete the MOV. -
Cross-Jump Optimization: When two different code sequences end with the same sequence of instructions, merge them into a single shared block and jump to it, reducing code size.
1; Sequence A ending: 2ADD R1, R2 3SUB R3, R4 4; Sequence B ending: 5ADD R1, R2 6SUB R3, R4Both can jump to one shared
ADD; SUBblock.
Visualizing a Multi-Pass Optimization
- 1Step 1
1t1 = 10 2t2 = i * t1 3t3 = 5 + 5 4t4 = j + t3 5t5 = t2 * t4 - 2Step 2
The compiler identifies
5 + 5can be pre-calculated.1t1 = 10 2t2 = i * t1 3t3 = 10 <-- Folded 4t4 = j + t3 5t5 = t2 * t4 - 3Step 3
The values of
t1andt3are known constants, so we propagate them.1t1 = 10 2t2 = i * 10 <-- Propagated 3t3 = 10 4t4 = j + 10 <-- Propagated 5t5 = t2 * t4 - 4Step 4
After propagation, the assignments
t1 = 10andt3 = 10are no longer needed by any instruction.1t2 = i * 10 2t4 = j + 10 3t5 = t2 * t4Result: We reduced 5 instructions to 3 without changing the outcome.
Strength Reduction: A Complete Trace
- 1Step 1
1for(i=0; i<10; i++) { 2 y = i * 4; 3 a[y] = 0; 4}Each iteration computes
y = i * 4— a multiplication inside the loop. - 2Step 2
iis a basic induction variable (increments by 1 each iteration).y = i * 4is a derived induction variable (changes by a fixed constant each iteration:4). This makesyeligible for strength reduction. - 3Step 3
Before the loop, set
yto its initial value:y = 0 * 4 = 01y = 0; 2for(i=0; i<10; i++) { 3 a[y] = 0; 4} - 4Step 4
Instead of recalculating
y = i * 4, incrementyby the step constant4:1y = 0; 2for(i=0; i<10; i++) { 3 a[y] = 0; 4 y = y + 4; 5}Multiplication → addition: strength reduced.
- 5Step 5
If
iis only used to computey(and not after the loop), we can replace the loop test withy < 40and eliminatei:1y = 0; 2while(y < 40) { 3 a[y] = 0; 4 y = y + 4; 5}Final result: No multiplication, no
ivariable — 2 additions per iteration.
The Induction Variable Trap
In loop optimization, an Induction Variable is a variable that changes by a constant amount every time the loop executes (like i++).
Induction Variable Elimination is a powerful technique that removes the original index variable entirely if it is only used to compute other variables. However, the compiler must be careful: it can only eliminate i if i is not needed after the loop terminates!
Loop Optimization for Cache Memory
Modern CPUs are 100–200× faster than main memory. A loop that strides through memory inefficiently can spend most of its time waiting for cache rather than computing. Cache-aware loop transformations are therefore critical for performance.
The Memory Hierarchy
Each cache loads data in cache lines (typically 64 bytes). Two key principles:
- Temporal Locality: If a memory location is accessed now, it will likely be accessed again soon — keep it in cache.
- Spatial Locality: If a memory location is accessed, nearby locations will likely be accessed soon — a whole cache line is fetched together.
| Access Pattern | Locality Type | Cache Behavior |
|---|---|---|
Sequential a[i] traversal | Spatial | Excellent — one line serves ~16 ints |
| Re-accessing same element | Temporal | Good — stays in cache if used soon |
Stride-N a[j][i] column access | Neither | Poor — each access may fetch a new line |
Loop Interchange vs. Loop Tiling — Quick Comparison
| Transformation | What it does | When to use | Typical gain |
|---|---|---|---|
| Loop Interchange | Swap inner/outer loop order | Wrong traversal order (column vs row) | 2×–10× |
| Loop Tiling | Partition into cache-sized blocks | Reused data evicted between sweeps | 2×–5× |
Swapping the inner and outer loop to change the traversal order from column-major to row-major (or vice versa) so that consecutive iterations access adjacent memory locations in the same cache line.
Before — Column-major (cache misses):
1for (i = 0; i < N; i++) 2 for (j = 0; j < N; j++) 3 a[j][i] = b[j][i] + c[j][i];
C arrays are row-major: a[j][i] jumps N elements per i step → poor spatial locality.
After — Row-major (cache hits):
1for (j = 0; j < N; j++) 2 for (i = 0; i < N; i++) 3 a[j][i] = b[j][i] + c[j][i];
Now consecutive i values access adjacent elements of row j → one cache line serves ~16 iterations.
Why Cache Optimization Matters
A modern CPU can execute 3–5 instructions per nanosecond, but fetching data from DRAM takes ~100 ns. If a loop accesses memory with poor locality, the CPU stalls for 99% of the time waiting for data. Loop interchange and loop tiling can yield 2× to 10× speedups on real hardware — often more impact than any other compiler optimization.
Common Questions
Knowledge Check
Which optimization involves moving a computation out of a loop if its result doesn't change during iterations?