Code Optimization Techniques

50 mins

Make compiled code faster without changing its meaning. Covers the full spectrum of optimizations from simple peephole window transformations to powerful global and loop-level improvements including code motion, strength reduction, and cache-oriented loop optimizations.

Learning Goals

Define code optimization and state the fundamental correctness constraint it must obey.
Apply peephole optimization to identify and eliminate redundant loads, stores, and jumps.
Identify loop-invariant computations and apply loop-invariant code motion.
Apply strength reduction to replace expensive operations (multiplication) with cheaper ones (addition) inside loops.
Explain induction variable elimination and its impact on loop performance.
Explain the memory hierarchy and how loop interchange and loop tiling improve cache performance.

The Goal of Code Optimization

Code optimization is the process of transforming a program so that it executes more efficiently (faster) or consumes fewer resources (less memory/power) without changing its semantic meaning.

The Three Fundamental Constraints:

Correctness: The optimized code must produce the exact same output as the original for all inputs.
Efficiency: On average, the transformation must improve the program's performance.
Feasibility: The optimization itself must not take so long that it makes the compilation process impractical.

Levels of Optimization

Optimizations are typically categorized by the scope of code they analyze:

Level	Scope	Example
Local	Within a single Basic Block.	Algebraic simplification, constant folding.
Global	Across multiple basic blocks (the CFG).	Global dead code elimination, constant propagation.
Loop	Within loops (where programs spend 90% of time).	Code motion, strength reduction, loop tiling.
Peephole	A tiny "sliding window" of 2-3 instructions.	Redundant load/store elimination.

Peephole optimization looks at a very small window of target instructions and replaces them with a shorter or faster sequence.

Common Patterns:

Redundant Load/Store:

1MOV R0, a
2MOV a, R0  <-- Redundant! Delete.

Algebraic Simplification:
- Replace x = x + 0 or x = x * 1 with nothing.
- Replace x = x * 2 with x = x + x or x = x << 1 (Faster).

Unreachable Code:

1GOTO L2
2MOV R0, 5  <-- This can NEVER execute. Delete.
3L2: ...

Flow of Control:
- Replace GOTO L1 ... L1: GOTO L2 with GOTO L2.
Null Sequence Elimination:
- Remove instructions that have no effect, e.g. x = x + 0, x = x * 1, or a GOTO that jumps to the immediately following label.
Copy Propagation (Peephole-level):
```
1MOV R1, R2    ; R1 = R2
2ADD R3, R1    ; use of R1
```
If R1 is only used once after the copy, replace ADD R3, R1 with ADD R3, R2 and delete the MOV.
Cross-Jump Optimization: When two different code sequences end with the same sequence of instructions, merge them into a single shared block and jump to it, reducing code size.
```
1; Sequence A ending:
2ADD R1, R2
3SUB R3, R4
4; Sequence B ending:
5ADD R1, R2
6SUB R3, R4
```
Both can jump to one shared ADD; SUB block.

Visualizing a Multi-Pass Optimization

Step 1

1t1 = 10
2t2 = i * t1
3t3 = 5 + 5
4t4 = j + t3
5t5 = t2 * t4

2
Step 2
The compiler identifies 5 + 5 can be pre-calculated.

1t1 = 10 2t2 = i * t1 3t3 = 10 <-- Folded 4t4 = j + t3 5t5 = t2 * t4

Step 3

The values of t1 and t3 are known constants, so we propagate them.

1t1 = 10
2t2 = i * 10    <-- Propagated
3t3 = 10
4t4 = j + 10    <-- Propagated
5t5 = t2 * t4

4
Step 4
After propagation, the assignments t1 = 10 and t3 = 10 are no longer needed by any instruction.

1t2 = i * 10 2t4 = j + 10 3t5 = t2 * t4

Result: We reduced 5 instructions to 3 without changing the outcome.

Strength Reduction: A Complete Trace

1
Step 1
1for(i=0; i<10; i++) { 2 y = i * 4; 3 a[y] = 0; 4}

Each iteration computes y = i * 4 — a multiplication inside the loop.
2
Step 2
i is a basic induction variable (increments by 1 each iteration). y = i * 4 is a derived induction variable (changes by a fixed constant each iteration: 4). This makes y eligible for strength reduction.
3
Step 3
Before the loop, set y to its initial value: y = 0 * 4 = 0

1y = 0; 2for(i=0; i<10; i++) { 3 a[y] = 0; 4}
4
Step 4
Instead of recalculating y = i * 4, increment y by the step constant 4:

1y = 0; 2for(i=0; i<10; i++) { 3 a[y] = 0; 4 y = y + 4; 5}

Multiplication → addition: strength reduced.
5
Step 5
If i is only used to compute y (and not after the loop), we can replace the loop test with y < 40 and eliminate i:

1y = 0; 2while(y < 40) { 3 a[y] = 0; 4 y = y + 4; 5}

Final result: No multiplication, no i variable — 2 additions per iteration.

The Induction Variable Trap

In loop optimization, an Induction Variable is a variable that changes by a constant amount every time the loop executes (like i++).

Induction Variable Elimination is a powerful technique that removes the original index variable entirely if it is only used to compute other variables. However, the compiler must be careful: it can only eliminate i if i is not needed after the loop terminates!

Loop Optimization for Cache Memory

Modern CPUs are 100–200× faster than main memory. A loop that strides through memory inefficiently can spend most of its time waiting for cache rather than computing. Cache-aware loop transformations are therefore critical for performance.

The Memory Hierarchy

Each cache loads data in cache lines (typically 64 bytes). Two key principles:

Temporal Locality: If a memory location is accessed now, it will likely be accessed again soon — keep it in cache.
Spatial Locality: If a memory location is accessed, nearby locations will likely be accessed soon — a whole cache line is fetched together.

Access Pattern	Locality Type	Cache Behavior
Sequential `a[i]` traversal	Spatial	Excellent — one line serves ~16 ints
Re-accessing same element	Temporal	Good — stays in cache if used soon
Stride-N `a[j][i]` column access	Neither	Poor — each access may fetch a new line

Loop Interchange vs. Loop Tiling — Quick Comparison

Transformation	What it does	When to use	Typical gain
Loop Interchange	Swap inner/outer loop order	Wrong traversal order (column vs row)	2×–10×
Loop Tiling	Partition into cache-sized blocks	Reused data evicted between sweeps	2×–5×

Swapping the inner and outer loop to change the traversal order from column-major to row-major (or vice versa) so that consecutive iterations access adjacent memory locations in the same cache line.

Before — Column-major (cache misses):

1for (i = 0; i < N; i++)
2  for (j = 0; j < N; j++)
3    a[j][i] = b[j][i] + c[j][i];

C arrays are row-major: a[j][i] jumps N elements per i step → poor spatial locality.

After — Row-major (cache hits):

1for (j = 0; j < N; j++)
2  for (i = 0; i < N; i++)
3    a[j][i] = b[j][i] + c[j][i];

Now consecutive i values access adjacent elements of row j → one cache line serves ~16 iterations.

Why Cache Optimization Matters

A modern CPU can execute 3–5 instructions per nanosecond, but fetching data from DRAM takes ~100 ns. If a loop accesses memory with poor locality, the CPU stalls for 99% of the time waiting for data. Loop interchange and loop tiling can yield 2× to 10× speedups on real hardware — often more impact than any other compiler optimization.

Common Questions

Knowledge Check

Question 1 of 6

Q1Single choice

Which optimization involves moving a computation out of a loop if its result doesn't change during iterations?

Constant Folding

Strength Reduction

Code Motion

Peephole Optimization

Compiler Design - Code Optimization (TutorialsPoint)

web

Intermediate Code Forms — TAC, Quadruples and Triples

Data Flow Analysis