Coursify

Compiler Design (CD)

Register Allocation and Target Code Generation

45 mins

The final stages of a compiler. Learn how liveness information is used to allocate the scarce CPU registers via graph coloring, and how the compiler selects and emits the final sequence of target machine instructions.

Learning Goals

  • Explain why register allocation is critical for performance and why it is NP-complete in general.
  • Build a register interference graph from liveness information.
  • Apply graph coloring to allocate registers and identify variables that must be spilled to memory.
  • Describe the steps of a simple target code generation algorithm for a basic block.
  • Explain instruction scheduling and how it exploits CPU pipeline parallelism.

The Register Allocation Problem

Registers are the fastest storage locations in a CPU, but they are extremely scarce (typically only 8 to 32 general-purpose registers). Register Allocation is the process of deciding which values (variables and temporaries) should reside in registers at any given time to minimize slow memory accesses (loads and stores).

Key Definitions:

  1. Register Allocation: Deciding which variables will be held in registers.
  2. Register Assignment: Deciding which specific physical register (e.g., R0, R1) each variable will use.

If there are more active variables than available registers, the compiler must perform Spilling — moving a value from a register back to memory (the stack).


Graph Coloring: The Mathematical Solution

How does a compiler know which variables can share the same register? It uses Liveness Analysis to build a Register Interference Graph (RIG).

  • Nodes: Every variable in the program.
  • Edges: An edge exists between two nodes if those variables are live at the same time (they "interfere" with each other and cannot share a register).

The Task: Find a kk-coloring of the graph, where kk is the number of available registers. If you can color the graph so no two adjacent nodes have the same color, you have successfully allocated registers without conflicts!

Visualizing Register Allocation via Graph Coloring

  1. 1
    Step 1

    Analyze the code to find the 'live range' of every variable. Example variables: a, b, c, d, e. Suppose:

    • a interferes with b, c.
    • b interferes with a, c, d.
    • c interferes with a, b, e.
    • d interferes with b, e.
    • e interferes with c, d.
  2. 2
    Step 2

    Draw a graph where an edge represents an interference.

  3. 3
    Step 3

    Assume we have 3 registers: R1 (Red), R2 (Green), R3 (Blue).

    1. Color a Red (R1).
    2. b interferes with a, so color b Green (R2).
    3. c interferes with a and b, so color c Blue (R3).
    4. d interferes with b, so color d Red (R1).
    5. e interferes with c and d, so color e Green (R2).

    Success! Every variable is assigned a register, and no two interfering variables share one.

  4. 4
    Step 4

    If the graph cannot be kk-colored, the compiler must pick a node to Spill. It removes the node from the graph, generates code to store it in memory, and then tries to color the remaining graph.

Chaitin's Algorithm for Register Allocation [PYQ 2019, 2023]

  1. 1
    Step 1

    Begin with live ranges from liveness analysis. Each variable becomes a node in the graph. Draw an edge between two nodes if their live ranges overlap (they are simultaneously live).

    Example — 5 variables {a, b, c, d, e} with k=3 registers:

    Degree check: deg(a)=3, deg(b)=3, deg(c)=4, deg(d)=3, deg(e)=3. Every node has degree ≥ k=3.

  2. 2
    Step 2

    Core invariant: A node with degree < k can always be colored if the rest of the graph is k-colorable.

    Scan the graph. If any node has degree < k, push it onto the stack and remove it from the graph. Recalculate neighbor degrees and repeat.

    In our graph, every node has degree ≥ 3 = k. No node qualifies for simplification. The simplify phase cannot proceed — we must spill.

  3. 3
    Step 3

    When no low-degree node exists, Chaitin's algorithm selects a node to spill (mark for memory).

    Spill heuristic: Choose the node with the lowest (spill cost ÷ current degree) ratio.

    • Spill cost = Σ (uses + defs) × 10^(loopDepth)
    • Nodes used outside loops are preferred spill candidates

    Assuming equal spill costs, pick node c (highest degree = 4).

    Remove c from the graph → spill list: [c]

    Reduced graph after removing c:

    New degrees: deg(a)=2, deg(b)=2, deg(d)=2, deg(e)=2. All < k=3! Simplify can now proceed.

  4. 4
    Step 4

    With c removed, all remaining nodes have degree 2 < k. Push them onto the stack:

    1. Pick a (deg 2) → Stack: [a] → Remove a (neighbors b, e update)
    2. Pick b (deg 1 after removing a) → Stack: [a, b] → Remove b
    3. Pick d (deg 1 after removing b) → Stack: [a, b, d] → Remove d
    4. Pick e (deg 0, isolated) → Stack: [a, b, d, e] → Graph is empty

    Final stack (bottom→top): a, b, d, e | Spill list: [c]

  5. 5
    Step 5

    Pop nodes off the stack in LIFO order, reinsert into graph, assign registers (R1, R2, R3):

    Pop e: No colored neighbors in current graph → Color e = R1 Pop d: Neighbor e uses R1. Available: R2, R3 → Color d = R2 Pop b: Neighbors: d(R2), c(spilled—no color). Available: R1, R3 → Color b = R1 Pop a: Neighbors: b(R1), e(R1), c(spilled). Available: R2, R3 → Color a = R2

    Final assignment: a→R2, b→R1, c→MEMORY(spilled), d→R2, e→R1

    3 registers suffice after spilling one variable. The algorithm handles NP-complete graph coloring in O(n log n) average time.

  6. 6
    Step 6

    For spilled variable c, the generated code must:

    • Before each use of c: emit LOAD Rtmp, [c_stack_offset]
    • After each definition of c: emit STORE Rtmp, [c_stack_offset]

    This adds memory access overhead but ensures register pressure stays within the k-register budget. If spilling still fails after re-coloring, repeat the process with additional spill candidates.

    Key takeaway: Chaitin's algorithm proves that despite register allocation being NP-complete, practical heuristics deliver high-quality results on real programs.

Target Code Generation

The final step of the compiler is emitting the actual assembly instructions. A simple code generator works on one Basic Block at a time.

Key Data Structures:

  1. Register Descriptor: Keeps track of what is currently stored in each register (e.g., R1 contains variable 'x').
  2. Address Descriptor: Keeps track of where the current value of each variable is located (e.g., 'x' is in R1 and also in memory at offset 8).

The Goal: For an instruction x = y + z, the code generator must:

  • Find a register for y (load it if not already in a register).
  • Find a register for z.
  • Perform the addition.
  • Update the descriptors to show x is now in the result register.

The GetReg Function — Complete Algorithm [PYQ 2018, 2020]

The core of a simple code generator for basic blocks is the GetReg function. For each operand in an instruction x = y op z, GetReg determines which register to use, handling loads, stores, and spills as needed.

GetReg(I) — Pseudocode:

1function GetReg(operand I): 2 1. If I is already in some register R, return R (no work needed). 3 2. If there is an empty (free) register R, return R. 4 3. If there is a register R whose value is "clean" (its value is also 5 stored in memory, so it can be overwritten safely): 6 - Evict R's old value from the register descriptor. 7 - Return R. 8 4. If all registers are "dirty" (occupied by live values not yet stored 9 to memory), we must spill: 10 - Find the register R whose current value will be needed farthest 11 in the future (furthest next-use). 12 - Generate: STORE R, [var_addr] to save the spilled value. 13 - Mark the variable as now having a memory copy. 14 - Return R.

Register Descriptor — tracks contents of each register:

RegisterContentsDirty?
R1{a}Yes (a not saved to memory)
R2{b}No (b already saved to [fp-4])
R3empty

Address Descriptor — tracks all storage locations of each variable:

VariableLocations
aR1
bR2, [fp-4]
c[fp-8]

Step-by-Step Code Generation Trace:

Given three-address code for a basic block:

1(1) t1 = a + b 2(2) t2 = t1 + c 3(3) d = t2 + t1

Assume initially: a→[fp-4], b→[fp-8], c→[fp-12], d→[fp-16], t1,t2 are temporaries. All registers (R1,R2,R3) are empty.

Instruction (1): t1 = a + b

  • GetReg(a): free register R1 available → LOAD R1, [fp-4]. RegDesc: R1→{a}. AddrDesc: a→R1,[fp-4].
  • GetReg(b): free register R2 available → LOAD R2, [fp-8]. RegDesc: R2→{b}. AddrDesc: b→R2,[fp-8].
  • Emit: ADD R1, R1, R2
  • RegDesc: R1→{t1}. AddrDesc: t1→R1. a no longer in R1; AddrDesc: a→[fp-4].

Instruction (2): t2 = t1 + c

  • GetReg(t1): t1 is in R1 → R1 (no load needed)
  • GetReg(c): free register R2 available → LOAD R2, [fp-12]. RegDesc: R2→{c}. AddrDesc: c→R2,[fp-12].
  • Emit: ADD R1, R1, R2
  • RegDesc: R1→{t2}. AddrDesc: t2→R1. t1 value overwritten.

Instruction (3): d = t2 + t1

  • GetReg(t2): t2 is in R1 → R1
  • GetReg(t1): t1 was in R1 but got overwritten. Not in any register. Free register R2 available → Reuse R2. But c was in R2 and still live? Assuming c is dead after (2): RegDesc: R2→empty. LOAD R2, [t1_mem] — but t1 was never stored! The compiler should have generated STORE R1, [t1_addr] before overwriting R1 with t2's result. This illustrates why spilling/store decisions matter.
  • Once t1 is reloaded: ADD R1, R1, R2
  • RegDesc: R1→{d}. AddrDesc: d→R1,[fp-16].

When to Spill (Step 4 of GetReg): If t1, t2, and c are all simultaneously live in R1, R2, R3, and we need a register for another operand:

  • Compute next-use distances: which value is needed farthest in the future?
  • Spill that register: STORE Ri, [addr] then reuse Ri.

The power of Instruction Scheduling

Modern CPUs use Pipelining — they start executing the next instruction before the current one finishes. If the next instruction depends on the result of the current one, the CPU must 'stall' (wait).

Instruction Scheduling is a back-end optimization where the compiler reorders instructions to fill these stalls with independent work, ensuring the CPU pipeline stays full and the code runs at maximum speed.

Common Questions

Instruction Scheduling for Pipelining [PYQ 2024]

Modern CPUs use Pipelining to execute multiple instructions concurrently. An instruction is broken into stages (Fetch, Decode, Execute, Writeback). Ideally, a new instruction enters the pipeline every clock cycle.

Hazards and Stalls: If instruction B depends on the result of instruction A, B must wait (stall) until A's result is written back. This reduces performance.

Instruction Scheduling Algorithm: The compiler reorders instructions to ensure that the "delay" between a producer and a consumer is filled with independent instructions.

Example: Suppose a LOAD takes 2 cycles to complete.

  • Unoptimized:
    1. LOAD R1, a
    2. ADD R2, R1, R3 (Stall for 1 cycle!)
    3. MOV R4, 5
  • Optimized (Scheduled):
    1. LOAD R1, a
    2. MOV R4, 5 (Independent work fills the stall)
    3. ADD R2, R1, R3 (R1 is now ready)

Dynamic Programming for Code Generation [PYQ 2019]

For machines with complex instruction sets and multiple registers, the Sethi-Ullman Algorithm (based on Dynamic Programming) can be used to generate the "optimal" code sequence for expression trees.

  1. Labeling Phase: The algorithm labels each node in the tree with the minimum number of registers needed to evaluate it.
  2. Generation Phase: It traverses the tree, always evaluating the child that requires more registers first, to minimize the total registers used (and prevent spilling).

Sethi-Ullman Algorithm for Expression Trees [PYQ 2019]

  1. 1
    Step 1

    The Sethi-Ullman algorithm generates optimal code (minimum registers) for evaluating an expression tree on a machine where all operations are non-commutative and use two operands.

    Phase 1 (Labeling): Bottom-up labeling of each node with the minimum number of registers needed to evaluate its subtree. Phase 2 (Code Generation): Top-down code emission, always evaluating the 'heavier' child first.

  2. 2
    Step 2

    For a node n with left label l and right label r:

    Leaf node (variable or constant):

    • If n is a left leaf (left child of its parent): label = 1
    • If n is a right leaf (right child of its parent): label = 0

    Internal node (operator):

    • If l ≠ r: label(n) = max(l, r)
    • If l == r: label(n) = l + 1

    Intuition: When both children need the same number of registers, evaluating one will tie them all up, so we need +1 more register to hold that result while evaluating the other child.

  3. 3
    Step 3

    Expression: (a + b) * (c + d) - (e / f)

             (-)
            /   \
          (*)    (/)
         /  \   /  \
       (+)  (+) e   f
      / \   / \
    a   b  c  d
    

    Labeling (bottom-up):

    Leaves: a (left) → 1, b (right) → 0, c (left) → 1, d (right) → 0, e (left) → 1, f (right) → 0

    Node (+ a,b): l=1, r=0, l≠r → max(1,0) = 1 Node (+ c,d): l=1, r=0, l≠r → max(1,0) = 1 Node (*): l=1, r=1, l==r → 1+1 = 2 Node (/): l=1, r=0, l≠r → max(1,0) = 1 Node (-): l=2, r=1, l≠r → max(2,1) = 2

    Root label = 2 registers needed for the entire expression.

  4. 4
    Step 4

    Traverse recursively. For node n with labels l (left) and r (right):

    Rule 1: If l > r, evaluate left child first → its result stays in a register → evaluate right child using remaining registers → emit the operation.

    Rule 2: If r > l, evaluate right child first → hold its result → evaluate left child → emit operation.

    Rule 3: If l == r, evaluate left first (arbitrary), store result in one register, then evaluate right (which needs same number of registers), emit operation.

    Key principle: always evaluate the more register-intensive subtree first.

  5. 5
    Step 5

    Traversing the tree for ((a+b)*(c+d)) - (e/f) where root label=2 (left=2, right=1), we evaluate the heavier left child (*) first.

    At node (*): l=1, r=1, equal → evaluate left (+) first. At node (+ a,b): l=1(left leaf a), r=0(right leaf b) - LOAD R1, a ; evaluate left child (label 1 → needs register) - ADD R1, R1, b ; right leaf (label 0) can use memory operand → Result in R1 = a+b

    Back at (*), now evaluate right child: At node (+ c,d): l=1, r=0 - LOAD R2, c - ADD R2, R2, d → Result in R2 = c+d

    Back at (*): MUL R1, R1, R2 → R1 = (a+b)*(c+d)

    Back at (-), evaluate right child (/): At node (/): l=1, r=0 - LOAD R2, e - DIV R2, R2, f → Result in R2 = e/f

    At root (-): SUB R1, R1, R2 → R1 = final result

    Total: 8 instructions using only 2 registers.

  6. 6
    Step 6
    1LOAD R1, a ; R1 = a 2ADD R1, R1, b ; R1 = a + b 3LOAD R2, c ; R2 = c 4ADD R2, R2, d ; R2 = c + d 5MUL R1, R1, R2 ; R1 = (a+b)*(c+d) 6LOAD R2, e ; R2 = e 7DIV R2, R2, f ; R2 = e / f 8SUB R1, R1, R2 ; R1 = ((a+b)*(c+d)) - (e/f)

    Register usage: Only 2 registers (R1, R2) for the entire 8-instruction sequence. A naive left-to-right evaluation would have used 3+ registers.

    Complexity: O(n) time for an n-node expression tree. The algorithm is optimal for homogeneous registers on a non-commutative machine model.

Knowledge Check

Question 1 of 7
Q1Single choice

In a Register Interference Graph (RIG), what does an edge between two variables represent?