Coursify

Compiler Design (CD)

Bottom-Up Parsing: LR(0), SLR(1), LR(1), LALR(1)

50

Shift your perspective from Top-Down to Bottom-Up. Learn how Shift-Reduce parsers work, how to identify parsing conflicts, and understand the hierarchy of LR parsing algorithms used in modern compilers.

Learning Goals

  • Explain the core mechanism of Shift-Reduce parsing (Shift, Reduce, Accept, Error).
  • Identify and differentiate between Shift-Reduce and Reduce-Reduce conflicts.
  • Construct the Canonical Collection of LR(0) items and build a parsing state machine (DFA).
  • Describe the hierarchical relationship between LR(0), SLR(1), LALR(1), and LR(1) parsers.
  • Explain how SLR(1) uses FOLLOW sets to resolve conflicts present in LR(0) parsing tables.

Introduction to Bottom-Up Parsing

While Top-Down (LL) parsing starts at the root and attempts to derive the input string, Bottom-Up Parsing starts with the input string (the leaves) and attempts to compress it backwards into the Start Symbol (the root).

This process is mathematically equivalent to tracing a Rightmost Derivation in reverse—a core property frequently evaluated in advanced compiler design assessments. The most powerful class of bottom-up parsers are the LR Parsers.

What does LR(1) stand for?

  • L: Scans the input from Left to right.
  • R: Produces a Rightmost derivation (in reverse).
  • (1): Uses exactly 1 token of lookahead.

Handle Pruning: A 'handle' is a substring that matches the right side of a production rule and represents one step backwards in a rightmost derivation. The entire basis of bottom-up parsing is Handle Pruning—finding this handle on the stack and 'pruning' (reducing) it to its left-hand side non-terminal.

LR parsers are strictly more powerful than LL parsers. Almost all modern production compilers (and parser generators like Yacc/Bison) use LR parsing because it can handle a much larger class of grammars (including left-recursive ones) without requiring major structural rewrites.

An LR Parser maintains a Stack (which stores states and grammar symbols) and an Input Buffer. At any step, it looks up the top state on the stack and the current input token in its parsing table. It will perform one of four actions:

  1. Shift (sns_n): Push the current input token onto the stack, and transition to a new state nn. Advance the input pointer.
  2. Reduce (rkr_k): The parser has found a complete right-hand side (a 'handle') matching grammar rule kk (AβA \to \beta). It pops β\beta off the stack, and pushes the left-hand side AA.
  3. Accept: The parser has successfully reduced the entire input back to the Start Symbol and the input is empty.
  4. Error: The table entry is blank, meaning a syntax error was found in the input.

Parsing Conflicts: When the Parser gets confused

An LR parser is deterministic. If a cell in the parsing table contains more than one action, the grammar cannot be parsed by that specific algorithm. There are two types of conflicts:

  1. Shift-Reduce Conflict: The parser sees a valid handle it could reduce, but it also sees a valid terminal it could shift. It doesn't know whether to reduce now, or keep shifting to find a larger handle. (This often happens with the classic 'dangling-else' problem).

  2. Reduce-Reduce Conflict: The parser sees a handle on the stack that matches the right-hand side of two different grammar rules. It doesn't know which rule to reduce by.

Building the Foundation: LR(0) Items

  1. 1
    Step 1

    An LR(0) item is simply a grammar production with a dot (\bullet) inserted somewhere in the right side. The dot indicates how much of a rule the parser has seen so far.

    For the rule AXYZA \to XYZ, the possible items are:

    • AXYZA \to \bullet XYZ (Haven't seen anything yet)
    • AXYZA \to X \bullet YZ (Seen XX, expecting YY next)
    • AXYZA \to XY \bullet Z (Seen XYXY, expecting ZZ next)
    • AXYZA \to XYZ \bullet (Seen everything — ready to Reduce!)
  2. 2
    Step 2

    Before we start, we add a brand new start symbol SS' to the grammar, with a single rule pointing to the old start symbol SS:

    SSS' \to S

    This is called the Augmented Grammar. Its sole purpose is to give the parser exactly one clear state where it knows it should halt and Accept (when it reaches SSS' \to S \bullet).

  3. 3
    Step 3

    If there is a dot immediately before a non-terminal, it means we are expecting to see that non-terminal. Therefore, we must add all the rules for that non-terminal to our current state (with the dot at the beginning).

    Example: If state I0I_0 contains: ETEE \to \bullet T E' Since the dot is before TT, we must add all TT rules to I0I_0: TFTT \to \bullet F T' Now the dot is before FF, so we must add all FF rules: F(E)F \to \bullet ( E ) FidF \to \bullet id

  4. 4
    Step 4

    The GOTO(I,X)\text{GOTO}(I, X) function computes the new state the parser enters when it is in state II and sees grammar symbol XX (which can be a terminal or non-terminal).

    Algorithm:

    1. Look at all items in II where the dot is right before XX (e.g., AαXβA \to \alpha \bullet X \beta).
    2. Move the dot past XX: AαXβA \to \alpha X \bullet \beta.
    3. Create a new state with these items, and apply the CLOSURE()\text{CLOSURE}() operation to expand it.
  5. 5
    Step 5

    To build a CLR(1) or LALR(1) table, we must compute LR(1) Items. An LR(1) item looks like this: [Aαβ,a][A \to \alpha \bullet \beta, a], where aa is the strict lookahead terminal.

    The Golden Rule for LR(1) Closures: If you have an item [AαBβ,a][A \to \alpha \bullet B \beta, a] in your state, you must add the closure for BB. The lookahead for the new BB items will be exactly FIRST(βa)\text{FIRST}(\beta a).

    Example: Suppose State I0I_0 contains: [SCC,$][S \to \bullet CC, \char36]

    • The dot is before CC. We must add CcCC \to \bullet cC and CdC \to \bullet d.
    • What is the lookahead? Here, α\alpha is empty, BB is CC, β\beta is CC, and aa is $\char36.
    • Lookahead = FIRST(C$)\text{FIRST}(C\char36). Since CC starts with 'c' or 'd', FIRST(C$)\text{FIRST}(C\char36) is {c,d}\{c, d\}.
    • So we add: [CcC,c/d][C \to \bullet cC, c/d] and [Cd,c/d][C \to \bullet d, c/d].

Constructing a Canonical LR(1) Parsing Table

  1. 1
    Step 1

    Start with the augmented grammar by adding SSS' \to S. This ensures exactly one accepting state. We'll trace the classic grammar:

    SSS' \to S SCCS \to CC CcCdC \to cC \mid d

    Step 0: Begin with the initial item [SS,$][S' \to \bullet S, \char36] and apply CLOSURE.

  2. 2
    Step 2

    The LR(1) Closure Rule: If [AαBβ,a][A \to \alpha \bullet B \beta, a] is in the state, add all BB productions as [Bγ,b][B \to \bullet \gamma, b] where bFIRST(βa)b \in \text{FIRST}(\beta a).

    Start: [SS,$][S' \to \bullet S, \char36]

    • Dot is before SS (non-terminal), so add SS rules with lookahead = FIRST(ε$)={$}\text{FIRST}(\varepsilon \cdot \char36) = \{\char36\} → Add [SCC,$][S \to \bullet CC, \char36]
    • Now dot is before CC. Lookahead = FIRST(C$)\text{FIRST}(C\char36). Since CC derives c or d, FIRST(C$)={c,d}\text{FIRST}(C\char36) = \{c, d\}. → Add [CcC,c/d][C \to \bullet cC, c/d] and [Cd,c/d][C \to \bullet d, c/d]

    State I0I_0 = {[SS,$], [SCC,$], [CcC,c/d], [Cd,c/d]}\{[S' \to \bullet S, \char36],\ [S \to \bullet CC, \char36],\ [C \to \bullet cC, c/d],\ [C \to \bullet d, c/d]\}

  3. 3
    Step 3

    For each state, compute GOTO\text{GOTO} on every grammar symbol that appears after a dot. Apply CLOSURE\text{CLOSURE} after moving the dot.

    GOTO(I0I_0, SS) = I1I_1: {[SS,$]}\{[S' \to S \bullet, \char36]\} (No closure needed — dot at end)

    GOTO(I0I_0, CC) = I2I_2: Move dot past CC in [SCC,$][S \to \bullet CC, \char36][SCC,$][S \to C \bullet C, \char36]. Apply closure on CC with lookahead FIRST($)={$}\text{FIRST}(\char36) = \{\char36\}: I2={[SCC,$], [CcC,$], [Cd,$]}I_2 = \{[S \to C \bullet C, \char36],\ [C \to \bullet cC, \char36],\ [C \to \bullet d, \char36]\}

    GOTO(I0I_0, cc) = I3I_3: Move dot past c in [CcC,c/d][C \to \bullet cC, c/d][CcC,c/d][C \to c \bullet C, c/d]. Apply closure on CC with lookahead FIRST(c/d)={c,d}\text{FIRST}(c/d) = \{c, d\}: I3={[CcC,c/d], [CcC,c/d], [Cd,c/d]}I_3 = \{[C \to c \bullet C, c/d],\ [C \to \bullet cC, c/d],\ [C \to \bullet d, c/d]\}

    GOTO(I0I_0, dd) = I4I_4: {[Cd,c/d]}\{[C \to d \bullet, c/d]\} (Reduce state — dot at end)

  4. 4
    Step 4

    Continue computing GOTO\text{GOTO} for every state-symbol pair until no new states emerge.

    GOTO(I2I_2, CC) = I5I_5: {[SCC,$]}\{[S \to CC \bullet, \char36]\} (Reduce state)

    GOTO(I2I_2, cc) = I6I_6: {[CcC,$], [CcC,$], [Cd,$]}\{[C \to c \bullet C, \char36],\ [C \to \bullet cC, \char36],\ [C \to \bullet d, \char36]\}

    GOTO(I2I_2, dd) = I7I_7: {[Cd,$]}\{[C \to d \bullet, \char36]\}

    GOTO(I3I_3, CC) = I8I_8: {[CcC,c/d]}\{[C \to cC \bullet, c/d]\}

    GOTO(I3I_3, cc) = I3I_3 (loop on c)

    GOTO(I3I_3, dd) = I4I_4 (existing state)

    GOTO(I6I_6, CC) = I9I_9: {[CcC,$]}\{[C \to cC \bullet, \char36]\}

    GOTO(I6I_6, cc) = I6I_6 (loop), GOTO(I6I_6, dd) = I7I_7 (existing)

    Total: 10 distinct LR(1) states (I0I_0 through I9I_9). Contrast this with only 7 LR(0) states for the same grammar—the lookahead information splits states apart.

  5. 5
    Step 5

    The final LR(1) parsing table for grammar SCC, CcCdS \to CC,\ C \to cC \mid d:

    Statecd\char36SC
    I0I_0s3s_3s4s_412
    I1I_1acc
    I2I_2s6s_6s7s_75
    I3I_3s3s_3s4s_48
    I4I_4r4r_4r4r_4
    I5I_5r2r_2
    I6I_6s6s_6s7s_79
    I7I_7r4r_4
    I8I_8r3r_3r3r_3
    I9I_9r3r_3

    Key observation: State I4I_4 (reduce CdC \to d) has lookahead {c, d}, while I7I_7 (reduce CdC \to d) has lookahead {\char36}. LR(1) splits what LR(0) would merge into a single state.

Operator Precedence Parsing is a classical shift-reduce parsing technique designed specifically for operator grammars—grammars where no production has two adjacent non-terminals.

Instead of building a full DFA of states, operator precedence parsers define three precedence relations between every pair of terminal symbols:

  1. aba \lessdot b (yields precedence): Terminal aa has lower precedence than bb. When aa is on top of stack and bb is next in input, the parser must shift bb.
  2. aba \doteq b (equal precedence): aa and bb have the same precedence level. The parser continues shifting.
  3. aba \gtrdot b (takes precedence): Terminal aa has higher precedence than bb. When aa is on top of stack and bb is next, the parser must reduce the handle ending with aa before bb can be processed.

The $\char36 marker: The end-of-input marker $\char36 has the lowest possible precedence: $a\char36 \lessdot a and a$a \gtrdot \char36 for any terminal aa.

Comparing State Counts Across LR Variants

For a given grammar, the number of states in the canonical collection varies dramatically depending on the LR variant. Consider our running grammar SCC, CcCdS \to CC,\ C \to cC \mid d:

LR VariantNumber of StatesReason
LR(0)7Items track only core — no lookahead information.
SLR(1)7Uses LR(0) state machine; only FOLLOW\text{FOLLOW} sets differ in the table.
LALR(1)7Merges LR(1) states with identical cores back into fewer states.
Canonical LR(1)10Each state is split by distinct lookahead tokens.

For real programming languages, the state explosion is much more severe. A grammar producing ~300 SLR(1) states can easily generate 3,000–10,000 canonical LR(1) states.

Why LALR(1) is the Industry Sweet Spot

LALR(1) achieves the ideal engineering compromise:

  • Same table size as SLR(1): By merging LR(1) states with identical cores, the state machine is heavily compressed.
  • High Power: The lookaheads per merged state are the union of the lookaheads from the constituent LR(1) states. This precise context resolves the vast majority of conflicts that SLR(1) cannot.
  • The Caveat: Merging can theoretically introduce new Reduce-Reduce conflicts (if two non-overlapping lookaheads merge and overlap). However, Shift-Reduce conflicts are never introduced by merging.
  • Yacc and Bison default to LALR(1) because it fits cleanly in memory while handling virtually all practical language constructs.

The Power Hierarchy of LR Parsers

Not all LR parsers are created equal. As we attempt to resolve conflicts by integrating more specific lookahead data, the algorithms grow mathematically stronger:

The relationship forms a strict subset hierarchy: LR(0)SLR(1)LALR(1)LR(1)LR(0) \subset SLR(1) \subset LALR(1) \subset LR(1). (If a grammar is LR(0), it is automatically SLR(1), but not vice versa).

Power: Weakest. Table Size: Small.

How it reduces: If state II contains a completed item AαA \to \alpha \bullet, LR(0) blindly puts the reduce action rAr_A in every single column of that state's row in the parsing table.

The Problem: It doesn't look ahead at the next input token. It just assumes if a rule is complete, it must reduce. This naturally creates massive numbers of Shift-Reduce conflicts.

Common Questions

Knowledge Check

Question 1 of 7
Q1Single choice

Which of the following describes a Shift-Reduce conflict?

Heuristics for Exam Classification

How to quickly determine which LR class a grammar belongs to:

  1. Build the LR(0) automaton first. This DFA forms the backbone of ALL LR parsers.
  2. Check LR(0): If filling the table (placing 'reduce' in ALL columns for completed states) yields any cell with multiple actions → the grammar is NOT LR(0).
  3. Check SLR(1): Recompute reduce logic using FOLLOW sets. If conflicts vanish → the grammar IS SLR(1). If conflicts persist → it's NOT SLR(1).
  4. Check LR(1): Build the full canonical LR(1) DFA. If conflicts exist here → the grammar is fundamentally ambiguous.
  5. Check LALR(1): If LR(1) has no conflicts, merge states with identical cores. If the compressed table has no Reduce-Reduce conflicts → the grammar IS LALR(1).

Golden Rule regarding DFA states: States in LR(0)=States in SLR(1)=States in LALR(1)<States in LR(1)\text{States in LR(0)} = \text{States in SLR(1)} = \text{States in LALR(1)} < \text{States in LR(1)}