Bottom-Up Parsing: LR(0), SLR(1), LR(1), LALR(1)
Shift your perspective from Top-Down to Bottom-Up. Learn how Shift-Reduce parsers work, how to identify parsing conflicts, and understand the hierarchy of LR parsing algorithms used in modern compilers.
Learning Goals
- Explain the core mechanism of Shift-Reduce parsing (Shift, Reduce, Accept, Error).
- Identify and differentiate between Shift-Reduce and Reduce-Reduce conflicts.
- Construct the Canonical Collection of LR(0) items and build a parsing state machine (DFA).
- Describe the hierarchical relationship between LR(0), SLR(1), LALR(1), and LR(1) parsers.
- Explain how SLR(1) uses FOLLOW sets to resolve conflicts present in LR(0) parsing tables.
Introduction to Bottom-Up Parsing
While Top-Down (LL) parsing starts at the root and attempts to derive the input string, Bottom-Up Parsing starts with the input string (the leaves) and attempts to compress it backwards into the Start Symbol (the root).
This process is mathematically equivalent to tracing a Rightmost Derivation in reverse—a core property frequently evaluated in advanced compiler design assessments. The most powerful class of bottom-up parsers are the LR Parsers.
What does LR(1) stand for?
- L: Scans the input from Left to right.
- R: Produces a Rightmost derivation (in reverse).
- (1): Uses exactly 1 token of lookahead.
Handle Pruning: A 'handle' is a substring that matches the right side of a production rule and represents one step backwards in a rightmost derivation. The entire basis of bottom-up parsing is Handle Pruning—finding this handle on the stack and 'pruning' (reducing) it to its left-hand side non-terminal.
LR parsers are strictly more powerful than LL parsers. Almost all modern production compilers (and parser generators like Yacc/Bison) use LR parsing because it can handle a much larger class of grammars (including left-recursive ones) without requiring major structural rewrites.
An LR Parser maintains a Stack (which stores states and grammar symbols) and an Input Buffer. At any step, it looks up the top state on the stack and the current input token in its parsing table. It will perform one of four actions:
- Shift (): Push the current input token onto the stack, and transition to a new state . Advance the input pointer.
- Reduce (): The parser has found a complete right-hand side (a 'handle') matching grammar rule (). It pops off the stack, and pushes the left-hand side .
- Accept: The parser has successfully reduced the entire input back to the Start Symbol and the input is empty.
- Error: The table entry is blank, meaning a syntax error was found in the input.
Parsing Conflicts: When the Parser gets confused
An LR parser is deterministic. If a cell in the parsing table contains more than one action, the grammar cannot be parsed by that specific algorithm. There are two types of conflicts:
-
Shift-Reduce Conflict: The parser sees a valid handle it could reduce, but it also sees a valid terminal it could shift. It doesn't know whether to reduce now, or keep shifting to find a larger handle. (This often happens with the classic 'dangling-else' problem).
-
Reduce-Reduce Conflict: The parser sees a handle on the stack that matches the right-hand side of two different grammar rules. It doesn't know which rule to reduce by.
Building the Foundation: LR(0) Items
- 1Step 1
An LR(0) item is simply a grammar production with a dot () inserted somewhere in the right side. The dot indicates how much of a rule the parser has seen so far.
For the rule , the possible items are:
- (Haven't seen anything yet)
- (Seen , expecting next)
- (Seen , expecting next)
- (Seen everything — ready to Reduce!)
- 2Step 2
Before we start, we add a brand new start symbol to the grammar, with a single rule pointing to the old start symbol :
This is called the Augmented Grammar. Its sole purpose is to give the parser exactly one clear state where it knows it should halt and Accept (when it reaches ).
- 3Step 3
If there is a dot immediately before a non-terminal, it means we are expecting to see that non-terminal. Therefore, we must add all the rules for that non-terminal to our current state (with the dot at the beginning).
Example: If state contains: Since the dot is before , we must add all rules to : Now the dot is before , so we must add all rules:
- 4Step 4
The function computes the new state the parser enters when it is in state and sees grammar symbol (which can be a terminal or non-terminal).
Algorithm:
- Look at all items in where the dot is right before (e.g., ).
- Move the dot past : .
- Create a new state with these items, and apply the operation to expand it.
- 5Step 5
To build a CLR(1) or LALR(1) table, we must compute LR(1) Items. An LR(1) item looks like this: , where is the strict lookahead terminal.
The Golden Rule for LR(1) Closures: If you have an item in your state, you must add the closure for . The lookahead for the new items will be exactly .
Example: Suppose State contains:
- The dot is before . We must add and .
- What is the lookahead? Here, is empty, is , is , and is .
- Lookahead = . Since starts with 'c' or 'd', is .
- So we add: and .
Constructing a Canonical LR(1) Parsing Table
- 1Step 1
Start with the augmented grammar by adding . This ensures exactly one accepting state. We'll trace the classic grammar:
Step 0: Begin with the initial item and apply CLOSURE.
- 2Step 2
The LR(1) Closure Rule: If is in the state, add all productions as where .
Start:
- Dot is before (non-terminal), so add rules with lookahead = → Add
- Now dot is before . Lookahead = . Since derives
cord, . → Add and
State =
- 3Step 3
For each state, compute on every grammar symbol that appears after a dot. Apply after moving the dot.
GOTO(, ) = : (No closure needed — dot at end)
GOTO(, ) = : Move dot past in → . Apply closure on with lookahead :
GOTO(, ) = : Move dot past
cin → . Apply closure on with lookahead :GOTO(, ) = : (Reduce state — dot at end)
- 4Step 4
Continue computing for every state-symbol pair until no new states emerge.
GOTO(, ) = : (Reduce state)
GOTO(, ) = :
GOTO(, ) = :
GOTO(, ) = :
GOTO(, ) = (loop on
c)GOTO(, ) = (existing state)
GOTO(, ) = :
GOTO(, ) = (loop), GOTO(, ) = (existing)
Total: 10 distinct LR(1) states ( through ). Contrast this with only 7 LR(0) states for the same grammar—the lookahead information splits states apart.
- 5Step 5
The final LR(1) parsing table for grammar :
State c d \char36 S C — 1 2 — — acc — — — — 5 — — 8 — — — — — — — — — 9 — — — — — — — — — — — Key observation: State (reduce ) has lookahead
{c, d}, while (reduce ) has lookahead{\char36}. LR(1) splits what LR(0) would merge into a single state.
Operator Precedence Parsing is a classical shift-reduce parsing technique designed specifically for operator grammars—grammars where no production has two adjacent non-terminals.
Instead of building a full DFA of states, operator precedence parsers define three precedence relations between every pair of terminal symbols:
- (yields precedence): Terminal has lower precedence than . When is on top of stack and is next in input, the parser must shift .
- (equal precedence): and have the same precedence level. The parser continues shifting.
- (takes precedence): Terminal has higher precedence than . When is on top of stack and is next, the parser must reduce the handle ending with before can be processed.
The marker: The end-of-input marker has the lowest possible precedence: and for any terminal .
Comparing State Counts Across LR Variants
For a given grammar, the number of states in the canonical collection varies dramatically depending on the LR variant. Consider our running grammar :
| LR Variant | Number of States | Reason |
|---|---|---|
| LR(0) | 7 | Items track only core — no lookahead information. |
| SLR(1) | 7 | Uses LR(0) state machine; only sets differ in the table. |
| LALR(1) | 7 | Merges LR(1) states with identical cores back into fewer states. |
| Canonical LR(1) | 10 | Each state is split by distinct lookahead tokens. |
For real programming languages, the state explosion is much more severe. A grammar producing ~300 SLR(1) states can easily generate 3,000–10,000 canonical LR(1) states.
Why LALR(1) is the Industry Sweet Spot
LALR(1) achieves the ideal engineering compromise:
- Same table size as SLR(1): By merging LR(1) states with identical cores, the state machine is heavily compressed.
- High Power: The lookaheads per merged state are the union of the lookaheads from the constituent LR(1) states. This precise context resolves the vast majority of conflicts that SLR(1) cannot.
- The Caveat: Merging can theoretically introduce new Reduce-Reduce conflicts (if two non-overlapping lookaheads merge and overlap). However, Shift-Reduce conflicts are never introduced by merging.
- Yacc and Bison default to LALR(1) because it fits cleanly in memory while handling virtually all practical language constructs.
The Power Hierarchy of LR Parsers
Not all LR parsers are created equal. As we attempt to resolve conflicts by integrating more specific lookahead data, the algorithms grow mathematically stronger:
The relationship forms a strict subset hierarchy: . (If a grammar is LR(0), it is automatically SLR(1), but not vice versa).
Power: Weakest. Table Size: Small.
How it reduces: If state contains a completed item , LR(0) blindly puts the reduce action in every single column of that state's row in the parsing table.
The Problem: It doesn't look ahead at the next input token. It just assumes if a rule is complete, it must reduce. This naturally creates massive numbers of Shift-Reduce conflicts.
Common Questions
Knowledge Check
Which of the following describes a Shift-Reduce conflict?
Heuristics for Exam Classification
How to quickly determine which LR class a grammar belongs to:
- Build the LR(0) automaton first. This DFA forms the backbone of ALL LR parsers.
- Check LR(0): If filling the table (placing 'reduce' in ALL columns for completed states) yields any cell with multiple actions → the grammar is NOT LR(0).
- Check SLR(1): Recompute reduce logic using FOLLOW sets. If conflicts vanish → the grammar IS SLR(1). If conflicts persist → it's NOT SLR(1).
- Check LR(1): Build the full canonical LR(1) DFA. If conflicts exist here → the grammar is fundamentally ambiguous.
- Check LALR(1): If LR(1) has no conflicts, merge states with identical cores. If the compressed table has no Reduce-Reduce conflicts → the grammar IS LALR(1).
Golden Rule regarding DFA states: