Top-Down Parsing and LL(1) Grammars
Master the most exam-critical topic in syntax analysis. Learn to eliminate left recursion, apply left factoring, compute First and Follow sets, and build an LL(1) predictive parsing table from scratch.
Learning Goals
- Explain why left recursion causes infinite loops in recursive-descent parsers and eliminate it.
- Apply left factoring to remove common prefixes that prevent deterministic parsing.
- Compute the FIRST set for any grammar symbol or sentential form.
- Compute the FOLLOW set for any non-terminal in a grammar.
- Construct a Predictive Parsing Table (LL(1)) and detect grammar conflicts.
- Trace the execution of a table-driven LL(1) parser on a given input string using a stack.
Introduction to Top-Down Parsing
A Top-Down Parser starts at the root of the parse tree (the Start Symbol) and works its way down to the leaves (the tokens), attempting to find a Leftmost Derivation (LMD) that matches the input string.
The most efficient and widely used top-down parsing technique is LL(1) parsing. What does the name LL(1) stand for?
- L: Scans the input from Left to right.
- L: Produces a Leftmost derivation.
- (1): Uses exactly 1 token of lookahead to make parsing decisions.
Because it only looks 1 token ahead, an LL(1) parser must be deterministic—it must know exactly which grammar rule to apply without guessing or backtracking. To make a grammar deterministic and suitable for LL(1) parsing, we must first fix two fatal flaws that frequently appear in academic assessments: Left Recursion and Common Prefixes.
Recursive Descent is the simplest form of top-down parsing. It uses a set of recursive functions (one for each non-terminal) to process the input.
Backtracking: If the parser makes the wrong choice, it must undo its steps, reset the input pointer, and try the next rule.
Classical Evaluation Example:
Grammar: , .
Input string: cad
- Parser reads
c, matches and calls function . - tries first rule : matches
a, butbfails against inputd. (Backtrack 1) - resets, tries second rule : matches
a, butcfails against inputd. (Backtrack 2) - resets, tries third rule : matches
a. Returns to . - matches
d. String accepted!
LL(1) predictive parsing was invented specifically to eliminate this expensive backtracking!
The Standard Expression Grammar
For the rest of this module, we will deeply analyze the classic arithmetic expression grammar. It has already had left-recursion eliminated so it is structurally prepared for LL(1) parsing:
To build the deterministic parsing table, we must compute two critical sets for this grammar: FIRST and FOLLOW. Mastering these computations is the most essential skill in Syntax Analysis.
Algorithm: Computing FIRST Sets
- 1Step 1
For any string of grammar symbols , FIRST() is the set of all terminal symbols that can begin a string derived from .
Crucial rule: If is a terminal, then FIRST() is simply {}. If can derive the empty string, then is also in FIRST().
- 2Step 2
- If is a terminal, then FIRST() = {}.
- If there is a rule , add to FIRST().
- If is a non-terminal and is a rule:
- Add FIRST() (excluding ) to FIRST().
- If is in FIRST(), add FIRST(), and so on.
- If ALL can derive , add to FIRST().
- 3Step 3
-
FIRST(F): Looking at , the first symbols are clearly
(andid. Result: { (, id } -
FIRST(T'): Looking at , the first symbols are
*and\epsilon. *Result: { , ε } -
FIRST(T): Looking at , starts with . So FIRST(T) = FIRST(F). Result: { (, id }
-
FIRST(E'): Looking at , the first symbols are
+and\epsilon. Result: { +, ε } -
FIRST(E): Looking at , starts with . So FIRST(E) = FIRST(T). Result: { (, id }
-
Algorithm: Computing FOLLOW Sets
- 1Step 1
For any non-terminal , FOLLOW() is the set of all terminals that can appear immediately to the right of in some valid derivation.
⚠️ Crucial Rule:
FOLLOW()sets never contain . - 2Step 2
- Place
$(the end-of-input marker) in FOLLOW(), where is the start symbol. - If there is a rule , then everything in FIRST() (except ) is added to FOLLOW().
- If there is a rule , OR a rule where FIRST() contains , then everything in FOLLOW() is added to FOLLOW().
- Place
- 3Step 3
-
FOLLOW(E): is the start symbol, so add
$. also appears in . What follows ? A). Result: { $, ) } -
FOLLOW(E'): appears at the end of and . By Rule 3, FOLLOW(E') gets everything in FOLLOW(E). Result: { $, ) }
-
FOLLOW(T): appears in and . What follows is . So we add FIRST(E') without (which is
{+}). Because FIRST(E') contains , Rule 3 dictates we also add FOLLOW(E) and FOLLOW(E'). Result: { +, $, ) } -
FOLLOW(T'): Appears at the end of and . By Rule 3, gets FOLLOW(T). Result: { +, $, ) }
-
FOLLOW(F): Appears in and . What follows is . Add FIRST(T') without (
{*}). Since FIRST(T') has , add FOLLOW(T). *Result: { , +, $, ) }
-
The Golden Rules of FIRST and FOLLOW
In high-stakes exams, double-check these two rules to catch cascading calculation errors:
FIRSTsets CAN contain . If a non-terminal can vanish, must be in its FIRST set.FOLLOWsets CANNOT contain . It makes no mathematical sense for the 'empty string' to follow a variable. The symbol$is used instead to indicate the end of the file/input.
Building the LL(1) Predictive Parsing Table
With FIRST and FOLLOW computed, building the 2D parsing table is purely mechanical.
Rows = Non-terminals. Columns = Terminals (including $).
Algorithm to populate : For each rule :
- For each terminal in FIRST(), add to .
- If is in FIRST(), then for each terminal in FOLLOW(), add to .
Applying this to our grammar yields this completed matrix:
| Non-Terminal | id | + | * | ( | ) | $ |
|---|---|---|---|---|---|---|
| E | ||||||
| E' | ||||||
| T | ||||||
| T' | ||||||
| F |
Note: Every cell has at most one entry. This mathematically proves the grammar is strictly LL(1). If any cell had two conflicting entries, the grammar would be ambiguous or require left-factoring, and thus NOT be LL(1).
When Grammars Fail LL(1) Constraints
Not all grammars are LL(1). If you construct a parsing table and any cell contains more than one rule, the grammar is NOT LL(1).
The Classic Dangling-Else Conflict:
If you calculate FIRST and FOLLOW for this grammar, you will find:
- FIRST() = {
e, } - FOLLOW() = {
e,$}
Because , we must put this rule into the columns of FOLLOW(). So, in cell M[A, e], we must insert . But 'e' is also in FIRST(), so we must ALSO insert into M[A, e].
Because M[A, e] contains two rules, the parser doesn't know whether to match the else or let vanish. Therefore, this grammar is formally proven to be NOT LL(1).
Formal Tests: Proving a Grammar is LL(1)
- 1Step 1
For each production rule , compute the FIRST set of its right-hand side. Then compute FOLLOW() for every non-terminal in the grammar, verifying that
$is in the start symbol's FOLLOW set and that no exists in any FOLLOW set. - 2Step 2
For every non-terminal with competing rules :
Rule 1: FIRST() FIRST()
If both rules can start with the same terminal, the parser has no way to choose between them using just one token of lookahead.
- 3Step 3
If FIRST(), then additionally check:
Rule 2: FIRST() FOLLOW()
This catches the dangling-else problem: when can either produce a terminal (via ) or vanish (via ), you must ensure the parser never faces a token that fits both scenarios simultaneously.
- 4Step 4
Consider :
- FIRST() = {
+} - FIRST() = { }
- Rule 1 is automatic (only one non- rule).
- Rule 2: {
+} FOLLOW() = {+} {$,)} = ✅
All checks on our expression grammar pass, proving it is LL(1).
- FIRST() = {
Architecture of a Non-Recursive Predictive Parser
A common requirement in compiler design evaluations is outlining the architecture of the LL(1) engine. It relies on four interlinked components:
- Input Buffer: Holds the string to be parsed, ending with the special end-marker
$. - Stack: A LIFO data structure that holds a sequence of grammar symbols. It is initialized with
$at the bottom and the Start Symbol at the top. - Parsing Table (): The pre-computed 2D array acting as the system's brain.
- Predictive Parsing Program: The algorithm driving the system. It compares the stack top () with the current input token ():
- If , Match (pop stack, advance input).
- If is a non-terminal, Lookup and push the resulting rule onto the stack in reverse order.
- If is blank, throw a Syntax Error.
Tracing the LL(1) Parser
Here is the trace for parsing the string id + id through our expression grammar table:
| Stack | Input | Action (Lookup ) |
|---|---|---|
$E | id + id$ | Pop , Push (reverse order) |
$E'T | id + id$ | Pop , Push |
$E'T'F | id + id$ | Pop , Push |
$E'T'id | id + id$ | Match! Top of stack (id) matches input (id). Pop stack, advance input. |
$E'T'$ | + id$ | Pop , Push (i.e., just pop) |
$E'$ | + id$ | Pop , Push |
$E'T+$ | + id$ | Match! Pop +, advance input. |
$E'T | id$ | Pop , Push |
$E'T'F | id$ | Pop , Push |
$E'T'id | id$ | Match! Pop id, advance input. |
$E'T'$ | $ | M[T', \] \impliesT'\epsilon$ |
$E'$ | $ | M[E', \] \impliesE'\epsilon$ |
$ | $ | Stack empty, Input empty ACCEPT ✅ |
Why Reverse Order?
Because a stack is Last-In-First-Out (LIFO). If we apply the rule and we want the parser to process first (since we read input left-to-right), must be at the very top of the stack. Pushing first, and then pushing , ensures that ends up on top.
Error Recovery in Predictive Parsing
Even the best LL(1) parser will encounter syntax errors in real-world inputs. The standard approach is Panic-Mode Recovery.
When the parser encounters a blank entry in the parsing table (no valid rule for the current non-terminal and input token ), it enters panic mode. The parser skips (discards) input tokens until it finds a synchronizing token—a token that belongs to the FOLLOW set of the current non-terminal or some ancestor non-terminal on the stack.
Algorithm for panic-mode recovery:
- Pop entries from the stack until a non-terminal is on top whose FOLLOW set contains the current input token .
- If no such non-terminal exists on the stack, skip the current input token and try again.
- Once a synchronizing token is found, resume normal parsing.
Key insight for exams: Synchronizing tokens are chosen from FOLLOW sets because they are "natural delimiters"—terminals like ), ;, or end that logically signal the end of a construct. After skipping to such a delimiter, the parser has a realistic chance of resynchronizing with correct subsequent input.
A table-driven LL(1) parser uses an explicit stack and a pre-computed 2D parsing table to guide decisions.
Mechanism:
- The parsing table stores the rule to apply when non-terminal is on top of the stack and terminal is the input.
- At each step, if the stack top is a non-terminal, it looks up the table and pushes the RHS. If the stack top is a terminal, it matches.
Pros:
- Fully deterministic — no backtracking, guaranteed time.
- The table can be generated automatically by tools (ANTLR).
- Language changes only require regenerating the table.
Cons:
- Requires pre-computation of FIRST and FOLLOW sets.
- The table can consume significant memory.
- Many grammars cannot be made LL(1) even after left-factoring.
Common Questions
Knowledge Check
Which of the following statements is true regarding LL(1) grammars and left recursion?