Top-Down Parsing and LL(1) Grammars

Master the most exam-critical topic in syntax analysis. Learn to eliminate left recursion, apply left factoring, compute First and Follow sets, and build an LL(1) predictive parsing table from scratch.

Learning Goals

Explain why left recursion causes infinite loops in recursive-descent parsers and eliminate it.
Apply left factoring to remove common prefixes that prevent deterministic parsing.
Compute the FIRST set for any grammar symbol or sentential form.
Compute the FOLLOW set for any non-terminal in a grammar.
Construct a Predictive Parsing Table (LL(1)) and detect grammar conflicts.
Trace the execution of a table-driven LL(1) parser on a given input string using a stack.

Introduction to Top-Down Parsing

A Top-Down Parser starts at the root of the parse tree (the Start Symbol) and works its way down to the leaves (the tokens), attempting to find a Leftmost Derivation (LMD) that matches the input string.

The most efficient and widely used top-down parsing technique is LL(1) parsing. What does the name LL(1) stand for?

L: Scans the input from Left to right.
L: Produces a Leftmost derivation.
(1): Uses exactly 1 token of lookahead to make parsing decisions.

Because it only looks 1 token ahead, an LL(1) parser must be deterministic—it must know exactly which grammar rule to apply without guessing or backtracking. To make a grammar deterministic and suitable for LL(1) parsing, we must first fix two fatal flaws that frequently appear in academic assessments: Left Recursion and Common Prefixes.

Recursive Descent is the simplest form of top-down parsing. It uses a set of recursive functions (one for each non-terminal) to process the input.

Backtracking: If the parser makes the wrong choice, it must undo its steps, reset the input pointer, and try the next rule.

Classical Evaluation Example: Grammar: $S \to cAd$ , $A \to ab \mid ac \mid a$ . Input string: cad

Parser reads c, matches $S \to c...$ and calls function $A()$ .
$A()$ tries first rule $A \to ab$ : matches a, but b fails against input d. (Backtrack 1)
$A()$ resets, tries second rule $A \to ac$ : matches a, but c fails against input d. (Backtrack 2)
$A()$ resets, tries third rule $A \to a$ : matches a. Returns to $S$ .
$S$ matches d. String accepted!

LL(1) predictive parsing was invented specifically to eliminate this expensive backtracking!

The Standard Expression Grammar

For the rest of this module, we will deeply analyze the classic arithmetic expression grammar. It has already had left-recursion eliminated so it is structurally prepared for LL(1) parsing:

$E \to T E'$
$E' \to + T E' \mid \epsilon$
$T \to F T'$
$T' \to * F T' \mid \epsilon$
$F \to ( E ) \mid \text{id}$

To build the deterministic parsing table, we must compute two critical sets for this grammar: FIRST and FOLLOW. Mastering these computations is the most essential skill in Syntax Analysis.

Algorithm: Computing FIRST Sets

1
Step 1
For any string of grammar symbols $\alpha$ , FIRST( $\alpha$ ) is the set of all terminal symbols that can begin a string derived from $\alpha$ .

Crucial rule: If $x$ is a terminal, then FIRST( $x$ ) is simply { $x$ }. If $\alpha$ can derive the empty string, then $\epsilon$ is also in FIRST( $\alpha$ ).
2
Step 2
If $X$ is a terminal, then FIRST( $X$ ) = { $X$ }.

If there is a rule $X \to \epsilon$ , add $\epsilon$ to FIRST( $X$ ).

If $X$ is a non-terminal and $X \to Y_1 Y_2 ... Y_k$ is a rule:

Add FIRST( $Y_1$ ) (excluding $\epsilon$ ) to FIRST( $X$ ).

If $\epsilon$ is in FIRST( $Y_1$ ), add FIRST( $Y_2$ ), and so on.

If ALL $Y_i$ can derive $\epsilon$ , add $\epsilon$ to FIRST( $X$ ).
3
Step 3
FIRST(F): Looking at $F \to (E) \mid id$ , the first symbols are clearly ( and id. Result: { (, id }

FIRST(T'): Looking at $T' \to * F T' \mid \epsilon$ , the first symbols are * and \epsilon. *Result: { , ε }

FIRST(T): Looking at $T \to F T'$ , $T$ starts with $F$ . So FIRST(T) = FIRST(F). Result: { (, id }

FIRST(E'): Looking at $E' \to + T E' \mid \epsilon$ , the first symbols are + and \epsilon. Result: { +, ε }

FIRST(E): Looking at $E \to T E'$ , $E$ starts with $T$ . So FIRST(E) = FIRST(T). Result: { (, id }

Algorithm: Computing FOLLOW Sets

1
Step 1
For any non-terminal $A$ , FOLLOW( $A$ ) is the set of all terminals that can appear immediately to the right of $A$ in some valid derivation.

⚠️ Crucial Rule: FOLLOW() sets never contain $\epsilon$ .
2
Step 2
Place $ (the end-of-input marker) in FOLLOW( $S$ ), where $S$ is the start symbol.

If there is a rule $A \to \alpha B \beta$ , then everything in FIRST( $\beta$ ) (except $\epsilon$ ) is added to FOLLOW( $B$ ).

If there is a rule $A \to \alpha B$ , OR a rule $A \to \alpha B \beta$ where FIRST( $\beta$ ) contains $\epsilon$ , then everything in FOLLOW( $A$ ) is added to FOLLOW( $B$ ).
3
Step 3
FOLLOW(E): $E$ is the start symbol, so add $. $E$ also appears in $F \to (E)$ . What follows $E$ ? A ). Result: { $, ) }

FOLLOW(E'): $E'$ appears at the end of $E \to T E'$ and $E' \to + T E'$ . By Rule 3, FOLLOW(E') gets everything in FOLLOW(E). Result: { $, ) }

FOLLOW(T): $T$ appears in $E \to T E'$ and $E' \to + T E'$ . What follows $T$ is $E'$ . So we add FIRST(E') without $\epsilon$ (which is {+}). Because FIRST(E') contains $\epsilon$ , Rule 3 dictates we also add FOLLOW(E) and FOLLOW(E'). Result: { +, $, ) }

FOLLOW(T'): Appears at the end of $T \to F T'$ and $T' \to * F T'$ . By Rule 3, gets FOLLOW(T). Result: { +, $, ) }

FOLLOW(F): Appears in $T \to F T'$ and $T' \to * F T'$ . What follows $F$ is $T'$ . Add FIRST(T') without $\epsilon$ ({*}). Since FIRST(T') has $\epsilon$ , add FOLLOW(T). *Result: { , +, $, ) }

The Golden Rules of FIRST and FOLLOW

In high-stakes exams, double-check these two rules to catch cascading calculation errors:

FIRST sets CAN contain $\epsilon$ . If a non-terminal can vanish, $\epsilon$ must be in its FIRST set.
FOLLOW sets CANNOT contain $\epsilon$ . It makes no mathematical sense for the 'empty string' to follow a variable. The symbol $ is used instead to indicate the end of the file/input.

Building the LL(1) Predictive Parsing Table

With FIRST and FOLLOW computed, building the 2D parsing table $M$ is purely mechanical. Rows = Non-terminals. Columns = Terminals (including $).

Algorithm to populate $M[A, a]$ : For each rule $A \to \alpha$ :

For each terminal $a$ in FIRST( $\alpha$ ), add $A \to \alpha$ to $M[A, a]$ .
If $\epsilon$ is in FIRST( $\alpha$ ), then for each terminal $b$ in FOLLOW( $A$ ), add $A \to \alpha$ to $M[A, b]$ .

Applying this to our grammar yields this completed matrix:

Non-Terminal	`id`	`+`	`*`	`(`	`)`	`$`
E	$E \to TE'$			$E \to TE'$
E'		$E' \to +TE'$			$E' \to \epsilon$	$E' \to \epsilon$
T	$T \to FT'$			$T \to FT'$
T'		$T' \to \epsilon$	$T' \to *FT'$		$T' \to \epsilon$	$T' \to \epsilon$
F	$F \to id$			$F \to (E)$

Note: Every cell has at most one entry. This mathematically proves the grammar is strictly LL(1). If any cell had two conflicting entries, the grammar would be ambiguous or require left-factoring, and thus NOT be LL(1).

When Grammars Fail LL(1) Constraints

Not all grammars are LL(1). If you construct a parsing table and any cell contains more than one rule, the grammar is NOT LL(1).

The Classic Dangling-Else Conflict: $S \to iEtSA \mid a$ $A \to eS \mid \epsilon$ $E \to b$

If you calculate FIRST and FOLLOW for this grammar, you will find:

FIRST( $A$ ) = { e, $\epsilon$ }
FOLLOW( $A$ ) = { e, $ }

Because $A \to \epsilon$ , we must put this rule into the columns of FOLLOW( $A$ ). So, in cell M[A, e], we must insert $A \to \epsilon$ . But 'e' is also in FIRST( $A$ ), so we must ALSO insert $A \to eS$ into M[A, e].

Because M[A, e] contains two rules, the parser doesn't know whether to match the else or let $A$ vanish. Therefore, this grammar is formally proven to be NOT LL(1).

Formal Tests: Proving a Grammar is LL(1)

1
Step 1
For each production rule $A \to \alpha$ , compute the FIRST set of its right-hand side. Then compute FOLLOW( $A$ ) for every non-terminal in the grammar, verifying that $ is in the start symbol's FOLLOW set and that no $\epsilon$ exists in any FOLLOW set.
2
Step 2
For every non-terminal $A$ with competing rules $A \to \alpha \mid \beta$ :

Rule 1: FIRST( $\alpha$ ) $\cap$ FIRST( $\beta$ ) $= \emptyset$

If both rules can start with the same terminal, the parser has no way to choose between them using just one token of lookahead.
3
Step 3
If $\epsilon \in$ FIRST( $\beta$ ), then additionally check:

Rule 2: FIRST( $\alpha$ ) $\cap$ FOLLOW( $A$ ) $= \emptyset$

This catches the dangling-else problem: when $A$ can either produce a terminal (via $\alpha$ ) or vanish (via $\beta$ ), you must ensure the parser never faces a token that fits both scenarios simultaneously.
4
Step 4
Consider $E' \to + T E' \mid \epsilon$ :

FIRST( $+ T E'$ ) = { + }

FIRST( $\epsilon$ ) = { $\epsilon$ }

Rule 1 is automatic (only one non- $\epsilon$ rule).

Rule 2: { + } $\cap$ FOLLOW( $E'$ ) = { + } $\cap$ { $, ) } = $\emptyset$ ✅

All checks on our expression grammar pass, proving it is LL(1).

Architecture of a Non-Recursive Predictive Parser

A common requirement in compiler design evaluations is outlining the architecture of the LL(1) engine. It relies on four interlinked components:

Input Buffer: Holds the string to be parsed, ending with the special end-marker $.
Stack: A LIFO data structure that holds a sequence of grammar symbols. It is initialized with $ at the bottom and the Start Symbol at the top.
Parsing Table ( $M$ ): The pre-computed 2D array acting as the system's brain.
Predictive Parsing Program: The algorithm driving the system. It compares the stack top ( $X$ $X$ ) with the current input token ( $a$ $a$ ):
- If $X == a$ , Match (pop stack, advance input).
- If $X$ is a non-terminal, Lookup $M[X, a]$ and push the resulting rule onto the stack in reverse order.
- If $M[X, a]$ is blank, throw a Syntax Error.

Tracing the LL(1) Parser

Here is the trace for parsing the string id + id through our expression grammar table:

Stack	Input	Action (Lookup $M[\text{top}, \text{input}]$ )
`$E`	`id + id$`	$M[E, id] \implies$ Pop $E$ , Push $E'T$ (reverse order)
`$E'T`	`id + id$`	$M[T, id] \implies$ Pop $T$ , Push $T'F$
`$E'T'F`	`id + id$`	$M[F, id] \implies$ Pop $F$ , Push $id$
`$E'T'id`	`id + id$`	Match! Top of stack (`id`) matches input (`id`). Pop stack, advance input.
$E'T'$	`+ id$`	$M[T', +] \implies$ Pop $T'$ , Push $\epsilon$ (i.e., just pop)
$E'$	`+ id$`	$M[E', +] \implies$ Pop $E'$ , Push $E'T+$
$E'T+$	`+ id$`	Match! Pop `+`, advance input.
`$E'T`	`id$`	$M[T, id] \implies$ Pop $T$ , Push $T'F$
`$E'T'F`	`id$`	$M[F, id] \implies$ Pop $F$ , Push $id$
`$E'T'id`	`id$`	Match! Pop `id`, advance input.
$E'T'$	`$`	$M[T', \$ ] \implies $Pop$ T' $, Push$ \epsilon$
$E'$	`$`	$M[E', \$ ] \implies $Pop$ E' $, Push$ \epsilon$
`$`	`$`	Stack empty, Input empty $\implies$ ACCEPT ✅

Why Reverse Order?

Because a stack is Last-In-First-Out (LIFO). If we apply the rule $E \to T E'$ and we want the parser to process $T$ first (since we read input left-to-right), $T$ must be at the very top of the stack. Pushing $E'$ first, and then pushing $T$ , ensures that $T$ ends up on top.

Error Recovery in Predictive Parsing

Even the best LL(1) parser will encounter syntax errors in real-world inputs. The standard approach is Panic-Mode Recovery.

When the parser encounters a blank entry in the parsing table $M[A, a]$ (no valid rule for the current non-terminal $A$ and input token $a$ ), it enters panic mode. The parser skips (discards) input tokens until it finds a synchronizing token—a token that belongs to the FOLLOW set of the current non-terminal or some ancestor non-terminal on the stack.

Algorithm for panic-mode recovery:

Pop entries from the stack until a non-terminal $X$ is on top whose FOLLOW set contains the current input token $a$ .
If no such non-terminal exists on the stack, skip the current input token and try again.
Once a synchronizing token is found, resume normal parsing.

Key insight for exams: Synchronizing tokens are chosen from FOLLOW sets because they are "natural delimiters"—terminals like ), ;, or end that logically signal the end of a construct. After skipping to such a delimiter, the parser has a realistic chance of resynchronizing with correct subsequent input.

A table-driven LL(1) parser uses an explicit stack and a pre-computed 2D parsing table $M$ to guide decisions.

Mechanism:

The parsing table $M[A, a]$ stores the rule to apply when non-terminal $A$ is on top of the stack and terminal $a$ is the input.
At each step, if the stack top is a non-terminal, it looks up the table and pushes the RHS. If the stack top is a terminal, it matches.

Pros:

Fully deterministic — no backtracking, guaranteed $O(n)$ time.
The table can be generated automatically by tools (ANTLR).
Language changes only require regenerating the table.

Cons:

Requires pre-computation of FIRST and FOLLOW sets.
The table can consume significant memory.
Many grammars cannot be made LL(1) even after left-factoring.

Common Questions

Knowledge Check

Question 1 of 9

Q1Single choice

Which of the following statements is true regarding LL(1) grammars and left recursion?

Every left recursive grammar can be LL(1)

An LL(1) grammar can be ambiguous

Both statements are true

Neither statement is true

Abdul Bari: LL(1) Parsing Table Construction

web

Context-Free Grammars and Push-Down Automata

Bottom-Up Parsing: LR(0), SLR(1), LR(1), LALR(1)