Coursify

Compiler Design (CD)

Top-Down Parsing and LL(1) Grammars

55

Master the most exam-critical topic in syntax analysis. Learn to eliminate left recursion, apply left factoring, compute First and Follow sets, and build an LL(1) predictive parsing table from scratch.

Learning Goals

  • Explain why left recursion causes infinite loops in recursive-descent parsers and eliminate it.
  • Apply left factoring to remove common prefixes that prevent deterministic parsing.
  • Compute the FIRST set for any grammar symbol or sentential form.
  • Compute the FOLLOW set for any non-terminal in a grammar.
  • Construct a Predictive Parsing Table (LL(1)) and detect grammar conflicts.
  • Trace the execution of a table-driven LL(1) parser on a given input string using a stack.

Introduction to Top-Down Parsing

A Top-Down Parser starts at the root of the parse tree (the Start Symbol) and works its way down to the leaves (the tokens), attempting to find a Leftmost Derivation (LMD) that matches the input string.

The most efficient and widely used top-down parsing technique is LL(1) parsing. What does the name LL(1) stand for?

  • L: Scans the input from Left to right.
  • L: Produces a Leftmost derivation.
  • (1): Uses exactly 1 token of lookahead to make parsing decisions.

Because it only looks 1 token ahead, an LL(1) parser must be deterministic—it must know exactly which grammar rule to apply without guessing or backtracking. To make a grammar deterministic and suitable for LL(1) parsing, we must first fix two fatal flaws that frequently appear in academic assessments: Left Recursion and Common Prefixes.

Recursive Descent is the simplest form of top-down parsing. It uses a set of recursive functions (one for each non-terminal) to process the input.

Backtracking: If the parser makes the wrong choice, it must undo its steps, reset the input pointer, and try the next rule.

Classical Evaluation Example: Grammar: ScAdS \to cAd, AabacaA \to ab \mid ac \mid a. Input string: cad

  1. Parser reads c, matches Sc...S \to c... and calls function A()A().
  2. A()A() tries first rule AabA \to ab: matches a, but b fails against input d. (Backtrack 1)
  3. A()A() resets, tries second rule AacA \to ac: matches a, but c fails against input d. (Backtrack 2)
  4. A()A() resets, tries third rule AaA \to a: matches a. Returns to SS.
  5. SS matches d. String accepted!

LL(1) predictive parsing was invented specifically to eliminate this expensive backtracking!

The Standard Expression Grammar

For the rest of this module, we will deeply analyze the classic arithmetic expression grammar. It has already had left-recursion eliminated so it is structurally prepared for LL(1) parsing:

  1. ETEE \to T E'
  2. E+TEϵE' \to + T E' \mid \epsilon
  3. TFTT \to F T'
  4. TFTϵT' \to * F T' \mid \epsilon
  5. F(E)idF \to ( E ) \mid \text{id}

To build the deterministic parsing table, we must compute two critical sets for this grammar: FIRST and FOLLOW. Mastering these computations is the most essential skill in Syntax Analysis.

Algorithm: Computing FIRST Sets

  1. 1
    Step 1

    For any string of grammar symbols α\alpha, FIRST(α\alpha) is the set of all terminal symbols that can begin a string derived from α\alpha.

    Crucial rule: If xx is a terminal, then FIRST(xx) is simply {xx}. If α\alpha can derive the empty string, then ϵ\epsilon is also in FIRST(α\alpha).

  2. 2
    Step 2
    1. If XX is a terminal, then FIRST(XX) = {XX}.
    2. If there is a rule XϵX \to \epsilon, add ϵ\epsilon to FIRST(XX).
    3. If XX is a non-terminal and XY1Y2...YkX \to Y_1 Y_2 ... Y_k is a rule:
      • Add FIRST(Y1Y_1) (excluding ϵ\epsilon) to FIRST(XX).
      • If ϵ\epsilon is in FIRST(Y1Y_1), add FIRST(Y2Y_2), and so on.
      • If ALL YiY_i can derive ϵ\epsilon, add ϵ\epsilon to FIRST(XX).
  3. 3
    Step 3
    • FIRST(F): Looking at F(E)idF \to (E) \mid id, the first symbols are clearly ( and id. Result: { (, id }

    • FIRST(T'): Looking at TFTϵT' \to * F T' \mid \epsilon, the first symbols are * and \epsilon. *Result: { , ε }

    • FIRST(T): Looking at TFTT \to F T', TT starts with FF. So FIRST(T) = FIRST(F). Result: { (, id }

    • FIRST(E'): Looking at E+TEϵE' \to + T E' \mid \epsilon, the first symbols are + and \epsilon. Result: { +, ε }

    • FIRST(E): Looking at ETEE \to T E', EE starts with TT. So FIRST(E) = FIRST(T). Result: { (, id }

Algorithm: Computing FOLLOW Sets

  1. 1
    Step 1

    For any non-terminal AA, FOLLOW(AA) is the set of all terminals that can appear immediately to the right of AA in some valid derivation.

    ⚠️ Crucial Rule: FOLLOW() sets never contain ϵ\epsilon.

  2. 2
    Step 2
    1. Place $ (the end-of-input marker) in FOLLOW(SS), where SS is the start symbol.
    2. If there is a rule AαBβA \to \alpha B \beta, then everything in FIRST(β\beta) (except ϵ\epsilon) is added to FOLLOW(BB).
    3. If there is a rule AαBA \to \alpha B, OR a rule AαBβA \to \alpha B \beta where FIRST(β\beta) contains ϵ\epsilon, then everything in FOLLOW(AA) is added to FOLLOW(BB).
  3. 3
    Step 3
    • FOLLOW(E): EE is the start symbol, so add $. EE also appears in F(E)F \to (E). What follows EE? A ). Result: { $, ) }

    • FOLLOW(E'): EE' appears at the end of ETEE \to T E' and E+TEE' \to + T E'. By Rule 3, FOLLOW(E') gets everything in FOLLOW(E). Result: { $, ) }

    • FOLLOW(T): TT appears in ETEE \to T E' and E+TEE' \to + T E'. What follows TT is EE'. So we add FIRST(E') without ϵ\epsilon (which is {+}). Because FIRST(E') contains ϵ\epsilon, Rule 3 dictates we also add FOLLOW(E) and FOLLOW(E'). Result: { +, $, ) }

    • FOLLOW(T'): Appears at the end of TFTT \to F T' and TFTT' \to * F T'. By Rule 3, gets FOLLOW(T). Result: { +, $, ) }

    • FOLLOW(F): Appears in TFTT \to F T' and TFTT' \to * F T'. What follows FF is TT'. Add FIRST(T') without ϵ\epsilon ({*}). Since FIRST(T') has ϵ\epsilon, add FOLLOW(T). *Result: { , +, $, ) }

The Golden Rules of FIRST and FOLLOW

In high-stakes exams, double-check these two rules to catch cascading calculation errors:

  1. FIRST sets CAN contain ϵ\epsilon. If a non-terminal can vanish, ϵ\epsilon must be in its FIRST set.
  2. FOLLOW sets CANNOT contain ϵ\epsilon. It makes no mathematical sense for the 'empty string' to follow a variable. The symbol $ is used instead to indicate the end of the file/input.

Building the LL(1) Predictive Parsing Table

With FIRST and FOLLOW computed, building the 2D parsing table MM is purely mechanical. Rows = Non-terminals. Columns = Terminals (including $).

Algorithm to populate M[A,a]M[A, a]: For each rule AαA \to \alpha:

  1. For each terminal aa in FIRST(α\alpha), add AαA \to \alpha to M[A,a]M[A, a].
  2. If ϵ\epsilon is in FIRST(α\alpha), then for each terminal bb in FOLLOW(AA), add AαA \to \alpha to M[A,b]M[A, b].

Applying this to our grammar yields this completed matrix:

Non-Terminalid+*()$
EETEE \to TE'ETEE \to TE'
E'E+TEE' \to +TE'EϵE' \to \epsilonEϵE' \to \epsilon
TTFTT \to FT'TFTT \to FT'
T'TϵT' \to \epsilonTFTT' \to *FT'TϵT' \to \epsilonTϵT' \to \epsilon
FFidF \to idF(E)F \to (E)

Note: Every cell has at most one entry. This mathematically proves the grammar is strictly LL(1). If any cell had two conflicting entries, the grammar would be ambiguous or require left-factoring, and thus NOT be LL(1).

When Grammars Fail LL(1) Constraints

Not all grammars are LL(1). If you construct a parsing table and any cell contains more than one rule, the grammar is NOT LL(1).

The Classic Dangling-Else Conflict: SiEtSAaS \to iEtSA \mid a AeSϵA \to eS \mid \epsilon EbE \to b

If you calculate FIRST and FOLLOW for this grammar, you will find:

  • FIRST(AA) = { e, ϵ\epsilon }
  • FOLLOW(AA) = { e, $ }

Because AϵA \to \epsilon, we must put this rule into the columns of FOLLOW(AA). So, in cell M[A, e], we must insert AϵA \to \epsilon. But 'e' is also in FIRST(AA), so we must ALSO insert AeSA \to eS into M[A, e].

Because M[A, e] contains two rules, the parser doesn't know whether to match the else or let AA vanish. Therefore, this grammar is formally proven to be NOT LL(1).

Formal Tests: Proving a Grammar is LL(1)

  1. 1
    Step 1

    For each production rule AαA \to \alpha, compute the FIRST set of its right-hand side. Then compute FOLLOW(AA) for every non-terminal in the grammar, verifying that $ is in the start symbol's FOLLOW set and that no ϵ\epsilon exists in any FOLLOW set.

  2. 2
    Step 2

    For every non-terminal AA with competing rules AαβA \to \alpha \mid \beta:

    Rule 1: FIRST(α\alpha) \cap FIRST(β\beta) == \emptyset

    If both rules can start with the same terminal, the parser has no way to choose between them using just one token of lookahead.

  3. 3
    Step 3

    If ϵ\epsilon \in FIRST(β\beta), then additionally check:

    Rule 2: FIRST(α\alpha) \cap FOLLOW(AA) == \emptyset

    This catches the dangling-else problem: when AA can either produce a terminal (via α\alpha) or vanish (via β\beta), you must ensure the parser never faces a token that fits both scenarios simultaneously.

  4. 4
    Step 4

    Consider E+TEϵE' \to + T E' \mid \epsilon:

    • FIRST(+TE+ T E') = { + }
    • FIRST(ϵ\epsilon) = { ϵ\epsilon }
    • Rule 1 is automatic (only one non-ϵ\epsilon rule).
    • Rule 2: { + } \cap FOLLOW(EE') = { + } \cap { $, ) } = \emptyset

    All checks on our expression grammar pass, proving it is LL(1).

Architecture of a Non-Recursive Predictive Parser

A common requirement in compiler design evaluations is outlining the architecture of the LL(1) engine. It relies on four interlinked components:

  1. Input Buffer: Holds the string to be parsed, ending with the special end-marker $.
  2. Stack: A LIFO data structure that holds a sequence of grammar symbols. It is initialized with $ at the bottom and the Start Symbol at the top.
  3. Parsing Table (MM): The pre-computed 2D array acting as the system's brain.
  4. Predictive Parsing Program: The algorithm driving the system. It compares the stack top (XX) with the current input token (aa):
    • If X==aX == a, Match (pop stack, advance input).
    • If XX is a non-terminal, Lookup M[X,a]M[X, a] and push the resulting rule onto the stack in reverse order.
    • If M[X,a]M[X, a] is blank, throw a Syntax Error.

Tracing the LL(1) Parser

Here is the trace for parsing the string id + id through our expression grammar table:

StackInputAction (Lookup M[top,input]M[\text{top}, \text{input}])
$Eid + id$M[E,id]    M[E, id] \implies Pop EE, Push ETE'T (reverse order)
$E'Tid + id$M[T,id]    M[T, id] \implies Pop TT, Push TFT'F
$E'T'Fid + id$M[F,id]    M[F, id] \implies Pop FF, Push idid
$E'T'idid + id$Match! Top of stack (id) matches input (id). Pop stack, advance input.
$E'T'$+ id$M[T,+]    M[T', +] \implies Pop TT', Push ϵ\epsilon (i.e., just pop)
$E'$+ id$M[E,+]    M[E', +] \implies Pop EE', Push ET+E'T+
$E'T+$+ id$Match! Pop +, advance input.
$E'Tid$M[T,id]    M[T, id] \implies Pop TT, Push TFT'F
$E'T'Fid$M[F,id]    M[F, id] \implies Pop FF, Push idid
$E'T'idid$Match! Pop id, advance input.
$E'T'$$M[T', \] \impliesPopPopT',Push, Push \epsilon$
$E'$$M[E', \] \impliesPopPopE',Push, Push \epsilon$
$$Stack empty, Input empty     \implies ACCEPT

Why Reverse Order?

Because a stack is Last-In-First-Out (LIFO). If we apply the rule ETEE \to T E' and we want the parser to process TT first (since we read input left-to-right), TT must be at the very top of the stack. Pushing EE' first, and then pushing TT, ensures that TT ends up on top.

Error Recovery in Predictive Parsing

Even the best LL(1) parser will encounter syntax errors in real-world inputs. The standard approach is Panic-Mode Recovery.

When the parser encounters a blank entry in the parsing table M[A,a]M[A, a] (no valid rule for the current non-terminal AA and input token aa), it enters panic mode. The parser skips (discards) input tokens until it finds a synchronizing token—a token that belongs to the FOLLOW set of the current non-terminal or some ancestor non-terminal on the stack.

Algorithm for panic-mode recovery:

  1. Pop entries from the stack until a non-terminal XX is on top whose FOLLOW set contains the current input token aa.
  2. If no such non-terminal exists on the stack, skip the current input token and try again.
  3. Once a synchronizing token is found, resume normal parsing.

Key insight for exams: Synchronizing tokens are chosen from FOLLOW sets because they are "natural delimiters"—terminals like ), ;, or end that logically signal the end of a construct. After skipping to such a delimiter, the parser has a realistic chance of resynchronizing with correct subsequent input.

A table-driven LL(1) parser uses an explicit stack and a pre-computed 2D parsing table MM to guide decisions.

Mechanism:

  • The parsing table M[A,a]M[A, a] stores the rule to apply when non-terminal AA is on top of the stack and terminal aa is the input.
  • At each step, if the stack top is a non-terminal, it looks up the table and pushes the RHS. If the stack top is a terminal, it matches.

Pros:

  • Fully deterministic — no backtracking, guaranteed O(n)O(n) time.
  • The table can be generated automatically by tools (ANTLR).
  • Language changes only require regenerating the table.

Cons:

  • Requires pre-computation of FIRST and FOLLOW sets.
  • The table can consume significant memory.
  • Many grammars cannot be made LL(1) even after left-factoring.

Common Questions

Knowledge Check

Question 1 of 9
Q1Single choice

Which of the following statements is true regarding LL(1) grammars and left recursion?

Top-Down Parsing and LL(1) Grammars | Compiler Design (CD) | Coursify