Coursify

Compiler Design (CD)

Context-Free Grammars and Push-Down Automata

40

The formal foundation of syntax analysis. Understand how Context-Free Grammars define the legal sentence structure of a programming language, and how Push-Down Automata serve as the computational model for recognizing them.

Learning Goals

  • Define a Context-Free Grammar (CFG) using the formal 4-tuple notation.
  • Distinguish between terminals and non-terminals in a given grammar.
  • Perform Leftmost (LMD) and Rightmost (RMD) derivations for a given string.
  • Construct Concrete Parse Trees and differentiate them from Abstract Syntax Trees (AST).
  • Identify ambiguous grammars and explain why ambiguity is a critical failure in compiler design.

The Role of the Parser

In Phase 1, the Lexical Analyzer grouped characters into a flat stream of tokens. In Phase 2, the Syntax Analyzer (Parser) takes that token stream and determines if it forms a valid sentence according to the rules of the programming language.

Why regular expressions aren't enough: Regular expressions (and the Finite Automata that run them) cannot count or remember unbounded information. This means they cannot verify matched structures like balanced parentheses ((())) or nested blocks { { } }. Because every modern programming language uses heavily nested structures (nested if statements, nested loops, nested math expressions), we require a more powerful mathematical tool: the Context-Free Grammar (CFG).


Formal Definition of a Context-Free Grammar

A Context-Free Grammar is a set of rules describing how to form valid sentences. Mathematically, it is defined as a 4-tuple: G=(V,T,P,S)G = (V, T, P, S)

  1. VV (Variables / Non-terminals): Syntactic variables that represent sets of strings (e.g., exprexpr, stmtstmt, blockblock). They are placeholders that must be expanded further.
  2. TT (Terminals): The basic symbols from which strings are formed. In a compiler, the terminals are the tokens returned by the lexer (e.g., id, +, while, int).
  3. PP (Productions): The rules that define how non-terminals can be replaced. They always have the form: Non-terminal → Sequence of terminals and/or non-terminals (e.g., EE+EE \to E + E)
  4. SS (Start Symbol): A special non-terminal in VV that represents the entire program or the root concept we are trying to parse.

Trace: Leftmost and Rightmost Derivations

  1. 1
    Step 1

    A derivation is a sequence of rule applications that transforms the Start Symbol into a string of purely terminal symbols.

    Consider this standard expression grammar:

    1. EE+EE \to E + E
    2. EEEE \to E * E
    3. E(E)E \to ( E )
    4. EidE \to id

    Our Goal: Derive the string id + id * id from the start symbol EE.

  2. 2
    Step 2

    In a Leftmost Derivation, we always choose the leftmost non-terminal in the current string to expand next.

    EE EE\Rightarrow E * E (using rule 2) E+EE\Rightarrow E + E * E (expanding the left EE using rule 1) id+EE\Rightarrow id + E * E (expanding the left EE using rule 4) id+idE\Rightarrow id + id * E (expanding the left EE using rule 4) id+idid\Rightarrow id + id * id (expanding the left EE using rule 4)

    (Notice that at every step, we specifically targeted the EE that was furthest to the left.)

  3. 3
    Step 3

    In a Rightmost Derivation, we always choose the rightmost non-terminal to expand next.

    EE E+E\Rightarrow E + E (using rule 1) E+EE\Rightarrow E + E * E (expanding the right EE using rule 2) E+Eid\Rightarrow E + E * id (expanding the right EE using rule 4) E+idid\Rightarrow E + id * id (expanding the right EE using rule 4) id+idid\Rightarrow id + id * id (expanding the right EE using rule 4)

    (Rightmost derivations are also called canonical derivations, and they form the mathematical basis for Bottom-Up parsers like LR and Yacc.)

  4. 4
    Step 4

    A Parse Tree is a graphical representation of a derivation that ignores the order in which replacements were made. The root is the Start Symbol, the internal nodes are Non-terminals, and the leaves are Terminals.

    Whether you followed the LMD above or the RMD above, they both correspond to this exact same parse tree:

    Because mathematical operators are evaluated bottom-up, this tree correctly forces id * id to be evaluated before the + operation.

A Leftmost Derivation or LMDLMD is a derivation in which, at every step, the leftmost non-terminal in the current sentential form is chosen for expansion.

Formally, for a CFG GG, a derivation sequence is a Leftmost Derivation if the production is applied to the leftmost non-terminal. We denote it as: αlmβ\alpha \Rightarrow_{lm} \beta

Relation to Top-Down Parsing: Top-Down parsers (Recursive Descent, LL parsers) naturally construct a Leftmost Derivation. They start from the Start Symbol and repeatedly expand the leftmost pending non-terminal, matching terminals against the input from left to right. This is why LL parsers produce an LMDLMD — the first 'L' stands for "Left-to-right scan" and the second 'L' stands for "Leftmost derivation."

A Parse Tree perfectly reflects the grammar rules. Every non-terminal expanded during the derivation has a corresponding node in the tree.

Problem: It is overly verbose. Nodes like ETFidE \to T \to F \to id form long single-child chains that waste memory and slow down the compiler.

The Disaster of Ambiguity

A grammar is ambiguous if it can produce more than one valid parse tree for the same input string (equivalently: more than one Leftmost Derivation or more than one Rightmost Derivation). Identifying and resolving ambiguous grammars is a critical, frequently examined concept in compiler design.

If a grammar is ambiguous, it is fundamentally useless for a compiler. A compiler cannot guess what the programmer meant. If id + id * id yields two trees — one where + happens first, and one where * happens first — the compiler cannot safely generate machine code.

There is no general algorithm to determine if an arbitrary CFG is ambiguous (the problem is formally undecidable). Ambiguity must be resolved manually by rewriting the grammar to enforce strict precedence and associativity rules.

The Classic "Dangling-Else" Ambiguity

The most famous ambiguity in computer science occurs with if-then-else statements. Consider this grammar: Stmtif Expr then StmtStmt \to \text{if } Expr \text{ then } Stmt Stmtif Expr then Stmt else StmtStmt \to \text{if } Expr \text{ then } Stmt \text{ else } Stmt

Now parse this code:

1if x > 0 then 2 if y > 0 then print(y) 3else print(x)

Does the else belong to the first (outer) if, or the second (inner) if? Both are perfectly valid derivations under this grammar!

Most languages (like C and Java) resolve this informally in their spec: An else always binds to the closest unmatched if. However, to build an actual parser, the grammar itself must be rewritten to mathematically prevent the outer-bind interpretation.

Proving Ambiguity Formally (A common assessment requirement): If asked to justify if a grammar like SaSSaaS \to aS \mid Sa \mid a is ambiguous, you must pick a single test string (e.g., aaa) and draw two distinct Leftmost Derivations (or Parse Trees) for it.

Derivation 1 (Expanding Left first): SaSaaSaaaS \Rightarrow aS \Rightarrow aaS \Rightarrow aaa

Derivation 2 (Expanding Right first): SSaSaaaaaS \Rightarrow Sa \Rightarrow Saa \Rightarrow aaa

Because the exact same string aaa can be generated using two completely different leftmost derivations (resulting in two different parse trees), the grammar is formally proven to be ambiguous.

Push-Down Automata (PDA)

Just as Finite Automata are the theoretical machines that execute Regular Expressions, Push-Down Automata (PDA) are the theoretical machines that recognize Context-Free Grammars.

PDA=Finite Automaton+A Stack\text{PDA} = \text{Finite Automaton} + \text{A Stack}

The addition of a Last-In-First-Out (LIFO) stack gives the machine unbounded memory. When parsing balanced parentheses ((())), a PDA can push an item onto the stack for every (, and pop an item for every ). If the stack is empty at the end, the string is balanced.

Every parser generator (like Yacc or Bison) ultimately generates a deterministic Push-Down Automaton encoded as a transition table and a runtime stack.

Formalizing the Machine: The PDA Tuple

  1. 1
    Step 1

    A Push-Down Automaton (PDA) is formally defined as a 7-tuple: M=(Q,Σ,Γ,δ,q0,Z0,F)M = (Q, \Sigma, \Gamma, \delta, q_0, Z_0, F)

    SymbolNameDescription
    QQStatesA finite, non-empty set of states
    Σ\SigmaInput AlphabetA finite set of input symbols (in a compiler, these are the tokens)
    Γ\GammaStack AlphabetA finite set of symbols that can be stored on the stack
    δ\deltaTransition FunctionThe rules governing state changes and stack operations
    q0q_0Initial StateThe state the PDA starts in (q0Qq_0 \in Q)
    Z0Z_0Initial Stack SymbolMarks the bottom of the stack (Z0ΓZ_0 \in \Gamma)
    FFFinal StatesThe accepting states (FQF \subseteq Q)
  2. 2
    Step 2

    The heart of the PDA is δ\delta. A single transition does three things based on the tuple: (current state, input symbol, top-of-stack).

    It computes:

    1. A move to a new state.
    2. The optional consumption of an input symbol.
    3. Popping the top-of-stack and pushing zero or more new symbols.

    If it pushes ε\varepsilon, it essentially just pops the item. If it pushes XYXY, it puts XX and YY onto the stack.

  3. 3
    Step 3

    Consider the language L={(n)nn1}L = \{(^n)^n \mid n \geq 1\} — strings with nn opening parentheses followed by nn closing parentheses.

    Intuition:

    • In state q0q_0: for every '(' seen, push '(' onto the stack, then transition to state q1q_1.
    • In state q1q_1: continue pushing '(' for each '(' seen.
    • Transition to q2q_2 when a ')' is encountered and the top of stack is '(' (pop it).
    • In state q2q_2: continue popping '(' for each ')' seen.
    • Accept in qfq_f when input is exhausted and the top of the stack is exactly Z0Z_0.
  4. 4
    Step 4

    The transition function δ\delta is given by this table:

    Current StateInputStack TopNext StateStack Operation
    q0q_0((Z0Z_0q1q_1Push (Z0(Z_0
    q0q_0((((q1q_1Push ((((
    q1q_1((Z0Z_0q1q_1Push (Z0(Z_0
    q1q_1((((q1q_1Push ((((
    q1q_1))((q2q_2Pop
    q2q_2))((q2q_2Pop
    q2q_2ε\varepsilonZ0Z_0qfq_fAccept!

    Reading the table: In state q1q_1, when we see ')' in the input and '(' is on top of the stack, we transition to q2q_2 and pop the '(' from the stack.

  5. 5
    Step 5

    Let's trace the PDA as it processes the input string ((())):

    StepRemaining InputStateStack (top → bottom)Action
    0((()))((()))q0q_0Z0Z_0Initial configuration
    1(()))(()))q1q_1(Z0(Z_0Read '(', push '('
    2()))()))q1q_1((((Z_0$Read '(', push '('
    3))))))q1q_1((((((Z_0$Read '(', push '('
    4))))q2q_2((((Z_0$Read ')', pop '('
    5))q2q_2(Z0(Z_0Read ')', pop '('
    6ε\varepsilonq2q_2Z0Z_0Read ')', pop '('
    7ε\varepsilonqfq_fZ0Z_0ε\varepsilon-transition, accept!

    Result: The string is ACCEPTED. The PDA pushed 3 symbols, popped 3 symbols, and reached the final state perfectly.

CFG to PDA Conversion

There is a fundamental correspondence between Context-Free Grammars and Push-Down Automata: every CFG can be mechanically converted to an equivalent PDA, and every PDA defines a context-free language whose grammar can be reconstructed. This bi-directional equivalence is one of the most important theorems in formal language theory.

Construction Algorithm (CFG → PDA): Given a CFG G=(V,T,P,S)G = (V, T, P, S), we construct a PDA that accepts exactly L(G)L(G) by empty stack. The PDA only needs a single state qq, because the stack alone encodes the entire derivation state!

  1. Expand (simulate a production): For each production AαA \to \alpha in PP, add a transition: δ(q,ε,A)={(q,α)}\delta(q, \varepsilon, A) = \{(q, \alpha)\} (When AA is on top of the stack, pop it and push α\alpha. This simulates one derivation step without consuming any input.)

  2. Match (verify a terminal): For each terminal aTa \in T, add a transition: δ(q,a,a)={(q,ε)}\delta(q, a, a) = \{(q, \varepsilon)\} (When the input symbol matches the top-of-stack terminal, pop it and consume the input.)

How it works: The PDA starts with the Start Symbol SS on the stack. It non-deterministically expands non-terminals using the grammar rules. When a terminal appears on top of the stack, it must match the current input symbol. If all input is consumed and the stack is empty, the string is accepted.

Why PDA is the Theoretical Minimum for Syntax Parsing

Understanding how the Chomsky Hierarchy maps to compiler phases is a core learning objective often evaluated in formal language theory:

  • Type 3 (Regular Languages) → Finite Automata → Lexical Analysis: Regular expressions and DFAs can recognize tokens because tokens are simple patterns. They have no memory and cannot match nested structures.

  • Type 2 (Context-Free) → Push-Down Automata → Syntax Analysis: A PDA's stack provides the unbounded counting memory required to match nested constructs like parentheses, braces, and if-else chains. This is the weakest machine capable of parsing the syntax of a typical programming language.

  • Type 1 (Context-Sensitive) → Linear Bounded Automata → Semantic Analysis: Type checking, scope resolution, and rules that depend on context (e.g., 'this variable was declared earlier') require context-sensitive power.

Common Questions

Knowledge Check

Question 1 of 8
Q1Single choice

Which formal grammar type is inherently recognized by a push-down automaton?

Context-Free Grammars and Push-Down Automata | Compiler Design (CD) | Coursify