Context-Free Grammars and Push-Down Automata
The formal foundation of syntax analysis. Understand how Context-Free Grammars define the legal sentence structure of a programming language, and how Push-Down Automata serve as the computational model for recognizing them.
Learning Goals
- Define a Context-Free Grammar (CFG) using the formal 4-tuple notation.
- Distinguish between terminals and non-terminals in a given grammar.
- Perform Leftmost (LMD) and Rightmost (RMD) derivations for a given string.
- Construct Concrete Parse Trees and differentiate them from Abstract Syntax Trees (AST).
- Identify ambiguous grammars and explain why ambiguity is a critical failure in compiler design.
The Role of the Parser
In Phase 1, the Lexical Analyzer grouped characters into a flat stream of tokens. In Phase 2, the Syntax Analyzer (Parser) takes that token stream and determines if it forms a valid sentence according to the rules of the programming language.
Why regular expressions aren't enough:
Regular expressions (and the Finite Automata that run them) cannot count or remember unbounded information. This means they cannot verify matched structures like balanced parentheses ((())) or nested blocks { { } }. Because every modern programming language uses heavily nested structures (nested if statements, nested loops, nested math expressions), we require a more powerful mathematical tool: the Context-Free Grammar (CFG).
Formal Definition of a Context-Free Grammar
A Context-Free Grammar is a set of rules describing how to form valid sentences. Mathematically, it is defined as a 4-tuple:
- (Variables / Non-terminals): Syntactic variables that represent sets of strings (e.g., , , ). They are placeholders that must be expanded further.
- (Terminals): The basic symbols from which strings are formed. In a compiler, the terminals are the tokens returned by the lexer (e.g.,
id,+,while,int). - (Productions): The rules that define how non-terminals can be replaced. They always have the form:
Non-terminal → Sequence of terminals and/or non-terminals(e.g., ) - (Start Symbol): A special non-terminal in that represents the entire program or the root concept we are trying to parse.
Trace: Leftmost and Rightmost Derivations
- 1Step 1
A derivation is a sequence of rule applications that transforms the Start Symbol into a string of purely terminal symbols.
Consider this standard expression grammar:
Our Goal: Derive the string
id + id * idfrom the start symbol . - 2Step 2
In a Leftmost Derivation, we always choose the leftmost non-terminal in the current string to expand next.
(using rule 2) (expanding the left using rule 1) (expanding the left using rule 4) (expanding the left using rule 4) (expanding the left using rule 4)
(Notice that at every step, we specifically targeted the that was furthest to the left.)
- 3Step 3
In a Rightmost Derivation, we always choose the rightmost non-terminal to expand next.
(using rule 1) (expanding the right using rule 2) (expanding the right using rule 4) (expanding the right using rule 4) (expanding the right using rule 4)
(Rightmost derivations are also called canonical derivations, and they form the mathematical basis for Bottom-Up parsers like LR and Yacc.)
- 4Step 4
A Parse Tree is a graphical representation of a derivation that ignores the order in which replacements were made. The root is the Start Symbol, the internal nodes are Non-terminals, and the leaves are Terminals.
Whether you followed the LMD above or the RMD above, they both correspond to this exact same parse tree:
Because mathematical operators are evaluated bottom-up, this tree correctly forces
id * idto be evaluated before the+operation.
A Leftmost Derivation or is a derivation in which, at every step, the leftmost non-terminal in the current sentential form is chosen for expansion.
Formally, for a CFG , a derivation sequence is a Leftmost Derivation if the production is applied to the leftmost non-terminal. We denote it as:
Relation to Top-Down Parsing: Top-Down parsers (Recursive Descent, LL parsers) naturally construct a Leftmost Derivation. They start from the Start Symbol and repeatedly expand the leftmost pending non-terminal, matching terminals against the input from left to right. This is why LL parsers produce an — the first 'L' stands for "Left-to-right scan" and the second 'L' stands for "Leftmost derivation."
A Parse Tree perfectly reflects the grammar rules. Every non-terminal expanded during the derivation has a corresponding node in the tree.
Problem: It is overly verbose. Nodes like form long single-child chains that waste memory and slow down the compiler.
The Disaster of Ambiguity
A grammar is ambiguous if it can produce more than one valid parse tree for the same input string (equivalently: more than one Leftmost Derivation or more than one Rightmost Derivation). Identifying and resolving ambiguous grammars is a critical, frequently examined concept in compiler design.
If a grammar is ambiguous, it is fundamentally useless for a compiler. A compiler cannot guess what the programmer meant. If id + id * id yields two trees — one where + happens first, and one where * happens first — the compiler cannot safely generate machine code.
There is no general algorithm to determine if an arbitrary CFG is ambiguous (the problem is formally undecidable). Ambiguity must be resolved manually by rewriting the grammar to enforce strict precedence and associativity rules.
The Classic "Dangling-Else" Ambiguity
The most famous ambiguity in computer science occurs with if-then-else statements. Consider this grammar:
Now parse this code:
1if x > 0 then 2 if y > 0 then print(y) 3else print(x)
Does the else belong to the first (outer) if, or the second (inner) if? Both are perfectly valid derivations under this grammar!
Most languages (like C and Java) resolve this informally in their spec: An else always binds to the closest unmatched if. However, to build an actual parser, the grammar itself must be rewritten to mathematically prevent the outer-bind interpretation.
Proving Ambiguity Formally (A common assessment requirement):
If asked to justify if a grammar like is ambiguous, you must pick a single test string (e.g., aaa) and draw two distinct Leftmost Derivations (or Parse Trees) for it.
Derivation 1 (Expanding Left first):
Derivation 2 (Expanding Right first):
Because the exact same string aaa can be generated using two completely different leftmost derivations (resulting in two different parse trees), the grammar is formally proven to be ambiguous.
Push-Down Automata (PDA)
Just as Finite Automata are the theoretical machines that execute Regular Expressions, Push-Down Automata (PDA) are the theoretical machines that recognize Context-Free Grammars.
The addition of a Last-In-First-Out (LIFO) stack gives the machine unbounded memory. When parsing balanced parentheses ((())), a PDA can push an item onto the stack for every (, and pop an item for every ). If the stack is empty at the end, the string is balanced.
Every parser generator (like Yacc or Bison) ultimately generates a deterministic Push-Down Automaton encoded as a transition table and a runtime stack.
Formalizing the Machine: The PDA Tuple
- 1Step 1
A Push-Down Automaton (PDA) is formally defined as a 7-tuple:
Symbol Name Description States A finite, non-empty set of states Input Alphabet A finite set of input symbols (in a compiler, these are the tokens) Stack Alphabet A finite set of symbols that can be stored on the stack Transition Function The rules governing state changes and stack operations Initial State The state the PDA starts in () Initial Stack Symbol Marks the bottom of the stack () Final States The accepting states () - 2Step 2
The heart of the PDA is . A single transition does three things based on the tuple:
(current state, input symbol, top-of-stack).It computes:
- A move to a new state.
- The optional consumption of an input symbol.
- Popping the top-of-stack and pushing zero or more new symbols.
If it pushes , it essentially just pops the item. If it pushes , it puts and onto the stack.
- 3Step 3
Consider the language — strings with opening parentheses followed by closing parentheses.
Intuition:
- In state : for every '(' seen, push '(' onto the stack, then transition to state .
- In state : continue pushing '(' for each '(' seen.
- Transition to when a ')' is encountered and the top of stack is '(' (pop it).
- In state : continue popping '(' for each ')' seen.
- Accept in when input is exhausted and the top of the stack is exactly .
- 4Step 4
The transition function is given by this table:
Current State Input Stack Top Next State Stack Operation Push Push Push Push Pop Pop Accept! Reading the table: In state , when we see ')' in the input and '(' is on top of the stack, we transition to and pop the '(' from the stack.
- 5Step 5
Let's trace the PDA as it processes the input string
((())):Step Remaining Input State Stack (top → bottom) Action 0 Initial configuration 1 Read '(', push '(' 2 Z_0$ Read '(', push '(' 3 Z_0$ Read '(', push '(' 4 Z_0$ Read ')', pop '(' 5 Read ')', pop '(' 6 Read ')', pop '(' 7 -transition, accept! Result: The string is ACCEPTED. The PDA pushed 3 symbols, popped 3 symbols, and reached the final state perfectly.
CFG to PDA Conversion
There is a fundamental correspondence between Context-Free Grammars and Push-Down Automata: every CFG can be mechanically converted to an equivalent PDA, and every PDA defines a context-free language whose grammar can be reconstructed. This bi-directional equivalence is one of the most important theorems in formal language theory.
Construction Algorithm (CFG → PDA): Given a CFG , we construct a PDA that accepts exactly by empty stack. The PDA only needs a single state , because the stack alone encodes the entire derivation state!
-
Expand (simulate a production): For each production in , add a transition: (When is on top of the stack, pop it and push . This simulates one derivation step without consuming any input.)
-
Match (verify a terminal): For each terminal , add a transition: (When the input symbol matches the top-of-stack terminal, pop it and consume the input.)
How it works: The PDA starts with the Start Symbol on the stack. It non-deterministically expands non-terminals using the grammar rules. When a terminal appears on top of the stack, it must match the current input symbol. If all input is consumed and the stack is empty, the string is accepted.
Why PDA is the Theoretical Minimum for Syntax Parsing
Understanding how the Chomsky Hierarchy maps to compiler phases is a core learning objective often evaluated in formal language theory:
-
Type 3 (Regular Languages) → Finite Automata → Lexical Analysis: Regular expressions and DFAs can recognize tokens because tokens are simple patterns. They have no memory and cannot match nested structures.
-
Type 2 (Context-Free) → Push-Down Automata → Syntax Analysis: A PDA's stack provides the unbounded counting memory required to match nested constructs like parentheses, braces, and
if-elsechains. This is the weakest machine capable of parsing the syntax of a typical programming language. -
Type 1 (Context-Sensitive) → Linear Bounded Automata → Semantic Analysis: Type checking, scope resolution, and rules that depend on context (e.g., 'this variable was declared earlier') require context-sensitive power.
Common Questions
Knowledge Check
Which formal grammar type is inherently recognized by a push-down automaton?