Ambiguity, Operator Precedence, and YACC
Understand how to handle ambiguity using specialized parsers like Operator Precedence, and learn the architecture of industry-standard parser generators like YACC.
Learning Goals
- Define ambiguity formally and identify ambiguous grammars with multiple parse trees.
- Identify the characteristics of an Operator Grammar.
- Explain how Operator Precedence Parsers uniquely handle ambiguous grammars.
- Trace an Operator Precedence Parser step-by-step on a given input string.
- Describe the structure of a YACC file and its three sections.
- Explain how precedence directives (%left, %right, %nonassoc) resolve shift-reduce conflicts in YACC.
- Distinguish between YACC and Bison and their respective features.
- Illustrate the YACC-Lex integration pipeline and the role of yylex() and yyparse().
Ambiguity in Grammars [PYQ 2023, 2024]
A grammar is ambiguous if there exists at least one string that has two or more distinct parse trees (or equivalently, two or more distinct leftmost/rightmost derivations). Ambiguity is a property of the grammar, not the language -- an ambiguous grammar may generate a language that also has an unambiguous grammar.
Classic Example: Dangling Else [PYQ 2023]
Consider the grammar for conditional statements:
For the string if E then if E then a else a, there are two valid parse trees depending on whether the else associates with the inner or outer if. Most languages resolve this by adopting the rule that else associates with the nearest unmatched if.
Resolving Ambiguity: There are three common strategies:
- Rewrite the grammar to be unambiguous (often makes the grammar large and unnatural).
- Use precedence and associativity rules to impose an ordering on operators.
- Use specialized parsers like the Operator Precedence Parser that inherently encode precedence.
Two Parse Trees for an Ambiguous Expression
Consider the ambiguous grammar and the string .
The string has two parse trees: one where + is evaluated first (incorrect) and one where * is evaluated first (correct). This ambiguity is exactly what precedence rules resolve.
In the absence of precedence rules, both parse trees are equally valid, making the grammar ambiguous. The Operator Precedence Parser and YACC precedence directives both resolve this by ensuring * binds tighter than +.
Ambiguity vs LR Conflicts
Ambiguity in a grammar inevitably causes conflicts in LR parsers: a shift-reduce conflict corresponds to an ambiguity where one derivation continues (shift) and another completes (reduce). A reduce-reduce conflict means two different productions could both complete at the same point. These are not bugs in the parser -- they are signals that the grammar is ambiguous.
Operator Precedence Parser [PYQ 2023, 2024]
Most parsers (LL, LR) absolutely cannot parse ambiguous grammars. The Operator Precedence Parser is the exception [PYQ 2024]. It is a Bottom-Up (Shift-Reduce) parser specifically designed to evaluate mathematical expressions, even if the grammar is ambiguous (like ).
What is an Operator Grammar? A grammar is an operator grammar if it has two properties:
- No (empty) productions on the right-hand side of any rule.
- No two Non-Terminals are adjacent on the right-hand side of any rule (e.g., is invalid; is valid).
These restrictions ensure that every production contains at least one terminal (operator), which allows us to define precedence relations purely between terminals.
How it works: Instead of using complex DFA state machines, it uses mathematical precedence relations between terminal symbols (operators):
- : yields precedence to (the operator on the stack has lower precedence, so we shift)
- : takes precedence over (the operator on the stack has higher precedence, so we reduce)
- : has the same precedence as (same precedence level; associativity decides shift vs reduce)
If the parser sees , it compares + and *. Since + *, it shifts the * onto the stack, ensuring multiplication is evaluated (reduced) first.
Precedence Relation Table
For a grammar with operators +, *, terminals id, and the end marker $, the precedence relation table is:
| id | + | * | $ | |
|---|---|---|---|---|
| id | ||||
| + | ||||
| * | ||||
| $ |
Reading the table: To decide between shift or reduce, find the row for the topmost terminal on the stack and the column for the next input symbol. If the relation is , shift. If , reduce. If , shift or reduce depending on associativity. If no relation exists, it is a syntax error.
Note that + * (row +, column *) confirms that * has higher precedence -- when + is on the stack and * is next, we shift. Conversely, * + (row *, column +) confirms that when * is on the stack and + is next, we reduce the * first.
Tracing Operator Precedence Parser on id + id * id
- 1Step 1
Using the grammar with standard precedence (* > +, both left-associative), construct the precedence table as shown above. The end-marker $ is added to both ends of the input string.
- 2Step 2
Stack = The parser always begins with appended to the end of the input.
- 3Step 3
Stack top terminal is \lessdot id. So we shift. Stack =
- 4Step 4
Stack top terminal is id. Next input is +. From the table: id \gtrdot +. So we reduce id to E. Stack = \lessdot + (shift +). Stack =
- 5Step 5
Stack top terminal is +. Next input is id. From the table: + \lessdot id. So we shift id. Stack =
- 6Step 6
Stack top terminal is id. Next input is *. From the table: id \gtrdot *. Reduce id to E. Stack = E + E * Input = id $
- 7Step 7
Stack top terminal is *. Next input is id. From the table: * \lessdot id. Shift id. Stack =
- 8Step 8
Stack top terminal is id. Next input is . Reduce id to E. Stack = . Reduce E * E to E. Stack = . Reduce E + E to E. Stack = \doteq $. Accept. Parse complete.
Strengths:
- Can parse ambiguous grammars by encoding precedence directly
- Simple to implement for expression grammars
- Very efficient -- only compares adjacent operators
- Small memory footprint (just a precedence table)
Limitations:
- Only works on operator grammars (no adjacent non-terminals, no epsilon productions)
- Cannot handle general CFGs
- Cannot distinguish between operators with the same precedence but different semantics
- Limited error detection capabilities
Key Insight: Why Operator Precedence Parser Handles Ambiguity
The Operator Precedence Parser does not construct parse trees at all. It only decides the order of evaluation by comparing adjacent operators. Because it never builds a full derivation, the existence of multiple possible parse trees (ambiguity) does not matter -- the precedence table unambiguously decides whether to shift or reduce at every step. This is fundamentally different from LR parsers, which must enumerate all possible derivations and thus collide when multiple derivations exist.
YACC (Yet Another Compiler Compiler) [PYQ 2022]
YACC is an industry-standard UNIX tool used to automatically generate C code for a parser.
- Underlying Algorithm: YACC internally uses the LALR(1) parsing algorithm.
- Input/Output: It takes a file containing a Context-Free Grammar (usually with
.yextension) and outputs a C program (y.tab.c) containing the parsing table and the shift-reduce logic. - Lexical Integration: YACC does not scan tokens itself. It is designed to work perfectly alongside Lex (which outputs
lex.yy.c). YACC repeatedly callsyylex()to fetch the next token.
Structure of a YACC File:
A standard YACC program is strictly divided into three sections separated by %%:
1/* 1. DEFINITIONS SECTION */ 2%token ID NUM 3%left '+' '-' 4%left '*' '/' 5%% 6 7/* 2. RULES SECTION (The Context Free Grammar) */ 8expr : expr '+' expr { $$ = $1 + $3; } 9 | expr '*' expr { $$ = $1 * $3; } 10 | NUM { $$ = $1; } 11 ; 12%% 13 14/* 3. USER SUBROUTINES (C Code) */ 15int main() { 16 yyparse(); // Starts the parser 17 return 0; 18} 19void yyerror(char *s) { 20 printf("Syntax Error"); 21}
Building a Complete YACC Calculator Specification
- 1Step 1
- 2Step 2
Declare all token names using %token (ID, NUM, PLUS, STAR, LPAREN, RPAREN). Then declare precedence using %left and %right directives. Earlier declarations have lower precedence. For example:
%token NUM ID %left '+' '-' /* lowest precedence, left-associative */ %left '*' '/' /* higher precedence, left-associative */ %right UMINUS /* highest precedence, right-associative (for unary minus) */ - 3Step 3
Each rule has a production followed by a C semantic action enclosed in braces. The pseudo-variables $$, 2, $3 refer to the result and the values of the symbols on the right-hand side:
expr : expr '+' expr { $$ = $1 + $3; } | expr '-' expr { $$ = $1 - $3; } | expr '*' expr { $$ = $1 * $3; } | '(' expr ')' { $$ = $2; } | '-' expr %prec UMINUS { $$ = -$2; } | NUM { $$ = $1; } ; - 4Step 4
The third section contains standard C code. You must provide a main() function that calls yyparse() to start parsing, and an yyerror() function for error reporting:
int main() { yyparse(); return 0; } void yyerror(char *s) { fprintf(stderr, "Error: %s\ ", s); } - 5Step 5
The build process:
yacc -d calc.ygenerates y.tab.c and y.tab.h (the -d flag creates the header for Lex)lex calc.lgenerates lex.yy.ccc lex.yy.c y.tab.c -ly -ll -o calccompiles and links both files with the Lex and YACC libraries- Run
./calcto start the calculator
Precedence Directives in YACC
YACC provides three precedence directives to resolve shift-reduce conflicts:
| Directive | Meaning | Example Use |
|---|---|---|
%left | Left-associative; reduce on same precedence | %left '+' '-' -- a+b+c = (a+b)+c |
%right | Right-associative; shift on same precedence | %right '^' -- a^b^c = a^(b^c) |
%nonassoc | Non-associative; error on same precedence | %nonassoc '<' -- a<b<c is illegal |
Conflict Resolution Mechanism: When YACC encounters a shift-reduce conflict, it compares the precedence of:
- The terminal that would be shifted (the look-ahead token)
- The production that would be reduced (whose precedence is set by its rightmost terminal, or by a
%precoverride)
If the shift terminal has higher precedence, YACC shifts. If the reduce production has higher precedence, YACC reduces. If they have equal precedence, the associativity directive decides: %left causes a reduce, %right causes a shift, %nonassoc causes an error.
YACC-Lex Integration Pipeline
YACC and Lex work together in a tightly integrated pipeline. Lex handles lexical analysis (tokenization) and YACC handles syntactic analysis (parsing). The flow is:
YACC calls yylex() (defined in lex.yy.c) each time it needs a new token. The token value is returned as an integer, and any associated semantic value (like a number) is passed through the global variable yylval. The header file y.tab.h (generated with yacc -d) contains the token definitions that both Lex and YACC use.
Semantic Actions and Pseudo-Variables
In YACC rules, semantic actions are C code enclosed in { } that execute when a production is reduced. YACC provides pseudo-variables:
| Variable | Meaning |
|---|---|
$$ | The result value of the current production (left-hand side) |
$1 | Value of the first symbol on the right-hand side |
$2 | Value of the second symbol on the right-hand side |
$3 | Value of the third symbol on the right-hand side |
For example, in the rule expr : expr '+' expr { $$ = $1 + $3; }, $1 holds the value of the first expr, $3 holds the value of the second expr, and $$ is assigned the sum. The values are stored in a value stack parallel to the parsing stack, and their type is controlled by the %union declaration.
Frequently Asked Questions
Knowledge Check
Which of the following is NOT a required property of an Operator Grammar? [PYQ 2023, 2024]