Ambiguity, Operator Precedence, and YACC

40 mins

Understand how to handle ambiguity using specialized parsers like Operator Precedence, and learn the architecture of industry-standard parser generators like YACC.

Learning Goals

Define ambiguity formally and identify ambiguous grammars with multiple parse trees.
Identify the characteristics of an Operator Grammar.
Explain how Operator Precedence Parsers uniquely handle ambiguous grammars.
Trace an Operator Precedence Parser step-by-step on a given input string.
Describe the structure of a YACC file and its three sections.
Explain how precedence directives (%left, %right, %nonassoc) resolve shift-reduce conflicts in YACC.
Distinguish between YACC and Bison and their respective features.
Illustrate the YACC-Lex integration pipeline and the role of yylex() and yyparse().

Ambiguity in Grammars [PYQ 2023, 2024]

A grammar is ambiguous if there exists at least one string that has two or more distinct parse trees (or equivalently, two or more distinct leftmost/rightmost derivations). Ambiguity is a property of the grammar, not the language -- an ambiguous grammar may generate a language that also has an unambiguous grammar.

Classic Example: Dangling Else [PYQ 2023]

Consider the grammar for conditional statements:

$S \to \text{if } E \text{ then } S \mid \text{if } E \text{ then } S \text{ else } S \mid a$

For the string if E then if E then a else a, there are two valid parse trees depending on whether the else associates with the inner or outer if. Most languages resolve this by adopting the rule that else associates with the nearest unmatched if.

Resolving Ambiguity: There are three common strategies:

Rewrite the grammar to be unambiguous (often makes the grammar large and unnatural).
Use precedence and associativity rules to impose an ordering on operators.
Use specialized parsers like the Operator Precedence Parser that inherently encode precedence.

Two Parse Trees for an Ambiguous Expression

Consider the ambiguous grammar $E \to E + E \mid E * E \mid id$ and the string $id + id * id$ .

The string $id + id * id$ has two parse trees: one where + is evaluated first (incorrect) and one where * is evaluated first (correct). This ambiguity is exactly what precedence rules resolve.

In the absence of precedence rules, both parse trees are equally valid, making the grammar ambiguous. The Operator Precedence Parser and YACC precedence directives both resolve this by ensuring * binds tighter than +.

Ambiguity vs LR Conflicts

Ambiguity in a grammar inevitably causes conflicts in LR parsers: a shift-reduce conflict corresponds to an ambiguity where one derivation continues (shift) and another completes (reduce). A reduce-reduce conflict means two different productions could both complete at the same point. These are not bugs in the parser -- they are signals that the grammar is ambiguous.

Operator Precedence Parser [PYQ 2023, 2024]

Most parsers (LL, LR) absolutely cannot parse ambiguous grammars. The Operator Precedence Parser is the exception [PYQ 2024]. It is a Bottom-Up (Shift-Reduce) parser specifically designed to evaluate mathematical expressions, even if the grammar is ambiguous (like $E \to E + E \mid E * E \mid id$ ).

What is an Operator Grammar? A grammar is an operator grammar if it has two properties:

No $\epsilon$ (empty) productions on the right-hand side of any rule.
No two Non-Terminals are adjacent on the right-hand side of any rule (e.g., $A \to BC$ is invalid; $A \to B \text{ op } C$ is valid).

These restrictions ensure that every production contains at least one terminal (operator), which allows us to define precedence relations purely between terminals.

How it works: Instead of using complex DFA state machines, it uses mathematical precedence relations between terminal symbols (operators):

$a \lessdot b$ : $a$ yields precedence to $b$ (the operator on the stack has lower precedence, so we shift)
$a \gtrdot b$ : $a$ takes precedence over $b$ (the operator on the stack has higher precedence, so we reduce)
$a \doteq b$ : $a$ has the same precedence as $b$ (same precedence level; associativity decides shift vs reduce)

If the parser sees $id + id * id$ , it compares + and *. Since + $\lessdot$ *, it shifts the * onto the stack, ensuring multiplication is evaluated (reduced) first.

Precedence Relation Table

For a grammar with operators +, *, terminals id, and the end marker $, the precedence relation table is:

	id	+	*	$
id		$\gtrdot$	$\gtrdot$	$\gtrdot$
+	$\lessdot$	$\doteq$	$\lessdot$	$\gtrdot$
*	$\lessdot$	$\gtrdot$	$\doteq$	$\gtrdot$
$	$\lessdot$	$\lessdot$	$\lessdot$	$\doteq$

Reading the table: To decide between shift or reduce, find the row for the topmost terminal on the stack and the column for the next input symbol. If the relation is $\lessdot$ , shift. If $\gtrdot$ , reduce. If $\doteq$ , shift or reduce depending on associativity. If no relation exists, it is a syntax error.

Note that + $\lessdot$ * (row +, column *) confirms that * has higher precedence -- when + is on the stack and * is next, we shift. Conversely, * $\gtrdot$ + (row *, column +) confirms that when * is on the stack and + is next, we reduce the * first.

Tracing Operator Precedence Parser on id + id * id

1
Step 1
Using the grammar $E \to E + E \mid E * E \mid id$ with standard precedence (* > +, both left-associative), construct the precedence table as shown above. The end-marker $ is added to both ends of the input string.
2
Step 2
Stack = $Input = id + id * id$ The parser always begins with $on the stack and$ appended to the end of the input.
3
Step 3
Stack top terminal is $. Next input is id. From the table:$ \lessdot id. So we shift. Stack = $id Input = + id * id$
4
Step 4
Stack top terminal is id. Next input is +. From the table: id \gtrdot +. So we reduce id to E. Stack = $E Then we compare$ \lessdot + (shift +). Stack = $E + Input = id * id$
5
Step 5
Stack top terminal is +. Next input is id. From the table: + \lessdot id. So we shift id. Stack = $E + id Input = * id$
6
Step 6
Stack top terminal is id. Next input is *. From the table: id \gtrdot *. Reduce id to E. Stack = $E + E Then compare + \lessdot * (shift *). Stack =$ E + E * Input = id $
7
Step 7
Stack top terminal is *. Next input is id. From the table: * \lessdot id. Shift id. Stack = $E + E * id Input =$
8
Step 8
Stack top terminal is id. Next input is $. id \gtrdot$ . Reduce id to E. Stack = $E + E * E Next: * \gtrdot$ . Reduce E * E to E. Stack = $E + E Next: + \gtrdot$ . Reduce E + E to E. Stack = $E Next:$ \doteq $. Accept. Parse complete.

Strengths:

Can parse ambiguous grammars by encoding precedence directly
Simple to implement for expression grammars
Very efficient -- only compares adjacent operators
Small memory footprint (just a precedence table)

Limitations:

Only works on operator grammars (no adjacent non-terminals, no epsilon productions)
Cannot handle general CFGs
Cannot distinguish between operators with the same precedence but different semantics
Limited error detection capabilities

Key Insight: Why Operator Precedence Parser Handles Ambiguity

The Operator Precedence Parser does not construct parse trees at all. It only decides the order of evaluation by comparing adjacent operators. Because it never builds a full derivation, the existence of multiple possible parse trees (ambiguity) does not matter -- the precedence table unambiguously decides whether to shift or reduce at every step. This is fundamentally different from LR parsers, which must enumerate all possible derivations and thus collide when multiple derivations exist.

YACC (Yet Another Compiler Compiler) [PYQ 2022]

YACC is an industry-standard UNIX tool used to automatically generate C code for a parser.

Underlying Algorithm: YACC internally uses the LALR(1) parsing algorithm.
Input/Output: It takes a file containing a Context-Free Grammar (usually with .y extension) and outputs a C program (y.tab.c) containing the parsing table and the shift-reduce logic.
Lexical Integration: YACC does not scan tokens itself. It is designed to work perfectly alongside Lex (which outputs lex.yy.c). YACC repeatedly calls yylex() to fetch the next token.

Structure of a YACC File: A standard YACC program is strictly divided into three sections separated by %%:

1/* 1. DEFINITIONS SECTION */
2%token ID NUM
3%left '+' '-'
4%left '*' '/'
5%%
6
7/* 2. RULES SECTION (The Context Free Grammar) */
8expr : expr '+' expr  { $$ = $1 + $3; }
9     | expr '*' expr  { $$ = $1 * $3; }
10     | NUM            { $$ = $1; }
11     ;
12%%
13
14/* 3. USER SUBROUTINES (C Code) */
15int main() {
16    yyparse(); // Starts the parser
17    return 0;
18}
19void yyerror(char *s) {
20    printf("Syntax Error");
21}

Building a Complete YACC Calculator Specification

1
Step 1

Step 2

Declare all token names using %token (ID, NUM, PLUS, STAR, LPAREN, RPAREN). Then declare precedence using %left and %right directives. Earlier declarations have lower precedence. For example:

%token NUM ID
%left '+' '-'       /* lowest precedence, left-associative */
%left '*' '/'       /* higher precedence, left-associative */
%right UMINUS       /* highest precedence, right-associative (for unary minus) */

Step 3

Each rule has a production followed by a C semantic action enclosed in braces. The pseudo-variables $$, $1,$ 2, $3 refer to the result and the values of the symbols on the right-hand side:

expr : expr '+' expr   { $$ = $1 + $3; }
     | expr '-' expr   { $$ = $1 - $3; }
     | expr '*' expr   { $$ = $1 * $3; }
     | '(' expr ')'    { $$ = $2; }
     | '-' expr %prec UMINUS  { $$ = -$2; }
     | NUM             { $$ = $1; }
     ;

Step 4

The third section contains standard C code. You must provide a main() function that calls yyparse() to start parsing, and an yyerror() function for error reporting:

int main() {
    yyparse();
    return 0;
}
void yyerror(char *s) {
    fprintf(stderr, "Error: %s\
", s);
}

5
Step 5
The build process:

yacc -d calc.y generates y.tab.c and y.tab.h (the -d flag creates the header for Lex)

lex calc.l generates lex.yy.c

cc lex.yy.c y.tab.c -ly -ll -o calc compiles and links both files with the Lex and YACC libraries

Run ./calc to start the calculator

Precedence Directives in YACC

YACC provides three precedence directives to resolve shift-reduce conflicts:

Directive	Meaning	Example Use
`%left`	Left-associative; reduce on same precedence	`%left '+' '-'` -- `a+b+c` = `(a+b)+c`
`%right`	Right-associative; shift on same precedence	`%right '^'` -- `a^b^c` = `a^(b^c)`
`%nonassoc`	Non-associative; error on same precedence	`%nonassoc '<'` -- `a<b<c` is illegal

Conflict Resolution Mechanism: When YACC encounters a shift-reduce conflict, it compares the precedence of:

The terminal that would be shifted (the look-ahead token)
The production that would be reduced (whose precedence is set by its rightmost terminal, or by a %prec override)

If the shift terminal has higher precedence, YACC shifts. If the reduce production has higher precedence, YACC reduces. If they have equal precedence, the associativity directive decides: %left causes a reduce, %right causes a shift, %nonassoc causes an error.

YACC-Lex Integration Pipeline

YACC and Lex work together in a tightly integrated pipeline. Lex handles lexical analysis (tokenization) and YACC handles syntactic analysis (parsing). The flow is:

YACC calls yylex() (defined in lex.yy.c) each time it needs a new token. The token value is returned as an integer, and any associated semantic value (like a number) is passed through the global variable yylval. The header file y.tab.h (generated with yacc -d) contains the token definitions that both Lex and YACC use.

Semantic Actions and Pseudo-Variables

In YACC rules, semantic actions are C code enclosed in { } that execute when a production is reduced. YACC provides pseudo-variables:

Variable	Meaning
`$$`	The result value of the current production (left-hand side)
`$1`	Value of the first symbol on the right-hand side
`$2`	Value of the second symbol on the right-hand side
`$3`	Value of the third symbol on the right-hand side

For example, in the rule expr : expr '+' expr { $$ = $1 + $3; }, $1 holds the value of the first expr, $3 holds the value of the second expr, and $$ is assigned the sum. The values are stored in a value stack parallel to the parsing stack, and their type is controlled by the %union declaration.

Frequently Asked Questions

Knowledge Check

Question 1 of 8

Q1Single choice

Which of the following is NOT a required property of an Operator Grammar? [PYQ 2023, 2024]

No epsilon productions on the RHS of any rule

No two non-terminals adjacent on the RHS of any rule

Every production must have at least one terminal on the RHS

The grammar must be unambiguous

YACC and Lex: Compiler Construction Tools

web

Bottom-Up Parsing: LR(0), SLR(1), LR(1), LALR(1)

PYQ Analysis and Exam Preparation