Scanner Generators — Lex and Flex
From theory to practice: learn how tools like lex and flex automate the construction of a lexer from a simple specification file of regular expression rules, eliminating the need to hand-code a DFA.
Learning Goals
- Describe the structure of a lex/flex specification file (definitions, rules, user code sections).
- Write basic lex rules to recognize common token patterns.
- Explain how lex internally converts the rules into a DFA-based scanner.
- Understand the concept of longest-match (maximal munch) rule for resolving token ambiguity.
The Big Picture: Automating Lexical Analysis
In the previous sections, we learned the theory behind building a lexer: writing a regular expression, converting it to an NFA via Thompson's Construction, transforming it into a DFA via Subset Construction, and minimizing that DFA.
Doing this by hand for a real programming language with dozens of tokens is tedious and error-prone. Lex (and its modern, open-source counterpart Flex) automates this entire process. You simply provide a file containing regular expressions, and the tool generates a C program that implements the corresponding DFA.
The Lex Pipeline:
The generated file, always named lex.yy.c, contains:
- A massive 2D array representing the minimized DFA transition table.
- A core C function named
yylex()that reads characters from an input file, drives the DFA, and executes your custom C code whenever a token is recognized.
The 3-Section Structure of a Lex Specification
A Lex specification file (usually ending in .l) is strictly divided into three sections, separated by %% markers.
1%{ 2 /* ---------------------------------------------------- */ 3 /* SECTION 1a: C PROLOGUE */ 4 /* Code enclosed in %{ and %} is copied verbatim into */ 5 /* the generated C file. Used for #includes, variables. */ 6 /* ---------------------------------------------------- */ 7 #include <stdio.h> 8 int line_num = 1; 9%} 10 11/* ------------------------------------------------------ */ 12/* SECTION 1b: REGULAR DEFINITIONS */ 13/* Named regex patterns to keep the rules section clean. */ 14/* Format: NAME pattern */ 15/* ------------------------------------------------------ */ 16DIGIT [0-9] 17LETTER [a-zA-Z_] 18 19%% 20/* ------------------------------------------------------ */ 21/* SECTION 2: TRANSLATION RULES */ 22/* The core of the lexer. Format: pattern { action } */ 23/* ------------------------------------------------------ */ 24{LETTER}({LETTER}|{DIGIT})* { printf("Found identifier: %s\n", yytext); } 25{DIGIT}+ { printf("Found integer: %s\n", yytext); } 26\n { line_num++; } 27[ \t] { /* Ignore whitespace */ } 28. { printf("Error on line %d: %s\n", line_num, yytext); } 29%% 30 31/* ------------------------------------------------------ */ 32/* SECTION 3: USER SUBROUTINES */ 33/* Auxiliary C functions, often including a main() to */ 34/* drive the lexer for testing. Copied verbatim. */ 35/* ------------------------------------------------------ */ 36int main(void) { 37 yylex(); // Start the scanning process 38 return 0; 39} 40 41int yywrap(void) { 42 return 1; // Indicate end of input 43}
Key Global Variables Provided by Lex: When a rule matches, Lex automatically populates several global variables before executing the corresponding action code:
yytext— Achar *pointing to the actual matched string (the lexeme).yyleng— Anintcontaining the length of the matched string.yyin— AFILE *representing the input stream (defaults tostdin).
Building a Complete Lex Specification
- 1Step 1
We start by setting up the three sections. In the prologue, we include necessary headers. For this example, we assume we are generating tokens for a parser, so we might define some token constants (usually these come from a header file generated by Yacc/Bison, but we'll mock them here).
1%{ 2#include <stdio.h> 3#include <stdlib.h> 4 5// Mock token values 6#define TOK_IF 256 7#define TOK_WHILE 257 8#define TOK_ID 258 9#define TOK_INT 259 10#define TOK_ASSIGN 260 11%} 12 13%% 14 15%% - 2Step 2
To make our rules readable, we define named regular expressions in the first section (outside the
%{ ... %}block).1DIGIT [0-9] 2LETTER [a-zA-Z_] 3ID {LETTER}({LETTER}|{DIGIT})* 4INT {DIGIT}+ 5%%Notice how we use curly braces
{}to reference previously defined names like{LETTER}. - 3Step 3
The first rules in Section 2 should be for keywords. We match them literally.
1%% 2"if" { return TOK_IF; } 3"while" { return TOK_WHILE; }By placing keywords before the general identifier rule, we ensure they are matched correctly based on Lex's priority rules.
- 4Step 4
Now we add the rules for identifiers and integer literals using the definitions we created earlier.
1{ID} { 2 // We could copy yytext to a symbol table here 3 return TOK_ID; 4 } 5 6{INT} { 7 // We could convert yytext to an integer value here 8 return TOK_INT; 9 } - 5Step 5
We add rules for single characters or multi-character operators.
1"=" { return TOK_ASSIGN; } 2"==" { return 261; } // e.g., TOK_EQ 3"+" { return '+'; } // Return ASCII value for single chars - 6Step 6
Finally, we handle whitespace (which we usually want to silently ignore) and unmatched characters (which are lexical errors).
1[ \t\ 2]+ { /* Ignore whitespace */ } 3 4. { 5 fprintf(stderr, "Lexical Error: Unrecognized character '%s'\ 6", yytext); 7 // Do not return; lexer will continue scanning the next char 8 } 9%%The
.(dot) matches any single character except newline. Because it is the last rule, it acts as a catch-all for anything not matched by the rules above.
Here is the assembled specification file for our simple language:
1%{ 2#include <stdio.h> 3#define TOK_IF 256 4#define TOK_ID 257 5#define TOK_INT 258 6%} 7 8DIGIT [0-9] 9ID [a-zA-Z_][a-zA-Z0-9_]* 10 11%% 12"if" { return TOK_IF; } 13{ID} { return TOK_ID; } 14{DIGIT}+ { return TOK_INT; } 15"+" { return '+'; } 16"=" { return '='; } 17[ \t\ 18]+ { /* Ignore */ } 19. { printf("Error: %s\ 20", yytext); } 21%% 22 23int main() { 24 int tok; 25 while ((tok = yylex())) { 26 printf("Token: %d\ 27", tok); 28 } 29 return 0; 30} 31int yywrap() { return 1; }
Conflict Resolution: The Two Golden Rules
What happens when input text matches more than one rule? Lex resolves ambiguities using two strict rules, applied in order:
1. Longest Match Wins (Maximal Munch)
If the input is iff, it matches the rule for if (length 2) AND the rule for {ID} (length 3). The longest match wins, so it is treated as an identifier, not a keyword followed by an 'f'.
2. First Rule in File Wins
If the input is if, it matches the rule for if (length 2) AND the rule for {ID} (length 2). It's a tie! Lex breaks the tie by choosing the rule that appears first in the .l file.
This is why keyword rules must always be placed before identifier rules in a Lex specification.
The Default ECHO Trap
If Lex encounters a character that does not match any rule, its default built-in behavior is to execute ECHO — which prints the matched character to stdout and continues scanning.
This is usually terrible for a compiler! It means invalid characters silently pass through your lexer and pollute the console output without generating an actual error token.
Always include a catch-all rule at the very bottom of your rules section:
1. { fprintf(stderr, "Lexical error on line %d\ 2", yylineno); }
Because of the 'first rule wins' tiebreaker, this dot rule will only execute if absolutely no other rule matched.
Common Questions
Knowledge Check
A Lex specification file is divided into three sections separated by which symbol?