Scanner Generators — Lex and Flex

30 mins

From theory to practice: learn how tools like lex and flex automate the construction of a lexer from a simple specification file of regular expression rules, eliminating the need to hand-code a DFA.

Learning Goals

Describe the structure of a lex/flex specification file (definitions, rules, user code sections).
Write basic lex rules to recognize common token patterns.
Explain how lex internally converts the rules into a DFA-based scanner.
Understand the concept of longest-match (maximal munch) rule for resolving token ambiguity.

The Big Picture: Automating Lexical Analysis

In the previous sections, we learned the theory behind building a lexer: writing a regular expression, converting it to an NFA via Thompson's Construction, transforming it into a DFA via Subset Construction, and minimizing that DFA.

Doing this by hand for a real programming language with dozens of tokens is tedious and error-prone. Lex (and its modern, open-source counterpart Flex) automates this entire process. You simply provide a file containing regular expressions, and the tool generates a C program that implements the corresponding DFA.

The Lex Pipeline:

The generated file, always named lex.yy.c, contains:

A massive 2D array representing the minimized DFA transition table.
A core C function named yylex() that reads characters from an input file, drives the DFA, and executes your custom C code whenever a token is recognized.

The 3-Section Structure of a Lex Specification

A Lex specification file (usually ending in .l) is strictly divided into three sections, separated by %% markers.

1%{
2  /* ---------------------------------------------------- */
3  /* SECTION 1a: C PROLOGUE                               */
4  /* Code enclosed in %{ and %} is copied verbatim into   */
5  /* the generated C file. Used for #includes, variables. */
6  /* ---------------------------------------------------- */
7  #include <stdio.h>
8  int line_num = 1;
9%}
10
11/* ------------------------------------------------------ */
12/* SECTION 1b: REGULAR DEFINITIONS                        */
13/* Named regex patterns to keep the rules section clean.  */
14/* Format: NAME   pattern                                 */
15/* ------------------------------------------------------ */
16DIGIT    [0-9]
17LETTER   [a-zA-Z_]
18
19%%
20/* ------------------------------------------------------ */
21/* SECTION 2: TRANSLATION RULES                           */
22/* The core of the lexer. Format: pattern { action }      */
23/* ------------------------------------------------------ */
24{LETTER}({LETTER}|{DIGIT})*   { printf("Found identifier: %s\n", yytext); }
25{DIGIT}+                      { printf("Found integer: %s\n", yytext); }
26\n                            { line_num++; }
27[ \t]                         { /* Ignore whitespace */ }
28.                             { printf("Error on line %d: %s\n", line_num, yytext); }
29%%
30
31/* ------------------------------------------------------ */
32/* SECTION 3: USER SUBROUTINES                            */
33/* Auxiliary C functions, often including a main() to     */
34/* drive the lexer for testing. Copied verbatim.          */
35/* ------------------------------------------------------ */
36int main(void) {
37    yylex(); // Start the scanning process
38    return 0;
39}
40
41int yywrap(void) {
42    return 1; // Indicate end of input
43}

Key Global Variables Provided by Lex: When a rule matches, Lex automatically populates several global variables before executing the corresponding action code:

yytext — A char * pointing to the actual matched string (the lexeme).
yyleng — An int containing the length of the matched string.
yyin — A FILE * representing the input stream (defaults to stdin).

Building a Complete Lex Specification

1
Step 1
We start by setting up the three sections. In the prologue, we include necessary headers. For this example, we assume we are generating tokens for a parser, so we might define some token constants (usually these come from a header file generated by Yacc/Bison, but we'll mock them here).

1%{ 2#include <stdio.h> 3#include <stdlib.h> 4 5// Mock token values 6#define TOK_IF 256 7#define TOK_WHILE 257 8#define TOK_ID 258 9#define TOK_INT 259 10#define TOK_ASSIGN 260 11%} 12 13%% 14 15%%
2
Step 2
To make our rules readable, we define named regular expressions in the first section (outside the %{ ... %} block).

1DIGIT [0-9] 2LETTER [a-zA-Z_] 3ID {LETTER}({LETTER}|{DIGIT})* 4INT {DIGIT}+ 5%%

Notice how we use curly braces {} to reference previously defined names like {LETTER}.
3
Step 3
The first rules in Section 2 should be for keywords. We match them literally.

1%% 2"if" { return TOK_IF; } 3"while" { return TOK_WHILE; }

By placing keywords before the general identifier rule, we ensure they are matched correctly based on Lex's priority rules.

Step 4

Now we add the rules for identifiers and integer literals using the definitions we created earlier.

1{ID}      {
2            // We could copy yytext to a symbol table here
3            return TOK_ID; 
4          }
5
6{INT}     {
7            // We could convert yytext to an integer value here
8            return TOK_INT; 
9          }

Step 5

We add rules for single characters or multi-character operators.

1"="       { return TOK_ASSIGN; }
2"=="      { return 261; } // e.g., TOK_EQ
3"+"       { return '+'; } // Return ASCII value for single chars

6
Step 6
Finally, we handle whitespace (which we usually want to silently ignore) and unmatched characters (which are lexical errors).

1[ \t\ 2]+ { /* Ignore whitespace */ } 3 4. { 5 fprintf(stderr, "Lexical Error: Unrecognized character '%s'\ 6", yytext); 7 // Do not return; lexer will continue scanning the next char 8 } 9%%

The . (dot) matches any single character except newline. Because it is the last rule, it acts as a catch-all for anything not matched by the rules above.

Here is the assembled specification file for our simple language:

1%{
2#include <stdio.h>
3#define TOK_IF     256
4#define TOK_ID     257
5#define TOK_INT    258
6%}
7
8DIGIT    [0-9]
9ID       [a-zA-Z_][a-zA-Z0-9_]*
10
11%%
12"if"             { return TOK_IF; }
13{ID}             { return TOK_ID; }
14{DIGIT}+         { return TOK_INT; }
15"+"              { return '+'; }
16"="              { return '='; }
17[ \t\
18]+         { /* Ignore */ }
19.                { printf("Error: %s\
20", yytext); }
21%%
22
23int main() {
24    int tok;
25    while ((tok = yylex())) {
26        printf("Token: %d\
27", tok);
28    }
29    return 0;
30}
31int yywrap() { return 1; }

Conflict Resolution: The Two Golden Rules

What happens when input text matches more than one rule? Lex resolves ambiguities using two strict rules, applied in order:

1. Longest Match Wins (Maximal Munch) If the input is iff, it matches the rule for if (length 2) AND the rule for {ID} (length 3). The longest match wins, so it is treated as an identifier, not a keyword followed by an 'f'.

2. First Rule in File Wins If the input is if, it matches the rule for if (length 2) AND the rule for {ID} (length 2). It's a tie! Lex breaks the tie by choosing the rule that appears first in the .l file.

This is why keyword rules must always be placed before identifier rules in a Lex specification.

The Default ECHO Trap

If Lex encounters a character that does not match any rule, its default built-in behavior is to execute ECHO — which prints the matched character to stdout and continues scanning.

This is usually terrible for a compiler! It means invalid characters silently pass through your lexer and pollute the console output without generating an actual error token.

Always include a catch-all rule at the very bottom of your rules section:

1.  { fprintf(stderr, "Lexical error on line %d\
2", yylineno); }

Because of the 'first rule wins' tiebreaker, this dot rule will only execute if absolutely no other rule matched.

Common Questions

Knowledge Check

Question 1 of 5

Q1Single choice

A Lex specification file is divided into three sections separated by which symbol?

Flex - The Fast Lexical Analyzer (GNU Manual)

web

Finite Automata — NFA and DFA

PYQ Analysis and Exam Preparation