Coursify

Compiler Design (CD)

Scanner Generators — Lex and Flex

30 mins

From theory to practice: learn how tools like lex and flex automate the construction of a lexer from a simple specification file of regular expression rules, eliminating the need to hand-code a DFA.

Learning Goals

  • Describe the structure of a lex/flex specification file (definitions, rules, user code sections).
  • Write basic lex rules to recognize common token patterns.
  • Explain how lex internally converts the rules into a DFA-based scanner.
  • Understand the concept of longest-match (maximal munch) rule for resolving token ambiguity.

The Big Picture: Automating Lexical Analysis

In the previous sections, we learned the theory behind building a lexer: writing a regular expression, converting it to an NFA via Thompson's Construction, transforming it into a DFA via Subset Construction, and minimizing that DFA.

Doing this by hand for a real programming language with dozens of tokens is tedious and error-prone. Lex (and its modern, open-source counterpart Flex) automates this entire process. You simply provide a file containing regular expressions, and the tool generates a C program that implements the corresponding DFA.

The Lex Pipeline:

The generated file, always named lex.yy.c, contains:

  • A massive 2D array representing the minimized DFA transition table.
  • A core C function named yylex() that reads characters from an input file, drives the DFA, and executes your custom C code whenever a token is recognized.

The 3-Section Structure of a Lex Specification

A Lex specification file (usually ending in .l) is strictly divided into three sections, separated by %% markers.

1%{ 2 /* ---------------------------------------------------- */ 3 /* SECTION 1a: C PROLOGUE */ 4 /* Code enclosed in %{ and %} is copied verbatim into */ 5 /* the generated C file. Used for #includes, variables. */ 6 /* ---------------------------------------------------- */ 7 #include <stdio.h> 8 int line_num = 1; 9%} 10 11/* ------------------------------------------------------ */ 12/* SECTION 1b: REGULAR DEFINITIONS */ 13/* Named regex patterns to keep the rules section clean. */ 14/* Format: NAME pattern */ 15/* ------------------------------------------------------ */ 16DIGIT [0-9] 17LETTER [a-zA-Z_] 18 19%% 20/* ------------------------------------------------------ */ 21/* SECTION 2: TRANSLATION RULES */ 22/* The core of the lexer. Format: pattern { action } */ 23/* ------------------------------------------------------ */ 24{LETTER}({LETTER}|{DIGIT})* { printf("Found identifier: %s\n", yytext); } 25{DIGIT}+ { printf("Found integer: %s\n", yytext); } 26\n { line_num++; } 27[ \t] { /* Ignore whitespace */ } 28. { printf("Error on line %d: %s\n", line_num, yytext); } 29%% 30 31/* ------------------------------------------------------ */ 32/* SECTION 3: USER SUBROUTINES */ 33/* Auxiliary C functions, often including a main() to */ 34/* drive the lexer for testing. Copied verbatim. */ 35/* ------------------------------------------------------ */ 36int main(void) { 37 yylex(); // Start the scanning process 38 return 0; 39} 40 41int yywrap(void) { 42 return 1; // Indicate end of input 43}

Key Global Variables Provided by Lex: When a rule matches, Lex automatically populates several global variables before executing the corresponding action code:

  • yytext — A char * pointing to the actual matched string (the lexeme).
  • yyleng — An int containing the length of the matched string.
  • yyin — A FILE * representing the input stream (defaults to stdin).

Building a Complete Lex Specification

  1. 1
    Step 1

    We start by setting up the three sections. In the prologue, we include necessary headers. For this example, we assume we are generating tokens for a parser, so we might define some token constants (usually these come from a header file generated by Yacc/Bison, but we'll mock them here).

    1%{ 2#include <stdio.h> 3#include <stdlib.h> 4 5// Mock token values 6#define TOK_IF 256 7#define TOK_WHILE 257 8#define TOK_ID 258 9#define TOK_INT 259 10#define TOK_ASSIGN 260 11%} 12 13%% 14 15%%
  2. 2
    Step 2

    To make our rules readable, we define named regular expressions in the first section (outside the %{ ... %} block).

    1DIGIT [0-9] 2LETTER [a-zA-Z_] 3ID {LETTER}({LETTER}|{DIGIT})* 4INT {DIGIT}+ 5%%

    Notice how we use curly braces {} to reference previously defined names like {LETTER}.

  3. 3
    Step 3

    The first rules in Section 2 should be for keywords. We match them literally.

    1%% 2"if" { return TOK_IF; } 3"while" { return TOK_WHILE; }

    By placing keywords before the general identifier rule, we ensure they are matched correctly based on Lex's priority rules.

  4. 4
    Step 4

    Now we add the rules for identifiers and integer literals using the definitions we created earlier.

    1{ID} { 2 // We could copy yytext to a symbol table here 3 return TOK_ID; 4 } 5 6{INT} { 7 // We could convert yytext to an integer value here 8 return TOK_INT; 9 }
  5. 5
    Step 5

    We add rules for single characters or multi-character operators.

    1"=" { return TOK_ASSIGN; } 2"==" { return 261; } // e.g., TOK_EQ 3"+" { return '+'; } // Return ASCII value for single chars
  6. 6
    Step 6

    Finally, we handle whitespace (which we usually want to silently ignore) and unmatched characters (which are lexical errors).

    1[ \t\ 2]+ { /* Ignore whitespace */ } 3 4. { 5 fprintf(stderr, "Lexical Error: Unrecognized character '%s'\ 6", yytext); 7 // Do not return; lexer will continue scanning the next char 8 } 9%%

    The . (dot) matches any single character except newline. Because it is the last rule, it acts as a catch-all for anything not matched by the rules above.

Here is the assembled specification file for our simple language:

1%{ 2#include <stdio.h> 3#define TOK_IF 256 4#define TOK_ID 257 5#define TOK_INT 258 6%} 7 8DIGIT [0-9] 9ID [a-zA-Z_][a-zA-Z0-9_]* 10 11%% 12"if" { return TOK_IF; } 13{ID} { return TOK_ID; } 14{DIGIT}+ { return TOK_INT; } 15"+" { return '+'; } 16"=" { return '='; } 17[ \t\ 18]+ { /* Ignore */ } 19. { printf("Error: %s\ 20", yytext); } 21%% 22 23int main() { 24 int tok; 25 while ((tok = yylex())) { 26 printf("Token: %d\ 27", tok); 28 } 29 return 0; 30} 31int yywrap() { return 1; }

Conflict Resolution: The Two Golden Rules

What happens when input text matches more than one rule? Lex resolves ambiguities using two strict rules, applied in order:

1. Longest Match Wins (Maximal Munch) If the input is iff, it matches the rule for if (length 2) AND the rule for {ID} (length 3). The longest match wins, so it is treated as an identifier, not a keyword followed by an 'f'.

2. First Rule in File Wins If the input is if, it matches the rule for if (length 2) AND the rule for {ID} (length 2). It's a tie! Lex breaks the tie by choosing the rule that appears first in the .l file.

This is why keyword rules must always be placed before identifier rules in a Lex specification.

The Default ECHO Trap

If Lex encounters a character that does not match any rule, its default built-in behavior is to execute ECHO — which prints the matched character to stdout and continues scanning.

This is usually terrible for a compiler! It means invalid characters silently pass through your lexer and pollute the console output without generating an actual error token.

Always include a catch-all rule at the very bottom of your rules section:

1. { fprintf(stderr, "Lexical error on line %d\ 2", yylineno); }

Because of the 'first rule wins' tiebreaker, this dot rule will only execute if absolutely no other rule matched.

Common Questions

Knowledge Check

Question 1 of 5
Q1Single choice

A Lex specification file is divided into three sections separated by which symbol?

Scanner Generators — Lex and Flex | Compiler Design (CD) | Coursify