Lexical Analysis and Regular Languages
Understand how a scanner (lexer) reads a stream of characters and groups them into meaningful tokens. Master regular expressions as the formal tool for defining token patterns.
Learning Goals
- Define the terms token, lexeme, and pattern and explain how they relate.
- Write regular expressions for identifiers, integer literals, float literals, and keywords.
- Explain the role of the lexical analyzer in the compiler pipeline and what it returns to the parser.
- Describe how a lexer handles whitespace, comments, and errors.
The Role of the Lexical Analyzer
The Lexical Analyzer (also called the scanner or lexer) is the very first phase of a compiler. It acts as the interface between the raw source text and the parser. Instead of the parser reading characters one-by-one, it calls the lexer with a simple request: getNextToken(). The lexer does all the hard work of reading ahead, grouping characters, and returning a clean, structured token.
The lexer also interacts directly with the Symbol Table — when it encounters an identifier for the first time, it inserts an entry into the symbol table and returns a pointer to that entry as part of the token.
Token, Lexeme, and Pattern — The Three Core Definitions
These three terms are constantly confused in exams. Here is the precise distinction:
| Term | Definition | Example (from float position = 3.14) |
|---|---|---|
| Pattern | The rule (typically a regular expression) that describes a class of valid lexemes. | [0-9]+\.[0-9]+ |
| Lexeme | The actual sequence of characters in the source code that matches a pattern. | 3.14 |
| Token | The structured <token-name, attribute-value> pair returned to the parser. | <num, 3.14> |
A token has two parts:
- Token-name: A symbolic constant representing the category (e.g.,
id,num,keyword,relop). - Attribute-value: Additional data, usually a pointer into the symbol table or the literal value itself.
Example: For the source fragment position = initial + rate * 60:
| Lexeme | Pattern Matched | Token Returned |
|---|---|---|
position | identifier pattern | <id, ptr_to_position> |
= | assignment operator | <assign> |
initial | identifier pattern | <id, ptr_to_initial> |
+ | addition operator | <+> |
rate | identifier pattern | <id, ptr_to_rate> |
* | multiplication operator | <*> |
60 | integer literal pattern | <num, 60> |
Exam Question Example: Counting Tokens
A very common exam question (like PYQ 2024) asks you to count the total number of tokens in a statement.
Consider: while(count<=10) count = count + 1;
while(keyword)((punctuation)count(identifier)<=(relational operator — single token via maximal munch)10(integer literal))(punctuation)count(identifier)=(assignment operator)count(identifier)+(addition operator)1(integer literal);(punctuation) Total: 12 tokens. Note that whitespace is discarded and not counted.
Regular Expressions — The Language of Lexers
A Regular Expression (regex) is a compact notation for describing a set of strings, called a regular language. Every token class (identifiers, integers, keywords) can be described by a regular expression, making them the perfect formalism for specifying a lexer's rules.
Regular Expression Metacharacters
| Notation | Name | Meaning | Example |
|---|---|---|---|
a | Literal | Matches the exact character a. | a matches "a" |
ab | Concatenation | a followed by b. | ab matches "ab" |
a|b | Alternation (Union) | Matches a or b. | a|b matches "a" or "b" |
a* | Kleene Star | Zero or more repetitions of a. | a* matches "", "a", "aa", ... |
a+ | Plus | One or more repetitions of a. Equivalent to aa*. | a+ matches "a", "aa", ... |
a? | Optional | Zero or one occurrence of a. | a? matches "" or "a" |
[abc] | Character Class | Matches any single character in the set. | [abc] matches "a", "b", or "c" |
[a-z] | Range | Matches any character in the range. | [a-z] matches any lowercase letter |
[^abc] | Negated Class | Matches any character NOT in the set. | [^"] matches any char except " |
. | Wildcard | Matches any single character except newline. | .+ matches any non-empty string |
Operator Precedence (High to Low)
- Kleene Star / Plus / Optional (
*,+,?) — bind most tightly - Concatenation (implicit, no operator)
- Alternation (
|) — binds least tightly
So ab*|c parses as (a(b*))|c, not a(b*|c).
Building Token Patterns: From Digits to Full Identifiers
- 1Step 1
The most fundamental building block is a single decimal digit. We use a character class to match any one of the ten digit characters.
Pattern:
[0-9]Matches:
0,1,2, ...,9Does not match:a,10,(space)This is read as: 'any single character in the range from
0to9'. - 2Step 2
An integer is a sequence of one or more digits. We apply the
+quantifier to our digit pattern.Pattern:
[0-9]+Matches:
0,42,1000,007Does not match:(empty string),3.14,0xFFThe
+means 'one or more', which is equivalent to writing[0-9][0-9]*. - 3Step 3
A floating-point number has an integer part, a decimal point, and a fractional part.
Pattern:
[0-9]+\.[0-9]+Matches:
3.14,0.5,100.001Does not match:3(no decimal),.14(no integer part),3.(no fractional part)Note: the
.inside\.is escaped — without the backslash,.is a wildcard and would match any character. - 4Step 4
Real programs often write numbers in scientific notation (e.g.,
1.5e-10). We extend our pattern to make the exponent part optional.Pattern:
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)?Breaking it down:
[0-9]+\.[0-9]+— the mandatory decimal part([eE]— optionally, the lettereorE[+-]?— optionally, a+or-sign[0-9]+)?— followed by one or more digits (the exponent value)
Matches:
3.14,1.5e10,2.7E-3,0.001e+100 - 5Step 5
Most languages define an identifier as a sequence that starts with a letter or underscore, followed by any number of letters, digits, or underscores.
Pattern:
[a-zA-Z_][a-zA-Z0-9_]*Breaking it down:
[a-zA-Z_]— the first character must be a letter (upper or lower case) or an underscore.[a-zA-Z0-9_]*— followed by zero or more letters, digits, or underscores.
Matches:
x,position,rate_1,_temp,MyClassDoes not match:1x(starts with digit),my-var(hyphen not allowed),(empty) - 6Step 6
A C-style string literal is any sequence of characters (except a double quote or newline) enclosed in double quotes.
Pattern:
"[^"\]*"Breaking it down:
"— opening double-quote (literal)[^"\]*— zero or more characters that are not a double-quote and not a newline"— closing double-quote (literal)
Matches:
"hello","x = 5",""(empty string) Does not match:"hello(unclosed),"li\ ne"(contains newline)
Pattern: [a-zA-Z_][a-zA-Z0-9_]*
| Lexeme | Valid? | Reason |
|---|---|---|
position | ✅ | Starts with letter, all alphanumeric |
_temp | ✅ | Underscore is a valid start character |
x1 | ✅ | Letter start, digit after is fine |
1x | ❌ | Cannot start with a digit |
my-var | ❌ | Hyphen (-) is not in [a-zA-Z0-9_] |
⚠️ Note: Keywords like if, while, and int also match this pattern. The lexer must check a keyword table after matching an identifier to resolve the conflict. Keywords always win.
The Keyword-Identifier Conflict
This is one of the most common exam questions. Keywords like if, while, int, and return all match the identifier pattern [a-zA-Z_][a-zA-Z0-9_]*. So how does the lexer tell them apart?
Two standard solutions:
- Keyword table lookup (most common): The lexer matches any sequence as an identifier first, then looks it up in a pre-populated keyword hash table. If found, it returns the keyword token instead of
<id>. - Separate regex rules: The lexer spec lists keywords as separate, higher-priority rules that are tried before the identifier rule.
Both solutions rely on the same principle: keywords always take priority over identifiers. A programmer cannot use if as a variable name because the lexer will always classify it as a keyword token first.
The Limits of Regular Expressions
Regular expressions are powerful, but they have a fundamental limitation: they cannot count. This means they cannot recognize:
- Balanced parentheses:
((()))with arbitrarily deep nesting - Matched HTML/XML tags:
<b>...</b> - Recursive structures of any kind
Formally, these languages are not regular — they are context-free. This is precisely why we need a separate Syntax Analyzer (Parser) in Phase 2 that uses a Context-Free Grammar and a push-down automaton to handle these nested structures. Regular expressions are exactly strong enough for token recognition — no more, no less.
Common Questions
Knowledge Check
The actual sequence of characters in the source code that matches a pattern (e.g., the string 'position') is called a: