Phases of Compilation
A complete walkthrough of the six phases of a modern compiler — from reading raw source text to emitting optimized machine code. Understand the data each phase produces and how they chain together.
Learning Goals
- Name and correctly order all six phases of a compiler.
- Describe the input and output of each phase with a concrete example.
- Explain the role of the Symbol Table and Error Handler as cross-cutting components.
- Differentiate between the front-end (analysis) and back-end (synthesis) of a compiler.
- Distinguish between a compiler, an interpreter, and an assembler.
- Describe the complete Language Processing System from preprocessor to loader.
- Compare single-pass and multi-pass compiler architectures.
What is a Compiler?
A compiler is a program that reads source code written in one language (the source language) and translates it into an equivalent program in another language (the target language), typically machine code. This translation process is performed in a well-defined sequence of six phases, each consuming the output of the previous one.
Before diving into the phases, it is important to understand where a compiler sits among similar tools:
| Tool | What it does | Example |
|---|---|---|
| Compiler | Translates the entire source program into target code before execution. | GCC, javac |
| Interpreter | Reads and executes the source program line-by-line, without producing a separate target program. | Python, Ruby |
| Assembler | A special-case compiler that translates assembly language (human-readable mnemonics) into machine code. | NASM, MASM |
A key practical distinction: a compiled program generally runs faster than an interpreted one, because all translation overhead is paid once at compile time. An interpreted program is typically more portable and easier to debug interactively.
The Language Processing System
A compiler does not work in isolation. In a real system, the source program passes through a chain of tools before and after the compiler. This complete chain is called the Language Processing System.
| Component | Role | Input → Output |
|---|---|---|
| Preprocessor | Expands macros (#define), includes header files (#include), removes comments | Source → Modified source |
| Compiler | Translates the high-level source to assembly language | Modified source → Assembly |
| Assembler | Converts assembly mnemonics to binary machine code (relocatable) | Assembly → Object code (.o) |
| Linker | Combines multiple object files and library files into a single executable, resolving cross-file references | Object files → Executable |
| Loader | Loads the executable into main memory (RAM) and sets up the process for the OS to run it | Executable → Running process |
This is a frequently asked exam question — draw the complete pipeline with all five components and their data flows.
The Compiler Pipeline
A compiler does not transform source code into machine code in a single step. Instead, it passes the program through a chain of six specialized phases. Two shared components — the Symbol Table and the Error Handler — provide services to all phases throughout the entire process.
The dashed arrows show that the Symbol Table is a shared database — every phase can look up or store information about identifiers (variables, functions, types) in it. The Error Handler collects and reports errors detected by the first three phases.
Trace: The Six Phases on a Single Statement
- 1Step 1
We will trace the single assignment statement
position = initial + rate * 60through all six phases of compilation. This is the canonical example from the Dragon Book, and it perfectly illustrates every phase's role.The Symbol Table at this point already contains entries for
position,initial, andrate(all declared as typefloatearlier in the program). - 2Step 2
The Lexical Analyzer (scanner) reads the source character stream and groups characters into meaningful units called tokens. Whitespace and comments are discarded.
For our statement, the scanner produces the following token stream:
1<id, 1> → 'position' (identifier, Symbol Table entry 1) 2<=> → assignment operator 3<id, 2> → 'initial' (identifier, Symbol Table entry 2) 4<+> → addition operator 5<id, 3> → 'rate' (identifier, Symbol Table entry 3) 6<*> → multiplication operator 7<60> → integer literalEach
<id, N>token does not store the name itself — it stores a pointer to the Symbol Table entry. This is how the compiler avoids duplicating strings. - 3Step 3
The Syntax Analyzer (parser) takes the token stream and checks that the sequence of tokens conforms to the grammatical rules of the language (the Context-Free Grammar). It produces a Parse Tree (or Abstract Syntax Tree) that represents the grammatical structure of the statement.
The tree captures the operator precedence correctly —
*is a child of+, meaning multiplication is performed before addition, exactly as per the grammar rules. - 4Step 4
The Semantic Analyzer checks the parse tree for semantic consistency — meaning that is correct beyond mere grammar. Its most critical job is type checking.
Looking up the Symbol Table, the analyzer determines:
position,initial, andrateare all of typefloat.- The literal
60is of typeint. - You cannot directly multiply a
float(rate) with anint(60) without a conversion.
The semantic analyzer inserts an explicit type coercion node (
inttoreal) into the tree:This is the annotated AST — the same tree structure, but enriched with type information at every node.
- 5Step 5
The Intermediate Code Generator traverses the annotated AST and produces a machine-independent intermediate representation. The most common form is Three-Address Code (TAC), where each instruction has at most one operator and three addresses (two sources, one destination).
For our statement, the TAC sequence is:
1t1 = inttoreal(60) # Convert integer 60 to float 60.0 2t2 = id3 * t1 # rate * 60.0 3t3 = id2 + t2 # initial + (rate * 60.0) 4id1 = t3 # position = resultt1,t2,t3are compiler-generated temporary variables. TAC is easy to generate from the AST and easy to optimize in the next phase. - 6Step 6
The Code Optimizer improves the TAC sequence to make the final program run faster or use less memory, without changing its meaning.
A simple but powerful optimization is constant folding: the compiler notices that
inttoreal(60)always produces the same constant60.0— so there is no need to compute it at runtime. It can be folded in at compile time.The optimized TAC:
1t1 = id3 * 60.0 # rate * 60.0 (60.0 is now a float constant) 2id1 = id2 + t1 # position = initial + t1Notice we went from 4 instructions to 2. The temporary
t2andt3were also merged and eliminated. - 7Step 7
The Code Generator translates the optimized TAC into actual target machine instructions. It must make decisions about register allocation — which CPU registers to use for which values.
A sample output for a simple register machine:
1LDF R2, id3 ; Load 'rate' into floating-point register R2 2MULF R2, #60.0 ; R2 = R2 * 60.0 3LDF R1, id2 ; Load 'initial' into R1 4ADDF R1, R2 ; R1 = R1 + R2 5STF id1, R1 ; Store R1 back to memory as 'position'This is the final target program — the machine code that will be executed by the CPU. The compilation of
position = initial + rate * 60is complete.
The front-end of a compiler consists of the first three phases. It is concerned entirely with understanding the source language. The front-end is generally machine-independent — the same front-end could produce intermediate code for any target.
| Phase | Input | Output |
|---|---|---|
| ① Lexical Analysis | Character stream | Token stream |
| ② Syntax Analysis | Token stream | Parse Tree |
| ③ Semantic Analysis | Parse Tree | Annotated AST |
Goal: Detect all errors in the source program and produce a clean, machine-independent intermediate representation.
The Two Silent Workers: Symbol Table & Error Handler
The Symbol Table and Error Handler are not phases — they are shared infrastructure that all six phases rely on simultaneously.
Symbol Table: A data structure (typically a hash table) that maps each identifier name (position, rate, etc.) to its attributes: its type, memory location, scope, and more. Every phase reads from or writes to it.
Error Handler: Whenever a phase detects a problem (e.g., an undeclared variable in semantic analysis, or a syntax error in parsing), it reports the error to the Error Handler. A good compiler uses error recovery to continue past the first error and detect as many errors as possible in a single compilation pass.
Common Questions
Knowledge Check
Which phase of a compiler is responsible for grouping characters into tokens and discarding whitespace?