Introduction to Compiler Design
Introduction
A compiler translates high-level language source code into target machine code, serving as the critical bridge between programmers and hardware. Understanding compiler design helps write more efficient code and provides the foundation for understanding language design, program optimization, and toolchains such as LLVM.
Related content: Formal Languages and Automata
1. Overall Compiler Structure
graph LR
A[Source Code] --> B[Lexical Analysis]
B --> C[Parsing]
C --> D[Semantic Analysis]
D --> E[IR Generation]
E --> F[Optimization]
F --> G[Code Generation]
G --> H[Target Code]
subgraph Frontend
B
C
D
end
subgraph Middle-end
E
F
end
subgraph Backend
G
end
The compilation process is typically divided into a frontend (source-language dependent) and a backend (target-platform dependent), decoupled through an intermediate representation (IR).
2. Lexical Analysis
The lexical analyzer (lexer/scanner) converts the source code character stream into a token stream.
2.1 Example
Source code: if (x >= 42) { y = x + 1; }
Token stream:
IF LPAREN ID("x") GE NUM(42) RPAREN
LBRACE ID("y") ASSIGN ID("x") PLUS NUM(1) SEMICOLON RBRACE
2.2 Regular Expressions → DFA
Lexical rules are described using regular expressions and converted to deterministic finite automata (DFA) for execution:
Example: regular expression for identifiers [a-zA-Z_][a-zA-Z0-9_]*
letter/underscore letter/digit/underscore
──→ (q0) ──────────────→ ((q1)) ◄─────────────────┐
│ │
└───────────────────────┘
2.3 Lexical Analysis Tools
- Lex/Flex: classic lexer generators
- Regular expression engines: built into modern languages
3. Parsing (Syntax Analysis)
The parser organizes the token stream into a syntax tree according to a context-free grammar (CFG).
3.1 Context-Free Grammar
// Simple arithmetic expression grammar
E → E + T | E - T | T
T → T * F | T / F | F
F → ( E ) | num | id
3.2 Top-Down Parsing: LL Parsing
LL(k): scans left-to-right (L), leftmost derivation (L), looks ahead k tokens.
Recursive descent parsing is the most intuitive form of LL parsing:
def parse_expr():
"""E → T (('+' | '-') T)*"""
left = parse_term()
while current_token in ('+', '-'):
op = current_token
advance()
right = parse_term()
left = BinOp(op, left, right)
return left
def parse_term():
"""T → F (('*' | '/') F)*"""
left = parse_factor()
while current_token in ('*', '/'):
op = current_token
advance()
right = parse_factor()
left = BinOp(op, left, right)
return left
def parse_factor():
"""F → '(' E ')' | num | id"""
if current_token == '(':
advance()
node = parse_expr()
expect(')')
return node
elif current_token.isdigit():
node = Num(int(current_token))
advance()
return node
LL parsing requires the grammar to be free of left recursion, which must be eliminated:
3.3 Bottom-Up Parsing: LR Parsing
LR(k): scans left-to-right (L), rightmost derivation in reverse (R), looks ahead k tokens.
LR parsing is more powerful than LL and can handle a broader class of grammars.
| Method | Power | Complexity | Tools |
|---|---|---|---|
| LL(1) | Weaker | Simple | ANTLR, recursive descent |
| SLR(1) | Moderate | Moderate | Yacc variants |
| LALR(1) | Stronger | Moderate | Yacc/Bison |
| LR(1) | Strongest | Higher | Rarely used |
Core LR operations:
- Shift: read a token and push it onto the stack
- Reduce: when the top of the stack matches the right-hand side of a grammar rule, replace it with the left-hand side
4. Abstract Syntax Tree (AST)
The AST is a tree-structured representation of the source code, stripped of syntactic noise (such as parentheses and semicolons).
Source code: x = 2 * (y + 3)
AST:
Assign
/ \
Var(x) Mul
/ \
Num(2) Add
/ \
Var(y) Num(3)
The AST is the fundamental data structure for subsequent analysis and code generation.
5. Semantic Analysis
Semantic analysis checks the semantic correctness of the AST. Main tasks include:
5.1 Type Checking
int x = "hello"; // Type error: cannot assign string to int
float y = 3 + 2.5; // OK: int implicitly promoted to float
Type checking rules:
5.2 Symbol Table
The symbol table records information about all identifiers in the program:
| Name | Type | Scope | Address/Offset |
|---|---|---|---|
| x | int | main | rbp-4 |
| y | float | main | rbp-8 |
| add | int→int→int | global | 0x4000a0 |
5.3 Scope Resolution
int x = 1; // global x
void foo() {
int x = 2; // local x, shadows global
{
int x = 3; // inner scope, shadows outer
}
}
6. Intermediate Representation (IR)
IR serves as the bridge between source language and target code, facilitating optimization.
6.1 Three-Address Code
Source code: a = b * c + d
Three-address code:
t1 = b * c
t2 = t1 + d
a = t2
Each instruction involves at most three addresses (two operands + one result).
6.2 SSA Form
Static Single Assignment (SSA): each variable is assigned exactly once.
// Original
x = 1
x = x + 1
y = x
// SSA form
x1 = 1
x2 = x1 + 1
y1 = x2
SSA simplifies dataflow analysis and optimization. At branch merge points, \(\phi\) functions are used:
// if (cond) x = 1; else x = 2; y = x;
if cond goto L1 else L2
L1: x1 = 1; goto L3
L2: x2 = 2; goto L3
L3: x3 = φ(x1, x2)
y1 = x3
7. Optimization
7.1 Common Optimization Techniques
| Optimization | Description | Example |
|---|---|---|
| Constant folding | Evaluate constant expressions at compile time | 3+4 → 7 |
| Constant propagation | Replace variables with known constants | x=5; y=x+1 → y=6 |
| Dead code elimination | Remove unreachable/unused code | Delete unused variables |
| Common subexpression elimination | Avoid redundant computation | a*b+a*b → t=a*b; t+t |
| Loop-invariant code motion | Move invariant computations outside the loop | - |
| Loop unrolling | Reduce loop overhead | Unroll the loop body |
| Inline expansion | Replace function calls with the function body | Eliminate call overhead |
| Tail call optimization | Convert tail recursion to loops | Avoid stack overflow |
| Strength reduction | Replace expensive operations with cheaper ones | x*2 → x<<1 |
7.2 Dataflow Analysis
Optimization relies on dataflow analysis, propagating information over the control flow graph (CFG):
- Reaching definitions: which assignment statements' definitions can reach a given program point
- Live variable analysis: whether a variable will be used later
- Available expression analysis: whether an expression has already been computed and its value has not changed
8. Target Code Generation
8.1 Instruction Selection
Map IR to target machine instructions:
IR: t1 = a + b
x86-64:
mov eax, [rbp-4] ; load a
add eax, [rbp-8] ; add b
mov [rbp-12], eax ; store to t1
8.2 Register Allocation
Registers are limited, so decisions must be made about which variables reside in registers:
- Graph coloring algorithm: build an interference graph (edges between simultaneously live variables), color with \(k\) colors (\(k\) = number of registers)
- Variables that cannot be colored are spilled to memory
8.3 Instruction Scheduling
Reorder instructions to exploit the pipeline and reduce stalls.
9. Interpreters, Compilers, and JIT
| Approach | How it Works | Advantages | Disadvantages |
|---|---|---|---|
| Interpreter | Executes source/bytecode statement by statement | Fast startup, easy debugging | Slow execution |
| AOT Compiler | Pre-compiles to machine code | Fastest execution | Slow compilation, platform-dependent |
| JIT Compiler | Compiles hot code at runtime | Runtime optimization | Slow startup, higher memory usage |
JIT compilation flow:
graph LR
A[Bytecode] --> B{Hot spot?}
B -->|No| C[Interpret]
B -->|Yes| D[JIT compile to machine code]
D --> E[Cache machine code]
E --> F[Execute machine code directly]
C --> B
10. LLVM Architecture
LLVM is a modern compiler infrastructure with a modular design.
graph TD
subgraph Frontend
A1[Clang - C/C++]
A2[Rust Frontend]
A3[Swift Frontend]
end
subgraph LLVM Core
B[LLVM IR]
C[Optimization Passes]
end
subgraph Backend
D1[x86-64 CodeGen]
D2[ARM CodeGen]
D3[RISC-V CodeGen]
D4[WASM CodeGen]
end
A1 --> B
A2 --> B
A3 --> B
B --> C
C --> D1
C --> D2
C --> D3
C --> D4
10.1 LLVM IR Example
define i32 @add(i32 %a, i32 %b) {
entry:
%result = add i32 %a, %b
ret i32 %result
}
LLVM IR characteristics:
- Typed: all values have types
- SSA form: each variable is assigned exactly once
- Three forms: in-memory data structures, human-readable
.llfiles, compact bitcode.bcfiles
10.2 Impact of LLVM
- Clang: C/C++/Objective-C compiler
- Rust: uses LLVM as its backend
- Swift: Apple's language, built on LLVM
- Julia: scientific computing language, uses LLVM JIT
- Numba: Python numerical computation JIT
11. Compiler Construction Tools
| Tool | Purpose | Representatives |
|---|---|---|
| Lexer generators | Regular expressions → DFA | Lex, Flex, re2c |
| Parser generators | Grammar → Parser | Yacc, Bison, ANTLR |
| Compiler frameworks | Complete compiler infrastructure | LLVM, GCC |
| Parser combinators | Functional parsing | Parsec (Haskell), nom (Rust) |
References
- "Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
- "Engineering a Compiler" - Cooper & Torczon
- LLVM Official Documentation: https://llvm.org/docs/