Introduction to Compiler Design

Introduction

A compiler translates high-level language source code into target machine code, serving as the critical bridge between programmers and hardware. Understanding compiler design helps write more efficient code and provides the foundation for understanding language design, program optimization, and toolchains such as LLVM.

Related content: Formal Languages and Automata

1. Overall Compiler Structure

graph LR
    A[Source Code] --> B[Lexical Analysis]
    B --> C[Parsing]
    C --> D[Semantic Analysis]
    D --> E[IR Generation]
    E --> F[Optimization]
    F --> G[Code Generation]
    G --> H[Target Code]

    subgraph Frontend
        B
        C
        D
    end

    subgraph Middle-end
        E
        F
    end

    subgraph Backend
        G
    end

The compilation process is typically divided into a frontend (source-language dependent) and a backend (target-platform dependent), decoupled through an intermediate representation (IR).

2. Lexical Analysis

The lexical analyzer (lexer/scanner) converts the source code character stream into a token stream.

2.1 Example

Source code: if (x >= 42) { y = x + 1; }

Token stream:
  IF  LPAREN  ID("x")  GE  NUM(42)  RPAREN
  LBRACE  ID("y")  ASSIGN  ID("x")  PLUS  NUM(1)  SEMICOLON  RBRACE

2.2 Regular Expressions → DFA

Lexical rules are described using regular expressions and converted to deterministic finite automata (DFA) for execution:

\[ \text{Regular Expression} \xrightarrow{\text{Thompson}} \text{NFA} \xrightarrow{\text{Subset Construction}} \text{DFA} \xrightarrow{\text{Minimization}} \text{Minimal DFA} \]

Example: regular expression for identifiers [a-zA-Z_][a-zA-Z0-9_]*

         letter/underscore    letter/digit/underscore
  ──→ (q0) ──────────────→ ((q1)) ◄─────────────────┐
                              │                       │
                              └───────────────────────┘

2.3 Lexical Analysis Tools

Lex/Flex: classic lexer generators
Regular expression engines: built into modern languages

3. Parsing (Syntax Analysis)

The parser organizes the token stream into a syntax tree according to a context-free grammar (CFG).

3.1 Context-Free Grammar

// Simple arithmetic expression grammar
E → E + T | E - T | T
T → T * F | T / F | F
F → ( E ) | num | id

3.2 Top-Down Parsing: LL Parsing

LL(k): scans left-to-right (L), leftmost derivation (L), looks ahead k tokens.

Recursive descent parsing is the most intuitive form of LL parsing:

def parse_expr():
    """E → T (('+' | '-') T)*"""
    left = parse_term()
    while current_token in ('+', '-'):
        op = current_token
        advance()
        right = parse_term()
        left = BinOp(op, left, right)
    return left

def parse_term():
    """T → F (('*' | '/') F)*"""
    left = parse_factor()
    while current_token in ('*', '/'):
        op = current_token
        advance()
        right = parse_factor()
        left = BinOp(op, left, right)
    return left

def parse_factor():
    """F → '(' E ')' | num | id"""
    if current_token == '(':
        advance()
        node = parse_expr()
        expect(')')
        return node
    elif current_token.isdigit():
        node = Num(int(current_token))
        advance()
        return node

LL parsing requires the grammar to be free of left recursion, which must be eliminated:

\[ A \to A\alpha \mid \beta \quad \Rightarrow \quad A \to \beta A', \quad A' \to \alpha A' \mid \epsilon \]

3.3 Bottom-Up Parsing: LR Parsing

LR(k): scans left-to-right (L), rightmost derivation in reverse (R), looks ahead k tokens.

LR parsing is more powerful than LL and can handle a broader class of grammars.

Method	Power	Complexity	Tools
LL(1)	Weaker	Simple	ANTLR, recursive descent
SLR(1)	Moderate	Moderate	Yacc variants
LALR(1)	Stronger	Moderate	Yacc/Bison
LR(1)	Strongest	Higher	Rarely used

Core LR operations:

Shift: read a token and push it onto the stack
Reduce: when the top of the stack matches the right-hand side of a grammar rule, replace it with the left-hand side

4. Abstract Syntax Tree (AST)

The AST is a tree-structured representation of the source code, stripped of syntactic noise (such as parentheses and semicolons).

Source code: x = 2 * (y + 3)

AST:
        Assign
       /      \
    Var(x)    Mul
             /    \
          Num(2)  Add
                 /    \
             Var(y)  Num(3)

The AST is the fundamental data structure for subsequent analysis and code generation.

5. Semantic Analysis

Semantic analysis checks the semantic correctness of the AST. Main tasks include:

5.1 Type Checking

int x = "hello";   // Type error: cannot assign string to int
float y = 3 + 2.5; // OK: int implicitly promoted to float

Type checking rules:

\[ \frac{\Gamma \vdash e_1 : \text{int} \quad \Gamma \vdash e_2 : \text{int}}{\Gamma \vdash e_1 + e_2 : \text{int}} \quad (\text{T-Add}) \]

5.2 Symbol Table

The symbol table records information about all identifiers in the program:

Name	Type	Scope	Address/Offset
x	int	main	rbp-4
y	float	main	rbp-8
add	int→int→int	global	0x4000a0

5.3 Scope Resolution

int x = 1;          // global x
void foo() {
    int x = 2;      // local x, shadows global
    {
        int x = 3;  // inner scope, shadows outer
    }
}

6. Intermediate Representation (IR)

IR serves as the bridge between source language and target code, facilitating optimization.

6.1 Three-Address Code

Source code: a = b * c + d

Three-address code:
  t1 = b * c
  t2 = t1 + d
  a = t2

Each instruction involves at most three addresses (two operands + one result).

6.2 SSA Form

Static Single Assignment (SSA): each variable is assigned exactly once.

// Original
x = 1
x = x + 1
y = x

// SSA form
x1 = 1
x2 = x1 + 1
y1 = x2

SSA simplifies dataflow analysis and optimization. At branch merge points, \(\phi\) functions are used:

// if (cond) x = 1; else x = 2; y = x;
if cond goto L1 else L2
L1: x1 = 1; goto L3
L2: x2 = 2; goto L3
L3: x3 = φ(x1, x2)
    y1 = x3

7. Optimization

7.1 Common Optimization Techniques

Optimization	Description	Example
Constant folding	Evaluate constant expressions at compile time	`3+4` → `7`
Constant propagation	Replace variables with known constants	`x=5; y=x+1` → `y=6`
Dead code elimination	Remove unreachable/unused code	Delete unused variables
Common subexpression elimination	Avoid redundant computation	`ab+ab` → `t=a*b; t+t`
Loop-invariant code motion	Move invariant computations outside the loop	-
Loop unrolling	Reduce loop overhead	Unroll the loop body
Inline expansion	Replace function calls with the function body	Eliminate call overhead
Tail call optimization	Convert tail recursion to loops	Avoid stack overflow
Strength reduction	Replace expensive operations with cheaper ones	`x*2` → `x<<1`

7.2 Dataflow Analysis

Optimization relies on dataflow analysis, propagating information over the control flow graph (CFG):

Reaching definitions: which assignment statements' definitions can reach a given program point
Live variable analysis: whether a variable will be used later
Available expression analysis: whether an expression has already been computed and its value has not changed

8. Target Code Generation

8.1 Instruction Selection

Map IR to target machine instructions:

IR: t1 = a + b

x86-64:
  mov eax, [rbp-4]    ; load a
  add eax, [rbp-8]    ; add b
  mov [rbp-12], eax   ; store to t1

8.2 Register Allocation

Registers are limited, so decisions must be made about which variables reside in registers:

Graph coloring algorithm: build an interference graph (edges between simultaneously live variables), color with \(k\) colors (\(k\) = number of registers)
Variables that cannot be colored are spilled to memory

8.3 Instruction Scheduling

Reorder instructions to exploit the pipeline and reduce stalls.

9. Interpreters, Compilers, and JIT

Approach	How it Works	Advantages	Disadvantages
Interpreter	Executes source/bytecode statement by statement	Fast startup, easy debugging	Slow execution
AOT Compiler	Pre-compiles to machine code	Fastest execution	Slow compilation, platform-dependent
JIT Compiler	Compiles hot code at runtime	Runtime optimization	Slow startup, higher memory usage

JIT compilation flow:

graph LR
    A[Bytecode] --> B{Hot spot?}
    B -->|No| C[Interpret]
    B -->|Yes| D[JIT compile to machine code]
    D --> E[Cache machine code]
    E --> F[Execute machine code directly]
    C --> B

10. LLVM Architecture

LLVM is a modern compiler infrastructure with a modular design.

graph TD
    subgraph Frontend
        A1[Clang - C/C++]
        A2[Rust Frontend]
        A3[Swift Frontend]
    end

    subgraph LLVM Core
        B[LLVM IR]
        C[Optimization Passes]
    end

    subgraph Backend
        D1[x86-64 CodeGen]
        D2[ARM CodeGen]
        D3[RISC-V CodeGen]
        D4[WASM CodeGen]
    end

    A1 --> B
    A2 --> B
    A3 --> B
    B --> C
    C --> D1
    C --> D2
    C --> D3
    C --> D4

10.1 LLVM IR Example

define i32 @add(i32 %a, i32 %b) {
entry:
    %result = add i32 %a, %b
    ret i32 %result
}

LLVM IR characteristics:

Typed: all values have types
SSA form: each variable is assigned exactly once
Three forms: in-memory data structures, human-readable .ll files, compact bitcode .bc files

10.2 Impact of LLVM

Clang: C/C++/Objective-C compiler
Rust: uses LLVM as its backend
Swift: Apple's language, built on LLVM
Julia: scientific computing language, uses LLVM JIT
Numba: Python numerical computation JIT

11. Compiler Construction Tools

Tool	Purpose	Representatives
Lexer generators	Regular expressions → DFA	Lex, Flex, re2c
Parser generators	Grammar → Parser	Yacc, Bison, ANTLR
Compiler frameworks	Complete compiler infrastructure	LLVM, GCC
Parser combinators	Functional parsing	Parsec (Haskell), nom (Rust)

References

"Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
"Engineering a Compiler" - Cooper & Torczon
LLVM Official Documentation: https://llvm.org/docs/