Skip to content

Introduction to Compiler Design

Introduction

A compiler translates high-level language source code into target machine code, serving as the critical bridge between programmers and hardware. Understanding compiler design helps write more efficient code and provides the foundation for understanding language design, program optimization, and toolchains such as LLVM.

Related content: Formal Languages and Automata


1. Overall Compiler Structure

graph LR
    A[Source Code] --> B[Lexical Analysis]
    B --> C[Parsing]
    C --> D[Semantic Analysis]
    D --> E[IR Generation]
    E --> F[Optimization]
    F --> G[Code Generation]
    G --> H[Target Code]

    subgraph Frontend
        B
        C
        D
    end

    subgraph Middle-end
        E
        F
    end

    subgraph Backend
        G
    end

The compilation process is typically divided into a frontend (source-language dependent) and a backend (target-platform dependent), decoupled through an intermediate representation (IR).


2. Lexical Analysis

The lexical analyzer (lexer/scanner) converts the source code character stream into a token stream.

2.1 Example

Source code: if (x >= 42) { y = x + 1; }

Token stream:
  IF  LPAREN  ID("x")  GE  NUM(42)  RPAREN
  LBRACE  ID("y")  ASSIGN  ID("x")  PLUS  NUM(1)  SEMICOLON  RBRACE

2.2 Regular Expressions → DFA

Lexical rules are described using regular expressions and converted to deterministic finite automata (DFA) for execution:

\[ \text{Regular Expression} \xrightarrow{\text{Thompson}} \text{NFA} \xrightarrow{\text{Subset Construction}} \text{DFA} \xrightarrow{\text{Minimization}} \text{Minimal DFA} \]

Example: regular expression for identifiers [a-zA-Z_][a-zA-Z0-9_]*

         letter/underscore    letter/digit/underscore
  ──→ (q0) ──────────────→ ((q1)) ◄─────────────────┐
                              │                       │
                              └───────────────────────┘

2.3 Lexical Analysis Tools

  • Lex/Flex: classic lexer generators
  • Regular expression engines: built into modern languages

3. Parsing (Syntax Analysis)

The parser organizes the token stream into a syntax tree according to a context-free grammar (CFG).

3.1 Context-Free Grammar

// Simple arithmetic expression grammar
E → E + T | E - T | T
T → T * F | T / F | F
F → ( E ) | num | id

3.2 Top-Down Parsing: LL Parsing

LL(k): scans left-to-right (L), leftmost derivation (L), looks ahead k tokens.

Recursive descent parsing is the most intuitive form of LL parsing:

def parse_expr():
    """E → T (('+' | '-') T)*"""
    left = parse_term()
    while current_token in ('+', '-'):
        op = current_token
        advance()
        right = parse_term()
        left = BinOp(op, left, right)
    return left

def parse_term():
    """T → F (('*' | '/') F)*"""
    left = parse_factor()
    while current_token in ('*', '/'):
        op = current_token
        advance()
        right = parse_factor()
        left = BinOp(op, left, right)
    return left

def parse_factor():
    """F → '(' E ')' | num | id"""
    if current_token == '(':
        advance()
        node = parse_expr()
        expect(')')
        return node
    elif current_token.isdigit():
        node = Num(int(current_token))
        advance()
        return node

LL parsing requires the grammar to be free of left recursion, which must be eliminated:

\[ A \to A\alpha \mid \beta \quad \Rightarrow \quad A \to \beta A', \quad A' \to \alpha A' \mid \epsilon \]

3.3 Bottom-Up Parsing: LR Parsing

LR(k): scans left-to-right (L), rightmost derivation in reverse (R), looks ahead k tokens.

LR parsing is more powerful than LL and can handle a broader class of grammars.

Method Power Complexity Tools
LL(1) Weaker Simple ANTLR, recursive descent
SLR(1) Moderate Moderate Yacc variants
LALR(1) Stronger Moderate Yacc/Bison
LR(1) Strongest Higher Rarely used

Core LR operations:

  • Shift: read a token and push it onto the stack
  • Reduce: when the top of the stack matches the right-hand side of a grammar rule, replace it with the left-hand side

4. Abstract Syntax Tree (AST)

The AST is a tree-structured representation of the source code, stripped of syntactic noise (such as parentheses and semicolons).

Source code: x = 2 * (y + 3)

AST:
        Assign
       /      \
    Var(x)    Mul
             /    \
          Num(2)  Add
                 /    \
             Var(y)  Num(3)

The AST is the fundamental data structure for subsequent analysis and code generation.


5. Semantic Analysis

Semantic analysis checks the semantic correctness of the AST. Main tasks include:

5.1 Type Checking

int x = "hello";   // Type error: cannot assign string to int
float y = 3 + 2.5; // OK: int implicitly promoted to float

Type checking rules:

\[ \frac{\Gamma \vdash e_1 : \text{int} \quad \Gamma \vdash e_2 : \text{int}}{\Gamma \vdash e_1 + e_2 : \text{int}} \quad (\text{T-Add}) \]

5.2 Symbol Table

The symbol table records information about all identifiers in the program:

Name Type Scope Address/Offset
x int main rbp-4
y float main rbp-8
add int→int→int global 0x4000a0

5.3 Scope Resolution

int x = 1;          // global x
void foo() {
    int x = 2;      // local x, shadows global
    {
        int x = 3;  // inner scope, shadows outer
    }
}

6. Intermediate Representation (IR)

IR serves as the bridge between source language and target code, facilitating optimization.

6.1 Three-Address Code

Source code: a = b * c + d

Three-address code:
  t1 = b * c
  t2 = t1 + d
  a = t2

Each instruction involves at most three addresses (two operands + one result).

6.2 SSA Form

Static Single Assignment (SSA): each variable is assigned exactly once.

// Original
x = 1
x = x + 1
y = x

// SSA form
x1 = 1
x2 = x1 + 1
y1 = x2

SSA simplifies dataflow analysis and optimization. At branch merge points, \(\phi\) functions are used:

// if (cond) x = 1; else x = 2; y = x;
if cond goto L1 else L2
L1: x1 = 1; goto L3
L2: x2 = 2; goto L3
L3: x3 = φ(x1, x2)
    y1 = x3

7. Optimization

7.1 Common Optimization Techniques

Optimization Description Example
Constant folding Evaluate constant expressions at compile time 3+47
Constant propagation Replace variables with known constants x=5; y=x+1y=6
Dead code elimination Remove unreachable/unused code Delete unused variables
Common subexpression elimination Avoid redundant computation a*b+a*bt=a*b; t+t
Loop-invariant code motion Move invariant computations outside the loop -
Loop unrolling Reduce loop overhead Unroll the loop body
Inline expansion Replace function calls with the function body Eliminate call overhead
Tail call optimization Convert tail recursion to loops Avoid stack overflow
Strength reduction Replace expensive operations with cheaper ones x*2x<<1

7.2 Dataflow Analysis

Optimization relies on dataflow analysis, propagating information over the control flow graph (CFG):

  • Reaching definitions: which assignment statements' definitions can reach a given program point
  • Live variable analysis: whether a variable will be used later
  • Available expression analysis: whether an expression has already been computed and its value has not changed

8. Target Code Generation

8.1 Instruction Selection

Map IR to target machine instructions:

IR: t1 = a + b

x86-64:
  mov eax, [rbp-4]    ; load a
  add eax, [rbp-8]    ; add b
  mov [rbp-12], eax   ; store to t1

8.2 Register Allocation

Registers are limited, so decisions must be made about which variables reside in registers:

  • Graph coloring algorithm: build an interference graph (edges between simultaneously live variables), color with \(k\) colors (\(k\) = number of registers)
  • Variables that cannot be colored are spilled to memory

8.3 Instruction Scheduling

Reorder instructions to exploit the pipeline and reduce stalls.


9. Interpreters, Compilers, and JIT

Approach How it Works Advantages Disadvantages
Interpreter Executes source/bytecode statement by statement Fast startup, easy debugging Slow execution
AOT Compiler Pre-compiles to machine code Fastest execution Slow compilation, platform-dependent
JIT Compiler Compiles hot code at runtime Runtime optimization Slow startup, higher memory usage

JIT compilation flow:

graph LR
    A[Bytecode] --> B{Hot spot?}
    B -->|No| C[Interpret]
    B -->|Yes| D[JIT compile to machine code]
    D --> E[Cache machine code]
    E --> F[Execute machine code directly]
    C --> B

10. LLVM Architecture

LLVM is a modern compiler infrastructure with a modular design.

graph TD
    subgraph Frontend
        A1[Clang - C/C++]
        A2[Rust Frontend]
        A3[Swift Frontend]
    end

    subgraph LLVM Core
        B[LLVM IR]
        C[Optimization Passes]
    end

    subgraph Backend
        D1[x86-64 CodeGen]
        D2[ARM CodeGen]
        D3[RISC-V CodeGen]
        D4[WASM CodeGen]
    end

    A1 --> B
    A2 --> B
    A3 --> B
    B --> C
    C --> D1
    C --> D2
    C --> D3
    C --> D4

10.1 LLVM IR Example

define i32 @add(i32 %a, i32 %b) {
entry:
    %result = add i32 %a, %b
    ret i32 %result
}

LLVM IR characteristics:

  • Typed: all values have types
  • SSA form: each variable is assigned exactly once
  • Three forms: in-memory data structures, human-readable .ll files, compact bitcode .bc files

10.2 Impact of LLVM

  • Clang: C/C++/Objective-C compiler
  • Rust: uses LLVM as its backend
  • Swift: Apple's language, built on LLVM
  • Julia: scientific computing language, uses LLVM JIT
  • Numba: Python numerical computation JIT

11. Compiler Construction Tools

Tool Purpose Representatives
Lexer generators Regular expressions → DFA Lex, Flex, re2c
Parser generators Grammar → Parser Yacc, Bison, ANTLR
Compiler frameworks Complete compiler infrastructure LLVM, GCC
Parser combinators Functional parsing Parsec (Haskell), nom (Rust)

References

  • "Compilers: Principles, Techniques, and Tools" (Dragon Book) - Aho, Lam, Sethi, Ullman
  • "Engineering a Compiler" - Cooper & Torczon
  • LLVM Official Documentation: https://llvm.org/docs/

评论 #