Processor Microarchitecture
1. Overview
Microarchitecture is the concrete hardware implementation of an ISA. The same ISA can have multiple microarchitecture implementations that make different trade-offs among performance, power, and area.
Core objective: Maximize instruction throughput (IPC) under given power and area constraints.
2. Classic Five-Stage Pipeline
2.1 Pipeline Stages
Instruction execution is divided into five stages, each handled by independent hardware, allowing different stages of different instructions to overlap:
graph LR
IF[IF<br/>Fetch] --> ID[ID<br/>Decode/Read Registers]
ID --> EX[EX<br/>Execute/Compute Address]
EX --> MEM[MEM<br/>Memory Access]
MEM --> WB[WB<br/>Write Back]
| Stage | Full Name | Function |
|---|---|---|
| IF | Instruction Fetch | Fetch instruction from I-Cache, PC += 4 |
| ID | Instruction Decode | Decode instruction, read source registers |
| EX | Execute | ALU operation or address computation |
| MEM | Memory Access | Load/Store access D-Cache |
| WB | Write Back | Write result to destination register |
2.2 Pipeline Speedup
Ideally, a \(k\)-stage pipeline executing \(n\) instructions:
When \(n \gg k\), \(S \approx k\), i.e., the speedup approaches the number of pipeline stages.
Numerical Example
A 5-stage pipeline executing 100 instructions:
Close to the ideal speedup of 5.
In practice, due to the presence of hazards, the pipeline cannot always run at full capacity.
3. Pipeline Hazards
3.1 Data Hazards
A subsequent instruction depends on the result of a prior instruction, but the result has not yet been produced.
Types (classified by data dependence):
| Type | Meaning | Example |
|---|---|---|
| RAW (Read After Write) | True dependence | add x1,x2,x3; sub x4,x1,x5 |
| WAR (Write After Read) | Anti-dependence | May occur in out-of-order execution |
| WAW (Write After Write) | Output dependence | May occur in out-of-order execution |
Solutions:
- Forwarding/Bypassing: Route results from EX or MEM stage directly to the instruction that needs them, without waiting for WB
- Pipeline stalling: Insert bubbles, waiting for data to become ready
- Compiler scheduling: Reorder instructions to avoid hazards
3.2 Control Hazards
When the branch direction is undetermined, the pipeline does not know which path to fetch from next.
Impact: Branch instructions may cause 1-3 cycles of pipeline bubbles.
Solutions:
- Branch Delay Slot: The instruction after a branch always executes (MIPS approach, now obsolete)
- Branch Prediction: Predict the branch direction and speculatively execute
3.3 Structural Hazards
Multiple instructions simultaneously need the same hardware resource.
Solutions:
- Duplicate hardware resources (e.g., separate I-Cache and D-Cache)
- Pipeline stalling
4. Branch Prediction
Branch prediction is a critical technique in modern processors; prediction accuracy directly affects IPC.
4.1 Static Prediction
- Always Not Taken: Assume no jump
- Always Taken: Assume jump
- BTFN (Backward Taken, Forward Not Taken): Predict backward jumps as taken (loop scenario)
4.2 Dynamic Prediction
Branch History Table (BHT)
Indexed by low-order PC bits, recording past branch behavior:
- 1-bit predictor: Records whether the last branch was taken. Poor performance with nested loops (mispredicts twice at each loop boundary)
- 2-bit saturating counter: Requires two consecutive mispredictions to change the prediction direction
Predict Taken
┌──────────────┐
│ │
▼ Taken │ Not Taken
[11] ────────► [11] ────────► [10]
ST ST WT
│
Taken │ Not Taken
┌──── [01] ◄───┘
│ WN
▼
[00]
SN
States: ST (Strongly Taken), WT (Weakly Taken), WN (Weakly Not Taken), SN (Strongly Not Taken)
Branch Target Buffer (BTB)
- Caches the mapping from branch instruction addresses to jump target addresses
- Provides the target address at the IF stage, reducing penalty
Tournament Predictor
Combines multiple predictors with a meta-predictor that selects the currently more accurate one:
- Local predictor: Based on the history pattern of a single branch
- Global predictor: Based on the global history of all branches
- Meta selector: Tracks which predictor is more accurate for the current branch
Modern processors (e.g., Intel Alder Lake) achieve branch prediction accuracy exceeding 97%.
4.3 Indirect Jump Prediction
- Function pointers, virtual function calls, switch-case -> target address is not fixed
- Uses Indirect Branch Target Array to record historical targets
5. Out-of-Order Execution
5.1 Motivation
In in-order execution, a long-latency instruction (such as a cache miss) blocks all subsequent unrelated instructions. Out-of-order execution allows ready instructions to execute first.
5.2 Tomasulo's Algorithm
Pioneered by IBM 360/91 (1967). Core idea: Register renaming + distributed scheduling via reservation stations.
Key components:
| Component | Function |
|---|---|
| Reservation Station (RS) | Buffers instructions waiting for operands; issues when operands are ready |
| Common Data Bus (CDB) | Broadcasts computation results; all reservation stations waiting for a result receive it simultaneously |
| Reorder Buffer (ROB) | Ensures instructions commit in program order (precise exceptions) |
Execution flow:
- Issue: Instruction enters reservation station; register renaming eliminates WAR/WAW
- Execute: Execute in functional unit once operands are ready
- Write Result: Result is broadcast via CDB
- Commit: Instructions at the head of ROB commit in order to the register file
graph TB
A[Instruction Queue] --> B[Issue]
B --> C[Reservation Station RS]
C --> D{Operands ready?}
D -->|Yes| E[Functional Unit Execution]
D -->|No| F[Wait for CDB broadcast]
F --> D
E --> G[CDB Broadcast Result]
G --> C
G --> H[Reorder Buffer ROB]
H --> I[In-order Commit]
5.3 Register Renaming
Physical registers >> architectural registers; renaming eliminates false dependencies:
Original code (WAW dependence): After renaming:
add x1, x2, x3 add p10, p2, p3
sub x4, x1, x5 sub p11, p10, p5
mul x1, x6, x7 mul p12, p6, p7 <- x1 renamed to p12, independent of p10
6. Superscalar Processors
6.1 Basic Concept
Fetch, decode, issue, and execute multiple instructions per clock cycle.
Actual IPC is limited by data dependencies, branch mispredictions, and cache misses.
Typical configurations:
| Processor | Issue Width | Actual IPC |
|---|---|---|
| ARM Cortex-A78 | 4-wide | ~3 |
| Intel Golden Cove | 6-wide | ~4-5 |
| Apple Firestorm (M1) | 8-wide | ~5-6 |
6.2 SMT (Simultaneous Multithreading)
Execute multiple hardware threads simultaneously on the same physical core:
- Intel calls it Hyper-Threading
- Shares execution units and cache; each thread has independent architectural state (registers, PC)
- Improves functional unit utilization (when one thread stalls, another can use idle units)
- Typical benefit: ~20-30% throughput improvement per core
7. VLIW (Very Long Instruction Word)
7.1 Design Philosophy
Delegates the discovery of instruction-level parallelism to the compiler rather than hardware:
- The compiler packs multiple parallelizable operations into a single "very long instruction"
- Hardware requires no complex out-of-order scheduling logic -> simpler, lower power
7.2 Advantages and Disadvantages
| Advantage | Disadvantage |
|---|---|
| Simple hardware | Poor binary compatibility |
| Low power | Compiler optimization is challenging |
| Deterministic latency | NOP padding wastes bandwidth |
Representatives: Intel Itanium (IA-64, commercial failure), TI DSP (successful in embedded domain)
8. Modern Microarchitecture Examples
Apple M Series (Firestorm + Icestorm)
- Big core Firestorm: 8-wide issue, 630+ ROB entries, 192 KB L1I, 128 KB L1D
- Small core Icestorm: 4-wide issue, low power
- big.LITTLE heterogeneous: High-performance cores + high-efficiency cores
AMD Zen Series
- Front-end: Branch prediction -> Fetch -> Decode (4-wide) -> micro-op cache
- Back-end: 6 integer execution units + 4 floating-point/SIMD units
- 6-wide issue, 256-entry ROB
9. Performance Analysis: CPI Stack
Performance Optimization Strategies
- \(\text{CPI}_{\text{cache miss}}\): Optimize data locality, prefetching -> see Memory Hierarchy Design
- \(\text{CPI}_{\text{branch mispredict}}\): Improve branch predictors, reduce branches (branchless programming)
- \(\text{CPI}_{\text{data hazard}}\): Compiler instruction scheduling
- Overall: Increase issue width and out-of-order window size -> higher IPC
Navigation
- Previous: Instruction Set Architecture
- Next: Memory Hierarchy Design