Processor Microarchitecture

1. Overview

Microarchitecture is the concrete hardware implementation of an ISA. The same ISA can have multiple microarchitecture implementations that make different trade-offs among performance, power, and area.

Core objective: Maximize instruction throughput (IPC) under given power and area constraints.

2. Classic Five-Stage Pipeline

2.1 Pipeline Stages

Instruction execution is divided into five stages, each handled by independent hardware, allowing different stages of different instructions to overlap:

graph LR
    IF[IF<br/>Fetch] --> ID[ID<br/>Decode/Read Registers]
    ID --> EX[EX<br/>Execute/Compute Address]
    EX --> MEM[MEM<br/>Memory Access]
    MEM --> WB[WB<br/>Write Back]

Stage	Full Name	Function
IF	Instruction Fetch	Fetch instruction from I-Cache, PC += 4
ID	Instruction Decode	Decode instruction, read source registers
EX	Execute	ALU operation or address computation
MEM	Memory Access	Load/Store access D-Cache
WB	Write Back	Write result to destination register

2.2 Pipeline Speedup

Ideally, a \(k\)-stage pipeline executing \(n\) instructions:

\[ S = \frac{n \cdot k}{n + k - 1} \]

When \(n \gg k\), \(S \approx k\), i.e., the speedup approaches the number of pipeline stages.

Numerical Example

A 5-stage pipeline executing 100 instructions:

\[S = \frac{100 \times 5}{100 + 5 - 1} = \frac{500}{104} \approx 4.81\]

Close to the ideal speedup of 5.

In practice, due to the presence of hazards, the pipeline cannot always run at full capacity.

3. Pipeline Hazards

3.1 Data Hazards

A subsequent instruction depends on the result of a prior instruction, but the result has not yet been produced.

Types (classified by data dependence):

Type	Meaning	Example
RAW (Read After Write)	True dependence	`add x1,x2,x3; sub x4,x1,x5`
WAR (Write After Read)	Anti-dependence	May occur in out-of-order execution
WAW (Write After Write)	Output dependence	May occur in out-of-order execution

Solutions:

Forwarding/Bypassing: Route results from EX or MEM stage directly to the instruction that needs them, without waiting for WB
Pipeline stalling: Insert bubbles, waiting for data to become ready
Compiler scheduling: Reorder instructions to avoid hazards

3.2 Control Hazards

When the branch direction is undetermined, the pipeline does not know which path to fetch from next.

Impact: Branch instructions may cause 1-3 cycles of pipeline bubbles.

Solutions:

Branch Delay Slot: The instruction after a branch always executes (MIPS approach, now obsolete)
Branch Prediction: Predict the branch direction and speculatively execute

3.3 Structural Hazards

Multiple instructions simultaneously need the same hardware resource.

Solutions:

Duplicate hardware resources (e.g., separate I-Cache and D-Cache)
Pipeline stalling

4. Branch Prediction

Branch prediction is a critical technique in modern processors; prediction accuracy directly affects IPC.

4.1 Static Prediction

Always Not Taken: Assume no jump
Always Taken: Assume jump
BTFN (Backward Taken, Forward Not Taken): Predict backward jumps as taken (loop scenario)

4.2 Dynamic Prediction

Branch History Table (BHT)

Indexed by low-order PC bits, recording past branch behavior:

1-bit predictor: Records whether the last branch was taken. Poor performance with nested loops (mispredicts twice at each loop boundary)
2-bit saturating counter: Requires two consecutive mispredictions to change the prediction direction

    Predict Taken
  ┌──────────────┐
  │              │
  ▼    Taken     │   Not Taken
[11] ────────► [11]   ────────► [10]
 ST              ST              WT
                                  │
                   Taken          │ Not Taken
                  ┌──── [01] ◄───┘
                  │      WN
                  ▼
                 [00]
                  SN

States: ST (Strongly Taken), WT (Weakly Taken), WN (Weakly Not Taken), SN (Strongly Not Taken)

Branch Target Buffer (BTB)

Caches the mapping from branch instruction addresses to jump target addresses
Provides the target address at the IF stage, reducing penalty

Tournament Predictor

Combines multiple predictors with a meta-predictor that selects the currently more accurate one:

Local predictor: Based on the history pattern of a single branch
Global predictor: Based on the global history of all branches
Meta selector: Tracks which predictor is more accurate for the current branch

Modern processors (e.g., Intel Alder Lake) achieve branch prediction accuracy exceeding 97%.

4.3 Indirect Jump Prediction

Function pointers, virtual function calls, switch-case -> target address is not fixed
Uses Indirect Branch Target Array to record historical targets

5. Out-of-Order Execution

5.1 Motivation

In in-order execution, a long-latency instruction (such as a cache miss) blocks all subsequent unrelated instructions. Out-of-order execution allows ready instructions to execute first.

5.2 Tomasulo's Algorithm

Pioneered by IBM 360/91 (1967). Core idea: Register renaming + distributed scheduling via reservation stations.

Key components:

Component	Function
Reservation Station (RS)	Buffers instructions waiting for operands; issues when operands are ready
Common Data Bus (CDB)	Broadcasts computation results; all reservation stations waiting for a result receive it simultaneously
Reorder Buffer (ROB)	Ensures instructions commit in program order (precise exceptions)

Execution flow:

Issue: Instruction enters reservation station; register renaming eliminates WAR/WAW
Execute: Execute in functional unit once operands are ready
Write Result: Result is broadcast via CDB
Commit: Instructions at the head of ROB commit in order to the register file

graph TB
    A[Instruction Queue] --> B[Issue]
    B --> C[Reservation Station RS]
    C --> D{Operands ready?}
    D -->|Yes| E[Functional Unit Execution]
    D -->|No| F[Wait for CDB broadcast]
    F --> D
    E --> G[CDB Broadcast Result]
    G --> C
    G --> H[Reorder Buffer ROB]
    H --> I[In-order Commit]

5.3 Register Renaming

Physical registers >> architectural registers; renaming eliminates false dependencies:

Original code (WAW dependence):     After renaming:
add x1, x2, x3                     add p10, p2, p3
sub x4, x1, x5                     sub p11, p10, p5
mul x1, x6, x7                     mul p12, p6, p7   <- x1 renamed to p12, independent of p10

6. Superscalar Processors

6.1 Basic Concept

Fetch, decode, issue, and execute multiple instructions per clock cycle.

\[ \text{IPC}_{\text{ideal}} = \text{Issue Width} \]

Actual IPC is limited by data dependencies, branch mispredictions, and cache misses.

Typical configurations:

Processor	Issue Width	Actual IPC
ARM Cortex-A78	4-wide	~3
Intel Golden Cove	6-wide	~4-5
Apple Firestorm (M1)	8-wide	~5-6

6.2 SMT (Simultaneous Multithreading)

Execute multiple hardware threads simultaneously on the same physical core:

Intel calls it Hyper-Threading
Shares execution units and cache; each thread has independent architectural state (registers, PC)
Improves functional unit utilization (when one thread stalls, another can use idle units)
Typical benefit: ~20-30% throughput improvement per core

7. VLIW (Very Long Instruction Word)

7.1 Design Philosophy

Delegates the discovery of instruction-level parallelism to the compiler rather than hardware:

The compiler packs multiple parallelizable operations into a single "very long instruction"
Hardware requires no complex out-of-order scheduling logic -> simpler, lower power

7.2 Advantages and Disadvantages

Advantage	Disadvantage
Simple hardware	Poor binary compatibility
Low power	Compiler optimization is challenging
Deterministic latency	NOP padding wastes bandwidth

Representatives: Intel Itanium (IA-64, commercial failure), TI DSP (successful in embedded domain)

8. Modern Microarchitecture Examples

Apple M Series (Firestorm + Icestorm)

Big core Firestorm: 8-wide issue, 630+ ROB entries, 192 KB L1I, 128 KB L1D
Small core Icestorm: 4-wide issue, low power
big.LITTLE heterogeneous: High-performance cores + high-efficiency cores

AMD Zen Series

Front-end: Branch prediction -> Fetch -> Decode (4-wide) -> micro-op cache
Back-end: 6 integer execution units + 4 floating-point/SIMD units
6-wide issue, 256-entry ROB

9. Performance Analysis: CPI Stack

\[ \text{CPI} = \text{CPI}_{\text{base}} + \text{CPI}_{\text{stalls}} \]

\[ \text{CPI}_{\text{stalls}} = \text{CPI}_{\text{cache miss}} + \text{CPI}_{\text{branch mispredict}} + \text{CPI}_{\text{data hazard}} + \text{CPI}_{\text{structural}} \]

Performance Optimization Strategies

\(\text{CPI}_{\text{cache miss}}\): Optimize data locality, prefetching -> see Memory Hierarchy Design
\(\text{CPI}_{\text{branch mispredict}}\): Improve branch predictors, reduce branches (branchless programming)
\(\text{CPI}_{\text{data hazard}}\): Compiler instruction scheduling
Overall: Increase issue width and out-of-order window size -> higher IPC

Navigation

Previous: Instruction Set Architecture
Next: Memory Hierarchy Design

Processor Microarchitecture

1. Overview

2. Classic Five-Stage Pipeline

2.1 Pipeline Stages

2.2 Pipeline Speedup

3. Pipeline Hazards

3.1 Data Hazards

3.2 Control Hazards

3.3 Structural Hazards

4. Branch Prediction

4.1 Static Prediction

4.2 Dynamic Prediction

Branch History Table (BHT)

Branch Target Buffer (BTB)

Tournament Predictor

4.3 Indirect Jump Prediction

5. Out-of-Order Execution

5.1 Motivation

5.2 Tomasulo's Algorithm

5.3 Register Renaming

6. Superscalar Processors

6.1 Basic Concept

6.2 SMT (Simultaneous Multithreading)

7. VLIW (Very Long Instruction Word)

7.1 Design Philosophy

7.2 Advantages and Disadvantages

8. Modern Microarchitecture Examples

Apple M Series (Firestorm + Icestorm)

AMD Zen Series

9. Performance Analysis: CPI Stack

评论 #