Skip to content

Processor Microarchitecture

1. Overview

Microarchitecture is the concrete hardware implementation of an ISA. The same ISA can have multiple microarchitecture implementations that make different trade-offs among performance, power, and area.

Core objective: Maximize instruction throughput (IPC) under given power and area constraints.

2. Classic Five-Stage Pipeline

2.1 Pipeline Stages

Instruction execution is divided into five stages, each handled by independent hardware, allowing different stages of different instructions to overlap:

graph LR
    IF[IF<br/>Fetch] --> ID[ID<br/>Decode/Read Registers]
    ID --> EX[EX<br/>Execute/Compute Address]
    EX --> MEM[MEM<br/>Memory Access]
    MEM --> WB[WB<br/>Write Back]
Stage Full Name Function
IF Instruction Fetch Fetch instruction from I-Cache, PC += 4
ID Instruction Decode Decode instruction, read source registers
EX Execute ALU operation or address computation
MEM Memory Access Load/Store access D-Cache
WB Write Back Write result to destination register

2.2 Pipeline Speedup

Ideally, a \(k\)-stage pipeline executing \(n\) instructions:

\[ S = \frac{n \cdot k}{n + k - 1} \]

When \(n \gg k\), \(S \approx k\), i.e., the speedup approaches the number of pipeline stages.

Numerical Example

A 5-stage pipeline executing 100 instructions:

\[S = \frac{100 \times 5}{100 + 5 - 1} = \frac{500}{104} \approx 4.81\]

Close to the ideal speedup of 5.

In practice, due to the presence of hazards, the pipeline cannot always run at full capacity.

3. Pipeline Hazards

3.1 Data Hazards

A subsequent instruction depends on the result of a prior instruction, but the result has not yet been produced.

Types (classified by data dependence):

Type Meaning Example
RAW (Read After Write) True dependence add x1,x2,x3; sub x4,x1,x5
WAR (Write After Read) Anti-dependence May occur in out-of-order execution
WAW (Write After Write) Output dependence May occur in out-of-order execution

Solutions:

  • Forwarding/Bypassing: Route results from EX or MEM stage directly to the instruction that needs them, without waiting for WB
  • Pipeline stalling: Insert bubbles, waiting for data to become ready
  • Compiler scheduling: Reorder instructions to avoid hazards

3.2 Control Hazards

When the branch direction is undetermined, the pipeline does not know which path to fetch from next.

Impact: Branch instructions may cause 1-3 cycles of pipeline bubbles.

Solutions:

  • Branch Delay Slot: The instruction after a branch always executes (MIPS approach, now obsolete)
  • Branch Prediction: Predict the branch direction and speculatively execute

3.3 Structural Hazards

Multiple instructions simultaneously need the same hardware resource.

Solutions:

  • Duplicate hardware resources (e.g., separate I-Cache and D-Cache)
  • Pipeline stalling

4. Branch Prediction

Branch prediction is a critical technique in modern processors; prediction accuracy directly affects IPC.

4.1 Static Prediction

  • Always Not Taken: Assume no jump
  • Always Taken: Assume jump
  • BTFN (Backward Taken, Forward Not Taken): Predict backward jumps as taken (loop scenario)

4.2 Dynamic Prediction

Branch History Table (BHT)

Indexed by low-order PC bits, recording past branch behavior:

  • 1-bit predictor: Records whether the last branch was taken. Poor performance with nested loops (mispredicts twice at each loop boundary)
  • 2-bit saturating counter: Requires two consecutive mispredictions to change the prediction direction
    Predict Taken
  ┌──────────────┐
  │              │
  ▼    Taken     │   Not Taken
[11] ────────► [11]   ────────► [10]
 ST              ST              WT
                                  │
                   Taken          │ Not Taken
                  ┌──── [01] ◄───┘
                  │      WN
                  ▼
                 [00]
                  SN

States: ST (Strongly Taken), WT (Weakly Taken), WN (Weakly Not Taken), SN (Strongly Not Taken)

Branch Target Buffer (BTB)

  • Caches the mapping from branch instruction addresses to jump target addresses
  • Provides the target address at the IF stage, reducing penalty

Tournament Predictor

Combines multiple predictors with a meta-predictor that selects the currently more accurate one:

  • Local predictor: Based on the history pattern of a single branch
  • Global predictor: Based on the global history of all branches
  • Meta selector: Tracks which predictor is more accurate for the current branch

Modern processors (e.g., Intel Alder Lake) achieve branch prediction accuracy exceeding 97%.

4.3 Indirect Jump Prediction

  • Function pointers, virtual function calls, switch-case -> target address is not fixed
  • Uses Indirect Branch Target Array to record historical targets

5. Out-of-Order Execution

5.1 Motivation

In in-order execution, a long-latency instruction (such as a cache miss) blocks all subsequent unrelated instructions. Out-of-order execution allows ready instructions to execute first.

5.2 Tomasulo's Algorithm

Pioneered by IBM 360/91 (1967). Core idea: Register renaming + distributed scheduling via reservation stations.

Key components:

Component Function
Reservation Station (RS) Buffers instructions waiting for operands; issues when operands are ready
Common Data Bus (CDB) Broadcasts computation results; all reservation stations waiting for a result receive it simultaneously
Reorder Buffer (ROB) Ensures instructions commit in program order (precise exceptions)

Execution flow:

  1. Issue: Instruction enters reservation station; register renaming eliminates WAR/WAW
  2. Execute: Execute in functional unit once operands are ready
  3. Write Result: Result is broadcast via CDB
  4. Commit: Instructions at the head of ROB commit in order to the register file
graph TB
    A[Instruction Queue] --> B[Issue]
    B --> C[Reservation Station RS]
    C --> D{Operands ready?}
    D -->|Yes| E[Functional Unit Execution]
    D -->|No| F[Wait for CDB broadcast]
    F --> D
    E --> G[CDB Broadcast Result]
    G --> C
    G --> H[Reorder Buffer ROB]
    H --> I[In-order Commit]

5.3 Register Renaming

Physical registers >> architectural registers; renaming eliminates false dependencies:

Original code (WAW dependence):     After renaming:
add x1, x2, x3                     add p10, p2, p3
sub x4, x1, x5                     sub p11, p10, p5
mul x1, x6, x7                     mul p12, p6, p7   <- x1 renamed to p12, independent of p10

6. Superscalar Processors

6.1 Basic Concept

Fetch, decode, issue, and execute multiple instructions per clock cycle.

\[ \text{IPC}_{\text{ideal}} = \text{Issue Width} \]

Actual IPC is limited by data dependencies, branch mispredictions, and cache misses.

Typical configurations:

Processor Issue Width Actual IPC
ARM Cortex-A78 4-wide ~3
Intel Golden Cove 6-wide ~4-5
Apple Firestorm (M1) 8-wide ~5-6

6.2 SMT (Simultaneous Multithreading)

Execute multiple hardware threads simultaneously on the same physical core:

  • Intel calls it Hyper-Threading
  • Shares execution units and cache; each thread has independent architectural state (registers, PC)
  • Improves functional unit utilization (when one thread stalls, another can use idle units)
  • Typical benefit: ~20-30% throughput improvement per core

7. VLIW (Very Long Instruction Word)

7.1 Design Philosophy

Delegates the discovery of instruction-level parallelism to the compiler rather than hardware:

  • The compiler packs multiple parallelizable operations into a single "very long instruction"
  • Hardware requires no complex out-of-order scheduling logic -> simpler, lower power

7.2 Advantages and Disadvantages

Advantage Disadvantage
Simple hardware Poor binary compatibility
Low power Compiler optimization is challenging
Deterministic latency NOP padding wastes bandwidth

Representatives: Intel Itanium (IA-64, commercial failure), TI DSP (successful in embedded domain)

8. Modern Microarchitecture Examples

Apple M Series (Firestorm + Icestorm)

  • Big core Firestorm: 8-wide issue, 630+ ROB entries, 192 KB L1I, 128 KB L1D
  • Small core Icestorm: 4-wide issue, low power
  • big.LITTLE heterogeneous: High-performance cores + high-efficiency cores

AMD Zen Series

  • Front-end: Branch prediction -> Fetch -> Decode (4-wide) -> micro-op cache
  • Back-end: 6 integer execution units + 4 floating-point/SIMD units
  • 6-wide issue, 256-entry ROB

9. Performance Analysis: CPI Stack

\[ \text{CPI} = \text{CPI}_{\text{base}} + \text{CPI}_{\text{stalls}} \]
\[ \text{CPI}_{\text{stalls}} = \text{CPI}_{\text{cache miss}} + \text{CPI}_{\text{branch mispredict}} + \text{CPI}_{\text{data hazard}} + \text{CPI}_{\text{structural}} \]

Performance Optimization Strategies

  • \(\text{CPI}_{\text{cache miss}}\): Optimize data locality, prefetching -> see Memory Hierarchy Design
  • \(\text{CPI}_{\text{branch mispredict}}\): Improve branch predictors, reduce branches (branchless programming)
  • \(\text{CPI}_{\text{data hazard}}\): Compiler instruction scheduling
  • Overall: Increase issue width and out-of-order window size -> higher IPC


评论 #