Skip to content

Parallel Architecture

1. Flynn's Taxonomy

Michael Flynn (1966) classified computer architectures into four categories based on the parallelism of instruction streams and data streams:

graph TB
    subgraph "Flynn's Taxonomy"
        direction TB
        A["<b>SISD</b><br/>Single Instruction<br/>Single Data<br/>Classic single-core CPU"]
        B["<b>SIMD</b><br/>Single Instruction<br/>Multiple Data<br/>Vector processing / GPU cores"]
        C["<b>MISD</b><br/>Multiple Instruction<br/>Single Data<br/>Fault-tolerant pipelines (rare)"]
        D["<b>MIMD</b><br/>Multiple Instruction<br/>Multiple Data<br/>Multi-core / Distributed"]
    end

    style A fill:#e1f5fe
    style B fill:#e8f5e9
    style C fill:#fff3e0
    style D fill:#f3e5f5
Category Instruction Streams Data Streams Representative
SISD Single Single Classic single-core processor
SIMD Single Multiple SSE/AVX, NEON, GPU cores
MISD Multiple Single Aerospace fault-tolerant systems (extremely rare)
MIMD Multiple Multiple Multi-core CPUs, clusters, supercomputers

2. SIMD Parallelism

2.1 Vector Extensions (CPU SIMD)

A single instruction operates on multiple data elements simultaneously:

Extension Width Architecture
SSE 128-bit (4x float) x86
AVX2 256-bit (8x float) x86
AVX-512 512-bit (16x float) x86 (server)
NEON 128-bit ARM
SVE/SVE2 128-2048 bit (scalable) ARMv9
RVV Scalable (VLEN) RISC-V

Speedup upper bound (ignoring scalar overhead):

\[ S_{\text{SIMD}} = \frac{\text{Vector Width (bits)}}{\text{Element Width (bits)}} \]

For example, AVX-512 processing 32-bit floats: \(S = 512 / 32 = 16\times\).

2.2 SIMD Programming Models

  • Compiler auto-vectorization: Compiler detects vectorizable loops
  • Intrinsics: Manual invocation of SIMD instructions (e.g., _mm256_add_ps)
  • Library abstractions: High-level library wrappers (e.g., Eigen, xsimd)

3. Multi-Core and Shared Memory

3.1 Memory Architecture

graph TB
    subgraph "UMA (Uniform Memory Access)"
        C1[Core 0] --> BUS1[Bus/Crossbar]
        C2[Core 1] --> BUS1
        C3[Core 2] --> BUS1
        BUS1 --> MEM1[Shared Memory<br/>Uniform latency]
    end

    subgraph "NUMA (Non-Uniform Memory Access)"
        C4[Core 0] --> L1[Local Memory<br/>Fast]
        C5[Core 1] --> L1
        C6[Core 2] --> L2[Local Memory<br/>Fast]
        C7[Core 3] --> L2
        L1 <-->|Interconnect<br/>Slow| L2
    end
Architecture Characteristics Applicable Scale
UMA Uniform memory access latency for all cores Small-scale (<=8 cores)
NUMA Local memory is fast, remote memory is slow Multi-socket servers (tens to hundreds of cores)

NUMA-aware programming is critical -- data should be allocated on the local node of the core that uses it.

3.2 Cache Coherence

When multiple cores share memory, the same data in different private caches may become inconsistent.

Coherence requirements:

  1. Write Propagation: A core's write eventually becomes visible to all cores
  2. Write Serialization: All cores observe writes to the same address in the same order

3.3 MESI Protocol

The most classic snooping protocol; each cache line has 4 states:

State Meaning
M (Modified) Modified, only this cache has a valid copy, main memory is stale
E (Exclusive) Unmodified, only this cache has a valid copy, consistent with main memory
S (Shared) Unmodified, potentially multiple caches hold copies
I (Invalid) Invalid, cannot be used

State transition examples:

stateDiagram-v2
    [*] --> I
    I --> E: Local read (no other copy)
    I --> S: Local read (other copies exist)
    E --> M: Local write
    S --> M: Local write (other copies -> I)
    M --> S: Remote read (write back to memory)
    M --> I: Remote write
    E --> I: Remote write
    S --> I: Remote write
    E --> S: Remote read

MOESI / MESIF extensions:

  • MOESI (AMD): Adds Owned state, avoiding the need to write back dirty data to main memory
  • MESIF (Intel): Adds Forward state, designating which cache responds to shared requests

3.4 Directory Protocol

Snooping protocols have high broadcast overhead in large-scale systems. Directory protocols use a centralized directory recording sharer information for each cache line:

  • Suitable for NUMA architectures and large-scale multi-core (>16 cores)
  • Point-to-point communication, no broadcasting needed
  • Directory entry: state + sharer bit vector

3.5 False Sharing

Different cores access different variables that happen to reside in the same cache line, causing frequent coherence invalidations.

\[ \text{Performance degradation} \propto \text{coherence message frequency} \]

Solutions:

  • Cache line alignment (padding)
  • Per-thread independent data structures (thread-local)
  • alignas(64) to ensure independent variables reside in different cache lines

4. GPU Computing Model

4.1 GPU vs CPU Design Philosophy

graph LR
    subgraph "CPU: Latency Optimized"
        A1[Large Cache] --- A2[Complex Control Logic]
        A2 --- A3[Few Powerful Cores]
        A3 --- A4[Out-of-Order Execution]
    end

    subgraph "GPU: Throughput Optimized"
        B1[Small Cache] --- B2[Simple Control Logic]
        B2 --- B3[Massive Weak Cores]
        B3 --- B4[Thread-Level Parallelism Hides Latency]
    end
Feature CPU GPU
Core count 4-128 (powerful cores) Thousands to tens of thousands (CUDA cores)
Control logic Complex (out-of-order, branch prediction) Simple (in-order execution)
Cache Large (L1+L2+L3 up to 100MB) Small (relies mainly on registers and shared memory)
Memory bandwidth ~50-100 GB/s ~1-3 TB/s (HBM)
Suitable workloads Serial, branch-intensive Massively data-parallel

4.2 SIMT (Single Instruction Multiple Thread)

NVIDIA GPU execution model:

  • Thread: Smallest execution unit with independent registers and PC
  • Warp: 32 threads forming a warp, executing the same instruction in lockstep
  • Thread Block: Multiple warps sharing Shared Memory, executing on the same SM
  • Grid: Multiple Blocks distributed across SMs
Grid
├── Block (0,0)
│   ├── Warp 0: Thread 0-31
│   ├── Warp 1: Thread 32-63
│   └── ...
├── Block (0,1)
│   └── ...
└── ...

4.3 SM (Streaming Multiprocessor) Structure

Each SM contains:

Component Count (typical) Function
CUDA cores (FP32) 128 Floating-point/integer arithmetic
Tensor cores 4 Matrix multiply-accumulate (AI acceleration)
Load/Store units 32 Memory access
SFU (Special Function Units) 16 sin, cos, exp, etc.
Warp schedulers 4 Select ready warps for issue each cycle
Register file 256 KB Thread-private registers
Shared Memory 64-228 KB Shared among threads within a Block

4.4 Warp Divergence

When threads within a warp take different branch paths:

  • Must serially execute each path (masking off non-participating threads)
  • Worst case: Both paths execute, halving throughput
\[ \text{Efficiency} = \frac{\text{Active threads}}{32} \]

Optimization: Minimize warp divergence by grouping threads taking the same path into the same warp.

4.5 Memory Hierarchy

Memory Type Scope Latency Bandwidth
Registers Thread-private 1 cycle Highest
Shared Memory Block-shared ~5 cycles Very high
L1 Cache SM-level ~30 cycles High
L2 Cache Global ~200 cycles Medium
Global Memory (HBM) Global ~300-500 cycles ~3 TB/s

Key optimization: Use Shared Memory as a manually managed cache to reduce global memory accesses.

5. Interconnect Topologies

5.1 On-Chip Interconnects

Topology Structure Applicable Scale
Bus Shared medium <=4 cores
Crossbar Fully connected <=16 cores (area \(O(N^2)\))
Ring Single/double ring Intel consumer multi-core
Mesh 2D grid Many-core (Intel Xeon, GPU)

5.2 Chip-to-Chip Interconnects

Interconnect Characteristics Usage
PCIe General-purpose, moderate bandwidth CPU-GPU, CPU-SSD
NVLink Ultra-high bandwidth (900 GB/s) GPU-GPU (NVIDIA)
CXL Cache-coherent extension CPU-accelerator, memory pooling
Infinity Fabric AMD inter-die interconnect Chiplet connections
UCIe Open chiplet interconnect standard Heterogeneous chiplets

6. Parallel Programming Models

Model Abstraction Level Target Hardware
Pthreads / std::thread Low-level threading Multi-core CPU
OpenMP Compiler directives (#pragma) Multi-core CPU
CUDA / HIP GPU parallel programming NVIDIA / AMD GPU
SYCL / oneAPI Heterogeneous parallel (cross-device) CPU + GPU + FPGA
MPI Message passing (distributed memory) Clusters, supercomputers

7. Performance Metrics and Analysis

7.1 Speedup and Efficiency

\[ S(p) = \frac{T(1)}{T(p)}, \quad E(p) = \frac{S(p)}{p} \]
  • Ideal linear speedup: \(S(p) = p\), \(E(p) = 1\)
  • Practically limited by: Serial fraction (Amdahl), communication overhead, load imbalance

7.2 Roofline Model

Measures whether a program is compute-bound or bandwidth-bound:

\[ \text{Performance} = \min(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Operational Intensity}) \]

where Operational Intensity \(\text{OI} = \text{FLOPs} / \text{Bytes transferred}\).

  • \(\text{OI} < \text{Ridge Point}\): Bandwidth-bound (memory-bound)
  • \(\text{OI} > \text{Ridge Point}\): Compute-bound

Optimization Strategies

  • Memory-bound: Improve data reuse, prefetching, compression, use higher-bandwidth memory
  • Compute-bound: Leverage SIMD, Tensor cores, algorithmic optimization to reduce computation

8. Heterogeneous Computing and Domain-Specific Architectures

8.1 Heterogeneous SoC

Modern SoCs integrate multiple types of compute units:

Unit Strengths
CPU big cores Serial, control-intensive tasks
CPU little cores Low-power background tasks
GPU Graphics, massively data-parallel
NPU/TPU Neural network inference
DSP Signal processing
ISP Image processing

8.2 Domain-Specific Accelerators

Accelerator Application Key Features
Google TPU Machine learning Systolic array, BF16/INT8
NVIDIA Tensor Core Matrix multiply-accumulate FP16/BF16/INT8/FP8
Apple Neural Engine On-device AI inference High energy efficiency
FPGA Reconfigurable computing Flexible but high development difficulty

Navigation


评论 #