Parallel Architecture

1. Flynn's Taxonomy

Michael Flynn (1966) classified computer architectures into four categories based on the parallelism of instruction streams and data streams:

graph TB
    subgraph "Flynn's Taxonomy"
        direction TB
        A["<b>SISD</b><br/>Single Instruction<br/>Single Data<br/>Classic single-core CPU"]
        B["<b>SIMD</b><br/>Single Instruction<br/>Multiple Data<br/>Vector processing / GPU cores"]
        C["<b>MISD</b><br/>Multiple Instruction<br/>Single Data<br/>Fault-tolerant pipelines (rare)"]
        D["<b>MIMD</b><br/>Multiple Instruction<br/>Multiple Data<br/>Multi-core / Distributed"]
    end

    style A fill:#e1f5fe
    style B fill:#e8f5e9
    style C fill:#fff3e0
    style D fill:#f3e5f5

Category	Instruction Streams	Data Streams	Representative
SISD	Single	Single	Classic single-core processor
SIMD	Single	Multiple	SSE/AVX, NEON, GPU cores
MISD	Multiple	Single	Aerospace fault-tolerant systems (extremely rare)
MIMD	Multiple	Multiple	Multi-core CPUs, clusters, supercomputers

2. SIMD Parallelism

2.1 Vector Extensions (CPU SIMD)

A single instruction operates on multiple data elements simultaneously:

Extension	Width	Architecture
SSE	128-bit (4x float)	x86
AVX2	256-bit (8x float)	x86
AVX-512	512-bit (16x float)	x86 (server)
NEON	128-bit	ARM
SVE/SVE2	128-2048 bit (scalable)	ARMv9
RVV	Scalable (VLEN)	RISC-V

Speedup upper bound (ignoring scalar overhead):

\[ S_{\text{SIMD}} = \frac{\text{Vector Width (bits)}}{\text{Element Width (bits)}} \]

For example, AVX-512 processing 32-bit floats: \(S = 512 / 32 = 16\times\).

2.2 SIMD Programming Models

Compiler auto-vectorization: Compiler detects vectorizable loops
Intrinsics: Manual invocation of SIMD instructions (e.g., _mm256_add_ps)
Library abstractions: High-level library wrappers (e.g., Eigen, xsimd)

3. Multi-Core and Shared Memory

3.1 Memory Architecture

graph TB
    subgraph "UMA (Uniform Memory Access)"
        C1[Core 0] --> BUS1[Bus/Crossbar]
        C2[Core 1] --> BUS1
        C3[Core 2] --> BUS1
        BUS1 --> MEM1[Shared Memory<br/>Uniform latency]
    end

    subgraph "NUMA (Non-Uniform Memory Access)"
        C4[Core 0] --> L1[Local Memory<br/>Fast]
        C5[Core 1] --> L1
        C6[Core 2] --> L2[Local Memory<br/>Fast]
        C7[Core 3] --> L2
        L1 <-->|Interconnect<br/>Slow| L2
    end

Architecture	Characteristics	Applicable Scale
UMA	Uniform memory access latency for all cores	Small-scale (<=8 cores)
NUMA	Local memory is fast, remote memory is slow	Multi-socket servers (tens to hundreds of cores)

NUMA-aware programming is critical -- data should be allocated on the local node of the core that uses it.

3.2 Cache Coherence

When multiple cores share memory, the same data in different private caches may become inconsistent.

Coherence requirements:

Write Propagation: A core's write eventually becomes visible to all cores
Write Serialization: All cores observe writes to the same address in the same order

3.3 MESI Protocol

The most classic snooping protocol; each cache line has 4 states:

State	Meaning
M (Modified)	Modified, only this cache has a valid copy, main memory is stale
E (Exclusive)	Unmodified, only this cache has a valid copy, consistent with main memory
S (Shared)	Unmodified, potentially multiple caches hold copies
I (Invalid)	Invalid, cannot be used

State transition examples:

stateDiagram-v2
    [*] --> I
    I --> E: Local read (no other copy)
    I --> S: Local read (other copies exist)
    E --> M: Local write
    S --> M: Local write (other copies -> I)
    M --> S: Remote read (write back to memory)
    M --> I: Remote write
    E --> I: Remote write
    S --> I: Remote write
    E --> S: Remote read

MOESI / MESIF extensions:

MOESI (AMD): Adds Owned state, avoiding the need to write back dirty data to main memory
MESIF (Intel): Adds Forward state, designating which cache responds to shared requests

3.4 Directory Protocol

Snooping protocols have high broadcast overhead in large-scale systems. Directory protocols use a centralized directory recording sharer information for each cache line:

Suitable for NUMA architectures and large-scale multi-core (>16 cores)
Point-to-point communication, no broadcasting needed
Directory entry: state + sharer bit vector

Different cores access different variables that happen to reside in the same cache line, causing frequent coherence invalidations.

\[ \text{Performance degradation} \propto \text{coherence message frequency} \]

Solutions:

Cache line alignment (padding)
Per-thread independent data structures (thread-local)
alignas(64) to ensure independent variables reside in different cache lines

4. GPU Computing Model

4.1 GPU vs CPU Design Philosophy

graph LR
    subgraph "CPU: Latency Optimized"
        A1[Large Cache] --- A2[Complex Control Logic]
        A2 --- A3[Few Powerful Cores]
        A3 --- A4[Out-of-Order Execution]
    end

    subgraph "GPU: Throughput Optimized"
        B1[Small Cache] --- B2[Simple Control Logic]
        B2 --- B3[Massive Weak Cores]
        B3 --- B4[Thread-Level Parallelism Hides Latency]
    end

Feature	CPU	GPU
Core count	4-128 (powerful cores)	Thousands to tens of thousands (CUDA cores)
Control logic	Complex (out-of-order, branch prediction)	Simple (in-order execution)
Cache	Large (L1+L2+L3 up to 100MB)	Small (relies mainly on registers and shared memory)
Memory bandwidth	~50-100 GB/s	~1-3 TB/s (HBM)
Suitable workloads	Serial, branch-intensive	Massively data-parallel

4.2 SIMT (Single Instruction Multiple Thread)

NVIDIA GPU execution model:

Thread: Smallest execution unit with independent registers and PC
Warp: 32 threads forming a warp, executing the same instruction in lockstep
Thread Block: Multiple warps sharing Shared Memory, executing on the same SM
Grid: Multiple Blocks distributed across SMs

Grid
├── Block (0,0)
│   ├── Warp 0: Thread 0-31
│   ├── Warp 1: Thread 32-63
│   └── ...
├── Block (0,1)
│   └── ...
└── ...

4.3 SM (Streaming Multiprocessor) Structure

Each SM contains:

Component	Count (typical)	Function
CUDA cores (FP32)	128	Floating-point/integer arithmetic
Tensor cores	4	Matrix multiply-accumulate (AI acceleration)
Load/Store units	32	Memory access
SFU (Special Function Units)	16	sin, cos, exp, etc.
Warp schedulers	4	Select ready warps for issue each cycle
Register file	256 KB	Thread-private registers
Shared Memory	64-228 KB	Shared among threads within a Block

4.4 Warp Divergence

When threads within a warp take different branch paths:

Must serially execute each path (masking off non-participating threads)
Worst case: Both paths execute, halving throughput

\[ \text{Efficiency} = \frac{\text{Active threads}}{32} \]

Optimization: Minimize warp divergence by grouping threads taking the same path into the same warp.

4.5 Memory Hierarchy

Memory Type	Scope	Latency	Bandwidth
Registers	Thread-private	1 cycle	Highest
Shared Memory	Block-shared	~5 cycles	Very high
L1 Cache	SM-level	~30 cycles	High
L2 Cache	Global	~200 cycles	Medium
Global Memory (HBM)	Global	~300-500 cycles	~3 TB/s

Key optimization: Use Shared Memory as a manually managed cache to reduce global memory accesses.

5. Interconnect Topologies

5.1 On-Chip Interconnects

Topology	Structure	Applicable Scale
Bus	Shared medium	<=4 cores
Crossbar	Fully connected	<=16 cores (area \(O(N^2)\))
Ring	Single/double ring	Intel consumer multi-core
Mesh	2D grid	Many-core (Intel Xeon, GPU)

5.2 Chip-to-Chip Interconnects

Interconnect	Characteristics	Usage
PCIe	General-purpose, moderate bandwidth	CPU-GPU, CPU-SSD
NVLink	Ultra-high bandwidth (900 GB/s)	GPU-GPU (NVIDIA)
CXL	Cache-coherent extension	CPU-accelerator, memory pooling
Infinity Fabric	AMD inter-die interconnect	Chiplet connections
UCIe	Open chiplet interconnect standard	Heterogeneous chiplets

6. Parallel Programming Models

Model	Abstraction Level	Target Hardware
Pthreads / std::thread	Low-level threading	Multi-core CPU
OpenMP	Compiler directives (`#pragma`)	Multi-core CPU
CUDA / HIP	GPU parallel programming	NVIDIA / AMD GPU
SYCL / oneAPI	Heterogeneous parallel (cross-device)	CPU + GPU + FPGA
MPI	Message passing (distributed memory)	Clusters, supercomputers

7. Performance Metrics and Analysis

7.1 Speedup and Efficiency

\[ S(p) = \frac{T(1)}{T(p)}, \quad E(p) = \frac{S(p)}{p} \]

Ideal linear speedup: \(S(p) = p\), \(E(p) = 1\)
Practically limited by: Serial fraction (Amdahl), communication overhead, load imbalance

7.2 Roofline Model

Measures whether a program is compute-bound or bandwidth-bound:

\[ \text{Performance} = \min(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Operational Intensity}) \]

where Operational Intensity \(\text{OI} = \text{FLOPs} / \text{Bytes transferred}\).

\(\text{OI} < \text{Ridge Point}\): Bandwidth-bound (memory-bound)
\(\text{OI} > \text{Ridge Point}\): Compute-bound

Optimization Strategies

Memory-bound: Improve data reuse, prefetching, compression, use higher-bandwidth memory
Compute-bound: Leverage SIMD, Tensor cores, algorithmic optimization to reduce computation

8. Heterogeneous Computing and Domain-Specific Architectures

8.1 Heterogeneous SoC

Modern SoCs integrate multiple types of compute units:

Unit	Strengths
CPU big cores	Serial, control-intensive tasks
CPU little cores	Low-power background tasks
GPU	Graphics, massively data-parallel
NPU/TPU	Neural network inference
DSP	Signal processing
ISP	Image processing

8.2 Domain-Specific Accelerators

Accelerator	Application	Key Features
Google TPU	Machine learning	Systolic array, BF16/INT8
NVIDIA Tensor Core	Matrix multiply-accumulate	FP16/BF16/INT8/FP8
Apple Neural Engine	On-device AI inference	High energy efficiency
FPGA	Reconfigurable computing	Flexible but high development difficulty

Navigation

Previous: Memory Hierarchy Design
Series home: Architecture Overview