Parallel Architecture
1. Flynn's Taxonomy
Michael Flynn (1966) classified computer architectures into four categories based on the parallelism of instruction streams and data streams:
graph TB
subgraph "Flynn's Taxonomy"
direction TB
A["<b>SISD</b><br/>Single Instruction<br/>Single Data<br/>Classic single-core CPU"]
B["<b>SIMD</b><br/>Single Instruction<br/>Multiple Data<br/>Vector processing / GPU cores"]
C["<b>MISD</b><br/>Multiple Instruction<br/>Single Data<br/>Fault-tolerant pipelines (rare)"]
D["<b>MIMD</b><br/>Multiple Instruction<br/>Multiple Data<br/>Multi-core / Distributed"]
end
style A fill:#e1f5fe
style B fill:#e8f5e9
style C fill:#fff3e0
style D fill:#f3e5f5
| Category | Instruction Streams | Data Streams | Representative |
|---|---|---|---|
| SISD | Single | Single | Classic single-core processor |
| SIMD | Single | Multiple | SSE/AVX, NEON, GPU cores |
| MISD | Multiple | Single | Aerospace fault-tolerant systems (extremely rare) |
| MIMD | Multiple | Multiple | Multi-core CPUs, clusters, supercomputers |
2. SIMD Parallelism
2.1 Vector Extensions (CPU SIMD)
A single instruction operates on multiple data elements simultaneously:
| Extension | Width | Architecture |
|---|---|---|
| SSE | 128-bit (4x float) | x86 |
| AVX2 | 256-bit (8x float) | x86 |
| AVX-512 | 512-bit (16x float) | x86 (server) |
| NEON | 128-bit | ARM |
| SVE/SVE2 | 128-2048 bit (scalable) | ARMv9 |
| RVV | Scalable (VLEN) | RISC-V |
Speedup upper bound (ignoring scalar overhead):
For example, AVX-512 processing 32-bit floats: \(S = 512 / 32 = 16\times\).
2.2 SIMD Programming Models
- Compiler auto-vectorization: Compiler detects vectorizable loops
- Intrinsics: Manual invocation of SIMD instructions (e.g.,
_mm256_add_ps) - Library abstractions: High-level library wrappers (e.g., Eigen, xsimd)
3. Multi-Core and Shared Memory
3.1 Memory Architecture
graph TB
subgraph "UMA (Uniform Memory Access)"
C1[Core 0] --> BUS1[Bus/Crossbar]
C2[Core 1] --> BUS1
C3[Core 2] --> BUS1
BUS1 --> MEM1[Shared Memory<br/>Uniform latency]
end
subgraph "NUMA (Non-Uniform Memory Access)"
C4[Core 0] --> L1[Local Memory<br/>Fast]
C5[Core 1] --> L1
C6[Core 2] --> L2[Local Memory<br/>Fast]
C7[Core 3] --> L2
L1 <-->|Interconnect<br/>Slow| L2
end
| Architecture | Characteristics | Applicable Scale |
|---|---|---|
| UMA | Uniform memory access latency for all cores | Small-scale (<=8 cores) |
| NUMA | Local memory is fast, remote memory is slow | Multi-socket servers (tens to hundreds of cores) |
NUMA-aware programming is critical -- data should be allocated on the local node of the core that uses it.
3.2 Cache Coherence
When multiple cores share memory, the same data in different private caches may become inconsistent.
Coherence requirements:
- Write Propagation: A core's write eventually becomes visible to all cores
- Write Serialization: All cores observe writes to the same address in the same order
3.3 MESI Protocol
The most classic snooping protocol; each cache line has 4 states:
| State | Meaning |
|---|---|
| M (Modified) | Modified, only this cache has a valid copy, main memory is stale |
| E (Exclusive) | Unmodified, only this cache has a valid copy, consistent with main memory |
| S (Shared) | Unmodified, potentially multiple caches hold copies |
| I (Invalid) | Invalid, cannot be used |
State transition examples:
stateDiagram-v2
[*] --> I
I --> E: Local read (no other copy)
I --> S: Local read (other copies exist)
E --> M: Local write
S --> M: Local write (other copies -> I)
M --> S: Remote read (write back to memory)
M --> I: Remote write
E --> I: Remote write
S --> I: Remote write
E --> S: Remote read
MOESI / MESIF extensions:
- MOESI (AMD): Adds Owned state, avoiding the need to write back dirty data to main memory
- MESIF (Intel): Adds Forward state, designating which cache responds to shared requests
3.4 Directory Protocol
Snooping protocols have high broadcast overhead in large-scale systems. Directory protocols use a centralized directory recording sharer information for each cache line:
- Suitable for NUMA architectures and large-scale multi-core (>16 cores)
- Point-to-point communication, no broadcasting needed
- Directory entry: state + sharer bit vector
3.5 False Sharing
Different cores access different variables that happen to reside in the same cache line, causing frequent coherence invalidations.
Solutions:
- Cache line alignment (padding)
- Per-thread independent data structures (thread-local)
alignas(64)to ensure independent variables reside in different cache lines
4. GPU Computing Model
4.1 GPU vs CPU Design Philosophy
graph LR
subgraph "CPU: Latency Optimized"
A1[Large Cache] --- A2[Complex Control Logic]
A2 --- A3[Few Powerful Cores]
A3 --- A4[Out-of-Order Execution]
end
subgraph "GPU: Throughput Optimized"
B1[Small Cache] --- B2[Simple Control Logic]
B2 --- B3[Massive Weak Cores]
B3 --- B4[Thread-Level Parallelism Hides Latency]
end
| Feature | CPU | GPU |
|---|---|---|
| Core count | 4-128 (powerful cores) | Thousands to tens of thousands (CUDA cores) |
| Control logic | Complex (out-of-order, branch prediction) | Simple (in-order execution) |
| Cache | Large (L1+L2+L3 up to 100MB) | Small (relies mainly on registers and shared memory) |
| Memory bandwidth | ~50-100 GB/s | ~1-3 TB/s (HBM) |
| Suitable workloads | Serial, branch-intensive | Massively data-parallel |
4.2 SIMT (Single Instruction Multiple Thread)
NVIDIA GPU execution model:
- Thread: Smallest execution unit with independent registers and PC
- Warp: 32 threads forming a warp, executing the same instruction in lockstep
- Thread Block: Multiple warps sharing Shared Memory, executing on the same SM
- Grid: Multiple Blocks distributed across SMs
Grid
├── Block (0,0)
│ ├── Warp 0: Thread 0-31
│ ├── Warp 1: Thread 32-63
│ └── ...
├── Block (0,1)
│ └── ...
└── ...
4.3 SM (Streaming Multiprocessor) Structure
Each SM contains:
| Component | Count (typical) | Function |
|---|---|---|
| CUDA cores (FP32) | 128 | Floating-point/integer arithmetic |
| Tensor cores | 4 | Matrix multiply-accumulate (AI acceleration) |
| Load/Store units | 32 | Memory access |
| SFU (Special Function Units) | 16 | sin, cos, exp, etc. |
| Warp schedulers | 4 | Select ready warps for issue each cycle |
| Register file | 256 KB | Thread-private registers |
| Shared Memory | 64-228 KB | Shared among threads within a Block |
4.4 Warp Divergence
When threads within a warp take different branch paths:
- Must serially execute each path (masking off non-participating threads)
- Worst case: Both paths execute, halving throughput
Optimization: Minimize warp divergence by grouping threads taking the same path into the same warp.
4.5 Memory Hierarchy
| Memory Type | Scope | Latency | Bandwidth |
|---|---|---|---|
| Registers | Thread-private | 1 cycle | Highest |
| Shared Memory | Block-shared | ~5 cycles | Very high |
| L1 Cache | SM-level | ~30 cycles | High |
| L2 Cache | Global | ~200 cycles | Medium |
| Global Memory (HBM) | Global | ~300-500 cycles | ~3 TB/s |
Key optimization: Use Shared Memory as a manually managed cache to reduce global memory accesses.
5. Interconnect Topologies
5.1 On-Chip Interconnects
| Topology | Structure | Applicable Scale |
|---|---|---|
| Bus | Shared medium | <=4 cores |
| Crossbar | Fully connected | <=16 cores (area \(O(N^2)\)) |
| Ring | Single/double ring | Intel consumer multi-core |
| Mesh | 2D grid | Many-core (Intel Xeon, GPU) |
5.2 Chip-to-Chip Interconnects
| Interconnect | Characteristics | Usage |
|---|---|---|
| PCIe | General-purpose, moderate bandwidth | CPU-GPU, CPU-SSD |
| NVLink | Ultra-high bandwidth (900 GB/s) | GPU-GPU (NVIDIA) |
| CXL | Cache-coherent extension | CPU-accelerator, memory pooling |
| Infinity Fabric | AMD inter-die interconnect | Chiplet connections |
| UCIe | Open chiplet interconnect standard | Heterogeneous chiplets |
6. Parallel Programming Models
| Model | Abstraction Level | Target Hardware |
|---|---|---|
| Pthreads / std::thread | Low-level threading | Multi-core CPU |
| OpenMP | Compiler directives (#pragma) |
Multi-core CPU |
| CUDA / HIP | GPU parallel programming | NVIDIA / AMD GPU |
| SYCL / oneAPI | Heterogeneous parallel (cross-device) | CPU + GPU + FPGA |
| MPI | Message passing (distributed memory) | Clusters, supercomputers |
7. Performance Metrics and Analysis
7.1 Speedup and Efficiency
- Ideal linear speedup: \(S(p) = p\), \(E(p) = 1\)
- Practically limited by: Serial fraction (Amdahl), communication overhead, load imbalance
7.2 Roofline Model
Measures whether a program is compute-bound or bandwidth-bound:
where Operational Intensity \(\text{OI} = \text{FLOPs} / \text{Bytes transferred}\).
- \(\text{OI} < \text{Ridge Point}\): Bandwidth-bound (memory-bound)
- \(\text{OI} > \text{Ridge Point}\): Compute-bound
Optimization Strategies
- Memory-bound: Improve data reuse, prefetching, compression, use higher-bandwidth memory
- Compute-bound: Leverage SIMD, Tensor cores, algorithmic optimization to reduce computation
8. Heterogeneous Computing and Domain-Specific Architectures
8.1 Heterogeneous SoC
Modern SoCs integrate multiple types of compute units:
| Unit | Strengths |
|---|---|
| CPU big cores | Serial, control-intensive tasks |
| CPU little cores | Low-power background tasks |
| GPU | Graphics, massively data-parallel |
| NPU/TPU | Neural network inference |
| DSP | Signal processing |
| ISP | Image processing |
8.2 Domain-Specific Accelerators
| Accelerator | Application | Key Features |
|---|---|---|
| Google TPU | Machine learning | Systolic array, BF16/INT8 |
| NVIDIA Tensor Core | Matrix multiply-accumulate | FP16/BF16/INT8/FP8 |
| Apple Neural Engine | On-device AI inference | High energy efficiency |
| FPGA | Reconfigurable computing | Flexible but high development difficulty |
Navigation
- Previous: Memory Hierarchy Design
- Series home: Architecture Overview