Parallel Computing
Parallel computing is a method of accelerating computation by decomposing tasks into multiple subtasks and executing them simultaneously across multiple processing units. It forms the theoretical and practical foundation of High-Performance Computing (HPC), directly underpinning modern scientific simulations, big data processing, and deep learning training.
Core questions include: how to effectively decompose tasks (the bottleneck described by Amdahl's Law), how to communicate and synchronize across processing units (MPI/OpenMP), how to leverage the massive parallelism of GPUs (CUDA), and how to apply these techniques to distributed deep learning training (data parallelism / model parallelism / pipeline parallelism). These notes systematically cover the core knowledge of parallel computing, from theoretical foundations to programming models to practical applications.
Theoretical Foundations of Parallel Computing
Amdahl's Law
Amdahl's Law describes the theoretical maximum speedup achievable by adding more processors for a fixed problem size.
Formula:
Where:
- \(S(n)\): Speedup when using \(n\) processors
- \(P\): The fraction of the program that is parallelizable (\(0 \le P \le 1\))
- \(n\): Number of processors
- \((1 - P)\): The fraction of the program that must be executed sequentially
Key Insight
When \(n \to \infty\), \(S = \frac{1}{1 - P}\). Even with infinitely many processors, the speedup is bounded by the sequential portion. For example, if \(P = 0.95\) (95% parallelizable), the maximum speedup is only \(\frac{1}{0.05} = 20\times\).
Speedup for different values of P and n:
| Parallelizable fraction P | n=2 | n=4 | n=8 | n=16 | n=64 | n=256 | n=∞ |
|---|---|---|---|---|---|---|---|
| 0.50 | 1.33 | 1.60 | 1.78 | 1.88 | 1.97 | 1.99 | 2.00 |
| 0.75 | 1.60 | 2.29 | 2.91 | 3.37 | 3.76 | 3.94 | 4.00 |
| 0.90 | 1.82 | 3.08 | 4.71 | 6.40 | 8.77 | 9.61 | 10.00 |
| 0.95 | 1.90 | 3.48 | 5.93 | 9.14 | 15.42 | 18.28 | 20.00 |
| 0.99 | 1.98 | 3.88 | 7.48 | 13.91 | 39.26 | 72.12 | 100.00 |
Implication
The sequential portion is the ultimate bottleneck for parallel speedup. When optimizing parallel programs, the priority should be reducing the sequential fraction rather than blindly adding more processors.
Gustafson's Law
Gustafson's Law considers parallel speedup from a different perspective: with a fixed execution time, increasing the number of processors allows solving larger-scale problems.
Formula:
Where:
- \(n\): Number of processors
- \(\alpha\): The fraction of the program that is sequential
Amdahl vs Gustafson: Different Perspectives
- Amdahl's Law: Fixed problem size, add processors → Strong Scaling
- Gustafson's Law: Fixed execution time, increase problem size → Weak Scaling
| Dimension | Amdahl's Law | Gustafson's Law |
|---|---|---|
| Fixed quantity | Problem size | Execution time |
| Scaling type | Strong Scaling | Weak Scaling |
| Assumption | Problem size stays the same; add processors | More processors solve larger problems |
| Conclusion | Speedup has an upper bound | Speedup can grow linearly with n |
| Applicable scenarios | Real-time systems, latency-sensitive tasks | Scientific simulations, large-scale data processing |
Performance Metrics
- Speedup: \(S(n) = \frac{T_1}{T_n}\), where \(T_1\) is the sequential execution time and \(T_n\) is the parallel execution time with \(n\) processors
- Efficiency: \(E(n) = \frac{S(n)}{n}\); ideally \(E = 1\) (i.e., 100%), but in practice \(E < 1\)
- Scalability: Whether a program can maintain good efficiency as the number of processors increases
| Metric | Ideal value | Meaning |
|---|---|---|
| Speedup | \(S(n) = n\) | Linear speedup: doubling processors doubles speed |
| Efficiency | \(E = 1\) | Every processor is fully utilized |
| Scalability | Efficiency does not decrease with n | The system can efficiently utilize more resources |
Parallel Programming Models
Shared Memory Model
Characteristics: Multiple threads share the same address space and communicate by reading and writing shared variables.
┌─────────────────────────────────────┐
│ Shared Memory Space │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ B │ │ C │ │ D │ ... │
│ └───┘ └───┘ └───┘ └───┘ │
└────┬───────┬───────┬───────┬───────┘
│ │ │ │
Thread0 Thread1 Thread2 Thread3
OpenMP
OpenMP is a shared-memory parallel programming API based on compiler directives. It achieves parallelization through #pragma directives and is straightforward to use.
Vector addition example:
#include <stdio.h>
#include <omp.h>
#define N 1000000
int main() {
double a[N], b[N], c[N];
// Initialization
for (int i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
// OpenMP parallel vector addition
#pragma omp parallel for
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
printf("c[0] = %f, c[N-1] = %f\n", c[0], c[N-1]);
return 0;
}
Compilation
gcc -fopenmp vector_add.c -o vector_add
Common OpenMP directives:
| Directive | Function |
|---|---|
#pragma omp parallel |
Create a parallel region |
#pragma omp parallel for |
Parallelize a for loop |
#pragma omp critical |
Critical section: only one thread may execute at a time |
#pragma omp atomic |
Atomic operation |
#pragma omp barrier |
Synchronization barrier |
#pragma omp reduction(+:sum) |
Reduction operation |
Pros and cons:
| Pros | Cons |
|---|---|
| Simple to program; incremental parallelization | Limited to a single machine (single node) |
| Minimal code changes via compiler directives | Must guard against data races |
| Supports C/C++/Fortran | Scalability limited by the number of cores on a single machine |
Distributed Memory Model
Characteristics: Each node has its own independent memory; nodes communicate via message passing.
┌──────────┐ Message Passing ┌──────────┐
│ Node 0 │◄────────────────►│ Node 1 │
│ ┌──────┐ │ (Network) │ ┌──────┐ │
│ │Memory│ │ │ │Memory│ │
│ └──────┘ │ │ └──────┘ │
└──────────┘ └──────────┘
▲ ▲
│ Message Passing │
▼ ▼
┌──────────┐ ┌──────────┐
│ Node 2 │◄────────────────►│ Node 3 │
│ ┌──────┐ │ │ ┌──────┐ │
│ │Memory│ │ │ │Memory│ │
│ └──────┘ │ │ └──────┘ │
└──────────┘ └──────────┘
MPI (Message Passing Interface)
MPI is the standard interface for distributed-memory parallel programming, suitable for large-scale parallel computing across nodes.
Basic operations:
| Function | Purpose |
|---|---|
MPI_Init |
Initialize the MPI environment |
MPI_Comm_rank |
Get the rank of the current process |
MPI_Comm_size |
Get the total number of processes |
MPI_Send |
Send a message (blocking) |
MPI_Recv |
Receive a message (blocking) |
MPI_Bcast |
Broadcast (one-to-all) |
MPI_Reduce |
Reduce (all-to-one) |
MPI_Allreduce |
All-reduce (all-to-all) |
MPI_Finalize |
Finalize the MPI environment |
MPI vector addition example:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 1000000
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int local_n = N / size;
double *local_a = malloc(local_n * sizeof(double));
double *local_b = malloc(local_n * sizeof(double));
double *local_c = malloc(local_n * sizeof(double));
// Each process initializes its own data chunk
for (int i = 0; i < local_n; i++) {
local_a[i] = (rank * local_n + i) * 1.0;
local_b[i] = (rank * local_n + i) * 2.0;
}
// Each process computes its own portion
for (int i = 0; i < local_n; i++) {
local_c[i] = local_a[i] + local_b[i];
}
if (rank == 0) {
printf("Process 0: local_c[0] = %f\n", local_c[0]);
}
free(local_a); free(local_b); free(local_c);
MPI_Finalize();
return 0;
}
Compilation and Execution
mpicc vector_add_mpi.c -o vector_add_mpi
mpirun -np 4 ./vector_add_mpi
Pros and cons:
| Pros | Cons |
|---|---|
| Cross-node capable with strong scalability | Complex programming; explicit communication management required |
| Suitable for large-scale clusters | Communication overhead can become a bottleneck |
| No data race issues | Difficult to debug |
GPU Parallelism (CUDA)
CUDA Programming Model
CUDA (Compute Unified Device Architecture) is NVIDIA's general-purpose GPU computing programming model.
Hierarchy:
Grid
├── Block (0,0) Block (1,0) Block (2,0)
│ ├── Thread (0,0) ├── Thread (0,0) ├── ...
│ ├── Thread (1,0) ├── Thread (1,0) │
│ ├── Thread (0,1) ├── Thread (0,1) │
│ └── ... └── ... └── ...
├── Block (0,1) Block (1,1) Block (2,1)
│ └── ... └── ... └── ...
- Grid: Composed of multiple Blocks; corresponds to a single kernel launch
- Block: Composed of multiple Threads; threads within a Block can cooperate
- Thread: The smallest unit of execution
Kernel Functions
A kernel is a function executed on the GPU, declared with the __global__ qualifier and launched using the <<<grid, block>>> syntax.
GPU Memory Hierarchy
| Memory Type | Speed | Size | Scope | Lifetime |
|---|---|---|---|---|
| Registers | Fastest | Very small | Single Thread | Thread |
| Shared Memory | Very fast | Small (~48KB/Block) | All Threads within a Block | Block |
| Global Memory | Slower | Large (GB-scale) | All Threads | Managed by host |
| Constant Memory | Fast (cached) | 64KB | All Threads (read-only) | Managed by host |
CUDA Vector Addition Example
#include <stdio.h>
#include <cuda_runtime.h>
#define N 1000000
// Kernel: each thread computes one element
__global__ void vector_add(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
float *h_a, *h_b, *h_c; // Host (CPU) memory
float *d_a, *d_b, *d_c; // Device (GPU) memory
size_t bytes = N * sizeof(float);
// Allocate host memory and initialize
h_a = (float*)malloc(bytes);
h_b = (float*)malloc(bytes);
h_c = (float*)malloc(bytes);
for (int i = 0; i < N; i++) {
h_a[i] = i * 1.0f;
h_b[i] = i * 2.0f;
}
// Allocate device memory
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy data from host to device
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Launch kernel
int blockSize = 256;
int gridSize = (N + blockSize - 1) / blockSize;
vector_add<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);
// Copy result from device back to host
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
printf("c[0] = %f, c[N-1] = %f\n", h_c[0], h_c[N-1]);
// Free memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
free(h_a); free(h_b); free(h_c);
return 0;
}
Compilation
nvcc vector_add.cu -o vector_add
GPU vs CPU: Use Case Comparison
| Dimension | CPU | GPU |
|---|---|---|
| Core count | Few (single digits to tens) | Many (thousands) |
| Single-core performance | Strong (complex logic) | Weak (simple operations) |
| Parallelism granularity | Coarse-grained | Fine-grained (SIMT) |
| Suitable tasks | Control logic, branch-heavy workloads | Data-parallel, compute-intensive workloads |
| Typical applications | OS scheduling, compilation, databases | Matrix operations, image processing, deep learning |
| Memory bandwidth | Lower | Very high (HBM) |
| Latency | Low (fast for single tasks) | High (significant launch overhead) |
Selection Guidelines
- Tasks involving many independent, identical computations → use GPU
- Tasks involving complex branching logic and dependencies → use CPU
- Best practice: CPU + GPU heterogeneous cooperation
Parallel Algorithm Design
Common Parallel Patterns
Master-Worker Pattern
A single Master process handles task distribution and result collection, while multiple Worker processes perform the actual computation.
┌──────────┐
│ Master │
│Task Alloc│
└──┬──┬──┬─┘
│ │ │
┌──────┘ │ └──────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│Worker 0││Worker 1││Worker 2│
│Compute ││Compute ││Compute │
└───┬────┘└───┬────┘└───┬────┘
│ │ │
└────┐ │ ┌────┘
▼ ▼ ▼
┌──────────┐
│ Master │
│ Collect │
└──────────┘
MapReduce
Inspired by functional programming, computation is divided into two phases: Map and Reduce. This is the core model for big data processing (e.g., Hadoop, Spark).
Input Data → [Map] → Intermediate Key-Value Pairs → [Shuffle] → [Reduce] → Output
Pipeline
Data flows sequentially through multiple processing stages, each executed by a different processor -- analogous to a factory assembly line.
Data Flow → [Stage 1] → [Stage 2] → [Stage 3] → [Stage 4] → Output
Processor0 Processor1 Processor2 Processor3
Divide and Conquer
The problem is recursively split into subproblems, which are solved in parallel, and the results are then merged. Classic examples include parallel merge sort and parallel quicksort.
Communication Patterns
Point-to-Point
Direct message send/receive between two processes.
Process 0 ──── data ────► Process 1
Collective Communication
Communication operations involving a group of processes; this is the most commonly used communication pattern in parallel computing.
Broadcast: One process sends data to all processes.
P0 [A] ──► P0 [A]
P1 [A]
P2 [A]
P3 [A]
Scatter: One process partitions its data and distributes the chunks to each process.
P0 [A B C D] ──► P0 [A]
P1 [B]
P2 [C]
P3 [D]
Gather: Each process sends its data to a single process, which collects them all.
P0 [A] ──► P0 [A B C D]
P1 [B]
P2 [C]
P3 [D]
Reduce: Applies an operation (e.g., summation) to data from all processes, storing the result in one process.
P0 [1] ──► P0 [1+2+3+4 = 10]
P1 [2] (op = SUM)
P2 [3]
P3 [4]
AllReduce: Performs a Reduce and then broadcasts the result to all processes. Equivalent to Reduce + Broadcast.
P0 [1] ──► P0 [10]
P1 [2] P1 [10]
P2 [3] P2 [10]
P3 [4] P3 [10]
Importance of AllReduce
AllReduce is the most critical communication operation in distributed deep learning training, used to synchronize gradients across GPUs. Common implementations include Ring AllReduce and Tree AllReduce.
Synchronization and Load Balancing
Barrier Synchronization
All threads/processes must reach the barrier before any can proceed, ensuring that all parallel branches are aligned at a given point.
Thread 0: ──work── │ ──work──
Thread 1: ──work────│ ──work──
Thread 2: ──work── │ ──work──
Barrier
Locks and Atomic Operations
- Mutex: Protects a critical section so that only one thread can access a shared resource at a time
- Spinlock: A thread busy-waits until it acquires the lock; suitable when the lock is held for short durations
- Atomic Operation: A hardware-level indivisible operation, such as
atomic_addorcompare_and_swap
Caveats
- Deadlock: Two or more threads each waiting for the other to release a lock
- Data Race: Multiple threads simultaneously reading and writing the same variable, leading to nondeterministic results
- Excessive synchronization can severely degrade parallel performance; minimize lock granularity and hold time
Load Balancing
| Strategy | Static Load Balancing | Dynamic Load Balancing |
|---|---|---|
| Allocation timing | Pre-assigned before execution | Adjusted dynamically at runtime based on load |
| Implementation complexity | Simple | More complex |
| Applicable scenarios | Uniform, predictable workloads | Non-uniform, unpredictable workloads |
| Overhead | Low | Scheduling overhead |
| Typical methods | Uniform partitioning, round-robin | Work stealing, task queues |
Impact of Load Imbalance
If workloads vary significantly across processors, faster processors must idle while waiting for slower ones, causing overall execution time to be determined by the slowest processor (the barrel effect / straggler problem). This significantly reduces parallel efficiency.
Applications of Parallel Computing in Deep Learning
Modern deep learning models (such as LLMs) are so large that a single GPU cannot complete training on its own. Distributed parallel training has become an essential technique.
Data Parallelism
Each GPU holds a complete copy of the model and processes different data batches (mini-batches). Gradients are then synchronized via AllReduce.
GPU 0: Model Copy + Data Batch 0 → Gradient 0 ─┐
GPU 1: Model Copy + Data Batch 1 → Gradient 1 ─┤ AllReduce → Avg Gradient → Update Model
GPU 2: Model Copy + Data Batch 2 → Gradient 2 ─┤
GPU 3: Model Copy + Data Batch 3 → Gradient 3 ─┘
- Pros: Simple to implement; fixed communication pattern
- Cons: Each GPU must be able to hold the entire model
Model Parallelism
Different layers or components of the model are distributed across different GPUs. This is suitable when a single GPU cannot accommodate the entire model.
GPU 0: Layer 1-10
GPU 1: Layer 11-20
GPU 2: Layer 21-30
GPU 3: Layer 31-40
- Pros: Enables training of very large models
- Cons: Data dependencies between GPUs can lead to idle GPUs (the bubble problem)
Pipeline Parallelism
After partitioning the model by layers, different micro-batches are fed through the stages in a pipelined fashion, reducing GPU idle time.
Time → t0 t1 t2 t3 t4 t5
GPU 0: [MB0] [MB1] [MB2] [MB3]
GPU 1: [MB0] [MB1] [MB2] [MB3]
GPU 2: [MB0] [MB1] [MB2] [MB3]
GPU 3: [MB0] [MB1] [MB2] [MB3]
The Bubble Problem
During the startup and teardown phases of the pipeline, GPUs are idle -- these idle periods are called "bubbles." Increasing the number of micro-batches reduces the bubble ratio. GPipe and PipeDream are well-known pipeline parallelism frameworks.
Hybrid Parallelism Strategies
In practice, large-scale training typically combines multiple parallelism strategies:
| Parallelism Strategy | Partitioning Dimension | Typical Frameworks |
|---|---|---|
| Data Parallelism | Data (Batch) | PyTorch DDP, Horovod |
| Model Parallelism (Tensor Parallelism) | Intra-layer (weight matrices) | Megatron-LM |
| Pipeline Parallelism | Inter-layer | GPipe, PipeDream |
| Hybrid Parallelism (3D Parallelism) | Data + intra-layer + inter-layer simultaneously | Megatron-LM + DeepSpeed |
Common Frameworks
- PyTorch DDP (DistributedDataParallel): The standard approach for data parallelism
- DeepSpeed (Microsoft): Supports ZeRO optimization, significantly reducing GPU memory usage
- Megatron-LM (NVIDIA): A tensor parallelism solution designed specifically for very large Transformer models
- FSDP (Fully Sharded Data Parallel): PyTorch's native model-sharding approach
The Role of AllReduce in Distributed Training
In data-parallel training, each iteration involves the following key steps:
- Each GPU performs forward propagation with its own data and computes the loss
- Each GPU performs backpropagation to obtain local gradients
- AllReduce averages the gradients across all GPUs
- Each GPU updates the model parameters using the averaged gradients
Ring AllReduce is the most widely used implementation: \(n\) GPUs are arranged in a ring, and the all-reduce operation completes in \(2(n-1)\) communication steps. The communication volume is independent of the number of GPUs (it depends only on the data size), providing good scalability.
GPU 0
╱ ╲
GPU 3 GPU 1
╲ ╱
GPU 2
Each GPU communicates only with its neighbors; AllReduce completes in 2(n-1) steps
Optimization Directions
- Gradient compression: Reduce communication data volume
- Overlapping computation and communication: While backpropagation computes gradients, layers that have already finished can begin AllReduce concurrently
- Mixed-precision training (FP16/BF16): Reduce GPU memory usage and communication volume
Relations to Other Topics
- See System Design for how throughput, capacity planning, and asynchronous execution appear at the system level
- See Cloud Services for GPU clusters, container orchestration, and cloud training infrastructure
- See Distributed Systems for the role of communication overhead, coordination, and fault tolerance
- See Architecture Overview for the hardware background behind software parallelism models
References
- Ananth Grama et al., Introduction to Parallel Computing, Addison-Wesley.
- PyTorch Distributed Overview
- NVIDIA NCCL Documentation
- DeepSpeed Documentation