Skip to content

Parallel Computing

Parallel computing is a method of accelerating computation by decomposing tasks into multiple subtasks and executing them simultaneously across multiple processing units. It forms the theoretical and practical foundation of High-Performance Computing (HPC), directly underpinning modern scientific simulations, big data processing, and deep learning training.

Core questions include: how to effectively decompose tasks (the bottleneck described by Amdahl's Law), how to communicate and synchronize across processing units (MPI/OpenMP), how to leverage the massive parallelism of GPUs (CUDA), and how to apply these techniques to distributed deep learning training (data parallelism / model parallelism / pipeline parallelism). These notes systematically cover the core knowledge of parallel computing, from theoretical foundations to programming models to practical applications.


Theoretical Foundations of Parallel Computing

Amdahl's Law

Amdahl's Law describes the theoretical maximum speedup achievable by adding more processors for a fixed problem size.

Formula:

\[S(n) = \frac{1}{(1 - P) + \frac{P}{n}}\]

Where:

  • \(S(n)\): Speedup when using \(n\) processors
  • \(P\): The fraction of the program that is parallelizable (\(0 \le P \le 1\))
  • \(n\): Number of processors
  • \((1 - P)\): The fraction of the program that must be executed sequentially

Key Insight

When \(n \to \infty\), \(S = \frac{1}{1 - P}\). Even with infinitely many processors, the speedup is bounded by the sequential portion. For example, if \(P = 0.95\) (95% parallelizable), the maximum speedup is only \(\frac{1}{0.05} = 20\times\).

Speedup for different values of P and n:

Parallelizable fraction P n=2 n=4 n=8 n=16 n=64 n=256 n=∞
0.50 1.33 1.60 1.78 1.88 1.97 1.99 2.00
0.75 1.60 2.29 2.91 3.37 3.76 3.94 4.00
0.90 1.82 3.08 4.71 6.40 8.77 9.61 10.00
0.95 1.90 3.48 5.93 9.14 15.42 18.28 20.00
0.99 1.98 3.88 7.48 13.91 39.26 72.12 100.00

Implication

The sequential portion is the ultimate bottleneck for parallel speedup. When optimizing parallel programs, the priority should be reducing the sequential fraction rather than blindly adding more processors.

Gustafson's Law

Gustafson's Law considers parallel speedup from a different perspective: with a fixed execution time, increasing the number of processors allows solving larger-scale problems.

Formula:

\[S(n) = n - \alpha(n - 1)\]

Where:

  • \(n\): Number of processors
  • \(\alpha\): The fraction of the program that is sequential

Amdahl vs Gustafson: Different Perspectives

  • Amdahl's Law: Fixed problem size, add processors → Strong Scaling
  • Gustafson's Law: Fixed execution time, increase problem size → Weak Scaling
Dimension Amdahl's Law Gustafson's Law
Fixed quantity Problem size Execution time
Scaling type Strong Scaling Weak Scaling
Assumption Problem size stays the same; add processors More processors solve larger problems
Conclusion Speedup has an upper bound Speedup can grow linearly with n
Applicable scenarios Real-time systems, latency-sensitive tasks Scientific simulations, large-scale data processing

Performance Metrics

  • Speedup: \(S(n) = \frac{T_1}{T_n}\), where \(T_1\) is the sequential execution time and \(T_n\) is the parallel execution time with \(n\) processors
  • Efficiency: \(E(n) = \frac{S(n)}{n}\); ideally \(E = 1\) (i.e., 100%), but in practice \(E < 1\)
  • Scalability: Whether a program can maintain good efficiency as the number of processors increases
Metric Ideal value Meaning
Speedup \(S(n) = n\) Linear speedup: doubling processors doubles speed
Efficiency \(E = 1\) Every processor is fully utilized
Scalability Efficiency does not decrease with n The system can efficiently utilize more resources

Parallel Programming Models

Shared Memory Model

Characteristics: Multiple threads share the same address space and communicate by reading and writing shared variables.

┌─────────────────────────────────────┐
│         Shared Memory Space         │
│   ┌───┐  ┌───┐  ┌───┐  ┌───┐      │
│   │ A │  │ B │  │ C │  │ D │ ...   │
│   └───┘  └───┘  └───┘  └───┘      │
└────┬───────┬───────┬───────┬───────┘
     │       │       │       │
  Thread0  Thread1 Thread2 Thread3

OpenMP

OpenMP is a shared-memory parallel programming API based on compiler directives. It achieves parallelization through #pragma directives and is straightforward to use.

Vector addition example:

#include <stdio.h>
#include <omp.h>

#define N 1000000

int main() {
    double a[N], b[N], c[N];

    // Initialization
    for (int i = 0; i < N; i++) {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
    }

    // OpenMP parallel vector addition
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }

    printf("c[0] = %f, c[N-1] = %f\n", c[0], c[N-1]);
    return 0;
}

Compilation

gcc -fopenmp vector_add.c -o vector_add

Common OpenMP directives:

Directive Function
#pragma omp parallel Create a parallel region
#pragma omp parallel for Parallelize a for loop
#pragma omp critical Critical section: only one thread may execute at a time
#pragma omp atomic Atomic operation
#pragma omp barrier Synchronization barrier
#pragma omp reduction(+:sum) Reduction operation

Pros and cons:

Pros Cons
Simple to program; incremental parallelization Limited to a single machine (single node)
Minimal code changes via compiler directives Must guard against data races
Supports C/C++/Fortran Scalability limited by the number of cores on a single machine

Distributed Memory Model

Characteristics: Each node has its own independent memory; nodes communicate via message passing.

┌──────────┐  Message Passing  ┌──────────┐
│  Node 0  │◄────────────────►│  Node 1  │
│ ┌──────┐ │    (Network)      │ ┌──────┐ │
│ │Memory│ │                   │ │Memory│ │
│ └──────┘ │                   │ └──────┘ │
└──────────┘                   └──────────┘
      ▲                              ▲
      │       Message Passing        │
      ▼                              ▼
┌──────────┐                   ┌──────────┐
│  Node 2  │◄────────────────►│  Node 3  │
│ ┌──────┐ │                   │ ┌──────┐ │
│ │Memory│ │                   │ │Memory│ │
│ └──────┘ │                   │ └──────┘ │
└──────────┘                   └──────────┘

MPI (Message Passing Interface)

MPI is the standard interface for distributed-memory parallel programming, suitable for large-scale parallel computing across nodes.

Basic operations:

Function Purpose
MPI_Init Initialize the MPI environment
MPI_Comm_rank Get the rank of the current process
MPI_Comm_size Get the total number of processes
MPI_Send Send a message (blocking)
MPI_Recv Receive a message (blocking)
MPI_Bcast Broadcast (one-to-all)
MPI_Reduce Reduce (all-to-one)
MPI_Allreduce All-reduce (all-to-all)
MPI_Finalize Finalize the MPI environment

MPI vector addition example:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define N 1000000

int main(int argc, char *argv[]) {
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int local_n = N / size;
    double *local_a = malloc(local_n * sizeof(double));
    double *local_b = malloc(local_n * sizeof(double));
    double *local_c = malloc(local_n * sizeof(double));

    // Each process initializes its own data chunk
    for (int i = 0; i < local_n; i++) {
        local_a[i] = (rank * local_n + i) * 1.0;
        local_b[i] = (rank * local_n + i) * 2.0;
    }

    // Each process computes its own portion
    for (int i = 0; i < local_n; i++) {
        local_c[i] = local_a[i] + local_b[i];
    }

    if (rank == 0) {
        printf("Process 0: local_c[0] = %f\n", local_c[0]);
    }

    free(local_a); free(local_b); free(local_c);
    MPI_Finalize();
    return 0;
}

Compilation and Execution

mpicc vector_add_mpi.c -o vector_add_mpi
mpirun -np 4 ./vector_add_mpi

Pros and cons:

Pros Cons
Cross-node capable with strong scalability Complex programming; explicit communication management required
Suitable for large-scale clusters Communication overhead can become a bottleneck
No data race issues Difficult to debug

GPU Parallelism (CUDA)

CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA's general-purpose GPU computing programming model.

Hierarchy:

Grid
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Thread (0,0)     ├── Thread (0,0)     ├── ...
│   ├── Thread (1,0)     ├── Thread (1,0)     │
│   ├── Thread (0,1)     ├── Thread (0,1)     │
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          Block (2,1)
│   └── ...              └── ...              └── ...
  • Grid: Composed of multiple Blocks; corresponds to a single kernel launch
  • Block: Composed of multiple Threads; threads within a Block can cooperate
  • Thread: The smallest unit of execution

Kernel Functions

A kernel is a function executed on the GPU, declared with the __global__ qualifier and launched using the <<<grid, block>>> syntax.

GPU Memory Hierarchy

Memory Type Speed Size Scope Lifetime
Registers Fastest Very small Single Thread Thread
Shared Memory Very fast Small (~48KB/Block) All Threads within a Block Block
Global Memory Slower Large (GB-scale) All Threads Managed by host
Constant Memory Fast (cached) 64KB All Threads (read-only) Managed by host

CUDA Vector Addition Example

#include <stdio.h>
#include <cuda_runtime.h>

#define N 1000000

// Kernel: each thread computes one element
__global__ void vector_add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *h_a, *h_b, *h_c;           // Host (CPU) memory
    float *d_a, *d_b, *d_c;           // Device (GPU) memory

    size_t bytes = N * sizeof(float);

    // Allocate host memory and initialize
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);
    for (int i = 0; i < N; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    vector_add<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);

    // Copy result from device back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    printf("c[0] = %f, c[N-1] = %f\n", h_c[0], h_c[N-1]);

    // Free memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Compilation

nvcc vector_add.cu -o vector_add

GPU vs CPU: Use Case Comparison

Dimension CPU GPU
Core count Few (single digits to tens) Many (thousands)
Single-core performance Strong (complex logic) Weak (simple operations)
Parallelism granularity Coarse-grained Fine-grained (SIMT)
Suitable tasks Control logic, branch-heavy workloads Data-parallel, compute-intensive workloads
Typical applications OS scheduling, compilation, databases Matrix operations, image processing, deep learning
Memory bandwidth Lower Very high (HBM)
Latency Low (fast for single tasks) High (significant launch overhead)

Selection Guidelines

  • Tasks involving many independent, identical computations → use GPU
  • Tasks involving complex branching logic and dependencies → use CPU
  • Best practice: CPU + GPU heterogeneous cooperation

Parallel Algorithm Design

Common Parallel Patterns

Master-Worker Pattern

A single Master process handles task distribution and result collection, while multiple Worker processes perform the actual computation.

         ┌──────────┐
         │  Master  │
         │Task Alloc│
         └──┬──┬──┬─┘
            │  │  │
     ┌──────┘  │  └──────┐
     ▼         ▼         ▼
┌────────┐┌────────┐┌────────┐
│Worker 0││Worker 1││Worker 2│
│Compute ││Compute ││Compute │
└───┬────┘└───┬────┘└───┬────┘
    │         │         │
    └────┐    │    ┌────┘
         ▼    ▼    ▼
       ┌──────────┐
       │  Master  │
       │ Collect  │
       └──────────┘

MapReduce

Inspired by functional programming, computation is divided into two phases: Map and Reduce. This is the core model for big data processing (e.g., Hadoop, Spark).

Input Data → [Map] → Intermediate Key-Value Pairs → [Shuffle] → [Reduce] → Output

Pipeline

Data flows sequentially through multiple processing stages, each executed by a different processor -- analogous to a factory assembly line.

Data Flow →  [Stage 1] → [Stage 2] → [Stage 3] → [Stage 4] → Output
             Processor0   Processor1   Processor2   Processor3

Divide and Conquer

The problem is recursively split into subproblems, which are solved in parallel, and the results are then merged. Classic examples include parallel merge sort and parallel quicksort.

Communication Patterns

Point-to-Point

Direct message send/receive between two processes.

Process 0  ──── data ────►  Process 1

Collective Communication

Communication operations involving a group of processes; this is the most commonly used communication pattern in parallel computing.

Broadcast: One process sends data to all processes.

P0 [A]  ──►  P0 [A]
             P1 [A]
             P2 [A]
             P3 [A]

Scatter: One process partitions its data and distributes the chunks to each process.

P0 [A B C D]  ──►  P0 [A]
                    P1 [B]
                    P2 [C]
                    P3 [D]

Gather: Each process sends its data to a single process, which collects them all.

P0 [A]  ──►  P0 [A B C D]
P1 [B]
P2 [C]
P3 [D]

Reduce: Applies an operation (e.g., summation) to data from all processes, storing the result in one process.

P0 [1]  ──►  P0 [1+2+3+4 = 10]
P1 [2]       (op = SUM)
P2 [3]
P3 [4]

AllReduce: Performs a Reduce and then broadcasts the result to all processes. Equivalent to Reduce + Broadcast.

P0 [1]  ──►  P0 [10]
P1 [2]       P1 [10]
P2 [3]       P2 [10]
P3 [4]       P3 [10]

Importance of AllReduce

AllReduce is the most critical communication operation in distributed deep learning training, used to synchronize gradients across GPUs. Common implementations include Ring AllReduce and Tree AllReduce.


Synchronization and Load Balancing

Barrier Synchronization

All threads/processes must reach the barrier before any can proceed, ensuring that all parallel branches are aligned at a given point.

Thread 0:  ──work──  │  ──work──
Thread 1:  ──work────│  ──work──
Thread 2:  ──work──  │  ──work──
                  Barrier

Locks and Atomic Operations

  • Mutex: Protects a critical section so that only one thread can access a shared resource at a time
  • Spinlock: A thread busy-waits until it acquires the lock; suitable when the lock is held for short durations
  • Atomic Operation: A hardware-level indivisible operation, such as atomic_add or compare_and_swap

Caveats

  • Deadlock: Two or more threads each waiting for the other to release a lock
  • Data Race: Multiple threads simultaneously reading and writing the same variable, leading to nondeterministic results
  • Excessive synchronization can severely degrade parallel performance; minimize lock granularity and hold time

Load Balancing

Strategy Static Load Balancing Dynamic Load Balancing
Allocation timing Pre-assigned before execution Adjusted dynamically at runtime based on load
Implementation complexity Simple More complex
Applicable scenarios Uniform, predictable workloads Non-uniform, unpredictable workloads
Overhead Low Scheduling overhead
Typical methods Uniform partitioning, round-robin Work stealing, task queues

Impact of Load Imbalance

If workloads vary significantly across processors, faster processors must idle while waiting for slower ones, causing overall execution time to be determined by the slowest processor (the barrel effect / straggler problem). This significantly reduces parallel efficiency.


Applications of Parallel Computing in Deep Learning

Modern deep learning models (such as LLMs) are so large that a single GPU cannot complete training on its own. Distributed parallel training has become an essential technique.

Data Parallelism

Each GPU holds a complete copy of the model and processes different data batches (mini-batches). Gradients are then synchronized via AllReduce.

GPU 0: Model Copy + Data Batch 0  → Gradient 0 ─┐
GPU 1: Model Copy + Data Batch 1  → Gradient 1 ─┤ AllReduce → Avg Gradient → Update Model
GPU 2: Model Copy + Data Batch 2  → Gradient 2 ─┤
GPU 3: Model Copy + Data Batch 3  → Gradient 3 ─┘
  • Pros: Simple to implement; fixed communication pattern
  • Cons: Each GPU must be able to hold the entire model

Model Parallelism

Different layers or components of the model are distributed across different GPUs. This is suitable when a single GPU cannot accommodate the entire model.

GPU 0: Layer 1-10
GPU 1: Layer 11-20
GPU 2: Layer 21-30
GPU 3: Layer 31-40
  • Pros: Enables training of very large models
  • Cons: Data dependencies between GPUs can lead to idle GPUs (the bubble problem)

Pipeline Parallelism

After partitioning the model by layers, different micro-batches are fed through the stages in a pipelined fashion, reducing GPU idle time.

Time →    t0    t1    t2    t3    t4    t5
GPU 0:   [MB0] [MB1] [MB2] [MB3]
GPU 1:         [MB0] [MB1] [MB2] [MB3]
GPU 2:               [MB0] [MB1] [MB2] [MB3]
GPU 3:                     [MB0] [MB1] [MB2] [MB3]

The Bubble Problem

During the startup and teardown phases of the pipeline, GPUs are idle -- these idle periods are called "bubbles." Increasing the number of micro-batches reduces the bubble ratio. GPipe and PipeDream are well-known pipeline parallelism frameworks.

Hybrid Parallelism Strategies

In practice, large-scale training typically combines multiple parallelism strategies:

Parallelism Strategy Partitioning Dimension Typical Frameworks
Data Parallelism Data (Batch) PyTorch DDP, Horovod
Model Parallelism (Tensor Parallelism) Intra-layer (weight matrices) Megatron-LM
Pipeline Parallelism Inter-layer GPipe, PipeDream
Hybrid Parallelism (3D Parallelism) Data + intra-layer + inter-layer simultaneously Megatron-LM + DeepSpeed

Common Frameworks

  • PyTorch DDP (DistributedDataParallel): The standard approach for data parallelism
  • DeepSpeed (Microsoft): Supports ZeRO optimization, significantly reducing GPU memory usage
  • Megatron-LM (NVIDIA): A tensor parallelism solution designed specifically for very large Transformer models
  • FSDP (Fully Sharded Data Parallel): PyTorch's native model-sharding approach

The Role of AllReduce in Distributed Training

In data-parallel training, each iteration involves the following key steps:

  1. Each GPU performs forward propagation with its own data and computes the loss
  2. Each GPU performs backpropagation to obtain local gradients
  3. AllReduce averages the gradients across all GPUs
  4. Each GPU updates the model parameters using the averaged gradients

Ring AllReduce is the most widely used implementation: \(n\) GPUs are arranged in a ring, and the all-reduce operation completes in \(2(n-1)\) communication steps. The communication volume is independent of the number of GPUs (it depends only on the data size), providing good scalability.

      GPU 0
     ╱     ╲
   GPU 3   GPU 1
     ╲     ╱
      GPU 2

Each GPU communicates only with its neighbors; AllReduce completes in 2(n-1) steps

Optimization Directions

  • Gradient compression: Reduce communication data volume
  • Overlapping computation and communication: While backpropagation computes gradients, layers that have already finished can begin AllReduce concurrently
  • Mixed-precision training (FP16/BF16): Reduce GPU memory usage and communication volume

Relations to Other Topics

  • See System Design for how throughput, capacity planning, and asynchronous execution appear at the system level
  • See Cloud Services for GPU clusters, container orchestration, and cloud training infrastructure
  • See Distributed Systems for the role of communication overhead, coordination, and fault tolerance
  • See Architecture Overview for the hardware background behind software parallelism models

References


评论 #