Parallel Computing

Parallel computing is a method of accelerating computation by decomposing tasks into multiple subtasks and executing them simultaneously across multiple processing units. It forms the theoretical and practical foundation of High-Performance Computing (HPC), directly underpinning modern scientific simulations, big data processing, and deep learning training.

Core questions include: how to effectively decompose tasks (the bottleneck described by Amdahl's Law), how to communicate and synchronize across processing units (MPI/OpenMP), how to leverage the massive parallelism of GPUs (CUDA), and how to apply these techniques to distributed deep learning training (data parallelism / model parallelism / pipeline parallelism). These notes systematically cover the core knowledge of parallel computing, from theoretical foundations to programming models to practical applications.

Theoretical Foundations of Parallel Computing

Amdahl's Law

Amdahl's Law describes the theoretical maximum speedup achievable by adding more processors for a fixed problem size.

Formula:

\[S(n) = \frac{1}{(1 - P) + \frac{P}{n}}\]

Where:

\(S(n)\): Speedup when using \(n\) processors
\(P\): The fraction of the program that is parallelizable (\(0 \le P \le 1\))
\(n\): Number of processors
\((1 - P)\): The fraction of the program that must be executed sequentially

Key Insight

When \(n \to \infty\), \(S = \frac{1}{1 - P}\). Even with infinitely many processors, the speedup is bounded by the sequential portion. For example, if \(P = 0.95\) (95% parallelizable), the maximum speedup is only \(\frac{1}{0.05} = 20\times\).

Speedup for different values of P and n:

Parallelizable fraction P	n=2	n=4	n=8	n=16	n=64	n=256	n=∞
0.50	1.33	1.60	1.78	1.88	1.97	1.99	2.00
0.75	1.60	2.29	2.91	3.37	3.76	3.94	4.00
0.90	1.82	3.08	4.71	6.40	8.77	9.61	10.00
0.95	1.90	3.48	5.93	9.14	15.42	18.28	20.00
0.99	1.98	3.88	7.48	13.91	39.26	72.12	100.00

Implication

The sequential portion is the ultimate bottleneck for parallel speedup. When optimizing parallel programs, the priority should be reducing the sequential fraction rather than blindly adding more processors.

Gustafson's Law

Gustafson's Law considers parallel speedup from a different perspective: with a fixed execution time, increasing the number of processors allows solving larger-scale problems.

Formula:

\[S(n) = n - \alpha(n - 1)\]

Where:

\(n\): Number of processors
\(\alpha\): The fraction of the program that is sequential

Amdahl vs Gustafson: Different Perspectives

Amdahl's Law: Fixed problem size, add processors → Strong Scaling
Gustafson's Law: Fixed execution time, increase problem size → Weak Scaling

Dimension	Amdahl's Law	Gustafson's Law
Fixed quantity	Problem size	Execution time
Scaling type	Strong Scaling	Weak Scaling
Assumption	Problem size stays the same; add processors	More processors solve larger problems
Conclusion	Speedup has an upper bound	Speedup can grow linearly with n
Applicable scenarios	Real-time systems, latency-sensitive tasks	Scientific simulations, large-scale data processing

Performance Metrics

Speedup: \(S(n) = \frac{T_1}{T_n}\), where \(T_1\) is the sequential execution time and \(T_n\) is the parallel execution time with \(n\) processors
Efficiency: \(E(n) = \frac{S(n)}{n}\); ideally \(E = 1\) (i.e., 100%), but in practice \(E < 1\)
Scalability: Whether a program can maintain good efficiency as the number of processors increases

Metric	Ideal value	Meaning
Speedup	\(S(n) = n\)	Linear speedup: doubling processors doubles speed
Efficiency	\(E = 1\)	Every processor is fully utilized
Scalability	Efficiency does not decrease with n	The system can efficiently utilize more resources

Parallel Programming Models

Shared Memory Model

Characteristics: Multiple threads share the same address space and communicate by reading and writing shared variables.

┌─────────────────────────────────────┐
│         Shared Memory Space         │
│   ┌───┐  ┌───┐  ┌───┐  ┌───┐      │
│   │ A │  │ B │  │ C │  │ D │ ...   │
│   └───┘  └───┘  └───┘  └───┘      │
└────┬───────┬───────┬───────┬───────┘
     │       │       │       │
  Thread0  Thread1 Thread2 Thread3

OpenMP

OpenMP is a shared-memory parallel programming API based on compiler directives. It achieves parallelization through #pragma directives and is straightforward to use.

Vector addition example:

#include <stdio.h>
#include <omp.h>

#define N 1000000

int main() {
    double a[N], b[N], c[N];

    // Initialization
    for (int i = 0; i < N; i++) {
        a[i] = i * 1.0;
        b[i] = i * 2.0;
    }

    // OpenMP parallel vector addition
    #pragma omp parallel for
    for (int i = 0; i < N; i++) {
        c[i] = a[i] + b[i];
    }

    printf("c[0] = %f, c[N-1] = %f\n", c[0], c[N-1]);
    return 0;
}

Compilation

gcc -fopenmp vector_add.c -o vector_add

Common OpenMP directives:

Directive	Function
`#pragma omp parallel`	Create a parallel region
`#pragma omp parallel for`	Parallelize a for loop
`#pragma omp critical`	Critical section: only one thread may execute at a time
`#pragma omp atomic`	Atomic operation
`#pragma omp barrier`	Synchronization barrier
`#pragma omp reduction(+:sum)`	Reduction operation

Pros and cons:

Pros	Cons
Simple to program; incremental parallelization	Limited to a single machine (single node)
Minimal code changes via compiler directives	Must guard against data races
Supports C/C++/Fortran	Scalability limited by the number of cores on a single machine

Distributed Memory Model

Characteristics: Each node has its own independent memory; nodes communicate via message passing.

┌──────────┐  Message Passing  ┌──────────┐
│  Node 0  │◄────────────────►│  Node 1  │
│ ┌──────┐ │    (Network)      │ ┌──────┐ │
│ │Memory│ │                   │ │Memory│ │
│ └──────┘ │                   │ └──────┘ │
└──────────┘                   └──────────┘
      ▲                              ▲
      │       Message Passing        │
      ▼                              ▼
┌──────────┐                   ┌──────────┐
│  Node 2  │◄────────────────►│  Node 3  │
│ ┌──────┐ │                   │ ┌──────┐ │
│ │Memory│ │                   │ │Memory│ │
│ └──────┘ │                   │ └──────┘ │
└──────────┘                   └──────────┘

MPI (Message Passing Interface)

MPI is the standard interface for distributed-memory parallel programming, suitable for large-scale parallel computing across nodes.

Basic operations:

Function	Purpose
`MPI_Init`	Initialize the MPI environment
`MPI_Comm_rank`	Get the rank of the current process
`MPI_Comm_size`	Get the total number of processes
`MPI_Send`	Send a message (blocking)
`MPI_Recv`	Receive a message (blocking)
`MPI_Bcast`	Broadcast (one-to-all)
`MPI_Reduce`	Reduce (all-to-one)
`MPI_Allreduce`	All-reduce (all-to-all)
`MPI_Finalize`	Finalize the MPI environment

MPI vector addition example:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define N 1000000

int main(int argc, char *argv[]) {
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int local_n = N / size;
    double *local_a = malloc(local_n * sizeof(double));
    double *local_b = malloc(local_n * sizeof(double));
    double *local_c = malloc(local_n * sizeof(double));

    // Each process initializes its own data chunk
    for (int i = 0; i < local_n; i++) {
        local_a[i] = (rank * local_n + i) * 1.0;
        local_b[i] = (rank * local_n + i) * 2.0;
    }

    // Each process computes its own portion
    for (int i = 0; i < local_n; i++) {
        local_c[i] = local_a[i] + local_b[i];
    }

    if (rank == 0) {
        printf("Process 0: local_c[0] = %f\n", local_c[0]);
    }

    free(local_a); free(local_b); free(local_c);
    MPI_Finalize();
    return 0;
}

Compilation and Execution

mpicc vector_add_mpi.c -o vector_add_mpi
mpirun -np 4 ./vector_add_mpi

Pros and cons:

Pros	Cons
Cross-node capable with strong scalability	Complex programming; explicit communication management required
Suitable for large-scale clusters	Communication overhead can become a bottleneck
No data race issues	Difficult to debug

GPU Parallelism (CUDA)

CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA's general-purpose GPU computing programming model.

Hierarchy:

Grid
├── Block (0,0)          Block (1,0)          Block (2,0)
│   ├── Thread (0,0)     ├── Thread (0,0)     ├── ...
│   ├── Thread (1,0)     ├── Thread (1,0)     │
│   ├── Thread (0,1)     ├── Thread (0,1)     │
│   └── ...              └── ...              └── ...
├── Block (0,1)          Block (1,1)          Block (2,1)
│   └── ...              └── ...              └── ...

Grid: Composed of multiple Blocks; corresponds to a single kernel launch
Block: Composed of multiple Threads; threads within a Block can cooperate
Thread: The smallest unit of execution

Kernel Functions

A kernel is a function executed on the GPU, declared with the __global__ qualifier and launched using the <<<grid, block>>> syntax.

GPU Memory Hierarchy

Memory Type	Speed	Size	Scope	Lifetime
Registers	Fastest	Very small	Single Thread	Thread
Shared Memory	Very fast	Small (~48KB/Block)	All Threads within a Block	Block
Global Memory	Slower	Large (GB-scale)	All Threads	Managed by host
Constant Memory	Fast (cached)	64KB	All Threads (read-only)	Managed by host

CUDA Vector Addition Example

#include <stdio.h>
#include <cuda_runtime.h>

#define N 1000000

// Kernel: each thread computes one element
__global__ void vector_add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    float *h_a, *h_b, *h_c;           // Host (CPU) memory
    float *d_a, *d_b, *d_c;           // Device (GPU) memory

    size_t bytes = N * sizeof(float);

    // Allocate host memory and initialize
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);
    for (int i = 0; i < N; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy data from host to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel
    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    vector_add<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);

    // Copy result from device back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    printf("c[0] = %f, c[N-1] = %f\n", h_c[0], h_c[N-1]);

    // Free memory
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Compilation

nvcc vector_add.cu -o vector_add

GPU vs CPU: Use Case Comparison

Dimension	CPU	GPU
Core count	Few (single digits to tens)	Many (thousands)
Single-core performance	Strong (complex logic)	Weak (simple operations)
Parallelism granularity	Coarse-grained	Fine-grained (SIMT)
Suitable tasks	Control logic, branch-heavy workloads	Data-parallel, compute-intensive workloads
Typical applications	OS scheduling, compilation, databases	Matrix operations, image processing, deep learning
Memory bandwidth	Lower	Very high (HBM)
Latency	Low (fast for single tasks)	High (significant launch overhead)

Selection Guidelines

Tasks involving many independent, identical computations → use GPU
Tasks involving complex branching logic and dependencies → use CPU
Best practice: CPU + GPU heterogeneous cooperation

Parallel Algorithm Design

Common Parallel Patterns

Master-Worker Pattern

A single Master process handles task distribution and result collection, while multiple Worker processes perform the actual computation.

         ┌──────────┐
         │  Master  │
         │Task Alloc│
         └──┬──┬──┬─┘
            │  │  │
     ┌──────┘  │  └──────┐
     ▼         ▼         ▼
┌────────┐┌────────┐┌────────┐
│Worker 0││Worker 1││Worker 2│
│Compute ││Compute ││Compute │
└───┬────┘└───┬────┘└───┬────┘
    │         │         │
    └────┐    │    ┌────┘
         ▼    ▼    ▼
       ┌──────────┐
       │  Master  │
       │ Collect  │
       └──────────┘

MapReduce

Inspired by functional programming, computation is divided into two phases: Map and Reduce. This is the core model for big data processing (e.g., Hadoop, Spark).

Input Data → [Map] → Intermediate Key-Value Pairs → [Shuffle] → [Reduce] → Output

Pipeline

Data flows sequentially through multiple processing stages, each executed by a different processor -- analogous to a factory assembly line.

Data Flow →  [Stage 1] → [Stage 2] → [Stage 3] → [Stage 4] → Output
             Processor0   Processor1   Processor2   Processor3

Divide and Conquer

The problem is recursively split into subproblems, which are solved in parallel, and the results are then merged. Classic examples include parallel merge sort and parallel quicksort.

Communication Patterns

Point-to-Point

Direct message send/receive between two processes.

Process 0  ──── data ────►  Process 1

Collective Communication

Communication operations involving a group of processes; this is the most commonly used communication pattern in parallel computing.

Broadcast: One process sends data to all processes.

P0 [A]  ──►  P0 [A]
             P1 [A]
             P2 [A]
             P3 [A]

Scatter: One process partitions its data and distributes the chunks to each process.

P0 [A B C D]  ──►  P0 [A]
                    P1 [B]
                    P2 [C]
                    P3 [D]

Gather: Each process sends its data to a single process, which collects them all.

P0 [A]  ──►  P0 [A B C D]
P1 [B]
P2 [C]
P3 [D]

Reduce: Applies an operation (e.g., summation) to data from all processes, storing the result in one process.

P0 [1]  ──►  P0 [1+2+3+4 = 10]
P1 [2]       (op = SUM)
P2 [3]
P3 [4]

AllReduce: Performs a Reduce and then broadcasts the result to all processes. Equivalent to Reduce + Broadcast.

P0 [1]  ──►  P0 [10]
P1 [2]       P1 [10]
P2 [3]       P2 [10]
P3 [4]       P3 [10]

Importance of AllReduce

AllReduce is the most critical communication operation in distributed deep learning training, used to synchronize gradients across GPUs. Common implementations include Ring AllReduce and Tree AllReduce.

Synchronization and Load Balancing

Barrier Synchronization

All threads/processes must reach the barrier before any can proceed, ensuring that all parallel branches are aligned at a given point.

Thread 0:  ──work──  │  ──work──
Thread 1:  ──work────│  ──work──
Thread 2:  ──work──  │  ──work──
                  Barrier

Locks and Atomic Operations

Mutex: Protects a critical section so that only one thread can access a shared resource at a time
Spinlock: A thread busy-waits until it acquires the lock; suitable when the lock is held for short durations
Atomic Operation: A hardware-level indivisible operation, such as atomic_add or compare_and_swap

Caveats

Deadlock: Two or more threads each waiting for the other to release a lock
Data Race: Multiple threads simultaneously reading and writing the same variable, leading to nondeterministic results
Excessive synchronization can severely degrade parallel performance; minimize lock granularity and hold time

Load Balancing

Strategy	Static Load Balancing	Dynamic Load Balancing
Allocation timing	Pre-assigned before execution	Adjusted dynamically at runtime based on load
Implementation complexity	Simple	More complex
Applicable scenarios	Uniform, predictable workloads	Non-uniform, unpredictable workloads
Overhead	Low	Scheduling overhead
Typical methods	Uniform partitioning, round-robin	Work stealing, task queues

Impact of Load Imbalance

If workloads vary significantly across processors, faster processors must idle while waiting for slower ones, causing overall execution time to be determined by the slowest processor (the barrel effect / straggler problem). This significantly reduces parallel efficiency.

Applications of Parallel Computing in Deep Learning

Modern deep learning models (such as LLMs) are so large that a single GPU cannot complete training on its own. Distributed parallel training has become an essential technique.

Data Parallelism

Each GPU holds a complete copy of the model and processes different data batches (mini-batches). Gradients are then synchronized via AllReduce.

GPU 0: Model Copy + Data Batch 0  → Gradient 0 ─┐
GPU 1: Model Copy + Data Batch 1  → Gradient 1 ─┤ AllReduce → Avg Gradient → Update Model
GPU 2: Model Copy + Data Batch 2  → Gradient 2 ─┤
GPU 3: Model Copy + Data Batch 3  → Gradient 3 ─┘

Pros: Simple to implement; fixed communication pattern
Cons: Each GPU must be able to hold the entire model

Model Parallelism

Different layers or components of the model are distributed across different GPUs. This is suitable when a single GPU cannot accommodate the entire model.

GPU 0: Layer 1-10
GPU 1: Layer 11-20
GPU 2: Layer 21-30
GPU 3: Layer 31-40

Pros: Enables training of very large models
Cons: Data dependencies between GPUs can lead to idle GPUs (the bubble problem)

Pipeline Parallelism

After partitioning the model by layers, different micro-batches are fed through the stages in a pipelined fashion, reducing GPU idle time.

Time →    t0    t1    t2    t3    t4    t5
GPU 0:   [MB0] [MB1] [MB2] [MB3]
GPU 1:         [MB0] [MB1] [MB2] [MB3]
GPU 2:               [MB0] [MB1] [MB2] [MB3]
GPU 3:                     [MB0] [MB1] [MB2] [MB3]

The Bubble Problem

During the startup and teardown phases of the pipeline, GPUs are idle -- these idle periods are called "bubbles." Increasing the number of micro-batches reduces the bubble ratio. GPipe and PipeDream are well-known pipeline parallelism frameworks.

Hybrid Parallelism Strategies

In practice, large-scale training typically combines multiple parallelism strategies:

Parallelism Strategy	Partitioning Dimension	Typical Frameworks
Data Parallelism	Data (Batch)	PyTorch DDP, Horovod
Model Parallelism (Tensor Parallelism)	Intra-layer (weight matrices)	Megatron-LM
Pipeline Parallelism	Inter-layer	GPipe, PipeDream
Hybrid Parallelism (3D Parallelism)	Data + intra-layer + inter-layer simultaneously	Megatron-LM + DeepSpeed

Common Frameworks

PyTorch DDP (DistributedDataParallel): The standard approach for data parallelism
DeepSpeed (Microsoft): Supports ZeRO optimization, significantly reducing GPU memory usage
Megatron-LM (NVIDIA): A tensor parallelism solution designed specifically for very large Transformer models
FSDP (Fully Sharded Data Parallel): PyTorch's native model-sharding approach

The Role of AllReduce in Distributed Training

In data-parallel training, each iteration involves the following key steps:

Each GPU performs forward propagation with its own data and computes the loss
Each GPU performs backpropagation to obtain local gradients
AllReduce averages the gradients across all GPUs
Each GPU updates the model parameters using the averaged gradients

Ring AllReduce is the most widely used implementation: \(n\) GPUs are arranged in a ring, and the all-reduce operation completes in \(2(n-1)\) communication steps. The communication volume is independent of the number of GPUs (it depends only on the data size), providing good scalability.

      GPU 0
     ╱     ╲
   GPU 3   GPU 1
     ╲     ╱
      GPU 2

Each GPU communicates only with its neighbors; AllReduce completes in 2(n-1) steps

Optimization Directions

Gradient compression: Reduce communication data volume
Overlapping computation and communication: While backpropagation computes gradients, layers that have already finished can begin AllReduce concurrently
Mixed-precision training (FP16/BF16): Reduce GPU memory usage and communication volume

Relations to Other Topics

See System Design for how throughput, capacity planning, and asynchronous execution appear at the system level
See Cloud Services for GPU clusters, container orchestration, and cloud training infrastructure
See Distributed Systems for the role of communication overhead, coordination, and fault tolerance
See Architecture Overview for the hardware background behind software parallelism models

References

Ananth Grama et al., Introduction to Parallel Computing, Addison-Wesley.
PyTorch Distributed Overview
NVIDIA NCCL Documentation
DeepSpeed Documentation