Building Models with PyTorch

This note covers the main approaches to building models in PyTorch. Unlike manually implementing low-level computations, PyTorch's modular design makes developing deep learning architectures remarkably straightforward, and as of 2026 it has become the de facto standard development framework.

Modular Development

Today, the low-level computational principles of deep learning are encapsulated in various abstraction libraries. Developers no longer think in terms of individual artificial neurons, but instead design networks from the perspective of layers and reason about coarser-grained blocks during architecture design.

Layers & Blocks are the cornerstones of code organization. To manage networks with hundreds or thousands of layers, engineers adopt object-oriented design principles.

Layer: The smallest computational unit (e.g., nn.Linear, nn.Conv2d). It encapsulates the weights \(W\), biases \(b\), and the corresponding mathematical operators.
Block/Module: A container. A block can contain multiple layers (e.g., a ResNet Residual Block).

Blocks can be nested within other blocks, forming a tree structure. When you call model.backward(), it recursively traverses this tree to compute all gradients. A block automatically identifies and collects the parameters from all layers within it for parameter management.

Tensor

A Tensor is a raw data array and the most fundamental structure in PyTorch.

We can build models directly using Tensors and functions without relying on Layers (which can be useful when researching highly novel algorithms):

# Pure manual implementation without nn.Linear Layer
import torch.nn.functional as F

weight = torch.randn(10, 5, requires_grad=True)
bias = torch.randn(5, requires_grad=True)

def manual_linear(x):
    return F.linear(x, weight, bias) # or simply x @ weight.t() + bias

Tensors are also the underlying data entities inside Layer containers. For example, nn.Linear is a Layer, and internally its weight is a Tensor. Its computation logic x @ weight + bias is composed of basic matrix multiplication and addition operators.

Note this point.

nn.Parameter

A Parameter is a variable marked as trainable. In other words, nn.Parameter represents the parameters that need to be learned.

A trainable parameter is essentially a tensor, but if we store it as a plain tensor, it will not be updated by the optimizer:

self.W = torch.empty(input_dim, output_dim)

Therefore, we need to wrap it with nn.Parameter:

self.W = nn.Parameter(torch.empty(input_dim, output_dim))

Here "dim" refers to dimension.

Layer

A Layer is the most basic building block in PyTorch, defining the atomic unit of data processing in a neural network. A layer receives input, performs a specific mathematical operation, and produces output.

Simply put, a layer = parameters + computation rules, typically consisting of weights W and b along with a forward rule.

Common abstraction layers include:

Fully connected layer (Linear/Dense)
Convolutional layer (Conv2d)
Activation function (ReLU)
Normalization layer (BatchNorm)

We can use predefined layers or define custom ones.

Predefined layers can be quickly created via torch.nn (which provides nearly all standard deep learning layers):

import torch
import torch.nn as nn

linear_layer = nn.Linear(in_features=10, out_features=5)
relu_layer = nn.ReLU()

We can also define custom layers:

class MyCustomLayer(nn.Module):
    def __init__(self, size):
        super().__init__()
        # Initialize a learnable parameter (weight)
        # nn.Parameter tells PyTorch: "this is a variable that needs to be trained and updated"
        self.weights = nn.Parameter(torch.randn(size, size))

    def forward(self, x):
        # Define the forward computation logic of the layer
        return x @ self.weights + 1 # e.g., matrix multiplication followed by adding 1

Custom layers are a more advanced operation; we will revisit them after becoming familiar with general development practices.

When we call a simple line of code like nn.Linear(10,5), PyTorch manages a great deal of detail behind the scenes, including:

Parameter management: weight, bias; automatic parameter initialization
Computation logic: matrix multiplication, convolution, addition, etc.
Gradient tracking: autograd automatically maintains the computation graph, computes parameter gradients during backpropagation, etc., eliminating the need to manually derive gradient formulas
State management: certain layers like dropout or batchnorm behave differently during training and inference; the layer encapsulates .train() and .eval() switches to automatically toggle behavior
Device management: CPU/GPU computation, moving parameter tensors to GPU, etc.

nn.Module

A Block is composed of multiple layers, essentially a composite structure wrapping several Layers for convenient reuse. ResNet's Residual Block, Transformer's Encoder Block, Inception Block, and others can all be implemented with ease.

In PyTorch, layers, blocks, and entire models are all fundamentally nn.Module.

nn.Module is the standard format for composable computational modules in PyTorch.

Model

By assembling Layers and Blocks together, we construct a Model.

Similarly, PyTorch provides the predefined nn.Sequential, which enables simple composition. For example, using the layers from the previous section:

import torch
import torch.nn as nn

linear_layer = nn.Linear(in_features=10, out_features=5)
relu_layer = nn.ReLU()

We can directly use nn.Sequential to wrap the previously defined linear_layer and relu_layer into a model:

# Directly place the instantiated layers inside
model = nn.Sequential(
    linear_layer,
    relu_layer,
    # More layers can be added...
    nn.Linear(5, 1)
)

# Now the model is a ready-to-use neural network
# input_data = torch.randn(1, 10)
# output = model(input_data)

We can of course also build models manually. In general, defining a class by inheriting from nn.Module is the most common approach found in paper implementations and GitHub repositories. For example:

class MyMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Define the components here
        self.hidden = linear_layer  # The linear layer created earlier
        self.act = relu_layer       # The activation layer created earlier
        self.output = nn.Linear(5, 1) # Assume a single output value

    def forward(self, x):
        # Define the data flow path (assembly logic)
        x = self.hidden(x)
        x = self.act(x)
        x = self.output(x)
        return x

# Instantiate the model
model = MyMLP()

Here, the MyMLP class — or equivalently, an nn.Sequential object — is both a Module and a Model. Conceptually, this is a Model. At this point, we have successfully built a deep learning model using PyTorch.

MLP Development Example

Let us walk through a complete MLP example. The MLP is the simplest neural network, and we use it here to become familiar with PyTorch's workflow.

Layer

First, we need to implement a fully connected layer, which consists of a linear transformation followed by an activation function:

\[ z=XW+b,y=activation(z) \]

Note that although Linear and ReLU are theoretically separate layers, they almost always appear together, so the combination is commonly referred to as a single Dense Layer.

We know that a neural network is composed of many hidden layers stacked together:

x -> layer1 -> layer2 -> layer3 -> output

We can abstract this layer structure into a class, making it possible to assemble any desired architecture. The general form is as follows:

layer(inputs X):
    z = XW + b
    y = activation(z)
    return y

If we want to build a three-layer MLP, the assembly looks like:

y1 = layer1(X)
y2 = layer2(y1)
y3 = layer3(y2)
return y3

We can implement this DenseLayer as follows:

class DenseLayer(nn.Module):
    """
    Custom fully connected layer (Dense Layer).
    Mathematical form:
        z = XW + b
        y = activation(z)

    Parameters:
    - W: shape (input_dim, output_dim)
    - b: shape (output_dim,)
    - activation: specifies the activation function type
    """
    def __init__(self, input_dim, output_dim, activation):
        super(DenseLayer, self).__init__()

        # Weight W: initialized with Xavier uniform, which generally leads to more stable training
        self.W = nn.Parameter(torch.empty(input_dim, output_dim))
        nn.init.xavier_uniform_(self.W)

        # Bias b: initialized to zero
        self.b = nn.Parameter(torch.zeros(output_dim))

        # Record the activation function type used by this layer
        self.activation = activation

    def forward(self, inputs):
        """
        Forward pass:
        1) Linear transformation: inputs @ W + b
        2) Activation function: select the appropriate nonlinearity based on activation
        """
        z = inputs @ self.W + self.b

        if self.activation == 'relu':
            outputs = torch.relu(z)
        elif self.activation == 'sigmoid':
            outputs = torch.sigmoid(z)
        elif self.activation == 'tanh':
            outputs = torch.tanh(z)
        elif self.activation == 'softmax':
            # Softmax normalizes along the class dimension (dim=1)
            outputs = torch.softmax(z, dim=1)
        else:
            # linear: no activation, output the raw linear result
            outputs = z

        return outputs

When using this, suppose we want to build a three-layer MLP:

Layer 1: input 10-dim vector -> output 32-dim vector (relu)
Layer 2: 32 -> 16 (relu)
Layer 3: 16 -> 3 (classification output)

We can chain three DenseLayers together:

layer1 = DenseLayer(10, 32, "relu")
layer2 = DenseLayer(32, 16, "relu")
layer3 = DenseLayer(16, 3,  "softmax")  # outputs probabilities

When using them, as specified by nn.Module, we can call a layer like a function. That is, Y = layer(X) is equivalent to Y = layer.forward(X). Suppose we have 4 data samples (batch_size = 4), each with 10 features, so X is a (4, 10) matrix. Since the first layer (layer1) expects a 10-dimensional input, we can feed X into it, performing the matrix multiplication X@W. This produces an output with batch_size = 4 and output_dim = 32. After adding b and applying ReLU activation, we obtain the output, which is then passed to layer2.

Model

By stacking the DenseLayers we just implemented in sequence, we obtain an MLP (also known as a Feedforward Network):

class Feedforward(nn.Module):
    """
    Feedforward Neural Network (MLP).
    Composed of several DenseLayers stacked sequentially.

    Construction:
    - depth specifies the total number of "connection layers"
    - hidden_sizes specifies the width of each hidden layer (length should be depth-1)
    - output_size determines the output layer dimension (number of classes for classification, typically 1 for regression)
    """
    def __init__(self, input_size, depth, hidden_sizes, output_size):
        super(Feedforward, self).__init__()

        # Structural validity check: number of hidden layers must be exactly depth - 1
        if not (depth - len(hidden_sizes)) == 1:
            raise Exception(
                "The depth (%d) of the network should be 1 larger than `hidden_sizes` (%d)." %
                (depth, len(hidden_sizes))
            )

        # sizes describes the input/output dimension sequence for each layer
        # e.g., input=10, hidden=[32,16], output=3 => sizes=[10,32,16,3]
        sizes = [input_size] + hidden_sizes + [output_size]

        layers = []

        # Build the first depth-1 layers: hidden layers with ReLU activation
        for i in range(depth - 1):
            layers.append(DenseLayer(sizes[i], sizes[i + 1], 'relu'))

        # Build the last layer: choose activation based on the task (output_size)
        # - output_size == 1: regression task, output a continuous value, use linear
        # - output_size > 1: classification task, use softmax to output class probabilities
        if output_size == 1:
            layers.append(DenseLayer(sizes[-2], sizes[-1], 'linear'))
        else:
            layers.append(DenseLayer(sizes[-2], sizes[-1], 'softmax'))

        # Use ModuleList to store layers, ensuring parameters are properly registered by PyTorch
        self.layers = nn.ModuleList(layers)

    def forward(self, inputs):
        """
        Forward pass: feed the input through each layer sequentially
        """
        outputs = inputs
        for layer in self.layers:
            outputs = layer(outputs)
        return outputs

.

Data Wrapping

.

class MyDataset(Dataset):
    """
    Wraps (x, y) data into a PyTorch Dataset for convenient batch loading via DataLoader.
    - x: converted to float32 Tensor
    - y: converted to long for classification (class indices) or float32 for regression (continuous values)
    """
    def __init__(self, x, y, pr_type):
        self.x = torch.tensor(x, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=(torch.long if pr_type == "classification"
                                        else torch.float32))

    def __len__(self):
        # Return the number of samples in the dataset
        return self.x.size()[0]

    def __getitem__(self, idx):
        # Return a single sample (features, label)
        return self.x[idx], self.y[idx]

.

Training Function

.

def train(x_train, y_train, x_val, y_val, loss_type, model, num_train_epochs, batch_size, lr, weight_decay):
    """
    Training function: optimizes the model using mini-batch training, and evaluates on the
    validation set after each epoch while recording the training history.

    Arguments:
    - x_train, y_train: training data and labels (numpy)
    - x_val, y_val: validation data and labels (numpy)
    - loss_type:
        'CrossEntropy' -> classification task
        'SquaredError' -> regression task (uses MSELoss)
    - model: PyTorch model to be trained (Feedforward instance or compatible module)
    - num_train_epochs: number of training epochs
    - batch_size: batch size
    - lr: learning rate
    - weight_decay: weight decay (L2 regularization coefficient)
    """

    # 1) Determine the task type (affects label dtype and evaluation method)
    if loss_type == 'CrossEntropy':
        pr_type = 'classification'
    else:
        pr_type = 'regression'

    # 2) Create training/validation Datasets
    train_dataset = MyDataset(x_train, y_train, pr_type)
    val_dataset = MyDataset(x_val, y_val, pr_type)

    # 3) Create training DataLoader (shuffle=True improves generalization)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    # 4) Create optimizer (Adam) with weight_decay regularization
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    # 5) Select the loss function
    # Classification: CrossEntropyLoss
    # Regression: MSELoss
    if loss_type == 'CrossEntropy':
        criterion = nn.CrossEntropyLoss()
    else:
        criterion = nn.MSELoss()

    # 6) history: record the training process
    # - loss: average training loss
    # - val_loss: validation loss
    # - accuracy: accuracy for classification; val_loss (MSE) as the metric for regression
    history = {"loss": [], "val_loss": [], "accuracy": []}

    # 7) Training loop
    for epoch in range(num_train_epochs):

        # ---- Training phase ----
        model.train()
        total_loss = 0
        count = 0

        for x_batch, y_batch in train_loader:
            # Forward pass
            pred = model(x_batch)

            # Compute loss
            loss = criterion(pred, y_batch)

            # Backpropagation and parameter update
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Accumulate total loss for this epoch (weighted by sample count) to compute the average
            total_loss += loss.item() * x_batch.size(0)
            count += x_batch.size(0)

        avg_train_loss = total_loss / count

        # ---- Validation phase ----
        model.eval()
        with torch.no_grad():
            # Forward pass on the validation set
            val_pred = model(val_dataset.x)

            # Compute validation loss
            val_loss = criterion(val_pred, val_dataset.y).item()

            # Compute metrics:
            # Classification: accuracy (use argmax to get predicted class)
            # Regression: use val_loss directly as the MSE metric
            if loss_type == 'CrossEntropy':
                acc = accuracy_score(y_val, val_pred.argmax(dim=1).numpy())
            else:
                acc = val_loss

        # Record history
        history["loss"].append(avg_train_loss)
        history["val_loss"].append(val_loss)
        history["accuracy"].append(acc)

        # Print training progress: output at the 1st epoch and every 10th epoch
        if (epoch + 1) % 10 == 0 or epoch == 0:
            if loss_type == 'CrossEntropy':
                print(f"Epoch [{epoch+1}/{num_train_epochs}], "
                      f"Train Loss: {avg_train_loss:.4f}, "
                      f"Val Loss: {val_loss:.4f}, "
                      f"Val Accuracy: {acc:.4f}")
            else:
                print(f"Epoch [{epoch+1}/{num_train_epochs}], "
                      f"Train Loss: {avg_train_loss:.4f}, "
                      f"Val Loss: {val_loss:.4f}, "
                      f"Val MSE: {acc:.4f}")

    # Return the trained model and training history
    return model, history

.

Regression Task Application

.

# =========================
# Toy Regression: y = sin(1/x)
# =========================

import numpy as np
import torch
import matplotlib.pyplot as plt

# Import the training function and model architecture from your implementation
from implementation import train, Feedforward

# Fix random seeds for reproducibility
torch.manual_seed(137)
np.random.seed(137)

# -------------------------
# 1) Generate data: x is in the range (0.05, 1.05), skewed toward small values (power=4 biases the distribution)
# -------------------------
def target_func(x):
    """Target function: y = sin(1/x), oscillates more rapidly as x approaches 0"""
    return np.sin(1 / x)

# Training and validation sets: 2000 points each
x_train = np.power(np.random.random_sample([2000, 1]), 4) + 0.05
y_train = target_func(x_train)

x_val = np.power(np.random.random_sample([2000, 1]), 4) + 0.05
y_val = target_func(x_val)

# -------------------------
# 2) Visualize the training data: observe the target function shape and sample distribution
# -------------------------
sort_ind = np.argsort(x_train[:, 0])
plt.figure()
plt.plot(x_train[sort_ind, 0], y_train[sort_ind, 0], label="target curve")
plt.plot(x_train[sort_ind, 0], y_train[sort_ind, 0], '.', alpha=0.5, label="train samples")
plt.title("Toy regression data: y = sin(1/x)")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

# -------------------------
# 3) Define model hyperparameters and create the model
#    - input_size=1: because x is one-dimensional
#    - output_size=1: regression outputs a single continuous value
#    - depth=4, hidden_sizes=[64,64,64]: 3 hidden layers + 1 output layer
# -------------------------
input_size = x_train.shape[-1]
output_size = 1

depth = 4
hidden_sizes = [64, 64, 64]

num_train_epochs = 2000
batch_size = 64
learning_rate = 0.01
weight_decay = 1e-5

model = Feedforward(input_size, depth, hidden_sizes, output_size)

# -------------------------
# 4) Train the model (regression task -> SquaredError -> MSELoss)
# -------------------------
model, history = train(
    x_train, y_train,
    x_val, y_val,
    loss_type="SquaredError",
    model=model,
    num_train_epochs=num_train_epochs,
    batch_size=batch_size,
    lr=learning_rate,
    weight_decay=weight_decay
)

# Print final and best validation MSE
print("final MSE: ", f"{history['val_loss'][-1]:.6f}")
print("best  MSE: ", f"{min(history['val_loss']):.6f}")

# -------------------------
# 5) Plot training curves: training loss and validation loss
# -------------------------
plt.figure()
plt.plot(history['loss'], label='train loss')
plt.plot(history['val_loss'], label='val loss')
plt.title('Training curve (MSE)')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# -------------------------
# 6) Save the model and verify it can be correctly loaded
#    Assignment requirement: torch.save(model, 'sin_inv_x.sav') and torch.load it back
# -------------------------
torch.save(model, 'sin_inv_x.sav')
model = torch.load('sin_inv_x.sav', weights_only=False)

# -------------------------
# 7) Visualize the model's fit on the validation set
# -------------------------
model.eval()
with torch.no_grad():
    y_pred = model(torch.tensor(x_val, dtype=torch.float32))

plt.figure()
plt.plot(x_val[:, 0], y_pred.numpy()[:, 0], '.', alpha=0.5)
plt.title("Predictions on validation set")
plt.xlabel("x")
plt.ylabel("predicted y")
plt.show()

.

MNIST Classification Application

.

# =========================
# MNIST Classification (Fully Connected NN)
# =========================

import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

from torchvision import datasets as dts
from torchvision.transforms import ToTensor
from sklearn.model_selection import train_test_split

from math import sqrt, ceil

# Import the training function and feedforward network from your implementation
from implementation import train, Feedforward

# -------------------------
# 1) Data loading and flattening
#    MNIST images are 28x28; here we flatten them into 784-dim vectors
# -------------------------
def transform(x):
    """torchvision converts images to tensors; here we additionally flatten to a 784-dim vector"""
    return ToTensor()(x).flatten()

traindt = dts.MNIST(
    root='data',
    train=True,
    transform=transform,
    download=True
)
testdt = dts.MNIST(
    root='data',
    train=False,
    transform=transform
)

# torchvision's data/targets are torch tensors; convert to numpy for convenient splitting
x_tr = traindt.data.numpy().reshape(-1, 28 * 28)
x_test = testdt.data.numpy().reshape(-1, 28 * 28)
y_tr = traindt.targets.numpy()
y_test = testdt.targets.numpy()

# Split into training/validation sets (stratified sampling to maintain class proportions)
x_train, x_val, y_train, y_val = train_test_split(
    x_tr, y_tr,
    train_size=0.8,
    stratify=y_tr,
    random_state=137
)

print('Shape of training input: ', x_train.shape)
print('Shape of training labels: ', y_train.shape)
print('Shape of validation input: ', x_val.shape)
print('Shape of validation labels: ', y_val.shape)
print('Shape of test input: ', x_test.shape)
print('Shape of test labels: ', y_test.shape)
print('Number of channels: ', np.max(y_train) + 1)
print('Data range:', np.max(x_train), np.min(x_train))

# -------------------------
# 2) Normalization module: scale input from [0,255] -> [0,1] -> standardize
#    Note: this performs standardization using MNIST's global mean/std
# -------------------------
MNIST_MEAN = 0.1307
MNIST_STD = 0.3081

class MNISTNormalizer(nn.Module):
    """A standardization layer that can be inserted into nn.Sequential (embedding preprocessing into the model)"""
    def __init__(self):
        super().__init__()

    def forward(self, x):
        # 1) Scale to [0,1]
        x = x / 255.0
        # 2) Standardize
        x = (x - MNIST_MEAN) / MNIST_STD
        return x

mnist_normalizer = MNISTNormalizer()

# -------------------------
# 3) Build the network architecture
#    - input_size=784
#    - output_size=10 (digits 0-9)
#    - depth=3, hidden_sizes=[256,128]: 2 hidden layers + 1 output layer
# -------------------------
input_size = x_train.shape[-1]
output_size = 10
depth = 3
hidden_sizes = [256, 128]

ff_net = Feedforward(input_size, depth, hidden_sizes, output_size)

# Chain the preprocessor (normalizer) and network (ff_net) into a single model
# Advantage: normalization is applied automatically during both training and inference
model = nn.Sequential(mnist_normalizer, ff_net)

# -------------------------
# 4) Train the classification model (CrossEntropy)
# -------------------------
model, history = train(
    x_train, y_train,
    x_val, y_val,
    loss_type="CrossEntropy",
    model=model,
    batch_size=64,
    num_train_epochs=50,
    lr=0.001,
    weight_decay=1e-4
)

# -------------------------
# 5) Plot training curves: loss and validation accuracy
# -------------------------
plt.figure(figsize=(10, 8))

plt.subplot(2, 1, 1)
plt.plot(history['loss'], label='train loss')
plt.plot(history['val_loss'], label='val loss')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')

plt.subplot(2, 1, 2)
plt.plot(history['accuracy'], label='val accuracy')
plt.ylim(0.0, 1.0)
plt.xlabel('Epoch')
plt.ylabel('Classification accuracy')
plt.legend()

plt.show()

# -------------------------
# 6) Visualize first-layer weights (check whether "stroke structures" are learned)
#    Here model[1] is the Feedforward (since model[0] is the normalizer)
# -------------------------
def visualize_grid(Xs, ubound=255.0, padding=1):
  """
  Reshape a 4D tensor of image data to a grid for easy visualization.

  Inputs:
  - Xs: Data of shape (N, H, W)
  - ubound: Output grid will have values scaled to the range [0, ubound]
  - padding: The number of blank pixels between elements of the grid
  """
  (N, H, W) = Xs.shape
  #grid_size = int(ceil(sqrt(N)))
  num_grid_h = 2
  num_grid_w = int(ceil(N / 2))

  grid_height = H * num_grid_h + padding * (num_grid_h - 1)
  grid_width = W * num_grid_w + padding * (num_grid_w - 1)
  grid = np.zeros((grid_height, grid_width))
  next_idx = 0
  y0, y1 = 0, H
  for y in range(num_grid_h):
    x0, x1 = 0, W
    for x in range(num_grid_w):
      if next_idx < N:
        img = Xs[next_idx]
        low, high = np.min(img), np.max(img)
        grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low)
        # grid[y0:y1, x0:x1] = Xs[next_idx]
        next_idx += 1
      x0 += W + padding
      x1 += W + padding
    y0 += H + padding
    y1 += H + padding

  # grid_max = np.max(grid)
  # grid_min = np.min(grid)
  # grid = ubound * (grid - grid_min) / (grid_max - grid_min)
  return grid


with torch.no_grad():
    W1 = model[1].layers[0].W.numpy()     # First-layer weights: shape (784, 256)
W1 = W1.transpose()                       # Transpose to (256, 784), one "weight map" per neuron
W1 = np.reshape(W1, [W1.shape[0], 28, 28])

plt.figure()
plt.imshow(visualize_grid(W1))
plt.title("First-layer weights visualization")
plt.axis('off')
plt.show()

# -------------------------
# 7) Save the model and verify it can be correctly loaded
# -------------------------
torch.save(model, 'mnist_cls.sav')
model = torch.load('mnist_cls.sav', weights_only=False)

# -------------------------
# 8) Evaluate accuracy on the test set
# -------------------------
model.eval()
with torch.no_grad():
    y_pred = model(torch.tensor(x_test.astype(np.float32)))
    acc = np.mean(y_test == np.argmax(y_pred.numpy(), axis=1))

print('The test accuracy is ', acc)

.

CNN Development Example

LeNet

LeNet-5 is the pioneering architecture of deep learning. Its structure is remarkably simple:

Input (32x32x1) -- Grayscale image, no RGB channels
   |
Conv C1 (6@5x5) -- 6 kernels of size 5x5, stride=1, output 28x28x6, extracts low-level features
   |
AvgPool S2 -- Average pooling for downsampling
   |
Conv C3 (16@5x5) -- 16 kernels of size 5x5, output 10x10x16, extracts complex features
   |
AvgPool S4 -- 2x2 average pooling for downsampling
   |
Conv C5 (120@5x5) -- 120 kernels of size 5x5, effectively fully connected, output 1x1x120
   |
FC F6 (84) -- Fully connected layer with 84 neurons
   |
FC Output (10) -- Output layer with 10 class predictions

We can quickly build this with PyTorch:

import torch
import torch.nn as nn
import torch.nn.functional as F

class LeNet5(nn.Module):

    def __init__(self):
        super(LeNet5, self).__init__()

        # C1: 1 -> 6, kernel=5
        self.conv1 = nn.Conv2d(
            in_channels=1,
            out_channels=6,
            kernel_size=5,
            stride=1
        )

        # S2: AvgPool
        self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)

        # C3: 6 -> 16
        self.conv2 = nn.Conv2d(
            in_channels=6,
            out_channels=16,
            kernel_size=5
        )

        # S4
        self.pool2 = nn.AvgPool2d(2, 2)

        # C5 (equivalent to FC)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)

        # F6
        self.fc2 = nn.Linear(120, 84)

        # Output
        self.fc3 = nn.Linear(84, 10)


    def forward(self, x):

        # Input: (batch, 1, 32, 32)

        x = self.conv1(x)      # -> (batch, 6, 28, 28)
        x = torch.tanh(x)

        x = self.pool1(x)      # -> (batch, 6, 14, 14)

        x = self.conv2(x)      # -> (batch, 16, 10, 10)
        x = torch.tanh(x)

        x = self.pool2(x)      # -> (batch, 16, 5, 5)

        x = x.view(x.size(0), -1)   # flatten -> (batch, 400)

        x = self.fc1(x)        # -> 120
        x = torch.tanh(x)

        x = self.fc2(x)        # -> 84
        x = torch.tanh(x)

        x = self.fc3(x)        # -> 10

        return x

It can also be defined directly using nn.Sequential:

model = nn.Sequential(
    nn.Conv2d(1,6,5),
    nn.Tanh(),
    nn.AvgPool2d(2),

    nn.Conv2d(6,16,5),
    nn.Tanh(),
    nn.AvgPool2d(2),

    nn.Flatten(),

    nn.Linear(400,120),
    nn.Tanh(),

    nn.Linear(120,84),
    nn.Tanh(),

    nn.Linear(84,10)
)

The most suitable style for assignments and interviews:

class LeNet5(nn.Module):

    def __init__(self):
        super().__init__()

        self.features = nn.Sequential(
            nn.Conv2d(1,6,5),
            nn.Tanh(),
            nn.AvgPool2d(2),

            nn.Conv2d(6,16,5),
            nn.Tanh(),
            nn.AvgPool2d(2)
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(400,120),
            nn.Tanh(),
            nn.Linear(120,84),
            nn.Tanh(),
            nn.Linear(84,10)
        )

    def forward(self,x):
        x = self.features(x)
        x = self.classifier(x)
        return x

This style is best suited for CNNs because virtually all classic CNNs follow this organizational pattern:

features   -> Feature extraction (Conv + Pool)
classifier -> Classification (FC)

.

AlexNet

VGG

ResNet

Using CIFAR-10 as the dataset, we introduce the implementation and training of ResNet.

First, we load the core PyTorch components:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
import pandas as pd

Then load the dataset:

from torch.utils.data import Subset

# === Data Augmentation ===
# Note: augmentation should only be applied to the training set's transform, not to val/test.
# Reason: val and test require stable, reproducible evaluation results;
#         if random cropping/flipping is applied to them, results would vary each time, losing reference value.
# Therefore, we define two separate transforms for training and evaluation sets.

# Training set transform: incorporates random augmentation to artificially increase training data diversity
train_transform = transforms.Compose([
    # RandomCrop: first pads the image by 4 pixels on each side (padding=4), then randomly crops back to 32x32
    # Simulates image translation, helping the model learn to recognize objects at different positions
    transforms.RandomCrop(32, padding=4),
    # RandomHorizontalFlip: randomly flips the image horizontally with 50% probability
    # For most CIFAR-10 classes (car, bird, ship, etc.), flipping preserves semantic meaning
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Evaluation set transform: only perform necessary normalization, no random transformations
eval_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load two copies of the original dataset (same data files, different transforms)
# full_trainset is used to inspect attributes like .classes
full_trainset     = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                  download=True, transform=train_transform)
full_trainset_eval = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                   download=True, transform=eval_transform)

# Split indices with a fixed random seed to ensure consistent train/val partitions
val_size   = 5000
train_size = len(full_trainset) - val_size
generator  = torch.Generator().manual_seed(42)   # Fixed seed for reproducibility
indices    = torch.randperm(len(full_trainset), generator=generator).tolist()

# Train split: uses the augmented full_trainset
# Val split: uses the non-augmented full_trainset_eval (same indices, different transforms)
trainset = Subset(full_trainset,      indices[val_size:])
valset   = Subset(full_trainset_eval, indices[:val_size])

trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
valloader   = DataLoader(valset,   batch_size=64, shuffle=False)

# Test set: uses eval_transform only, no augmentation
testset    = torchvision.datasets.CIFAR10(root='./data', train=False,
                                          download=True, transform=eval_transform)
testloader = DataLoader(testset, batch_size=64, shuffle=False)

Inspect the dataset:

import numpy as np
import matplotlib.pyplot as plt

# full_trainset retains the .classes attribute; trainset/valset are Subsets and lack it
classes = full_trainset.classes

# === Basic information ===
print(f"Full train set : {len(full_trainset)}")
print(f"  -> Train split: {len(trainset)}")
print(f"  -> Val   split: {len(valset)}")
print(f"Test set       : {len(testset)}")

# Examine a single image's format (from full_trainset, since Subset uses different indexing)
sample_img, sample_label = full_trainset[0]
print(f"\nSingle image tensor shape : {sample_img.shape}")
print(f"Pixel range               : [{sample_img.min():.3f}, {sample_img.max():.3f}]")
print(f"Label example             : {sample_label} -> '{classes[sample_label]}'")

# === Check the shape of one batch ===
sample_batch_imgs, sample_batch_labels = next(iter(trainloader))
print(f"\nOne batch shape : {sample_batch_imgs.shape}")
print(f"Label shape     : {sample_batch_labels.shape}")

# === Visualize sample images from each class ===
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
shown = {i: False for i in range(10)}

for img, label in full_trainset:
    label = int(label)
    if not shown[label]:
        ax = axes[label // 5][label % 5]
        npimg = (img.numpy() * 0.5 + 0.5).transpose(1, 2, 0)
        ax.imshow(np.clip(npimg, 0, 1))
        ax.set_title(classes[label])
        ax.axis('off')
        shown[label] = True
    if all(shown.values()):
        break

plt.suptitle("CIFAR-10 Sample Images (32x32)", fontsize=13)
plt.tight_layout()
plt.show()

# === Class distribution in the dataset ===
from collections import Counter
label_counts = Counter([label for _, label in full_trainset])
print("\nFull training set class distribution:")
for i, cls in enumerate(classes):
    print(f"  {cls:>10}: {label_counts[i]}")

Now we can implement a basic ResNet:

import torch
from torch import nn

class BasicBlock(nn.Module):
    """
    The basic building block of ResNet (Basic Block).
    Structure: Conv -> BN -> ReLU -> Conv -> BN -> (+shortcut) -> ReLU

    Core idea: residual connection (skip connection)
    The output H(x) = F(x) + x, so the model only needs to learn the residual F(x) = H(x) - x.
    This allows gradients to flow directly through the shortcut, solving the vanishing gradient
    problem in deep networks.

    When stride != 1 or the number of channels changes, the shortcut uses a 1x1 Conv to match dimensions.
    """
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        # Main path: two 3x3 convolutions, Conv -> BN -> ReLU -> Conv -> BN
        self.conv1 = nn.Conv2d(in_channels, out_channels,
                               kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_channels)
        self.relu  = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels,
                               kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_channels)

        # Shortcut path: when stride != 1 or channels change, use 1x1 Conv to align dimensions
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels,
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))  # Conv -> BN -> ReLU
        out = self.bn2(self.conv2(out))            # Conv -> BN
        out += self.shortcut(x)                   # Residual addition: F(x) + x
        out = self.relu(out)
        return out


class ResNet18(nn.Module):
    """
    ResNet-18 for CIFAR-10.
    Architecture:
      stem:    Conv3x3(3->64, stride=1) + BN + ReLU   [no downsampling, preserves 32x32]
      layer1:  2x BasicBlock(64->64,   stride=1)       [32x32]
      layer2:  2x BasicBlock(64->128,  stride=2)       [16x16]
      layer3:  2x BasicBlock(128->256, stride=2)       [ 8x8]
      layer4:  2x BasicBlock(256->512, stride=2)       [ 4x4]
      head:    AdaptiveAvgPool2d(1) + Flatten + Linear(512->10)
    """
    def __init__(self, num_classes=10):
        super().__init__()
        # stem: use 3x3 conv(stride=1) instead of the original 7x7 conv(stride=2), adapted for 32x32 input
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True)
        )
        self.layer1 = self._make_layer(64,  64,  num_blocks=2, stride=1)
        self.layer2 = self._make_layer(64,  128, num_blocks=2, stride=2)
        self.layer3 = self._make_layer(128, 256, num_blocks=2, stride=2)
        self.layer4 = self._make_layer(256, 512, num_blocks=2, stride=2)

        # Global average pooling followed by flatten, then a single fully connected layer for 10-class output
        self.head = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Any spatial size -> 1x1
            nn.Flatten(),
            nn.Linear(512, num_classes)
        )

    def _make_layer(self, in_channels, out_channels, num_blocks, stride):
        # The first block handles downsampling (stride); subsequent blocks use stride=1
        layers = [BasicBlock(in_channels, out_channels, stride)]
        for _ in range(1, num_blocks):
            layers.append(BasicBlock(out_channels, out_channels, stride=1))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.head(x)
        return x


net = ResNet18(num_classes=10)
print(net)

# Verify the forward pass produces the correct output dimensions
dummy = torch.zeros(1, 3, 32, 32)
print(f"\nOutput shape: {net(dummy).shape}")  # Should be torch.Size([1, 10])

Training:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
## Code Starting Here
# The training code references the training code I used in CS130 and CS137.

# First, the dataset is divided into multiple portions, where count = total dataset size / batch size.
# We set batch_size = 64, meaning 64 images are fed to the model at a time.
# iterations = len(trainset) / 64 = 781
# This length means each epoch requires 781 batches to traverse the entire dataset.
# One iteration = processing one batch = seeing 64 images.
# One epoch = processing all images = completing 781 iterations.

model = net.to(device)

criterion = nn.CrossEntropyLoss()

# Optimizer: SGD + Momentum + Weight Decay (original ResNet paper configuration, better generalization than AdamW)
# - lr=0.1: SGD's initial learning rate is typically much larger than Adam's
# - momentum=0.9: momentum gives the gradient update direction "inertia", helping escape local optima and accelerating convergence
# - weight_decay=5e-4: L2 regularization coefficient, keeps weights small to prevent overfitting
optimizer = torch.optim.SGD(model.parameters(),
                            lr=0.1,
                            momentum=0.9,
                            weight_decay=5e-4)

# Learning rate scheduler: MultiStepLR
# At specified epochs, multiply the learning rate by gamma (i.e., reduce to 1/10)
# Use a large lr early for rapid descent, then a small lr later for fine-grained convergence:
#   epoch  1-59: lr = 0.1
#   epoch 60-79: lr = 0.01
#   epoch 80+:   lr = 0.001
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[60, 80], gamma=0.1)

import os
os.makedirs('./resnet_checkpoints', exist_ok=True)

train_losses = []
test_accuracies = []
best_acc = 0.0

# 100 epochs: coordinated with MultiStepLR to decay learning rate at epochs 60 and 80
num_epochs = 100
for epoch in range(num_epochs):
    # Using nn.Module, toggle the .train() switch
    # .train() here is a switch mainly for special layers like dropout and BatchNorm
    # that behave differently during training and testing:
    # dropout: does not randomly deactivate neurons during evaluation
    # BatchNorm: stops computing running statistics (more on this later)
    model.train()

    # running_loss is a variable for manually tracking training error
    running_loss = 0.0

    # for image, label in trainloader:
    for images, labels in trainloader:
        images, labels = images.to(device), labels.to(device)

        # Clear gradients from the previous batch, since PyTorch accumulates gradients by default
        optimizer.zero_grad()

        # Forward pass to get predictions
        outputs = model(images)

        # Compare predictions with ground truth to compute loss (cross-entropy loss)
        loss = criterion(outputs, labels)

        # Backpropagation to compute gradients for each layer
        loss.backward()

        # Update parameters
        optimizer.step()

        # In PyTorch, loss is a tensor attached to the computation graph; directly accumulating it is inappropriate
        # .item() extracts the concrete number from the tensor as a lightweight Python float
        running_loss += loss.item()

    avg_loss = running_loss / len(trainloader)
    train_losses.append(avg_loss)

    # Evaluate on the validation set (carved from the training set for monitoring during training)
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in valloader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    val_acc = 100 * correct / total
    test_accuracies.append(val_acc)

    # scheduler.step() must be called after each epoch to trigger the scheduled learning rate decay
    scheduler.step()
    current_lr = scheduler.get_last_lr()[0]

    print(f"Epoch [{epoch+1}/{num_epochs}]  Loss: {avg_loss:.4f}  Val Acc: {val_acc:.2f}%  LR: {current_lr:.5f}")

    # Whenever validation accuracy reaches a new high, overwrite and save the best model (for deployment)
    if val_acc > best_acc:
        best_acc = val_acc
        torch.save(model.state_dict(), './best_model_resnet.pth')
        print(f"  -> New best val accuracy: {best_acc:.2f}%, best model saved.")

    # Save a checkpoint every 5 epochs (includes scheduler state for resuming training from interruption)
    if (epoch + 1) % 5 == 0:
        torch.save({
            'epoch': epoch + 1,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict(),
            'train_losses': train_losses,
            'test_accuracies': test_accuracies,
        }, f'./resnet_checkpoints/checkpoint_epoch{epoch+1}.pth')
        print(f"  -> Checkpoint saved at epoch {epoch+1}")

# Save final weights after training completes
torch.save(model.state_dict(), './resnet_final_weights.pth')
print("Training complete. Final weights saved to resnet_final_weights.pth")

Test evaluation:

# Final evaluation on test set (with data augmentation)
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in testloader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

test_acc = 100 * correct / total
print(f"Final Test Accuracy (ResNet-18 on CIFAR-10): {test_acc:.2f}%")
print(f"Best Val Accuracy during training:           {best_acc:.2f}%")

Results:

Final Test Accuracy (ResNet-18 on CIFAR-10): 93.69%
Best Val Accuracy during training:           94.46%

We also typically plot the training loss and test accuracy over epochs:

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 5))

# plot loss
plt.subplot(1, 2, 1)
plt.plot(train_losses, marker='o', label="Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training Loss over Epochs")
plt.legend()

# plot accuracy
plt.subplot(1, 2, 2)
plt.plot(test_accuracies, marker='o', color='orange', label="Test Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy (%)")
plt.title("Test Accuracy over Epochs")
plt.legend()

plt.tight_layout()
plt.show()

These training logs can also be saved for future analysis:

torch.save(model.state_dict(), './model_architecture_1.pth')

## Save your log file and upload to Canvas
# Your Code Here

# csv log
import pandas as pd
df = pd.DataFrame({
    'epoch': range(1, len(train_losses) + 1),
    'train_loss': train_losses,
    'val_acc': test_accuracies
})
df.to_csv('./training_log_model1.csv', index=False)

# json log
import json
log = {'train_losses': train_losses, 'val_accuracies': test_accuracies}
with open('./training_log_model1.json', 'w') as f:
    json.dump(log, f)

###

.

Custom CNN Model

We can build a CNN model like assembling building blocks.

For example, suppose we design the following architecture:

Input -> Conv Layer 1 -> ReLU -> Max Pooling ->
      -> Conv Layer 2 -> ReLU -> Max Pooling ->
      -> Conv Layer 3 -> ReLU -> Max Pooling ->
      -> Flatten -> FC1 -> ReLU -> FC2 -> ReLU -> Output

We can use PyTorch to assemble it directly, just like stacking building blocks:

# ============================================================
# Key concepts: nn.Conv2d(in_channels, out_channels, kernel_size, padding)
#   - in_channels : Number of input channels
#       First layer: CIFAR-10 uses RGB images, fixed at 3
#       Subsequent layers: must equal the previous layer's out_channels (dimensions must align)
#   - out_channels: Number of output feature maps, a hyperparameter you choose
#       Common choices: 32, 64, 128... typically increasing with depth
#   - kernel_size=3 : 3x3 convolution kernel, commonly used for CIFAR-10
#   - padding=1     : with kernel_size=3, padding=1 preserves spatial dimensions
#
# Key concepts: nn.MaxPool2d(kernel_size)
#   - kernel_size=2 : 2x2 window, halves spatial dimensions each time
#   - CIFAR-10 input is 32x32; after 3 MaxPool2d(2) operations: 32 -> 16 -> 8 -> 4
#
# Key concepts: nn.Linear(in_features, out_features)
#   - in_features  : Length of the flattened vector = last Conv layer's out_channels x 4 x 4
#       Because 32x32 becomes 4x4 after 3 halvings, multiplied by channel count gives the flattened dimension
#   - out_features : FC2's output is fixed at 10 (CIFAR-10 has 10 classes)
#
# Key concepts: forward method signature
#   def forward(self, x):   <- self and x are two separate parameters separated by a comma, not self.x
#
# Required architecture flow (from assignment instructions):
#   Input -> Conv1 -> ReLU -> MaxPool
#         -> Conv2 -> ReLU -> MaxPool
#         -> Conv3 -> ReLU -> MaxPool
#         -> Flatten -> FC1 -> ReLU -> FC2 -> ReLU -> Output (10 classes)
# ============================================================

class MyCNN(nn.Module):
    def __init__(self):
        super(MyCNN, self).__init__()
        # Your Code Here

        # --- Convolutional part: 3 Conv->ReLU->MaxPool blocks ---
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(3, 64, kernel_size=3, padding=1),  # in=3 (RGB, fixed), out=64
            nn.ReLU(),
            nn.MaxPool2d(2),                             # Pooling window size

            # Block 2
            # Note: in_channels must equal the previous layer's out_channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 3
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        # --- Fully connected part: Flatten is done in forward; only Linear layers here ---
        # FC1's in_features = Conv3's out_channels x 4 x 4
        # (32x32 becomes 4x4 after 3 MaxPool(2) operations)
        self.classifier = nn.Sequential(
            # The flattened vector from the convolutional layers contains rich spatial features,
            # but with significant redundancy.
            # We use FC layers to compress and distill these features:
            # 4096-dim (low-level spatial features) -> 512-dim (mid-level semantic features) -> 128-dim (high-level abstract features) -> 10-dim (class scores)
            nn.Flatten(), # Flattens 256x4x4 to 4096
            nn.Linear(4096, 512),
            nn.ReLU(),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Linear(128, 10)
        )

    def forward(self, x):           # Note: (self, x) not (self.x)
        x = self.features(x)        # Through 3 conv blocks, shape: (B, C3, 4, 4)
        x = self.classifier(x)     # Through FC layers, shape: (B, 10)
        return x

###

model = MyCNN()
print(model)
####

Building Models with PyTorch

Modular Development

Tensor

nn.Parameter

Layer

nn.Module

Model

MLP Development Example

Layer

Model

Data Wrapping

Training Function

Regression Task Application

MNIST Classification Application

CNN Development Example

LeNet

AlexNet

VGG

ResNet

Custom CNN Model

RNN Development Example

Transformer Development Example

ViT

评论 #