CNN Architectures

Typical CNN Architecture Pattern

A typical CNN consists of two parts: convolutional blocks (Feature Learning) and a classification head (Classification). Each convolutional block follows this pattern:

[Conv] -> [Activation] -> [Conv] -> [Activation] -> [Pooling]

The model parameters include the weights of all convolutional filters and the parameters of the fully connected layers.

典型CNN架构

How to Read Network Architecture Diagrams

When reading or designing CNN architectures, pay attention to the following notation conventions:

The input shape is determined by the preceding layer
Dense layer: [Dense, output-size] — a fully connected layer with a specified output size
Convolutional layer: [Conv, output-channel, padding-size, stride] — a convolutional layer with specified output channels, padding, and stride

LeNet-5

LeCun et al., "Gradient-Based Learning Applied to Document Recognition", 1998

LeNet-5 is the pioneer of CNNs. Proposed by Yann LeCun in 1998 for handwritten digit recognition (MNIST), it established the fundamental paradigm of Convolution (Conv) → Pooling → Fully Connected (FC).

Architecture:

[Input 28×28] -> [Conv 5×5, 6 filters] -> [Activation] -> [Pool 2×2]
             -> [Conv 5×5, 16 filters] -> [Activation] -> [Pool 2×2]
             -> [FC 120] -> [FC 84] -> [Output 10]

LeNet架构

Key Contributions:

First demonstration of the effectiveness of convolution + pooling
Established the two-stage paradigm of "feature extraction → classification"
Due to limited computational resources at the time, LeNet could not be scaled to larger image tasks

AlexNet

Krizhevsky et al., "ImageNet Classification with Deep Convolutional Neural Networks", NeurIPS 2012

In 2012, AlexNet won the ImageNet competition by a dominant margin (reducing the top-5 error rate from 25.8% to 16.4%), marking the beginning of deep learning's dominance in computer vision. AlexNet was the first modern CNN, with each layer designed individually — the concept of "blocks" had not yet emerged.

Architecture (bottom to top):

[Input 224×224×3]
  -> [Conv 11×11, 96, stride 4] -> [ReLU] -> [MaxPool 3×3, stride 2]
  -> [Conv 5×5, 256, pad 2]     -> [ReLU] -> [MaxPool 3×3, stride 2]
  -> [Conv 3×3, 384, pad 1]     -> [ReLU]
  -> [Conv 3×3, 384, pad 1]     -> [ReLU]
  -> [Conv 3×3, 256, pad 1]     -> [ReLU] -> [MaxPool 3×3, stride 2]
  -> [FC 4096] -> [Dropout] -> [FC 4096] -> [Dropout] -> [FC 1000]

Key Contributions:

ReLU activation function: First large-scale use of ReLU in place of Sigmoid/Tanh in CNNs, addressing the vanishing gradient problem and achieving approximately 6x faster training
Dropout: Applied Dropout (p=0.5) in the fully connected layers to prevent overfitting
GPU training: First use of GPUs (2x GTX 580) for distributed training
Data augmentation: Employed random cropping, horizontal flipping, color jittering, and other augmentation techniques
LRN (Local Response Normalization): Simulated lateral inhibition in biological neurons (later superseded by BatchNorm)

Limitations: The first layer used an oversized 11×11 convolution kernel with stride 4, causing significant spatial information loss; the fully connected layers accounted for over 95% of the total parameters.

VGGNet

Simonyan & Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition", ICLR 2015

VGGNet (2014) demonstrated the importance of depth — given the same computational resources, increasing depth is more effective than increasing width.

Core Idea — Block Structure:

VGG introduced the concept of block structure, elevating network design to a higher level of abstraction. Each VGG block consists of several \(3 \times 3\) convolutions (padding=1) followed by a \(2 \times 2\) MaxPooling layer (stride=2):

VGG Block:
  [3×3 Conv, pad 1] -> [ReLU]
  [3×3 Conv, pad 1] -> [ReLU]
  [2×2 MaxPool, stride 2]

Why use 3×3 instead of larger kernels?

Two stacked \(3 \times 3\) convolutions have the same receptive field as a single \(5 \times 5\) convolution
Three stacked \(3 \times 3\) convolutions are equivalent to a single \(7 \times 7\)
But with fewer parameters: \(3 \times (3^2 C^2) = 27C^2\) < \(7^2 C^2 = 49C^2\)
Meanwhile, more nonlinearities are introduced (more ReLU layers), yielding stronger representational power

VGG-16 Architecture:

Stage	Layers	Channels	Output Size
Block 1	2×Conv	64	112×112
Block 2	2×Conv	128	56×56
Block 3	3×Conv	256	28×28
Block 4	3×Conv	512	14×14
Block 5	3×Conv	512	7×7
Head	3×FC	4096, 4096, 1000	—

AlexNet与VGG架构对比

Limitations: Extremely large parameter count (VGG-16 has approximately 138 million parameters), with the fully connected layers accounting for over 80%. Training and inference are relatively slow.

Inception / GoogLeNet

Szegedy et al., "Going Deeper with Convolutions", CVPR 2015

GoogLeNet (2014) did not merely deepen the network — it widened it. The core idea is to use convolution kernels of different sizes simultaneously within the same layer, allowing the network to learn which scale is most effective on its own.

Inception Module:

         Input
       /   |   \    \
   1×1   3×3   5×5  MaxPool
   Conv  Conv  Conv  3×3
       \   |   /    /
        Concatenate
          Output

A \(1 \times 1\) convolution is applied before each branch for channel dimensionality reduction, drastically lowering the computational cost:

         Input (256ch)
       /       |        \        \
   1×1(64)  1×1(96)   1×1(16)  MaxPool
              |          |        |
           3×3(128)   5×5(32)  1×1(32)
       \       |        /        /
            Concatenate (256ch)

Key Contributions:

Multi-scale feature extraction: Parallel use of \(1 \times 1\), \(3 \times 3\), and \(5 \times 5\) convolutions within the same layer
The power of \(1 \times 1\) convolutions: Used as bottleneck layers for channel dimensionality reduction, cutting computational cost by an order of magnitude
Global Average Pooling: Replaced large fully connected layers with global average pooling; GoogLeNet has only about 5 million parameters (1/27 of VGG)
Auxiliary classifiers: Added auxiliary losses at intermediate layers to alleviate vanishing gradients in deep networks

ResNet

He et al., "Deep Residual Learning for Image Recognition", CVPR 2016

When networks exceed 20 layers, the vanishing gradient and degradation problems cause performance to actually decline — this is not overfitting, as even the training error is higher for deeper networks. In 2015, Kaiming He proposed ResNet, introducing residual connections (skip connections) that made it possible to train networks with hundreds or even thousands of layers.

ResNet is the core backbone of modern CNNs. Most subsequent architectures are essentially improvements built upon it. Its significance is on par with backpropagation and the Attention mechanism.

Core Formula:

\[ y = F(x) + x \]

That is: the output of a layer = the layer's computation + the original input.

Why do residual connections work?

Direct gradient flow: During backpropagation, gradients can flow directly back to earlier layers through the skip connections, alleviating vanishing gradients
Identity mapping as a fallback: If a layer does not need to learn any additional transformation, \(F(x)\) can learn to be zero, making the output simply \(x\) (an identity mapping). This is much easier than having the network learn an identity function from scratch
Ensemble perspective: ResNet can be viewed as an ensemble of many paths of different depths

ResNet residual blocks come in two forms:

Basic Block: x -> [3×3 Conv] -> [BN] -> [ReLU] -> [3×3 Conv] -> [BN] -> (+x) -> [ReLU]
Bottleneck Block (with 1×1 projection): When the input and output dimensions differ, a 1×1 Conv is used to project \(x\) to the matching dimension before addition

ResNet残差块（基本块与带投影的块）

ResNet-18 Architecture:

Stage	Residual Blocks	Channels	Output Size
Conv1	1×Conv 7×7, stride 2	64	112×112
Pool	MaxPool 3×3, stride 2	64	56×56
Stage 1	2×BasicBlock	64	56×56
Stage 2	2×BasicBlock	128	28×28
Stage 3	2×BasicBlock	256	14×14
Stage 4	2×BasicBlock	512	7×7
Head	AvgPool + FC	1000	—

ResNet-18完整架构

A key design choice in ResNet is the inclusion of Batch Normalization after every convolutional layer to improve training.

Receptive Field: The region in the input image that corresponds to a particular output unit. By stacking multiple layers of small convolution kernels, the receptive field grows progressively larger.

感受野示意图

DenseNet

Huang et al., "Densely Connected Convolutional Networks", CVPR 2017

While ResNet solved the problem of "training deeper networks," DenseNet went further by addressing "no feature goes to waste."

Core Idea — Dense Connection:

In DenseNet, every layer receives feature maps from all preceding layers as input, not just the immediately previous one:

x0 → x1 → x2 → x3 → x4
 └────┘    |    |    |
 └─────────┘    |    |
 └──────────────┘    |
 └───────────────────┘

That is, the input to layer \(\ell\) is the concatenation of all preceding layers' outputs: \(x_\ell = H_\ell([x_0, x_1, \ldots, x_{\ell-1}])\)

Dense Block:

Dense Block:
  Layer 1: BN -> ReLU -> 1×1 Conv(4k) -> BN -> ReLU -> 3×3 Conv(k)
  Layer 2: receives [x0, x1] as input
  Layer 3: receives [x0, x1, x2] as input
  ...

Here \(k\) is called the growth rate, meaning each layer adds only \(k\) new channels. Between Dense Blocks, a Transition Layer (\(1 \times 1\) Conv + \(2 \times 2\) AvgPool) is used for downsampling and channel compression.

Key Contributions:

Feature reuse: All feature maps are shared with all subsequent layers, preventing information loss
Parameter efficiency: Thanks to feature reuse, each layer only needs to learn a small number of new features (\(k\) is typically 12 or 32), resulting in far fewer parameters than ResNet
Implicit deep supervision: Shallow features can be directly accessed by deeper layers and the loss function, producing an effect similar to auxiliary losses
Smooth gradient flow: Every layer has a "short path" to the loss, making training more stable

MobileNet Series

Howard et al., "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications", 2017

MobileNet is designed specifically for mobile and embedded devices. Its core innovation is the Depthwise Separable Convolution.

Standard Convolution vs. Depthwise Separable Convolution:

Computational cost of standard convolution: \(k^2 \times C_{in} \times C_{out} \times H \times W\)

Depthwise separable convolution decomposes it into two steps:

Standard Convolution (single-step):
  Input [H×W×Cin] --[k×k×Cin×Cout]--> Output [H×W×Cout]

Depthwise Separable Convolution (two steps):
  Step 1 - Depthwise Conv (independent convolution per channel):
    Input [H×W×Cin] --[k×k×1×Cin]--> [H×W×Cin]
  Step 2 - Pointwise Conv (1×1 convolution to mix channels):
    [H×W×Cin] --[1×1×Cin×Cout]--> Output [H×W×Cout]

Computational Cost Comparison:

\[ \frac{\text{Depthwise Separable}}{\text{Standard}} = \frac{k^2 C_{in} HW + C_{in} C_{out} HW}{k^2 C_{in} C_{out} HW} = \frac{1}{C_{out}} + \frac{1}{k^2} \]

For \(3 \times 3\) convolutions, the computational cost is approximately \(\frac{1}{8} \sim \frac{1}{9}\) that of standard convolutions.

MobileNet V2 Improvements:

Introduced the Inverted Residual Block:

Inverted Residual Block:
  Input(thin) -> [1×1 Conv expand] -> [BN+ReLU6]
              -> [3×3 Depthwise Conv] -> [BN+ReLU6]
              -> [1×1 Conv compress] -> [BN]
              -> (+Input)

This is the opposite of ResNet's residual block: MobileNet V2 "expands" in the middle (thin → fat → thin), while ResNet "compresses" in the middle (fat → thin → fat).

ShuffleNet

Zhang et al., "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices", CVPR 2018

ShuffleNet further optimizes upon MobileNet by introducing the Channel Shuffle operation.

Problem: Although Group Convolution reduces computational cost, each group operates in isolation with no cross-group information exchange, limiting feature expressiveness.

Solution — Channel Shuffle:

Channel shuffle after group convolution:
  Group 1: [a1, a2, a3]     After shuffle:  [a1, b1, c1]
  Group 2: [b1, b2, b3]  ------------>      [a2, b2, c2]
  Group 3: [c1, c2, c3]                     [a3, b3, c3]

By rearranging the channel order, information from different groups can mix and interact in the next layer, with virtually no additional computational cost.

Key Contributions:

Channel Shuffle: Enables cross-group information exchange with zero computational overhead
Pointwise Group Convolution: Applies grouping to \(1 \times 1\) convolutions as well, further reducing computational cost
Achieves higher accuracy than MobileNet V1 under the same computational budget

Summary of CNN Architecture Evolution

Year	Architecture	Depth	Core Innovation	ImageNet Top-5
1998	LeNet-5	5	Conv+Pool paradigm	—
2012	AlexNet	8	ReLU, Dropout, GPU	16.4%
2014	VGGNet	16/19	Stacked 3×3 small kernels, block structure	7.3%
2014	GoogLeNet	22	Inception module, 1×1 convolution	6.7%
2015	ResNet	152+	Residual connections	3.6%
2017	DenseNet	121+	Dense connections, feature reuse	—
2017	MobileNet	28	Depthwise separable convolution	—
2018	ShuffleNet	—	Channel shuffle	—

Evolution Trajectory:

Going deeper: LeNet(5) → AlexNet(8) → VGG(19) → ResNet(152+)
Getting smarter: Inception (multi-scale) → ResNet (residual) → DenseNet (dense connections)
Getting lighter: MobileNet (depthwise separable) → ShuffleNet (channel shuffle)

CNN Architectures

Typical CNN Architecture Pattern

How to Read Network Architecture Diagrams

LeNet-5

AlexNet

VGGNet

Inception / GoogLeNet

ResNet

DenseNet

MobileNet Series

ShuffleNet

Summary of CNN Architecture Evolution

评论 #