Skip to content

Deep Learning Milestones

Overview

The history of deep learning is a story of continuous breakthroughs. From AlexNet igniting the deep learning revolution in 2012 to MoE and SSM reshaping architecture paradigms in 2024, each milestone solved a critical problem and opened new directions.

timeline
    title Deep Learning Milestones
    2012 : AlexNet
    2014 : GAN, VGG
    2015 : ResNet, LSTM Seq2Seq
    2017 : Transformer
    2018 : BERT, GPT
    2019 : GPT-2
    2020 : GPT-3, ViT, DDPM
    2021 : CLIP, DALL-E
    2022 : ChatGPT, Stable Diffusion, S4
    2023 : GPT-4, LLaMA, Mamba
    2024 : Mixtral, Sora, Flux, DeepSeek-V3
    2025 : DeepSeek-R1, Llama 4

1. Convolutional Network Era (2012-2017)

1.1 AlexNet (2012)

Item Details
Problem ImageNet classification accuracy stagnated
Insight Deep CNN + GPU training + ReLU + Dropout
Impact Top-5 error from 26% to 16%, launched the deep learning era
Paper Krizhevsky et al., NIPS 2012

1.2 VGG / GoogLeNet (2014)

  • VGG: Demonstrated the importance of depth (16-19 layers) and small kernels (3x3)
  • GoogLeNet: Inception module, multi-scale features

1.3 ResNet (2015)

Item Details
Problem Training degradation in deep networks
Insight Residual connections \(y = F(x) + x\); learning residuals is easier than learning mappings
Impact 152-layer network, 3.57% Top-5 error, surpassing humans
Paper He et al., CVPR 2016

1.4 GAN (2014)

Item Details
Problem Poor generative model quality
Insight Generator-discriminator adversarial game
Impact Created the adversarial training paradigm; StyleGAN and others followed
Paper Goodfellow et al., NeurIPS 2014

2. Transformer Revolution (2017-2020)

2.1 Transformer (2017)

Item Details
Problem RNN's sequential bottleneck and long-range dependency difficulties
Insight Self-attention mechanism, completely abandoning recurrence
Impact Became the foundational architecture across nearly all AI domains
Paper Vaswani et al., "Attention Is All You Need," NeurIPS 2017

2.2 BERT (2018)

Item Details
Problem Language models were unidirectional only
Insight Bidirectional masked language model pretraining
Impact NLP pretraining paradigm, SOTA on 11 tasks
Paper Devlin et al., NAACL 2019

2.3 GPT / GPT-2 (2018-2019)

  • GPT: Autoregressive pretraining + fine-tuning
  • GPT-2: 1.5B parameters, "too dangerous to release," demonstrated zero-shot capabilities

2.4 GPT-3 (2020)

Item Details
Problem Models required fine-tuning for each task
Insight 175B parameters, In-Context Learning
Impact Few-shot/zero-shot general capabilities, Scaling Laws
Paper Brown et al., NeurIPS 2020

3. Vision Revolution (2020-2022)

3.1 ViT (2020)

Item Details
Problem Vision still dominated by CNNs
Insight Image patches + standard Transformer; "an image is a sequence"
Impact Unified vision and NLP architectures
Paper Dosovitskiy et al., ICLR 2021

3.2 DDPM (2020)

Item Details
Problem GAN training unstable, VAE generates blurry images
Insight Forward noising + learned reverse denoising
Impact Launched the diffusion model era, replacing GANs
Paper Ho et al., NeurIPS 2020

3.3 CLIP (2021)

Item Details
Problem Vision models required labeled data
Insight Contrastive pretraining on 400M image-text pairs
Impact Zero-shot visual classification, multimodal foundation
Paper Radford et al., ICML 2021

3.4 Stable Diffusion (2022)

Item Details
Problem Diffusion models too slow in pixel space
Insight Latent space diffusion (VAE encoding + UNet denoising)
Impact Open-source text-to-image revolution
Paper Rombach et al., CVPR 2022

4. LLM Era (2022-2024)

4.1 ChatGPT / InstructGPT (2022)

Item Details
Problem LLMs don't follow instructions, generate harmful content
Insight SFT + RLHF alignment techniques
Impact AI entered mainstream awareness, triggered the LLM boom

4.2 LLaMA (2023)

Item Details
Problem Large models monopolized by big companies
Insight Efficient training + open source
Impact Open-source LLM ecosystem explosion
Paper Touvron et al., 2023

4.3 GPT-4 (2023)

Item Details
Problem Need for stronger reasoning and multimodal capabilities
Insight Larger scale + multimodal + better alignment
Impact Demonstrated near-AGI general capabilities

5. Architecture Innovation (2022-2025)

5.1 S4 / Mamba (2022-2023)

Item Details
Problem Transformer's quadratic complexity limits long sequences
Insight Structured state spaces + selective mechanism
Impact Linear-complexity alternative to Transformers

5.2 Mixtral / MoE (2024)

Item Details
Problem Dense model scaling is inefficient
Insight Sparse mixture of experts: large parameters but small compute
Impact MoE became the mainstream large model architecture

5.3 DeepSeek-V3 / R1 (2024-2025)

Item Details
Problem Large models lack reasoning capabilities
Insight MoE + MLA + reinforcement learning for reasoning
Impact Open-source models match closed-source; reasoning breakthrough

5.4 Sora / Flux (2024)

Item Details
Problem Long video generation is difficult
Insight DiT + Flow Matching + spatiotemporal Transformer
Impact Video generation enters a new phase

6.1 Architecture Evolution

graph LR
    A[CNN 2012] --> B[RNN/LSTM 2015]
    B --> C[Transformer 2017]
    C --> D[Sparse MoE 2024]
    C --> E[SSM/Mamba 2023]
    D --> F[Hybrid Architectures 2024+]
    E --> F

6.2 Paradigm Shifts

Period Paradigm Core
2012-2017 Supervised learning Big data + deep networks
2018-2020 Pretrain + fine-tune BERT, GPT
2020-2022 Scaling Laws Bigger is better
2022-2024 Alignment + instruction RLHF, SFT
2024-2025 Reasoning scaling Test-time compute, RL
2025+ Agent + tools AI as intelligent agents

References

See citations within each section above for individual milestone papers.


评论 #