Deep Learning Milestones
Overview
The history of deep learning is a story of continuous breakthroughs. From AlexNet igniting the deep learning revolution in 2012 to MoE and SSM reshaping architecture paradigms in 2024, each milestone solved a critical problem and opened new directions.
timeline
title Deep Learning Milestones
2012 : AlexNet
2014 : GAN, VGG
2015 : ResNet, LSTM Seq2Seq
2017 : Transformer
2018 : BERT, GPT
2019 : GPT-2
2020 : GPT-3, ViT, DDPM
2021 : CLIP, DALL-E
2022 : ChatGPT, Stable Diffusion, S4
2023 : GPT-4, LLaMA, Mamba
2024 : Mixtral, Sora, Flux, DeepSeek-V3
2025 : DeepSeek-R1, Llama 4
1. Convolutional Network Era (2012-2017)
1.1 AlexNet (2012)
| Item |
Details |
| Problem |
ImageNet classification accuracy stagnated |
| Insight |
Deep CNN + GPU training + ReLU + Dropout |
| Impact |
Top-5 error from 26% to 16%, launched the deep learning era |
| Paper |
Krizhevsky et al., NIPS 2012 |
1.2 VGG / GoogLeNet (2014)
- VGG: Demonstrated the importance of depth (16-19 layers) and small kernels (3x3)
- GoogLeNet: Inception module, multi-scale features
1.3 ResNet (2015)
| Item |
Details |
| Problem |
Training degradation in deep networks |
| Insight |
Residual connections \(y = F(x) + x\); learning residuals is easier than learning mappings |
| Impact |
152-layer network, 3.57% Top-5 error, surpassing humans |
| Paper |
He et al., CVPR 2016 |
1.4 GAN (2014)
| Item |
Details |
| Problem |
Poor generative model quality |
| Insight |
Generator-discriminator adversarial game |
| Impact |
Created the adversarial training paradigm; StyleGAN and others followed |
| Paper |
Goodfellow et al., NeurIPS 2014 |
| Item |
Details |
| Problem |
RNN's sequential bottleneck and long-range dependency difficulties |
| Insight |
Self-attention mechanism, completely abandoning recurrence |
| Impact |
Became the foundational architecture across nearly all AI domains |
| Paper |
Vaswani et al., "Attention Is All You Need," NeurIPS 2017 |
2.2 BERT (2018)
| Item |
Details |
| Problem |
Language models were unidirectional only |
| Insight |
Bidirectional masked language model pretraining |
| Impact |
NLP pretraining paradigm, SOTA on 11 tasks |
| Paper |
Devlin et al., NAACL 2019 |
2.3 GPT / GPT-2 (2018-2019)
- GPT: Autoregressive pretraining + fine-tuning
- GPT-2: 1.5B parameters, "too dangerous to release," demonstrated zero-shot capabilities
2.4 GPT-3 (2020)
| Item |
Details |
| Problem |
Models required fine-tuning for each task |
| Insight |
175B parameters, In-Context Learning |
| Impact |
Few-shot/zero-shot general capabilities, Scaling Laws |
| Paper |
Brown et al., NeurIPS 2020 |
3. Vision Revolution (2020-2022)
3.1 ViT (2020)
| Item |
Details |
| Problem |
Vision still dominated by CNNs |
| Insight |
Image patches + standard Transformer; "an image is a sequence" |
| Impact |
Unified vision and NLP architectures |
| Paper |
Dosovitskiy et al., ICLR 2021 |
3.2 DDPM (2020)
| Item |
Details |
| Problem |
GAN training unstable, VAE generates blurry images |
| Insight |
Forward noising + learned reverse denoising |
| Impact |
Launched the diffusion model era, replacing GANs |
| Paper |
Ho et al., NeurIPS 2020 |
3.3 CLIP (2021)
| Item |
Details |
| Problem |
Vision models required labeled data |
| Insight |
Contrastive pretraining on 400M image-text pairs |
| Impact |
Zero-shot visual classification, multimodal foundation |
| Paper |
Radford et al., ICML 2021 |
3.4 Stable Diffusion (2022)
| Item |
Details |
| Problem |
Diffusion models too slow in pixel space |
| Insight |
Latent space diffusion (VAE encoding + UNet denoising) |
| Impact |
Open-source text-to-image revolution |
| Paper |
Rombach et al., CVPR 2022 |
4. LLM Era (2022-2024)
4.1 ChatGPT / InstructGPT (2022)
| Item |
Details |
| Problem |
LLMs don't follow instructions, generate harmful content |
| Insight |
SFT + RLHF alignment techniques |
| Impact |
AI entered mainstream awareness, triggered the LLM boom |
4.2 LLaMA (2023)
| Item |
Details |
| Problem |
Large models monopolized by big companies |
| Insight |
Efficient training + open source |
| Impact |
Open-source LLM ecosystem explosion |
| Paper |
Touvron et al., 2023 |
4.3 GPT-4 (2023)
| Item |
Details |
| Problem |
Need for stronger reasoning and multimodal capabilities |
| Insight |
Larger scale + multimodal + better alignment |
| Impact |
Demonstrated near-AGI general capabilities |
5. Architecture Innovation (2022-2025)
5.1 S4 / Mamba (2022-2023)
| Item |
Details |
| Problem |
Transformer's quadratic complexity limits long sequences |
| Insight |
Structured state spaces + selective mechanism |
| Impact |
Linear-complexity alternative to Transformers |
5.2 Mixtral / MoE (2024)
| Item |
Details |
| Problem |
Dense model scaling is inefficient |
| Insight |
Sparse mixture of experts: large parameters but small compute |
| Impact |
MoE became the mainstream large model architecture |
5.3 DeepSeek-V3 / R1 (2024-2025)
| Item |
Details |
| Problem |
Large models lack reasoning capabilities |
| Insight |
MoE + MLA + reinforcement learning for reasoning |
| Impact |
Open-source models match closed-source; reasoning breakthrough |
5.4 Sora / Flux (2024)
| Item |
Details |
| Problem |
Long video generation is difficult |
| Insight |
DiT + Flow Matching + spatiotemporal Transformer |
| Impact |
Video generation enters a new phase |
6. Key Trends Summary
6.1 Architecture Evolution
graph LR
A[CNN 2012] --> B[RNN/LSTM 2015]
B --> C[Transformer 2017]
C --> D[Sparse MoE 2024]
C --> E[SSM/Mamba 2023]
D --> F[Hybrid Architectures 2024+]
E --> F
6.2 Paradigm Shifts
| Period |
Paradigm |
Core |
| 2012-2017 |
Supervised learning |
Big data + deep networks |
| 2018-2020 |
Pretrain + fine-tune |
BERT, GPT |
| 2020-2022 |
Scaling Laws |
Bigger is better |
| 2022-2024 |
Alignment + instruction |
RLHF, SFT |
| 2024-2025 |
Reasoning scaling |
Test-time compute, RL |
| 2025+ |
Agent + tools |
AI as intelligent agents |
References
See citations within each section above for individual milestone papers.