Deep Learning Milestones

Overview

The history of deep learning is a story of continuous breakthroughs. From AlexNet igniting the deep learning revolution in 2012 to MoE and SSM reshaping architecture paradigms in 2024, each milestone solved a critical problem and opened new directions.

timeline
    title Deep Learning Milestones
    2012 : AlexNet
    2014 : GAN, VGG
    2015 : ResNet, LSTM Seq2Seq
    2017 : Transformer
    2018 : BERT, GPT
    2019 : GPT-2
    2020 : GPT-3, ViT, DDPM
    2021 : CLIP, DALL-E
    2022 : ChatGPT, Stable Diffusion, S4
    2023 : GPT-4, LLaMA, Mamba
    2024 : Mixtral, Sora, Flux, DeepSeek-V3
    2025 : DeepSeek-R1, Llama 4

1. Convolutional Network Era (2012-2017)

1.1 AlexNet (2012)

Item	Details
Problem	ImageNet classification accuracy stagnated
Insight	Deep CNN + GPU training + ReLU + Dropout
Impact	Top-5 error from 26% to 16%, launched the deep learning era
Paper	Krizhevsky et al., NIPS 2012

1.2 VGG / GoogLeNet (2014)

VGG: Demonstrated the importance of depth (16-19 layers) and small kernels (3x3)
GoogLeNet: Inception module, multi-scale features

1.3 ResNet (2015)

Item	Details
Problem	Training degradation in deep networks
Insight	Residual connections \(y = F(x) + x\); learning residuals is easier than learning mappings
Impact	152-layer network, 3.57% Top-5 error, surpassing humans
Paper	He et al., CVPR 2016

1.4 GAN (2014)

Item	Details
Problem	Poor generative model quality
Insight	Generator-discriminator adversarial game
Impact	Created the adversarial training paradigm; StyleGAN and others followed
Paper	Goodfellow et al., NeurIPS 2014

2. Transformer Revolution (2017-2020)

2.1 Transformer (2017)

Item	Details
Problem	RNN's sequential bottleneck and long-range dependency difficulties
Insight	Self-attention mechanism, completely abandoning recurrence
Impact	Became the foundational architecture across nearly all AI domains
Paper	Vaswani et al., "Attention Is All You Need," NeurIPS 2017

2.2 BERT (2018)

Item	Details
Problem	Language models were unidirectional only
Insight	Bidirectional masked language model pretraining
Impact	NLP pretraining paradigm, SOTA on 11 tasks
Paper	Devlin et al., NAACL 2019

2.3 GPT / GPT-2 (2018-2019)

GPT: Autoregressive pretraining + fine-tuning
GPT-2: 1.5B parameters, "too dangerous to release," demonstrated zero-shot capabilities

2.4 GPT-3 (2020)

Item	Details
Problem	Models required fine-tuning for each task
Insight	175B parameters, In-Context Learning
Impact	Few-shot/zero-shot general capabilities, Scaling Laws
Paper	Brown et al., NeurIPS 2020

3. Vision Revolution (2020-2022)

3.1 ViT (2020)

Item	Details
Problem	Vision still dominated by CNNs
Insight	Image patches + standard Transformer; "an image is a sequence"
Impact	Unified vision and NLP architectures
Paper	Dosovitskiy et al., ICLR 2021

3.2 DDPM (2020)

Item	Details
Problem	GAN training unstable, VAE generates blurry images
Insight	Forward noising + learned reverse denoising
Impact	Launched the diffusion model era, replacing GANs
Paper	Ho et al., NeurIPS 2020

3.3 CLIP (2021)

Item	Details
Problem	Vision models required labeled data
Insight	Contrastive pretraining on 400M image-text pairs
Impact	Zero-shot visual classification, multimodal foundation
Paper	Radford et al., ICML 2021

3.4 Stable Diffusion (2022)

Item	Details
Problem	Diffusion models too slow in pixel space
Insight	Latent space diffusion (VAE encoding + UNet denoising)
Impact	Open-source text-to-image revolution
Paper	Rombach et al., CVPR 2022

4. LLM Era (2022-2024)

4.1 ChatGPT / InstructGPT (2022)

Item	Details
Problem	LLMs don't follow instructions, generate harmful content
Insight	SFT + RLHF alignment techniques
Impact	AI entered mainstream awareness, triggered the LLM boom

4.2 LLaMA (2023)

Item	Details
Problem	Large models monopolized by big companies
Insight	Efficient training + open source
Impact	Open-source LLM ecosystem explosion
Paper	Touvron et al., 2023

4.3 GPT-4 (2023)

Item	Details
Problem	Need for stronger reasoning and multimodal capabilities
Insight	Larger scale + multimodal + better alignment
Impact	Demonstrated near-AGI general capabilities

5. Architecture Innovation (2022-2025)

5.1 S4 / Mamba (2022-2023)

Item	Details
Problem	Transformer's quadratic complexity limits long sequences
Insight	Structured state spaces + selective mechanism
Impact	Linear-complexity alternative to Transformers

5.2 Mixtral / MoE (2024)

Item	Details
Problem	Dense model scaling is inefficient
Insight	Sparse mixture of experts: large parameters but small compute
Impact	MoE became the mainstream large model architecture

5.3 DeepSeek-V3 / R1 (2024-2025)

Item	Details
Problem	Large models lack reasoning capabilities
Insight	MoE + MLA + reinforcement learning for reasoning
Impact	Open-source models match closed-source; reasoning breakthrough

5.4 Sora / Flux (2024)

Item	Details
Problem	Long video generation is difficult
Insight	DiT + Flow Matching + spatiotemporal Transformer
Impact	Video generation enters a new phase

6. Key Trends Summary

6.1 Architecture Evolution

graph LR
    A[CNN 2012] --> B[RNN/LSTM 2015]
    B --> C[Transformer 2017]
    C --> D[Sparse MoE 2024]
    C --> E[SSM/Mamba 2023]
    D --> F[Hybrid Architectures 2024+]
    E --> F

6.2 Paradigm Shifts

Period	Paradigm	Core
2012-2017	Supervised learning	Big data + deep networks
2018-2020	Pretrain + fine-tune	BERT, GPT
2020-2022	Scaling Laws	Bigger is better
2022-2024	Alignment + instruction	RLHF, SFT
2024-2025	Reasoning scaling	Test-time compute, RL
2025+	Agent + tools	AI as intelligent agents

References

See citations within each section above for individual milestone papers.