Game AI

Overview

Games have long served as important testbeds for AI research. From simple Atari games to complex real-time strategy games, reinforcement learning has achieved a series of landmark breakthroughs in game AI.

Atari: From DQN to Rainbow

DQN (Deep Q-Network)

Mnih et al. (2015) first demonstrated that deep RL can learn to play Atari games directly from pixels:

Key innovations:

Convolutional neural network processing raw pixel input
Experience Replay to break data correlations
Target Network to stabilize training

Input: Stack of 4 most recent 84x84 grayscale frames

Output: Q-value for each action

Result: Surpassed human-level performance in 29 out of 49 Atari games

Double DQN

Van Hasselt et al. (2016) addressed Q-value overestimation:

\[Q(s, a) = r + \gamma Q_{\bar{\theta}}(s', \arg\max_{a'} Q_\theta(s', a'))\]

Uses the online network to select actions and the target network to evaluate values.

Dueling DQN

Wang et al. (2016) separated state value from action advantage:

\[Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')\]

Prioritized Experience Replay

Schaul et al. (2016) prioritized sampling of experiences with large TD errors.

Rainbow

Hessel et al. (2018) integrated six improvements:

Double DQN
Prioritized replay
Dueling architecture
Multi-step returns
Distributional RL (C51)
NoisyNet

Result: Achieved SOTA on most Atari games.

Board Games: The AlphaGo Family

AlphaGo

Silver et al. (2016) was the first to defeat a Go world champion:

Three-stage training:

Supervised learning policy network: Learn from human games \(p_\sigma(a|s)\)
RL policy network: Strengthen through self-play \(p_\rho(a|s)\)
Value network: Predict win probability \(v_\theta(s) \approx \mathbb{E}[z|s]\)

MCTS (Monte Carlo Tree Search):

At inference time, combines policy and value networks to guide search:

\[a_t = \arg\max_a \left[Q(s_t, a) + c \cdot P(s_t, a) \cdot \frac{\sqrt{N(s_t)}}{1 + N(s_t, a)}\right]\]

AlphaZero

Silver et al. (2018) completely removed human knowledge, learning from scratch:

Key simplifications:

No human data: Learns entirely through self-play
Single network: Simultaneously outputs policy \(\mathbf{p}\) and value \(v\)
Unified framework: Same algorithm plays Go, chess, and shogi

Training objective:

\[\ell = (z - v)^2 - \boldsymbol{\pi}^T \log \mathbf{p} + c \|\theta\|^2\]

where \(z\) is the actual game outcome and \(\boldsymbol{\pi}\) is the MCTS search probability.

Result: Surpassed AlphaGo and all traditional chess engines within just hours of training.

MuZero

Schrittwieser et al. (2020) further removed dependence on environment rules:

Learned models:

Representation function: \(h(o_t) \to s_t\) (observation to hidden state)
Dynamics function: \(g(s_t, a_t) \to (r_{t+1}, s_{t+1})\) (state transition)
Prediction function: \(f(s_t) \to (p_t, v_t)\) (policy and value)

Key innovation:

No need to know game rules
The learned model only needs to be useful for planning, not for reconstructing observations
Also achieved SOTA on Atari

graph LR
    A[AlphaGo<br/>2016] --> B[AlphaGo Zero<br/>2017]
    B --> C[AlphaZero<br/>2018]
    C --> D[MuZero<br/>2020]

    A1[Human data + RL] --> A
    A2[Pure self-play] --> B
    A3[Multi-game general] --> C
    A4[No rules needed] --> D

    style A fill:#faa
    style B fill:#fda
    style C fill:#ffa
    style D fill:#afa

Esports

OpenAI Five (Dota 2)

OpenAI (2019) defeated world champion teams in 5v5 Dota 2:

Scale:

Each hero controlled by an independent LSTM policy
Observation space: ~20,000 dimensions (no visual input, uses game API)
Action space: ~170,000 (discretized)
Training: PPO with 128,000 CPUs + 256 GPUs
Training duration: Equivalent to 45,000 years of human gameplay

Key techniques:

Large-scale PPO: Distributed training infrastructure
Reward shaping: Carefully designed dense rewards
Surgery on learning rates: Fine-grained tuning during training
Cooperative strategy: 5 agents share parameters but make independent decisions

Limitations:

Some rule restrictions (e.g., limited hero pool)
Relies on game API rather than visual input

AlphaStar (StarCraft II)

Vinyals et al. (2019) reached Grandmaster level in StarCraft II:

Challenges:

Imperfect information (fog of war)
Real-time decisions (not turn-based)
Long-term strategic planning (games last ~20 minutes)
Enormous action space

Key techniques:

League Training: Maintaining an agent league
- Main Agent: Continuously evolving
- Main Exploiter: Specifically targeting the main agent's weaknesses
- League Exploiter: Exploiting weaknesses across the entire league
Imitation learning pretraining: Starting from human game data
Transformer architecture: Processing variable-length entity lists
Autoregressive policy: Sequentially outputting action type, target, etc.

Result: Reached 99.8th percentile (Grandmaster level), defeating top professional players.

Open-World Games

Voyager (Minecraft)

Wang et al. (2023) used an LLM-driven agent for continuous exploration in Minecraft:

Architecture:

Automatic curriculum: LLM generates increasingly complex tasks
Skill library: Stores successful behaviors as reusable code
Iterative prompting: Refines code based on execution feedback

Distinguishing feature: Does not use traditional RL training, instead leveraging LLM reasoning capabilities for exploration and adaptation.

STEVE-1

A Minecraft agent based on pretrained vision models and text instructions, combining foundation models with RL fine-tuning.

CICERO (Diplomacy)

Meta AI (2022) achieved human-level play in the board game Diplomacy: