Skip to content

Game AI

Overview

Games have long served as important testbeds for AI research. From simple Atari games to complex real-time strategy games, reinforcement learning has achieved a series of landmark breakthroughs in game AI.

Atari: From DQN to Rainbow

DQN (Deep Q-Network)

Mnih et al. (2015) first demonstrated that deep RL can learn to play Atari games directly from pixels:

Key innovations:

  • Convolutional neural network processing raw pixel input
  • Experience Replay to break data correlations
  • Target Network to stabilize training

Input: Stack of 4 most recent 84x84 grayscale frames

Output: Q-value for each action

Result: Surpassed human-level performance in 29 out of 49 Atari games

Double DQN

Van Hasselt et al. (2016) addressed Q-value overestimation:

\[Q(s, a) = r + \gamma Q_{\bar{\theta}}(s', \arg\max_{a'} Q_\theta(s', a'))\]

Uses the online network to select actions and the target network to evaluate values.

Dueling DQN

Wang et al. (2016) separated state value from action advantage:

\[Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')\]

Prioritized Experience Replay

Schaul et al. (2016) prioritized sampling of experiences with large TD errors.

Rainbow

Hessel et al. (2018) integrated six improvements:

  1. Double DQN
  2. Prioritized replay
  3. Dueling architecture
  4. Multi-step returns
  5. Distributional RL (C51)
  6. NoisyNet

Result: Achieved SOTA on most Atari games.

Board Games: The AlphaGo Family

AlphaGo

Silver et al. (2016) was the first to defeat a Go world champion:

Three-stage training:

  1. Supervised learning policy network: Learn from human games \(p_\sigma(a|s)\)
  2. RL policy network: Strengthen through self-play \(p_\rho(a|s)\)
  3. Value network: Predict win probability \(v_\theta(s) \approx \mathbb{E}[z|s]\)

MCTS (Monte Carlo Tree Search):

At inference time, combines policy and value networks to guide search:

\[a_t = \arg\max_a \left[Q(s_t, a) + c \cdot P(s_t, a) \cdot \frac{\sqrt{N(s_t)}}{1 + N(s_t, a)}\right]\]

AlphaZero

Silver et al. (2018) completely removed human knowledge, learning from scratch:

Key simplifications:

  • No human data: Learns entirely through self-play
  • Single network: Simultaneously outputs policy \(\mathbf{p}\) and value \(v\)
  • Unified framework: Same algorithm plays Go, chess, and shogi

Training objective:

\[\ell = (z - v)^2 - \boldsymbol{\pi}^T \log \mathbf{p} + c \|\theta\|^2\]

where \(z\) is the actual game outcome and \(\boldsymbol{\pi}\) is the MCTS search probability.

Result: Surpassed AlphaGo and all traditional chess engines within just hours of training.

MuZero

Schrittwieser et al. (2020) further removed dependence on environment rules:

Learned models:

  • Representation function: \(h(o_t) \to s_t\) (observation to hidden state)
  • Dynamics function: \(g(s_t, a_t) \to (r_{t+1}, s_{t+1})\) (state transition)
  • Prediction function: \(f(s_t) \to (p_t, v_t)\) (policy and value)

Key innovation:

  • No need to know game rules
  • The learned model only needs to be useful for planning, not for reconstructing observations
  • Also achieved SOTA on Atari
graph LR
    A[AlphaGo<br/>2016] --> B[AlphaGo Zero<br/>2017]
    B --> C[AlphaZero<br/>2018]
    C --> D[MuZero<br/>2020]

    A1[Human data + RL] --> A
    A2[Pure self-play] --> B
    A3[Multi-game general] --> C
    A4[No rules needed] --> D

    style A fill:#faa
    style B fill:#fda
    style C fill:#ffa
    style D fill:#afa

Esports

OpenAI Five (Dota 2)

OpenAI (2019) defeated world champion teams in 5v5 Dota 2:

Scale:

  • Each hero controlled by an independent LSTM policy
  • Observation space: ~20,000 dimensions (no visual input, uses game API)
  • Action space: ~170,000 (discretized)
  • Training: PPO with 128,000 CPUs + 256 GPUs
  • Training duration: Equivalent to 45,000 years of human gameplay

Key techniques:

  • Large-scale PPO: Distributed training infrastructure
  • Reward shaping: Carefully designed dense rewards
  • Surgery on learning rates: Fine-grained tuning during training
  • Cooperative strategy: 5 agents share parameters but make independent decisions

Limitations:

  • Some rule restrictions (e.g., limited hero pool)
  • Relies on game API rather than visual input

AlphaStar (StarCraft II)

Vinyals et al. (2019) reached Grandmaster level in StarCraft II:

Challenges:

  • Imperfect information (fog of war)
  • Real-time decisions (not turn-based)
  • Long-term strategic planning (games last ~20 minutes)
  • Enormous action space

Key techniques:

  • League Training: Maintaining an agent league
    • Main Agent: Continuously evolving
    • Main Exploiter: Specifically targeting the main agent's weaknesses
    • League Exploiter: Exploiting weaknesses across the entire league
  • Imitation learning pretraining: Starting from human game data
  • Transformer architecture: Processing variable-length entity lists
  • Autoregressive policy: Sequentially outputting action type, target, etc.

Result: Reached 99.8th percentile (Grandmaster level), defeating top professional players.

Open-World Games

Voyager (Minecraft)

Wang et al. (2023) used an LLM-driven agent for continuous exploration in Minecraft:

Architecture:

  • Automatic curriculum: LLM generates increasingly complex tasks
  • Skill library: Stores successful behaviors as reusable code
  • Iterative prompting: Refines code based on execution feedback

Distinguishing feature: Does not use traditional RL training, instead leveraging LLM reasoning capabilities for exploration and adaptation.

STEVE-1

A Minecraft agent based on pretrained vision models and text instructions, combining foundation models with RL fine-tuning.

Social and Communication Games

CICERO (Diplomacy)

Meta AI (2022) achieved human-level play in the board game Diplomacy:

Challenges:

  • Requires natural language communication
  • Requires strategic planning
  • Requires building trust and alliances
  • Involves deception and negotiation

Architecture:

  • Language model: Generates natural language messages
  • Strategic reasoning: Search-based action planning
  • Intent modeling: Predicting other players' behavior
  • Dialogue-action consistency: Ensuring words match actions

Result: Ranked in the top 10% in anonymous online games.

Game AI Evolution

graph TD
    A[2013: DQN<br/>Atari] --> B[2016: AlphaGo<br/>Go]
    B --> C[2017: AlphaZero<br/>Multiple board games]
    A --> D[2018: Rainbow<br/>Atari SOTA]
    C --> E[2019: OpenAI Five<br/>Dota 2]
    C --> F[2019: AlphaStar<br/>StarCraft]
    C --> G[2020: MuZero<br/>No rules needed]
    F --> H[2022: CICERO<br/>Diplomacy]
    G --> I[2023: Voyager<br/>Minecraft]

    style A fill:#faa
    style B fill:#fda
    style C fill:#ffa
    style E fill:#afa
    style F fill:#afa
    style H fill:#abf
    style I fill:#abf

Insights from Game AI

Technical Perspective

Breakthrough Key Technique Generalizability
DQN Deep Q-Network + Experience Replay High
AlphaGo MCTS + RL + SL Medium (requires self-play)
AlphaZero Pure self-play High (multi-game general)
MuZero Learned world model High (no rules needed)
OpenAI Five Large-scale PPO Medium (heavy engineering)
AlphaStar League training Medium
CICERO RL + LLM New paradigm

From Games to Reality

Methods from game AI are transferring to real-world applications:

  • MCTS planning → Robot planning
  • Self-play → Adversarial training
  • League training → Multi-agent competition
  • LLM + RL → General-purpose agents

References

  • Mnih et al., "Human-level Control through Deep Reinforcement Learning" (Nature 2015)
  • Silver et al., "Mastering the Game of Go with Deep Neural Networks and Tree Search" (Nature 2016)
  • Silver et al., "A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go" (Science 2018)
  • Schrittwieser et al., "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model" (Nature 2020)
  • OpenAI, "Dota 2 with Large Scale Deep Reinforcement Learning" (2019)
  • Vinyals et al., "Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning" (Nature 2019)
  • Meta FAIR, "Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning" (Science 2022)
  • Wang et al., "Voyager: An Open-Ended Embodied Agent with Large Language Models" (2023)

评论 #