Skip to content

Adversarial Search & Game Theory

Adversarial Search and Games

This note focuses on adversarial search and game playing. By the end of this note, we will have covered the algorithms used to solve games such as Tic-Tac-Toe, chess, and Go.

MiniMax

The MiniMax algorithm can solve two-player zero-sum games with a manageable state space, such as Tic-Tac-Toe.

A two-player zero-sum game is one in which two players compete, and the outcome is either a win for one player or a draw — there is no scenario in which both players win. I have already solved the Coin Game and Tic-Tac-Toe and introduced the Minimax idea in my notes under Algorithms >> Dynamic Programming.

AIMA refers to the two participants as MAX and MIN. The intuition is straightforward: if we consider ourselves as a participant, we are MAX and our opponent is MIN. The label "MIN" reflects our own perspective — the opponent, trying to maximize their own benefit, effectively minimizes ours on their turn.

Formal Definition

We formalize a game as a search problem with the following components:

  • Initial state \(s_0\): The starting configuration of the board and which player moves first.
  • Player function \(\text{TO-MOVE}(s)\): Returns the player whose turn it is in state \(s\).
  • Actions function \(\text{ACTIONS}(s)\): Returns the set of legal moves in state \(s\).
  • Transition model \(\text{RESULT}(s, a)\): Returns the new state after executing action \(a\).
  • Terminal test \(\text{IS-TERMINAL}(s)\): Determines whether the game is over (win, loss, or draw).
  • Utility function \(\text{UTILITY}(s, p)\): Assigns a numeric value to terminal state \(s\) for player \(p\). For example, in Tic-Tac-Toe: win = +1, loss = -1, draw = 0.

Algorithm

The MiniMax algorithm makes optimal decisions by recursively expanding the entire game tree. The core idea is:

  • When it is MAX's turn, MAX chooses the action that maximizes the utility value.
  • When it is MIN's turn, MIN chooses the action that minimizes the utility value.

The pseudocode is as follows:

function MINIMAX-SEARCH(game, state):
    player ← game.TO-MOVE(state)
    value, move ← MAX-VALUE(game, state)
    return move

function MAX-VALUE(game, state):
    if game.IS-TERMINAL(state):
        return game.UTILITY(state, MAX), null
    v ← -∞
    for each a in game.ACTIONS(state):
        v2, a2 ← MIN-VALUE(game, game.RESULT(state, a))
        if v2 > v:
            v, move ← v2, a
    return v, move

function MIN-VALUE(game, state):
    if game.IS-TERMINAL(state):
        return game.UTILITY(state, MAX), null
    v ← +∞
    for each a in game.ACTIONS(state):
        v2, a2 ← MAX-VALUE(game, game.RESULT(state, a))
        if v2 < v:
            v, move ← v2, a
    return v, move

Complexity Analysis

Let \(m\) be the maximum depth of the game tree and \(b\) be the branching factor (number of legal moves per state). Then:

  • Time complexity: \(O(b^m)\) — the entire game tree must be traversed.
  • Space complexity: \(O(bm)\) — depth-first search only needs to maintain the path from root to the current node.

For Tic-Tac-Toe, \(b \approx 5\) (average branching factor), \(m = 9\) (at most 9 moves), and the state space is approximately \(9! = 362{,}880\), which is entirely tractable. However, for chess (\(b \approx 35, m \approx 80\)) or Go (\(b \approx 250, m \approx 150\)), naive MiniMax is infeasible.

Tic-Tac-Toe Example

Consider a simplified Tic-Tac-Toe position: it is MAX's (X's) turn, and the board already has several pieces. The MiniMax solving process works as follows:

  1. Expand the game tree: Starting from the current position, MAX enumerates all empty cells as possible moves.
  2. Recursive evaluation: For each of MAX's moves, it becomes MIN's (O's) turn; MIN similarly enumerates all moves, alternating recursively until the game ends.
  3. Back-propagate utility values: - At leaf nodes (terminal states), assign values directly using the utility function: X wins = +1, O wins = -1, draw = 0. - At MIN layers, propagate the minimum child value upward. - At MAX layers, propagate the maximum child value upward.
  4. Select the optimal action: MAX chooses the action at the root with the highest backed-up value.

For instance, suppose MAX has three candidate positions A, B, and C. After full recursive expansion, A returns 0 (draw), B returns +1 (MAX wins), and C returns -1 (MIN wins). MAX selects B.

Optimality of MiniMax

Under the assumption that both players play optimally, MiniMax is guaranteed to find a Nash equilibrium strategy. For Tic-Tac-Toe, MiniMax proves that optimal play by both sides always results in a draw.


Alpha-Beta Pruning

For games like chess, where the state space is too large for naive Minimax, we use Alpha-Beta pruning. The core idea is: if we already know that a branch cannot yield a better result than the current best option, there is no need to explore it further.

Pruning Mechanism

Alpha-Beta pruning maintains two values:

  • \(\alpha\) (Alpha): The best (highest) value that MAX is guaranteed along the current path — MAX's lower bound. Initialized to \(-\infty\).
  • \(\beta\) (Beta): The best (lowest) value that MIN is guaranteed along the current path — MIN's upper bound (from MAX's perspective, an upper bound). Initialized to \(+\infty\).

Pruning conditions:

  • At a MIN node, if the current value \(v \leq \alpha\), an \(\alpha\) cutoff occurs — MAX would never choose this branch because MAX already has a better option at an ancestor node.
  • At a MAX node, if the current value \(v \geq \beta\), a \(\beta\) cutoff occurs — MIN would never choose this branch because MIN already has a better option at an ancestor node.

In short: pruning occurs whenever \(\alpha \geq \beta\).

Pseudocode

function ALPHA-BETA-SEARCH(game, state):
    player ← game.TO-MOVE(state)
    value, move ← MAX-VALUE(game, state, -∞, +∞)
    return move

function MAX-VALUE(game, state, α, β):
    if game.IS-TERMINAL(state):
        return game.UTILITY(state, MAX), null
    v ← -∞
    for each a in game.ACTIONS(state):
        v2, a2 ← MIN-VALUE(game, game.RESULT(state, a), α, β)
        if v2 > v:
            v, move ← v2, a
            α ← max(α, v)
        if v ≥ β:
            return v, move    // β cutoff
    return v, move

function MIN-VALUE(game, state, α, β):
    if game.IS-TERMINAL(state):
        return game.UTILITY(state, MAX), null
    v ← +∞
    for each a in game.ACTIONS(state):
        v2, a2 ← MAX-VALUE(game, game.RESULT(state, a), α, β)
        if v2 < v:
            v, move ← v2, a
            β ← min(β, v)
        if v ≤ α:
            return v, move    // α cutoff
    return v, move

Complexity and Move Ordering

The efficiency of Alpha-Beta pruning depends critically on move ordering:

  • Worst case: Poor move ordering, no pruning occurs, and the algorithm degenerates to plain MiniMax with time complexity \(O(b^m)\).
  • Best case: Perfect move ordering (the best move is always searched first), reducing time complexity to \(O(b^{m/2})\). This means Alpha-Beta can search a game tree twice as deep in the same amount of time.

Move Ordering Heuristics

Common ordering strategies in practice include:

  • Iterative Deepening: Perform a shallow search first and use the results to guide move ordering in the deeper search.
  • Killer Heuristic: Prioritize moves that produced cutoffs at sibling nodes.
  • History Heuristic: Order moves based on how often they have produced cutoffs throughout the search tree.
  • Transposition Table: Use a hash table to store previously evaluated positions and avoid redundant search.

In practical applications (such as chess engines), since it is impossible to search all the way to terminal states, we typically cut off the search at some depth \(d\) and use an evaluation function \(\text{EVAL}(s)\) to estimate the utility of non-terminal states. A good evaluation function should:

  1. Agree with the true utility at terminal states.
  2. Be fast to compute.
  3. Correlate strongly with the actual probability of winning.

In chess, the classic evaluation function is based on piece values: pawn = 1, knight/bishop = 3, rook = 5, queen = 9, plus correction terms for position, mobility, king safety, and so on.


For games like Go, whose state space exceeds the number of atoms in the universe, even Alpha-Beta pruning with carefully designed evaluation functions falls far short. We use Monte Carlo Tree Search (MCTS).

The core idea of MCTS is fundamentally different: no evaluation function is needed. Instead, it uses a large number of random simulations (rollouts/playouts) to statistically estimate the win rate of each move.

Four Phases

Each iteration of MCTS consists of four phases:

1. Selection

Starting from the root, follow the tree policy down through already-expanded nodes until reaching a leaf node (a node with unexpanded children). The selection policy must balance exploitation (choosing moves with high observed win rates) and exploration (trying less-visited moves).

2. Expansion

At the selected leaf node, add one or more unexpanded child nodes to the tree.

3. Simulation (Rollout)

From the newly expanded node, perform a random simulation until the game terminates. In vanilla MCTS, moves during simulation are chosen uniformly at random (or using simple heuristics).

4. Backpropagation

Propagate the simulation result (win/loss/draw) back up the path, updating the visit count \(N\) and cumulative reward \(Q\) at each node along the way.

These four phases repeat until the computational budget is exhausted (time or iteration limit), after which the child of the root with the highest visit count is selected as the final move.

The UCB1 Formula

The most widely used selection policy is UCT (Upper Confidence Bounds for Trees), based on the UCB1 formula:

\[ \text{UCB1}(v_i) = \frac{Q(v_i)}{N(v_i)} + C \cdot \sqrt{\frac{\ln N(v)}{N(v_i)}} \]

where:

  • \(Q(v_i)\): Cumulative reward (total wins or score) of child node \(v_i\).
  • \(N(v_i)\): Visit count of child node \(v_i\).
  • \(N(v)\): Visit count of parent node \(v\).
  • \(C\): Exploration constant controlling the exploitation-exploration trade-off. \(C = \sqrt{2}\) is the theoretical optimum, though it is often tuned in practice.

The first term \(\frac{Q(v_i)}{N(v_i)}\) is the exploitation term, favoring nodes with high historical win rates. The second term \(C \cdot \sqrt{\frac{\ln N(v)}{N(v_i)}}\) is the exploration term, favoring nodes that have been visited less frequently.

Theoretical Guarantee of UCB1

It can be shown that as the number of iterations approaches infinity, the MCTS policy converges to the MiniMax optimal strategy. This means that given sufficient computational resources, MCTS can solve any finite perfect-information game.

MCTS and Go

The state space of Go is approximately \(10^{170}\) (roughly \(2 \times 10^{170}\) legal positions), far exceeding chess's \(10^{47}\). Go is particularly challenging for the following reasons:

  1. Extremely large branching factor: \(b \approx 250\), compared to roughly 35 for chess.
  2. Evaluation functions are extremely hard to design: Evaluating a Go position is far more complex than in chess and resists simple rule-based quantification.
  3. Long games: An average game lasts about 150 moves, each with an enormous number of options.

The success of MCTS in Go stems from the fact that it requires no evaluation function. Early Go programs (such as MoGo, Crazy Stone, and Fuego) were based on vanilla MCTS with domain knowledge injected into the rollout policy (e.g., common patterns, ladder reading). These programs reached strong amateur level on \(9 \times 9\) boards but remained at low amateur level on the standard \(19 \times 19\) board.

AlphaGo: MCTS + Deep Reinforcement Learning

AlphaGo (DeepMind, 2016) made critical improvements to the core MCTS framework:

  • Policy Network: A deep convolutional neural network replaces random rollouts, guiding the selection and expansion phases of MCTS to drastically reduce the effective search breadth.
  • Value Network: A deep network directly estimates the win probability of a position, replacing the need to roll out to the end of the game and drastically reducing the effective search depth.
  • Reinforcement Learning via Self-Play: The policy and value networks are continuously improved through self-play.

The successor, AlphaGo Zero, dispensed entirely with human game data and learned exclusively through self-play from scratch, surpassing all previous versions within 72 hours. AlphaZero further generalized this framework to chess and shogi, surpassing the strongest specialized engines in all three games.

From Search to Learning

The success of AlphaGo demonstrates an important paradigm: the combination of search (MCTS) + learning (deep reinforcement learning) can solve problems beyond the reach of traditional methods. This idea has been inherited in the era of large language models — for example, OpenAI's o1 model enhances reasoning ability through a search-and-learn approach.


Many real-world games involve stochastic elements, such as dice rolls in Backgammon. Expectiminimax extends MiniMax by introducing chance nodes.

The game tree contains three types of nodes:

  • MAX nodes: Choose the action that maximizes value
  • MIN nodes: Choose the action that minimizes value
  • Chance nodes: Compute the expected value over all possible random outcomes
\[ \text{EXPECTIMINIMAX}(s) = \begin{cases} \text{UTILITY}(s) & \text{if IS-TERMINAL}(s) \\ \max_a \text{EXPECTIMINIMAX}(\text{RESULT}(s, a)) & \text{if TO-MOVE}(s) = \text{MAX} \\ \min_a \text{EXPECTIMINIMAX}(\text{RESULT}(s, a)) & \text{if TO-MOVE}(s) = \text{MIN} \\ \sum_r P(r) \cdot \text{EXPECTIMINIMAX}(\text{RESULT}(s, r)) & \text{if } s \text{ is a chance node} \end{cases} \]

Complexity: If there are \(n\) possible random outcomes, the branching factor becomes \(b \times n\), yielding time complexity \(O((bn)^m)\).

Limitations of Alpha-Beta Pruning

Alpha-Beta pruning cannot be directly applied to Expectiminimax trees because chance node values are weighted averages that cannot be simply bounded. Modified Star1/Star2 pruning algorithms use the range of the utility function to perform limited pruning.


Imperfect-Information Games

All algorithms above assume perfect information — all players can see the complete game state. However, many real-world games have imperfect information, such as poker and bridge.

Information Sets

In imperfect-information games, a player cannot distinguish between certain states. All states in the same information set are indistinguishable to that player.

\[ \text{Information set } I_i = \{s \in S : s \text{ is indistinguishable to player } i\} \]

Solution Methods

1. Determinization

The simplest approach: convert the imperfect-information game into multiple perfect-information games, solve each, and aggregate.

  • Randomly sample multiple possible complete states from the current information set
  • Solve each complete state using standard Alpha-Beta or MCTS
  • Vote or average the recommended actions across solutions

Pitfalls of Determinization

Determinization ignores the strategic value of information asymmetry. In poker, for example, determinization will never recommend bluffing, because in each determinized world the opponent "knows" your hand. This is known as the Strategy Fusion problem.

2. Counterfactual Regret Minimization (CFR)

CFR is currently the most successful method for solving large-scale imperfect-information games:

  • Regret: Measures "how much more I could have won had I chosen a different action"
  • Counterfactual regret: The regret of each action accounting for the opponent's strategy
\[ R_i^T(I, a) = \sum_{t=1}^T \left[ v_i(\sigma_I^{I \to a}, t) - v_i(\sigma, t) \right] \]
  • Strategy update: Normalize positive regrets via Regret Matching to form the next iteration's mixed strategy
  • Convergence guarantee: The average strategy converges to a Nash equilibrium

Notable CFR successes:

  • Libratus (CMU, 2017): Defeated top human professionals in Heads-Up No-Limit Texas Hold'em, with \(10^{161}\) decision points
  • Pluribus (CMU & Facebook, 2019): Extended imperfect-information game solving to six-player poker

3. Information-Set MCTS (IS-MCTS)

Extends MCTS to imperfect-information games:

  • At the start of each iteration, sample a determinized state from the current information set
  • Merge nodes in the search tree by information set rather than concrete state
  • Balances the trade-off between determinization and exact solving

Multiplayer Games

When more than two players are involved, the problem changes fundamentally: games are no longer zero-sum, so MiniMax does not directly apply.

Extend the utility function from a scalar to a vector \(\mathbf{V}(s) = (V_1(s), V_2(s), \ldots, V_n(s))\). Each player \(i\) chooses the action that maximizes \(V_i\) on their turn.

Coalitions and Betrayal

In multiplayer games, players may form temporary coalitions — two weaker players teaming up against a stronger one. This makes game analysis far more complex than the two-player case:

  • Cooperative game theory: Studies conditions for stable coalition formation
  • The Core: Allocations from which no sub-coalition has an incentive to deviate
  • Shapley value: A principled method for fairly distributing coalition payoffs

Modern Game AI Overview

Game Type Representative Game Core Algorithm Milestone
Perfect info · Deterministic Tic-Tac-Toe MiniMax Completely solved
Perfect info · Deterministic Chess Alpha-Beta + evaluation function Deep Blue (1997)
Perfect info · Deterministic Go MCTS + deep RL AlphaGo (2016)
Perfect info · Stochastic Backgammon Expectiminimax + TD learning TD-Gammon (1992)
Imperfect info Texas Hold'em CFR + abstraction Libratus (2017)
Imperfect info · Multiplayer 6-player poker CFR + depth-limited search Pluribus (2019)
Cooperative Hanabi Theory of Mind + search Open problem
Multiplayer · Large-scale StarCraft II Multi-agent RL AlphaStar (2019)
Multiplayer · Large-scale Dota 2 PPO + self-play OpenAI Five (2019)

1. Search in LLM Reasoning

Improving reasoning in the era of large language models increasingly relies on search:

  • Best-of-N sampling: Generate \(N\) responses and select the best via a verifier — essentially Monte Carlo sampling
  • Tree of Thoughts (ToT): Organize the reasoning process as a tree and use BFS/DFS to search for the optimal reasoning path
  • Process Reward Models (PRM): Score each step of reasoning rather than only the final answer — analogous to a value network in MCTS
  • OpenAI o1/o3 models: Perform search during "thinking time (test-time compute)," trading more computation for better reasoning

This direction is called inference-time scaling, and its core philosophy is continuous with MCTS: use search to compensate for model limitations.

2. Multi-Agent Games and Self-Play

  • Population-Based Training (PBT): Maintain a population of policies that evolve through mutual play
  • League Training: AlphaStar used league-style training where policies iterate against diverse opponents
  • Neural Fictitious Self-Play (NFSP): Combines RL and supervised learning to approximate Nash equilibria

3. Human-AI Collaborative Games

  • Cooperative AI: Cooperative games requiring Theory of Mind, such as Hanabi, remain open challenges
  • Human-AI hybrid teams: How to coordinate decisions when humans and AI each have distinct strengths
  • Interpretable game strategies: Making AI strategies understandable to humans to foster trust

评论 #