Skip to content

Multi-Agent Reinforcement Learning Survey

Overview

Multi-Agent Reinforcement Learning (MARL) studies the problem of multiple agents simultaneously learning and interacting in a shared environment. Compared to single-agent RL, MARL faces unique challenges such as environment non-stationarity, credit assignment difficulties, and scalability, while also exhibiting rich phenomena including cooperation, competition, and emergent behaviors.


1. Why Multi-Agent RL?

1.1 The Real World Is Inherently Multi-Agent

  • Traffic systems: Multiple autonomous vehicles coordinating driving
  • Robot teams: Multi-robot collaboration for carrying, searching, etc.
  • Economic markets: Strategic games among multiple participants
  • Game AI: Multiplayer games like Dota 2 and StarCraft
  • Communication networks: Multi-node resource allocation coordination
  • Social simulation: Simulating multi-agent social behaviors

1.2 Limitations of Single-Agent Methods

Treating other agents as part of the environment and directly applying single-agent RL:

  • The environment becomes non-stationary (other agents are also learning and changing strategies)
  • Curse of dimensionality: Joint action space grows exponentially
  • Credit assignment: Difficult to determine individual contributions under team rewards
  • Cannot model communication and coordination

2. Problem Formulation

2.1 Markov Game (Stochastic Game)

The most common mathematical framework for MARL is the Markov Game, defined as:

\[\mathcal{G} = (N, \mathcal{S}, \{\mathcal{A}_i\}_{i=1}^N, P, \{R_i\}_{i=1}^N, \gamma)\]
  • \(N\): Number of agents
  • \(\mathcal{S}\): State space (global state)
  • \(\mathcal{A}_i\): Action space of agent \(i\)
  • \(P(s'|s, a_1, \dots, a_N)\): Joint state transition
  • \(R_i(s, a_1, \dots, a_N)\): Reward function of agent \(i\)
  • \(\gamma\): Discount factor

Joint action space: \(\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2 \times \dots \times \mathcal{A}_N\)

2.2 Dec-POMDP: Decentralized Partially Observable

A more realistic formulation where each agent has only local observations:

\[\mathcal{M} = (N, \mathcal{S}, \{\mathcal{A}_i\}, \{\mathcal{O}_i\}, P, O, \{R_i\}, \gamma)\]

Additional components: - \(\mathcal{O}_i\): Observation space of agent \(i\) - \(O(o_i|s, i)\): Observation function

Key difference: - Each agent makes decisions based on local observation \(o_i\) rather than global state \(s\) - Optimal solving of Dec-POMDP is NEXP-hard

2.3 Special Cases

Model Reward Structure Information Structure Example
Cooperative game \(R_1 = R_2 = \dots = R_N\) Partially observable Multi-robot cooperation
Zero-sum game \(R_1 = -R_2\) (two-player) Full/partial observability Go, poker
General-sum game Independent Partially observable Traffic, economics
Mean-field game Depends on mean behavior Local observation Large-scale populations

3. Core Challenges of MARL

3.1 Non-Stationarity

Problem: From any agent's perspective, the environment includes other learning agents, so the environment dynamics are constantly changing.

\[P_i(s'|s, a_i) = \sum_{a_{-i}} P(s'|s, a_i, a_{-i}) \prod_{j \neq i} \pi_j(a_j|o_j)\]

As \(\pi_j\) updates, \(P_i\) also changes, breaking the MDP stationarity assumption.

Countermeasures:

  • Centralized training: Leverage global information during training
  • Opponent modeling: Explicitly model other agents' policies
  • Experience replay correction: Importance sampling to correct stale experiences

3.2 Credit Assignment

Problem: In cooperative tasks, the team receives a shared reward \(R_{team}\). How to determine each agent's contribution?

\[R_{team} = R(s, a_1, a_2, \dots, a_N)\]

What is agent \(i\)'s contribution to the team reward?

Countermeasures:

  • Difference rewards: \(R_i = R_{team}(a_i, a_{-i}) - R_{team}(a_{-i})\)
  • Value decomposition: VDN, QMIX decompose team Q-values into individual contributions
  • Shapley values: Game-theoretic fair allocation methods

3.3 Partial Observability

Problem: Agents can only observe local information about the environment.

  • Cannot access other agents' states/intentions
  • Need to infer hidden information from observation history
  • Communication can mitigate but not fully resolve this

3.4 Scalability

Problem: Joint action space grows exponentially with the number of agents.

\[|\mathcal{A}| = \prod_{i=1}^N |\mathcal{A}_i|\]

With \(N=10\), \(|\mathcal{A}_i|=5\): \(|\mathcal{A}| = 5^{10} \approx 10^7\)

Countermeasures:

  • Parameter sharing: All agents share network parameters
  • Mean-field approximation: Replace individual interactions with population mean behavior
  • Attention mechanisms: Dynamically select subsets of agents to attend to

3.5 Coordinated Exploration

Problem: Multiple agents need coordinated exploration; independent exploration may never discover cooperative strategies.

  • Joint exploration space is enormous
  • Good cooperative strategies may require multiple agents to change behavior simultaneously
  • Local optima traps are more severe

4. MARL Training Paradigms

4.1 Paradigm Taxonomy

graph TD
    MARL[MARL Training Paradigms] --> IL[Independent Learners]
    MARL --> CTDE[Centralized Training<br>Decentralized Execution<br>CTDE]
    MARL --> FC[Fully Centralized]

    IL --> IQL[Independent Q-Learning]
    IL --> IPPO[Independent PPO]

    CTDE --> VD[Value Decomposition]
    CTDE --> CC[Centralized Critic]
    CTDE --> COMM[Communication Learning]

    VD --> VDN[VDN]
    VD --> QMIX[QMIX]

    CC --> MADDPG[MADDPG]
    CC --> MAPPO[MAPPO]

    COMM --> CommNet[CommNet]
    COMM --> TarMAC[TarMAC]

    FC --> CQL_M[Joint Q-Learning]

    style MARL fill:#e1f5fe
    style CTDE fill:#e8f5e9

4.2 Independent Learners

Idea: Each agent independently runs a single-agent RL algorithm.

Training: Decentralized, each agent uses only its own observations and rewards

Execution: Decentralized

Advantages: - Simple implementation - Good scalability - No communication needed

Disadvantages: - Ignores other agents - Non-stationary environment causes training instability - Difficult to learn cooperative strategies

Representative algorithms: IQL (Independent Q-Learning), IPPO (Independent PPO)

4.3 Centralized Training Decentralized Execution (CTDE)

Idea: Leverage global information during training; use only local observations during execution.

Training: Centralized, with access to global state and all agents' observations and actions

Execution: Decentralized, each agent uses only its own local observations

Advantages: - Can leverage additional information to improve learning efficiency during training - No communication needed during execution, suitable for real deployment - Currently the most mainstream paradigm

Disadvantages: - Requires centralized infrastructure during training - Information asymmetry between training and execution may cause issues

Representative algorithms: QMIX, MADDPG, MAPPO

CTDE Is the Current Mainstream

CTDE achieves the best balance between practicality and performance, and is the most prevalent paradigm in current MARL research and applications.

4.4 Fully Centralized

Idea: Treat the multi-agent problem as a single super-agent's decision problem.

Training/Execution: Both centralized

Advantages: - Can theoretically find the global optimum - Full coordination

Disadvantages: - Joint action space explodes exponentially - Requires global communication - Not suitable for large-scale problems


5. Cooperative vs Competitive vs Mixed

5.1 Cooperative Tasks

All agents share the same objective:

\[\max_{\pi_1, \dots, \pi_N} J = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_{team}(s_t, \mathbf{a}_t)\right]\]

Challenges: Credit assignment, coordinated exploration Applications: Multi-robot cooperation, formation control, cooperative search

5.2 Competitive Tasks

Agents are adversarial:

  • Zero-sum game: \(R_1 + R_2 = 0\)
  • Solution concept: Nash equilibrium
\[\pi_i^* = \arg\max_{\pi_i} J_i(\pi_i, \pi_{-i}^*), \quad \forall i\]

Challenges: Non-transitivity (A > B > C > A), equilibrium computation Applications: Go, poker, security games

5.3 Mixed Tasks

Both cooperation and competition:

  • Team competition: Intra-team cooperation, inter-team competition (e.g., Dota 2)
  • Social dilemmas: Individual vs collective rationality conflict (e.g., prisoner's dilemma)
  • Mechanism design: Designing incentives to promote cooperation

Applications: Multiplayer games, traffic systems, economic simulation


6. Evaluation and Benchmarks

6.1 Common Environments

Environment Type Agents Features
MPE Coop/Competitive 2-10 Simple continuous, classic benchmark
SMAC Cooperative 2-27 StarCraft micromanagement
Google Football Coop/Competitive 2-22 Football simulation
Hanabi Cooperative 2-5 Imperfect information card game
Overcooked Cooperative 2 Human-AI cooking cooperation
MAgent Large-scale 100+ Large-scale confrontation
MetaDrive Mixed Multiple vehicles Autonomous driving

6.2 Evaluation Metrics

  • Team return: Core metric for cooperative tasks
  • Win rate: Comparison metric in competitive tasks
  • Social welfare: Sum of all agents' returns
  • Fairness: Uniformity of return distribution
  • Scalability: Performance change with increasing agent count
  • Communication overhead: Communication volume and bandwidth requirements

7. MARL and LLM Multi-Agent Systems

7.1 Emerging Directions

In the era of large language models, multi-agent systems are seeing new research directions:

  • LLM multi-agent collaboration: Multiple LLM roles collaborating to solve problems (AutoGen, CrewAI)
  • Debate and negotiation: Multiple LLMs improving reasoning through debate
  • Social simulation: Using LLM agents to simulate social behavior (Generative Agents)
  • RL-trained multi-agent LLMs: Using MARL methods to train LLM interactions

7.2 Classical MARL vs LLM Multi-Agent

Dimension Classical MARL LLM Multi-Agent
Agents Trained from scratch Pre-trained large models
Communication Learned vectors Natural language
Policy space Continuous/discrete actions Text generation
Training method Gradient optimization Prompt engineering/fine-tuning
Interpretability Low High (natural language)

8. Summary and Outlook

8.1 Current State

  • CTDE paradigm is mature, performing well across multiple benchmarks
  • MAPPO is surprisingly powerful in cooperative tasks
  • Value decomposition methods work well in discrete action spaces
  • Large-scale MARL (100+ agents) remains challenging

8.2 Future Directions

  1. Large-scale MARL: Handling hundreds or thousands of agents
  2. Heterogeneous agents: Cooperation among different types of agents
  3. Transfer and generalization: Generalizing across tasks/agent counts
  4. Safe multi-agent systems: Guaranteeing safety of multi-agent systems
  5. Human-AI hybrid teams: Collaboration between humans and AI agents
  6. LLM + MARL: Integration of large language models with classical MARL

References

  • Busoniu, L. et al. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Systems.
  • Hernandez-Leal, P. et al. (2019). A Survey and Critique of Multiagent Deep Reinforcement Learning.
  • Zhang, K. et al. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms.
  • Wong, A. et al. (2023). Deep Multiagent Reinforcement Learning: Challenges and Directions.

Further Reading


评论 #