Multi-Agent Reinforcement Learning Survey
Overview
Multi-Agent Reinforcement Learning (MARL) studies the problem of multiple agents simultaneously learning and interacting in a shared environment. Compared to single-agent RL, MARL faces unique challenges such as environment non-stationarity, credit assignment difficulties, and scalability, while also exhibiting rich phenomena including cooperation, competition, and emergent behaviors.
1. Why Multi-Agent RL?
1.1 The Real World Is Inherently Multi-Agent
- Traffic systems: Multiple autonomous vehicles coordinating driving
- Robot teams: Multi-robot collaboration for carrying, searching, etc.
- Economic markets: Strategic games among multiple participants
- Game AI: Multiplayer games like Dota 2 and StarCraft
- Communication networks: Multi-node resource allocation coordination
- Social simulation: Simulating multi-agent social behaviors
1.2 Limitations of Single-Agent Methods
Treating other agents as part of the environment and directly applying single-agent RL:
- The environment becomes non-stationary (other agents are also learning and changing strategies)
- Curse of dimensionality: Joint action space grows exponentially
- Credit assignment: Difficult to determine individual contributions under team rewards
- Cannot model communication and coordination
2. Problem Formulation
2.1 Markov Game (Stochastic Game)
The most common mathematical framework for MARL is the Markov Game, defined as:
- \(N\): Number of agents
- \(\mathcal{S}\): State space (global state)
- \(\mathcal{A}_i\): Action space of agent \(i\)
- \(P(s'|s, a_1, \dots, a_N)\): Joint state transition
- \(R_i(s, a_1, \dots, a_N)\): Reward function of agent \(i\)
- \(\gamma\): Discount factor
Joint action space: \(\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2 \times \dots \times \mathcal{A}_N\)
2.2 Dec-POMDP: Decentralized Partially Observable
A more realistic formulation where each agent has only local observations:
Additional components: - \(\mathcal{O}_i\): Observation space of agent \(i\) - \(O(o_i|s, i)\): Observation function
Key difference: - Each agent makes decisions based on local observation \(o_i\) rather than global state \(s\) - Optimal solving of Dec-POMDP is NEXP-hard
2.3 Special Cases
| Model | Reward Structure | Information Structure | Example |
|---|---|---|---|
| Cooperative game | \(R_1 = R_2 = \dots = R_N\) | Partially observable | Multi-robot cooperation |
| Zero-sum game | \(R_1 = -R_2\) (two-player) | Full/partial observability | Go, poker |
| General-sum game | Independent | Partially observable | Traffic, economics |
| Mean-field game | Depends on mean behavior | Local observation | Large-scale populations |
3. Core Challenges of MARL
3.1 Non-Stationarity
Problem: From any agent's perspective, the environment includes other learning agents, so the environment dynamics are constantly changing.
As \(\pi_j\) updates, \(P_i\) also changes, breaking the MDP stationarity assumption.
Countermeasures:
- Centralized training: Leverage global information during training
- Opponent modeling: Explicitly model other agents' policies
- Experience replay correction: Importance sampling to correct stale experiences
3.2 Credit Assignment
Problem: In cooperative tasks, the team receives a shared reward \(R_{team}\). How to determine each agent's contribution?
What is agent \(i\)'s contribution to the team reward?
Countermeasures:
- Difference rewards: \(R_i = R_{team}(a_i, a_{-i}) - R_{team}(a_{-i})\)
- Value decomposition: VDN, QMIX decompose team Q-values into individual contributions
- Shapley values: Game-theoretic fair allocation methods
3.3 Partial Observability
Problem: Agents can only observe local information about the environment.
- Cannot access other agents' states/intentions
- Need to infer hidden information from observation history
- Communication can mitigate but not fully resolve this
3.4 Scalability
Problem: Joint action space grows exponentially with the number of agents.
With \(N=10\), \(|\mathcal{A}_i|=5\): \(|\mathcal{A}| = 5^{10} \approx 10^7\)
Countermeasures:
- Parameter sharing: All agents share network parameters
- Mean-field approximation: Replace individual interactions with population mean behavior
- Attention mechanisms: Dynamically select subsets of agents to attend to
3.5 Coordinated Exploration
Problem: Multiple agents need coordinated exploration; independent exploration may never discover cooperative strategies.
- Joint exploration space is enormous
- Good cooperative strategies may require multiple agents to change behavior simultaneously
- Local optima traps are more severe
4. MARL Training Paradigms
4.1 Paradigm Taxonomy
graph TD
MARL[MARL Training Paradigms] --> IL[Independent Learners]
MARL --> CTDE[Centralized Training<br>Decentralized Execution<br>CTDE]
MARL --> FC[Fully Centralized]
IL --> IQL[Independent Q-Learning]
IL --> IPPO[Independent PPO]
CTDE --> VD[Value Decomposition]
CTDE --> CC[Centralized Critic]
CTDE --> COMM[Communication Learning]
VD --> VDN[VDN]
VD --> QMIX[QMIX]
CC --> MADDPG[MADDPG]
CC --> MAPPO[MAPPO]
COMM --> CommNet[CommNet]
COMM --> TarMAC[TarMAC]
FC --> CQL_M[Joint Q-Learning]
style MARL fill:#e1f5fe
style CTDE fill:#e8f5e9
4.2 Independent Learners
Idea: Each agent independently runs a single-agent RL algorithm.
Training: Decentralized, each agent uses only its own observations and rewards
Execution: Decentralized
Advantages: - Simple implementation - Good scalability - No communication needed
Disadvantages: - Ignores other agents - Non-stationary environment causes training instability - Difficult to learn cooperative strategies
Representative algorithms: IQL (Independent Q-Learning), IPPO (Independent PPO)
4.3 Centralized Training Decentralized Execution (CTDE)
Idea: Leverage global information during training; use only local observations during execution.
Training: Centralized, with access to global state and all agents' observations and actions
Execution: Decentralized, each agent uses only its own local observations
Advantages: - Can leverage additional information to improve learning efficiency during training - No communication needed during execution, suitable for real deployment - Currently the most mainstream paradigm
Disadvantages: - Requires centralized infrastructure during training - Information asymmetry between training and execution may cause issues
Representative algorithms: QMIX, MADDPG, MAPPO
CTDE Is the Current Mainstream
CTDE achieves the best balance between practicality and performance, and is the most prevalent paradigm in current MARL research and applications.
4.4 Fully Centralized
Idea: Treat the multi-agent problem as a single super-agent's decision problem.
Training/Execution: Both centralized
Advantages: - Can theoretically find the global optimum - Full coordination
Disadvantages: - Joint action space explodes exponentially - Requires global communication - Not suitable for large-scale problems
5. Cooperative vs Competitive vs Mixed
5.1 Cooperative Tasks
All agents share the same objective:
Challenges: Credit assignment, coordinated exploration Applications: Multi-robot cooperation, formation control, cooperative search
5.2 Competitive Tasks
Agents are adversarial:
- Zero-sum game: \(R_1 + R_2 = 0\)
- Solution concept: Nash equilibrium
Challenges: Non-transitivity (A > B > C > A), equilibrium computation Applications: Go, poker, security games
5.3 Mixed Tasks
Both cooperation and competition:
- Team competition: Intra-team cooperation, inter-team competition (e.g., Dota 2)
- Social dilemmas: Individual vs collective rationality conflict (e.g., prisoner's dilemma)
- Mechanism design: Designing incentives to promote cooperation
Applications: Multiplayer games, traffic systems, economic simulation
6. Evaluation and Benchmarks
6.1 Common Environments
| Environment | Type | Agents | Features |
|---|---|---|---|
| MPE | Coop/Competitive | 2-10 | Simple continuous, classic benchmark |
| SMAC | Cooperative | 2-27 | StarCraft micromanagement |
| Google Football | Coop/Competitive | 2-22 | Football simulation |
| Hanabi | Cooperative | 2-5 | Imperfect information card game |
| Overcooked | Cooperative | 2 | Human-AI cooking cooperation |
| MAgent | Large-scale | 100+ | Large-scale confrontation |
| MetaDrive | Mixed | Multiple vehicles | Autonomous driving |
6.2 Evaluation Metrics
- Team return: Core metric for cooperative tasks
- Win rate: Comparison metric in competitive tasks
- Social welfare: Sum of all agents' returns
- Fairness: Uniformity of return distribution
- Scalability: Performance change with increasing agent count
- Communication overhead: Communication volume and bandwidth requirements
7. MARL and LLM Multi-Agent Systems
7.1 Emerging Directions
In the era of large language models, multi-agent systems are seeing new research directions:
- LLM multi-agent collaboration: Multiple LLM roles collaborating to solve problems (AutoGen, CrewAI)
- Debate and negotiation: Multiple LLMs improving reasoning through debate
- Social simulation: Using LLM agents to simulate social behavior (Generative Agents)
- RL-trained multi-agent LLMs: Using MARL methods to train LLM interactions
7.2 Classical MARL vs LLM Multi-Agent
| Dimension | Classical MARL | LLM Multi-Agent |
|---|---|---|
| Agents | Trained from scratch | Pre-trained large models |
| Communication | Learned vectors | Natural language |
| Policy space | Continuous/discrete actions | Text generation |
| Training method | Gradient optimization | Prompt engineering/fine-tuning |
| Interpretability | Low | High (natural language) |
8. Summary and Outlook
8.1 Current State
- CTDE paradigm is mature, performing well across multiple benchmarks
- MAPPO is surprisingly powerful in cooperative tasks
- Value decomposition methods work well in discrete action spaces
- Large-scale MARL (100+ agents) remains challenging
8.2 Future Directions
- Large-scale MARL: Handling hundreds or thousands of agents
- Heterogeneous agents: Cooperation among different types of agents
- Transfer and generalization: Generalizing across tasks/agent counts
- Safe multi-agent systems: Guaranteeing safety of multi-agent systems
- Human-AI hybrid teams: Collaboration between humans and AI agents
- LLM + MARL: Integration of large language models with classical MARL
References
- Busoniu, L. et al. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Systems.
- Hernandez-Leal, P. et al. (2019). A Survey and Critique of Multiagent Deep Reinforcement Learning.
- Zhang, K. et al. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms.
- Wong, A. et al. (2023). Deep Multiagent Reinforcement Learning: Challenges and Directions.
Further Reading
- MARL Algorithms — Value decomposition, policy gradient, and specific algorithms
- RL Landscape — Global view of RL methodology
- RL Milestones — OpenAI Five, AlphaStar, and other milestones
- Multi-Agent Survey — Multi-agent systems from an AI Agent perspective