Multi-Agent Reinforcement Learning Survey

Overview

Multi-Agent Reinforcement Learning (MARL) studies the problem of multiple agents simultaneously learning and interacting in a shared environment. Compared to single-agent RL, MARL faces unique challenges such as environment non-stationarity, credit assignment difficulties, and scalability, while also exhibiting rich phenomena including cooperation, competition, and emergent behaviors.

1. Why Multi-Agent RL?

1.1 The Real World Is Inherently Multi-Agent

Traffic systems: Multiple autonomous vehicles coordinating driving
Robot teams: Multi-robot collaboration for carrying, searching, etc.
Economic markets: Strategic games among multiple participants
Game AI: Multiplayer games like Dota 2 and StarCraft
Communication networks: Multi-node resource allocation coordination
Social simulation: Simulating multi-agent social behaviors

1.2 Limitations of Single-Agent Methods

Treating other agents as part of the environment and directly applying single-agent RL:

The environment becomes non-stationary (other agents are also learning and changing strategies)
Curse of dimensionality: Joint action space grows exponentially
Credit assignment: Difficult to determine individual contributions under team rewards
Cannot model communication and coordination

2. Problem Formulation

2.1 Markov Game (Stochastic Game)

The most common mathematical framework for MARL is the Markov Game, defined as:

\[\mathcal{G} = (N, \mathcal{S}, \{\mathcal{A}_i\}_{i=1}^N, P, \{R_i\}_{i=1}^N, \gamma)\]

\(N\): Number of agents
\(\mathcal{S}\): State space (global state)
\(\mathcal{A}_i\): Action space of agent \(i\)
\(P(s'|s, a_1, \dots, a_N)\): Joint state transition
\(R_i(s, a_1, \dots, a_N)\): Reward function of agent \(i\)
\(\gamma\): Discount factor

Joint action space: \(\mathcal{A} = \mathcal{A}_1 \times \mathcal{A}_2 \times \dots \times \mathcal{A}_N\)

2.2 Dec-POMDP: Decentralized Partially Observable

A more realistic formulation where each agent has only local observations:

\[\mathcal{M} = (N, \mathcal{S}, \{\mathcal{A}_i\}, \{\mathcal{O}_i\}, P, O, \{R_i\}, \gamma)\]

Additional components: - \(\mathcal{O}_i\): Observation space of agent \(i\) - \(O(o_i|s, i)\): Observation function

Key difference: - Each agent makes decisions based on local observation \(o_i\) rather than global state \(s\) - Optimal solving of Dec-POMDP is NEXP-hard

2.3 Special Cases

Model	Reward Structure	Information Structure	Example
Cooperative game	\(R_1 = R_2 = \dots = R_N\)	Partially observable	Multi-robot cooperation
Zero-sum game	\(R_1 = -R_2\) (two-player)	Full/partial observability	Go, poker
General-sum game	Independent	Partially observable	Traffic, economics
Mean-field game	Depends on mean behavior	Local observation	Large-scale populations

3. Core Challenges of MARL

3.1 Non-Stationarity

Problem: From any agent's perspective, the environment includes other learning agents, so the environment dynamics are constantly changing.

\[P_i(s'|s, a_i) = \sum_{a_{-i}} P(s'|s, a_i, a_{-i}) \prod_{j \neq i} \pi_j(a_j|o_j)\]

As \(\pi_j\) updates, \(P_i\) also changes, breaking the MDP stationarity assumption.

Countermeasures:

Centralized training: Leverage global information during training
Opponent modeling: Explicitly model other agents' policies
Experience replay correction: Importance sampling to correct stale experiences

3.2 Credit Assignment

Problem: In cooperative tasks, the team receives a shared reward \(R_{team}\). How to determine each agent's contribution?

\[R_{team} = R(s, a_1, a_2, \dots, a_N)\]

What is agent \(i\)'s contribution to the team reward?

Countermeasures:

Difference rewards: \(R_i = R_{team}(a_i, a_{-i}) - R_{team}(a_{-i})\)
Value decomposition: VDN, QMIX decompose team Q-values into individual contributions
Shapley values: Game-theoretic fair allocation methods

3.3 Partial Observability

Problem: Agents can only observe local information about the environment.

Cannot access other agents' states/intentions
Need to infer hidden information from observation history
Communication can mitigate but not fully resolve this

3.4 Scalability

Problem: Joint action space grows exponentially with the number of agents.

\[|\mathcal{A}| = \prod_{i=1}^N |\mathcal{A}_i|\]

With \(N=10\), \(|\mathcal{A}_i|=5\): \(|\mathcal{A}| = 5^{10} \approx 10^7\)

Countermeasures:

Parameter sharing: All agents share network parameters
Mean-field approximation: Replace individual interactions with population mean behavior
Attention mechanisms: Dynamically select subsets of agents to attend to

3.5 Coordinated Exploration

Problem: Multiple agents need coordinated exploration; independent exploration may never discover cooperative strategies.

Joint exploration space is enormous
Good cooperative strategies may require multiple agents to change behavior simultaneously
Local optima traps are more severe

4. MARL Training Paradigms

4.1 Paradigm Taxonomy

graph TD
    MARL[MARL Training Paradigms] --> IL[Independent Learners]
    MARL --> CTDE[Centralized Training<br>Decentralized Execution<br>CTDE]
    MARL --> FC[Fully Centralized]

    IL --> IQL[Independent Q-Learning]
    IL --> IPPO[Independent PPO]

    CTDE --> VD[Value Decomposition]
    CTDE --> CC[Centralized Critic]
    CTDE --> COMM[Communication Learning]

    VD --> VDN[VDN]
    VD --> QMIX[QMIX]

    CC --> MADDPG[MADDPG]
    CC --> MAPPO[MAPPO]

    COMM --> CommNet[CommNet]
    COMM --> TarMAC[TarMAC]

    FC --> CQL_M[Joint Q-Learning]

    style MARL fill:#e1f5fe
    style CTDE fill:#e8f5e9

4.2 Independent Learners

Idea: Each agent independently runs a single-agent RL algorithm.

Training: Decentralized, each agent uses only its own observations and rewards

Execution: Decentralized

Advantages: - Simple implementation - Good scalability - No communication needed

Disadvantages: - Ignores other agents - Non-stationary environment causes training instability - Difficult to learn cooperative strategies

Representative algorithms: IQL (Independent Q-Learning), IPPO (Independent PPO)

4.3 Centralized Training Decentralized Execution (CTDE)

Idea: Leverage global information during training; use only local observations during execution.

Training: Centralized, with access to global state and all agents' observations and actions

Execution: Decentralized, each agent uses only its own local observations

Advantages: - Can leverage additional information to improve learning efficiency during training - No communication needed during execution, suitable for real deployment - Currently the most mainstream paradigm

Disadvantages: - Requires centralized infrastructure during training - Information asymmetry between training and execution may cause issues

Representative algorithms: QMIX, MADDPG, MAPPO

CTDE Is the Current Mainstream

CTDE achieves the best balance between practicality and performance, and is the most prevalent paradigm in current MARL research and applications.

4.4 Fully Centralized

Idea: Treat the multi-agent problem as a single super-agent's decision problem.

Training/Execution: Both centralized

Advantages: - Can theoretically find the global optimum - Full coordination

Disadvantages: - Joint action space explodes exponentially - Requires global communication - Not suitable for large-scale problems

5. Cooperative vs Competitive vs Mixed

5.1 Cooperative Tasks

All agents share the same objective:

\[\max_{\pi_1, \dots, \pi_N} J = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_{team}(s_t, \mathbf{a}_t)\right]\]

Challenges: Credit assignment, coordinated exploration Applications: Multi-robot cooperation, formation control, cooperative search

5.2 Competitive Tasks

Agents are adversarial:

Zero-sum game: \(R_1 + R_2 = 0\)
Solution concept: Nash equilibrium

\[\pi_i^* = \arg\max_{\pi_i} J_i(\pi_i, \pi_{-i}^*), \quad \forall i\]

Challenges: Non-transitivity (A > B > C > A), equilibrium computation Applications: Go, poker, security games

5.3 Mixed Tasks

Both cooperation and competition:

Team competition: Intra-team cooperation, inter-team competition (e.g., Dota 2)
Social dilemmas: Individual vs collective rationality conflict (e.g., prisoner's dilemma)
Mechanism design: Designing incentives to promote cooperation

Applications: Multiplayer games, traffic systems, economic simulation

6. Evaluation and Benchmarks

6.1 Common Environments

Environment	Type	Agents	Features
MPE	Coop/Competitive	2-10	Simple continuous, classic benchmark
SMAC	Cooperative	2-27	StarCraft micromanagement
Google Football	Coop/Competitive	2-22	Football simulation
Hanabi	Cooperative	2-5	Imperfect information card game
Overcooked	Cooperative	2	Human-AI cooking cooperation
MAgent	Large-scale	100+	Large-scale confrontation
MetaDrive	Mixed	Multiple vehicles	Autonomous driving

6.2 Evaluation Metrics

Team return: Core metric for cooperative tasks
Win rate: Comparison metric in competitive tasks
Social welfare: Sum of all agents' returns
Fairness: Uniformity of return distribution
Scalability: Performance change with increasing agent count
Communication overhead: Communication volume and bandwidth requirements

7. MARL and LLM Multi-Agent Systems

7.1 Emerging Directions

In the era of large language models, multi-agent systems are seeing new research directions:

LLM multi-agent collaboration: Multiple LLM roles collaborating to solve problems (AutoGen, CrewAI)
Debate and negotiation: Multiple LLMs improving reasoning through debate
Social simulation: Using LLM agents to simulate social behavior (Generative Agents)
RL-trained multi-agent LLMs: Using MARL methods to train LLM interactions

7.2 Classical MARL vs LLM Multi-Agent

Dimension	Classical MARL	LLM Multi-Agent
Agents	Trained from scratch	Pre-trained large models
Communication	Learned vectors	Natural language
Policy space	Continuous/discrete actions	Text generation
Training method	Gradient optimization	Prompt engineering/fine-tuning
Interpretability	Low	High (natural language)

8. Summary and Outlook

8.1 Current State

CTDE paradigm is mature, performing well across multiple benchmarks
MAPPO is surprisingly powerful in cooperative tasks
Value decomposition methods work well in discrete action spaces
Large-scale MARL (100+ agents) remains challenging

8.2 Future Directions

Large-scale MARL: Handling hundreds or thousands of agents
Heterogeneous agents: Cooperation among different types of agents
Transfer and generalization: Generalizing across tasks/agent counts
Safe multi-agent systems: Guaranteeing safety of multi-agent systems
Human-AI hybrid teams: Collaboration between humans and AI agents
LLM + MARL: Integration of large language models with classical MARL

References

Busoniu, L. et al. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans. Systems.
Hernandez-Leal, P. et al. (2019). A Survey and Critique of Multiagent Deep Reinforcement Learning.
Zhang, K. et al. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms.
Wong, A. et al. (2023). Deep Multiagent Reinforcement Learning: Challenges and Directions.

Multi-Agent Reinforcement Learning Survey

Overview

1. Why Multi-Agent RL?

1.1 The Real World Is Inherently Multi-Agent

1.2 Limitations of Single-Agent Methods

2. Problem Formulation

2.1 Markov Game (Stochastic Game)

2.2 Dec-POMDP: Decentralized Partially Observable

2.3 Special Cases

3. Core Challenges of MARL

3.1 Non-Stationarity

3.2 Credit Assignment

3.3 Partial Observability

3.4 Scalability

3.5 Coordinated Exploration

4. MARL Training Paradigms

4.1 Paradigm Taxonomy

4.2 Independent Learners

4.3 Centralized Training Decentralized Execution (CTDE)

4.4 Fully Centralized

5. Cooperative vs Competitive vs Mixed

5.1 Cooperative Tasks

5.2 Competitive Tasks

5.3 Mixed Tasks

6. Evaluation and Benchmarks

6.1 Common Environments

6.2 Evaluation Metrics

7. MARL and LLM Multi-Agent Systems

7.1 Emerging Directions

7.2 Classical MARL vs LLM Multi-Agent

8. Summary and Outlook

8.1 Current State

8.2 Future Directions

References

Further Reading

评论 #