Skip to content

Multi-Agent Reinforcement Learning Algorithms

Overview

This article provides detailed coverage of core MARL algorithm families, including value decomposition methods (VDN, QMIX), multi-agent policy gradient (MAPPO, MADDPG), communication mechanisms (CommNet, TarMAC), and frontier directions such as emergent communication.


1. Value Decomposition Methods

1.1 Core Idea

In cooperative MARL, learning the joint action-value function \(Q_{tot}(s, \mathbf{a})\) faces the curse of dimensionality. Value decomposition factorizes the team Q-value into some combination of individual Q-values:

\[Q_{tot}(s, a_1, \dots, a_N) = f(Q_1(o_1, a_1), Q_2(o_2, a_2), \dots, Q_N(o_N, a_N))\]

Key constraint: Individual-Global-Max (IGM):

\[\arg\max_{\mathbf{a}} Q_{tot}(s, \mathbf{a}) = \begin{pmatrix} \arg\max_{a_1} Q_1(o_1, a_1) \\ \vdots \\ \arg\max_{a_N} Q_N(o_N, a_N) \end{pmatrix}\]

That is, each agent independently choosing the action that maximizes its own Q-value is equivalent to maximizing the team Q-value.

1.2 VDN: Value Decomposition Network

Method: Simple additive decomposition

\[Q_{tot}(s, \mathbf{a}) = \sum_{i=1}^N Q_i(o_i, a_i)\]

IGM satisfaction: Addition automatically satisfies IGM because:

\[\arg\max_{\mathbf{a}} \sum_i Q_i = \left(\arg\max_{a_1} Q_1, \dots, \arg\max_{a_N} Q_N\right)\]

Advantages: - Simple and straightforward - Naturally satisfies IGM - Fully decentralized execution

Disadvantages: - Limited representational capacity (can only represent additively decomposable value functions) - Cannot model interaction effects between agents

1.3 QMIX: Monotonic Mixing Network

Method: Combine individual Q-values with a mixing network, ensuring monotonicity:

\[Q_{tot} = f_{mix}(Q_1, Q_2, \dots, Q_N; s)\]

Monotonicity constraint:

\[\frac{\partial Q_{tot}}{\partial Q_i} \geq 0, \quad \forall i\]

Implementation: Mixing network weights are generated from the global state \(s\) via hypernetworks, with non-negative weights:

Hypernetworks:
  s → HyperNet_w1 → |abs| → W1 (mixing network layer 1 weights)
  s → HyperNet_b1 → B1 (mixing network layer 1 biases)
  s → HyperNet_w2 → |abs| → W2 (mixing network layer 2 weights)
  s → HyperNet_b2 → B2 (mixing network layer 2 biases)

Mixing Network:
  [Q1, Q2, ..., QN] → Linear(W1, B1) → ELU → Linear(W2, B2) → Q_tot

Key design choices: - Taking absolute value of weights ensures non-negativity → monotonicity - Biases are unconstrained → increased representational capacity - Hypernetworks conditioned on global state → leveraging additional information

IGM satisfaction: Monotonicity guarantees:

\[\arg\max_{a_i} Q_i \Rightarrow Q_{tot} \text{ also increases in that direction}\]

QMIX Architecture Diagram

graph TB
    subgraph Agents["Individual Agent Q-Networks"]
        O1[o1] --> Q1["Q1(o1, a1)"]
        O2[o2] --> Q2["Q2(o2, a2)"]
        ON[oN] --> QN["QN(oN, aN)"]
    end

    subgraph Mixing["Mixing Network"]
        Q1 --> MIX[Mixing Network]
        Q2 --> MIX
        QN --> MIX
        MIX --> QTOT["Q_tot"]
    end

    subgraph Hyper["Hypernetworks"]
        S[Global State s] --> HN[HyperNetworks]
        HN -->|weights/biases| MIX
    end

    style Agents fill:#e3f2fd
    style Mixing fill:#fff3e0
    style Hyper fill:#e8f5e9

1.4 Value Decomposition Methods Comparison

Method Decomposition Representational Power Extra Info Year
VDN Additive Weak None 2018
QMIX Monotonic mixing Medium Global state 2018
QTRAN Linear constraints Strong Global state 2019
QPLEX Duplex dueling Strong Global state 2021
WQMIX Weighted QMIX Medium+ Global state 2020

QMIX's Limitation

QMIX's monotonicity constraint means it cannot represent all possible joint Q-functions. For example, non-monotonic coordination tasks (such as certain matrix games) may cause QMIX to fail. QTRAN and QPLEX attempt to address this but at higher computational cost.


2. Multi-Agent Policy Gradient

2.1 MADDPG: Multi-Agent DDPG

Core idea: Multi-agent Actor-Critic under the CTDE framework. Each agent has an independent Actor, but the Critic uses information from all agents.

Architecture: - Actor \(\mu_{\theta_i}(o_i)\): Uses only local observations, decentralized execution - Critic \(Q_{\phi_i}(s, a_1, \dots, a_N)\): Uses global information, centralized training

Critic update:

\[\mathcal{L}(\phi_i) = \mathbb{E}\left[\left(Q_{\phi_i}(s, a_1, \dots, a_N) - y_i\right)^2\right]\]
\[y_i = r_i + \gamma Q_{\phi_i'}(s', a_1', \dots, a_N') \big|_{a_j' = \mu_{\theta_j'}(o_j')}\]

Actor update:

\[\nabla_{\theta_i} J = \mathbb{E}\left[\nabla_{a_i} Q_{\phi_i}(s, a_1, \dots, a_N) \big|_{a_i = \mu_{\theta_i}(o_i)} \nabla_{\theta_i} \mu_{\theta_i}(o_i)\right]\]

Features: - Applicable to cooperative, competitive, and mixed scenarios - Handles continuous action spaces - Requires all agents' observations and actions during training

2.2 MAPPO: Multi-Agent PPO

Core idea: Extend PPO to multi-agent settings -- surprisingly simple yet remarkably effective.

Architecture: - Actor \(\pi_{\theta_i}(a_i|o_i)\): Local observation, independent policy - Critic \(V_{\phi}(s)\): Global state, shared or independent

Policy objective (per agent):

\[L_i^{CLIP} = \mathbb{E}\left[\min\left(r_t^i \hat{A}_t^i, \text{clip}(r_t^i, 1-\epsilon, 1+\epsilon) \hat{A}_t^i\right)\right]\]

Key design choices:

Design Options MAPPO Recommendation
Parameter sharing Shared/Independent Shared (with agent ID)
Critic input Local/Global Global state
Value normalization PopArt/Standard PopArt
Data usage Single/Multiple passes Multiple (5-15 epochs)

Why is MAPPO so effective?

  • PPO's clipping mechanism naturally prevents excessively large updates, offering more stability in non-stationary environments
  • Parameter sharing significantly improves sample efficiency
  • Simple architecture is easy to tune and reproduce
  • Matches or exceeds QMIX on benchmarks like SMAC

2.3 MADDPG vs MAPPO Comparison

Feature MADDPG MAPPO
Policy type Deterministic Stochastic
Action space Continuous Discrete/Continuous
Data utilization Off-policy (replay buffer) On-policy (no replay)
Critic Independent per agent Can be shared
Applicable scenarios Competitive/Mixed Primarily cooperative
Implementation complexity Medium Low
Performance Good Usually better (cooperative)

3. Communication Mechanisms

3.1 Why Communication?

In partially observable environments, agents' local observations are insufficient for optimal decision-making. Communication allows agents to:

  • Share local information
  • Coordinate actions
  • Convey intentions
  • Request assistance

3.2 CommNet: Communication Neural Network

Core idea: Agents exchange information through continuous communication vectors, integrated into the forward pass.

Architecture:

Each agent \(i\) at each communication round \(k\):

\[h_i^{k+1} = \sigma\left(W^k h_i^k + C^k \bar{c}_i^k\right)\]

where the communication message is the mean of all other agents' hidden states:

\[\bar{c}_i^k = \frac{1}{N-1} \sum_{j \neq i} h_j^k\]

Features: - End-to-end differentiable, trainable via backpropagation - Predefined communication structure (fully connected) - Multi-round communication progressively refines information

3.3 TarMAC: Targeted Multi-Agent Communication

Core improvement: Uses attention mechanisms for selective communication -- agents can decide whose messages to listen to.

Attention-based communication:

  1. Each agent \(i\) generates:

    • Message \(m_i\): Information to send
    • Key \(k_i\): Message "tag"
  2. Receiver \(j\) computes attention weights:

\[\alpha_{j \leftarrow i} = \text{softmax}_i\left(\frac{q_j \cdot k_i}{\sqrt{d}}\right)\]
  1. Weighted message aggregation:
\[c_j = \sum_{i \neq j} \alpha_{j \leftarrow i} \cdot v_i\]

Advantages: - Dynamic selection of communication targets - Interpretable (attention weight visualization) - Automatically learns communication protocols

3.4 Communication Methods Comparison

Method Communication Structure Content Selective Year
CommNet Fully connected mean Continuous vector No 2016
IC3Net Gated communication Continuous vector Yes (binary gate) 2019
TarMAC Attention-weighted Continuous vector Yes (soft attention) 2019
DIAL Direct gradient passing Discrete/Continuous No 2016
ATOC Dynamic grouping Continuous vector Yes (grouping) 2018

4. Emergent Communication

4.1 What Is Emergent Communication?

Without predefining communication protocols, let agents spontaneously learn a communication language through RL:

  • Communication channel is part of the action space
  • Message content and semantics are entirely learned by agents
  • Can discover communication strategies humans never conceived

4.2 Key Findings

  • Language emergence: Agents can develop "languages" with compositionality
  • Referential games: Signal semantics emerge in Lewis signaling games
  • Language drift: After pre-training on human language, agents may drift away
  • Interpretability: Emergent languages are typically difficult for humans to decode

4.3 Challenges

  • Discrete messages make gradients non-differentiable → Gumbel-Softmax or REINFORCE
  • Communication protocol instability
  • Difficult to align with human language
  • Evaluating emergent language quality

5. Case Study: OpenAI Five

5.1 System Overview

OpenAI Five defeated world champions in Dota 2 (5v5 MOBA game), one of the most impactful applications of MARL.

5.2 Technical Approach

Component Choice
Algorithm Large-scale PPO (similar to IPPO)
Policy network LSTM (independent per hero)
Parameter sharing Yes (all heroes share)
Communication No explicit communication
Training scale ~800 petaflop-days/day
Self-play 80% current version + 20% historical versions
Reward Team reward + individual reward shaping

5.3 Key Insights

  1. Scale is everything: Massive compute compensates for algorithmic simplicity
  2. Parameter sharing works: All heroes share parameters + agent ID
  3. No explicit communication needed: Implicit coordination through shared training
  4. Long-term credit assignment: Mitigated through careful reward shaping
  5. PPO's robustness: Stable performance in large-scale non-stationary training

5.4 Takeaways

  • Simple algorithms + massive computation can solve extremely complex MARL problems
  • Parameter sharing is a key technique for efficiency
  • Reward engineering is crucial in MARL

6. Algorithm Summary and Selection

6.1 Algorithm Comparison Table

Algorithm Paradigm Action Space Scenario Complexity Performance
IQL Independent Discrete General Low Weak
VDN CTDE-Value Decomp Discrete Cooperative Low Medium
QMIX CTDE-Value Decomp Discrete Cooperative Medium Good
MADDPG CTDE-Policy Gradient Continuous General Medium Good
MAPPO CTDE-Policy Gradient Discrete/Continuous Cooperative Low Very Good
CommNet CTDE-Communication Discrete Cooperative Medium Medium
TarMAC CTDE-Communication Discrete Cooperative Medium Good

6.2 Selection Guide

Cooperative Tasks:
  ├── Discrete actions + value decomposition needed → QMIX
  ├── Discrete/Continuous + simple and efficient → MAPPO
  ├── Communication needed → TarMAC / CommNet
  └── Large-scale agents → MAPPO + parameter sharing

Competitive/Mixed Tasks:
  ├── Continuous actions → MADDPG
  ├── Discrete actions → MAPPO
  └── Diversity needed → League Training

When unsure:
  └── Try MAPPO first (simple, robust, usually good enough)

7. Practical Advice

7.1 Common Pitfalls

  1. Ignoring hyperparameter tuning: MARL is more sensitive to hyperparameters
  2. Insufficient evaluation: Need multi-seed, multi-opponent evaluation
  3. Poor reward design: Individual rewards conflicting with team objectives
  4. Ignoring scalability: An algorithm working at small scale doesn't guarantee large-scale success
  5. Communication overhead: Communication methods may introduce significant computational cost
Framework Features
EPyMARL SMAC benchmark, multiple algorithm implementations
MARLlib Unified interface, 10+ environments, 20+ algorithms
PettingZoo Standardized multi-agent environment interface
OpenSpiel Game theory + MARL
Melting Pot Social intelligence evaluation

References

  • Sunehag, P. et al. (2018). Value-Decomposition Networks for Cooperative Multi-Agent Learning. AAMAS.
  • Rashid, T. et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.
  • Lowe, R. et al. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS.
  • Yu, C. et al. (2022). The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. NeurIPS.
  • Sukhbaatar, S. et al. (2016). Learning Multiagent Communication with Backpropagation. NeurIPS.
  • Das, A. et al. (2019). TarMAC: Targeted Multi-Agent Communication. ICML.
  • Berner, C. et al. (2019). Dota 2 with Large Scale Deep Reinforcement Learning.

Further Reading


评论 #