Multi-Agent Reinforcement Learning Algorithms

Overview

This article provides detailed coverage of core MARL algorithm families, including value decomposition methods (VDN, QMIX), multi-agent policy gradient (MAPPO, MADDPG), communication mechanisms (CommNet, TarMAC), and frontier directions such as emergent communication.

1. Value Decomposition Methods

1.1 Core Idea

In cooperative MARL, learning the joint action-value function \(Q_{tot}(s, \mathbf{a})\) faces the curse of dimensionality. Value decomposition factorizes the team Q-value into some combination of individual Q-values:

\[Q_{tot}(s, a_1, \dots, a_N) = f(Q_1(o_1, a_1), Q_2(o_2, a_2), \dots, Q_N(o_N, a_N))\]

Key constraint: Individual-Global-Max (IGM):

\[\arg\max_{\mathbf{a}} Q_{tot}(s, \mathbf{a}) = \begin{pmatrix} \arg\max_{a_1} Q_1(o_1, a_1) \\ \vdots \\ \arg\max_{a_N} Q_N(o_N, a_N) \end{pmatrix}\]

That is, each agent independently choosing the action that maximizes its own Q-value is equivalent to maximizing the team Q-value.

1.2 VDN: Value Decomposition Network

Method: Simple additive decomposition

\[Q_{tot}(s, \mathbf{a}) = \sum_{i=1}^N Q_i(o_i, a_i)\]

IGM satisfaction: Addition automatically satisfies IGM because:

\[\arg\max_{\mathbf{a}} \sum_i Q_i = \left(\arg\max_{a_1} Q_1, \dots, \arg\max_{a_N} Q_N\right)\]

Advantages: - Simple and straightforward - Naturally satisfies IGM - Fully decentralized execution

Disadvantages: - Limited representational capacity (can only represent additively decomposable value functions) - Cannot model interaction effects between agents

1.3 QMIX: Monotonic Mixing Network

Method: Combine individual Q-values with a mixing network, ensuring monotonicity:

\[Q_{tot} = f_{mix}(Q_1, Q_2, \dots, Q_N; s)\]

Monotonicity constraint:

\[\frac{\partial Q_{tot}}{\partial Q_i} \geq 0, \quad \forall i\]

Implementation: Mixing network weights are generated from the global state \(s\) via hypernetworks, with non-negative weights:

Hypernetworks:
  s → HyperNet_w1 → |abs| → W1 (mixing network layer 1 weights)
  s → HyperNet_b1 → B1 (mixing network layer 1 biases)
  s → HyperNet_w2 → |abs| → W2 (mixing network layer 2 weights)
  s → HyperNet_b2 → B2 (mixing network layer 2 biases)

Mixing Network:
  [Q1, Q2, ..., QN] → Linear(W1, B1) → ELU → Linear(W2, B2) → Q_tot

Key design choices: - Taking absolute value of weights ensures non-negativity → monotonicity - Biases are unconstrained → increased representational capacity - Hypernetworks conditioned on global state → leveraging additional information

IGM satisfaction: Monotonicity guarantees:

\[\arg\max_{a_i} Q_i \Rightarrow Q_{tot} \text{ also increases in that direction}\]

QMIX Architecture Diagram

graph TB
    subgraph Agents["Individual Agent Q-Networks"]
        O1[o1] --> Q1["Q1(o1, a1)"]
        O2[o2] --> Q2["Q2(o2, a2)"]
        ON[oN] --> QN["QN(oN, aN)"]
    end

    subgraph Mixing["Mixing Network"]
        Q1 --> MIX[Mixing Network]
        Q2 --> MIX
        QN --> MIX
        MIX --> QTOT["Q_tot"]
    end

    subgraph Hyper["Hypernetworks"]
        S[Global State s] --> HN[HyperNetworks]
        HN -->|weights/biases| MIX
    end

    style Agents fill:#e3f2fd
    style Mixing fill:#fff3e0
    style Hyper fill:#e8f5e9

1.4 Value Decomposition Methods Comparison

Method	Decomposition	Representational Power	Extra Info	Year
VDN	Additive	Weak	None	2018
QMIX	Monotonic mixing	Medium	Global state	2018
QTRAN	Linear constraints	Strong	Global state	2019
QPLEX	Duplex dueling	Strong	Global state	2021
WQMIX	Weighted QMIX	Medium+	Global state	2020

QMIX's Limitation

QMIX's monotonicity constraint means it cannot represent all possible joint Q-functions. For example, non-monotonic coordination tasks (such as certain matrix games) may cause QMIX to fail. QTRAN and QPLEX attempt to address this but at higher computational cost.

2. Multi-Agent Policy Gradient

2.1 MADDPG: Multi-Agent DDPG

Core idea: Multi-agent Actor-Critic under the CTDE framework. Each agent has an independent Actor, but the Critic uses information from all agents.

Architecture: - Actor \(\mu_{\theta_i}(o_i)\): Uses only local observations, decentralized execution - Critic \(Q_{\phi_i}(s, a_1, \dots, a_N)\): Uses global information, centralized training

Critic update:

\[\mathcal{L}(\phi_i) = \mathbb{E}\left[\left(Q_{\phi_i}(s, a_1, \dots, a_N) - y_i\right)^2\right]\]

\[y_i = r_i + \gamma Q_{\phi_i'}(s', a_1', \dots, a_N') \big|_{a_j' = \mu_{\theta_j'}(o_j')}\]

Actor update:

\[\nabla_{\theta_i} J = \mathbb{E}\left[\nabla_{a_i} Q_{\phi_i}(s, a_1, \dots, a_N) \big|_{a_i = \mu_{\theta_i}(o_i)} \nabla_{\theta_i} \mu_{\theta_i}(o_i)\right]\]

Features: - Applicable to cooperative, competitive, and mixed scenarios - Handles continuous action spaces - Requires all agents' observations and actions during training

2.2 MAPPO: Multi-Agent PPO

Core idea: Extend PPO to multi-agent settings -- surprisingly simple yet remarkably effective.

Architecture: - Actor \(\pi_{\theta_i}(a_i|o_i)\): Local observation, independent policy - Critic \(V_{\phi}(s)\): Global state, shared or independent

Policy objective (per agent):

\[L_i^{CLIP} = \mathbb{E}\left[\min\left(r_t^i \hat{A}_t^i, \text{clip}(r_t^i, 1-\epsilon, 1+\epsilon) \hat{A}_t^i\right)\right]\]

Key design choices:

Design	Options	MAPPO Recommendation
Parameter sharing	Shared/Independent	Shared (with agent ID)
Critic input	Local/Global	Global state
Value normalization	PopArt/Standard	PopArt
Data usage	Single/Multiple passes	Multiple (5-15 epochs)

Why is MAPPO so effective?

PPO's clipping mechanism naturally prevents excessively large updates, offering more stability in non-stationary environments
Parameter sharing significantly improves sample efficiency
Simple architecture is easy to tune and reproduce
Matches or exceeds QMIX on benchmarks like SMAC

2.3 MADDPG vs MAPPO Comparison

Feature	MADDPG	MAPPO
Policy type	Deterministic	Stochastic
Action space	Continuous	Discrete/Continuous
Data utilization	Off-policy (replay buffer)	On-policy (no replay)
Critic	Independent per agent	Can be shared
Applicable scenarios	Competitive/Mixed	Primarily cooperative
Implementation complexity	Medium	Low
Performance	Good	Usually better (cooperative)

3. Communication Mechanisms

3.1 Why Communication?

In partially observable environments, agents' local observations are insufficient for optimal decision-making. Communication allows agents to:

Share local information
Coordinate actions
Convey intentions
Request assistance

3.2 CommNet: Communication Neural Network

Core idea: Agents exchange information through continuous communication vectors, integrated into the forward pass.

Architecture:

Each agent \(i\) at each communication round \(k\):

\[h_i^{k+1} = \sigma\left(W^k h_i^k + C^k \bar{c}_i^k\right)\]

where the communication message is the mean of all other agents' hidden states:

\[\bar{c}_i^k = \frac{1}{N-1} \sum_{j \neq i} h_j^k\]

Features: - End-to-end differentiable, trainable via backpropagation - Predefined communication structure (fully connected) - Multi-round communication progressively refines information

3.3 TarMAC: Targeted Multi-Agent Communication

Core improvement: Uses attention mechanisms for selective communication -- agents can decide whose messages to listen to.

Attention-based communication:

Each agent \(i\) generates:
- Message \(m_i\): Information to send
- Key \(k_i\): Message "tag"
Receiver \(j\) computes attention weights:

\[\alpha_{j \leftarrow i} = \text{softmax}_i\left(\frac{q_j \cdot k_i}{\sqrt{d}}\right)\]

Weighted message aggregation:

\[c_j = \sum_{i \neq j} \alpha_{j \leftarrow i} \cdot v_i\]

Advantages: - Dynamic selection of communication targets - Interpretable (attention weight visualization) - Automatically learns communication protocols

3.4 Communication Methods Comparison

Method	Communication Structure	Content	Selective	Year
CommNet	Fully connected mean	Continuous vector	No	2016
IC3Net	Gated communication	Continuous vector	Yes (binary gate)	2019
TarMAC	Attention-weighted	Continuous vector	Yes (soft attention)	2019
DIAL	Direct gradient passing	Discrete/Continuous	No	2016
ATOC	Dynamic grouping	Continuous vector	Yes (grouping)	2018

4. Emergent Communication

4.1 What Is Emergent Communication?

Without predefining communication protocols, let agents spontaneously learn a communication language through RL:

Communication channel is part of the action space
Message content and semantics are entirely learned by agents
Can discover communication strategies humans never conceived

4.2 Key Findings

Language emergence: Agents can develop "languages" with compositionality
Referential games: Signal semantics emerge in Lewis signaling games
Language drift: After pre-training on human language, agents may drift away
Interpretability: Emergent languages are typically difficult for humans to decode

4.3 Challenges

Discrete messages make gradients non-differentiable → Gumbel-Softmax or REINFORCE
Communication protocol instability
Difficult to align with human language
Evaluating emergent language quality

5. Case Study: OpenAI Five

5.1 System Overview

OpenAI Five defeated world champions in Dota 2 (5v5 MOBA game), one of the most impactful applications of MARL.

5.2 Technical Approach

Component	Choice
Algorithm	Large-scale PPO (similar to IPPO)
Policy network	LSTM (independent per hero)
Parameter sharing	Yes (all heroes share)
Communication	No explicit communication
Training scale	~800 petaflop-days/day
Self-play	80% current version + 20% historical versions
Reward	Team reward + individual reward shaping

5.3 Key Insights

Scale is everything: Massive compute compensates for algorithmic simplicity
Parameter sharing works: All heroes share parameters + agent ID
No explicit communication needed: Implicit coordination through shared training
Long-term credit assignment: Mitigated through careful reward shaping
PPO's robustness: Stable performance in large-scale non-stationary training

5.4 Takeaways

Simple algorithms + massive computation can solve extremely complex MARL problems
Parameter sharing is a key technique for efficiency
Reward engineering is crucial in MARL

6. Algorithm Summary and Selection

6.1 Algorithm Comparison Table

Algorithm	Paradigm	Action Space	Scenario	Complexity	Performance
IQL	Independent	Discrete	General	Low	Weak
VDN	CTDE-Value Decomp	Discrete	Cooperative	Low	Medium
QMIX	CTDE-Value Decomp	Discrete	Cooperative	Medium	Good
MADDPG	CTDE-Policy Gradient	Continuous	General	Medium	Good
MAPPO	CTDE-Policy Gradient	Discrete/Continuous	Cooperative	Low	Very Good
CommNet	CTDE-Communication	Discrete	Cooperative	Medium	Medium
TarMAC	CTDE-Communication	Discrete	Cooperative	Medium	Good

6.2 Selection Guide

Cooperative Tasks:
  ├── Discrete actions + value decomposition needed → QMIX
  ├── Discrete/Continuous + simple and efficient → MAPPO
  ├── Communication needed → TarMAC / CommNet
  └── Large-scale agents → MAPPO + parameter sharing

Competitive/Mixed Tasks:
  ├── Continuous actions → MADDPG
  ├── Discrete actions → MAPPO
  └── Diversity needed → League Training

When unsure:
  └── Try MAPPO first (simple, robust, usually good enough)

7. Practical Advice

7.1 Common Pitfalls

Ignoring hyperparameter tuning: MARL is more sensitive to hyperparameters
Insufficient evaluation: Need multi-seed, multi-opponent evaluation
Poor reward design: Individual rewards conflicting with team objectives
Ignoring scalability: An algorithm working at small scale doesn't guarantee large-scale success
Communication overhead: Communication methods may introduce significant computational cost

7.2 Recommended Tools and Frameworks

Framework	Features
EPyMARL	SMAC benchmark, multiple algorithm implementations
MARLlib	Unified interface, 10+ environments, 20+ algorithms
PettingZoo	Standardized multi-agent environment interface
OpenSpiel	Game theory + MARL
Melting Pot	Social intelligence evaluation

References

Sunehag, P. et al. (2018). Value-Decomposition Networks for Cooperative Multi-Agent Learning. AAMAS.
Rashid, T. et al. (2018). QMIX: Monotonic Value Function Factorisation. ICML.
Lowe, R. et al. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS.
Yu, C. et al. (2022). The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. NeurIPS.
Sukhbaatar, S. et al. (2016). Learning Multiagent Communication with Backpropagation. NeurIPS.
Das, A. et al. (2019). TarMAC: Targeted Multi-Agent Communication. ICML.
Berner, C. et al. (2019). Dota 2 with Large Scale Deep Reinforcement Learning.