Decision Making (AIMA Part IV)
These notes correspond to AIMA (Artificial Intelligence: A Modern Approach) Chapters 16–17, covering making simple decisions and making complex decisions. While the preceding probabilistic reasoning notes address "what is the world like" (inference), this chapter goes further to answer "what should be done" (decision).
1. Utility Theory
1.1 Maximum Expected Utility
A rational agent should choose the action that maximizes expected utility (Maximum Expected Utility, MEU):
where \(U(s)\) is the utility function over state \(s\), and \(P(s \mid a)\) is the probability of reaching state \(s\) after taking action \(a\).
The MEU principle is the central criterion for rational agent design in AIMA — it unifies probabilistic reasoning (beliefs) and utility theory (preferences) within a single decision-making framework.
1.2 Utility Functions and Risk Attitudes
A utility function \(U: S \to \mathbb{R}\) maps outcomes to real values, reflecting the agent's preferences. Different utility function shapes reflect different risk attitudes:
| Risk Attitude | Utility Function Shape | Characteristic |
|---|---|---|
| Risk neutral | Linear: \(U(x) = x\) | Cares only about expected value |
| Risk averse | Concave: \(U(x) = \sqrt{x}\) | Prefers certainty; \(U(\mathbb{E}[x]) > \mathbb{E}[U(x)]\) |
| Risk seeking | Convex: \(U(x) = x^2\) | Prefers gambles; \(U(\mathbb{E}[x]) < \mathbb{E}[U(x)]\) |
Risk Attitude Example
Consider two options: (A) receive $100 for certain; (B) 50% chance of $200, 50% chance of \(0. Both have the same expected monetary value (\)100), but a risk-averse agent chooses A while a risk-seeking agent chooses B.
1.3 Axiomatic Foundation of Utility Functions
Von Neumann and Morgenstern proved that if an agent's preferences satisfy the following axioms, then there necessarily exists a utility function such that the agent's behavior is equivalent to MEU:
| Axiom | Meaning |
|---|---|
| Completeness | For any two outcomes \(A, B\), the agent can determine \(A \succ B\), \(B \succ A\), or \(A \sim B\) |
| Transitivity | If \(A \succ B\) and \(B \succ C\), then \(A \succ C\) |
| Continuity | If \(A \succ B \succ C\), there exists a probability \(p\) such that \(B \sim [p: A;\ (1-p): C]\) |
| Independence | If \(A \succ B\), then for any \(C\) and \(p\), \([p: A;\ (1-p): C] \succ [p: B;\ (1-p): C]\) |
Behaviors that violate these axioms (such as the Allais paradox) are commonly observed in humans but are considered irrational.
2. Value of Information
2.1 Value of Perfect Information (VPI)
Before making a decision, an agent may choose to gather information first. The Value of Perfect Information (VPI) quantifies the expected benefit of learning the true value of some variable:
where \(\alpha_{e_j}\) is the optimal action after learning \(E_j = e_j\), and \(\alpha\) is the optimal action under current information.
Key properties of VPI:
- \(\text{VPI}(E_j) \ge 0\): Information never has negative value (in expectation)
- VPI is not additive: \(\text{VPI}(E_1, E_2) \neq \text{VPI}(E_1) + \text{VPI}(E_2)\) (in general)
- If the current optimal decision would not change regardless of \(E_j\)'s value, then \(\text{VPI}(E_j) = 0\)
2.2 Information-Gathering Decisions
Based on VPI, an agent can rationally decide whether it is worth paying the cost to acquire information:
- If \(\text{VPI}(E_j) > \text{Cost}(E_j)\), the information should be gathered
- Among multiple available information sources, select the one with the highest net value
- This is a meta-reasoning problem: reasoning about how to reason
3. Decision Networks
A Decision Network (also called an Influence Diagram) extends Bayesian networks for modeling decision problems. It contains three types of nodes:
| Node Type | Shape | Meaning |
|---|---|---|
| Chance node | Oval | Random variable (same as Bayesian network nodes) |
| Decision node | Rectangle | Action the agent can choose |
| Utility node | Diamond | Utility value given parent node values |
Solving a decision network:
- Set evidence variables
- For each possible decision value, compute the expected utility at the utility node
- Select the decision that maximizes expected utility
Decision networks unify probabilistic reasoning and decision optimization within a single graphical model, making it intuitive to model and solve complex decision problems.
4. Sequential Decisions: Markov Decision Processes
4.1 MDP Definition
When decisions involve multiple time steps, we enter the realm of sequential decision making. The Markov Decision Process (MDP) is the standard framework for modeling sequential decisions:
| Element | Meaning |
|---|---|
| \(S\) | State space |
| \(A\) | Action space |
| \(T(s' \mid s, a)\) | State transition probability |
| \(R(s, a, s')\) | Reward function |
| \(\gamma \in [0, 1)\) | Discount factor |
4.2 Bellman Equation
The optimal value function satisfies the Bellman optimality equation:
The optimal policy is \(\pi^*(s) = \arg\max_{a} Q^*(s, a)\).
4.3 Solution Algorithms
| Algorithm | Approach | Complexity |
|---|---|---|
| Value Iteration | Repeatedly apply Bellman updates until convergence | \(O(\|S\|^2 \|A\|)\) per iteration |
| Policy Iteration | Alternate between policy evaluation and policy improvement | Typically fewer iterations |
Relationship to Reinforcement Learning
MDPs form the theoretical foundation of reinforcement learning. Value iteration and policy iteration assume that the transition probability \(T\) and reward function \(R\) are known (i.e., the model is known), whereas reinforcement learning addresses the model-unknown setting — the agent must learn through interaction with the environment. For detailed coverage, see the reinforcement learning notes on this site.
5. POMDPs: Decision Making Under Partial Observability
5.1 Definition
A Partially Observable Markov Decision Process (POMDP) extends the MDP with an observation model:
where \(\Omega\) is the observation space and \(O(o \mid s', a)\) is the probability of observing \(o\) upon reaching state \(s'\) after taking action \(a\).
5.2 Belief State
In a POMDP, the agent cannot directly observe the state. Instead, it maintains a belief state \(b(s) = P(S_t = s \mid h_t)\), a probability distribution over states conditioned on the history of observations and actions.
Belief updates follow Bayes' rule:
A POMDP can be transformed into a continuous-state MDP over the belief space, but solving it is generally computationally very hard (PSPACE-complete). In practice, approximate methods such as Point-Based Value Iteration (PBVI) are commonly used.
6. Game Theory and Multi-Agent Decision Making
6.1 Game Theory Fundamentals
When the environment contains multiple rational agents, each agent's optimal decision depends on the behavior of others. Game theory provides a framework for analyzing such situations.
A normal-form game is defined by:
- A set of players \(\{1, 2, \dots, n\}\)
- A strategy set \(S_i\) for each player
- A utility function \(u_i(s_1, \dots, s_n)\) for each player
6.2 Nash Equilibrium
A Nash Equilibrium is a strategy profile \((s_1^*, \dots, s_n^*)\) such that no player can improve their utility by unilaterally changing their strategy:
| Concept | Description |
|---|---|
| Pure strategy equilibrium | Each player chooses a single deterministic action |
| Mixed strategy equilibrium | Players define probability distributions over actions |
| Nash's existence theorem | Every finite game has at least one mixed strategy Nash equilibrium |
Classic Example: The Prisoner's Dilemma
Two suspects independently decide whether to cooperate (remain silent) or defect (betray the other). The Nash equilibrium is mutual defection — even though mutual cooperation would leave both better off. This illustrates that individual rationality does not necessarily lead to collective optimality.
6.3 Mechanism Design
Mechanism Design is the "inverse problem" of game theory: given a desired social outcome, how should the rules (mechanism) be designed so that the equilibrium behavior of rational participants produces exactly that outcome?
Typical applications include auction design (e.g., the truthful bidding incentive in Vickrey auctions) and voting mechanisms.
Cross-References
Connections to Other Notes on This Site
- MDPs and Reinforcement Learning: The reinforcement learning notes on this site provide detailed coverage of MDP solution methods (value iteration, policy iteration), Q-learning, policy gradients, Actor-Critic, and more. This document offers only a conceptual introduction.
- Probabilistic Reasoning Foundations: The decision networks and POMDPs in this document build on the Bayesian networks and HMMs introduced in the Probabilistic Reasoning notes.
- Adversarial Search: The Adversarial Search notes discuss two-player zero-sum games from a search perspective (Minimax, Alpha-Beta pruning), complementing the game-theoretic perspective presented here.
References
- Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapters 16–17
- Kaelbling, Littman & Cassandra, "Planning and Acting in Partially Observable Stochastic Domains", Artificial Intelligence, 1998
- Myerson, Game Theory: Analysis of Conflict