Decision Making (AIMA Part IV)

These notes correspond to AIMA (Artificial Intelligence: A Modern Approach) Chapters 16–17, covering making simple decisions and making complex decisions. While the preceding probabilistic reasoning notes address "what is the world like" (inference), this chapter goes further to answer "what should be done" (decision).

1. Utility Theory

1.1 Maximum Expected Utility

A rational agent should choose the action that maximizes expected utility (Maximum Expected Utility, MEU):

\[ a^* = \arg\max_{a} \mathbb{E}[U(s) \mid a] = \arg\max_{a} \sum_{s} P(s \mid a) \cdot U(s) \]

where $U(s)$ is the utility function over state $s$, and $P(s \mid a)$ is the probability of reaching state $s$ after taking action $a$.

The MEU principle is the central criterion for rational agent design in AIMA — it unifies probabilistic reasoning (beliefs) and utility theory (preferences) within a single decision-making framework.

1.2 Utility Functions and Risk Attitudes

A utility function $U: S \to \mathbb{R}$ maps outcomes to real values, reflecting the agent's preferences. Different utility function shapes reflect different risk attitudes:

Risk Attitude	Utility Function Shape	Characteristic
Risk neutral	Linear: $U(x) = x$	Cares only about expected value
Risk averse	Concave: $U(x) = \sqrt{x}$	Prefers certainty; $U(\mathbb{E}[x]) > \mathbb{E}[U(x)]$
Risk seeking	Convex: $U(x) = x^2$	Prefers gambles; $U(\mathbb{E}[x]) < \mathbb{E}[U(x)]$

Risk Attitude Example

Consider two options: (A) receive $100 for certain; (B) 50% chance of $200, 50% chance of $0. Both have the same expected monetary value ($100), but a risk-averse agent chooses A while a risk-seeking agent chooses B.

1.3 Axiomatic Foundation of Utility Functions

Von Neumann and Morgenstern proved that if an agent's preferences satisfy the following axioms, then there necessarily exists a utility function such that the agent's behavior is equivalent to MEU:

Axiom	Meaning
Completeness	For any two outcomes $A, B$, the agent can determine $A \succ B$, $B \succ A$, or $A \sim B$
Transitivity	If $A \succ B$ and $B \succ C$, then $A \succ C$
Continuity	If $A \succ B \succ C$, there exists a probability $p$ such that $B \sim [p: A;\ (1-p): C]$
Independence	If $A \succ B$, then for any $C$ and $p$, $[p: A;\ (1-p): C] \succ [p: B;\ (1-p): C]$

Behaviors that violate these axioms (such as the Allais paradox) are commonly observed in humans but are considered irrational.

2. Value of Information

2.1 Value of Perfect Information (VPI)

Before making a decision, an agent may choose to gather information first. The Value of Perfect Information (VPI) quantifies the expected benefit of learning the true value of some variable:

\[ \text{VPI}(E_j \mid \mathbf{e}) = \left(\sum_{e_j} P(e_j \mid \mathbf{e}) \cdot \text{EU}(\alpha_{e_j} \mid \mathbf{e}, E_j = e_j)\right) - \text{EU}(\alpha \mid \mathbf{e}) \]

where $\alpha_{e_j}$ is the optimal action after learning $E_j = e_j$, and $\alpha$ is the optimal action under current information.

Key properties of VPI:

$\text{VPI}(E_j) \ge 0$: Information never has negative value (in expectation)
VPI is not additive: $\text{VPI}(E_1, E_2) \neq \text{VPI}(E_1) + \text{VPI}(E_2)$ (in general)
If the current optimal decision would not change regardless of $E_j$'s value, then $\text{VPI}(E_j) = 0$

2.2 Information-Gathering Decisions

Based on VPI, an agent can rationally decide whether it is worth paying the cost to acquire information:

If $\text{VPI}(E_j) > \text{Cost}(E_j)$, the information should be gathered
Among multiple available information sources, select the one with the highest net value
This is a meta-reasoning problem: reasoning about how to reason

3. Decision Networks

A Decision Network (also called an Influence Diagram) extends Bayesian networks for modeling decision problems. It contains three types of nodes:

Node Type	Shape	Meaning
Chance node	Oval	Random variable (same as Bayesian network nodes)
Decision node	Rectangle	Action the agent can choose
Utility node	Diamond	Utility value given parent node values

Solving a decision network:

Set evidence variables
For each possible decision value, compute the expected utility at the utility node
Select the decision that maximizes expected utility

Decision networks unify probabilistic reasoning and decision optimization within a single graphical model, making it intuitive to model and solve complex decision problems.

4. Sequential Decisions: Markov Decision Processes

4.1 MDP Definition

When decisions involve multiple time steps, we enter the realm of sequential decision making. The Markov Decision Process (MDP) is the standard framework for modeling sequential decisions:

\[ \text{MDP} = \langle S, A, T, R, \gamma \rangle \]

Element	Meaning
$S$	State space
$A$	Action space
$T(s' \mid s, a)$	State transition probability
$R(s, a, s')$	Reward function
$\gamma \in [0, 1)$	Discount factor

4.2 Bellman Equation

The optimal value function satisfies the Bellman optimality equation:

\[ V^*(s) = \max_{a} \sum_{s'} T(s' \mid s, a) \left[R(s, a, s') + \gamma V^*(s')\right] \]

The optimal policy is $\pi^*(s) = \arg\max_{a} Q^*(s, a)$.

4.3 Solution Algorithms

Algorithm	Approach	Complexity
Value Iteration	Repeatedly apply Bellman updates until convergence	$O(\\|S\\|^2 \\|A\\|)$ per iteration
Policy Iteration	Alternate between policy evaluation and policy improvement	Typically fewer iterations

Relationship to Reinforcement Learning

MDPs form the theoretical foundation of reinforcement learning. Value iteration and policy iteration assume that the transition probability $T$ and reward function $R$ are known (i.e., the model is known), whereas reinforcement learning addresses the model-unknown setting — the agent must learn through interaction with the environment. For detailed coverage, see the reinforcement learning notes on this site.

5. POMDPs: Decision Making Under Partial Observability

5.1 Definition

A Partially Observable Markov Decision Process (POMDP) extends the MDP with an observation model:

\[ \text{POMDP} = \langle S, A, T, R, \gamma, \Omega, O \rangle \]

where $\Omega$ is the observation space and $O(o \mid s', a)$ is the probability of observing $o$ upon reaching state $s'$ after taking action $a$.

5.2 Belief State

In a POMDP, the agent cannot directly observe the state. Instead, it maintains a belief state $b(s) = P(S_t = s \mid h_t)$, a probability distribution over states conditioned on the history of observations and actions.

Belief updates follow Bayes' rule:

\[ b'(s') = \eta \cdot O(o \mid s', a) \sum_{s} T(s' \mid s, a) \cdot b(s) \]

A POMDP can be transformed into a continuous-state MDP over the belief space, but solving it is generally computationally very hard (PSPACE-complete). In practice, approximate methods such as Point-Based Value Iteration (PBVI) are commonly used.

6. Game Theory and Multi-Agent Decision Making

6.1 Game Theory Fundamentals

When the environment contains multiple rational agents, each agent's optimal decision depends on the behavior of others. Game theory provides a framework for analyzing such situations.

A normal-form game is defined by:

A set of players $\{1, 2, \dots, n\}$
A strategy set $S_i$ for each player
A utility function $u_i(s_1, \dots, s_n)$ for each player

6.2 Nash Equilibrium

A Nash Equilibrium is a strategy profile $(s_1^*, \dots, s_n^*)$ such that no player can improve their utility by unilaterally changing their strategy:

\[ \forall i, \forall s_i: \quad u_i(s_i^*, s_{-i}^*) \ge u_i(s_i, s_{-i}^*) \]

Concept	Description
Pure strategy equilibrium	Each player chooses a single deterministic action
Mixed strategy equilibrium	Players define probability distributions over actions
Nash's existence theorem	Every finite game has at least one mixed strategy Nash equilibrium

Classic Example: The Prisoner's Dilemma

Two suspects independently decide whether to cooperate (remain silent) or defect (betray the other). The Nash equilibrium is mutual defection — even though mutual cooperation would leave both better off. This illustrates that individual rationality does not necessarily lead to collective optimality.

6.3 Mechanism Design

Mechanism Design is the "inverse problem" of game theory: given a desired social outcome, how should the rules (mechanism) be designed so that the equilibrium behavior of rational participants produces exactly that outcome?

Typical applications include auction design (e.g., the truthful bidding incentive in Vickrey auctions) and voting mechanisms.

Cross-References

Connections to Other Notes on This Site

MDPs and Reinforcement Learning: The reinforcement learning notes on this site provide detailed coverage of MDP solution methods (value iteration, policy iteration), Q-learning, policy gradients, Actor-Critic, and more. This document offers only a conceptual introduction.
Probabilistic Reasoning Foundations: The decision networks and POMDPs in this document build on the Bayesian networks and HMMs introduced in the Probabilistic Reasoning notes.
Adversarial Search: The Adversarial Search notes discuss two-player zero-sum games from a search perspective (Minimax, Alpha-Beta pruning), complementing the game-theoretic perspective presented here.

References

Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapters 16–17
Kaelbling, Littman & Cassandra, "Planning and Acting in Partially Observable Stochastic Domains", Artificial Intelligence, 1998
Myerson, Game Theory: Analysis of Conflict

Risk Attitude	Utility Function Shape	Characteristic
Risk neutral	Linear: \(U(x) = x\)	Cares only about expected value
Risk averse	Concave: \(U(x) = \sqrt{x}\)	Prefers certainty; \(U(\mathbb{E}[x]) > \mathbb{E}[U(x)]\)
Risk seeking	Convex: \(U(x) = x^2\)	Prefers gambles; \(U(\mathbb{E}[x]) < \mathbb{E}[U(x)]\)

Axiom	Meaning
Completeness	For any two outcomes \(A, B\), the agent can determine \(A \succ B\), \(B \succ A\), or \(A \sim B\)
Transitivity	If \(A \succ B\) and \(B \succ C\), then \(A \succ C\)
Continuity	If \(A \succ B \succ C\), there exists a probability \(p\) such that \(B \sim [p: A;\ (1-p): C]\)
Independence	If \(A \succ B\), then for any \(C\) and \(p\), \([p: A;\ (1-p): C] \succ [p: B;\ (1-p): C]\)

Element	Meaning
\(S\)	State space
\(A\)	Action space
\(T(s' \mid s, a)\)	State transition probability
\(R(s, a, s')\)	Reward function
\(\gamma \in [0, 1)\)	Discount factor

Algorithm	Approach	Complexity
Value Iteration	Repeatedly apply Bellman updates until convergence	\(O(\\|S\\|^2 \\|A\\|)\) per iteration
Policy Iteration	Alternate between policy evaluation and policy improvement	Typically fewer iterations