Probabilistic Reasoning (AIMA Part IV)
These notes correspond to AIMA (Artificial Intelligence: A Modern Approach) Chapters 12–14, covering quantifying uncertainty, probabilistic reasoning, and probabilistic reasoning over time.
1. Overview of Reasoning Under Uncertainty
1.1 Why Logic Is Not Enough
In previous notes, we studied propositional logic and first-order logic. Logical reasoning is deterministic: a proposition is either true or false. However, real-world agents face several challenges:
- Incomplete information: An agent cannot observe the full state of the world. For example, a doctor cannot directly see a patient's internal pathology.
- Sensor noise: Observations themselves are subject to error.
- Nondeterministic effects: The same action may produce different outcomes under different conditions.
- Model simplification: Even the most detailed model cannot fully capture the real world.
Within a logical framework, handling these uncertainties requires enumerating all possible exceptional conditions (the qualification problem), which is infeasible in practice. We therefore turn to probability theory as a formal tool for representing degrees of belief.
1.2 Probability as Degree of Belief
The probability axioms require that any rational agent's beliefs satisfy the Kolmogorov axioms:
- \(0 \le P(A) \le 1\)
- \(P(\text{True}) = 1,\ P(\text{False}) = 0\)
- \(P(A \lor B) = P(A) + P(B) - P(A \land B)\)
AIMA points out that if an agent's beliefs violate these axioms, one can construct a set of bets that guarantee a sure loss (the Dutch Book theorem). This provides a normative argument for using probability to quantify uncertainty.
2. Bayes' Rule and Bayesian Reasoning
2.1 Conditional Probability and Bayes' Rule
Conditional probability is defined as:
From this, we directly derive Bayes' Rule:
| Term | Meaning | Description |
|---|---|---|
| \(P(A)\) | Prior | Belief in \(A\) before observing evidence \(B\) |
| \(P(B \mid A)\) | Likelihood | Probability of observing \(B\) given that \(A\) is true |
| \(P(A \mid B)\) | Posterior | Updated belief in \(A\) after observing evidence \(B\) |
| \(P(B)\) | Marginal likelihood (Evidence) | Normalizing constant, \(P(B) = \sum_a P(B \mid A=a) P(A=a)\) |
2.2 Naive Bayes
When multiple pieces of evidence \(E_1, E_2, \dots, E_n\) are conditionally independent given hypothesis \(H\):
This is the Naive Bayes model, widely applied in spam filtering, text classification, and other tasks.
3. Bayesian Networks
3.1 Definition
A Bayesian Network is a directed acyclic graph (DAG) in which:
- Each node represents a random variable
- Each directed edge represents a direct probabilistic dependency
- Each node is associated with a conditional probability table (CPT)
The joint probability distribution factorizes as:
Classic Example: The Alarm Network
Consider the following scenario: an earthquake or a burglary can trigger an alarm, and when the alarm sounds, neighbors John and Mary may call you.
[Burglary] [Earthquake]
\ /
v v
[Alarm]
/ \
v v
[JohnCalls] [MaryCalls]
This network requires only 10 independent parameters (compared to \(2^5 - 1 = 31\) for the full joint probability table).
3.2 Conditional Independence
Bayesian networks encode rich conditional independence relationships. The core property is the Markov condition:
Given its parents, each node is conditionally independent of all its non-descendants.
Written as \(X \perp\!\!\!\perp Y \mid Z\), meaning \(P(X \mid Y, Z) = P(X \mid Z)\).
3.3 d-Separation
d-separation is a graphical criterion for determining conditional independence in Bayesian networks. The three fundamental connection structures are:
| Structure | Form | Middle node not observed | Middle node observed |
|---|---|---|---|
| Chain | \(X \to Z \to Y\) | \(X, Y\) dependent | \(X \perp\!\!\!\perp Y \mid Z\) |
| Fork | \(X \leftarrow Z \to Y\) | \(X, Y\) dependent | \(X \perp\!\!\!\perp Y \mid Z\) |
| Collider | \(X \to Z \leftarrow Y\) | \(X \perp\!\!\!\perp Y\) | \(X, Y\) become dependent |
The Counterintuitive Nature of Colliders
In a collider structure, observing the middle node actually introduces a dependency. This is known as the "explaining away" effect. For example, talent and attractiveness are independent, but if we only observe successful people (Talent → Success ← Attractiveness), talent and attractiveness become negatively correlated among the successful population.
4. Inference in Bayesian Networks
4.1 Exact Inference
Variable Elimination is the most fundamental exact inference algorithm:
| Step | Operation |
|---|---|
| 1 | Identify query variable \(Q\) and evidence variables \(E\) |
| 2 | Choose an elimination ordering (affects efficiency; finding the optimal ordering is NP-hard) |
| 3 | For each hidden variable to be eliminated, combine relevant factors and sum out |
| 4 | Normalize to obtain the posterior probability |
Junction Tree (Clique Tree) Algorithm: Converts the Bayesian network into a junction tree structure and performs exact inference via message passing. For tree-structured networks, Belief Propagation can perform inference in linear time.
4.2 Approximate Inference
When the network is too large for exact inference, approximate methods are used:
Direct Sampling: Samples nodes in topological order to generate samples from the joint distribution.
Likelihood Weighting:
- Fixes the values of evidence variables and samples only non-evidence variables
- Each sample is assigned a weight \(w = \prod_{i \in E} P(e_i \mid \text{Parents}(E_i))\)
- More efficient than rejection sampling
MCMC (Markov Chain Monte Carlo):
- Gibbs Sampling: Samples each variable in turn from its conditional distribution
- \(X_i^{(t+1)} \sim P(X_i \mid X_1^{(t+1)}, \dots, X_{i-1}^{(t+1)}, X_{i+1}^{(t)}, \dots, X_n^{(t)})\)
- In a Bayesian network, each variable's conditional distribution depends only on its Markov blanket: its parents, children, and children's other parents
| Method | Advantage | Disadvantage |
|---|---|---|
| Variable Elimination | Exact | Worst-case exponential complexity |
| Likelihood Weighting | Simple to implement; biased but consistent | High variance when evidence probability is very low |
| Gibbs Sampling | Asymptotically exact | Convergence rate uncertain |
5. Temporal Probabilistic Models
5.1 Markov Chains
A Markov chain satisfies the first-order Markov property:
That is, the future state depends only on the current state, not on the history. State transitions are described by a transition matrix \(T\), where \(T_{ij} = P(X_{t+1} = j \mid X_t = i)\).
5.2 Hidden Markov Models (HMMs)
A Hidden Markov Model extends a Markov chain by introducing hidden (unobservable) states:
- Hidden state sequence \(Z_1, Z_2, \dots, Z_T\): satisfies first-order Markov property
- Observation sequence \(X_1, X_2, \dots, X_T\): each observation depends only on the current hidden state
Parameters \(\lambda = (\pi, A, B)\):
| Parameter | Meaning |
|---|---|
| \(\pi_i = P(Z_1 = i)\) | Initial state distribution |
| \(A_{ij} = P(Z_{t+1} = j \mid Z_t = i)\) | State transition probability |
| \(B_{ik} = P(X_t = k \mid Z_t = i)\) | Emission probability (observation probability) |
5.3 The Three Core HMM Problems
| Problem | Description | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | Given model and observations, compute \(P(X \mid \lambda)\) | Forward algorithm | \(O(N^2 T)\) |
| Decoding | Given model and observations, find most likely hidden state sequence | Viterbi algorithm | \(O(N^2 T)\) |
| Learning | Given observations, estimate model parameters | Baum-Welch (EM) | Iterative convergence |
Forward Algorithm: Define the forward variable \(\alpha_t(i) = P(X_1, \dots, X_t, Z_t = i \mid \lambda)\)
Viterbi Algorithm: Uses dynamic programming, defining \(\delta_t(j) = \max_{z_1, \dots, z_{t-1}} P(z_1, \dots, z_{t-1}, Z_t = j, X_1, \dots, X_t)\)
The optimal path is recovered via backpointers.
Baum-Welch Algorithm: A specialization of the EM algorithm for HMMs. The E-step computes forward-backward variables to obtain the posterior distribution over hidden states; the M-step updates parameters \(\pi, A, B\). Each iteration is guaranteed to monotonically increase the likelihood.
6. Kalman Filtering
The Kalman Filter is a method for temporal reasoning over continuous state spaces, assuming a linear Gaussian model:
State transition model (linear + Gaussian noise):
Observation model (linear + Gaussian noise):
Kalman filtering alternates between two steps:
| Step | Operation | Description |
|---|---|---|
| Predict | \(\hat{\mathbf{x}}_{t\mid t-1} = F\hat{\mathbf{x}}_{t-1\mid t-1} + B\mathbf{u}_t\) | Predict next state from the transition model |
| \(P_{t\mid t-1} = FP_{t-1\mid t-1}F^T + Q\) | Predicted covariance | |
| Update | \(K_t = P_{t\mid t-1}H^T(HP_{t\mid t-1}H^T + R)^{-1}\) | Compute Kalman gain |
| \(\hat{\mathbf{x}}_{t\mid t} = \hat{\mathbf{x}}_{t\mid t-1} + K_t(\mathbf{z}_t - H\hat{\mathbf{x}}_{t\mid t-1})\) | Fuse observation to update state estimate |
Kalman filtering is widely used in target tracking, navigation (GPS/INS fusion), and robot localization. For nonlinear systems, the Extended Kalman Filter (EKF) or the Unscented Kalman Filter (UKF) can be used.
7. Dynamic Bayesian Networks (DBNs)
A Dynamic Bayesian Network (DBN) is the temporal extension of Bayesian networks. Both HMMs and Kalman filters can be viewed as special cases of DBNs:
| Model | Hidden State | Observation | Transition/Observation Model |
|---|---|---|---|
| HMM | Single discrete variable | Single discrete/continuous variable | Tabular (CPT) |
| Kalman Filter | Continuous vector | Continuous vector | Linear Gaussian |
| General DBN | Arbitrary set of variables | Arbitrary set of variables | Arbitrary CPTs/functions |
Inference methods for DBNs include:
- Exact inference: Unroll into a static Bayesian network and apply variable elimination (feasible only for small-scale problems)
- Approximate inference: Particle filtering (sequential Monte Carlo), which represents the posterior distribution using a set of weighted particles
Cross-References
Connections to Other Notes on This Site
- The Probabilistic Reasoning notes on this site contain deeper coverage of probabilistic graphical models, including Markov Random Fields (MRFs), Conditional Random Fields (CRFs), variational inference, probabilistic programming, and causal inference.
- Variational Autoencoders (VAEs) and diffusion models in deep learning represent cutting-edge directions that combine probabilistic reasoning with neural networks.
- The POMDP (Partially Observable Markov Decision Process) discussed in the reinforcement learning notes builds directly on the foundation of HMMs.
References
- Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapters 12–14
- Koller & Friedman, Probabilistic Graphical Models: Principles and Techniques
- Bishop, Pattern Recognition and Machine Learning, Chapters 8, 13