Probabilistic Reasoning (AIMA Part IV)

These notes correspond to AIMA (Artificial Intelligence: A Modern Approach) Chapters 12–14, covering quantifying uncertainty, probabilistic reasoning, and probabilistic reasoning over time.

1. Overview of Reasoning Under Uncertainty

1.1 Why Logic Is Not Enough

In previous notes, we studied propositional logic and first-order logic. Logical reasoning is deterministic: a proposition is either true or false. However, real-world agents face several challenges:

Incomplete information: An agent cannot observe the full state of the world. For example, a doctor cannot directly see a patient's internal pathology.
Sensor noise: Observations themselves are subject to error.
Nondeterministic effects: The same action may produce different outcomes under different conditions.
Model simplification: Even the most detailed model cannot fully capture the real world.

Within a logical framework, handling these uncertainties requires enumerating all possible exceptional conditions (the qualification problem), which is infeasible in practice. We therefore turn to probability theory as a formal tool for representing degrees of belief.

1.2 Probability as Degree of Belief

The probability axioms require that any rational agent's beliefs satisfy the Kolmogorov axioms:

\(0 \le P(A) \le 1\)
\(P(\text{True}) = 1,\ P(\text{False}) = 0\)
\(P(A \lor B) = P(A) + P(B) - P(A \land B)\)

AIMA points out that if an agent's beliefs violate these axioms, one can construct a set of bets that guarantee a sure loss (the Dutch Book theorem). This provides a normative argument for using probability to quantify uncertainty.

2. Bayes' Rule and Bayesian Reasoning

2.1 Conditional Probability and Bayes' Rule

Conditional probability is defined as:

\[ P(A \mid B) = \frac{P(A \land B)}{P(B)}, \quad P(B) > 0 \]

From this, we directly derive Bayes' Rule:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Term	Meaning	Description
\(P(A)\)	Prior	Belief in \(A\) before observing evidence \(B\)
\(P(B \mid A)\)	Likelihood	Probability of observing \(B\) given that \(A\) is true
\(P(A \mid B)\)	Posterior	Updated belief in \(A\) after observing evidence \(B\)
\(P(B)\)	Marginal likelihood (Evidence)	Normalizing constant, \(P(B) = \sum_a P(B \mid A=a) P(A=a)\)

2.2 Naive Bayes

When multiple pieces of evidence \(E_1, E_2, \dots, E_n\) are conditionally independent given hypothesis \(H\):

\[ P(H \mid E_1, \dots, E_n) \propto P(H) \prod_{i=1}^{n} P(E_i \mid H) \]

This is the Naive Bayes model, widely applied in spam filtering, text classification, and other tasks.

3. Bayesian Networks

3.1 Definition

A Bayesian Network is a directed acyclic graph (DAG) in which:

Each node represents a random variable
Each directed edge represents a direct probabilistic dependency
Each node is associated with a conditional probability table (CPT)

The joint probability distribution factorizes as:

\[ P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i)) \]

Classic Example: The Alarm Network

Consider the following scenario: an earthquake or a burglary can trigger an alarm, and when the alarm sounds, neighbors John and Mary may call you.

[Burglary]  [Earthquake]
     \         /
      v       v
      [Alarm]
     /       \
    v         v
[JohnCalls] [MaryCalls]

This network requires only 10 independent parameters (compared to \(2^5 - 1 = 31\) for the full joint probability table).

3.2 Conditional Independence

Bayesian networks encode rich conditional independence relationships. The core property is the Markov condition:

Given its parents, each node is conditionally independent of all its non-descendants.

Written as \(X \perp\!\!\!\perp Y \mid Z\), meaning \(P(X \mid Y, Z) = P(X \mid Z)\).

3.3 d-Separation

d-separation is a graphical criterion for determining conditional independence in Bayesian networks. The three fundamental connection structures are:

Structure	Form	Middle node not observed	Middle node observed
Chain	\(X \to Z \to Y\)	\(X, Y\) dependent	\(X \perp\!\!\!\perp Y \mid Z\)
Fork	\(X \leftarrow Z \to Y\)	\(X, Y\) dependent	\(X \perp\!\!\!\perp Y \mid Z\)
Collider	\(X \to Z \leftarrow Y\)	\(X \perp\!\!\!\perp Y\)	\(X, Y\) become dependent

The Counterintuitive Nature of Colliders

In a collider structure, observing the middle node actually introduces a dependency. This is known as the "explaining away" effect. For example, talent and attractiveness are independent, but if we only observe successful people (Talent → Success ← Attractiveness), talent and attractiveness become negatively correlated among the successful population.

4. Inference in Bayesian Networks

4.1 Exact Inference

Variable Elimination is the most fundamental exact inference algorithm:

\[ P(Q \mid E = e) \propto \sum_{H} \prod_{i} P(X_i \mid \text{Parents}(X_i)) \bigg|_{E=e} \]

Step	Operation
1	Identify query variable \(Q\) and evidence variables \(E\)
2	Choose an elimination ordering (affects efficiency; finding the optimal ordering is NP-hard)
3	For each hidden variable to be eliminated, combine relevant factors and sum out
4	Normalize to obtain the posterior probability

Junction Tree (Clique Tree) Algorithm: Converts the Bayesian network into a junction tree structure and performs exact inference via message passing. For tree-structured networks, Belief Propagation can perform inference in linear time.

4.2 Approximate Inference

When the network is too large for exact inference, approximate methods are used:

Direct Sampling: Samples nodes in topological order to generate samples from the joint distribution.

Likelihood Weighting:

Fixes the values of evidence variables and samples only non-evidence variables
Each sample is assigned a weight \(w = \prod_{i \in E} P(e_i \mid \text{Parents}(E_i))\)
More efficient than rejection sampling

MCMC (Markov Chain Monte Carlo):

Gibbs Sampling: Samples each variable in turn from its conditional distribution
- \(X_i^{(t+1)} \sim P(X_i \mid X_1^{(t+1)}, \dots, X_{i-1}^{(t+1)}, X_{i+1}^{(t)}, \dots, X_n^{(t)})\)
- In a Bayesian network, each variable's conditional distribution depends only on its Markov blanket: its parents, children, and children's other parents

Method	Advantage	Disadvantage
Variable Elimination	Exact	Worst-case exponential complexity
Likelihood Weighting	Simple to implement; biased but consistent	High variance when evidence probability is very low
Gibbs Sampling	Asymptotically exact	Convergence rate uncertain

5. Temporal Probabilistic Models

5.1 Markov Chains

A Markov chain satisfies the first-order Markov property:

\[ P(X_{t+1} \mid X_0, X_1, \dots, X_t) = P(X_{t+1} \mid X_t) \]

That is, the future state depends only on the current state, not on the history. State transitions are described by a transition matrix \(T\), where \(T_{ij} = P(X_{t+1} = j \mid X_t = i)\).

5.2 Hidden Markov Models (HMMs)

A Hidden Markov Model extends a Markov chain by introducing hidden (unobservable) states:

Hidden state sequence \(Z_1, Z_2, \dots, Z_T\): satisfies first-order Markov property
Observation sequence \(X_1, X_2, \dots, X_T\): each observation depends only on the current hidden state

Parameters \(\lambda = (\pi, A, B)\):

Parameter	Meaning
\(\pi_i = P(Z_1 = i)\)	Initial state distribution
\(A_{ij} = P(Z_{t+1} = j \mid Z_t = i)\)	State transition probability
\(B_{ik} = P(X_t = k \mid Z_t = i)\)	Emission probability (observation probability)

5.3 The Three Core HMM Problems

Problem	Description	Algorithm	Complexity
Evaluation	Given model and observations, compute \(P(X \mid \lambda)\)	Forward algorithm	\(O(N^2 T)\)
Decoding	Given model and observations, find most likely hidden state sequence	Viterbi algorithm	\(O(N^2 T)\)
Learning	Given observations, estimate model parameters	Baum-Welch (EM)	Iterative convergence

Forward Algorithm: Define the forward variable \(\alpha_t(i) = P(X_1, \dots, X_t, Z_t = i \mid \lambda)\)

\[ \begin{aligned} \alpha_1(i) &= \pi_i \cdot B_{i, X_1} \\ \alpha_{t+1}(j) &= \left[\sum_{i=1}^{N} \alpha_t(i) \cdot A_{ij}\right] \cdot B_{j, X_{t+1}} \end{aligned} \]

Viterbi Algorithm: Uses dynamic programming, defining \(\delta_t(j) = \max_{z_1, \dots, z_{t-1}} P(z_1, \dots, z_{t-1}, Z_t = j, X_1, \dots, X_t)\)

\[ \delta_t(j) = \max_{i} \left[\delta_{t-1}(i) \cdot A_{ij}\right] \cdot B_{j, X_t} \]

The optimal path is recovered via backpointers.

Baum-Welch Algorithm: A specialization of the EM algorithm for HMMs. The E-step computes forward-backward variables to obtain the posterior distribution over hidden states; the M-step updates parameters \(\pi, A, B\). Each iteration is guaranteed to monotonically increase the likelihood.

6. Kalman Filtering

The Kalman Filter is a method for temporal reasoning over continuous state spaces, assuming a linear Gaussian model:

State transition model (linear + Gaussian noise):

\[ \mathbf{x}_{t+1} = F \mathbf{x}_t + B \mathbf{u}_t + \mathbf{w}_t, \quad \mathbf{w}_t \sim \mathcal{N}(\mathbf{0}, Q) \]

Observation model (linear + Gaussian noise):

\[ \mathbf{z}_t = H \mathbf{x}_t + \mathbf{v}_t, \quad \mathbf{v}_t \sim \mathcal{N}(\mathbf{0}, R) \]

Kalman filtering alternates between two steps:

Step	Operation	Description
Predict	\(\hat{\mathbf{x}}_{t\mid t-1} = F\hat{\mathbf{x}}_{t-1\mid t-1} + B\mathbf{u}_t\)	Predict next state from the transition model
	\(P_{t\mid t-1} = FP_{t-1\mid t-1}F^T + Q\)	Predicted covariance
Update	\(K_t = P_{t\mid t-1}H^T(HP_{t\mid t-1}H^T + R)^{-1}\)	Compute Kalman gain
	\(\hat{\mathbf{x}}_{t\mid t} = \hat{\mathbf{x}}_{t\mid t-1} + K_t(\mathbf{z}_t - H\hat{\mathbf{x}}_{t\mid t-1})\)	Fuse observation to update state estimate

Kalman filtering is widely used in target tracking, navigation (GPS/INS fusion), and robot localization. For nonlinear systems, the Extended Kalman Filter (EKF) or the Unscented Kalman Filter (UKF) can be used.

7. Dynamic Bayesian Networks (DBNs)

A Dynamic Bayesian Network (DBN) is the temporal extension of Bayesian networks. Both HMMs and Kalman filters can be viewed as special cases of DBNs:

Model	Hidden State	Observation	Transition/Observation Model
HMM	Single discrete variable	Single discrete/continuous variable	Tabular (CPT)
Kalman Filter	Continuous vector	Continuous vector	Linear Gaussian
General DBN	Arbitrary set of variables	Arbitrary set of variables	Arbitrary CPTs/functions

Inference methods for DBNs include:

Exact inference: Unroll into a static Bayesian network and apply variable elimination (feasible only for small-scale problems)
Approximate inference: Particle filtering (sequential Monte Carlo), which represents the posterior distribution using a set of weighted particles

Cross-References

Connections to Other Notes on This Site

The Probabilistic Reasoning notes on this site contain deeper coverage of probabilistic graphical models, including Markov Random Fields (MRFs), Conditional Random Fields (CRFs), variational inference, probabilistic programming, and causal inference.
Variational Autoencoders (VAEs) and diffusion models in deep learning represent cutting-edge directions that combine probabilistic reasoning with neural networks.
The POMDP (Partially Observable Markov Decision Process) discussed in the reinforcement learning notes builds directly on the foundation of HMMs.

References

Russell & Norvig, Artificial Intelligence: A Modern Approach, Chapters 12–14
Koller & Friedman, Probabilistic Graphical Models: Principles and Techniques
Bishop, Pattern Recognition and Machine Learning, Chapters 8, 13