Probabilistic Reasoning
Probabilistic reasoning is the core AI technique for handling uncertainty. In the real world, information is often incomplete, noisy, or ambiguous, and deterministic logical reasoning cannot effectively cope with these situations. Probabilistic reasoning provides a rigorous mathematical framework based on probability theory and graphical models for inference and decision-making under uncertainty.
From Bayesian networks to Hidden Markov Models, from exact inference to variational inference, probabilistic reasoning methods are widely applied in natural language processing, computer vision, medical diagnosis, robotics, and beyond. These notes cover Bayesian networks, Markov random fields, Hidden Markov Models, probabilistic programming, and connections to causal inference.
1. The Necessity of Reasoning Under Uncertainty
| Source of Uncertainty | Example |
|---|---|
| Incomplete observations | A doctor can only observe symptoms, not directly see the underlying disease |
| Sensor noise | A robot's distance sensor has measurement error |
| Model simplification | Weather models cannot capture every microscopic physical process |
| Inherent randomness | Quantum mechanical processes are fundamentally probabilistic |
| Incomplete knowledge | It is impossible to possess all relevant domain knowledge |
Core idea of probabilistic reasoning: Represent uncertainty using probability distributions and update beliefs using Bayes' rule:
where \(H\) is the hypothesis, \(E\) is the evidence, \(P(H)\) is the prior, and \(P(H \mid E)\) is the posterior.
2. Bayesian Networks
2.1 Definition
A Bayesian Network (BN) is a directed acyclic graph (DAG) in which:
- Each node represents a random variable
- Each directed edge represents a causal or dependency relationship
- Each node has an associated conditional probability table (CPT)
The joint probability distribution factorizes as:
Example: A simple medical diagnosis network
[Smoking] → [LungCancer] → [AbnormalXRay]
[Smoking] → [Bronchitis] → [Dyspnea]
[LungCancer] → [Dyspnea]
2.2 Conditional Independence
Bayesian networks encode rich conditional independence relationships. If \(X\) and \(Y\) are independent given \(Z\), we write:
Markov condition: Given its parents, each node is conditionally independent of all its non-descendants.
2.3 d-Separation
d-separation is the graphical criterion for determining conditional independence in Bayesian networks.
Three basic connection structures:
| Structure | Form | Middle node unobserved | Middle node observed |
|---|---|---|---|
| Chain | \(X \to Z \to Y\) | \(X, Y\) are dependent | \(X \perp\!\!\!\perp Y \mid Z\) |
| Fork | \(X \leftarrow Z \to Y\) | \(X, Y\) are dependent | \(X \perp\!\!\!\perp Y \mid Z\) |
| Collider | \(X \to Z \leftarrow Y\) | \(X \perp\!\!\!\perp Y\) | \(X, Y\) become dependent (explaining away) |
The Counterintuitive Nature of Colliders
In a collider structure, observing the middle node actually introduces a dependency. For example: talent and attractiveness are independent, but if we only observe successful people (Talent → Success ← Attractiveness), then within the successful population, talent and attractiveness appear negatively correlated.
2.4 Exact Inference: Variable Elimination
Variable Elimination computes query probabilities by sequentially marginalizing out variables:
| Step | Operation |
|---|---|
| 1 | Identify query variables and evidence variables |
| 2 | Choose an elimination ordering |
| 3 | For each variable to be eliminated, multiply relevant factors and sum out |
| 4 | Normalize to obtain the posterior probability |
Complexity: Depends on the largest factor produced by the elimination ordering. Finding the optimal elimination ordering is NP-hard.
2.5 Approximate Inference
When exact inference is infeasible, approximate methods are used.
MCMC (Markov Chain Monte Carlo):
- Gibbs sampling: Sample each variable in turn from its conditional distribution - \(X_i^{(t+1)} \sim P(X_i \mid X_1^{(t+1)}, \dots, X_{i-1}^{(t+1)}, X_{i+1}^{(t)}, \dots, X_n^{(t)})\)
- Metropolis-Hastings: Propose-accept/reject mechanism; more general
Variational Inference (VI):
Cast inference as an optimization problem: find the approximate distribution \(Q\) closest to the true posterior \(P\):
| Method | Advantages | Disadvantages |
|---|---|---|
| MCMC | Asymptotically exact | Slow convergence, difficult diagnostics |
| Variational Inference | Fast, scalable | Approximation quality limited by \(\mathcal{Q}\) |
3. Markov Random Fields
3.1 Undirected Graphical Models
A Markov Random Field (MRF), also known as an undirected graphical model, uses an undirected graph to represent dependencies between variables.
The joint probability distribution is defined via potential functions:
where \(\mathcal{C}\) is the set of maximal cliques in the graph and \(Z\) is the partition function (normalization constant).
3.2 Comparison with Bayesian Networks
| Dimension | Bayesian Network | Markov Random Field |
|---|---|---|
| Graph type | Directed acyclic graph (DAG) | Undirected graph |
| Parameterization | Conditional probability tables | Potential functions (may be unnormalized) |
| Causal semantics | Naturally encodes causal relationships | Does not directly encode causality |
| Independence criterion | d-separation | Graph separation (node removal) |
| Partition function | Not needed (probabilities naturally normalize) | Must compute \(Z\) (typically intractable) |
| Typical applications | Medical diagnosis, causal inference | Image segmentation, physical system modeling |
3.3 Conditional Random Fields (CRF)
A CRF is a discriminative Markov random field that directly models \(P(Y \mid X)\):
Applications: Sequence labeling (NER, POS tagging), image segmentation. BiLSTM-CRF was once the dominant model for NER.
4. Hidden Markov Models
4.1 HMM Definition
A Hidden Markov Model (HMM) is a temporal probabilistic model:
- Hidden state sequence \(Z_1, Z_2, \dots, Z_T\): satisfies the first-order Markov property
- Observation sequence \(X_1, X_2, \dots, X_T\): each observation depends only on the current hidden state
Parameters: \(\lambda = (\pi, A, B)\)
| Parameter | Meaning |
|---|---|
| \(\pi_i = P(Z_1 = i)\) | Initial state distribution |
| \(A_{ij} = P(Z_{t+1} = j \mid Z_t = i)\) | State transition probability |
| \(B_{ik} = P(X_t = k \mid Z_t = i)\) | Emission probability |
4.2 Three Core Problems
| Problem | Description | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | Given model and observations, compute \(P(X \mid \lambda)\) | Forward algorithm | \(O(N^2 T)\) |
| Decoding | Given model and observations, find the optimal hidden state sequence | Viterbi algorithm | \(O(N^2 T)\) |
| Learning | Given observations, estimate model parameters | Baum-Welch (EM) | Iterative convergence |
4.3 The Forward Algorithm
Define the forward variable \(\alpha_t(i) = P(X_1, \dots, X_t, Z_t = i \mid \lambda)\):
Finally: \(P(X \mid \lambda) = \sum_{i=1}^{N} \alpha_T(i)\)
4.4 The Viterbi Algorithm
Viterbi uses dynamic programming to find the most probable hidden state sequence:
The optimal path is recovered by following backpointers.
4.5 The Baum-Welch Algorithm
Baum-Welch is the specialization of the EM algorithm for HMMs:
- E-step: Compute forward-backward variables to obtain the posterior distribution over hidden states
- M-step: Update parameters \(\pi, A, B\) based on the posterior distribution
- Iterate until convergence (guaranteed to monotonically increase the likelihood)
5. Probabilistic Programming
Probabilistic programming allows users to define probabilistic models programmatically, with the framework automatically performing inference.
5.1 Major Frameworks
| Framework | Inference Methods | Language | Characteristics |
|---|---|---|---|
| PyMC | NUTS (HMC variant), VI | Python | Bayesian modeling friendly |
| Stan | HMC/NUTS | Stan DSL | Industrial-grade stability |
| Pyro | SVI, MCMC | Python (PyTorch) | Deep probabilistic models |
| NumPyro | NUTS, SVI | Python (JAX) | High performance |
| Edward2/TFP | VI, MCMC | Python (TensorFlow) | TF ecosystem integration |
5.2 PyMC Example
import pymc as pm
with pm.Model() as model:
# Priors
mu = pm.Normal('mu', mu=0, sigma=10)
sigma = pm.HalfNormal('sigma', sigma=5)
# Likelihood
obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)
# Inference
trace = pm.sample(2000, tune=1000)
pm.summary(trace)
5.3 Advantages of Probabilistic Programming
| Advantage | Description |
|---|---|
| Declarative modeling | Focus on model definition without hand-coding inference algorithms |
| Uncertainty quantification | Automatically obtain full posterior distributions over parameters |
| Model comparison | Information criteria such as WAIC and LOO-CV |
| Flexible composition | Build arbitrarily complex hierarchical models |
6. Causal Inference and Probabilistic Reasoning
6.1 From Correlation to Causation
Probabilistic reasoning answers "How likely is \(H\) after observing \(E\)?", whereas causal inference answers "How would \(H\) change if we intervene on \(E\)?"
Pearl's Causal Hierarchy:
| Level | Question Type | Mathematical Expression | Example |
|---|---|---|---|
| 1. Association | What if I observe? | \(P(Y \mid X)\) | What is the probability of lung cancer among smokers? |
| 2. Intervention | What if I act? | \(P(Y \mid \text{do}(X))\) | What would happen if someone were forced to smoke? |
| 3. Counterfactual | What if things had been different? | \(P(Y_{x'} \mid X=x, Y=y)\) | Would they have gotten lung cancer had they not smoked? |
6.2 The do-Calculus
The do-calculus is Pearl's formal tool for inferring causal effects from observational data. The core concepts are the backdoor criterion and the front-door criterion:
Backdoor criterion: If a set of variables \(Z\) blocks all backdoor paths from \(X\) to \(Y\), then:
6.3 Connections to Symbolic Methods
| Connection Point | Description |
|---|---|
| Structural causal models | Causal graphs are special Bayesian networks + interventional semantics |
| Knowledge graphs | Causal relationships can be encoded in knowledge graphs |
| Explainable AI | Causal reasoning provides stronger explanations than correlation |
| LLM reasoning | The causal reasoning capabilities of LLMs are an active research topic |
Modern Fusion of Probabilistic Reasoning
Modern AI increasingly emphasizes combining probabilistic reasoning with deep learning: Variational Autoencoders (VAEs) merge deep generative models with variational inference, diffusion models use stochastic differential equations to model probability distributions, and Bayesian deep learning assigns probabilistic interpretations to neural network weights.
References
- Koller & Friedman, Probabilistic Graphical Models: Principles and Techniques
- Bishop, Pattern Recognition and Machine Learning (Chapters 8-13)
- Pearl, Causality: Models, Reasoning, and Inference
- Murphy, Machine Learning: A Probabilistic Perspective
- Pyro Documentation
- PyMC Documentation