Skip to content

Probabilistic Reasoning

Probabilistic reasoning is the core AI technique for handling uncertainty. In the real world, information is often incomplete, noisy, or ambiguous, and deterministic logical reasoning cannot effectively cope with these situations. Probabilistic reasoning provides a rigorous mathematical framework based on probability theory and graphical models for inference and decision-making under uncertainty.

From Bayesian networks to Hidden Markov Models, from exact inference to variational inference, probabilistic reasoning methods are widely applied in natural language processing, computer vision, medical diagnosis, robotics, and beyond. These notes cover Bayesian networks, Markov random fields, Hidden Markov Models, probabilistic programming, and connections to causal inference.


1. The Necessity of Reasoning Under Uncertainty

Source of Uncertainty Example
Incomplete observations A doctor can only observe symptoms, not directly see the underlying disease
Sensor noise A robot's distance sensor has measurement error
Model simplification Weather models cannot capture every microscopic physical process
Inherent randomness Quantum mechanical processes are fundamentally probabilistic
Incomplete knowledge It is impossible to possess all relevant domain knowledge

Core idea of probabilistic reasoning: Represent uncertainty using probability distributions and update beliefs using Bayes' rule:

\[ P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)} \]

where \(H\) is the hypothesis, \(E\) is the evidence, \(P(H)\) is the prior, and \(P(H \mid E)\) is the posterior.


2. Bayesian Networks

2.1 Definition

A Bayesian Network (BN) is a directed acyclic graph (DAG) in which:

  • Each node represents a random variable
  • Each directed edge represents a causal or dependency relationship
  • Each node has an associated conditional probability table (CPT)

The joint probability distribution factorizes as:

\[ P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i)) \]

Example: A simple medical diagnosis network

  [Smoking] → [LungCancer] → [AbnormalXRay]
  [Smoking] → [Bronchitis] → [Dyspnea]
  [LungCancer] → [Dyspnea]

2.2 Conditional Independence

Bayesian networks encode rich conditional independence relationships. If \(X\) and \(Y\) are independent given \(Z\), we write:

\[ X \perp\!\!\!\perp Y \mid Z \]

Markov condition: Given its parents, each node is conditionally independent of all its non-descendants.

2.3 d-Separation

d-separation is the graphical criterion for determining conditional independence in Bayesian networks.

Three basic connection structures:

Structure Form Middle node unobserved Middle node observed
Chain \(X \to Z \to Y\) \(X, Y\) are dependent \(X \perp\!\!\!\perp Y \mid Z\)
Fork \(X \leftarrow Z \to Y\) \(X, Y\) are dependent \(X \perp\!\!\!\perp Y \mid Z\)
Collider \(X \to Z \leftarrow Y\) \(X \perp\!\!\!\perp Y\) \(X, Y\) become dependent (explaining away)

The Counterintuitive Nature of Colliders

In a collider structure, observing the middle node actually introduces a dependency. For example: talent and attractiveness are independent, but if we only observe successful people (Talent → Success ← Attractiveness), then within the successful population, talent and attractiveness appear negatively correlated.

2.4 Exact Inference: Variable Elimination

Variable Elimination computes query probabilities by sequentially marginalizing out variables:

\[ P(Q \mid E = e) \propto \sum_{H} \prod_{i} P(X_i \mid \text{Parents}(X_i)) \bigg|_{E=e} \]
Step Operation
1 Identify query variables and evidence variables
2 Choose an elimination ordering
3 For each variable to be eliminated, multiply relevant factors and sum out
4 Normalize to obtain the posterior probability

Complexity: Depends on the largest factor produced by the elimination ordering. Finding the optimal elimination ordering is NP-hard.

2.5 Approximate Inference

When exact inference is infeasible, approximate methods are used.

MCMC (Markov Chain Monte Carlo):

  • Gibbs sampling: Sample each variable in turn from its conditional distribution - \(X_i^{(t+1)} \sim P(X_i \mid X_1^{(t+1)}, \dots, X_{i-1}^{(t+1)}, X_{i+1}^{(t)}, \dots, X_n^{(t)})\)
  • Metropolis-Hastings: Propose-accept/reject mechanism; more general

Variational Inference (VI):

Cast inference as an optimization problem: find the approximate distribution \(Q\) closest to the true posterior \(P\):

\[ Q^* = \arg\min_{Q \in \mathcal{Q}} \text{KL}(Q \| P) = \arg\max_{Q \in \mathcal{Q}} \text{ELBO}(Q) \]
Method Advantages Disadvantages
MCMC Asymptotically exact Slow convergence, difficult diagnostics
Variational Inference Fast, scalable Approximation quality limited by \(\mathcal{Q}\)

3. Markov Random Fields

3.1 Undirected Graphical Models

A Markov Random Field (MRF), also known as an undirected graphical model, uses an undirected graph to represent dependencies between variables.

The joint probability distribution is defined via potential functions:

\[ P(X_1, \dots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(X_c) \]

where \(\mathcal{C}\) is the set of maximal cliques in the graph and \(Z\) is the partition function (normalization constant).

3.2 Comparison with Bayesian Networks

Dimension Bayesian Network Markov Random Field
Graph type Directed acyclic graph (DAG) Undirected graph
Parameterization Conditional probability tables Potential functions (may be unnormalized)
Causal semantics Naturally encodes causal relationships Does not directly encode causality
Independence criterion d-separation Graph separation (node removal)
Partition function Not needed (probabilities naturally normalize) Must compute \(Z\) (typically intractable)
Typical applications Medical diagnosis, causal inference Image segmentation, physical system modeling

3.3 Conditional Random Fields (CRF)

A CRF is a discriminative Markov random field that directly models \(P(Y \mid X)\):

\[ P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{k} \lambda_k f_k(\mathbf{y}, \mathbf{x})\right) \]

Applications: Sequence labeling (NER, POS tagging), image segmentation. BiLSTM-CRF was once the dominant model for NER.


4. Hidden Markov Models

4.1 HMM Definition

A Hidden Markov Model (HMM) is a temporal probabilistic model:

  • Hidden state sequence \(Z_1, Z_2, \dots, Z_T\): satisfies the first-order Markov property
  • Observation sequence \(X_1, X_2, \dots, X_T\): each observation depends only on the current hidden state

Parameters: \(\lambda = (\pi, A, B)\)

Parameter Meaning
\(\pi_i = P(Z_1 = i)\) Initial state distribution
\(A_{ij} = P(Z_{t+1} = j \mid Z_t = i)\) State transition probability
\(B_{ik} = P(X_t = k \mid Z_t = i)\) Emission probability

4.2 Three Core Problems

Problem Description Algorithm Complexity
Evaluation Given model and observations, compute \(P(X \mid \lambda)\) Forward algorithm \(O(N^2 T)\)
Decoding Given model and observations, find the optimal hidden state sequence Viterbi algorithm \(O(N^2 T)\)
Learning Given observations, estimate model parameters Baum-Welch (EM) Iterative convergence

4.3 The Forward Algorithm

Define the forward variable \(\alpha_t(i) = P(X_1, \dots, X_t, Z_t = i \mid \lambda)\):

\[ \begin{aligned} \alpha_1(i) &= \pi_i \cdot B_{i, X_1} \\ \alpha_{t+1}(j) &= \left[\sum_{i=1}^{N} \alpha_t(i) \cdot A_{ij}\right] \cdot B_{j, X_{t+1}} \end{aligned} \]

Finally: \(P(X \mid \lambda) = \sum_{i=1}^{N} \alpha_T(i)\)

4.4 The Viterbi Algorithm

Viterbi uses dynamic programming to find the most probable hidden state sequence:

\[ \delta_t(j) = \max_{i} \left[\delta_{t-1}(i) \cdot A_{ij}\right] \cdot B_{j, X_t} \]

The optimal path is recovered by following backpointers.

4.5 The Baum-Welch Algorithm

Baum-Welch is the specialization of the EM algorithm for HMMs:

  • E-step: Compute forward-backward variables to obtain the posterior distribution over hidden states
  • M-step: Update parameters \(\pi, A, B\) based on the posterior distribution
  • Iterate until convergence (guaranteed to monotonically increase the likelihood)

5. Probabilistic Programming

Probabilistic programming allows users to define probabilistic models programmatically, with the framework automatically performing inference.

5.1 Major Frameworks

Framework Inference Methods Language Characteristics
PyMC NUTS (HMC variant), VI Python Bayesian modeling friendly
Stan HMC/NUTS Stan DSL Industrial-grade stability
Pyro SVI, MCMC Python (PyTorch) Deep probabilistic models
NumPyro NUTS, SVI Python (JAX) High performance
Edward2/TFP VI, MCMC Python (TensorFlow) TF ecosystem integration

5.2 PyMC Example

import pymc as pm

with pm.Model() as model:
    # Priors
    mu = pm.Normal('mu', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=5)

    # Likelihood
    obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)

    # Inference
    trace = pm.sample(2000, tune=1000)

pm.summary(trace)

5.3 Advantages of Probabilistic Programming

Advantage Description
Declarative modeling Focus on model definition without hand-coding inference algorithms
Uncertainty quantification Automatically obtain full posterior distributions over parameters
Model comparison Information criteria such as WAIC and LOO-CV
Flexible composition Build arbitrarily complex hierarchical models

6. Causal Inference and Probabilistic Reasoning

6.1 From Correlation to Causation

Probabilistic reasoning answers "How likely is \(H\) after observing \(E\)?", whereas causal inference answers "How would \(H\) change if we intervene on \(E\)?"

Pearl's Causal Hierarchy:

Level Question Type Mathematical Expression Example
1. Association What if I observe? \(P(Y \mid X)\) What is the probability of lung cancer among smokers?
2. Intervention What if I act? \(P(Y \mid \text{do}(X))\) What would happen if someone were forced to smoke?
3. Counterfactual What if things had been different? \(P(Y_{x'} \mid X=x, Y=y)\) Would they have gotten lung cancer had they not smoked?

6.2 The do-Calculus

The do-calculus is Pearl's formal tool for inferring causal effects from observational data. The core concepts are the backdoor criterion and the front-door criterion:

Backdoor criterion: If a set of variables \(Z\) blocks all backdoor paths from \(X\) to \(Y\), then:

\[ P(Y \mid \text{do}(X=x)) = \sum_{z} P(Y \mid X=x, Z=z) \cdot P(Z=z) \]

6.3 Connections to Symbolic Methods

Connection Point Description
Structural causal models Causal graphs are special Bayesian networks + interventional semantics
Knowledge graphs Causal relationships can be encoded in knowledge graphs
Explainable AI Causal reasoning provides stronger explanations than correlation
LLM reasoning The causal reasoning capabilities of LLMs are an active research topic

Modern Fusion of Probabilistic Reasoning

Modern AI increasingly emphasizes combining probabilistic reasoning with deep learning: Variational Autoencoders (VAEs) merge deep generative models with variational inference, diffusion models use stochastic differential equations to model probability distributions, and Bayesian deep learning assigns probabilistic interpretations to neural network weights.


References

  • Koller & Friedman, Probabilistic Graphical Models: Principles and Techniques
  • Bishop, Pattern Recognition and Machine Learning (Chapters 8-13)
  • Pearl, Causality: Models, Reasoning, and Inference
  • Murphy, Machine Learning: A Probabilistic Perspective
  • Pyro Documentation
  • PyMC Documentation

评论 #