Probabilistic Reasoning

Probabilistic reasoning is the core AI technique for handling uncertainty. In the real world, information is often incomplete, noisy, or ambiguous, and deterministic logical reasoning cannot effectively cope with these situations. Probabilistic reasoning provides a rigorous mathematical framework based on probability theory and graphical models for inference and decision-making under uncertainty.

From Bayesian networks to Hidden Markov Models, from exact inference to variational inference, probabilistic reasoning methods are widely applied in natural language processing, computer vision, medical diagnosis, robotics, and beyond. These notes cover Bayesian networks, Markov random fields, Hidden Markov Models, probabilistic programming, and connections to causal inference.

1. The Necessity of Reasoning Under Uncertainty

Source of Uncertainty	Example
Incomplete observations	A doctor can only observe symptoms, not directly see the underlying disease
Sensor noise	A robot's distance sensor has measurement error
Model simplification	Weather models cannot capture every microscopic physical process
Inherent randomness	Quantum mechanical processes are fundamentally probabilistic
Incomplete knowledge	It is impossible to possess all relevant domain knowledge

Core idea of probabilistic reasoning: Represent uncertainty using probability distributions and update beliefs using Bayes' rule:

\[ P(H \mid E) = \frac{P(E \mid H) \cdot P(H)}{P(E)} \]

where \(H\) is the hypothesis, \(E\) is the evidence, \(P(H)\) is the prior, and \(P(H \mid E)\) is the posterior.

2. Bayesian Networks

2.1 Definition

A Bayesian Network (BN) is a directed acyclic graph (DAG) in which:

Each node represents a random variable
Each directed edge represents a causal or dependency relationship
Each node has an associated conditional probability table (CPT)

The joint probability distribution factorizes as:

\[ P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Parents}(X_i)) \]

Example: A simple medical diagnosis network

  [Smoking] → [LungCancer] → [AbnormalXRay]
  [Smoking] → [Bronchitis] → [Dyspnea]
  [LungCancer] → [Dyspnea]

2.2 Conditional Independence

Bayesian networks encode rich conditional independence relationships. If \(X\) and \(Y\) are independent given \(Z\), we write:

\[ X \perp\!\!\!\perp Y \mid Z \]

Markov condition: Given its parents, each node is conditionally independent of all its non-descendants.

2.3 d-Separation

d-separation is the graphical criterion for determining conditional independence in Bayesian networks.

Three basic connection structures:

Structure	Form	Middle node unobserved	Middle node observed
Chain	\(X \to Z \to Y\)	\(X, Y\) are dependent	\(X \perp\!\!\!\perp Y \mid Z\)
Fork	\(X \leftarrow Z \to Y\)	\(X, Y\) are dependent	\(X \perp\!\!\!\perp Y \mid Z\)
Collider	\(X \to Z \leftarrow Y\)	\(X \perp\!\!\!\perp Y\)	\(X, Y\) become dependent (explaining away)

The Counterintuitive Nature of Colliders

In a collider structure, observing the middle node actually introduces a dependency. For example: talent and attractiveness are independent, but if we only observe successful people (Talent → Success ← Attractiveness), then within the successful population, talent and attractiveness appear negatively correlated.

2.4 Exact Inference: Variable Elimination

Variable Elimination computes query probabilities by sequentially marginalizing out variables:

\[ P(Q \mid E = e) \propto \sum_{H} \prod_{i} P(X_i \mid \text{Parents}(X_i)) \bigg|_{E=e} \]

Step	Operation
1	Identify query variables and evidence variables
2	Choose an elimination ordering
3	For each variable to be eliminated, multiply relevant factors and sum out
4	Normalize to obtain the posterior probability

Complexity: Depends on the largest factor produced by the elimination ordering. Finding the optimal elimination ordering is NP-hard.

2.5 Approximate Inference

When exact inference is infeasible, approximate methods are used.

MCMC (Markov Chain Monte Carlo):

Gibbs sampling: Sample each variable in turn from its conditional distribution - \(X_i^{(t+1)} \sim P(X_i \mid X_1^{(t+1)}, \dots, X_{i-1}^{(t+1)}, X_{i+1}^{(t)}, \dots, X_n^{(t)})\)
Metropolis-Hastings: Propose-accept/reject mechanism; more general

Variational Inference (VI):

Cast inference as an optimization problem: find the approximate distribution \(Q\) closest to the true posterior \(P\):

\[ Q^* = \arg\min_{Q \in \mathcal{Q}} \text{KL}(Q \| P) = \arg\max_{Q \in \mathcal{Q}} \text{ELBO}(Q) \]

Method	Advantages	Disadvantages
MCMC	Asymptotically exact	Slow convergence, difficult diagnostics
Variational Inference	Fast, scalable	Approximation quality limited by \(\mathcal{Q}\)

3. Markov Random Fields

3.1 Undirected Graphical Models

A Markov Random Field (MRF), also known as an undirected graphical model, uses an undirected graph to represent dependencies between variables.

The joint probability distribution is defined via potential functions:

\[ P(X_1, \dots, X_n) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(X_c) \]

where \(\mathcal{C}\) is the set of maximal cliques in the graph and \(Z\) is the partition function (normalization constant).

3.2 Comparison with Bayesian Networks

Dimension	Bayesian Network	Markov Random Field
Graph type	Directed acyclic graph (DAG)	Undirected graph
Parameterization	Conditional probability tables	Potential functions (may be unnormalized)
Causal semantics	Naturally encodes causal relationships	Does not directly encode causality
Independence criterion	d-separation	Graph separation (node removal)
Partition function	Not needed (probabilities naturally normalize)	Must compute \(Z\) (typically intractable)
Typical applications	Medical diagnosis, causal inference	Image segmentation, physical system modeling

3.3 Conditional Random Fields (CRF)

A CRF is a discriminative Markov random field that directly models \(P(Y \mid X)\):

\[ P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{k} \lambda_k f_k(\mathbf{y}, \mathbf{x})\right) \]

Applications: Sequence labeling (NER, POS tagging), image segmentation. BiLSTM-CRF was once the dominant model for NER.

4. Hidden Markov Models

4.1 HMM Definition

A Hidden Markov Model (HMM) is a temporal probabilistic model:

Hidden state sequence \(Z_1, Z_2, \dots, Z_T\): satisfies the first-order Markov property
Observation sequence \(X_1, X_2, \dots, X_T\): each observation depends only on the current hidden state

Parameters: \(\lambda = (\pi, A, B)\)

Parameter	Meaning
\(\pi_i = P(Z_1 = i)\)	Initial state distribution
\(A_{ij} = P(Z_{t+1} = j \mid Z_t = i)\)	State transition probability
\(B_{ik} = P(X_t = k \mid Z_t = i)\)	Emission probability

4.2 Three Core Problems

Problem	Description	Algorithm	Complexity
Evaluation	Given model and observations, compute \(P(X \mid \lambda)\)	Forward algorithm	\(O(N^2 T)\)
Decoding	Given model and observations, find the optimal hidden state sequence	Viterbi algorithm	\(O(N^2 T)\)
Learning	Given observations, estimate model parameters	Baum-Welch (EM)	Iterative convergence

4.3 The Forward Algorithm

Define the forward variable \(\alpha_t(i) = P(X_1, \dots, X_t, Z_t = i \mid \lambda)\):

\[ \begin{aligned} \alpha_1(i) &= \pi_i \cdot B_{i, X_1} \\ \alpha_{t+1}(j) &= \left[\sum_{i=1}^{N} \alpha_t(i) \cdot A_{ij}\right] \cdot B_{j, X_{t+1}} \end{aligned} \]

Finally: \(P(X \mid \lambda) = \sum_{i=1}^{N} \alpha_T(i)\)

4.4 The Viterbi Algorithm

Viterbi uses dynamic programming to find the most probable hidden state sequence:

\[ \delta_t(j) = \max_{i} \left[\delta_{t-1}(i) \cdot A_{ij}\right] \cdot B_{j, X_t} \]

The optimal path is recovered by following backpointers.

4.5 The Baum-Welch Algorithm

Baum-Welch is the specialization of the EM algorithm for HMMs:

E-step: Compute forward-backward variables to obtain the posterior distribution over hidden states
M-step: Update parameters \(\pi, A, B\) based on the posterior distribution
Iterate until convergence (guaranteed to monotonically increase the likelihood)

5. Probabilistic Programming

Probabilistic programming allows users to define probabilistic models programmatically, with the framework automatically performing inference.

5.1 Major Frameworks

Framework	Inference Methods	Language	Characteristics
PyMC	NUTS (HMC variant), VI	Python	Bayesian modeling friendly
Stan	HMC/NUTS	Stan DSL	Industrial-grade stability
Pyro	SVI, MCMC	Python (PyTorch)	Deep probabilistic models
NumPyro	NUTS, SVI	Python (JAX)	High performance
Edward2/TFP	VI, MCMC	Python (TensorFlow)	TF ecosystem integration

5.2 PyMC Example

import pymc as pm

with pm.Model() as model:
    # Priors
    mu = pm.Normal('mu', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=5)

    # Likelihood
    obs = pm.Normal('obs', mu=mu, sigma=sigma, observed=data)

    # Inference
    trace = pm.sample(2000, tune=1000)

pm.summary(trace)

5.3 Advantages of Probabilistic Programming

Advantage	Description
Declarative modeling	Focus on model definition without hand-coding inference algorithms
Uncertainty quantification	Automatically obtain full posterior distributions over parameters
Model comparison	Information criteria such as WAIC and LOO-CV
Flexible composition	Build arbitrarily complex hierarchical models

6. Causal Inference and Probabilistic Reasoning

6.1 From Correlation to Causation

Probabilistic reasoning answers "How likely is \(H\) after observing \(E\)?", whereas causal inference answers "How would \(H\) change if we intervene on \(E\)?"

Pearl's Causal Hierarchy:

Level	Question Type	Mathematical Expression	Example
1. Association	What if I observe?	\(P(Y \mid X)\)	What is the probability of lung cancer among smokers?
2. Intervention	What if I act?	\(P(Y \mid \text{do}(X))\)	What would happen if someone were forced to smoke?
3. Counterfactual	What if things had been different?	\(P(Y_{x'} \mid X=x, Y=y)\)	Would they have gotten lung cancer had they not smoked?

6.2 The do-Calculus

The do-calculus is Pearl's formal tool for inferring causal effects from observational data. The core concepts are the backdoor criterion and the front-door criterion:

Backdoor criterion: If a set of variables \(Z\) blocks all backdoor paths from \(X\) to \(Y\), then:

\[ P(Y \mid \text{do}(X=x)) = \sum_{z} P(Y \mid X=x, Z=z) \cdot P(Z=z) \]

6.3 Connections to Symbolic Methods

Connection Point	Description
Structural causal models	Causal graphs are special Bayesian networks + interventional semantics
Knowledge graphs	Causal relationships can be encoded in knowledge graphs
Explainable AI	Causal reasoning provides stronger explanations than correlation
LLM reasoning	The causal reasoning capabilities of LLMs are an active research topic

Modern Fusion of Probabilistic Reasoning

Modern AI increasingly emphasizes combining probabilistic reasoning with deep learning: Variational Autoencoders (VAEs) merge deep generative models with variational inference, diffusion models use stochastic differential equations to model probability distributions, and Bayesian deep learning assigns probabilistic interpretations to neural network weights.

References

Koller & Friedman, Probabilistic Graphical Models: Principles and Techniques
Bishop, Pattern Recognition and Machine Learning (Chapters 8-13)
Pearl, Causality: Models, Reasoning, and Inference
Murphy, Machine Learning: A Probabilistic Perspective
Pyro Documentation
PyMC Documentation