Probabilistic Graphical Models and Inference

Probabilistic Graphical Models (PGMs) are a powerful framework that combines probability theory with graph theory, using graph structures to represent dependencies among random variables and enabling efficient probabilistic inference and learning.

Learning path: Probability fundamentals → Bayes' theorem → PGM taxonomy → EM algorithm → Variational inference → Modern applications

Tribe-level perspective: full HMM derivations (forward-backward / Viterbi / Baum-Welch) and LDA Gibbs sampling are in The Master Algorithm — Graphical Models & HMMs.

Overview of Probabilistic Graphical Models

Directed vs. Undirected Models

Probabilistic graphical models fall into two main categories based on the type of graph:

Dimension	Directed Models (Bayesian Networks)	Undirected Models (Markov Random Fields)
Graph type	Directed Acyclic Graph (DAG)	Undirected graph
Parameterization	Conditional probability distributions \(P(X_i \mid \text{Pa}(X_i))\)	Potential functions
Joint distribution	\(P(\mathbf{X}) = \prod_i P(X_i \mid \text{Pa}(X_i))\)	\(P(\mathbf{X}) = \frac{1}{Z}\prod_c \psi_c(\mathbf{X}_c)\)
Independence	d-separation criterion	Global/local/pairwise Markov properties
Typical applications	Causal reasoning, Naive Bayes, LDA	Image segmentation (CRF), physics modeling
Normalization	Automatically satisfied (conditional probabilities sum to 1)	Requires computing the partition function \(Z\) (often intractable)

Directed models (Bayesian Networks) factorize the joint probability as:

\[ P(X_1, X_2, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{Pa}(X_i)) \]

where \(\text{Pa}(X_i)\) denotes the set of parent nodes of \(X_i\).

Undirected models (Markov Random Fields) factorize the joint probability as:

\[ P(\mathbf{X}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{X}_c), \quad Z = \sum_{\mathbf{X}} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{X}_c) \]

where \(\mathcal{C}\) is the set of maximal cliques in the graph, \(\psi_c\) are non-negative potential functions, and \(Z\) is the partition function.

Discriminative vs. Generative Models

Dimension	Generative Models	Discriminative Models
Modeling target	Joint distribution \(P(X, Y)\) or \(P(X \mid Y)P(Y)\)	Conditional distribution \(P(Y \mid X)\)
Inference approach	Obtain posterior via Bayes' theorem	Directly learn the decision boundary
Advantages	Can generate new samples, handle missing data	Typically higher classification accuracy
Disadvantages	Requires stronger assumptions, higher computational cost	Cannot generate data
Typical algorithms	Naive Bayes, GMM, HMM, VAE	Logistic regression, SVM, CRF, neural networks

The EM Algorithm

The Expectation-Maximization (EM) algorithm is the most important parameter estimation method for probabilistic models with latent variables.

Problem Formulation

Given observed data \(\mathbf{X}\), latent variables \(\mathbf{Z}\), and model parameters \(\theta\), the goal is to maximize the incomplete-data log-likelihood:

\[ \ell(\theta) = \log P(\mathbf{X} \mid \theta) = \log \sum_{\mathbf{Z}} P(\mathbf{X}, \mathbf{Z} \mid \theta) \]

Because the summation over latent variables is inside the logarithm, direct optimization is typically very difficult.

E-Step and M-Step

The EM algorithm iteratively performs two steps to progressively increase the likelihood:

E-Step (Expectation Step): Fix parameters \(\theta^{(t)}\) and compute the posterior distribution of the latent variables:

\[ Q(\mathbf{Z}) = P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}) \]

Then compute the expected complete-data log-likelihood (the Q-function):

\[ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{\mathbf{Z} \sim P(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)})} \left[ \log P(\mathbf{X}, \mathbf{Z} \mid \theta) \right] \]

M-Step (Maximization Step): Maximize the Q-function to obtain new parameters:

\[ \theta^{(t+1)} = \arg\max_{\theta} \; Q(\theta \mid \theta^{(t)}) \]

ELBO and Convergence

The theoretical foundation of the EM algorithm is the Evidence Lower Bound (ELBO). For any distribution \(q(\mathbf{Z})\):

\[ \log P(\mathbf{X} \mid \theta) = \underbrace{\mathbb{E}_{q}[\log P(\mathbf{X}, \mathbf{Z} \mid \theta)] - \mathbb{E}_{q}[\log q(\mathbf{Z})]}_{\text{ELBO}(q, \theta)} + \text{KL}(q(\mathbf{Z}) \| P(\mathbf{Z} \mid \mathbf{X}, \theta)) \]

Since the KL divergence is non-negative, the ELBO is always a lower bound on the log-likelihood:

\[ \log P(\mathbf{X} \mid \theta) \geq \text{ELBO}(q, \theta) \]

E-Step: Fix \(\theta\), set \(q(\mathbf{Z}) = P(\mathbf{Z} \mid \mathbf{X}, \theta)\), making the KL divergence zero so the ELBO tightly matches the likelihood
M-Step: Fix \(q\), maximize the ELBO with respect to \(\theta\)

Convergence guarantee: Each iteration never decreases the log-likelihood, i.e., \(\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})\). However, EM only guarantees convergence to a local optimum, not the global optimum.

Relationship Between EM and Variational Inference

EM can be viewed as a special case of variational inference:

The E-step of EM exactly computes the posterior \(P(\mathbf{Z} \mid \mathbf{X}, \theta)\)
Variational inference (VI) uses a parameterized \(q_\phi(\mathbf{Z})\) to approximate the posterior when it is analytically intractable
When the exact posterior is computable, VI reduces to EM

Naive Bayes

Conditional Independence Assumption

Naive Bayes is the simplest and most classic generative classifier. Its core assumption is that given the class label \(Y\), all features \(X_1, X_2, \dots, X_d\) are conditionally independent:

\[ P(X_1, X_2, \dots, X_d \mid Y) = \prod_{j=1}^{d} P(X_j \mid Y) \]

The classification decision is based on Bayes' theorem:

\[ P(Y = c \mid \mathbf{X}) = \frac{P(Y = c) \prod_{j=1}^{d} P(X_j \mid Y = c)}{\sum_{c'} P(Y = c') \prod_{j=1}^{d} P(X_j \mid Y = c')} \]

At prediction time, we select the class with the highest posterior probability: \(\hat{y} = \arg\max_c P(Y = c) \prod_j P(X_j \mid Y = c)\).

Laplace Smoothing

In practice, if a feature value never appears in the training set, the zero probability estimate would cause the entire product to become zero. Laplace smoothing addresses this by adding a small constant to every count:

\[ \hat{P}(X_j = x \mid Y = c) = \frac{\text{count}(X_j = x, Y = c) + \alpha}{\text{count}(Y = c) + \alpha \cdot |V_j|} \]

where \(\alpha\) is the smoothing parameter (\(\alpha = 1\) corresponds to Laplace smoothing) and \(|V_j|\) is the number of possible values of feature \(X_j\).

Text Classification Applications

Naive Bayes performs exceptionally well in text classification tasks (spam filtering, sentiment analysis):

Variant	Feature representation	\(P(X_j \mid Y)\) distribution	Use case
Multinomial Naive Bayes	Word frequency / TF-IDF	Multinomial distribution	Text classification (most common)
Bernoulli Naive Bayes	Word presence (0/1)	Bernoulli distribution	Short texts, binary features
Gaussian Naive Bayes	Continuous values	Gaussian distribution	Classification with continuous features

Gaussian Mixture Models

GMM Definition

A Gaussian Mixture Model (GMM) assumes that data is generated from a mixture of \(K\) Gaussian distributions:

\[ P(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \; \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]

where \(\pi_k\) are the mixing weights (\(\sum_k \pi_k = 1\)), and \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}_k\) are the mean and covariance matrix of the \(k\)-th component, respectively.

Relationship to K-Means: K-Means can be viewed as a special case of GMM where all Gaussian components have covariance \(\sigma^2 I\) (isotropic) and \(\sigma \to 0\), causing GMM's soft assignment to degenerate into K-Means' hard assignment. For more on clustering, see the Unsupervised Learning notes.

EM Derivation for GMM

E-Step: Compute the responsibility of each data point belonging to the \(k\)-th component:

\[ \gamma_{ik} = \frac{\pi_k \; \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \; \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \]

M-Step: Update parameters using the responsibilities:

\[ \boldsymbol{\mu}_k^{\text{new}} = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}, \quad \boldsymbol{\Sigma}_k^{\text{new}} = \frac{\sum_i \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})^T}{\sum_i \gamma_{ik}}, \quad \pi_k^{\text{new}} = \frac{\sum_i \gamma_{ik}}{N} \]

Introduction to Variational Inference

Motivation

When the posterior distribution of latent variables \(P(\mathbf{Z} \mid \mathbf{X})\) cannot be computed analytically (as in complex Bayesian models), we need approximate inference methods. Variational Inference (VI) transforms the inference problem into an optimization problem.

ELBO Derivation

We introduce a variational distribution \(q(\mathbf{Z})\) to approximate the true posterior \(P(\mathbf{Z} \mid \mathbf{X})\). The objective is to minimize the KL divergence between them:

\[ \text{KL}(q(\mathbf{Z}) \| P(\mathbf{Z} \mid \mathbf{X})) = \mathbb{E}_q\left[\log \frac{q(\mathbf{Z})}{P(\mathbf{Z} \mid \mathbf{X})}\right] \]

Expanding this yields:

\[ \text{KL}(q \| p) = -\underbrace{\left(\mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z})] - \mathbb{E}_q[\log q(\mathbf{Z})]\right)}_{\text{ELBO}} + \log P(\mathbf{X}) \]

Since \(\log P(\mathbf{X})\) is a constant, minimizing the KL divergence is equivalent to maximizing the ELBO:

\[ \text{ELBO}(q) = \mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z})] + H[q] \]

where \(H[q] = -\mathbb{E}_q[\log q(\mathbf{Z})]\) is the entropy of \(q\).

Mean-Field Approximation

Mean-field variational inference assumes that the variational distribution factorizes into a product of independent distributions over individual latent variables:

\[ q(\mathbf{Z}) = \prod_{i=1}^{M} q_i(Z_i) \]

Under this assumption, the optimal update for each factor is:

\[ \log q_j^*(Z_j) = \mathbb{E}_{q_{-j}}[\log P(\mathbf{X}, \mathbf{Z})] + \text{const} \]

where \(q_{-j}\) denotes the distributions of all factors except \(q_j\). This provides a coordinate ascent iterative optimization procedure.

Inference Method	Accuracy	Computational Efficiency	Applicable Scale
Exact inference	Exact	Exponential complexity	Small-scale
MCMC	Asymptotically exact	Slow, hard to diagnose convergence	Medium-scale
Variational inference	Approximate	Fast, scalable	Large-scale
EM	Point estimate	Moderate	Medium-scale

PGMs in the ML Pipeline

Probabilistic graphical models do not exist in isolation -- they are tightly integrated with modern machine learning workflows:

Data modeling stage: Using GMMs for density estimation, HMMs for temporal data modeling
Feature engineering stage: Topic models like LDA extract latent features from documents
Uncertainty quantification: Bayesian methods provide confidence intervals for predictions
Causal inference: Bayesian networks for causal discovery and intervention analysis
Generative models: VAEs combine variational inference with deep learning for image/text generation

Convergence with deep learning:

VAE (Variational Autoencoder): Combines the ELBO framework of variational inference with neural network encoders/decoders
Deep Bayesian networks: Using neural networks to parameterize conditional distributions
Normalizing Flows: Constructing flexible variational distributions through invertible transformations

Probabilistic graphical models provide a principled framework for modeling uncertainty and incorporating structured prior knowledge. When combined with the representational power of deep learning, they form a critical pillar of modern AI.