Probabilistic Graphical Models and Inference
Probabilistic Graphical Models (PGMs) are a powerful framework that combines probability theory with graph theory, using graph structures to represent dependencies among random variables and enabling efficient probabilistic inference and learning.
Learning path: Probability fundamentals → Bayes' theorem → PGM taxonomy → EM algorithm → Variational inference → Modern applications
Tribe-level perspective: full HMM derivations (forward-backward / Viterbi / Baum-Welch) and LDA Gibbs sampling are in The Master Algorithm — Graphical Models & HMMs.
Overview of Probabilistic Graphical Models
Directed vs. Undirected Models
Probabilistic graphical models fall into two main categories based on the type of graph:
| Dimension | Directed Models (Bayesian Networks) | Undirected Models (Markov Random Fields) |
|---|---|---|
| Graph type | Directed Acyclic Graph (DAG) | Undirected graph |
| Parameterization | Conditional probability distributions \(P(X_i \mid \text{Pa}(X_i))\) | Potential functions |
| Joint distribution | \(P(\mathbf{X}) = \prod_i P(X_i \mid \text{Pa}(X_i))\) | \(P(\mathbf{X}) = \frac{1}{Z}\prod_c \psi_c(\mathbf{X}_c)\) |
| Independence | d-separation criterion | Global/local/pairwise Markov properties |
| Typical applications | Causal reasoning, Naive Bayes, LDA | Image segmentation (CRF), physics modeling |
| Normalization | Automatically satisfied (conditional probabilities sum to 1) | Requires computing the partition function \(Z\) (often intractable) |
Directed models (Bayesian Networks) factorize the joint probability as:
where \(\text{Pa}(X_i)\) denotes the set of parent nodes of \(X_i\).
Undirected models (Markov Random Fields) factorize the joint probability as:
where \(\mathcal{C}\) is the set of maximal cliques in the graph, \(\psi_c\) are non-negative potential functions, and \(Z\) is the partition function.
Discriminative vs. Generative Models
| Dimension | Generative Models | Discriminative Models |
|---|---|---|
| Modeling target | Joint distribution \(P(X, Y)\) or \(P(X \mid Y)P(Y)\) | Conditional distribution \(P(Y \mid X)\) |
| Inference approach | Obtain posterior via Bayes' theorem | Directly learn the decision boundary |
| Advantages | Can generate new samples, handle missing data | Typically higher classification accuracy |
| Disadvantages | Requires stronger assumptions, higher computational cost | Cannot generate data |
| Typical algorithms | Naive Bayes, GMM, HMM, VAE | Logistic regression, SVM, CRF, neural networks |
The EM Algorithm
The Expectation-Maximization (EM) algorithm is the most important parameter estimation method for probabilistic models with latent variables.
Problem Formulation
Given observed data \(\mathbf{X}\), latent variables \(\mathbf{Z}\), and model parameters \(\theta\), the goal is to maximize the incomplete-data log-likelihood:
Because the summation over latent variables is inside the logarithm, direct optimization is typically very difficult.
E-Step and M-Step
The EM algorithm iteratively performs two steps to progressively increase the likelihood:
E-Step (Expectation Step): Fix parameters \(\theta^{(t)}\) and compute the posterior distribution of the latent variables:
Then compute the expected complete-data log-likelihood (the Q-function):
M-Step (Maximization Step): Maximize the Q-function to obtain new parameters:
ELBO and Convergence
The theoretical foundation of the EM algorithm is the Evidence Lower Bound (ELBO). For any distribution \(q(\mathbf{Z})\):
Since the KL divergence is non-negative, the ELBO is always a lower bound on the log-likelihood:
- E-Step: Fix \(\theta\), set \(q(\mathbf{Z}) = P(\mathbf{Z} \mid \mathbf{X}, \theta)\), making the KL divergence zero so the ELBO tightly matches the likelihood
- M-Step: Fix \(q\), maximize the ELBO with respect to \(\theta\)
Convergence guarantee: Each iteration never decreases the log-likelihood, i.e., \(\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})\). However, EM only guarantees convergence to a local optimum, not the global optimum.
Relationship Between EM and Variational Inference
EM can be viewed as a special case of variational inference:
- The E-step of EM exactly computes the posterior \(P(\mathbf{Z} \mid \mathbf{X}, \theta)\)
- Variational inference (VI) uses a parameterized \(q_\phi(\mathbf{Z})\) to approximate the posterior when it is analytically intractable
- When the exact posterior is computable, VI reduces to EM
Naive Bayes
Conditional Independence Assumption
Naive Bayes is the simplest and most classic generative classifier. Its core assumption is that given the class label \(Y\), all features \(X_1, X_2, \dots, X_d\) are conditionally independent:
The classification decision is based on Bayes' theorem:
At prediction time, we select the class with the highest posterior probability: \(\hat{y} = \arg\max_c P(Y = c) \prod_j P(X_j \mid Y = c)\).
Laplace Smoothing
In practice, if a feature value never appears in the training set, the zero probability estimate would cause the entire product to become zero. Laplace smoothing addresses this by adding a small constant to every count:
where \(\alpha\) is the smoothing parameter (\(\alpha = 1\) corresponds to Laplace smoothing) and \(|V_j|\) is the number of possible values of feature \(X_j\).
Text Classification Applications
Naive Bayes performs exceptionally well in text classification tasks (spam filtering, sentiment analysis):
| Variant | Feature representation | \(P(X_j \mid Y)\) distribution | Use case |
|---|---|---|---|
| Multinomial Naive Bayes | Word frequency / TF-IDF | Multinomial distribution | Text classification (most common) |
| Bernoulli Naive Bayes | Word presence (0/1) | Bernoulli distribution | Short texts, binary features |
| Gaussian Naive Bayes | Continuous values | Gaussian distribution | Classification with continuous features |
Gaussian Mixture Models
GMM Definition
A Gaussian Mixture Model (GMM) assumes that data is generated from a mixture of \(K\) Gaussian distributions:
where \(\pi_k\) are the mixing weights (\(\sum_k \pi_k = 1\)), and \(\boldsymbol{\mu}_k\) and \(\boldsymbol{\Sigma}_k\) are the mean and covariance matrix of the \(k\)-th component, respectively.
Relationship to K-Means: K-Means can be viewed as a special case of GMM where all Gaussian components have covariance \(\sigma^2 I\) (isotropic) and \(\sigma \to 0\), causing GMM's soft assignment to degenerate into K-Means' hard assignment. For more on clustering, see the Unsupervised Learning notes.
EM Derivation for GMM
E-Step: Compute the responsibility of each data point belonging to the \(k\)-th component:
M-Step: Update parameters using the responsibilities:
Introduction to Variational Inference
Motivation
When the posterior distribution of latent variables \(P(\mathbf{Z} \mid \mathbf{X})\) cannot be computed analytically (as in complex Bayesian models), we need approximate inference methods. Variational Inference (VI) transforms the inference problem into an optimization problem.
ELBO Derivation
We introduce a variational distribution \(q(\mathbf{Z})\) to approximate the true posterior \(P(\mathbf{Z} \mid \mathbf{X})\). The objective is to minimize the KL divergence between them:
Expanding this yields:
Since \(\log P(\mathbf{X})\) is a constant, minimizing the KL divergence is equivalent to maximizing the ELBO:
where \(H[q] = -\mathbb{E}_q[\log q(\mathbf{Z})]\) is the entropy of \(q\).
Mean-Field Approximation
Mean-field variational inference assumes that the variational distribution factorizes into a product of independent distributions over individual latent variables:
Under this assumption, the optimal update for each factor is:
where \(q_{-j}\) denotes the distributions of all factors except \(q_j\). This provides a coordinate ascent iterative optimization procedure.
| Inference Method | Accuracy | Computational Efficiency | Applicable Scale |
|---|---|---|---|
| Exact inference | Exact | Exponential complexity | Small-scale |
| MCMC | Asymptotically exact | Slow, hard to diagnose convergence | Medium-scale |
| Variational inference | Approximate | Fast, scalable | Large-scale |
| EM | Point estimate | Moderate | Medium-scale |
PGMs in the ML Pipeline
Probabilistic graphical models do not exist in isolation -- they are tightly integrated with modern machine learning workflows:
- Data modeling stage: Using GMMs for density estimation, HMMs for temporal data modeling
- Feature engineering stage: Topic models like LDA extract latent features from documents
- Uncertainty quantification: Bayesian methods provide confidence intervals for predictions
- Causal inference: Bayesian networks for causal discovery and intervention analysis
- Generative models: VAEs combine variational inference with deep learning for image/text generation
Convergence with deep learning:
- VAE (Variational Autoencoder): Combines the ELBO framework of variational inference with neural network encoders/decoders
- Deep Bayesian networks: Using neural networks to parameterize conditional distributions
- Normalizing Flows: Constructing flexible variational distributions through invertible transformations
Probabilistic graphical models provide a principled framework for modeling uncertainty and incorporating structured prior knowledge. When combined with the representational power of deep learning, they form a critical pillar of modern AI.