Bayesian Learning
Introduction
Bayesian learning uses probabilistic inference as the foundational framework for machine learning. Unlike point-estimation methods, Bayesian approaches maintain the full posterior distribution over parameters, naturally quantifying uncertainty. This article covers Bayesian inference, conjugate priors, MCMC, variational inference, Bayesian neural networks, and Gaussian processes.
Tribe-level perspective: this article focuses on the math and algorithms; for the Bayesians as a whole tribe à la Domingos (graphical models / probabilistic programming / Bayesian deep learning), see The Master Algorithm notebook — Bayesians.
1. Bayesian Inference Fundamentals
1.1 Bayes' Rule
| Term | Name | Meaning |
|---|---|---|
| \(p(\theta \| D)\) | Posterior | Belief about parameters after observing data |
| \(p(D \| \theta)\) | Likelihood | Probability of data given parameters |
| \(p(\theta)\) | Prior | Belief about parameters before observing data |
| \(p(D)\) | Evidence | Marginal likelihood, normalization constant |
Where the evidence (marginal likelihood) is:
1.2 Bayesian Prediction
Predictions are made by taking a weighted average over all possible parameter values, rather than using a single parameter estimate:
This naturally propagates parameter uncertainty into predictions.
2. MAP vs MLE
2.1 Maximum Likelihood Estimation (MLE)
- Ignores prior information
- May overfit (especially with limited data)
2.2 Maximum A Posteriori Estimation (MAP)
- The prior acts as a regularizer
- Gaussian prior \(\theta \sim \mathcal{N}(0, \sigma^2 I)\) is equivalent to L2 regularization
- Laplace prior \(\theta \sim \text{Laplace}(0, b)\) is equivalent to L1 regularization
2.3 Full Bayesian vs Point Estimates
| Method | Uses | Uncertainty |
|---|---|---|
| MLE | Single \(\hat{\theta}\) | None |
| MAP | Single \(\hat{\theta}\) | None (regularized via prior) |
| Full Bayesian | Entire \(p(\theta\|D)\) | Complete uncertainty quantification |
3. Conjugate Priors
When the prior and posterior belong to the same distribution family, the prior is called a conjugate prior of the likelihood.
| Likelihood | Conjugate Prior | Posterior | Application |
|---|---|---|---|
| Bernoulli \(\text{Ber}(p)\) | Beta \(\text{Beta}(\alpha, \beta)\) | \(\text{Beta}(\alpha+k, \beta+n-k)\) | Coin bias |
| Multinomial | Dirichlet | Dirichlet | Category probabilities |
| Gaussian (known variance) | Gaussian | Gaussian | Mean estimation |
| Gaussian (known mean) | Inverse Gamma | Inverse Gamma | Variance estimation |
| Poisson | Gamma | Gamma | Rate estimation |
Example: Beta-Bernoulli
Flipping a coin \(n\) times with \(k\) heads:
With \(\alpha = \beta = 1\) (uniform prior), observing 10 flips with 7 heads:
The Bayesian estimate shrinks toward the prior (0.5) -- automatic regularization.
4. MCMC Methods
When the posterior distribution has no analytical form, Markov Chain Monte Carlo sampling is used.
4.1 Metropolis-Hastings
def metropolis_hastings(log_posterior, initial, n_samples, proposal_std=1.0):
samples = [initial]
current = initial
for _ in range(n_samples):
# Propose new state
proposal = current + np.random.normal(0, proposal_std, size=current.shape)
# Acceptance probability
log_alpha = log_posterior(proposal) - log_posterior(current)
# Accept or reject
if np.log(np.random.uniform()) < log_alpha:
current = proposal
samples.append(current)
return np.array(samples)
4.2 Gibbs Sampling
Samples each dimension in turn, conditioned on the others:
Applicable when each conditional distribution is easy to sample.
4.3 Hamiltonian Monte Carlo (HMC)
Uses Hamiltonian mechanics simulation to improve sampling efficiency in high-dimensional spaces:
- Treats the parameter space as a physical system
- Introduces "momentum" variables to assist exploration
- NUTS (No-U-Turn Sampler) automatically tunes parameters
# PyMC example
import pymc as pm
with pm.Model() as model:
# Priors
mu = pm.Normal("mu", mu=0, sigma=10)
sigma = pm.HalfNormal("sigma", sigma=5)
# Likelihood
y_obs = pm.Normal("y", mu=mu, sigma=sigma, observed=data)
# NUTS sampling
trace = pm.sample(2000, tune=1000)
# Posterior analysis
pm.plot_trace(trace)
pm.summary(trace)
4.4 MCMC Diagnostics
| Diagnostic | Method | Good Standard |
|---|---|---|
| Convergence | \(\hat{R}\) (Gelman-Rubin statistic) | \(\hat{R} < 1.01\) |
| Effective sample size | ESS | ESS > 400 |
| Autocorrelation | Autocorrelation plot | Rapid decay |
| Trace plot | Trace plot | Stationary, well-mixed |
5. Variational Inference
5.1 Core Idea
Approximate the true posterior \(p(\theta|D)\) with a simple distribution \(q(\theta)\) by minimizing KL divergence:
Equivalent to maximizing the ELBO (Evidence Lower Bound):
5.2 Mean-Field Approximation
Assumes the posterior factorizes into independent factors:
The optimal solution for each factor:
5.3 MCMC vs Variational Inference
| Dimension | MCMC | Variational Inference |
|---|---|---|
| Accuracy | Asymptotically exact | Approximate (limited by \(\mathcal{Q}\)) |
| Speed | Slow | Fast |
| Scalability | Poor (high-dimensional difficulty) | Good |
| Diagnostics | Standard methods available | Difficult to diagnose approximation quality |
| Suited for | Small-scale, precise posterior needed | Large-scale, rapid inference |
6. Bayesian Neural Networks (BNN)
6.1 Motivation
Standard neural networks give point predictions and cannot quantify uncertainty. BNNs treat weights as random variables:
6.2 Approximation Methods
| Method | Principle | Practicality |
|---|---|---|
| VI (Bayes by Backprop) | Variational inference, reparameterization trick | Moderate |
| MC Dropout | Dropout as approximate VI | High (simple to implement) |
| Deep Ensemble | Multiple independently trained models | High |
| SWAG | Gaussian approximation based on SGD trajectory | Moderate |
MC Dropout:
# Use Dropout during training
# Keep Dropout active during prediction, multiple forward passes
predictions = []
for _ in range(100):
pred = model(x, training=True) # keep Dropout active
predictions.append(pred)
mean = np.mean(predictions, axis=0) # prediction mean
std = np.std(predictions, axis=0) # uncertainty
7. Gaussian Processes (GP)
7.1 Definition
A Gaussian process is a distribution over functions. The function values at any finite set of points follow a multivariate Gaussian distribution:
where \(m(x)\) is the mean function and \(k(x, x')\) is the kernel function.
7.2 GP Regression
Given observations \((X, y)\), the prediction for a new input \(x^*\):
7.3 Common Kernel Functions
| Kernel | Expression | Characteristics |
|---|---|---|
| RBF (Radial Basis Function) | \(k(x, x') = \sigma^2 \exp\left(-\frac{\|x-x'\|^2}{2l^2}\right)\) | Smooth, most commonly used |
| Matern | \(k(x,x') = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}r}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{l}\right)\) | Controllable smoothness |
| Linear | \(k(x, x') = \sigma^2 x^T x'\) | Bayesian linear regression |
| Periodic | \(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi\|x-x'\|/p)}{l^2}\right)\) | Periodic data |
7.4 GP Pros and Cons
| Advantages | Disadvantages |
|---|---|
| Natural uncertainty quantification | Computational complexity \(O(n^3)\) |
| No need to specify model structure | Not suitable for large-scale data |
| Flexible kernel functions | Difficult with high-dimensional inputs |
| Good performance with small data | Kernel selection requires expertise |
References
- "Pattern Recognition and Machine Learning" - Christopher Bishop
- "Machine Learning: A Probabilistic Perspective" - Kevin Murphy
- "Gaussian Processes for Machine Learning" - Rasmussen & Williams
- "Bayesian Reasoning and Machine Learning" - David Barber