Skip to content

Bayesian Learning

Introduction

Bayesian learning uses probabilistic inference as the foundational framework for machine learning. Unlike point-estimation methods, Bayesian approaches maintain the full posterior distribution over parameters, naturally quantifying uncertainty. This article covers Bayesian inference, conjugate priors, MCMC, variational inference, Bayesian neural networks, and Gaussian processes.

Tribe-level perspective: this article focuses on the math and algorithms; for the Bayesians as a whole tribe à la Domingos (graphical models / probabilistic programming / Bayesian deep learning), see The Master Algorithm notebook — Bayesians.


1. Bayesian Inference Fundamentals

1.1 Bayes' Rule

\[ p(\theta | D) = \frac{p(D | \theta) \, p(\theta)}{p(D)} \]
Term Name Meaning
\(p(\theta \| D)\) Posterior Belief about parameters after observing data
\(p(D \| \theta)\) Likelihood Probability of data given parameters
\(p(\theta)\) Prior Belief about parameters before observing data
\(p(D)\) Evidence Marginal likelihood, normalization constant

Where the evidence (marginal likelihood) is:

\[ p(D) = \int p(D|\theta) p(\theta) \, d\theta \]

1.2 Bayesian Prediction

Predictions are made by taking a weighted average over all possible parameter values, rather than using a single parameter estimate:

\[ p(y^* | x^*, D) = \int p(y^* | x^*, \theta) \, p(\theta | D) \, d\theta \]

This naturally propagates parameter uncertainty into predictions.


2. MAP vs MLE

2.1 Maximum Likelihood Estimation (MLE)

\[ \hat{\theta}_{MLE} = \arg\max_\theta p(D|\theta) = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i | \theta) \]
  • Ignores prior information
  • May overfit (especially with limited data)

2.2 Maximum A Posteriori Estimation (MAP)

\[ \hat{\theta}_{MAP} = \arg\max_\theta p(\theta|D) = \arg\max_\theta \left[\log p(D|\theta) + \log p(\theta)\right] \]
  • The prior acts as a regularizer
  • Gaussian prior \(\theta \sim \mathcal{N}(0, \sigma^2 I)\) is equivalent to L2 regularization
  • Laplace prior \(\theta \sim \text{Laplace}(0, b)\) is equivalent to L1 regularization

2.3 Full Bayesian vs Point Estimates

Method Uses Uncertainty
MLE Single \(\hat{\theta}\) None
MAP Single \(\hat{\theta}\) None (regularized via prior)
Full Bayesian Entire \(p(\theta\|D)\) Complete uncertainty quantification

3. Conjugate Priors

When the prior and posterior belong to the same distribution family, the prior is called a conjugate prior of the likelihood.

Likelihood Conjugate Prior Posterior Application
Bernoulli \(\text{Ber}(p)\) Beta \(\text{Beta}(\alpha, \beta)\) \(\text{Beta}(\alpha+k, \beta+n-k)\) Coin bias
Multinomial Dirichlet Dirichlet Category probabilities
Gaussian (known variance) Gaussian Gaussian Mean estimation
Gaussian (known mean) Inverse Gamma Inverse Gamma Variance estimation
Poisson Gamma Gamma Rate estimation

Example: Beta-Bernoulli

Flipping a coin \(n\) times with \(k\) heads:

\[ \text{Prior}: p \sim \text{Beta}(\alpha, \beta) \]
\[ \text{Posterior}: p | \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k) \]
\[ \text{Posterior mean}: \hat{p} = \frac{\alpha + k}{\alpha + \beta + n} \]

With \(\alpha = \beta = 1\) (uniform prior), observing 10 flips with 7 heads:

\[ \hat{p}_{MAP} = 7/10, \quad \hat{p}_{Bayes} = 8/12 = 0.667 \]

The Bayesian estimate shrinks toward the prior (0.5) -- automatic regularization.


4. MCMC Methods

When the posterior distribution has no analytical form, Markov Chain Monte Carlo sampling is used.

4.1 Metropolis-Hastings

def metropolis_hastings(log_posterior, initial, n_samples, proposal_std=1.0):
    samples = [initial]
    current = initial

    for _ in range(n_samples):
        # Propose new state
        proposal = current + np.random.normal(0, proposal_std, size=current.shape)

        # Acceptance probability
        log_alpha = log_posterior(proposal) - log_posterior(current)

        # Accept or reject
        if np.log(np.random.uniform()) < log_alpha:
            current = proposal

        samples.append(current)

    return np.array(samples)

4.2 Gibbs Sampling

Samples each dimension in turn, conditioned on the others:

\[ \theta_j^{(t+1)} \sim p(\theta_j | \theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_{j+1}^{(t)}, \ldots, \theta_d^{(t)}, D) \]

Applicable when each conditional distribution is easy to sample.

4.3 Hamiltonian Monte Carlo (HMC)

Uses Hamiltonian mechanics simulation to improve sampling efficiency in high-dimensional spaces:

  • Treats the parameter space as a physical system
  • Introduces "momentum" variables to assist exploration
  • NUTS (No-U-Turn Sampler) automatically tunes parameters
# PyMC example
import pymc as pm

with pm.Model() as model:
    # Priors
    mu = pm.Normal("mu", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=5)

    # Likelihood
    y_obs = pm.Normal("y", mu=mu, sigma=sigma, observed=data)

    # NUTS sampling
    trace = pm.sample(2000, tune=1000)

# Posterior analysis
pm.plot_trace(trace)
pm.summary(trace)

4.4 MCMC Diagnostics

Diagnostic Method Good Standard
Convergence \(\hat{R}\) (Gelman-Rubin statistic) \(\hat{R} < 1.01\)
Effective sample size ESS ESS > 400
Autocorrelation Autocorrelation plot Rapid decay
Trace plot Trace plot Stationary, well-mixed

5. Variational Inference

5.1 Core Idea

Approximate the true posterior \(p(\theta|D)\) with a simple distribution \(q(\theta)\) by minimizing KL divergence:

\[ q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| p(\theta|D)) \]

Equivalent to maximizing the ELBO (Evidence Lower Bound):

\[ \text{ELBO}(q) = \mathbb{E}_{q}[\log p(D, \theta)] - \mathbb{E}_{q}[\log q(\theta)] = \mathbb{E}_{q}[\log p(D|\theta)] - \text{KL}(q(\theta) \| p(\theta)) \]
\[ \log p(D) = \text{ELBO}(q) + \text{KL}(q \| p) \geq \text{ELBO}(q) \]

5.2 Mean-Field Approximation

Assumes the posterior factorizes into independent factors:

\[ q(\theta) = \prod_{j=1}^{d} q_j(\theta_j) \]

The optimal solution for each factor:

\[ \log q_j^*(\theta_j) = \mathbb{E}_{q_{-j}}[\log p(\theta, D)] + \text{const} \]

5.3 MCMC vs Variational Inference

Dimension MCMC Variational Inference
Accuracy Asymptotically exact Approximate (limited by \(\mathcal{Q}\))
Speed Slow Fast
Scalability Poor (high-dimensional difficulty) Good
Diagnostics Standard methods available Difficult to diagnose approximation quality
Suited for Small-scale, precise posterior needed Large-scale, rapid inference

6. Bayesian Neural Networks (BNN)

6.1 Motivation

Standard neural networks give point predictions and cannot quantify uncertainty. BNNs treat weights as random variables:

\[ p(w|D) \propto p(D|w) p(w) \]

6.2 Approximation Methods

Method Principle Practicality
VI (Bayes by Backprop) Variational inference, reparameterization trick Moderate
MC Dropout Dropout as approximate VI High (simple to implement)
Deep Ensemble Multiple independently trained models High
SWAG Gaussian approximation based on SGD trajectory Moderate

MC Dropout:

# Use Dropout during training
# Keep Dropout active during prediction, multiple forward passes
predictions = []
for _ in range(100):
    pred = model(x, training=True)  # keep Dropout active
    predictions.append(pred)

mean = np.mean(predictions, axis=0)     # prediction mean
std = np.std(predictions, axis=0)       # uncertainty

7. Gaussian Processes (GP)

7.1 Definition

A Gaussian process is a distribution over functions. The function values at any finite set of points follow a multivariate Gaussian distribution:

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

where \(m(x)\) is the mean function and \(k(x, x')\) is the kernel function.

7.2 GP Regression

Given observations \((X, y)\), the prediction for a new input \(x^*\):

\[ f^* | X, y, x^* \sim \mathcal{N}(\mu^*, \sigma^{*2}) \]
\[ \mu^* = k(x^*, X)[k(X, X) + \sigma_n^2 I]^{-1} y \]
\[ \sigma^{*2} = k(x^*, x^*) - k(x^*, X)[k(X, X) + \sigma_n^2 I]^{-1} k(X, x^*) \]

7.3 Common Kernel Functions

Kernel Expression Characteristics
RBF (Radial Basis Function) \(k(x, x') = \sigma^2 \exp\left(-\frac{\|x-x'\|^2}{2l^2}\right)\) Smooth, most commonly used
Matern \(k(x,x') = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}r}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{l}\right)\) Controllable smoothness
Linear \(k(x, x') = \sigma^2 x^T x'\) Bayesian linear regression
Periodic \(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi\|x-x'\|/p)}{l^2}\right)\) Periodic data

7.4 GP Pros and Cons

Advantages Disadvantages
Natural uncertainty quantification Computational complexity \(O(n^3)\)
No need to specify model structure Not suitable for large-scale data
Flexible kernel functions Difficult with high-dimensional inputs
Good performance with small data Kernel selection requires expertise

References

  • "Pattern Recognition and Machine Learning" - Christopher Bishop
  • "Machine Learning: A Probabilistic Perspective" - Kevin Murphy
  • "Gaussian Processes for Machine Learning" - Rasmussen & Williams
  • "Bayesian Reasoning and Machine Learning" - David Barber

评论 #