Bayesian Learning

Introduction

Bayesian learning uses probabilistic inference as the foundational framework for machine learning. Unlike point-estimation methods, Bayesian approaches maintain the full posterior distribution over parameters, naturally quantifying uncertainty. This article covers Bayesian inference, conjugate priors, MCMC, variational inference, Bayesian neural networks, and Gaussian processes.

Tribe-level perspective: this article focuses on the math and algorithms; for the Bayesians as a whole tribe à la Domingos (graphical models / probabilistic programming / Bayesian deep learning), see The Master Algorithm notebook — Bayesians.

1. Bayesian Inference Fundamentals

1.1 Bayes' Rule

\[ p(\theta | D) = \frac{p(D | \theta) \, p(\theta)}{p(D)} \]

Term	Name	Meaning
\(p(\theta \\| D)\)	Posterior	Belief about parameters after observing data
\(p(D \\| \theta)\)	Likelihood	Probability of data given parameters
\(p(\theta)\)	Prior	Belief about parameters before observing data
\(p(D)\)	Evidence	Marginal likelihood, normalization constant

Where the evidence (marginal likelihood) is:

\[ p(D) = \int p(D|\theta) p(\theta) \, d\theta \]

1.2 Bayesian Prediction

Predictions are made by taking a weighted average over all possible parameter values, rather than using a single parameter estimate:

\[ p(y^* | x^*, D) = \int p(y^* | x^*, \theta) \, p(\theta | D) \, d\theta \]

This naturally propagates parameter uncertainty into predictions.

2. MAP vs MLE

2.1 Maximum Likelihood Estimation (MLE)

\[ \hat{\theta}_{MLE} = \arg\max_\theta p(D|\theta) = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i | \theta) \]

Ignores prior information
May overfit (especially with limited data)

2.2 Maximum A Posteriori Estimation (MAP)

\[ \hat{\theta}_{MAP} = \arg\max_\theta p(\theta|D) = \arg\max_\theta \left[\log p(D|\theta) + \log p(\theta)\right] \]

The prior acts as a regularizer
Gaussian prior \(\theta \sim \mathcal{N}(0, \sigma^2 I)\) is equivalent to L2 regularization
Laplace prior \(\theta \sim \text{Laplace}(0, b)\) is equivalent to L1 regularization

2.3 Full Bayesian vs Point Estimates

Method	Uses	Uncertainty
MLE	Single \(\hat{\theta}\)	None
MAP	Single \(\hat{\theta}\)	None (regularized via prior)
Full Bayesian	Entire \(p(\theta\\|D)\)	Complete uncertainty quantification

3. Conjugate Priors

When the prior and posterior belong to the same distribution family, the prior is called a conjugate prior of the likelihood.

Likelihood	Conjugate Prior	Posterior	Application
Bernoulli \(\text{Ber}(p)\)	Beta \(\text{Beta}(\alpha, \beta)\)	\(\text{Beta}(\alpha+k, \beta+n-k)\)	Coin bias
Multinomial	Dirichlet	Dirichlet	Category probabilities
Gaussian (known variance)	Gaussian	Gaussian	Mean estimation
Gaussian (known mean)	Inverse Gamma	Inverse Gamma	Variance estimation
Poisson	Gamma	Gamma	Rate estimation

Example: Beta-Bernoulli

Flipping a coin \(n\) times with \(k\) heads:

\[ \text{Prior}: p \sim \text{Beta}(\alpha, \beta) \]

\[ \text{Posterior}: p | \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k) \]

\[ \text{Posterior mean}: \hat{p} = \frac{\alpha + k}{\alpha + \beta + n} \]

With \(\alpha = \beta = 1\) (uniform prior), observing 10 flips with 7 heads:

\[ \hat{p}_{MAP} = 7/10, \quad \hat{p}_{Bayes} = 8/12 = 0.667 \]

The Bayesian estimate shrinks toward the prior (0.5) -- automatic regularization.

4. MCMC Methods

When the posterior distribution has no analytical form, Markov Chain Monte Carlo sampling is used.

4.1 Metropolis-Hastings

def metropolis_hastings(log_posterior, initial, n_samples, proposal_std=1.0):
    samples = [initial]
    current = initial

    for _ in range(n_samples):
        # Propose new state
        proposal = current + np.random.normal(0, proposal_std, size=current.shape)

        # Acceptance probability
        log_alpha = log_posterior(proposal) - log_posterior(current)

        # Accept or reject
        if np.log(np.random.uniform()) < log_alpha:
            current = proposal

        samples.append(current)

    return np.array(samples)

4.2 Gibbs Sampling

Samples each dimension in turn, conditioned on the others:

\[ \theta_j^{(t+1)} \sim p(\theta_j | \theta_1^{(t+1)}, \ldots, \theta_{j-1}^{(t+1)}, \theta_{j+1}^{(t)}, \ldots, \theta_d^{(t)}, D) \]

Applicable when each conditional distribution is easy to sample.

4.3 Hamiltonian Monte Carlo (HMC)

Uses Hamiltonian mechanics simulation to improve sampling efficiency in high-dimensional spaces:

Treats the parameter space as a physical system
Introduces "momentum" variables to assist exploration
NUTS (No-U-Turn Sampler) automatically tunes parameters

# PyMC example
import pymc as pm

with pm.Model() as model:
    # Priors
    mu = pm.Normal("mu", mu=0, sigma=10)
    sigma = pm.HalfNormal("sigma", sigma=5)

    # Likelihood
    y_obs = pm.Normal("y", mu=mu, sigma=sigma, observed=data)

    # NUTS sampling
    trace = pm.sample(2000, tune=1000)

# Posterior analysis
pm.plot_trace(trace)
pm.summary(trace)

4.4 MCMC Diagnostics

Diagnostic	Method	Good Standard
Convergence	\(\hat{R}\) (Gelman-Rubin statistic)	\(\hat{R} < 1.01\)
Effective sample size	ESS	ESS > 400
Autocorrelation	Autocorrelation plot	Rapid decay
Trace plot	Trace plot	Stationary, well-mixed

5. Variational Inference

5.1 Core Idea

Approximate the true posterior \(p(\theta|D)\) with a simple distribution \(q(\theta)\) by minimizing KL divergence:

\[ q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| p(\theta|D)) \]

Equivalent to maximizing the ELBO (Evidence Lower Bound):

\[ \text{ELBO}(q) = \mathbb{E}_{q}[\log p(D, \theta)] - \mathbb{E}_{q}[\log q(\theta)] = \mathbb{E}_{q}[\log p(D|\theta)] - \text{KL}(q(\theta) \| p(\theta)) \]

\[ \log p(D) = \text{ELBO}(q) + \text{KL}(q \| p) \geq \text{ELBO}(q) \]

5.2 Mean-Field Approximation

Assumes the posterior factorizes into independent factors:

\[ q(\theta) = \prod_{j=1}^{d} q_j(\theta_j) \]

The optimal solution for each factor:

\[ \log q_j^*(\theta_j) = \mathbb{E}_{q_{-j}}[\log p(\theta, D)] + \text{const} \]

5.3 MCMC vs Variational Inference

Dimension	MCMC	Variational Inference
Accuracy	Asymptotically exact	Approximate (limited by \(\mathcal{Q}\))
Speed	Slow	Fast
Scalability	Poor (high-dimensional difficulty)	Good
Diagnostics	Standard methods available	Difficult to diagnose approximation quality
Suited for	Small-scale, precise posterior needed	Large-scale, rapid inference

6. Bayesian Neural Networks (BNN)

6.1 Motivation

Standard neural networks give point predictions and cannot quantify uncertainty. BNNs treat weights as random variables:

\[ p(w|D) \propto p(D|w) p(w) \]

6.2 Approximation Methods

Method	Principle	Practicality
VI (Bayes by Backprop)	Variational inference, reparameterization trick	Moderate
MC Dropout	Dropout as approximate VI	High (simple to implement)
Deep Ensemble	Multiple independently trained models	High
SWAG	Gaussian approximation based on SGD trajectory	Moderate

MC Dropout:

# Use Dropout during training
# Keep Dropout active during prediction, multiple forward passes
predictions = []
for _ in range(100):
    pred = model(x, training=True)  # keep Dropout active
    predictions.append(pred)

mean = np.mean(predictions, axis=0)     # prediction mean
std = np.std(predictions, axis=0)       # uncertainty

7. Gaussian Processes (GP)

7.1 Definition

A Gaussian process is a distribution over functions. The function values at any finite set of points follow a multivariate Gaussian distribution:

\[ f(x) \sim \mathcal{GP}(m(x), k(x, x')) \]

where \(m(x)\) is the mean function and \(k(x, x')\) is the kernel function.

7.2 GP Regression

Given observations \((X, y)\), the prediction for a new input \(x^*\):

\[ f^* | X, y, x^* \sim \mathcal{N}(\mu^*, \sigma^{*2}) \]

\[ \mu^* = k(x^*, X)[k(X, X) + \sigma_n^2 I]^{-1} y \]

\[ \sigma^{*2} = k(x^*, x^*) - k(x^*, X)[k(X, X) + \sigma_n^2 I]^{-1} k(X, x^*) \]

7.3 Common Kernel Functions

Kernel	Expression	Characteristics
RBF (Radial Basis Function)	\(k(x, x') = \sigma^2 \exp\left(-\frac{\\|x-x'\\|^2}{2l^2}\right)\)	Smooth, most commonly used
Matern	\(k(x,x') = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}r}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{l}\right)\)	Controllable smoothness
Linear	\(k(x, x') = \sigma^2 x^T x'\)	Bayesian linear regression
Periodic	\(k(x, x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi\\|x-x'\\|/p)}{l^2}\right)\)	Periodic data

7.4 GP Pros and Cons

Advantages	Disadvantages
Natural uncertainty quantification	Computational complexity \(O(n^3)\)
No need to specify model structure	Not suitable for large-scale data
Flexible kernel functions	Difficult with high-dimensional inputs
Good performance with small data	Kernel selection requires expertise

References

"Pattern Recognition and Machine Learning" - Christopher Bishop
"Machine Learning: A Probabilistic Perspective" - Kevin Murphy
"Gaussian Processes for Machine Learning" - Rasmussen & Williams
"Bayesian Reasoning and Machine Learning" - David Barber