Bayesians

Tribe overview

The Bayesians hold that the essence of learning is probabilistic inference under uncertainty. Anything learnable — parameters, latent variables, model structures, future observations — is treated as a random variable; the goal of learning is, given observed data $D$, to obtain the posterior distribution $P(H \mid D)$ over the unknown quantity $H$, rather than a single point estimate.

In The Master Algorithm, Pedro Domingos lists the Bayesians as one of the five tribes of machine learning, and points out that the Bayesian "master algorithm" is Bayes' theorem itself:

\[ P(H \mid D) \;=\; \frac{P(D \mid H)\,P(H)}{P(D)} \]

This deceptively simple formula provides a unified framework for inductive inference:

the prior $P(H)$ encodes beliefs about the hypothesis before learning;
the likelihood $P(D \mid H)$ describes how the hypothesis generates data;
the evidence $P(D) = \int P(D\mid H)P(H)\,dH$ is the marginalization constant;
the posterior $P(H \mid D)$ is the output of learning.

The Bayesian core commitment: prior + likelihood = everything. From naive Bayes to LDA, from Kalman filtering to Bayesian neural networks, every method can be read as a specialization of Bayes' theorem under a particular model family and inference algorithm.

Fundamental disagreement with the frequentists: frequentists treat parameters as fixed but unknown quantities and interpret probability as a long-run frequency; Bayesians treat parameters as random variables and interpret probability as a degree of belief. This split determines the different forms the two camps take in confidence intervals, hypothesis testing, and model selection.

Tribe profile

Dimension	Content
Ontology	The world is built of probability distributions; uncertainty is an intrinsic property of knowledge
Master algorithm	Bayes' theorem $P(H\mid D) \propto P(D\mid H)P(H)$
Evaluation criteria	Posterior probability, marginal likelihood, predictive log-likelihood
Optimizers	MCMC (Metropolis-Hastings, Gibbs, HMC, NUTS), variational inference (VI), EM, Laplace approximation
Representative methods	Naive Bayes, Bayesian networks, hidden Markov models (HMM), LDA, Gaussian processes, Kalman filtering, Bayesian neural networks
Modern branches	Probabilistic programming (PyMC/Stan/NumPyro), Bayesian deep learning (BNN/MC Dropout/Laplace), Bayesian optimization (BO), variational autoencoders (VAE)
Typical loss	Negative log-posterior, ELBO (variational lower bound), KL divergence
Overfitting control	Prior regularization, Bayesian model averaging (BMA)

Algorithmic genealogy

flowchart TD
    A["Bayes' theorem<br/>P(H|D) ∝ P(D|H)P(H)"] --> B["Naive Bayes<br/>(conditional independence)"]
    A --> C["Bayesian networks / directed graphical models"]
    A --> D["Markov random fields / undirected graphical models"]
    C --> E["Hidden Markov Models (HMM)"]
    C --> F["Topic model: LDA"]
    C --> G["Kalman filter<br/>(linear Gaussian)"]
    A --> H["Exact inference<br/>variable elimination / belief propagation"]
    A --> I["Approximate inference"]
    I --> J["MCMC<br/>MH / Gibbs / HMC / NUTS"]
    I --> K["Variational inference (VI)<br/>ELBO optimization"]
    I --> L["Laplace approximation"]
    A --> M["Modern Bayesian deep learning"]
    M --> N["Bayes by Backprop"]
    M --> O["MC Dropout"]
    M --> P["Deep Ensembles"]
    M --> Q["SWAG / Laplace Redux"]
    A --> R["Probabilistic programming<br/>PyMC / Stan / NumPyro"]
    A --> S["Bayesian optimization (BO)<br/>(GP + acquisition function)"]

The whole genealogy can be summarized in three stages:

Classical stage (from 1763): Bayes' theorem → naive Bayes → Pearl's Bayesian networks (1988) → HMM (1960s-70s).
Algorithmic maturation (1990s-2000s): MCMC popularized (Geman-Geman 1984, Gelfand-Smith 1990) → LDA (Blei 2003) → variational inference standardized.
Deep learning era (2015-): variational autoencoders VAE (Kingma 2014) → Bayes by Backprop (Blundell 2015) → MC Dropout (Gal 2016) → Laplace Redux (Daxberger 2021).

Frequentist vs. Bayesian

Dimension	Frequentist	Bayesian
Interpretation of probability	Long-run relative frequency	Subjective degree of belief
Parameter $\theta$	Fixed but unknown	Random variable
Core estimator	Maximum likelihood $\hat\theta_{\text{MLE}}$	Posterior distribution $P(\theta\mid D)$
Confidence intervals	95% CI: frequency with which random intervals cover the true value	95% credible interval: posterior probability that the parameter lies in the interval
Hypothesis testing	$p$-value, Neyman-Pearson	Bayes factor
Prediction	Plug-in $p(y\mid \hat\theta)$	Posterior predictive $\int p(y\mid\theta)p(\theta\mid D)d\theta$
Model selection	AIC/BIC, cross-validation	Marginal likelihood, WAIC, LOO-CV
Regularization	Explicit L1/L2 penalty	Implicit through priors (Laplace/Gauss prior ↔ L1/L2)
Small samples	Unstable, requires bootstrap	Prior acts as constraint, natural shrinkage
Large samples	Converges with the Bayesian view (Bernstein-von Mises theorem)	Posterior concentrates around the MLE
Key figures	Fisher, Neyman, Pearson	Laplace, Jeffreys, de Finetti, Jaynes

Pragmatic stance: in modern ML practice, the two camps have long since blended. L2 regularization is equivalent to MAP under a Gaussian prior; cross-validation can be viewed as an approximation to the marginal likelihood; dropout in deep network training admits a variational-inference interpretation. Tribes are philosophy; methods are tools.

Division of labor with existing site pages

This notebook is the tribe-level entry point, focusing on:

placing Bayesian methods within the five-tribe framework of The Master Algorithm, and contrasting them with the Symbolist, Connectionist, Evolutionary, and Analogizer tribes;
applied engineering and modern branches (probabilistic programming, BDL, BO);
providing standard references and further reading for each branch.

For mathematical details (derivation of Bayes' theorem, conjugate priors, the MAP/MLE relationship, etc.), see ../../03_Machine_Learning/贝叶斯学习.md; for general probabilistic models and the basics of graphical models, see ../../03_Machine_Learning/probabilistic_models.md.

The reason both pages coexist: the tribe page emphasizes "why this is a unified research program", whereas the ML foundation page emphasizes "how to use it concretely in supervised/unsupervised learning".

This section contains three in-depth notes:

图模型与隐马尔可夫 — fundamentals of probabilistic graphical models, the three classical HMM problems (forward-backward, Viterbi, Baum-Welch), Kalman and particle filtering, LDA topic model.
概率编程与贝叶斯统计实战 — comparison of PyMC/Stan/NumPyro, hierarchical models, MCMC diagnostics, Bayesian A/B testing, Bayesian optimization, model comparison (WAIC/LOO).
贝叶斯深度学习与不确定性 — BNN, MC Dropout, Deep Ensembles, SWAG, Laplace approximation, calibration (ECE), OOD detection, relationship with VAE / diffusion models.

Suggested learning path

flowchart LR
    A[Bayes' theorem + conjugate priors] --> B[Naive Bayes<br/>discriminative vs generative]
    B --> C[Graphical model basics<br/>d-separation/I-Map]
    C --> D[HMM / Kalman<br/>sequential inference]
    C --> E[LDA / topic models]
    D --> F[MCMC / VI]
    E --> F
    F --> G[Probabilistic programming<br/>PyMC/Stan]
    G --> H[Bayesian optimization]
    G --> I[Bayesian deep learning]
    I --> J[VAE / diffusion models]

Beginner: start with chapters 1, 2, 8 of PRML; implement Beta-Binomial and Bayesian linear regression in PyMC.
Intermediate: derive the three HMM algorithms by hand; understand the unified view of ELBO and EM; run hierarchical models with NUTS and inspect R-hat / ESS.
Advanced: the engineering trade-offs of the major BNN approximations (Bayes by Backprop / MC Dropout / Laplace); practical use of BO in hyperparameter search.

References

Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (PRML, the standard textbook from a Bayesian viewpoint)
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
Murphy, K. P. (2022/2023). Probabilistic Machine Learning: An Introduction / Advanced Topics. MIT Press. (The most comprehensive modern Bayesian ML reference after PRML)
Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis (3rd ed.). CRC Press. (BDA3, the bible of applied Bayesian statistics)
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press. (The first-choice introduction to Bayesian methods)

Graphical Models and Hidden Markov Models

This note covers the most representative structured models of the Bayesian tribe: probabilistic graphical models (PGM), hidden Markov models (HMM), Kalman / particle filtering, and the LDA topic model. They share a common feature: they use graph structure to encode conditional independence, factorizing high-dimensional joint distributions into products of local factors so that inference and learning become tractable.

1. Overview of probabilistic graphical models

1.1 Why graphs

The joint distribution of an arbitrary $d$-dimensional discrete random vector requires $O(K^d)$ parameters ($K$ = number of values per dimension), which is intractable in high dimensions. Graphical models exploit conditional independence to factor the joint:

\[ P(X_1, \dots, X_d) \;=\; \prod_{i=1}^{d} P(X_i \mid \text{Pa}(X_i)) \]

If each node has at most $k$ parents, the parameter count drops from $O(K^d)$ to $O(d K^{k+1})$ — an exponential reduction.

1.2 Bayesian networks (directed graphical models)

Definition: a directed acyclic graph (DAG) $G=(V,E)$ in which each node carries a conditional probability distribution (CPD) $P(X_i \mid \text{Pa}(X_i))$, and the joint distribution is

\[ P(X_{1:d}) = \prod_i P(X_i \mid \text{Pa}(X_i)) \]

Conditional independence (d-separation): every triplet structure on a path falls into one of three categories —

Chain $A \to B \to C$: observing $B$ blocks $A$ and $C$;
Fork $A \leftarrow B \to C$: observing $B$ blocks $A$ and $C$;
Collider / v-structure $A \to B \leftarrow C$: observing $B$ or any descendant of $B$ opens $A$ and $C$ ("explaining away").

If every path between $X$ and $Y$ is blocked given $Z$, then $X \perp\!\!\!\perp Y \mid Z$.

I-Map: graph $G$ is an I-Map of distribution $P$ iff every independence encoded by $G$ holds in $P$. The minimal I-Map (perfect map) does not always exist.

1.3 Markov random fields (undirected graphical models)

An undirected graph writes the joint as a product of clique potentials:

\[ P(X) \;=\; \frac{1}{Z}\prod_{c \in \mathcal{C}} \psi_c(X_c), \quad Z = \sum_X \prod_c \psi_c(X_c) \]

$Z$ is called the partition function and is the main source of difficulty when learning MRFs.

Global / local / pairwise Markov properties are equivalent under positive distributions (everywhere positive) by the Hammersley-Clifford theorem.

Dimension	Bayesian network (BN)	Markov network (MRF)
Graph structure	Directed acyclic	Undirected
Factorization	Local CPDs (self-normalizing)	Clique potentials (require partition function $Z$)
Independence test	d-separation	Graph separation
Causal interpretation	Naturally supported	Not directly supported
Classical applications	Diagnostic networks, HMM, LDA	Image segmentation, CRF, Ising models
Learning difficulty	CPDs relatively easy to fit	Computing $Z$ typically #P-hard

2. Exact inference

Exact inference answers: given evidence $e$, compute the posterior $P(Q \mid E=e)$ or marginal $P(Q)$.

2.1 Variable Elimination (VE)

Sum out non-query, non-evidence variables one by one according to an elimination order $\pi$. Each elimination produces an intermediate factor. Complexity is governed by the elimination width (induced width / treewidth of the graph); finding the optimal ordering is itself NP-hard.

Input: factor set Φ, elimination order π
For each variable X_π(i):
    Collect every factor containing X_π(i) → Φ_i
    New factor g_i = Σ_{X_π(i)} ∏ Φ_i
    Φ ← (Φ \ Φ_i) ∪ {g_i}
Return ∏ Φ

2.2 Belief Propagation (BP)

Also called the sum-product algorithm. On a tree (including polytrees), two passes of message passing yield exact marginals:

\[ m_{i \to j}(x_j) \;=\; \sum_{x_i} \psi_{ij}(x_i, x_j)\,\phi_i(x_i)\!\!\prod_{k \in N(i)\setminus j}\!\! m_{k \to i}(x_i) \]

Final marginal: $\;P(x_i) \propto \phi_i(x_i)\prod_{k \in N(i)} m_{k\to i}(x_i)$.

On graphs with cycles, this becomes loopy BP, an approximate algorithm that may not converge.

2.3 Junction Tree algorithm

Triangulate the original graph and construct the junction tree (clique tree); run exact BP on the tree. This is the most general exact inference algorithm; complexity is still controlled by treewidth.

3. Hidden Markov Models (HMM)

3.1 Model definition

An HMM is a class of temporal Bayesian network containing:

Hidden state sequence $z_{1:T}$, $z_t \in \{1,\dots,K\}$;
Observation sequence $x_{1:T}$;
Initial distribution $\pi_k = P(z_1 = k)$;
Transition matrix $A_{ij} = P(z_{t+1}=j \mid z_t = i)$;
Emission probability $B_k(x) = P(x_t = x \mid z_t = k)$.

Parameters $\lambda = (\pi, A, B)$. Joint distribution:

\[ P(z_{1:T}, x_{1:T}) = P(z_1)\prod_{t=2}^{T} P(z_t \mid z_{t-1}) \prod_{t=1}^{T} P(x_t \mid z_t) \]

flowchart LR
    Z1((z_1)) --> Z2((z_2)) --> Z3((z_3)) --> Zd((... z_T))
    Z1 --> X1[x_1]
    Z2 --> X2[x_2]
    Z3 --> X3[x_3]
    Zd --> XT[x_T]

3.2 The three classical problems

Problem	Input	Output	Algorithm
Evaluation	$\lambda, x_{1:T}$	$P(x_{1:T} \mid \lambda)$	Forward algorithm
Decoding	$\lambda, x_{1:T}$	$\arg\max_{z_{1:T}} P(z_{1:T} \mid x_{1:T})$	Viterbi
Learning	$x_{1:T}$ (no $\lambda$)	$\hat\lambda$	Baum-Welch (EM)

3.3 Forward algorithm

Define the forward variable $\alpha_t(i) = P(x_{1:t}, z_t = i \mid \lambda)$.

Initialization: $\alpha_1(i) = \pi_i B_i(x_1)$ Recursion:

\[ \alpha_{t+1}(j) \;=\; \Big[\sum_{i=1}^{K}\alpha_t(i)A_{ij}\Big]\,B_j(x_{t+1}) \]

Termination: $\;P(x_{1:T} \mid \lambda) = \sum_{i=1}^{K}\alpha_T(i)$.

Complexity: $O(K^2 T)$.

3.4 Backward algorithm

Define the backward variable $\beta_t(i) = P(x_{t+1:T} \mid z_t = i, \lambda)$.

Initialization: $\beta_T(i) = 1$ Recursion:

\[ \beta_t(i) \;=\; \sum_{j=1}^{K} A_{ij} B_j(x_{t+1}) \beta_{t+1}(j) \]

Combining backward with forward variables yields the smoothed posterior at any time step:

\[ \gamma_t(i) = P(z_t = i \mid x_{1:T}) = \frac{\alpha_t(i)\beta_t(i)}{\sum_j \alpha_t(j)\beta_t(j)} \]

and the joint posterior at adjacent time steps:

\[ \xi_t(i,j) = P(z_t = i, z_{t+1} = j \mid x_{1:T}) = \frac{\alpha_t(i)A_{ij}B_j(x_{t+1})\beta_{t+1}(j)}{P(x_{1:T} \mid \lambda)} \]

3.5 Viterbi decoding

Find the optimal state sequence $z^*_{1:T} = \arg\max_{z_{1:T}} P(z_{1:T}, x_{1:T} \mid \lambda)$.

Define $\delta_t(i) = \max_{z_{1:t-1}} P(z_{1:t-1}, z_t = i, x_{1:t} \mid \lambda)$.

Initialization: $\delta_1(i) = \pi_i B_i(x_1)$, $\psi_1(i) = 0$ Recursion:

\[ \delta_{t}(j) = \max_i \big[\delta_{t-1}(i) A_{ij}\big] B_j(x_t), \quad \psi_t(j) = \arg\max_i \big[\delta_{t-1}(i) A_{ij}\big] \]

Termination: $z^*_T = \arg\max_i \delta_T(i)$ Backtracking: $z^*_{t} = \psi_{t+1}(z^*_{t+1})$.

Complexity is again $O(K^2 T)$ — a textbook example of dynamic programming.

def viterbi(pi, A, B, x):
    T, K = len(x), len(pi)
    delta = np.zeros((T, K))
    psi = np.zeros((T, K), dtype=int)
    delta[0] = pi * B[:, x[0]]
    for t in range(1, T):
        for j in range(K):
            scores = delta[t-1] * A[:, j]
            psi[t, j] = np.argmax(scores)
            delta[t, j] = scores[psi[t, j]] * B[j, x[t]]
    z = np.zeros(T, dtype=int)
    z[-1] = np.argmax(delta[-1])
    for t in range(T-2, -1, -1):
        z[t] = psi[t+1, z[t+1]]
    return z

Numerical stability: real implementations use log probabilities, replacing multiplications by additions. Forward/backward likewise needs scaling factors or logsumexp; otherwise, large $T$ leads to underflow.

3.6 Baum-Welch / EM training

In the unsupervised setting (only $x_{1:T}$ available), use EM to iteratively estimate $\lambda$.

E step: with current parameters $\lambda^{(s)}$, run forward-backward to obtain $\gamma_t(i), \xi_t(i,j)$. M step:

\[ \hat\pi_i = \gamma_1(i),\quad \hat A_{ij} = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\gamma_t(i)},\quad \hat B_j(v) = \frac{\sum_{t: x_t = v}\gamma_t(j)}{\sum_t \gamma_t(j)} \]

EM guarantees $P(x_{1:T} \mid \lambda^{(s+1)}) \ge P(x_{1:T}\mid\lambda^{(s)})$, but only converges to a local optimum; it is sensitive to initialization and is often run with multiple restarts or k-means initialization for the emission means.

3.7 Extensions of HMMs

Extension	Modification
GMM-HMM	Emission replaced by a Gaussian mixture, continuous observations
Autoregressive HMM	Emission depends on $x_{t-1}$
Input-output HMM	Transitions and emissions depend on external input $u_t$
Hierarchical HMM	The state itself is an HMM; structure is nested
Infinite HMM (HDP-HMM)	Unbounded number of states, nonparametric Bayes
Linear-Chain CRF	Discriminative version, modeling $P(z\mid x)$ rather than $P(z,x)$

4. Sequential Bayes: Kalman filtering and particle filtering

4.1 State-space models

\[ z_t = f(z_{t-1}) + \epsilon_t, \quad x_t = g(z_t) + \nu_t \]

HMMs cover the discrete-state case; for continuous states, the two most important variants are: linear Gaussian → Kalman filter; nonlinear / non-Gaussian → particle filter.

4.2 Kalman filter (linear Gaussian)

Suppose:

\[ z_t = A z_{t-1} + B u_t + \epsilon_t,\;\; \epsilon_t \sim \mathcal{N}(0, Q) $$ $$ x_t = H z_t + \nu_t,\;\; \nu_t \sim \mathcal{N}(0, R) \]

Let $\hat z_{t\mid s} = \mathbb{E}[z_t \mid x_{1:s}]$, $P_{t\mid s} = \mathrm{Cov}[z_t \mid x_{1:s}]$.

Predict step:

\[ \hat z_{t\mid t-1} = A\hat z_{t-1\mid t-1} + B u_t $$ $$ P_{t\mid t-1} = A P_{t-1\mid t-1} A^\top + Q \]

Update step: first compute the innovation and the innovation covariance:

\[ y_t = x_t - H \hat z_{t\mid t-1},\quad S_t = H P_{t\mid t-1} H^\top + R \]

Kalman gain:

\[ K_t = P_{t\mid t-1} H^\top S_t^{-1} \]

Posterior update:

\[ \hat z_{t\mid t} = \hat z_{t\mid t-1} + K_t y_t $$ $$ P_{t\mid t} = (I - K_t H) P_{t\mid t-1} \]

Intuition: $K_t$ encodes "how much more we should trust the measurement than the prediction" — large $R$ (high measurement noise) makes $K_t$ small, leaning on the prediction; the reverse leans on the observation.

Nonlinear extensions: extended Kalman filter (EKF), unscented Kalman filter (UKF).

4.3 Particle filter (Sequential Monte Carlo)

For nonlinear, non-Gaussian models, use $N$ weighted samples $\{(z_t^{(i)}, w_t^{(i)})\}_{i=1}^N$ to approximate $P(z_t \mid x_{1:t})$.

SIS (Sequential Importance Sampling):

Sample $z_t^{(i)}$ from the proposal $q(z_t \mid z_{t-1}^{(i)}, x_t)$;
Update the weight

\[ w_t^{(i)} \propto w_{t-1}^{(i)} \cdot \frac{p(x_t \mid z_t^{(i)}) p(z_t^{(i)} \mid z_{t-1}^{(i)})}{q(z_t^{(i)} \mid z_{t-1}^{(i)}, x_t)} \]

Normalize so that $\sum_i w_t^{(i)} = 1$.

SIR (Sampling Importance Resampling): when weights degenerate (a few particles carry almost all weight), resample according to weights to obtain a new equally weighted particle set. The standard criterion is the effective sample size $\hat N_{\text{eff}} = 1/\sum_i (w_t^{(i)})^2$; resample when it drops below a threshold (e.g. $N/2$).

Bootstrap filter: take $q = p(z_t \mid z_{t-1})$, so the weight reduces to $w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})$.

Applications: robotic SLAM, target tracking, financial time series, epidemiology.

5. Topic models: LDA

5.1 Generative process

LDA (Latent Dirichlet Allocation, Blei et al. 2003) assumes each document is a mixture of several topics, and each topic is a distribution over the vocabulary.

Hyperparameters: $\alpha$ (document-topic prior), $\beta$ (topic-word prior), and the number of topics $K$.

graph LR
    subgraph Plate_K["Topic plate: K"]
      betak["φ_k ~ Dir(β)"]
    end
    subgraph Plate_M["Document plate: M"]
      thetad["θ_d ~ Dir(α)"]
      subgraph Plate_N["Word plate: N_d"]
        zdn["z_{d,n} ~ Cat(θ_d)"]
        wdn["w_{d,n} ~ Cat(φ_{z_{d,n}})"]
      end
    end
    thetad --> zdn
    zdn --> wdn
    betak --> wdn

Generative process:

For each topic $k = 1, \dots, K$: draw $\varphi_k \sim \mathrm{Dir}(\beta)$.
For each document $d = 1, \dots, M$:
- Draw a topic distribution $\theta_d \sim \mathrm{Dir}(\alpha)$;
- For each word position $n = 1, \dots, N_d$:
  - Draw a topic $z_{d,n} \sim \mathrm{Cat}(\theta_d)$;
  - Draw a word $w_{d,n} \sim \mathrm{Cat}(\varphi_{z_{d,n}})$.

Joint distribution:

\[ p(\theta, \varphi, z, w \mid \alpha, \beta) = \prod_k p(\varphi_k\mid\beta) \prod_d p(\theta_d \mid \alpha)\prod_n p(z_{d,n}\mid\theta_d) p(w_{d,n}\mid\varphi_{z_{d,n}}) \]

5.2 Collapsed Gibbs sampling

Exploiting Dirichlet-Multinomial conjugacy, analytically integrate out $\theta$ and $\varphi$, sampling only $z$:

\[ P(z_{d,n} = k \mid z_{-(d,n)}, w, \alpha, \beta) \;\propto\; \frac{n_{d,k}^{-(d,n)} + \alpha_k}{\sum_{k'}\!(n_{d,k'}^{-(d,n)} + \alpha_{k'})}\cdot\frac{n_{k,w_{d,n}}^{-(d,n)} + \beta_{w_{d,n}}}{\sum_v (n_{k,v}^{-(d,n)} + \beta_v)} \]

where $n_{d,k}$ is the number of words in document $d$ assigned to topic $k$, $n_{k,v}$ is the count of word $v$ under topic $k$, and the superscript $-(d,n)$ denotes excluding the current position.

Each token costs $O(K)$ per update, with overall $O(K \sum_d N_d)$ per sweep. Variants: variational inference (Blei's original paper), online LDA (Hoffman 2010), SVI.

5.3 Model evaluation

Perplexity: $\exp(-\frac{1}{N}\sum \log p(w))$, lower is better;
Topic coherence: $C_v$, UMass, NPMI;
Downstream tasks: use topic vectors for classification / clustering.

6. Application case studies

6.1 Part-of-speech tagging (HMM)

Hidden states = POS tags (NN, VB, JJ, ...), observations = words;
$A$ encodes syntactic transition regularities (DT is highly likely to be followed by NN);
$B$ encodes word/POS associations;
Viterbi decoding yields the most likely tag sequence;
Modern baselines: CRF, BiLSTM-CRF, Transformer. HMMs remain a teaching tool and a baseline in low-resource scenarios.

6.2 GMM-HMM speech recognition

The classical acoustic model: each phone corresponds to an HMM (typically 3 left-to-right states), with emission probabilities modeled by GMMs over MFCC features. Sentence-level decoding uses a large WFST that composes the acoustic HMM, lexicon, and language model into a Viterbi beam search. Before DNN-HMM appeared this was the state of the art (the Kaldi toolchain remains widely used today).

6.3 Biological sequence alignment (profile HMM)

Hidden states: match / insert / delete;
Used for multiple sequence alignment (MSA) and remote homolog detection;
HMMER is the canonical implementation and a standard tool in bioinformatics.

6.4 LDA in practice

Topic discovery in large news corpora;
Low-dimensional representation of user interests in recommender systems (the topic distribution is a dense vector);
Compared with modern alternatives such as word2vec / BERT-topic, LDA still has value in scenarios that require strong interpretability.

7. Cross-references

Mathematical foundations and conjugate priors: see ../../03_Machine_Learning/贝叶斯学习.md.
Probabilistic models in supervised learning: see ../../03_Machine_Learning/probabilistic_models.md.
Other notes in this section: 概率编程实战, 贝叶斯深度学习.

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 8 (Graphical Models), Ch. 13 (Sequential Data). Springer.
Rabiner, L. R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE, 77(2): 257-286.
Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). "Latent Dirichlet Allocation". JMLR, 3: 993-1022.
Koller, D., Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics, Ch. 8-10. MIT Press.
Doucet, A., de Freitas, N., Gordon, N. (eds.) (2001). Sequential Monte Carlo Methods in Practice. Springer.
Griffiths, T. L., Steyvers, M. (2004). "Finding Scientific Topics". PNAS, 101(suppl 1): 5228-5235. (Collapsed Gibbs for LDA)
Welch, G., Bishop, G. (2006). An Introduction to the Kalman Filter. UNC-Chapel Hill TR.

Probabilistic Programming and Applied Bayesian Statistics

This section focuses on turning Bayesian ideas into code and practice. The central question is: how can we describe a generative model declaratively so that an inference algorithm runs automatically? That is what probabilistic programming languages (PPLs) provide.

1. The probabilistic programming paradigm

1.1 Models as programs

The traditional Bayesian modeling workflow is "write the formulas → derive the posterior → implement a sampler", and every change of likelihood or prior requires rewriting it. Probabilistic programming unifies this into:

Random variable = first-class citizen of the programming language
Model           = a program that writes down priors and likelihoods
Inference       = automatically performed by the PPL runtime

As long as you specify the generative process, the inference algorithms (HMC/NUTS, SVI, SMC) are produced automatically by the compiler / runtime. This decouples modeling from inference and makes Bayesian methods scalable in engineering.

1.2 Three categories of PPL

Category	Examples	Characteristics
Static graph, strongly typed	Stan	Custom DSL, C++ backend, the most stable NUTS implementation
Python-based, dynamic graph	PyMC, NumPyro, Pyro	Reuse autodiff frameworks from deep learning (Theano/Aesara/PyTensor, JAX, PyTorch)
General-purpose PPL (Turing-complete)	Turing.jl, Pyro, Gen	Support stochastic control flow and open-universe models

2. Comparison of major libraries

Library	Host language	Backend	Default sampler	VI support	Use case
PyMC	Python	PyTensor (formerly Aesara/Theano)	NUTS	ADVI, normalizing-flow VI	Statistical modeling, regression, hierarchical models
Stan	DSL (compiled to C++)	C++	NUTS (gold standard)	ADVI, Pathfinder	Academic statistics, reproducible research
NumPyro	Python	JAX	NUTS (GPU/TPU friendly)	SVI (Pyro-compatible)	Large-scale data, hardware acceleration required
Pyro	Python	PyTorch	HMC, NUTS	SVI (core)	Deep generative models, VAE-style
Edward2	Python	TensorFlow Probability	HMC, NUTS	VI	TFP ecosystem, research prototypes
Turing.jl	Julia	Julia	HMC, NUTS, PG	ADVI	Julia ecosystem, custom samplers

Selection guide: - Medium-scale, tabular data, statistical style → PyMC or Stan (Stan's NUTS remains the most diagnostically stable implementation); - Large scale, GPU required, coupled with JAX models → NumPyro; - Deep probabilistic models, VAE / variational flows → Pyro; - Turing-complete / open-universe → Turing.jl or Gen.

3. Hierarchical Bayesian models

3.1 The pooling spectrum

Consider multi-group data $\{(x_{ij}, y_{ij})\}$ ($j$ indexes groups, $i$ indexes within-group observations):

Strategy	Model	Bias-variance
Complete pooling	All groups share one $\theta$	High bias, low variance
No pooling	Each group estimated independently with its own $\theta_j$	Low bias, high variance
Partial pooling (hierarchical)	$\theta_j \sim \mathcal{N}(\mu, \tau^2)$ with $\mu, \tau$ as hyperparameters (with hyperpriors)	Compromise, adapts to "within-group information content"

The key idea of hierarchical models is shrinkage: groups with few samples get pulled toward the global mean, while groups with many samples remain close to their own MLE.

3.2 Classic example: 8 schools

Gelman's 8 schools: estimate the SAT teaching effect $y_j$ for 8 schools with standard errors $\sigma_j$.

\[ y_j \sim \mathcal{N}(\theta_j, \sigma_j^2),\quad \theta_j \sim \mathcal{N}(\mu, \tau^2),\quad \mu \sim \mathcal{N}(0, 10^2),\quad \tau \sim \mathrm{HalfCauchy}(5) \]

Non-centered parameterization (avoids funnel geometry):

\[ \tilde\theta_j \sim \mathcal{N}(0,1),\quad \theta_j = \mu + \tau\,\tilde\theta_j \]

import pymc as pm
import numpy as np

y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
sigma = np.array([15, 10, 16, 11, 9, 11, 10, 18])

with pm.Model() as eight_schools:
    mu = pm.Normal("mu", 0, 10)
    tau = pm.HalfCauchy("tau", 5)
    theta_tilde = pm.Normal("theta_tilde", 0, 1, shape=8)
    theta = pm.Deterministic("theta", mu + tau * theta_tilde)
    obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)
    idata = pm.sample(2000, tune=1000, target_accept=0.95)

4. MCMC in practice

4.1 The sampler family

Sampler	Core mechanism	Strengths	Weaknesses
Metropolis-Hastings	Proposal + acceptance ratio $\alpha = \min(1, \frac{p(\theta') q(\theta\mid\theta')}{p(\theta) q(\theta'\mid\theta)})$	General	Slow mixing in high dimensions
Gibbs	Sample one dimension at a time from $p(\theta_i \mid \theta_{-i})$	Efficient for conjugate models	Slow under strong parameter correlation
HMC	Introduces momentum, simulates Hamiltonian dynamics	Efficient in high dimensions	Step size $\epsilon$ and number of steps $L$ must be tuned
NUTS	HMC with auto-tuned $L$ (U-turn criterion)	Nearly tuning-free	Implementation is intricate
SMC	Annealed sequence	Can estimate the marginal likelihood	Heavy computation
Riemannian HMC	Uses Fisher information as metric	More stable on ill-conditioned geometry	Even heavier

4.2 HMC Hamiltonian dynamics

Introduce momentum $r \sim \mathcal{N}(0, M)$ and define the Hamiltonian

\[ H(\theta, r) = \underbrace{-\log p(\theta \mid D)}_{U(\theta)} + \underbrace{\tfrac{1}{2} r^\top M^{-1} r}_{K(r)} \]

Hamilton's equations:

\[ \dot\theta = \frac{\partial H}{\partial r} = M^{-1} r,\quad \dot r = -\frac{\partial H}{\partial \theta} = -\nabla U(\theta) \]

Use the leapfrog integrator (half-step momentum, full-step position, half-step momentum) to simulate $L$ steps and obtain the proposal $(\theta', r')$. Metropolis acceptance probability:

\[ \alpha = \min\!\big(1,\; \exp(H(\theta, r) - H(\theta', r'))\big) \]

An ideal Hamiltonian system conserves energy → acceptance rate $\approx 1$; the leapfrog integrator's $O(\epsilon^2)$ error introduces a small fraction of rejections. NUTS extends the trajectory dynamically at each step and uses "have the two ends of the trajectory begun a U-turn?" as the stopping criterion, eliminating the need to tune $L$.

4.3 Diagnostics

Trace plot: stack four or more chains; they should look like a "fuzzy caterpillar" indicating thorough mixing;
R-hat ($\hat R$): between-chain variance / within-chain variance. $\hat R < 1.01$ is taken as converged;
ESS (effective sample size): effective number of samples after accounting for autocorrelation. Bulk-ESS gauges mean precision, tail-ESS gauges quantile precision; aim for $\ge 400$ per parameter;
Divergent transitions: HMC's leapfrog diverges in funnels and narrow regions. When they appear, raise target_accept, switch to non-centered parameterization, or shrink the step size;
BFMI: how well the energy distribution mixes; $<0.3$ indicates insufficient momentum mixing;
Posterior predictive checks (PPC): sample from the posterior to generate new data and compare with observations.

5. Bayesian linear regression

Model: $y = X\beta + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$, prior $\beta \sim \mathcal{N}(0, \tau^2 I)$.

The closed-form posterior:

\[ \beta \mid y \sim \mathcal{N}(\mu_n, \Sigma_n),\quad \Sigma_n = \big(\tfrac{1}{\sigma^2}X^\top X + \tfrac{1}{\tau^2}I\big)^{-1},\quad \mu_n = \tfrac{1}{\sigma^2}\Sigma_n X^\top y \]

Relation to ridge regression: the MAP estimate $\hat\beta_{\text{MAP}} = \mu_n$ is equivalent to the ridge solution $\hat\beta_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$, with $\lambda = \sigma^2 / \tau^2$.

Posterior predictive:

\[ y^* \mid x^*, D \sim \mathcal{N}\big(x^{*\top}\mu_n,\; \sigma^2 + x^{*\top}\Sigma_n x^*\big) \]

The predictive variance automatically incorporates irreducible noise + parameter uncertainty — this is the fundamental advantage of Bayesian methods over point estimates with regularization.

6. Bayesian A/B testing

Business scenario: comparing the conversion rates of versions A and B.

6.1 Beta-Binomial model

For each version:

\[ \theta_v \sim \mathrm{Beta}(\alpha, \beta),\quad y_v \mid \theta_v \sim \mathrm{Binomial}(n_v, \theta_v) \]

The (conjugate) posterior:

\[ \theta_v \mid y_v \sim \mathrm{Beta}(\alpha + y_v,\; \beta + n_v - y_v) \]

We can directly read off interpretable quantities such as $P(\theta_B > \theta_A \mid D)$, the lift $\frac{\theta_B - \theta_A}{\theta_A}$, or the minimum detectable improvement — all far closer to business decisions than a $p$-value.

import pymc as pm
import numpy as np

obs = {"A": (120, 1000), "B": (135, 1000)}  # (conversions, trials)

with pm.Model() as ab:
    theta_A = pm.Beta("theta_A", 1, 1)
    theta_B = pm.Beta("theta_B", 1, 1)
    pm.Binomial("yA", n=obs["A"][1], p=theta_A, observed=obs["A"][0])
    pm.Binomial("yB", n=obs["B"][1], p=theta_B, observed=obs["B"][0])
    diff = pm.Deterministic("diff", theta_B - theta_A)
    lift = pm.Deterministic("lift", (theta_B - theta_A) / theta_A)
    idata = pm.sample(2000, tune=1000)

prob_B_better = (idata.posterior["diff"] > 0).mean().item()
print(f"P(B > A | D) = {prob_B_better:.3f}")

6.2 Sequential monitoring

Frequentist methods require committing to a sample size in advance — peeking inflates type-I error. The Bayesian framework can compute $P(\theta_B > \theta_A)$ at any time and combine it with a preset loss function (stopping rule) to make sequential decisions.

6.3 Multi-armed bandits

Generalize A/B to many arms: Thompson Sampling draws $\theta$ from each arm's posterior and picks the maximum, naturally balancing exploration and exploitation (regret close to optimal).

7. Bayesian Optimization (BO)

7.1 Framework

Goal: maximize a black-box, expensive function $f: \mathcal{X} \to \mathbb{R}$, $x^* = \arg\max f(x)$, where each evaluation is costly (e.g., training a deep network for a hyperparameter setting).

Iterate:

Fit a surrogate model to the observations $\{(x_i, y_i)\}$, typically a Gaussian process (GP).
Construct an acquisition function $a(x)$ that balances exploration and exploitation.
Take $x_{n+1} = \arg\max_x a(x)$, evaluate, and add to the dataset.

7.2 The GP surrogate model

\[ f \sim \mathcal{GP}(m(x), k(x, x')) \]

Posterior at a new point $x_*$:

\[ \mu_n(x_*) = k_*^\top (K + \sigma^2 I)^{-1} y,\quad \sigma^2_n(x_*) = k(x_*, x_*) - k_*^\top (K + \sigma^2 I)^{-1} k_* \]

7.3 Three popular acquisition functions

Let $\mu(x), \sigma(x)$ denote the GP posterior mean and standard deviation, and $f^+ = \max y_i$.

Probability of Improvement (PI):

\[ \mathrm{PI}(x) = \Phi\!\left(\frac{\mu(x) - f^+ - \xi}{\sigma(x)}\right) \]

Expected Improvement (EI):

\[ \mathrm{EI}(x) = (\mu(x) - f^+ - \xi)\,\Phi(z) + \sigma(x)\,\phi(z),\quad z = \frac{\mu(x) - f^+ - \xi}{\sigma(x)} \]

EI balances "expected magnitude of improvement" with "probability that improvement is possible" and is the most commonly used acquisition function.

Upper Confidence Bound (UCB):

\[ \mathrm{UCB}(x) = \mu(x) + \kappa\,\sigma(x) \]

$\kappa$ controls the exploration level. Srinivas (2010) gave sublinear-regret guarantees for GP-UCB.

7.4 Application areas

Hyperparameter search in deep learning (learning rate, layer width, regularization strength);
Experimental design (materials, chemical reaction conditions);
Tuning robotic control policies;
Continuous-parameter optimization in A/B settings.

Tools: BoTorch (PyTorch backend), GPyOpt, Ax, scikit-optimize.

8. Engineering considerations

8.1 Prior sensitivity analysis

When reporting results, you must perturb the prior: replace $\tau \sim \mathrm{HalfCauchy}(5)$ with $\mathrm{HalfNormal}(2)$ and rerun to see whether the posterior remains stable. If it does not, either the data carry too little information or the prior is too strong.

Weakly informative priors: Gelman's recommended practice — neither use a flat prior (numerically unstable and not invariant to reparameterization) nor an overly tight one. Common choices are on the order of $\mathcal{N}(0, 5)$, with HalfNormal/HalfCauchy on scale parameters.

8.2 Model comparison

Criterion	Formula	Note
DIC	$-2\log p(y\mid\hat\theta) + 2 p_D$	Outdated, no longer recommended
WAIC	$-2\sum_i \log\!\big(\frac{1}{S}\sum_s p(y_i\mid\theta^{(s)})\big) + 2 p_W$	Fully Bayesian, pointwise
PSIS-LOO	Importance-weighted LOO likelihood	Recommended by Vehtari (2017)
Bayes factor	$\frac{p(D\mid M_1)}{p(D\mid M_2)}$	Strict but highly sensitive to priors
Posterior predictive checks	Visual / statistical comparison	Mandatory

In PyMC: pm.compare({"m1": idata1, "m2": idata2}, ic="loo").

8.3 Reproducibility checklist

Fix random seeds;
Record the number of chains, warmup steps, target_accept, and the sampler used;
Report $\hat R$, ESS, and the number of divergences;
Submit the model code (not just results);
Put the data preprocessing pipeline under version control;
Posterior summaries should report intervals (e.g. 89% HDI), not only the mean.

8.4 When not to use Bayes

Vast amounts of data, prior influence negligible → MLE/MAP suffices, and running NUTS is wasteful;
Real-time inference with millisecond latency → posterior sampling is too slow; use VI or fall back to MAP;
A model whose priors are hard to specify (e.g. all weights of a black-box deep network) → consider the BDL approximations (see 贝叶斯深度学习).

9. Cross-references

Tribe perspective and genealogy: 本页 §1（派系入口）
Graphical-model foundations of HMM, Kalman, and LDA: 图模型与隐马尔可夫
Extending Bayesian ideas to deep networks: 贝叶斯深度学习与不确定性
Mathematical foundations and conjugate priors: ../../03_Machine_Learning/贝叶斯学习.md

References

Salvatier, J., Wiecki, T. V., Fonnesbeck, C. (2016). "Probabilistic programming in Python using PyMC3". PeerJ Computer Science, 2:e55.
Carpenter, B., Gelman, A., Hoffman, M. D., et al. (2017). "Stan: A Probabilistic Programming Language". Journal of Statistical Software, 76(1).
Phan, D., Pradhan, N., Jankowiak, M. (2019). "Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro". arXiv:1912.11554.
Bingham, E., Chen, J. P., Jankowiak, M., et al. (2019). "Pyro: Deep Universal Probabilistic Programming". JMLR, 20(28).
Hoffman, M. D., Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo". JMLR, 15: 1593-1623.
Neal, R. M. (2011). "MCMC Using Hamiltonian Dynamics". In Handbook of Markov Chain Monte Carlo. CRC Press.
Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
Vehtari, A., Gelman, A., Gabry, J. (2017). "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC". Statistics and Computing, 27: 1413-1432.
Frazier, P. I. (2018). "A Tutorial on Bayesian Optimization". arXiv:1807.02811.
Snoek, J., Larochelle, H., Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". NeurIPS.
Srinivas, N., Krause, A., Kakade, S. M., Seeger, M. (2010). "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design". ICML.
McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press.

Bayesian Deep Learning and Uncertainty

Deep networks already achieve high accuracy on i.i.d. test sets, but on out-of-distribution (OOD) data they often produce highly confident wrong predictions, posing a fundamental risk for safety-critical applications (autonomous driving, medical diagnostics, financial risk control). The goal of Bayesian Deep Learning (BDL) is to equip deep networks with well-calibrated uncertainty estimates.

1. Why deep networks need Bayes

1.1 Three symptoms

Overconfidence: modern ResNets and Transformers routinely output softmax values $>0.99$ on misclassified samples; the empirical study of Guo (2017) shows that ECE is far higher than for earlier shallow networks.
OOD failure: a network trained on CIFAR-10 and tested on SVHN still classifies with high confidence — it cannot recognize "I have not seen this".
Catastrophic errors: standard MLE/MAP networks output point estimates with no variance; downstream decisions (rejection, hand-off to humans) lack a principled basis.

1.2 What the Bayesian framework promises

\[ p(y^* \mid x^*, D) = \int p(y^* \mid x^*, w)\, p(w \mid D)\, dw \]

The posterior predictive automatically integrates weight uncertainty and marginalizes it into the output distribution — in theory this delivers well-calibrated probabilities. The problem is that $p(w\mid D)$ is intractable when there are millions of parameters; the entire BDL toolbox is about how to approximate this posterior.

2. Two kinds of uncertainty

Type	Source	Reducible by more data?	Examples
Aleatoric (data noise)	Inherent randomness of the data, measurement noise, label ambiguity	No	Blurry images, dice rolls
Epistemic (model ignorance)	The model's ignorance over unseen regions, limited training data	Yes	OOD inputs, sparse regions of the training distribution

Mathematically:

\[ \underbrace{\mathrm{Var}[y^*\mid x^*, D]}_{\text{total}} = \underbrace{\mathbb{E}_{p(w\mid D)}\big[\mathrm{Var}[y^*\mid x^*, w]\big]}_{\text{aleatoric}} + \underbrace{\mathrm{Var}_{p(w\mid D)}\big[\mathbb{E}[y^*\mid x^*, w]\big]}_{\text{epistemic}}\]

Operational meaning: high epistemic → trigger "abstention" or active-learning sampling; high aleatoric → provide a "probabilistic output" or ask for additional sensors.

For regression tasks, aleatoric uncertainty can be modeled as $y = f_w(x) + \epsilon(x)$, letting the network output $(\mu, \sigma^2)$ to learn heteroscedastic noise.

3. Bayesian Neural Networks (BNN)

Treat each weight $w_i$ as a random variable with prior $p(w)$ (commonly $\mathcal{N}(0, \sigma_p^2)$), and update it with observed data to obtain the posterior $p(w \mid D)$.

The ideal approach is to sample the posterior directly with HMC. Neal (1995) succeeded on shallow networks, but on deep networks the leapfrog cost and mixing difficulty are severe; only in the past few years (Izmailov 2021) have large-scale experiments appeared. In practice the following approximations are used.

4. Bayes by Backprop (Blundell 2015)

4.1 The variational inference (VI) framework

Introduce a parametric approximate posterior $q_\phi(w)$ and maximize the ELBO:

\[ \mathcal{L}(\phi) = \mathbb{E}_{q_\phi(w)}[\log p(D \mid w)] - \mathrm{KL}\big(q_\phi(w)\,\|\,p(w)\big) \]

The KL identity:

\[ \log p(D) = \mathcal{L}(\phi) + \mathrm{KL}\big(q_\phi(w)\,\|\,p(w\mid D)\big) \ge \mathcal{L}(\phi) \]

Maximizing the ELBO ⇔ minimizing the KL ⇔ pushing $q$ toward the true posterior.

4.2 Mean-field Gaussian approximation

Each weight is independent and Gaussian: $q_\phi(w_i) = \mathcal{N}(\mu_i, \sigma_i^2)$. The parameter count doubles ($\mu, \rho$ per weight, with $\sigma = \log(1+e^\rho)$ keeping it positive).

4.3 The reparameterization trick

To make $\nabla_\phi \mathbb{E}_{q_\phi}[\cdot]$ differentiable, push the sampling out of $q$ into a parameter-free distribution:

\[ w = \mu + \sigma \odot \epsilon,\quad \epsilon \sim \mathcal{N}(0, I) \]

Then

\[ \nabla_\phi \mathbb{E}_{q_\phi(w)}[f(w)] = \mathbb{E}_{p(\epsilon)}\big[\nabla_\phi f(\mu + \sigma\odot\epsilon)\big] \]

so the gradient can be computed in a single forward + backward pass.

4.4 Training

In each mini-batch, draw a set of $\epsilon$ → compute $w$ → forward pass → loss = negative likelihood + KL regularizer → backprop to update $\mu, \rho$. At prediction time, sample $w$ multiple times and average to obtain the posterior predictive.

Pros: principled and SGD-compatible. Cons: the mean-field assumption ignores weight correlations and often substantially underestimates variance; the parameter count doubles.

5. MC Dropout (Gal & Ghahramani 2016)

5.1 The central claim

For a neural network with dropout, training with dropout and keeping dropout active at test time across multiple forward passes produces a predictive distribution that is equivalent to an approximate variational-inference posterior predictive.

5.2 Sketch of the derivation

Consider an $L$-layer network with weights $W_l$. Express the dropout mask $z_l \in \{0,1\}^{K_l}$ (independent Bernoulli($p$)) as a "sampled" weight:

\[ \hat W_l = W_l \cdot \mathrm{diag}(z_l) \]

Define the approximate posterior:

\[ q(W_l) = M_l \cdot \mathrm{diag}\big(\mathrm{Bernoulli}(p_l)\big) \]

i.e., each weight column either takes the value $M_l$ or is zeroed. This is a highly restricted family of variational distributions.

The ELBO loss:

\[ \mathcal{L} = -\sum_n \mathbb{E}_{q(W)}[\log p(y_n \mid x_n, W)] + \mathrm{KL}\big(q(W)\,\|\,p(W)\big) \]

First term: average over $T$ dropout-mask samples per data point (in practice $T = 1$, i.e. standard SGD with dropout). KL term: under a Gaussian prior, the analytic KL expansion equals L2 regularization plus a constant depending on $p$ — meaning that dropout + L2 ≈ variational inference.

5.3 Inference

At test time, leave dropout on and perform $T$ stochastic forward passes:

\[ \hat\mu(x^*) = \frac{1}{T}\sum_{t=1}^{T} f_{\hat W^{(t)}}(x^*) \]

\[ \hat\Sigma(x^*) = \frac{1}{T}\sum_{t=1}^{T} f_{\hat W^{(t)}}(x^*) f_{\hat W^{(t)}}(x^*)^\top - \hat\mu \hat\mu^\top + \tau^{-1} I \]

where $\tau$ is the model precision (a function of weight decay, dropout rate, and dataset size).

# Minimal pseudocode
model.train()  # key: keep dropout active
preds = torch.stack([model(x) for _ in range(T)])  # T forward passes
mean = preds.mean(0)
var = preds.var(0)  # epistemic component

Pros: zero extra parameters, drop-in compatibility with existing training pipelines, almost no overhead. Cons: the dropout rate $p$ is a fixed prior; the approximation bias is uncontrolled. Follow-up work (Concrete Dropout) makes $p$ learnable.

6. Deep Ensembles (Lakshminarayanan 2017)

6.1 Method

Independently train $M$ networks (different random initializations and mini-batch orders) and average their predictions:

\[ p(y \mid x) = \frac{1}{M}\sum_{m=1}^{M} p_{w_m}(y \mid x) \]

6.2 Connection to Bayes

Although there is no explicit posterior, Deep Ensembles often empirically outperform BNN-VI and MC Dropout (in accuracy, calibration, and OOD detection). Wilson & Izmailov (2020) argue that the loss landscape of neural networks contains many equivalent local modes, and each training run lands in a different mode; multiple training runs are roughly an "informal sampling" of a multi-modal posterior. They can be viewed as a special case of MultiSWAG.

6.3 Engineering trade-offs

Training cost $\times M$ (typically $M=5$);
Inference cost $\times M$;
But each member can be trained in parallel;
On OOD detection and long-tailed classification benchmarks they remain very hard to beat.

7. SWAG / SWA

7.1 SWA (Stochastic Weight Averaging, Izmailov 2018)

Late in training, under a constant or cyclic learning rate, average the weights every few epochs:

\[ \bar w = \frac{1}{T}\sum_{t} w_t \]

This yields a flatter solution than a single SGD endpoint and generalizes better.

7.2 SWAG (Maddox 2019)

SWA upgraded to a Gaussian approximation over weights:

\[ p(w) \approx \mathcal{N}\big(\bar w,\; \tfrac{1}{2}(\Sigma_{\text{diag}} + \Sigma_{\text{lowrank}})\big) \]

The low-rank covariance is estimated from the deviation matrix along the SGD trajectory $D = [w_{t_1} - \bar w, \dots, w_{t_K} - \bar w]$: $\Sigma_{\text{lowrank}} = \frac{1}{K-1} D D^\top$.

At prediction time, sample $w$ from this Gaussian and form the posterior predictive — performance approaches Deep Ensembles while training cost is close to a single model.

8. Laplace approximation / Laplace Redux

8.1 Second-order expansion

Take a second-order Taylor expansion of the negative log-posterior around the MAP solution $w^*$:

\[ -\log p(w \mid D) \approx -\log p(w^* \mid D) + \tfrac{1}{2}(w - w^*)^\top H (w - w^*) \]

where $H = -\nabla^2 \log p(w \mid D)\big|_{w^*}$ is the Hessian. This is equivalent to approximating the posterior as a Gaussian:

\[ p(w \mid D) \approx \mathcal{N}(w^*, H^{-1}) \]

8.2 Laplace Redux (Daxberger 2021)

Storing the full Hessian for a deep network is infeasible ($O(P^2)$). Common approximations:

Last-layer Laplace: apply Laplace to the final linear layer only and keep all other layers at their MAP values;
Diagonal Hessian: ignore off-diagonal terms;
KFAC (Kronecker-factored): approximate the Hessian as a per-layer Kronecker product;
GGN (Generalized Gauss-Newton): replace the Hessian with the Fisher information matrix — positive semidefinite and batch-computable.

The laplace-torch library (Daxberger 2021) wraps these options and applies Laplace post-hoc to an already trained model, leaving accuracy nearly unchanged while substantially improving calibration and OOD performance.

8.3 Approximating the posterior predictive

For classification, $p(y\mid x^*) = \int \mathrm{softmax}(f_w(x^*)) p(w\mid D) dw$ has no closed form. Common options:

MC: sample weights from the Gaussian and average;
Probit approximation: replace softmax with a probit, giving a closed form (suitable for last-layer Laplace).

9. Calibration

9.1 The ECE metric

Bin the predicted confidence $\hat p$ into $M$ bins $B_m$:

\[ \mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n}\Big|\,\mathrm{acc}(B_m) - \mathrm{conf}(B_m)\,\Big| \]

The ideal ECE is 0. MCE is the maximum bin deviation.

9.2 Temperature scaling

The simplest and most widely used post-hoc calibrator: fit a scalar $T$ on a validation set:

\[ \hat p_i = \mathrm{softmax}(z_i / T) \]

minimizing the NLL with respect to $T$. $T > 1$ smooths; $T < 1$ sharpens. Guo (2017) showed that a single scalar $T$ can drive the ECE of modern CNNs from 5%-10% down below 1%.

9.3 Platt scaling

The binary version: a logistic regression mapping logits $z$ to calibrated probabilities:

\[ \hat p = \sigma(a z + b) \]

9.4 Isotonic regression

A nonparametric monotonic mapping; more flexible than Platt but requires more data.

Caveat: post-hoc methods like temperature scaling only calibrate in-distribution; on OOD they remain overconfident. The selling point of BDL is producing higher uncertainty on OOD inputs, complementary to temperature scaling.

10. OOD detection

10.1 Baselines

Max softmax probability (Hendrycks 2017): take $\max_k \hat p_k$ as the score; OOD inputs should score lower.
Energy score (Liu 2020): $E(x) = -T \log \sum_k e^{f_k(x)/T}$; theoretically better than MSP.
Mahalanobis distance (Lee 2018): under per-class Gaussians, compute $D_M(x) = (\phi(x) - \mu_c)^\top \Sigma^{-1}(\phi(x) - \mu_c)$; OOD inputs are far away.

10.2 Bayesian methods

Epistemic variance from MC Dropout / BNN: use directly as an OOD score;
Difference of predictive entropies in a Deep Ensemble: $\mathcal{H}[\bar p] - \frac{1}{M}\sum_m \mathcal{H}[p_m]$ is the BALD mutual information, specifically capturing epistemic uncertainty;
GP $\sigma_n(x)$: naturally grows away from training points.

10.3 Datasets and benchmarks

CIFAR-10 vs SVHN, CIFAR-10 vs CIFAR-100, ImageNet-O, OpenOOD benchmark. Metrics: AUROC, AUPR, FPR@95%TPR.

11. The BDL method genealogy

flowchart TD
    A[Bayesian Neural Network<br/>BNN] --> B[Exact MCMC<br/>HMC, Neal 1995]
    A --> C[Variational Inference (VI)]
    A --> D[Laplace approximation]
    A --> E[Monte Carlo sampling approximations]

    C --> C1[Bayes by Backprop<br/>Blundell 2015]
    C --> C2[MC Dropout<br/>Gal 2016]
    C --> C3[Concrete Dropout]
    C --> C4[Functional VI / FVB]

    D --> D1[Last-layer Laplace]
    D --> D2[KFAC Laplace]
    D --> D3[Laplace Redux 2021]

    E --> E1[Deep Ensembles<br/>Lakshminarayanan 2017]
    E --> E2[SWA / SWAG<br/>Izmailov 2018, Maddox 2019]
    E --> E3[MultiSWAG]

    A --> F[Bayesian generative models]
    F --> F1[VAE<br/>Kingma 2014]
    F --> F2[Diffusion models<br/>Sohl-Dickstein 2015, Ho 2020]
    F --> F3[Normalizing Flows]

12. Connection to VAE / diffusion models

12.1 VAE: amortized variational inference

The VAE is the canonical latent-variable Bayesian generative model:

\[ p_\theta(x, z) = p(z)\, p_\theta(x \mid z) \]

It introduces a recognition network $q_\phi(z\mid x)$ that amortizes posterior inference, maximizing the ELBO:

\[ \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z\mid x)}[\log p_\theta(x\mid z)] - \mathrm{KL}(q_\phi(z\mid x)\,\|\,p(z)) \]

Reparameterization ($z = \mu_\phi(x) + \sigma_\phi(x)\odot\epsilon$) lets gradients flow back to $\phi$. The VAE sits at the intersection of BDL and generative modeling; for details see VAE notes.

12.2 Diffusion models: hierarchical variational inference

The training objective of DDPM (Ho 2020) can be written as an ELBO:

\[ \mathcal{L} = \mathbb{E}_q\Big[\log p_\theta(x_0 \mid x_1) - \!\!\sum_{t \ge 2} \mathrm{KL}\big(q(x_{t-1}\mid x_t, x_0) \,\|\, p_\theta(x_{t-1}\mid x_t)\big) - \mathrm{KL}\big(q(x_T\mid x_0)\,\|\,p(x_T)\big)\Big] \]

Each denoising step is the KL term of a conditional Gaussian. From this perspective, diffusion models are hierarchical VAEs with $T$-step latent variables, closing the loop with the Bayesian tribe entirely.

12.3 Recent developments such as Bayesian Flow Networks

A number of recent works (Bayesian Flow Networks, Graves 2023; Diffusion Schrödinger Bridge) explicitly combine Bayesian inference with diffusion processes, and are at the active frontier of research.

13. Practitioner's checklist

Task	Recommended method	Notes
Already-trained model, want to add uncertainty quickly	Last-layer Laplace or MC Dropout	Lowest integration cost
Retraining, ample budget	Deep Ensembles ($M=5$)	Usually the strongest baseline
Retraining budget constrained but want a posterior feel	SWAG	Single-model cost
Safety-critical and retrainable	Deep Ensembles + temperature scaling	Calibration + robustness
Large models (LLM)	Last-layer Laplace, LoRA-BNN, ensemble of LoRAs	Full BNN infeasible
OOD detection	Energy score or Mahalanobis + ensembles	Strong baselines

14. Cross-references

Tribe perspective and algorithmic genealogy: 本页 §1（派系入口）
Foundations of probabilistic programming and Bayesian statistics: 概率编程与贝叶斯统计实战
Graphical models and sequential Bayes: 图模型与隐马尔可夫
Detailed VAE derivation: ../../../1_DeepLearning/Generative_Models/VAE.md

References

Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D. (2015). "Weight Uncertainty in Neural Networks". ICML. (Bayes by Backprop)
Gal, Y., Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles". NeurIPS.
Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., Hennig, P. (2021). "Laplace Redux — Effortless Bayesian Deep Learning". NeurIPS.
Guo, C., Pleiss, G., Sun, Y., Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks". ICML.
Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., Wilson, A. G. (2019). "A Simple Baseline for Bayesian Uncertainty in Deep Learning". NeurIPS. (SWAG)
Izmailov, P., Vikram, S., Hoffman, M. D., Wilson, A. G. (2021). "What Are Bayesian Neural Network Posteriors Really Like?". ICML.
Wilson, A. G., Izmailov, P. (2020). "Bayesian Deep Learning and a Probabilistic Perspective of Generalization". NeurIPS.
Kingma, D. P., Welling, M. (2014). "Auto-Encoding Variational Bayes". ICLR.
Ho, J., Jain, A., Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models". NeurIPS.
Hendrycks, D., Gimpel, K. (2017). "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks". ICLR.
Liu, W., Wang, X., Owens, J., Li, Y. (2020). "Energy-based Out-of-distribution Detection". NeurIPS.
Lee, K., Lee, K., Lee, H., Shin, J. (2018). "A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks". NeurIPS.
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. (Earliest systematic treatment of BNNs)

Problem	Input	Output	Algorithm
Evaluation	\(\lambda, x_{1:T}\)	\(P(x_{1:T} \mid \lambda)\)	Forward algorithm
Decoding	\(\lambda, x_{1:T}\)	\(\arg\max_{z_{1:T}} P(z_{1:T} \mid x_{1:T})\)	Viterbi
Learning	\(x_{1:T}\) (no \(\lambda\))	\(\hat\lambda\)	Baum-Welch (EM)

Strategy	Model	Bias-variance
Complete pooling	All groups share one \(\theta\)	High bias, low variance
No pooling	Each group estimated independently with its own \(\theta_j\)	Low bias, high variance
Partial pooling (hierarchical)	\(\theta_j \sim \mathcal{N}(\mu, \tau^2)\) with \(\mu, \tau\) as hyperparameters (with hyperpriors)	Compromise, adapts to "within-group information content"

Sampler	Core mechanism	Strengths	Weaknesses
Metropolis-Hastings	Proposal + acceptance ratio \(\alpha = \min(1, \frac{p(\theta') q(\theta\mid\theta')}{p(\theta) q(\theta'\mid\theta)})\)	General	Slow mixing in high dimensions
Gibbs	Sample one dimension at a time from \(p(\theta_i \mid \theta_{-i})\)	Efficient for conjugate models	Slow under strong parameter correlation
HMC	Introduces momentum, simulates Hamiltonian dynamics	Efficient in high dimensions	Step size \(\epsilon\) and number of steps \(L\) must be tuned
NUTS	HMC with auto-tuned \(L\) (U-turn criterion)	Nearly tuning-free	Implementation is intricate
SMC	Annealed sequence	Can estimate the marginal likelihood	Heavy computation
Riemannian HMC	Uses Fisher information as metric	More stable on ill-conditioned geometry	Even heavier

Criterion	Formula	Note
DIC	\(-2\log p(y\mid\hat\theta) + 2 p_D\)	Outdated, no longer recommended
WAIC	\(-2\sum_i \log\!\big(\frac{1}{S}\sum_s p(y_i\mid\theta^{(s)})\big) + 2 p_W\)	Fully Bayesian, pointwise
PSIS-LOO	Importance-weighted LOO likelihood	Recommended by Vehtari (2017)
Bayes factor	\(\frac{p(D\mid M_1)}{p(D\mid M_2)}\)	Strict but highly sensitive to priors
Posterior predictive checks	Visual / statistical comparison	Mandatory

Bayesians

Tribe overview

Tribe profile

Algorithmic genealogy

Frequentist vs. Bayesian

Division of labor with existing site pages

Subpage navigation

Suggested learning path

References

Graphical Models and Hidden Markov Models

1. Overview of probabilistic graphical models

1.1 Why graphs

1.2 Bayesian networks (directed graphical models)

1.3 Markov random fields (undirected graphical models)

2. Exact inference

2.1 Variable Elimination (VE)

2.2 Belief Propagation (BP)

2.3 Junction Tree algorithm

3. Hidden Markov Models (HMM)

3.1 Model definition

3.2 The three classical problems

3.3 Forward algorithm

3.4 Backward algorithm

3.5 Viterbi decoding

3.6 Baum-Welch / EM training

3.7 Extensions of HMMs

4. Sequential Bayes: Kalman filtering and particle filtering

4.1 State-space models

4.2 Kalman filter (linear Gaussian)

4.3 Particle filter (Sequential Monte Carlo)

5. Topic models: LDA

5.1 Generative process

5.2 Collapsed Gibbs sampling

5.3 Model evaluation

6. Application case studies

6.1 Part-of-speech tagging (HMM)

6.2 GMM-HMM speech recognition

6.3 Biological sequence alignment (profile HMM)

6.4 LDA in practice

7. Cross-references

References

Probabilistic Programming and Applied Bayesian Statistics

1. The probabilistic programming paradigm

1.1 Models as programs

1.2 Three categories of PPL

2. Comparison of major libraries

3. Hierarchical Bayesian models

3.1 The pooling spectrum

3.2 Classic example: 8 schools

4. MCMC in practice

4.1 The sampler family

4.2 HMC Hamiltonian dynamics

4.3 Diagnostics

5. Bayesian linear regression

6. Bayesian A/B testing

6.1 Beta-Binomial model

6.2 Sequential monitoring

6.3 Multi-armed bandits

7. Bayesian Optimization (BO)

7.1 Framework

7.2 The GP surrogate model

7.3 Three popular acquisition functions

7.4 Application areas

8. Engineering considerations

8.1 Prior sensitivity analysis

8.2 Model comparison

8.3 Reproducibility checklist

8.4 When not to use Bayes

9. Cross-references

References

Bayesian Deep Learning and Uncertainty

1. Why deep networks need Bayes

1.1 Three symptoms

1.2 What the Bayesian framework promises

2. Two kinds of uncertainty

3. Bayesian Neural Networks (BNN)

4. Bayes by Backprop (Blundell 2015)

4.1 The variational inference (VI) framework

4.2 Mean-field Gaussian approximation

4.3 The reparameterization trick