Bayesians
Tribe overview
The Bayesians hold that the essence of learning is probabilistic inference under uncertainty. Anything learnable — parameters, latent variables, model structures, future observations — is treated as a random variable; the goal of learning is, given observed data \(D\), to obtain the posterior distribution \(P(H \mid D)\) over the unknown quantity \(H\), rather than a single point estimate.
In The Master Algorithm, Pedro Domingos lists the Bayesians as one of the five tribes of machine learning, and points out that the Bayesian "master algorithm" is Bayes' theorem itself:
This deceptively simple formula provides a unified framework for inductive inference:
- the prior \(P(H)\) encodes beliefs about the hypothesis before learning;
- the likelihood \(P(D \mid H)\) describes how the hypothesis generates data;
- the evidence \(P(D) = \int P(D\mid H)P(H)\,dH\) is the marginalization constant;
- the posterior \(P(H \mid D)\) is the output of learning.
The Bayesian core commitment: prior + likelihood = everything. From naive Bayes to LDA, from Kalman filtering to Bayesian neural networks, every method can be read as a specialization of Bayes' theorem under a particular model family and inference algorithm.
Fundamental disagreement with the frequentists: frequentists treat parameters as fixed but unknown quantities and interpret probability as a long-run frequency; Bayesians treat parameters as random variables and interpret probability as a degree of belief. This split determines the different forms the two camps take in confidence intervals, hypothesis testing, and model selection.
Tribe profile
| Dimension | Content |
|---|---|
| Ontology | The world is built of probability distributions; uncertainty is an intrinsic property of knowledge |
| Master algorithm | Bayes' theorem \(P(H\mid D) \propto P(D\mid H)P(H)\) |
| Evaluation criteria | Posterior probability, marginal likelihood, predictive log-likelihood |
| Optimizers | MCMC (Metropolis-Hastings, Gibbs, HMC, NUTS), variational inference (VI), EM, Laplace approximation |
| Representative methods | Naive Bayes, Bayesian networks, hidden Markov models (HMM), LDA, Gaussian processes, Kalman filtering, Bayesian neural networks |
| Modern branches | Probabilistic programming (PyMC/Stan/NumPyro), Bayesian deep learning (BNN/MC Dropout/Laplace), Bayesian optimization (BO), variational autoencoders (VAE) |
| Typical loss | Negative log-posterior, ELBO (variational lower bound), KL divergence |
| Overfitting control | Prior regularization, Bayesian model averaging (BMA) |
Algorithmic genealogy
flowchart TD
A["Bayes' theorem<br/>P(H|D) ∝ P(D|H)P(H)"] --> B["Naive Bayes<br/>(conditional independence)"]
A --> C["Bayesian networks / directed graphical models"]
A --> D["Markov random fields / undirected graphical models"]
C --> E["Hidden Markov Models (HMM)"]
C --> F["Topic model: LDA"]
C --> G["Kalman filter<br/>(linear Gaussian)"]
A --> H["Exact inference<br/>variable elimination / belief propagation"]
A --> I["Approximate inference"]
I --> J["MCMC<br/>MH / Gibbs / HMC / NUTS"]
I --> K["Variational inference (VI)<br/>ELBO optimization"]
I --> L["Laplace approximation"]
A --> M["Modern Bayesian deep learning"]
M --> N["Bayes by Backprop"]
M --> O["MC Dropout"]
M --> P["Deep Ensembles"]
M --> Q["SWAG / Laplace Redux"]
A --> R["Probabilistic programming<br/>PyMC / Stan / NumPyro"]
A --> S["Bayesian optimization (BO)<br/>(GP + acquisition function)"]
The whole genealogy can be summarized in three stages:
- Classical stage (from 1763): Bayes' theorem → naive Bayes → Pearl's Bayesian networks (1988) → HMM (1960s-70s).
- Algorithmic maturation (1990s-2000s): MCMC popularized (Geman-Geman 1984, Gelfand-Smith 1990) → LDA (Blei 2003) → variational inference standardized.
- Deep learning era (2015-): variational autoencoders VAE (Kingma 2014) → Bayes by Backprop (Blundell 2015) → MC Dropout (Gal 2016) → Laplace Redux (Daxberger 2021).
Frequentist vs. Bayesian
| Dimension | Frequentist | Bayesian |
|---|---|---|
| Interpretation of probability | Long-run relative frequency | Subjective degree of belief |
| Parameter \(\theta\) | Fixed but unknown | Random variable |
| Core estimator | Maximum likelihood \(\hat\theta_{\text{MLE}}\) | Posterior distribution \(P(\theta\mid D)\) |
| Confidence intervals | 95% CI: frequency with which random intervals cover the true value | 95% credible interval: posterior probability that the parameter lies in the interval |
| Hypothesis testing | \(p\)-value, Neyman-Pearson | Bayes factor |
| Prediction | Plug-in \(p(y\mid \hat\theta)\) | Posterior predictive \(\int p(y\mid\theta)p(\theta\mid D)d\theta\) |
| Model selection | AIC/BIC, cross-validation | Marginal likelihood, WAIC, LOO-CV |
| Regularization | Explicit L1/L2 penalty | Implicit through priors (Laplace/Gauss prior ↔ L1/L2) |
| Small samples | Unstable, requires bootstrap | Prior acts as constraint, natural shrinkage |
| Large samples | Converges with the Bayesian view (Bernstein-von Mises theorem) | Posterior concentrates around the MLE |
| Key figures | Fisher, Neyman, Pearson | Laplace, Jeffreys, de Finetti, Jaynes |
Pragmatic stance: in modern ML practice, the two camps have long since blended. L2 regularization is equivalent to MAP under a Gaussian prior; cross-validation can be viewed as an approximation to the marginal likelihood; dropout in deep network training admits a variational-inference interpretation. Tribes are philosophy; methods are tools.
Division of labor with existing site pages
This notebook is the tribe-level entry point, focusing on:
- placing Bayesian methods within the five-tribe framework of The Master Algorithm, and contrasting them with the Symbolist, Connectionist, Evolutionary, and Analogizer tribes;
- applied engineering and modern branches (probabilistic programming, BDL, BO);
- providing standard references and further reading for each branch.
For mathematical details (derivation of Bayes' theorem, conjugate priors, the MAP/MLE relationship, etc.), see ../../03_Machine_Learning/贝叶斯学习.md; for general probabilistic models and the basics of graphical models, see ../../03_Machine_Learning/probabilistic_models.md.
The reason both pages coexist: the tribe page emphasizes "why this is a unified research program", whereas the ML foundation page emphasizes "how to use it concretely in supervised/unsupervised learning".
Subpage navigation
This section contains three in-depth notes:
- 图模型与隐马尔可夫 — fundamentals of probabilistic graphical models, the three classical HMM problems (forward-backward, Viterbi, Baum-Welch), Kalman and particle filtering, LDA topic model.
- 概率编程与贝叶斯统计实战 — comparison of PyMC/Stan/NumPyro, hierarchical models, MCMC diagnostics, Bayesian A/B testing, Bayesian optimization, model comparison (WAIC/LOO).
- 贝叶斯深度学习与不确定性 — BNN, MC Dropout, Deep Ensembles, SWAG, Laplace approximation, calibration (ECE), OOD detection, relationship with VAE / diffusion models.
Suggested learning path
flowchart LR
A[Bayes' theorem + conjugate priors] --> B[Naive Bayes<br/>discriminative vs generative]
B --> C[Graphical model basics<br/>d-separation/I-Map]
C --> D[HMM / Kalman<br/>sequential inference]
C --> E[LDA / topic models]
D --> F[MCMC / VI]
E --> F
F --> G[Probabilistic programming<br/>PyMC/Stan]
G --> H[Bayesian optimization]
G --> I[Bayesian deep learning]
I --> J[VAE / diffusion models]
- Beginner: start with chapters 1, 2, 8 of PRML; implement Beta-Binomial and Bayesian linear regression in PyMC.
- Intermediate: derive the three HMM algorithms by hand; understand the unified view of ELBO and EM; run hierarchical models with NUTS and inspect R-hat / ESS.
- Advanced: the engineering trade-offs of the major BNN approximations (Bayes by Backprop / MC Dropout / Laplace); practical use of BO in hyperparameter search.
References
- Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. (PRML, the standard textbook from a Bayesian viewpoint)
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
- Murphy, K. P. (2022/2023). Probabilistic Machine Learning: An Introduction / Advanced Topics. MIT Press. (The most comprehensive modern Bayesian ML reference after PRML)
- Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis (3rd ed.). CRC Press. (BDA3, the bible of applied Bayesian statistics)
- Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann.
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press. (The first-choice introduction to Bayesian methods)
Graphical Models and Hidden Markov Models
This note covers the most representative structured models of the Bayesian tribe: probabilistic graphical models (PGM), hidden Markov models (HMM), Kalman / particle filtering, and the LDA topic model. They share a common feature: they use graph structure to encode conditional independence, factorizing high-dimensional joint distributions into products of local factors so that inference and learning become tractable.
1. Overview of probabilistic graphical models
1.1 Why graphs
The joint distribution of an arbitrary \(d\)-dimensional discrete random vector requires \(O(K^d)\) parameters (\(K\) = number of values per dimension), which is intractable in high dimensions. Graphical models exploit conditional independence to factor the joint:
If each node has at most \(k\) parents, the parameter count drops from \(O(K^d)\) to \(O(d K^{k+1})\) — an exponential reduction.
1.2 Bayesian networks (directed graphical models)
Definition: a directed acyclic graph (DAG) \(G=(V,E)\) in which each node carries a conditional probability distribution (CPD) \(P(X_i \mid \text{Pa}(X_i))\), and the joint distribution is
Conditional independence (d-separation): every triplet structure on a path falls into one of three categories —
- Chain \(A \to B \to C\): observing \(B\) blocks \(A\) and \(C\);
- Fork \(A \leftarrow B \to C\): observing \(B\) blocks \(A\) and \(C\);
- Collider / v-structure \(A \to B \leftarrow C\): observing \(B\) or any descendant of \(B\) opens \(A\) and \(C\) ("explaining away").
If every path between \(X\) and \(Y\) is blocked given \(Z\), then \(X \perp\!\!\!\perp Y \mid Z\).
I-Map: graph \(G\) is an I-Map of distribution \(P\) iff every independence encoded by \(G\) holds in \(P\). The minimal I-Map (perfect map) does not always exist.
1.3 Markov random fields (undirected graphical models)
An undirected graph writes the joint as a product of clique potentials:
\(Z\) is called the partition function and is the main source of difficulty when learning MRFs.
Global / local / pairwise Markov properties are equivalent under positive distributions (everywhere positive) by the Hammersley-Clifford theorem.
| Dimension | Bayesian network (BN) | Markov network (MRF) |
|---|---|---|
| Graph structure | Directed acyclic | Undirected |
| Factorization | Local CPDs (self-normalizing) | Clique potentials (require partition function \(Z\)) |
| Independence test | d-separation | Graph separation |
| Causal interpretation | Naturally supported | Not directly supported |
| Classical applications | Diagnostic networks, HMM, LDA | Image segmentation, CRF, Ising models |
| Learning difficulty | CPDs relatively easy to fit | Computing \(Z\) typically #P-hard |
2. Exact inference
Exact inference answers: given evidence \(e\), compute the posterior \(P(Q \mid E=e)\) or marginal \(P(Q)\).
2.1 Variable Elimination (VE)
Sum out non-query, non-evidence variables one by one according to an elimination order \(\pi\). Each elimination produces an intermediate factor. Complexity is governed by the elimination width (induced width / treewidth of the graph); finding the optimal ordering is itself NP-hard.
Input: factor set Φ, elimination order π
For each variable X_π(i):
Collect every factor containing X_π(i) → Φ_i
New factor g_i = Σ_{X_π(i)} ∏ Φ_i
Φ ← (Φ \ Φ_i) ∪ {g_i}
Return ∏ Φ
2.2 Belief Propagation (BP)
Also called the sum-product algorithm. On a tree (including polytrees), two passes of message passing yield exact marginals:
Final marginal: \(\;P(x_i) \propto \phi_i(x_i)\prod_{k \in N(i)} m_{k\to i}(x_i)\).
On graphs with cycles, this becomes loopy BP, an approximate algorithm that may not converge.
2.3 Junction Tree algorithm
Triangulate the original graph and construct the junction tree (clique tree); run exact BP on the tree. This is the most general exact inference algorithm; complexity is still controlled by treewidth.
3. Hidden Markov Models (HMM)
3.1 Model definition
An HMM is a class of temporal Bayesian network containing:
- Hidden state sequence \(z_{1:T}\), \(z_t \in \{1,\dots,K\}\);
- Observation sequence \(x_{1:T}\);
- Initial distribution \(\pi_k = P(z_1 = k)\);
- Transition matrix \(A_{ij} = P(z_{t+1}=j \mid z_t = i)\);
- Emission probability \(B_k(x) = P(x_t = x \mid z_t = k)\).
Parameters \(\lambda = (\pi, A, B)\). Joint distribution:
flowchart LR
Z1((z_1)) --> Z2((z_2)) --> Z3((z_3)) --> Zd((... z_T))
Z1 --> X1[x_1]
Z2 --> X2[x_2]
Z3 --> X3[x_3]
Zd --> XT[x_T]
3.2 The three classical problems
| Problem | Input | Output | Algorithm |
|---|---|---|---|
| Evaluation | \(\lambda, x_{1:T}\) | \(P(x_{1:T} \mid \lambda)\) | Forward algorithm |
| Decoding | \(\lambda, x_{1:T}\) | \(\arg\max_{z_{1:T}} P(z_{1:T} \mid x_{1:T})\) | Viterbi |
| Learning | \(x_{1:T}\) (no \(\lambda\)) | \(\hat\lambda\) | Baum-Welch (EM) |
3.3 Forward algorithm
Define the forward variable \(\alpha_t(i) = P(x_{1:t}, z_t = i \mid \lambda)\).
Initialization: \(\alpha_1(i) = \pi_i B_i(x_1)\) Recursion:
Termination: \(\;P(x_{1:T} \mid \lambda) = \sum_{i=1}^{K}\alpha_T(i)\).
Complexity: \(O(K^2 T)\).
3.4 Backward algorithm
Define the backward variable \(\beta_t(i) = P(x_{t+1:T} \mid z_t = i, \lambda)\).
Initialization: \(\beta_T(i) = 1\) Recursion:
Combining backward with forward variables yields the smoothed posterior at any time step:
and the joint posterior at adjacent time steps:
3.5 Viterbi decoding
Find the optimal state sequence \(z^*_{1:T} = \arg\max_{z_{1:T}} P(z_{1:T}, x_{1:T} \mid \lambda)\).
Define \(\delta_t(i) = \max_{z_{1:t-1}} P(z_{1:t-1}, z_t = i, x_{1:t} \mid \lambda)\).
Initialization: \(\delta_1(i) = \pi_i B_i(x_1)\), \(\psi_1(i) = 0\) Recursion:
Termination: \(z^*_T = \arg\max_i \delta_T(i)\) Backtracking: \(z^*_{t} = \psi_{t+1}(z^*_{t+1})\).
Complexity is again \(O(K^2 T)\) — a textbook example of dynamic programming.
def viterbi(pi, A, B, x):
T, K = len(x), len(pi)
delta = np.zeros((T, K))
psi = np.zeros((T, K), dtype=int)
delta[0] = pi * B[:, x[0]]
for t in range(1, T):
for j in range(K):
scores = delta[t-1] * A[:, j]
psi[t, j] = np.argmax(scores)
delta[t, j] = scores[psi[t, j]] * B[j, x[t]]
z = np.zeros(T, dtype=int)
z[-1] = np.argmax(delta[-1])
for t in range(T-2, -1, -1):
z[t] = psi[t+1, z[t+1]]
return z
Numerical stability: real implementations use log probabilities, replacing multiplications by additions. Forward/backward likewise needs scaling factors or logsumexp; otherwise, large \(T\) leads to underflow.
3.6 Baum-Welch / EM training
In the unsupervised setting (only \(x_{1:T}\) available), use EM to iteratively estimate \(\lambda\).
E step: with current parameters \(\lambda^{(s)}\), run forward-backward to obtain \(\gamma_t(i), \xi_t(i,j)\). M step:
EM guarantees \(P(x_{1:T} \mid \lambda^{(s+1)}) \ge P(x_{1:T}\mid\lambda^{(s)})\), but only converges to a local optimum; it is sensitive to initialization and is often run with multiple restarts or k-means initialization for the emission means.
3.7 Extensions of HMMs
| Extension | Modification |
|---|---|
| GMM-HMM | Emission replaced by a Gaussian mixture, continuous observations |
| Autoregressive HMM | Emission depends on \(x_{t-1}\) |
| Input-output HMM | Transitions and emissions depend on external input \(u_t\) |
| Hierarchical HMM | The state itself is an HMM; structure is nested |
| Infinite HMM (HDP-HMM) | Unbounded number of states, nonparametric Bayes |
| Linear-Chain CRF | Discriminative version, modeling \(P(z\mid x)\) rather than \(P(z,x)\) |
4. Sequential Bayes: Kalman filtering and particle filtering
4.1 State-space models
HMMs cover the discrete-state case; for continuous states, the two most important variants are: linear Gaussian → Kalman filter; nonlinear / non-Gaussian → particle filter.
4.2 Kalman filter (linear Gaussian)
Suppose:
Let \(\hat z_{t\mid s} = \mathbb{E}[z_t \mid x_{1:s}]\), \(P_{t\mid s} = \mathrm{Cov}[z_t \mid x_{1:s}]\).
Predict step:
Update step: first compute the innovation and the innovation covariance:
Kalman gain:
Posterior update:
Intuition: \(K_t\) encodes "how much more we should trust the measurement than the prediction" — large \(R\) (high measurement noise) makes \(K_t\) small, leaning on the prediction; the reverse leans on the observation.
Nonlinear extensions: extended Kalman filter (EKF), unscented Kalman filter (UKF).
4.3 Particle filter (Sequential Monte Carlo)
For nonlinear, non-Gaussian models, use \(N\) weighted samples \(\{(z_t^{(i)}, w_t^{(i)})\}_{i=1}^N\) to approximate \(P(z_t \mid x_{1:t})\).
SIS (Sequential Importance Sampling):
- Sample \(z_t^{(i)}\) from the proposal \(q(z_t \mid z_{t-1}^{(i)}, x_t)\);
- Update the weight
- Normalize so that \(\sum_i w_t^{(i)} = 1\).
SIR (Sampling Importance Resampling): when weights degenerate (a few particles carry almost all weight), resample according to weights to obtain a new equally weighted particle set. The standard criterion is the effective sample size \(\hat N_{\text{eff}} = 1/\sum_i (w_t^{(i)})^2\); resample when it drops below a threshold (e.g. \(N/2\)).
Bootstrap filter: take \(q = p(z_t \mid z_{t-1})\), so the weight reduces to \(w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})\).
Applications: robotic SLAM, target tracking, financial time series, epidemiology.
5. Topic models: LDA
5.1 Generative process
LDA (Latent Dirichlet Allocation, Blei et al. 2003) assumes each document is a mixture of several topics, and each topic is a distribution over the vocabulary.
Hyperparameters: \(\alpha\) (document-topic prior), \(\beta\) (topic-word prior), and the number of topics \(K\).
graph LR
subgraph Plate_K["Topic plate: K"]
betak["φ_k ~ Dir(β)"]
end
subgraph Plate_M["Document plate: M"]
thetad["θ_d ~ Dir(α)"]
subgraph Plate_N["Word plate: N_d"]
zdn["z_{d,n} ~ Cat(θ_d)"]
wdn["w_{d,n} ~ Cat(φ_{z_{d,n}})"]
end
end
thetad --> zdn
zdn --> wdn
betak --> wdn
Generative process:
- For each topic \(k = 1, \dots, K\): draw \(\varphi_k \sim \mathrm{Dir}(\beta)\).
- For each document \(d = 1, \dots, M\):
- Draw a topic distribution \(\theta_d \sim \mathrm{Dir}(\alpha)\);
- For each word position \(n = 1, \dots, N_d\):
- Draw a topic \(z_{d,n} \sim \mathrm{Cat}(\theta_d)\);
- Draw a word \(w_{d,n} \sim \mathrm{Cat}(\varphi_{z_{d,n}})\).
Joint distribution:
5.2 Collapsed Gibbs sampling
Exploiting Dirichlet-Multinomial conjugacy, analytically integrate out \(\theta\) and \(\varphi\), sampling only \(z\):
where \(n_{d,k}\) is the number of words in document \(d\) assigned to topic \(k\), \(n_{k,v}\) is the count of word \(v\) under topic \(k\), and the superscript \(-(d,n)\) denotes excluding the current position.
Each token costs \(O(K)\) per update, with overall \(O(K \sum_d N_d)\) per sweep. Variants: variational inference (Blei's original paper), online LDA (Hoffman 2010), SVI.
5.3 Model evaluation
- Perplexity: \(\exp(-\frac{1}{N}\sum \log p(w))\), lower is better;
- Topic coherence: \(C_v\), UMass, NPMI;
- Downstream tasks: use topic vectors for classification / clustering.
6. Application case studies
6.1 Part-of-speech tagging (HMM)
- Hidden states = POS tags (NN, VB, JJ, ...), observations = words;
- \(A\) encodes syntactic transition regularities (DT is highly likely to be followed by NN);
- \(B\) encodes word/POS associations;
- Viterbi decoding yields the most likely tag sequence;
- Modern baselines: CRF, BiLSTM-CRF, Transformer. HMMs remain a teaching tool and a baseline in low-resource scenarios.
6.2 GMM-HMM speech recognition
The classical acoustic model: each phone corresponds to an HMM (typically 3 left-to-right states), with emission probabilities modeled by GMMs over MFCC features. Sentence-level decoding uses a large WFST that composes the acoustic HMM, lexicon, and language model into a Viterbi beam search. Before DNN-HMM appeared this was the state of the art (the Kaldi toolchain remains widely used today).
6.3 Biological sequence alignment (profile HMM)
- Hidden states: match / insert / delete;
- Used for multiple sequence alignment (MSA) and remote homolog detection;
- HMMER is the canonical implementation and a standard tool in bioinformatics.
6.4 LDA in practice
- Topic discovery in large news corpora;
- Low-dimensional representation of user interests in recommender systems (the topic distribution is a dense vector);
- Compared with modern alternatives such as word2vec / BERT-topic, LDA still has value in scenarios that require strong interpretability.
7. Cross-references
- Mathematical foundations and conjugate priors: see ../../03_Machine_Learning/贝叶斯学习.md.
- Probabilistic models in supervised learning: see ../../03_Machine_Learning/probabilistic_models.md.
- Other notes in this section: 概率编程实战, 贝叶斯深度学习.
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 8 (Graphical Models), Ch. 13 (Sequential Data). Springer.
- Rabiner, L. R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Proceedings of the IEEE, 77(2): 257-286.
- Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). "Latent Dirichlet Allocation". JMLR, 3: 993-1022.
- Koller, D., Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics, Ch. 8-10. MIT Press.
- Doucet, A., de Freitas, N., Gordon, N. (eds.) (2001). Sequential Monte Carlo Methods in Practice. Springer.
- Griffiths, T. L., Steyvers, M. (2004). "Finding Scientific Topics". PNAS, 101(suppl 1): 5228-5235. (Collapsed Gibbs for LDA)
- Welch, G., Bishop, G. (2006). An Introduction to the Kalman Filter. UNC-Chapel Hill TR.
Probabilistic Programming and Applied Bayesian Statistics
This section focuses on turning Bayesian ideas into code and practice. The central question is: how can we describe a generative model declaratively so that an inference algorithm runs automatically? That is what probabilistic programming languages (PPLs) provide.
1. The probabilistic programming paradigm
1.1 Models as programs
The traditional Bayesian modeling workflow is "write the formulas → derive the posterior → implement a sampler", and every change of likelihood or prior requires rewriting it. Probabilistic programming unifies this into:
Random variable = first-class citizen of the programming language
Model = a program that writes down priors and likelihoods
Inference = automatically performed by the PPL runtime
As long as you specify the generative process, the inference algorithms (HMC/NUTS, SVI, SMC) are produced automatically by the compiler / runtime. This decouples modeling from inference and makes Bayesian methods scalable in engineering.
1.2 Three categories of PPL
| Category | Examples | Characteristics |
|---|---|---|
| Static graph, strongly typed | Stan | Custom DSL, C++ backend, the most stable NUTS implementation |
| Python-based, dynamic graph | PyMC, NumPyro, Pyro | Reuse autodiff frameworks from deep learning (Theano/Aesara/PyTensor, JAX, PyTorch) |
| General-purpose PPL (Turing-complete) | Turing.jl, Pyro, Gen | Support stochastic control flow and open-universe models |
2. Comparison of major libraries
| Library | Host language | Backend | Default sampler | VI support | Use case |
|---|---|---|---|---|---|
| PyMC | Python | PyTensor (formerly Aesara/Theano) | NUTS | ADVI, normalizing-flow VI | Statistical modeling, regression, hierarchical models |
| Stan | DSL (compiled to C++) | C++ | NUTS (gold standard) | ADVI, Pathfinder | Academic statistics, reproducible research |
| NumPyro | Python | JAX | NUTS (GPU/TPU friendly) | SVI (Pyro-compatible) | Large-scale data, hardware acceleration required |
| Pyro | Python | PyTorch | HMC, NUTS | SVI (core) | Deep generative models, VAE-style |
| Edward2 | Python | TensorFlow Probability | HMC, NUTS | VI | TFP ecosystem, research prototypes |
| Turing.jl | Julia | Julia | HMC, NUTS, PG | ADVI | Julia ecosystem, custom samplers |
Selection guide: - Medium-scale, tabular data, statistical style → PyMC or Stan (Stan's NUTS remains the most diagnostically stable implementation); - Large scale, GPU required, coupled with JAX models → NumPyro; - Deep probabilistic models, VAE / variational flows → Pyro; - Turing-complete / open-universe → Turing.jl or Gen.
3. Hierarchical Bayesian models
3.1 The pooling spectrum
Consider multi-group data \(\{(x_{ij}, y_{ij})\}\) (\(j\) indexes groups, \(i\) indexes within-group observations):
| Strategy | Model | Bias-variance |
|---|---|---|
| Complete pooling | All groups share one \(\theta\) | High bias, low variance |
| No pooling | Each group estimated independently with its own \(\theta_j\) | Low bias, high variance |
| Partial pooling (hierarchical) | \(\theta_j \sim \mathcal{N}(\mu, \tau^2)\) with \(\mu, \tau\) as hyperparameters (with hyperpriors) | Compromise, adapts to "within-group information content" |
The key idea of hierarchical models is shrinkage: groups with few samples get pulled toward the global mean, while groups with many samples remain close to their own MLE.
3.2 Classic example: 8 schools
Gelman's 8 schools: estimate the SAT teaching effect \(y_j\) for 8 schools with standard errors \(\sigma_j\).
Non-centered parameterization (avoids funnel geometry):
import pymc as pm
import numpy as np
y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
sigma = np.array([15, 10, 16, 11, 9, 11, 10, 18])
with pm.Model() as eight_schools:
mu = pm.Normal("mu", 0, 10)
tau = pm.HalfCauchy("tau", 5)
theta_tilde = pm.Normal("theta_tilde", 0, 1, shape=8)
theta = pm.Deterministic("theta", mu + tau * theta_tilde)
obs = pm.Normal("obs", mu=theta, sigma=sigma, observed=y)
idata = pm.sample(2000, tune=1000, target_accept=0.95)
4. MCMC in practice
4.1 The sampler family
| Sampler | Core mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Metropolis-Hastings | Proposal + acceptance ratio \(\alpha = \min(1, \frac{p(\theta') q(\theta\mid\theta')}{p(\theta) q(\theta'\mid\theta)})\) | General | Slow mixing in high dimensions |
| Gibbs | Sample one dimension at a time from \(p(\theta_i \mid \theta_{-i})\) | Efficient for conjugate models | Slow under strong parameter correlation |
| HMC | Introduces momentum, simulates Hamiltonian dynamics | Efficient in high dimensions | Step size \(\epsilon\) and number of steps \(L\) must be tuned |
| NUTS | HMC with auto-tuned \(L\) (U-turn criterion) | Nearly tuning-free | Implementation is intricate |
| SMC | Annealed sequence | Can estimate the marginal likelihood | Heavy computation |
| Riemannian HMC | Uses Fisher information as metric | More stable on ill-conditioned geometry | Even heavier |
4.2 HMC Hamiltonian dynamics
Introduce momentum \(r \sim \mathcal{N}(0, M)\) and define the Hamiltonian
Hamilton's equations:
Use the leapfrog integrator (half-step momentum, full-step position, half-step momentum) to simulate \(L\) steps and obtain the proposal \((\theta', r')\). Metropolis acceptance probability:
An ideal Hamiltonian system conserves energy → acceptance rate \(\approx 1\); the leapfrog integrator's \(O(\epsilon^2)\) error introduces a small fraction of rejections. NUTS extends the trajectory dynamically at each step and uses "have the two ends of the trajectory begun a U-turn?" as the stopping criterion, eliminating the need to tune \(L\).
4.3 Diagnostics
- Trace plot: stack four or more chains; they should look like a "fuzzy caterpillar" indicating thorough mixing;
- R-hat (\(\hat R\)): between-chain variance / within-chain variance. \(\hat R < 1.01\) is taken as converged;
- ESS (effective sample size): effective number of samples after accounting for autocorrelation. Bulk-ESS gauges mean precision, tail-ESS gauges quantile precision; aim for \(\ge 400\) per parameter;
- Divergent transitions: HMC's leapfrog diverges in funnels and narrow regions. When they appear, raise
target_accept, switch to non-centered parameterization, or shrink the step size; - BFMI: how well the energy distribution mixes; \(<0.3\) indicates insufficient momentum mixing;
- Posterior predictive checks (PPC): sample from the posterior to generate new data and compare with observations.
5. Bayesian linear regression
Model: \(y = X\beta + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma^2 I)\), prior \(\beta \sim \mathcal{N}(0, \tau^2 I)\).
The closed-form posterior:
Relation to ridge regression: the MAP estimate \(\hat\beta_{\text{MAP}} = \mu_n\) is equivalent to the ridge solution \(\hat\beta_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y\), with \(\lambda = \sigma^2 / \tau^2\).
Posterior predictive:
The predictive variance automatically incorporates irreducible noise + parameter uncertainty — this is the fundamental advantage of Bayesian methods over point estimates with regularization.
6. Bayesian A/B testing
Business scenario: comparing the conversion rates of versions A and B.
6.1 Beta-Binomial model
For each version:
The (conjugate) posterior:
We can directly read off interpretable quantities such as \(P(\theta_B > \theta_A \mid D)\), the lift \(\frac{\theta_B - \theta_A}{\theta_A}\), or the minimum detectable improvement — all far closer to business decisions than a \(p\)-value.
import pymc as pm
import numpy as np
obs = {"A": (120, 1000), "B": (135, 1000)} # (conversions, trials)
with pm.Model() as ab:
theta_A = pm.Beta("theta_A", 1, 1)
theta_B = pm.Beta("theta_B", 1, 1)
pm.Binomial("yA", n=obs["A"][1], p=theta_A, observed=obs["A"][0])
pm.Binomial("yB", n=obs["B"][1], p=theta_B, observed=obs["B"][0])
diff = pm.Deterministic("diff", theta_B - theta_A)
lift = pm.Deterministic("lift", (theta_B - theta_A) / theta_A)
idata = pm.sample(2000, tune=1000)
prob_B_better = (idata.posterior["diff"] > 0).mean().item()
print(f"P(B > A | D) = {prob_B_better:.3f}")
6.2 Sequential monitoring
Frequentist methods require committing to a sample size in advance — peeking inflates type-I error. The Bayesian framework can compute \(P(\theta_B > \theta_A)\) at any time and combine it with a preset loss function (stopping rule) to make sequential decisions.
6.3 Multi-armed bandits
Generalize A/B to many arms: Thompson Sampling draws \(\theta\) from each arm's posterior and picks the maximum, naturally balancing exploration and exploitation (regret close to optimal).
7. Bayesian Optimization (BO)
7.1 Framework
Goal: maximize a black-box, expensive function \(f: \mathcal{X} \to \mathbb{R}\), \(x^* = \arg\max f(x)\), where each evaluation is costly (e.g., training a deep network for a hyperparameter setting).
Iterate:
- Fit a surrogate model to the observations \(\{(x_i, y_i)\}\), typically a Gaussian process (GP).
- Construct an acquisition function \(a(x)\) that balances exploration and exploitation.
- Take \(x_{n+1} = \arg\max_x a(x)\), evaluate, and add to the dataset.
7.2 The GP surrogate model
Posterior at a new point \(x_*\):
7.3 Three popular acquisition functions
Let \(\mu(x), \sigma(x)\) denote the GP posterior mean and standard deviation, and \(f^+ = \max y_i\).
Probability of Improvement (PI):
Expected Improvement (EI):
EI balances "expected magnitude of improvement" with "probability that improvement is possible" and is the most commonly used acquisition function.
Upper Confidence Bound (UCB):
\(\kappa\) controls the exploration level. Srinivas (2010) gave sublinear-regret guarantees for GP-UCB.
7.4 Application areas
- Hyperparameter search in deep learning (learning rate, layer width, regularization strength);
- Experimental design (materials, chemical reaction conditions);
- Tuning robotic control policies;
- Continuous-parameter optimization in A/B settings.
Tools: BoTorch (PyTorch backend), GPyOpt, Ax, scikit-optimize.
8. Engineering considerations
8.1 Prior sensitivity analysis
When reporting results, you must perturb the prior: replace \(\tau \sim \mathrm{HalfCauchy}(5)\) with \(\mathrm{HalfNormal}(2)\) and rerun to see whether the posterior remains stable. If it does not, either the data carry too little information or the prior is too strong.
Weakly informative priors: Gelman's recommended practice — neither use a flat prior (numerically unstable and not invariant to reparameterization) nor an overly tight one. Common choices are on the order of \(\mathcal{N}(0, 5)\), with HalfNormal/HalfCauchy on scale parameters.
8.2 Model comparison
| Criterion | Formula | Note |
|---|---|---|
| DIC | \(-2\log p(y\mid\hat\theta) + 2 p_D\) | Outdated, no longer recommended |
| WAIC | \(-2\sum_i \log\!\big(\frac{1}{S}\sum_s p(y_i\mid\theta^{(s)})\big) + 2 p_W\) | Fully Bayesian, pointwise |
| PSIS-LOO | Importance-weighted LOO likelihood | Recommended by Vehtari (2017) |
| Bayes factor | \(\frac{p(D\mid M_1)}{p(D\mid M_2)}\) | Strict but highly sensitive to priors |
| Posterior predictive checks | Visual / statistical comparison | Mandatory |
In PyMC: pm.compare({"m1": idata1, "m2": idata2}, ic="loo").
8.3 Reproducibility checklist
- Fix random seeds;
- Record the number of chains, warmup steps, target_accept, and the sampler used;
- Report \(\hat R\), ESS, and the number of divergences;
- Submit the model code (not just results);
- Put the data preprocessing pipeline under version control;
- Posterior summaries should report intervals (e.g. 89% HDI), not only the mean.
8.4 When not to use Bayes
- Vast amounts of data, prior influence negligible → MLE/MAP suffices, and running NUTS is wasteful;
- Real-time inference with millisecond latency → posterior sampling is too slow; use VI or fall back to MAP;
- A model whose priors are hard to specify (e.g. all weights of a black-box deep network) → consider the BDL approximations (see 贝叶斯深度学习).
9. Cross-references
- Tribe perspective and genealogy: 本页 §1(派系入口)
- Graphical-model foundations of HMM, Kalman, and LDA: 图模型与隐马尔可夫
- Extending Bayesian ideas to deep networks: 贝叶斯深度学习与不确定性
- Mathematical foundations and conjugate priors: ../../03_Machine_Learning/贝叶斯学习.md
References
- Salvatier, J., Wiecki, T. V., Fonnesbeck, C. (2016). "Probabilistic programming in Python using PyMC3". PeerJ Computer Science, 2:e55.
- Carpenter, B., Gelman, A., Hoffman, M. D., et al. (2017). "Stan: A Probabilistic Programming Language". Journal of Statistical Software, 76(1).
- Phan, D., Pradhan, N., Jankowiak, M. (2019). "Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro". arXiv:1912.11554.
- Bingham, E., Chen, J. P., Jankowiak, M., et al. (2019). "Pyro: Deep Universal Probabilistic Programming". JMLR, 20(28).
- Hoffman, M. D., Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo". JMLR, 15: 1593-1623.
- Neal, R. M. (2011). "MCMC Using Hamiltonian Dynamics". In Handbook of Markov Chain Monte Carlo. CRC Press.
- Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
- Vehtari, A., Gelman, A., Gabry, J. (2017). "Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC". Statistics and Computing, 27: 1413-1432.
- Frazier, P. I. (2018). "A Tutorial on Bayesian Optimization". arXiv:1807.02811.
- Snoek, J., Larochelle, H., Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". NeurIPS.
- Srinivas, N., Krause, A., Kakade, S. M., Seeger, M. (2010). "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design". ICML.
- McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC Press.
Bayesian Deep Learning and Uncertainty
Deep networks already achieve high accuracy on i.i.d. test sets, but on out-of-distribution (OOD) data they often produce highly confident wrong predictions, posing a fundamental risk for safety-critical applications (autonomous driving, medical diagnostics, financial risk control). The goal of Bayesian Deep Learning (BDL) is to equip deep networks with well-calibrated uncertainty estimates.
1. Why deep networks need Bayes
1.1 Three symptoms
- Overconfidence: modern ResNets and Transformers routinely output softmax values \(>0.99\) on misclassified samples; the empirical study of Guo (2017) shows that ECE is far higher than for earlier shallow networks.
- OOD failure: a network trained on CIFAR-10 and tested on SVHN still classifies with high confidence — it cannot recognize "I have not seen this".
- Catastrophic errors: standard MLE/MAP networks output point estimates with no variance; downstream decisions (rejection, hand-off to humans) lack a principled basis.
1.2 What the Bayesian framework promises
The posterior predictive automatically integrates weight uncertainty and marginalizes it into the output distribution — in theory this delivers well-calibrated probabilities. The problem is that \(p(w\mid D)\) is intractable when there are millions of parameters; the entire BDL toolbox is about how to approximate this posterior.
2. Two kinds of uncertainty
| Type | Source | Reducible by more data? | Examples |
|---|---|---|---|
| Aleatoric (data noise) | Inherent randomness of the data, measurement noise, label ambiguity | No | Blurry images, dice rolls |
| Epistemic (model ignorance) | The model's ignorance over unseen regions, limited training data | Yes | OOD inputs, sparse regions of the training distribution |
Mathematically:
Operational meaning: high epistemic → trigger "abstention" or active-learning sampling; high aleatoric → provide a "probabilistic output" or ask for additional sensors.
For regression tasks, aleatoric uncertainty can be modeled as \(y = f_w(x) + \epsilon(x)\), letting the network output \((\mu, \sigma^2)\) to learn heteroscedastic noise.
3. Bayesian Neural Networks (BNN)
Treat each weight \(w_i\) as a random variable with prior \(p(w)\) (commonly \(\mathcal{N}(0, \sigma_p^2)\)), and update it with observed data to obtain the posterior \(p(w \mid D)\).
The ideal approach is to sample the posterior directly with HMC. Neal (1995) succeeded on shallow networks, but on deep networks the leapfrog cost and mixing difficulty are severe; only in the past few years (Izmailov 2021) have large-scale experiments appeared. In practice the following approximations are used.
4. Bayes by Backprop (Blundell 2015)
4.1 The variational inference (VI) framework
Introduce a parametric approximate posterior \(q_\phi(w)\) and maximize the ELBO:
The KL identity:
Maximizing the ELBO ⇔ minimizing the KL ⇔ pushing \(q\) toward the true posterior.
4.2 Mean-field Gaussian approximation
Each weight is independent and Gaussian: \(q_\phi(w_i) = \mathcal{N}(\mu_i, \sigma_i^2)\). The parameter count doubles (\(\mu, \rho\) per weight, with \(\sigma = \log(1+e^\rho)\) keeping it positive).
4.3 The reparameterization trick
To make \(\nabla_\phi \mathbb{E}_{q_\phi}[\cdot]\) differentiable, push the sampling out of \(q\) into a parameter-free distribution:
Then
so the gradient can be computed in a single forward + backward pass.
4.4 Training
In each mini-batch, draw a set of \(\epsilon\) → compute \(w\) → forward pass → loss = negative likelihood + KL regularizer → backprop to update \(\mu, \rho\). At prediction time, sample \(w\) multiple times and average to obtain the posterior predictive.
Pros: principled and SGD-compatible. Cons: the mean-field assumption ignores weight correlations and often substantially underestimates variance; the parameter count doubles.
5. MC Dropout (Gal & Ghahramani 2016)
5.1 The central claim
For a neural network with dropout, training with dropout and keeping dropout active at test time across multiple forward passes produces a predictive distribution that is equivalent to an approximate variational-inference posterior predictive.
5.2 Sketch of the derivation
Consider an \(L\)-layer network with weights \(W_l\). Express the dropout mask \(z_l \in \{0,1\}^{K_l}\) (independent Bernoulli(\(p\))) as a "sampled" weight:
Define the approximate posterior:
i.e., each weight column either takes the value \(M_l\) or is zeroed. This is a highly restricted family of variational distributions.
The ELBO loss:
First term: average over \(T\) dropout-mask samples per data point (in practice \(T = 1\), i.e. standard SGD with dropout). KL term: under a Gaussian prior, the analytic KL expansion equals L2 regularization plus a constant depending on \(p\) — meaning that dropout + L2 ≈ variational inference.
5.3 Inference
At test time, leave dropout on and perform \(T\) stochastic forward passes:
where \(\tau\) is the model precision (a function of weight decay, dropout rate, and dataset size).
# Minimal pseudocode
model.train() # key: keep dropout active
preds = torch.stack([model(x) for _ in range(T)]) # T forward passes
mean = preds.mean(0)
var = preds.var(0) # epistemic component
Pros: zero extra parameters, drop-in compatibility with existing training pipelines, almost no overhead. Cons: the dropout rate \(p\) is a fixed prior; the approximation bias is uncontrolled. Follow-up work (Concrete Dropout) makes \(p\) learnable.
6. Deep Ensembles (Lakshminarayanan 2017)
6.1 Method
Independently train \(M\) networks (different random initializations and mini-batch orders) and average their predictions:
6.2 Connection to Bayes
Although there is no explicit posterior, Deep Ensembles often empirically outperform BNN-VI and MC Dropout (in accuracy, calibration, and OOD detection). Wilson & Izmailov (2020) argue that the loss landscape of neural networks contains many equivalent local modes, and each training run lands in a different mode; multiple training runs are roughly an "informal sampling" of a multi-modal posterior. They can be viewed as a special case of MultiSWAG.
6.3 Engineering trade-offs
- Training cost \(\times M\) (typically \(M=5\));
- Inference cost \(\times M\);
- But each member can be trained in parallel;
- On OOD detection and long-tailed classification benchmarks they remain very hard to beat.
7. SWAG / SWA
7.1 SWA (Stochastic Weight Averaging, Izmailov 2018)
Late in training, under a constant or cyclic learning rate, average the weights every few epochs:
This yields a flatter solution than a single SGD endpoint and generalizes better.
7.2 SWAG (Maddox 2019)
SWA upgraded to a Gaussian approximation over weights:
The low-rank covariance is estimated from the deviation matrix along the SGD trajectory \(D = [w_{t_1} - \bar w, \dots, w_{t_K} - \bar w]\): \(\Sigma_{\text{lowrank}} = \frac{1}{K-1} D D^\top\).
At prediction time, sample \(w\) from this Gaussian and form the posterior predictive — performance approaches Deep Ensembles while training cost is close to a single model.
8. Laplace approximation / Laplace Redux
8.1 Second-order expansion
Take a second-order Taylor expansion of the negative log-posterior around the MAP solution \(w^*\):
where \(H = -\nabla^2 \log p(w \mid D)\big|_{w^*}\) is the Hessian. This is equivalent to approximating the posterior as a Gaussian:
8.2 Laplace Redux (Daxberger 2021)
Storing the full Hessian for a deep network is infeasible (\(O(P^2)\)). Common approximations:
- Last-layer Laplace: apply Laplace to the final linear layer only and keep all other layers at their MAP values;
- Diagonal Hessian: ignore off-diagonal terms;
- KFAC (Kronecker-factored): approximate the Hessian as a per-layer Kronecker product;
- GGN (Generalized Gauss-Newton): replace the Hessian with the Fisher information matrix — positive semidefinite and batch-computable.
The laplace-torch library (Daxberger 2021) wraps these options and applies Laplace post-hoc to an already trained model, leaving accuracy nearly unchanged while substantially improving calibration and OOD performance.
8.3 Approximating the posterior predictive
For classification, \(p(y\mid x^*) = \int \mathrm{softmax}(f_w(x^*)) p(w\mid D) dw\) has no closed form. Common options:
- MC: sample weights from the Gaussian and average;
- Probit approximation: replace softmax with a probit, giving a closed form (suitable for last-layer Laplace).
9. Calibration
9.1 The ECE metric
Bin the predicted confidence \(\hat p\) into \(M\) bins \(B_m\):
The ideal ECE is 0. MCE is the maximum bin deviation.
9.2 Temperature scaling
The simplest and most widely used post-hoc calibrator: fit a scalar \(T\) on a validation set:
minimizing the NLL with respect to \(T\). \(T > 1\) smooths; \(T < 1\) sharpens. Guo (2017) showed that a single scalar \(T\) can drive the ECE of modern CNNs from 5%-10% down below 1%.
9.3 Platt scaling
The binary version: a logistic regression mapping logits \(z\) to calibrated probabilities:
9.4 Isotonic regression
A nonparametric monotonic mapping; more flexible than Platt but requires more data.
Caveat: post-hoc methods like temperature scaling only calibrate in-distribution; on OOD they remain overconfident. The selling point of BDL is producing higher uncertainty on OOD inputs, complementary to temperature scaling.
10. OOD detection
10.1 Baselines
- Max softmax probability (Hendrycks 2017): take \(\max_k \hat p_k\) as the score; OOD inputs should score lower.
- Energy score (Liu 2020): \(E(x) = -T \log \sum_k e^{f_k(x)/T}\); theoretically better than MSP.
- Mahalanobis distance (Lee 2018): under per-class Gaussians, compute \(D_M(x) = (\phi(x) - \mu_c)^\top \Sigma^{-1}(\phi(x) - \mu_c)\); OOD inputs are far away.
10.2 Bayesian methods
- Epistemic variance from MC Dropout / BNN: use directly as an OOD score;
- Difference of predictive entropies in a Deep Ensemble: \(\mathcal{H}[\bar p] - \frac{1}{M}\sum_m \mathcal{H}[p_m]\) is the BALD mutual information, specifically capturing epistemic uncertainty;
- GP \(\sigma_n(x)\): naturally grows away from training points.
10.3 Datasets and benchmarks
CIFAR-10 vs SVHN, CIFAR-10 vs CIFAR-100, ImageNet-O, OpenOOD benchmark. Metrics: AUROC, AUPR, FPR@95%TPR.
11. The BDL method genealogy
flowchart TD
A[Bayesian Neural Network<br/>BNN] --> B[Exact MCMC<br/>HMC, Neal 1995]
A --> C[Variational Inference (VI)]
A --> D[Laplace approximation]
A --> E[Monte Carlo sampling approximations]
C --> C1[Bayes by Backprop<br/>Blundell 2015]
C --> C2[MC Dropout<br/>Gal 2016]
C --> C3[Concrete Dropout]
C --> C4[Functional VI / FVB]
D --> D1[Last-layer Laplace]
D --> D2[KFAC Laplace]
D --> D3[Laplace Redux 2021]
E --> E1[Deep Ensembles<br/>Lakshminarayanan 2017]
E --> E2[SWA / SWAG<br/>Izmailov 2018, Maddox 2019]
E --> E3[MultiSWAG]
A --> F[Bayesian generative models]
F --> F1[VAE<br/>Kingma 2014]
F --> F2[Diffusion models<br/>Sohl-Dickstein 2015, Ho 2020]
F --> F3[Normalizing Flows]
12. Connection to VAE / diffusion models
12.1 VAE: amortized variational inference
The VAE is the canonical latent-variable Bayesian generative model:
It introduces a recognition network \(q_\phi(z\mid x)\) that amortizes posterior inference, maximizing the ELBO:
Reparameterization (\(z = \mu_\phi(x) + \sigma_\phi(x)\odot\epsilon\)) lets gradients flow back to \(\phi\). The VAE sits at the intersection of BDL and generative modeling; for details see VAE notes.
12.2 Diffusion models: hierarchical variational inference
The training objective of DDPM (Ho 2020) can be written as an ELBO:
Each denoising step is the KL term of a conditional Gaussian. From this perspective, diffusion models are hierarchical VAEs with \(T\)-step latent variables, closing the loop with the Bayesian tribe entirely.
12.3 Recent developments such as Bayesian Flow Networks
A number of recent works (Bayesian Flow Networks, Graves 2023; Diffusion Schrödinger Bridge) explicitly combine Bayesian inference with diffusion processes, and are at the active frontier of research.
13. Practitioner's checklist
| Task | Recommended method | Notes |
|---|---|---|
| Already-trained model, want to add uncertainty quickly | Last-layer Laplace or MC Dropout | Lowest integration cost |
| Retraining, ample budget | Deep Ensembles (\(M=5\)) | Usually the strongest baseline |
| Retraining budget constrained but want a posterior feel | SWAG | Single-model cost |
| Safety-critical and retrainable | Deep Ensembles + temperature scaling | Calibration + robustness |
| Large models (LLM) | Last-layer Laplace, LoRA-BNN, ensemble of LoRAs | Full BNN infeasible |
| OOD detection | Energy score or Mahalanobis + ensembles | Strong baselines |
14. Cross-references
- Tribe perspective and algorithmic genealogy: 本页 §1(派系入口)
- Foundations of probabilistic programming and Bayesian statistics: 概率编程与贝叶斯统计实战
- Graphical models and sequential Bayes: 图模型与隐马尔可夫
- Detailed VAE derivation: ../../../1_DeepLearning/Generative_Models/VAE.md
References
- Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D. (2015). "Weight Uncertainty in Neural Networks". ICML. (Bayes by Backprop)
- Gal, Y., Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning". ICML.
- Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles". NeurIPS.
- Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., Hennig, P. (2021). "Laplace Redux — Effortless Bayesian Deep Learning". NeurIPS.
- Guo, C., Pleiss, G., Sun, Y., Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks". ICML.
- Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., Wilson, A. G. (2019). "A Simple Baseline for Bayesian Uncertainty in Deep Learning". NeurIPS. (SWAG)
- Izmailov, P., Vikram, S., Hoffman, M. D., Wilson, A. G. (2021). "What Are Bayesian Neural Network Posteriors Really Like?". ICML.
- Wilson, A. G., Izmailov, P. (2020). "Bayesian Deep Learning and a Probabilistic Perspective of Generalization". NeurIPS.
- Kingma, D. P., Welling, M. (2014). "Auto-Encoding Variational Bayes". ICLR.
- Ho, J., Jain, A., Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models". NeurIPS.
- Hendrycks, D., Gimpel, K. (2017). "A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks". ICLR.
- Liu, W., Wang, X., Owens, J., Li, Y. (2020). "Energy-based Out-of-distribution Detection". NeurIPS.
- Lee, K., Lee, K., Lee, H., Shin, J. (2018). "A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks". NeurIPS.
- Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. (Earliest systematic treatment of BNNs)