Statistical Learning Theory
Introduction
Statistical learning theory provides rigorous mathematical foundations for machine learning, answering fundamental questions such as "why does learning succeed" and "how much data is needed." This article covers the PAC learning framework, VC dimension, generalization bounds, bias-variance decomposition, and other core concepts.
1. PAC Learning Framework
1.1 Definition
PAC (Probably Approximately Correct) learning was proposed by Valiant (1984):
Goal: can a learner find a hypothesis that is approximately correct with high probability?
Formal definition: a hypothesis class \(\mathcal{H}\) is PAC learnable if there exists an algorithm \(A\) such that for any:
- Error parameter \(\epsilon > 0\)
- Confidence parameter \(\delta > 0\)
- Data distribution \(\mathcal{D}\)
as long as the sample size satisfies \(m \geq m_0(\epsilon, \delta)\), algorithm \(A\) outputs a hypothesis \(h\) satisfying:
where \(R(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(h(x), y)]\) is the true risk.
1.2 PAC Bound for Finite Hypothesis Classes
For a finite hypothesis class \(|\mathcal{H}| < \infty\), the required sample size is:
Interpretation:
- The larger the hypothesis space, the more samples are needed
- The higher the required accuracy (smaller \(\epsilon\)), the more samples are needed
- The higher the required confidence (smaller \(\delta\)), the more samples are needed
2. VC Dimension
2.1 Shattering
Given a set of data points \(S = \{x_1, \ldots, x_m\}\), if hypothesis class \(\mathcal{H}\) can realize all \(2^m\) possible binary label combinations on \(S\), then \(\mathcal{H}\) shatters \(S\).
2.2 Growth Function
The growth function measures the maximum classification capacity of a hypothesis class on \(m\) points:
Clearly \(\Pi_{\mathcal{H}}(m) \leq 2^m\).
2.3 VC Dimension Definition
The VC dimension (Vapnik-Chervonenkis dimension) \(d_{VC}\) of hypothesis class \(\mathcal{H}\) is the largest set size that can be shattered by \(\mathcal{H}\):
2.4 Classic Examples
| Hypothesis Class | VC Dimension | Explanation |
|---|---|---|
| Threshold functions on \(\mathbb{R}\) | 1 | Can shatter 1 point, not 2 |
| Linear classifiers on \(\mathbb{R}^2\) | 3 | Can shatter 3 non-collinear points |
| Linear classifiers on \(\mathbb{R}^d\) | \(d + 1\) | |
| Union of \(k\) intervals | \(2k\) | |
| Neural network with \(W\) weights | \(O(W \log W)\) | Approximate result |
2.5 Sauer's Lemma
When \(m > d_{VC}\), the growth function transitions from exponential to polynomial growth.
3. Generalization Bounds
3.1 VC Generalization Bound
For a hypothesis class with VC dimension \(d\), with probability at least \(1-\delta\):
where \(\hat{R}(h) = \frac{1}{m}\sum_{i=1}^{m}\ell(h(x_i), y_i)\) is the empirical risk.
Interpretation:
- Complexity penalty \(\propto \sqrt{d/m}\)
- When \(m \gg d\), training error approximates generalization error
3.2 Required Sample Size
For the generalization bound to be meaningful, approximately:
samples are needed, where \(d = d_{VC}\) and \(\epsilon\) is the acceptable generalization gap.
4. Bias-Variance Decomposition
For regression problems, the prediction error can be decomposed into three parts:
| Component | Source | Meaning |
|---|---|---|
| Bias \(\text{Bias}^2\) | Model assumptions | Model is not flexible enough, systematically deviates from true values |
| Variance | Data randomness | Model is too sensitive to changes in training data |
| Noise \(\sigma^2\) | Data itself | Irreducible random error |
Bias-Variance Trade-off
Error
│ \ /
│ \ Bias² / Variance
│ \ /
│ \_____●_____/ ← Optimal complexity
│ Minimum total error
│
└───────────────────→ Model complexity
Underfitting Moderate Overfitting
| Model Complexity | Bias | Variance | Behavior |
|---|---|---|---|
| Low (e.g., linear regression) | High | Low | Underfitting |
| Moderate | Medium | Medium | Optimal |
| High (e.g., deep networks) | Low | High | Overfitting |
5. Structural Risk Minimization (SRM)
A learning principle proposed by Vapnik that balances empirical risk and model complexity:
where \(\Omega(\mathcal{H})\) is the complexity penalty term.
Examples:
| Method | Complexity Control \(\Omega\) |
|---|---|
| L2 regularization (Ridge) | \(\lambda\|w\|^2\) |
| L1 regularization (LASSO) | \(\lambda\|w\|_1\) |
| SVM | Maximizing margin ↔ \(\|w\|^2\) |
| Decision tree pruning | Number of leaf nodes |
| Neural networks | Dropout, Weight Decay, Early Stopping |
6. Rademacher Complexity
A more refined complexity measure than VC dimension that accounts for the data distribution.
Definition:
where \(\sigma_i \in \{+1, -1\}\) are uniform random variables (Rademacher variables).
Intuition: Rademacher complexity measures the ability of a hypothesis class to fit random noise. A hypothesis class that can fit pure noise has high complexity.
Generalization bound:
7. No Free Lunch Theorem (NFL)
7.1 Theorem Statement
Wolpert & Macready (1997):
Averaged over all possible problems, no learning algorithm outperforms any other. For every algorithm that performs well on some problems, there necessarily exist other problems on which it performs poorly.
7.2 Formalization
For all learning algorithms \(A_1, A_2\) and a uniform distribution over target functions:
7.3 Practical Implications
- No universal algorithm exists: algorithms must be chosen based on problem characteristics
- Prior knowledge matters: inductive bias determines which problems an algorithm excels at
- Does not preclude comparison: on specific problems/distributions, algorithms can be ranked
| Inductive Bias | Assumption | Corresponding Method |
|---|---|---|
| Smoothness | Similar inputs → similar outputs | KNN, kernel methods |
| Sparsity | Only a few features matter | LASSO, sparse coding |
| Hierarchical structure | Data has hierarchical features | Deep learning |
| Temporal stationarity | Statistical properties are time-invariant | Time series models |
8. Summary
Core questions of statistical learning theory:
Q: Why does learning succeed?
A: If the hypothesis class is not too complex (finite VC dimension)
and the sample size is large enough,
then empirical risk converges to true risk.
Q: How much data is needed?
A: m = O(d_VC / ε²)
Q: How to choose model complexity?
A: Bias-variance trade-off / Structural risk minimization
Q: What is the best algorithm?
A: No free lunch -- it depends on the problem.
References
- "Foundations of Machine Learning" - Mohri, Rostamizadeh, Talwalkar
- "Understanding Machine Learning" - Shalev-Shwartz & Ben-David
- "The Nature of Statistical Learning Theory" - Vapnik
- "An Introduction to Computational Learning Theory" - Kearns & Vazirani