Skip to content

Statistical Learning Theory

Introduction

Statistical learning theory provides rigorous mathematical foundations for machine learning, answering fundamental questions such as "why does learning succeed" and "how much data is needed." This article covers the PAC learning framework, VC dimension, generalization bounds, bias-variance decomposition, and other core concepts.


1. PAC Learning Framework

1.1 Definition

PAC (Probably Approximately Correct) learning was proposed by Valiant (1984):

Goal: can a learner find a hypothesis that is approximately correct with high probability?

Formal definition: a hypothesis class \(\mathcal{H}\) is PAC learnable if there exists an algorithm \(A\) such that for any:

  • Error parameter \(\epsilon > 0\)
  • Confidence parameter \(\delta > 0\)
  • Data distribution \(\mathcal{D}\)

as long as the sample size satisfies \(m \geq m_0(\epsilon, \delta)\), algorithm \(A\) outputs a hypothesis \(h\) satisfying:

\[ P\left[R(h) - R(h^*) \leq \epsilon\right] \geq 1 - \delta \]

where \(R(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(h(x), y)]\) is the true risk.

1.2 PAC Bound for Finite Hypothesis Classes

For a finite hypothesis class \(|\mathcal{H}| < \infty\), the required sample size is:

\[ m \geq \frac{1}{\epsilon}\left(\ln|\mathcal{H}| + \ln\frac{1}{\delta}\right) \]

Interpretation:

  • The larger the hypothesis space, the more samples are needed
  • The higher the required accuracy (smaller \(\epsilon\)), the more samples are needed
  • The higher the required confidence (smaller \(\delta\)), the more samples are needed

2. VC Dimension

2.1 Shattering

Given a set of data points \(S = \{x_1, \ldots, x_m\}\), if hypothesis class \(\mathcal{H}\) can realize all \(2^m\) possible binary label combinations on \(S\), then \(\mathcal{H}\) shatters \(S\).

2.2 Growth Function

The growth function measures the maximum classification capacity of a hypothesis class on \(m\) points:

\[ \Pi_{\mathcal{H}}(m) = \max_{S:|S|=m} \left|\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}\right| \]

Clearly \(\Pi_{\mathcal{H}}(m) \leq 2^m\).

2.3 VC Dimension Definition

The VC dimension (Vapnik-Chervonenkis dimension) \(d_{VC}\) of hypothesis class \(\mathcal{H}\) is the largest set size that can be shattered by \(\mathcal{H}\):

\[ d_{VC}(\mathcal{H}) = \max\{m : \Pi_{\mathcal{H}}(m) = 2^m\} \]

2.4 Classic Examples

Hypothesis Class VC Dimension Explanation
Threshold functions on \(\mathbb{R}\) 1 Can shatter 1 point, not 2
Linear classifiers on \(\mathbb{R}^2\) 3 Can shatter 3 non-collinear points
Linear classifiers on \(\mathbb{R}^d\) \(d + 1\)
Union of \(k\) intervals \(2k\)
Neural network with \(W\) weights \(O(W \log W)\) Approximate result

2.5 Sauer's Lemma

\[ \Pi_{\mathcal{H}}(m) \leq \sum_{i=0}^{d_{VC}} \binom{m}{i} \leq \left(\frac{em}{d_{VC}}\right)^{d_{VC}} \]

When \(m > d_{VC}\), the growth function transitions from exponential to polynomial growth.


3. Generalization Bounds

3.1 VC Generalization Bound

For a hypothesis class with VC dimension \(d\), with probability at least \(1-\delta\):

\[ R(h) \leq \hat{R}(h) + \sqrt{\frac{d(\ln(2m/d) + 1) + \ln(4/\delta)}{m}} \]

where \(\hat{R}(h) = \frac{1}{m}\sum_{i=1}^{m}\ell(h(x_i), y_i)\) is the empirical risk.

Interpretation:

\[ \text{Generalization error} \leq \text{Training error} + \text{Complexity penalty} \]
  • Complexity penalty \(\propto \sqrt{d/m}\)
  • When \(m \gg d\), training error approximates generalization error

3.2 Required Sample Size

For the generalization bound to be meaningful, approximately:

\[ m = O\left(\frac{d}{\epsilon^2}\right) \]

samples are needed, where \(d = d_{VC}\) and \(\epsilon\) is the acceptable generalization gap.


4. Bias-Variance Decomposition

For regression problems, the prediction error can be decomposed into three parts:

\[ \mathbb{E}\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}(x)] - f(x)\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}} \]
Component Source Meaning
Bias \(\text{Bias}^2\) Model assumptions Model is not flexible enough, systematically deviates from true values
Variance Data randomness Model is too sensitive to changes in training data
Noise \(\sigma^2\) Data itself Irreducible random error

Bias-Variance Trade-off

Error
  │  \                    /
  │   \  Bias²          / Variance
  │    \              /
  │     \_____●_____/    ← Optimal complexity
  │      Minimum total error
  │
  └───────────────────→ Model complexity
     Underfitting  Moderate  Overfitting
Model Complexity Bias Variance Behavior
Low (e.g., linear regression) High Low Underfitting
Moderate Medium Medium Optimal
High (e.g., deep networks) Low High Overfitting

5. Structural Risk Minimization (SRM)

A learning principle proposed by Vapnik that balances empirical risk and model complexity:

\[ \hat{h} = \arg\min_{h \in \mathcal{H}} \left[\hat{R}(h) + \Omega(\mathcal{H})\right] \]

where \(\Omega(\mathcal{H})\) is the complexity penalty term.

Examples:

Method Complexity Control \(\Omega\)
L2 regularization (Ridge) \(\lambda\|w\|^2\)
L1 regularization (LASSO) \(\lambda\|w\|_1\)
SVM Maximizing margin ↔ \(\|w\|^2\)
Decision tree pruning Number of leaf nodes
Neural networks Dropout, Weight Decay, Early Stopping

6. Rademacher Complexity

A more refined complexity measure than VC dimension that accounts for the data distribution.

Definition:

\[ \hat{\mathfrak{R}}_m(\mathcal{H}) = \mathbb{E}_{\sigma}\left[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_{i=1}^{m}\sigma_i h(x_i)\right] \]

where \(\sigma_i \in \{+1, -1\}\) are uniform random variables (Rademacher variables).

Intuition: Rademacher complexity measures the ability of a hypothesis class to fit random noise. A hypothesis class that can fit pure noise has high complexity.

Generalization bound:

\[ R(h) \leq \hat{R}(h) + 2\mathfrak{R}_m(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2m}} \]

7. No Free Lunch Theorem (NFL)

7.1 Theorem Statement

Wolpert & Macready (1997):

Averaged over all possible problems, no learning algorithm outperforms any other. For every algorithm that performs well on some problems, there necessarily exist other problems on which it performs poorly.

7.2 Formalization

For all learning algorithms \(A_1, A_2\) and a uniform distribution over target functions:

\[ \sum_{f} R_{f}(A_1) = \sum_{f} R_{f}(A_2) \]

7.3 Practical Implications

  • No universal algorithm exists: algorithms must be chosen based on problem characteristics
  • Prior knowledge matters: inductive bias determines which problems an algorithm excels at
  • Does not preclude comparison: on specific problems/distributions, algorithms can be ranked
Inductive Bias Assumption Corresponding Method
Smoothness Similar inputs → similar outputs KNN, kernel methods
Sparsity Only a few features matter LASSO, sparse coding
Hierarchical structure Data has hierarchical features Deep learning
Temporal stationarity Statistical properties are time-invariant Time series models

8. Summary

Core questions of statistical learning theory:

Q: Why does learning succeed?
A: If the hypothesis class is not too complex (finite VC dimension)
   and the sample size is large enough,
   then empirical risk converges to true risk.

Q: How much data is needed?
A: m = O(d_VC / ε²)

Q: How to choose model complexity?
A: Bias-variance trade-off / Structural risk minimization

Q: What is the best algorithm?
A: No free lunch -- it depends on the problem.

References

  • "Foundations of Machine Learning" - Mohri, Rostamizadeh, Talwalkar
  • "Understanding Machine Learning" - Shalev-Shwartz & Ben-David
  • "The Nature of Statistical Learning Theory" - Vapnik
  • "An Introduction to Computational Learning Theory" - Kearns & Vazirani

评论 #