Kernel Methods
Kernel methods are a family of algorithms that implicitly map data into high-dimensional feature spaces via the kernel trick. They represent one of the most theoretically elegant branches of classical machine learning. Support Vector Machines (SVMs) are the most successful representative of kernel methods, while Gaussian Processes (GPs) beautifully combine kernel methods with Bayesian inference.
Learning path: Linear inseparability → Feature mapping → Kernel trick → SVM dual problem → Kernel function theory → Gaussian Processes → Connections to neural networks
Tribe-level perspective: kernel methods are at the core of the Analogizers tribe in Domingos's five-tribe taxonomy — a kernel is an implicit similarity measure. For a unified tribe-level view of k-NN / SVM / metric learning / recommenders, see The Master Algorithm — Analogizers and Nearest Neighbors & Kernel Methods.
Overview of Kernel Methods
Why Do We Need Kernel Methods?
In many real-world problems, data is linearly inseparable in the original input space. An intuitive approach is to map the data into a higher-dimensional space and find a linear decision boundary there:
However, explicitly computing the high-dimensional mapping \(\phi(\mathbf{x})\) can be prohibitively expensive (\(D\) can even be infinite). The key insight of the kernel trick is that many algorithms depend only on inner products \(\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle\) between data points, rather than on the mapping itself. If there exists a function \(k\) satisfying:
then we can compute the kernel function directly in the original space without ever explicitly constructing \(\phi\). This is the kernel trick.
SVM in Depth
Linear SVM: Maximum Margin Classifier
Given linearly separable training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}\) (\(y_i \in \{-1, +1\}\)), SVM seeks the hyperplane that maximizes the geometric margin:
Equivalently, this is reformulated as a minimization problem (the primal problem):
The Dual Problem
Introducing Lagrange multipliers \(\alpha_i \geq 0\), we construct the Lagrangian:
Taking partial derivatives with respect to \(\mathbf{w}\) and \(b\) and setting them to zero:
Substituting back yields the dual problem:
Notice that the dual problem involves only inner products \(\mathbf{x}_i^T \mathbf{x}_j\) between data points -- this is precisely where the kernel trick enters.
KKT Conditions
The KKT (Karush-Kuhn-Tucker) conditions are the optimality conditions linking the primal and dual problems:
- Primal feasibility: \(y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1\)
- Dual feasibility: \(\alpha_i \geq 0\)
- Complementary slackness: \(\alpha_i [y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1] = 0\)
The complementary slackness condition implies: either \(\alpha_i = 0\) (the point is not a support vector) or \(y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\) (the point lies exactly on the margin boundary and is a support vector).
Soft-Margin SVM
For data that is not linearly separable, we introduce slack variables \(\xi_i \geq 0\) and a penalty parameter \(C > 0\):
The dual problem becomes: constraints \(0 \leq \alpha_i \leq C\) (the upper bound is controlled by \(C\)).
The Kernel Trick
Definition of Kernel Functions
A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is called a kernel function if there exists a feature map \(\phi: \mathcal{X} \to \mathcal{H}\) such that:
The kernelized SVM dual problem simply replaces inner products with kernel evaluations:
Mercer's Theorem
Mercer's theorem provides a necessary and sufficient condition for a function to be a valid kernel: \(k(\mathbf{x}, \mathbf{x}')\) is a valid kernel if and only if for any finite dataset \(\{\mathbf{x}_1, \dots, \mathbf{x}_N\}\), the kernel matrix \(K\) is positive semi-definite:
This means all eigenvalues of the kernel matrix are non-negative.
Common Kernel Functions
| Kernel | Expression | Feature space dimension | Characteristics |
|---|---|---|---|
| Linear | \(k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T \mathbf{x}'\) | \(d\) | Equivalent to linear SVM |
| Polynomial | \(k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T \mathbf{x}' + c)^p\) | \(\binom{d+p}{p}\) | Captures feature interactions |
| RBF (Gaussian) | \(k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{\|\mathbf{x}-\mathbf{x}'\|^2}{2\sigma^2}\right)\) | \(\infty\) | Most commonly used, local similarity |
| Sigmoid | \(k(\mathbf{x}, \mathbf{x}') = \tanh(\kappa \mathbf{x}^T \mathbf{x}' + c)\) | -- | Resembles neural networks |
Intuition for the RBF kernel: \(\sigma\) controls the "radius of influence." Small \(\sigma\) means each support vector only affects nearby regions (prone to overfitting); large \(\sigma\) results in smoother decision boundaries (may underfit).
RKHS (Reproducing Kernel Hilbert Space)
Intuitive Explanation
The Reproducing Kernel Hilbert Space (RKHS) is the theoretical foundation of kernel methods. The intuition is as follows:
- Each kernel function \(k\) uniquely corresponds to a function space \(\mathcal{H}_k\) (the RKHS)
- In this space, the "inner product" of functions can be computed via the kernel function
- Reproducing property: \(f(\mathbf{x}) = \langle f, k(\mathbf{x}, \cdot) \rangle_{\mathcal{H}_k}\), meaning we can evaluate a function by "probing" with the kernel
The Representer Theorem: In an RKHS, the solution to regularized empirical risk minimization is always a linear combination of kernel evaluations:
This means that no matter how high-dimensional (or even infinite-dimensional) the RKHS is, the optimal solution can still be represented by \(N\) coefficients.
Gaussian Processes
Definition
A Gaussian Process (GP) is a probability distribution over functions. A GP is fully specified by a mean function \(m(\mathbf{x})\) and a kernel (covariance) function \(k(\mathbf{x}, \mathbf{x}')\):
For any finite set of input points \(\mathbf{x}_1, \dots, \mathbf{x}_N\), the function values \(\mathbf{f} = [f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^T\) follow a multivariate Gaussian distribution.
GP Regression
Given training data \((\mathbf{X}, \mathbf{y})\) (where \(y_i = f(\mathbf{x}_i) + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma_n^2)\)), the predictive distribution for a new input \(\mathbf{x}_*\) is:
where:
Here \(\mathbf{k}_*\) is the kernel vector between the test point and training points, and \(K\) is the kernel matrix of training points.
Kernel Selection and Hyperparameter Optimization
The choice of kernel function in a GP directly determines the model's prior assumptions:
| Kernel | Prior assumption | Use case |
|---|---|---|
| Squared Exponential (SE) | Infinitely differentiable smooth functions | Default choice |
| Matern kernel | Controllable smoothness | Physical process modeling |
| Periodic kernel | Periodic functions | Seasonal data |
| Kernel combinations (additive/multiplicative) | Composite structures | Complex patterns |
Hyperparameters are optimized by maximizing the marginal likelihood:
This automatically balances data fit (first term) and model complexity (second term), exhibiting a natural Occam's razor effect.
Uncertainty Quantification
One of the greatest advantages of GPs is built-in uncertainty quantification:
- The predictive mean \(\bar{f}_*\) provides point predictions
- The predictive variance \(\text{Var}(f_*)\) provides uncertainty estimates
- In regions where training data is dense, variance is small (high confidence)
- In regions where training data is sparse, variance is large (low confidence)
This makes GPs particularly well-suited for tasks that require uncertainty assessment, such as Active Learning and Bayesian Optimization.
Kernel Methods and Neural Networks
Neural Tangent Kernel (NTK)
Jacot et al. (2018) established a profound theoretical connection: in the infinite-width limit, the training dynamics of neural networks are equivalent to kernel methods.
For a neural network \(f(\mathbf{x}; \theta)\) with parameters \(\theta\), the Neural Tangent Kernel is defined as:
As the network width approaches infinity, the NTK remains constant throughout training, and training the network is equivalent to performing kernel regression in the RKHS corresponding to the NTK.
Modern Insights from Kernel Methods
| Topic | Key insight |
|---|---|
| NTK theory | Infinite-width networks \(\approx\) kernel methods, providing a theoretical window into deep learning |
| Kernel approximation | Methods like Random Fourier Features enable kernel methods to scale to large datasets |
| Attention mechanism | Softmax attention can be viewed as a form of kernel function |
| GP and BNN | The prior of an infinite-width single-hidden-layer network is equivalent to a GP (Neal, 1996) |
| Limitations of kernels | Lack of feature learning capability (kernels are fixed) -- this is precisely where deep learning excels |
Kernel methods provide an elegant mathematical framework for understanding machine learning. Although deep learning now dominates in practice, the theoretical tools of kernel methods -- RKHS, Mercer's theorem, the representer theorem -- remain essential foundations for understanding and analyzing modern deep learning models.