Kernel Methods

Kernel methods are a family of algorithms that implicitly map data into high-dimensional feature spaces via the kernel trick. They represent one of the most theoretically elegant branches of classical machine learning. Support Vector Machines (SVMs) are the most successful representative of kernel methods, while Gaussian Processes (GPs) beautifully combine kernel methods with Bayesian inference.

Learning path: Linear inseparability → Feature mapping → Kernel trick → SVM dual problem → Kernel function theory → Gaussian Processes → Connections to neural networks

Tribe-level perspective: kernel methods are at the core of the Analogizers tribe in Domingos's five-tribe taxonomy — a kernel is an implicit similarity measure. For a unified tribe-level view of k-NN / SVM / metric learning / recommenders, see The Master Algorithm — Analogizers and Nearest Neighbors & Kernel Methods.

Overview of Kernel Methods

Why Do We Need Kernel Methods?

In many real-world problems, data is linearly inseparable in the original input space. An intuitive approach is to map the data into a higher-dimensional space and find a linear decision boundary there:

\[ \mathbf{x} \in \mathbb{R}^d \xrightarrow{\phi} \phi(\mathbf{x}) \in \mathbb{R}^D, \quad D \gg d \]

However, explicitly computing the high-dimensional mapping \(\phi(\mathbf{x})\) can be prohibitively expensive (\(D\) can even be infinite). The key insight of the kernel trick is that many algorithms depend only on inner products \(\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle\) between data points, rather than on the mapping itself. If there exists a function \(k\) satisfying:

\[ k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle \]

then we can compute the kernel function directly in the original space without ever explicitly constructing \(\phi\). This is the kernel trick.

SVM in Depth

Linear SVM: Maximum Margin Classifier

Given linearly separable training data \(\{(\mathbf{x}_i, y_i)\}_{i=1}^{N}\) (\(y_i \in \{-1, +1\}\)), SVM seeks the hyperplane that maximizes the geometric margin:

\[ \max_{\mathbf{w}, b} \; \frac{2}{\|\mathbf{w}\|} \quad \text{s.t.} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, \; \forall i \]

Equivalently, this is reformulated as a minimization problem (the primal problem):

\[ \min_{\mathbf{w}, b} \; \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, \; \forall i \]

The Dual Problem

Introducing Lagrange multipliers \(\alpha_i \geq 0\), we construct the Lagrangian:

\[ L(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\mathbf{w}\|^2 - \sum_{i=1}^{N} \alpha_i \left[ y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1 \right] \]

Taking partial derivatives with respect to \(\mathbf{w}\) and \(b\) and setting them to zero:

\[ \mathbf{w} = \sum_{i=1}^{N} \alpha_i y_i \mathbf{x}_i, \quad \sum_{i=1}^{N} \alpha_i y_i = 0 \]

Substituting back yields the dual problem:

\[ \max_{\boldsymbol{\alpha}} \; \sum_{i=1}^{N} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j \quad \text{s.t.} \quad \alpha_i \geq 0, \; \sum_i \alpha_i y_i = 0 \]

Notice that the dual problem involves only inner products \(\mathbf{x}_i^T \mathbf{x}_j\) between data points -- this is precisely where the kernel trick enters.

KKT Conditions

The KKT (Karush-Kuhn-Tucker) conditions are the optimality conditions linking the primal and dual problems:

Primal feasibility: \(y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1\)
Dual feasibility: \(\alpha_i \geq 0\)
Complementary slackness: \(\alpha_i [y_i(\mathbf{w}^T \mathbf{x}_i + b) - 1] = 0\)

The complementary slackness condition implies: either \(\alpha_i = 0\) (the point is not a support vector) or \(y_i(\mathbf{w}^T \mathbf{x}_i + b) = 1\) (the point lies exactly on the margin boundary and is a support vector).

Soft-Margin SVM

For data that is not linearly separable, we introduce slack variables \(\xi_i \geq 0\) and a penalty parameter \(C > 0\):

\[ \min_{\mathbf{w}, b, \boldsymbol{\xi}} \; \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i=1}^{N} \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i, \; \xi_i \geq 0 \]

The dual problem becomes: constraints \(0 \leq \alpha_i \leq C\) (the upper bound is controlled by \(C\)).

The Kernel Trick

Definition of Kernel Functions

A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is called a kernel function if there exists a feature map \(\phi: \mathcal{X} \to \mathcal{H}\) such that:

\[ k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle_{\mathcal{H}} \]

The kernelized SVM dual problem simply replaces inner products with kernel evaluations:

\[ \max_{\boldsymbol{\alpha}} \; \sum_{i} \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \, k(\mathbf{x}_i, \mathbf{x}_j) \]

Mercer's Theorem

Mercer's theorem provides a necessary and sufficient condition for a function to be a valid kernel: \(k(\mathbf{x}, \mathbf{x}')\) is a valid kernel if and only if for any finite dataset \(\{\mathbf{x}_1, \dots, \mathbf{x}_N\}\), the kernel matrix \(K\) is positive semi-definite:

\[ K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j), \quad K \succeq 0 \]

This means all eigenvalues of the kernel matrix are non-negative.

Common Kernel Functions

Kernel	Expression	Feature space dimension	Characteristics
Linear	\(k(\mathbf{x}, \mathbf{x}') = \mathbf{x}^T \mathbf{x}'\)	\(d\)	Equivalent to linear SVM
Polynomial	\(k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T \mathbf{x}' + c)^p\)	\(\binom{d+p}{p}\)	Captures feature interactions
RBF (Gaussian)	\(k(\mathbf{x}, \mathbf{x}') = \exp\left(-\frac{\\|\mathbf{x}-\mathbf{x}'\\|^2}{2\sigma^2}\right)\)	\(\infty\)	Most commonly used, local similarity
Sigmoid	\(k(\mathbf{x}, \mathbf{x}') = \tanh(\kappa \mathbf{x}^T \mathbf{x}' + c)\)	--	Resembles neural networks

Intuition for the RBF kernel: \(\sigma\) controls the "radius of influence." Small \(\sigma\) means each support vector only affects nearby regions (prone to overfitting); large \(\sigma\) results in smoother decision boundaries (may underfit).

RKHS (Reproducing Kernel Hilbert Space)

Intuitive Explanation

The Reproducing Kernel Hilbert Space (RKHS) is the theoretical foundation of kernel methods. The intuition is as follows:

Each kernel function \(k\) uniquely corresponds to a function space \(\mathcal{H}_k\) (the RKHS)
In this space, the "inner product" of functions can be computed via the kernel function
Reproducing property: \(f(\mathbf{x}) = \langle f, k(\mathbf{x}, \cdot) \rangle_{\mathcal{H}_k}\), meaning we can evaluate a function by "probing" with the kernel

The Representer Theorem: In an RKHS, the solution to regularized empirical risk minimization is always a linear combination of kernel evaluations:

\[ f^*(\mathbf{x}) = \sum_{i=1}^{N} \alpha_i \, k(\mathbf{x}_i, \mathbf{x}) \]

This means that no matter how high-dimensional (or even infinite-dimensional) the RKHS is, the optimal solution can still be represented by \(N\) coefficients.

Gaussian Processes

Definition

A Gaussian Process (GP) is a probability distribution over functions. A GP is fully specified by a mean function \(m(\mathbf{x})\) and a kernel (covariance) function \(k(\mathbf{x}, \mathbf{x}')\):

\[ f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')\right) \]

For any finite set of input points \(\mathbf{x}_1, \dots, \mathbf{x}_N\), the function values \(\mathbf{f} = [f(\mathbf{x}_1), \dots, f(\mathbf{x}_N)]^T\) follow a multivariate Gaussian distribution.

GP Regression

Given training data \((\mathbf{X}, \mathbf{y})\) (where \(y_i = f(\mathbf{x}_i) + \epsilon\), \(\epsilon \sim \mathcal{N}(0, \sigma_n^2)\)), the predictive distribution for a new input \(\mathbf{x}_*\) is:

\[ f_* \mid \mathbf{X}, \mathbf{y}, \mathbf{x}_* \sim \mathcal{N}(\bar{f}_*, \text{Var}(f_*)) \]

where:

\[ \bar{f}_* = \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{y} \]

\[ \text{Var}(f_*) = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_* \]

Here \(\mathbf{k}_*\) is the kernel vector between the test point and training points, and \(K\) is the kernel matrix of training points.

Kernel Selection and Hyperparameter Optimization

The choice of kernel function in a GP directly determines the model's prior assumptions:

Kernel	Prior assumption	Use case
Squared Exponential (SE)	Infinitely differentiable smooth functions	Default choice
Matern kernel	Controllable smoothness	Physical process modeling
Periodic kernel	Periodic functions	Seasonal data
Kernel combinations (additive/multiplicative)	Composite structures	Complex patterns

Hyperparameters are optimized by maximizing the marginal likelihood:

\[ \log P(\mathbf{y} \mid \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^T (K_\theta + \sigma_n^2 I)^{-1}\mathbf{y} - \frac{1}{2}\log|K_\theta + \sigma_n^2 I| - \frac{N}{2}\log 2\pi \]

This automatically balances data fit (first term) and model complexity (second term), exhibiting a natural Occam's razor effect.

Uncertainty Quantification

One of the greatest advantages of GPs is built-in uncertainty quantification:

The predictive mean \(\bar{f}_*\) provides point predictions
The predictive variance \(\text{Var}(f_*)\) provides uncertainty estimates
In regions where training data is dense, variance is small (high confidence)
In regions where training data is sparse, variance is large (low confidence)

This makes GPs particularly well-suited for tasks that require uncertainty assessment, such as Active Learning and Bayesian Optimization.

Kernel Methods and Neural Networks

Neural Tangent Kernel (NTK)

Jacot et al. (2018) established a profound theoretical connection: in the infinite-width limit, the training dynamics of neural networks are equivalent to kernel methods.

For a neural network \(f(\mathbf{x}; \theta)\) with parameters \(\theta\), the Neural Tangent Kernel is defined as:

\[ \Theta(\mathbf{x}, \mathbf{x}') = \left\langle \frac{\partial f(\mathbf{x}; \theta)}{\partial \theta}, \frac{\partial f(\mathbf{x}'; \theta)}{\partial \theta} \right\rangle \]

As the network width approaches infinity, the NTK remains constant throughout training, and training the network is equivalent to performing kernel regression in the RKHS corresponding to the NTK.

Modern Insights from Kernel Methods

Topic	Key insight
NTK theory	Infinite-width networks \(\approx\) kernel methods, providing a theoretical window into deep learning
Kernel approximation	Methods like Random Fourier Features enable kernel methods to scale to large datasets
Attention mechanism	Softmax attention can be viewed as a form of kernel function
GP and BNN	The prior of an infinite-width single-hidden-layer network is equivalent to a GP (Neal, 1996)
Limitations of kernels	Lack of feature learning capability (kernels are fixed) -- this is precisely where deep learning excels

Kernel methods provide an elegant mathematical framework for understanding machine learning. Although deep learning now dominates in practice, the theoretical tools of kernel methods -- RKHS, Mercer's theorem, the representer theorem -- remain essential foundations for understanding and analyzing modern deep learning models.