Analogizers

The tribe in one sentence: similar inputs should have similar outputs. The essence of learning is finding a good metric for "similar".

The Analogizers are one of the five tribes of machine learning identified by Pedro Domingos in The Master Algorithm. The tribe's core creed is plain yet profound: to predict a new sample, look for past samples that "look like it" and reuse their answers. From the venerable k-Nearest Neighbors (k-NN, 1950s), through the Support Vector Machines (SVMs) that ruled theoretical ML in the 1990s, to the CLIP / RAG ecosystem that swept industry after 2020, the Analogizers have woven the "metric + retrieval" paradigm through every decade of ML history.

The Analogizers' fundamental disagreement with the other tribes is this: they do not explicitly model the data-generating process (against the Bayesians), do not extract general rules (against the Symbolists), do not rely on differentiable parameterization (against the early Connectionists), and do not rely on population search (against the Evolutionaries). They reduce "learning" to "similarity": give me a good kernel $K(x, x')$ and I can predict.

Tribe Profile

Dimension	Analogizers
Ontology	The world is made of concrete instances; universal laws are secondary
Epistemology	Knowledge = case base + similarity metric; reasoning = retrieval + analogical extrapolation
Master Algorithm	Support Vector Machines / Kernel Methods
Core question	How to measure similarity?
Evaluator	Margin, contrastive loss, Recall@K
Optimizer	Quadratic programming (QP) / SMO / stochastic gradient descent (SGD)
Representative methods	k-NN, SVM, kernel PCA, kernel ridge regression, LMNN, FaceNet, SimCLR, CLIP
Modern branches	Metric learning, contrastive learning, ANN retrieval, RAG
Achilles' heel	Curse of dimensionality, similarity itself requires a prior, retrieval cost
Key figures	Vladimir Vapnik, Bernhard Schölkopf, Tomasz Aamodt, Yann LeCun (Siamese), Geoffrey Hinton (t-SNE)

Master Equation

Almost every method of the Analogizers can be written in the same form — a weighted sum of similarities to training samples:

\[ \hat{y}(x) = f\!\left(\sum_{i=1}^{N} \alpha_i \, K(x, x_i)\right) \]

where:

$K(x, x_i)$ is the similarity kernel: how alike the new sample $x$ is to training sample $x_i$;
$\alpha_i$ is each training sample's contribution weight to the prediction (a 0/1 indicator in k-NN, a dual variable in SVM, a user/item preference in CF);
$f$ is the output function (sign / softmax for classification, identity for regression).

The unifying power of this framework is striking:

Method	$K(x, x_i)$	$\alpha_i$
k-NN	$\mathbb{1}[x_i \in \mathrm{kNN}(x)]$	$y_i$
Nadaraya-Watson regression	RBF kernel $\exp(-\\|x-x_i\\|^2/2\sigma^2)$	$y_i / \sum_j K_j$
SVM	any Mercer kernel	Lagrange multipliers $\alpha_i y_i$ (only support vectors are non-zero)
Kernel ridge regression	any Mercer kernel	$(K + \lambda I)^{-1} y$
Collaborative filtering	user similarity $\mathrm{sim}(u, u_i)$	$r_{u_i, j}$ (other users' ratings on item $j$)
RAG	$\cos(\mathbf{e}_q, \mathbf{e}_{d_i})$	documents retrieved into the LLM context

Unified perspective: from k-NN to RAG, sixty years of methodological evolution have just been answering two questions — how to define $K$ (how to measure similarity?) and how to define $\alpha$ (which cases are worth borrowing from?).

Lineage

flowchart TD
    A["k-NN (Cover & Hart 1967)"] --> B["Kernel methods / Parzen window"]
    B --> C["SVM (Boser & Vapnik 1992)"]
    C --> D["Kernel PCA / Kernel ridge regression"]
    A --> E["Case-Based Reasoning CBR (Aamodt 1994)"]
    A --> F["Collaborative Filtering CF (GroupLens 1994)"]
    F --> G["Matrix Factorization FunkSVD (Netflix 2006)"]
    G --> H["BPR / NCF / Two-tower models"]
    A --> I["Metric learning LMNN/ITML (2007-2009)"]
    I --> J["Siamese / Triplet (FaceNet 2015)"]
    J --> K["Contrastive learning SimCLR / MoCo (2020)"]
    K --> L["Cross-modal CLIP (2021)"]
    L --> M["RAG / RETRO / Atlas (2022+)"]
    E --> M
    H --> N["ANN retrieval FAISS / HNSW"]
    N --> M

    style C fill:#fde68a
    style L fill:#bbf7d0
    style M fill:#bbf7d0

Three main streams:

Kernel methods stream: k-NN → Parzen window → SVM → kernel PCA / GP, unified in the RKHS framework.
Retrieval stream: k-NN → CBR → CF → matrix factorization → ANN → RAG, focused on "how to find similar cases efficiently".
Metric learning stream: LMNN → Siamese → Triplet → InfoNCE → CLIP, focused on "how to make a neural network learn a good similarity".

After 2020 the three streams were unified: deep models learn embeddings (metric learning), ANN indexes store embeddings (retrieval), and LLMs reason with retrieved results (RAG).

Modern Renaissance

The Analogizers have suffered two declines:

First decline (after backprop was rediscovered in 1986): the Connectionists rose, and k-NN was dismissed as "primitive".
Second decline (after AlexNet in 2012): deep learning dominated vision and NLP, and SVMs all but vanished from mainstream papers.

Yet the Analogizers never died, and after 2020 they made a dramatic comeback, for three reasons:

Contrastive learning became the self-supervised mainstream: SimCLR / MoCo / DINO train with contrastive losses, which are essentially metric learning on large unlabeled corpora. "Learning a good embedding" became the central goal of foundation models.
CLIP broke the modality wall: the joint embedding learned from image-text contrastive training enabled zero-shot classification and semantic search as new paradigms. CLIP is the most successful Analogizer system to date.
RAG replaced pure parametric memory: parameterizing knowledge inside LLM weights proved limited (hallucination, staleness, no provenance). RAG plugs an external knowledge base + vector retrieval back into the LLM — this is Aamodt's 1994 CBR reincarnated for the GPT era.

From obscure k-NN to the RAG era: a 2010 grad student would say "doing k-NN? too naive". A 2025 grad student says "our LLM agent uses HNSW to retrieve 1B documents for in-context learning" — same thing.

Division of Labour with Existing Pages

The three sub-pages in this directory expand on the three pillars of the Analogizers:

Nearest Neighbors and Kernel Methods: k-NN, distance metrics, the kernel trick, full SVM derivation, the Representer theorem, the GP perspective.
Metric Learning and Contrastive Learning: LMNN/ITML, Siamese/Triplet, InfoNCE, SimCLR/MoCo/DINO, CLIP/SigLIP.
Recommender Systems and Modern Retrieval: CF/MF, deep recommenders, CBR, ANN indexes, RAG engineering practice.

Coordination with other pages:

../../03_Machine_Learning/supervised.md introduces the k-NN/SVM interface from a "supervised algorithms catalog" angle. This directory organizes them by tribe philosophy and lineage.
../../03_Machine_Learning/kernel_methods.md covers kernel basics. The Nearest Neighbors and Kernel Methods page in this directory adds: a proof of the Representer theorem, the SVM dual derivation, the GP connection, and the SMO algorithm.
The GP section in ../../03_Machine_Learning/贝叶斯学习.md and the kernel methods section here mirror each other (Bayesian view vs. Analogizer view).

Sub-page	Core question	Keywords
Nearest Neighbors and Kernel Methods	How to define "similar"? How to turn similarity into a predictor?	k-NN, KD-Tree, Mercer condition, SVM dual, KKT, RKHS, Representer, kernel PCA, GP
Metric Learning and Contrastive Learning	How to learn good similarity?	LMNN, Siamese, Triplet, Contrastive, InfoNCE, SimCLR, MoCo, DINO, CLIP
Recommender Systems and Modern Retrieval	How to retrieve similar cases efficiently?	CF, FunkSVD, BPR, NCF, two-tower, CBR, FAISS, HNSW, RAG

References

Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. — The original source of the five-tribes framework; Chapter 7 is devoted to the Analogizers.
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. — The foundational k-NN paper, proving that the 1-NN error is at most twice the Bayes error.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer. — VC dimension and the SVM theoretical framework.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. COLT. — The pioneering paper of SVMs and the kernel trick.
Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press. — The standard textbook on kernel methods, the Representer theorem, and RKHS.
Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications, 7(1), 39–59. — The foundational paper on the CBR 4R cycle.
Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML. — CLIP, the landmark of the Analogizers' modern renaissance.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS. — The RAG paper, plugging retrieval back into parametric models.

Nearest Neighbors and Kernel Methods

Scope of this page: lay out the "skeleton" of the Analogizers in full — from the most naive k-NN to the unifying framework of kernel methods / SVMs. The emphasis is on three threads: mathematical derivation (SVM dual, KKT, the Representer theorem), geometric intuition (the kernel trick = implicit lifting to higher dimensions), and engineering acceleration (KD-Tree / LSH).

The oldest method of the Analogizers is k-Nearest Neighbors (k-NN): to predict a new sample, find its $k$ nearest neighbors and let the neighbors' labels vote. Proposed in the 1950s and given an error bound by Cover & Hart in 1967, this method remains the basic paradigm behind virtually every vector retrieval system today.

But k-NN has two fatal problems:

What does "nearest" mean? — the distance metric directly determines the result, and naive Euclidean distance breaks down in high dimensions.
How do we combine similarities into a "decision boundary"? — k-NN's boundary is patched together pointwise, with poor theoretical properties.

Kernel methods and SVMs answered both questions in the 1990s, elevating "similarity learning" to a paradigm with a complete mathematical foundation.

1. The k-NN Framework in Full

1.1 Algorithm Definition

Given a training set $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$, a distance metric $d$, and a new sample $x$:

Compute the distances from $x$ to all training points $\{d(x, x_i)\}_{i=1}^N$.
Find the $k$ nearest neighbors $\mathrm{kNN}(x)$.
Classification: $\hat y = \mathrm{mode}\{y_i : x_i \in \mathrm{kNN}(x)\}$ (majority vote).
Regression: $\hat y = \frac{1}{k}\sum_{x_i \in \mathrm{kNN}(x)} y_i$ (average).

1.2 Weighted k-NN

Naive k-NN treats the $k$ neighbors equally, but clearly "closer neighbors" should contribute more. The weighted version:

\[ \hat y(x) = \frac{\sum_{x_i \in \mathrm{kNN}(x)} w_i \, y_i}{\sum_{x_i \in \mathrm{kNN}(x)} w_i}, \quad w_i = \exp\!\left(-\frac{d(x, x_i)^2}{2\sigma^2}\right) \]

This is Nadaraya-Watson kernel regression. Pushing $k$ all the way to $N$ (using all points, but with weights vanishing for distant ones) yields the Parzen window estimator — a non-parametric density estimator. Note: k-NN $\to$ Parzen window $\to$ kernel regression is a continuous progression; the three are mathematically the same thing.

1.3 Choosing k

$k$ controls the bias–variance trade-off:

$k$	Bias	Variance	Decision boundary	Risk
$k=1$	very low	very high	jagged	overfitting (sensitive to noise)
moderate $k$	balanced	balanced	smooth	—
$k=N$	very high	very low	degenerates to majority class	underfitting

In practice, $k$ is chosen by cross-validation, with $k = \sqrt{N}$ as a common rule of thumb. $k$ should be odd (to avoid ties in binary classification).

1.4 Theoretical Guarantee for 1-NN (Cover-Hart 1967)

Let $R^*$ be the Bayes-optimal error rate. The asymptotic error rate $R_{1\text{-NN}}$ of 1-NN as $N \to \infty$ satisfies:

\[ R^* \leq R_{1\text{-NN}} \leq R^* (2 - R^*) \leq 2 R^* \]

In one sentence: the most naive 1-NN has error at most twice the Bayes optimum — a stunning theoretical guarantee that partly explains why the Analogizers never truly faded.

2. A Lineage of Distance Metrics

The choice of distance metric is the make-or-break factor for k-NN (and for every Analogizer method).

2.1 Common Metrics

Metric	Formula	Use case
Euclidean	$d(x,y) = \sqrt{\sum_i (x_i - y_i)^2}$	continuous features with comparable scale and meaning
Manhattan ($\ell_1$)	$\sum_i	x_i - y_i
Chebyshev ($\ell_\infty$)	$\max_i	x_i - y_i
Minkowski ($\ell_p$)	$\left(\sum_i	x_i - y_i
Mahalanobis	$\sqrt{(x-y)^T \Sigma^{-1} (x-y)}$	features with different scales / correlations
Cosine	$1 - \frac{x^T y}{\\|x\\|\\|y\\|}$	text TF-IDF, embedding similarity
Hamming	$\sum_i \mathbb{1}[x_i \neq x_i]$	binary strings, categorical features
Edit distance (Levenshtein)	minimal edit count by DP	strings, biological sequences
Jaccard	$1 - \frac{	A \cap B

2.2 The Mahalanobis Metric

Mahalanobis is an "adaptive" generalization of Euclidean distance:

\[ d_M(x, y) = \sqrt{(x-y)^T M (x-y)} \]

where $M = L^T L$ is positive semi-definite. When $M = I$ it reduces to Euclidean; when $M = \Sigma^{-1}$ (the inverse data covariance), it corrects for differing scales and correlations across dimensions. Metric learning is precisely the act of learning this $M$ from data — see Metric Learning and Contrastive Learning.

2.3 A Geometric Pitfall: Why Cosine Reigns in NLP

In text and embedding work, cosine is preferred over Euclidean because the norm of an embedding usually carries "frequency / confidence" information rather than "semantic" information. After normalizing onto the unit sphere ($\|x\|=1$), cosine distance is monotonically related to Euclidean:

\[ \|x - y\|^2 = 2 - 2 \cos\theta \quad \text{(when } \|x\|=\|y\|=1\text{)} \]

So modern deep metric learning (CLIP, SimCLR) almost always L2-normalizes embeddings and uses dot products.

3. The Curse of Dimensionality

Core proposition: in high-dimensional space "nearest neighbor" loses its meaning — distances between all points tend to be equal.

3.1 Volume Concentration

Consider a uniform distribution on the $d$-dimensional unit hypercube $[0,1]^d$. To capture $1\%$ of the volume, the side length per dimension must be:

\[ \ell = (0.01)^{1/d} \]

$d$	$\ell$ (side length needed for $1\%$ volume)
1	0.01
2	0.1
10	0.63
100	0.955
1000	0.9954

At $d = 100$, capturing $1\%$ of the "local" neighbors requires covering $95.5\%$ of every dimension's range — there is no longer anything "local" about it.

3.2 Distance Concentration

For $N$ uniform samples inside a $d$-dimensional unit ball, the ratio between the smallest distance $D_{\min}$ and the largest distance $D_{\max}$ satisfies:

\[ \lim_{d \to \infty} \frac{D_{\max} - D_{\min}}{D_{\min}} \to 0 \]

That is, all pairwise distances become essentially equal — the nearest neighbor is no longer "near" and the farthest is no longer "far". Beyer et al. (1999) proved this rigorously, and it is devastating for k-NN.

3.3 Ways out of the Curse

The Analogizers fight the curse along three paths:

Dimensionality reduction / manifold hypothesis: real data lies on a low-dimensional manifold (images, though tens of thousands of dimensions, have low effective dimension). t-SNE / UMAP / autoencoders all fall here.
Learn a metric: metric learning projects data into a "semantically meaningful" low-dimensional embedding space — this is the entire premise of deep metric learning.
The kernel trick: SVM's kernel trick goes the opposite way — lifting to infinite dimensions but accessing them only through similarity inner products, avoiding explicit storage.

4. k-NN Complexity and Acceleration

A naive k-NN query for a single point costs $O(Nd)$ — unacceptable for large $N$.

4.1 KD-Tree

The KD-Tree (k-dimensional tree) recursively bisects space using axis-aligned hyperplanes:

Construction: $O(N \log N)$, depth $O(\log N)$;
Query: $O(\log N)$ in low dimensions ($d \lesssim 20$), degrading to $O(N)$ in high dimensions;
Best for: low-dimensional dense data.

4.2 Ball Tree

Drops the axis-aligned constraint and uses nested hyperspheres instead. More robust than the KD-Tree at moderate dimensions (20–100), but with larger constants in construction and query.

4.3 LSH (Locality-Sensitive Hashing)

Hashes "close points" into the same bucket — trading accuracy for massive speedups. An LSH family $\mathcal{H}$ satisfies:

\[ \Pr_{h \in \mathcal{H}}[h(x) = h(y)] = f(d(x, y)), \quad f \text{ monotonically decreasing} \]

For example, SimHash for cosine distance: pick a random hyperplane $r$ and set $h(x) = \mathrm{sign}(r^T x)$. The probability that two vectors map to the same bit is proportional to the cosine of their angle.

LSH is the iconic early ANN (approximate nearest neighbor) method, comprehensively surpassed in the 2010s by graph-based HNSW — see Section 8 of Recommender Systems and Modern Retrieval.

5. The Unifying Perspective of Kernel Methods

5.1 A Kernel Is an Implicit Similarity

A kernel $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is a symmetric function that we may read as a "similarity". The core insight of the kernel trick is:

\[ K(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}} \]

i.e. $K$ is equivalent to first mapping $x$ into some (possibly infinite-dimensional) feature space $\mathcal{H}$ and then taking an inner product. We never need to construct $\phi$ explicitly, only compute $K$.

flowchart LR
    A["Input space X<br/>(linearly inseparable)"] -->|"explicit map φ"| B["Feature space H<br/>(linearly separable)"]
    A -->|"K(x,x') = ⟨φ(x), φ(x')⟩<br/>(avoids constructing φ)"| C["Similarity matrix<br/>K ∈ R^{N×N}"]
    C --> D["Dual solution<br/>depends only on K"]
    style C fill:#fde68a
    style D fill:#bbf7d0

5.2 The Mercer Condition

Not every "similarity function" is a valid kernel. Mercer's theorem: $K$ is a valid kernel $\iff$ for any $\{x_1, \dots, x_N\}$, the Gram matrix / kernel matrix

\[ \mathbf{K} = \begin{bmatrix} K(x_i, x_j) \end{bmatrix}_{i,j=1}^N \]

is symmetric positive semi-definite (PSD), i.e. for any $\alpha \in \mathbb{R}^N$:

\[ \alpha^T \mathbf{K} \alpha \geq 0 \]

PSD-ness is the foundation of all kernel-method theory — it guarantees that the dual problem is convex, that the Representer theorem holds, and that the solution is unique.

5.3 Algebra of Kernels

Valid kernels combine into new valid kernels:

$K_1 + K_2$ is a kernel;
$c \cdot K$ ($c > 0$) is a kernel;
$K_1 \cdot K_2$ is a kernel (pointwise product);
$f(x) f(x')$ is a kernel for any $f$;
$K(\psi(x), \psi(x'))$ is a kernel.

In engineering, one can design separate kernels for different feature families (numeric, text, graph) and linearly combine them — this is Multiple Kernel Learning (MKL).

6. The SVM Derivation in Full

The SVM is the Analogizers' "master algorithm": it upgrades k-NN from "voting" to "maximum-margin hyperplane".

6.1 Hard-Margin Primal

Binary classification, $y_i \in \{-1, +1\}$. Find a hyperplane $w^T x + b = 0$ that separates the two classes and maximizes the margin. The distance from $x_i$ to the hyperplane is $\frac{|w^T x_i + b|}{\|w\|}$. Normalizing so that the closest point satisfies $|w^T x_i + b| = 1$, the margin is $\frac{2}{\|w\|}$.

Primal problem:

\[ \begin{aligned} \min_{w, b} \quad & \tfrac{1}{2} \|w\|^2 \\ \text{s.t.} \quad & y_i (w^T x_i + b) \geq 1, \quad i=1,\dots,N \end{aligned} \]

A convex quadratic program, directly solvable, but its dimension equals $\dim(x)$ — no kernels possible.

6.2 Lagrangian Dual

Introducing multipliers $\alpha_i \geq 0$, the Lagrangian is:

\[ \mathcal{L}(w, b, \alpha) = \tfrac{1}{2}\|w\|^2 - \sum_i \alpha_i \left[y_i(w^T x_i + b) - 1\right] \]

Setting partial derivatives with respect to $w$ and $b$ to zero:

\[ \frac{\partial \mathcal{L}}{\partial w} = 0 \Rightarrow w = \sum_i \alpha_i y_i x_i \]

\[ \frac{\partial \mathcal{L}}{\partial b} = 0 \Rightarrow \sum_i \alpha_i y_i = 0 \]

Substituting back yields the dual problem:

\[ \begin{aligned} \max_\alpha \quad & \sum_i \alpha_i - \tfrac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \, \langle x_i, x_j \rangle \\ \text{s.t.} \quad & \alpha_i \geq 0, \quad \sum_i \alpha_i y_i = 0 \end{aligned} \]

Key observation: in the dual, $x_i$ appears only through inner products $\langle x_i, x_j \rangle$ — exactly the entry point for the kernel trick. Replace the inner product with $K(x_i, x_j)$ and you instantly get an SVM in an infinite-dimensional feature space.

6.3 KKT Conditions

The optimum $(w^*, b^*, \alpha^*)$ must satisfy the Karush-Kuhn-Tucker (KKT) conditions:

Stationarity: $w^* = \sum_i \alpha_i^* y_i x_i$;
Primal feasibility: $y_i({w^*}^T x_i + b^*) \geq 1$;
Dual feasibility: $\alpha_i^* \geq 0$;
Complementary slackness: $\alpha_i^* \left[y_i({w^*}^T x_i + b^*) - 1\right] = 0$.

Complementary slackness yields a deep conclusion: $\alpha_i^* > 0 \Rightarrow y_i({w^*}^T x_i + b^*) = 1$ — only points sitting exactly on the margin boundary have non-zero $\alpha$. These points are called support vectors.

The SVM solution depends only on a subset of training points (the support vectors), and that is the source of its generalization power — the VC dimension is controlled by the number of support vectors.

6.4 Soft Margin and Hinge Loss

Real data is rarely linearly separable. Introduce slack variables $\xi_i \geq 0$:

\[ \begin{aligned} \min_{w, b, \xi} \quad & \tfrac{1}{2} \|w\|^2 + C \sum_i \xi_i \\ \text{s.t.} \quad & y_i (w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \end{aligned} \]

$C$ is the regularization parameter: $C \to \infty$ recovers the hard margin; small $C$ tolerates more violations. The equivalent unconstrained form is:

\[ \min_{w, b} \tfrac{1}{2} \|w\|^2 + C \sum_i \max\left(0, 1 - y_i(w^T x_i + b)\right) \]

The second term is the hinge loss — the SVM's loss function.

The soft-margin dual is almost identical to the hard-margin dual, with the only change being the box constraint $0 \leq \alpha_i \leq C$.

6.5 The Kernel SVM Decision Function

Once $\alpha^*$ is learned, the prediction for a new sample $x$ is:

\[ \hat y(x) = \mathrm{sign}\!\left(\sum_{i \in \mathrm{SV}} \alpha_i^* y_i \, K(x_i, x) + b^*\right) \]

A perfect match for the Analogizer master form $\hat y = f(\sum_i \alpha_i K(x, x_i))$.

6.6 The SMO Algorithm

The standard way to solve the SVM dual is Sequential Minimal Optimization (SMO, Platt 1998):

SMO main loop:
  while not converged:
    pick a pair of multipliers (α_i, α_j) that violate KKT
    fix the others; reduce the dual to a 2-D QP in (α_i, α_j)
    solve analytically (1-D QP, closed form)
  end

The elegance of SMO: it optimizes only two variables at a time (necessarily two, because of the constraint $\sum_i \alpha_i y_i = 0$), decomposing an $N$-dimensional QP into many 2-D sub-problems. LibSVM is the canonical efficient implementation of SMO.

7. The Kernel Zoo

Kernel	Formula	Parameters	Use case
Linear	$K(x,y) = x^T y$	—	high-dimensional sparse text (TF-IDF)
Polynomial	$(x^T y + c)^d$	$c, d$	feature interactions (NLP N-gram-like)
RBF / Gaussian	$\exp(-\gamma \\|x-y\\|^2)$	$\gamma$	the universal default; infinite-dim feature space
Laplacian	$\exp(-\gamma \\|x-y\\|_1)$	$\gamma$	more outlier-robust than RBF
Sigmoid	$\tanh(\kappa x^T y + c)$	$\kappa, c$	historically mimicked two-layer nets (not strictly PSD)
String kernel	shared-substring count	$k$ (substring length)	text classification, biological sequences
Graph kernel	random walk / subgraph count	—	chemical molecules, social graphs
Fisher kernel	$\nabla_\theta \log p(x	\theta)^T I^{-1} \nabla_\theta \log p(y	\theta)$

Default choices: tabular features → RBF; text → linear plus high-dimensional features; structured objects → string/graph kernel; if in doubt → RBF + grid search over $(\gamma, C)$.

8. Kernel PCA and Kernel Ridge Regression

8.1 Kernel PCA

Plain PCA finds the directions of maximum variance in the original space; kernel PCA does PCA in the feature space after the $\phi$ map. Concretely, we find the eigenvectors of the centered kernel matrix $\tilde{\mathbf{K}}$:

\[ \tilde{\mathbf{K}} \alpha = \lambda N \alpha \]

The $k$-th principal component corresponds to the eigenvector $\alpha^{(k)}$, and the $k$-th principal coordinate of a new sample $x$ is:

\[ \mathrm{PC}_k(x) = \sum_{i=1}^N \alpha^{(k)}_i \, K(x_i, x) \]

Significance: a manifold structure that is non-linear in the original space may become linear in the $\phi$ space — kernel PCA extracts non-linear principal components.

8.2 Kernel Ridge Regression

The kernelized closed-form solution to ridge regression $\min_w \|y - Xw\|^2 + \lambda \|w\|^2$ is:

\[ \hat y(x) = \sum_i \alpha_i K(x_i, x), \quad \alpha = (\mathbf{K} + \lambda I)^{-1} y \]

A textbook instance of the Analogizer master form. Connection to GP regression: the kernel ridge prediction mean is mathematically identical to the GP regression posterior mean — see Section 10.

9. The Representer Theorem

Representer theorem (Schölkopf, Herbrich & Smola 2001): in a reproducing kernel Hilbert space (RKHS) $\mathcal{H}_K$, the optimal solution to a regularized empirical risk minimization problem can always be written as a linear combination of kernel functions evaluated at the training points.

Formally, consider:

\[ \min_{f \in \mathcal{H}_K} \;\; \sum_{i=1}^N L(y_i, f(x_i)) + \Omega(\|f\|_{\mathcal{H}_K}) \]

where $L$ is any loss and $\Omega$ a monotonically increasing regularizer. Then the optimal solution must have the form:

\[ f^*(x) = \sum_{i=1}^{N} \alpha_i \, K(x_i, x) \]

Sketch of proof: decompose any $f \in \mathcal{H}_K$ as $f = f_\parallel + f_\perp$, with $f_\parallel \in \mathrm{span}\{K(\cdot, x_i)\}$ and $f_\perp$ orthogonal to it. By the reproducing property, $f(x_i) = \langle f, K(\cdot, x_i) \rangle = f_\parallel(x_i)$, so $f_\perp$ does not affect the loss term. Yet $\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2$, so the regularizer is minimized when $f_\perp = 0$. $\square$

Significance:

Theoretical foundation: reduces an infinite-dimensional optimization to a finite $N$-dimensional one in $\alpha$;
Algorithmic unity: SVMs, kernel ridge regression, kernel logistic regression, GPs — all of their solutions take the form $\sum_i \alpha_i K(x_i, \cdot)$, differing only in how $\alpha$ is determined;
Legitimacy of the Analogizers: theoretically explains why "predicting via a weighted sum of training similarities" is the essential form of every RKHS method, not a coincidence.

10. Gaussian Processes as Kernel Methods

The Gaussian process (GP) is the Bayesians' regression method, but in essence it is a kernel method. A GP is fully specified by its mean function $m(x)$ and covariance kernel $K(x, x')$:

\[ f(x) \sim \mathcal{GP}(m(x), K(x, x')) \]

After observing $\mathbf{y} = f(\mathbf{X}) + \epsilon$ (Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$), the posterior at a new point $x_*$ is:

\[ \mathbb{E}[f(x_*) \mid \mathbf{X}, \mathbf{y}] = \mathbf{k}_*^T (\mathbf{K} + \sigma^2 I)^{-1} \mathbf{y} \]

\[ \mathrm{Var}[f(x_*) \mid \mathbf{X}, \mathbf{y}] = K(x_*, x_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma^2 I)^{-1} \mathbf{k}_* \]

where $\mathbf{k}_* = [K(x_*, x_i)]_{i=1}^N$ and $\mathbf{K}$ is the training-point kernel matrix.

Key observation: the posterior mean

\[ \mathbb{E}[f(x_*) \mid \cdot] = \sum_i \alpha_i K(x_*, x_i), \quad \alpha = (\mathbf{K} + \sigma^2 I)^{-1} \mathbf{y} \]

has exactly the same form as the kernel ridge solution (with $\sigma^2 = \lambda$). The only difference between GPs and kernel ridge regression is that GPs additionally provide the predictive variance (uncertainty quantification), whereas kernel ridge only gives a point estimate.

Tribal viewpoint: the Bayesians read a GP as "a prior over functions"; the Analogizers read a GP as "a non-parametric prediction weighted by training-point kernel similarity". The two perspectives are equivalent but tell different stories — and that is why the GP section in ../../03_Machine_Learning/贝叶斯学习.md and this section mirror each other.

11. Division of Labour with [../../03_Machine_Learning/kernel_methods.md]

../../03_Machine_Learning/kernel_methods.md already covers the basics of kernel methods (what a kernel is, the intuition for the kernel trick, common kernels). This page does not repeat that material; instead it goes deeper:

Topic	03_Machine_Learning/kernel_methods.md	This page
Kernel-trick concept	introduces it	assumes it and uses it
Mercer condition	mentions it	PSD properties + algebra of kernels
SVM derivation	brief sketch	full primal + dual + KKT + SMO
Representer theorem	—	full statement with proof sketch
GP–kernel connection	—	explicit derivation of the equivalence with kernel ridge
Curse of dimensionality	—	full mathematical analysis (volume, distance concentration)
Algorithmic engineering	—	overview of KD-Tree / LSH / SMO implementations

In short: that page asks "what are kernel methods", this page asks "why are kernel methods the master algorithm of the Analogizers, and what is their full mathematical skeleton".

Code Example: Running Everything Above with scikit-learn

import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.kernel_ridge import KernelRidge
from sklearn.decomposition import KernelPCA

X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

# 1) k-NN
knn = KNeighborsClassifier(n_neighbors=5, weights="distance").fit(X_tr, y_tr)

# 2) RBF-SVM (Analogizer master algorithm)
svm = SVC(kernel="rbf", C=1.0, gamma="scale").fit(X_tr, y_tr)
print("support vectors / training samples:", len(svm.support_), "/", len(X_tr))

# 3) Kernel ridge regression (regression-flavored Analogizer)
krr = KernelRidge(alpha=0.1, kernel="rbf", gamma=1.0).fit(X_tr, y_tr)

# 4) Kernel PCA (non-linear dimensionality reduction)
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=1.0).fit(X_tr)
Z = kpca.transform(X_te)

print("k-NN acc:", knn.score(X_te, y_te))
print("SVM acc:", svm.score(X_te, y_te))

Cross-references

The Analogizers as a tribe: 本页 §1（派系入口）
Kernel methods basics on the site: ../../03_Machine_Learning/kernel_methods.md
Supervised algorithms catalog (k-NN/SVM interface): ../../03_Machine_Learning/supervised.md
The Bayesian view of GPs: ../../03_Machine_Learning/贝叶斯学习.md
How the metric is learned: Metric Learning and Contrastive Learning
How to retrieve similar samples efficiently: Recommender Systems and Modern Retrieval

References

Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. — The foundational k-NN paper and 1-NN error bound.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. COLT. — SVMs and the kernel trick.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. — Soft-margin SVM.
Platt, J. (1998). Sequential Minimal Optimization: A fast algorithm for training support vector machines. Microsoft Research TR. — The SMO algorithm.
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. COLT. — The modern statement and proof of the Representer theorem.
Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press. — The standard textbook on kernel methods.
Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. — The standard GP textbook; Chapter 6 covers the connection to kernel methods.
Beyer, K., et al. (1999). When is "nearest neighbor" meaningful? ICDT. — The rigorous proof of distance concentration.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. CACM. — The original KD-Tree paper.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. STOC. — The foundation of LSH.

Metric Learning and Contrastive Learning

Scope of this page: a thorough treatment of how similarity itself is learned. From the shallow metric learning of the 2000s (LMNN/ITML) all the way to the contrastive learning of the 2020s (SimCLR/MoCo/BYOL/DINO) and the cross-modal CLIP — this is the renaissance of the Analogizers in the deep learning era.

Nearest Neighbors and Kernel Methods solved "given a similarity, how to predict". The deeper question is — where does the similarity itself come from?

A naive Euclidean distance in raw pixel space carries almost no semantic meaning: shifting a cat photo by one pixel yields a Euclidean distance larger than the gap between "cat vs. dog". To make "similar inputs have similar outputs", we must learn an embedding in which similar inputs are genuinely close.

That is the entire mission of metric learning. When metric learning is married to deep networks and scaled to unlabeled data, it becomes the training paradigm of foundation models in the 2020s — contrastive learning.

1. The Goal of Metric Learning

Formally: find an embedding function $f_\theta: \mathcal{X} \to \mathbb{R}^d$ such that for any "similar pair" $(x, x^+)$ and "dissimilar pair" $(x, x^-)$:

\[ d(f_\theta(x), f_\theta(x^+)) < d(f_\theta(x), f_\theta(x^-)) \]

Here $d$ is usually Euclidean or cosine distance. Key questions:

Who defines "similar"? — labels in supervised learning (same class is similar); data augmentation in self-supervised learning (different augmentations of the same image are similar).
How to train? — convert "similarity-ordering constraints" into a differentiable loss.
What is the geometry of the embedding space? — Euclidean / spherical / hyperbolic.

2. Classical Metric Learning

2.1 Mahalanobis Metric Learning

Learn a positive semi-definite matrix $M = L^T L$ such that

\[ d_M(x, y) = \sqrt{(x-y)^T M (x-y)} = \|L(x-y)\|_2 \]

is task-meaningful. Note that $L$ is a linear embedding — Mahalanobis metric learning is "learn a linear embedding, then use Euclidean distance".

2.2 LMNN (Large Margin Nearest Neighbor, Weinberger & Saul 2009)

LMNN directly optimizes the metric for k-NN, requiring each point's "target neighbors" (same class) to be at least one margin closer than its "impostors" (different class):

\[ \min_M \;\; \underbrace{\sum_{i,j \rightsquigarrow i} d_M(x_i, x_j)^2}_{\text{pull same class together}} + C \underbrace{\sum_{i,j \rightsquigarrow i, l} \left[1 + d_M(x_i, x_j)^2 - d_M(x_i, x_l)^2\right]_+ (1 - y_{il})}_{\text{push different classes apart}} \]

where $j \rightsquigarrow i$ means $j$ is one of $i$'s preselected target neighbors (k-nearest among the same class) and $l$ ranges over all points. $M$ must remain PSD (projection onto the PSD cone).

Contributions: imports the SVM "margin" idea into metric learning; can be combined with the kernel trick.

2.3 ITML (Information-Theoretic Metric Learning, Davis et al. 2007)

Measures the deviation of the metric from a prior metric using the KL divergence:

\[ \min_M \;\; D_{\mathrm{KL}}(\mathcal{N}(0, M^{-1}) \| \mathcal{N}(0, M_0^{-1})) \quad \text{s.t. similar/dissimilar pair constraints} \]

The solution form is unusually clean: iterative Bregman projection. Advantage: parameterizes naturally inside the PSD cone, avoiding explicit projection.

2.4 The Limits of Shallow Metric Learning

LMNN/ITML both assume a linear transform suffices — but semantic similarity in images and text is highly non-linear. Replacing the linear part with a deep network leads to the next section's Siamese / Triplet networks.

3. Deep Metric Learning Architectures

3.1 Siamese Networks

Proposed by Bromley & LeCun (1993) and exploding in popularity after 2014. Two weight-sharing subnetworks process two inputs and compare embeddings:

flowchart LR
    X1["x₁"] --> F1["fθ"]
    X2["x₂"] --> F2["fθ (shared weights)"]
    F1 --> Z1["z₁"]
    F2 --> Z2["z₂"]
    Z1 & Z2 --> D["d(z₁, z₂)"]
    D --> L["contrastive loss"]
    style F1 fill:#fde68a
    style F2 fill:#fde68a

Applications: face verification (DeepFace, FaceNet), signature verification, low-shot classification (deepening the Siamese section in ../../../1_DeepLearning/Computer_Vision/meta_learning.md).

3.2 Triplet Networks (FaceNet, Schroff et al. 2015)

Three inputs: anchor $a$, positive $p$ (same class as $a$), negative $n$ (different class from $a$). The loss pushes $a$ at least one margin $m$ closer to $p$ than to $n$:

\[ L_{\mathrm{triplet}} = \max\left(0, \; \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + m\right) \]

FaceNet's recipe:

L2-normalize embeddings to the unit sphere (128 dimensions);
99.63% verification accuracy on LFW.

3.3 Hard Negative Mining

The central difficulty of triplet training: random negatives mostly already satisfy the constraint (loss = 0), giving no gradient. We must pick hard negatives (the negatives closest to the anchor):

Strategy	Description	Risk
Random	random negative	training stalls
Semi-hard	choose $\\|f(a)-f(n)\\| > \\|f(a)-f(p)\\|$ but still violating the margin	FaceNet's default
Hardest	the closest negative in the batch	prone to collapse to a trivial solution
Online (within batch)	mine on the fly inside a mini-batch	efficient, mainstream

4. A Lineage of Loss Functions

The evolution of metric learning losses is a single thread, with each generation fixing problems left by the last.

4.1 Contrastive Loss (Hadsell, Chopra & LeCun 2006)

The earliest deep contrastive loss. Given a pair $(x_1, x_2)$ and a "similar" label $y \in \{0, 1\}$ (0 = similar, 1 = dissimilar):

\[ L_{\mathrm{contrastive}} = (1 - y) \cdot D^2 + y \cdot \max(0, m - D)^2 \]

where $D = \|f(x_1) - f(x_2)\|$ and $m$ is the margin. Intuition:

similar pair ($y=0$) → minimize squared distance;
dissimilar pair ($y=1$) → push apart by at least $m$.

4.2 Triplet Loss

See above — extends the contrastive "pair" to a "triplet", balancing positive and negative naturally.

4.3 Quadruplet / N-pair / Lifted Structure

Loss	Form	Improvement
Quadruplet (Chen 2017)	one anchor + one positive + two negatives, requiring negatives to also be apart	larger inter-class margins
N-pair (Sohn 2016)	one anchor + one positive + (N-1) negatives, softmax	better mini-batch utilization
Lifted Structure (Oh Song 2016)	jointly optimize all positive and negative pairs in a batch	global structure
Angular Loss (Wang 2017)	constrains the inner angles of the triangle, scale-invariant	more geometrically intuitive

4.4 InfoNCE / NT-Xent: the Core Loss of Modern Contrastive Learning

InfoNCE (van den Oord et al. 2018) connects contrastive learning to mutual information estimation, and NT-Xent (Normalized Temperature-scaled Cross Entropy, Chen et al. SimCLR 2020) is its standard form for self-supervised vision.

In a batch of $N$ original samples, each producing 2 augmented views, there are $2N$ samples total. The loss for view $i$ paired with its positive $j$ is:

\[ L_{i,j}^{\mathrm{NT\text{-}Xent}} = -\log \frac{\exp(\mathrm{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\mathrm{sim}(z_i, z_k) / \tau)} \]

where $\mathrm{sim}(u, v) = \frac{u^T v}{\|u\|\|v\|}$ (cosine similarity) and $\tau > 0$ is the temperature. The total batch loss is the average over all $(i, j)$ pairs.

4.4.1 Derivation: Connection to a Mutual-Information Lower Bound

InfoNCE is a lower-bound estimator of the mutual information $I(X; Y)$:

\[ I(X; Y) \geq \log K - L_{\mathrm{InfoNCE}} \]

where $K$ is the number of negatives. This gives contrastive learning an information-theoretic interpretation: maximizing contrastive performance = maximizing the mutual information between two views. But the bound is loose — the relationship between theory and practice remains debated (Tschannen 2020).

4.4.2 The Role of the Temperature $\tau$

The temperature $\tau$ controls the "sharpness" of the softmax:

$\tau$	Behavior	Effect
very small ($\tau \to 0$)	softmax becomes argmax	focuses only on the hardest negative; unstable training
very large ($\tau \to \infty$)	softmax becomes uniform	all negatives weighed equally; little learning signal
moderate ($\tau \approx 0.05 \sim 0.5$)	balanced	SimCLR default 0.5; MoCo default 0.07; CLIP learns it

Empirical: too-small temperature is a common cause of early-stage collapse. CLIP makes $\log \tau^{-1}$ learnable, with initial value $\log 14$ and a clip at $\log 100$ to prevent it from going too small.

4.4.3 SupCon: Supervised Contrastive Loss

Extends NT-Xent to the labeled setting (Khosla et al. 2020) by treating all same-class samples as positives:

\[ L_{\mathrm{SupCon}}_i = -\frac{1}{|P(i)|} \sum_{p \in P(i)} \log \frac{\exp(z_i \cdot z_p / \tau)}{\sum_{a \neq i} \exp(z_i \cdot z_a / \tau)} \]

where $P(i)$ is the set of samples sharing the class of $i$. SupCon often beats cross-entropy in supervised classification.

5. Self-Supervised Contrastive Learning

Cross-page deep dive: ../../../1_DeepLearning/Computer_Vision/视觉自监督.md already lays out the overall paradigm of self-supervised learning. This section focuses on the internal evolution and algorithmic detail of the contrastive line.

5.1 SimCLR (Chen et al. 2020)

The first work to scale contrastive learning seriously. Key components:

Augmentation composition: random crop + color jitter + Gaussian blur (ablations show that crop + color is the crucial pair).
Projection head: a 2-layer MLP $g$ stacked on top of the encoder; the contrastive loss is computed on $g(h)$ while $h$ is used downstream. A pivotal trick: the projection head discards augmentation-sensitive features, making $h$ more general.
NT-Xent loss.
Large batch: 4096–8192, since each sample contributes $2N - 2$ negatives within the batch.

SimCLR step:
  for x in batch:
    x1, x2 ← two random augmentations(x)
    h1, h2 ← encoder(x1), encoder(x2)
    z1, z2 ← projection_head(h1), projection_head(h2)
  L ← NT-Xent({(z1_i, z2_i)})
  backprop

5.2 MoCo (Momentum Contrast, He et al. 2020)

SimCLR depends on a large batch to provide negatives, demanding huge memory. MoCo's solution:

Momentum encoder: maintain an EMA copy of the encoder:

\[ \theta_k \leftarrow m \theta_k + (1 - m) \theta_q, \quad m \in [0.99, 0.999] \]

Queue: keep the key embeddings of the most recent $K$ mini-batches (typically $K = 65536$) as a negative pool.
InfoNCE loss: compare the query $q$ with the positive key $k_+$ and all negative keys in the queue.

Benefit: the number of negatives is decoupled from batch size; memory-friendly. Downside: momentum updates introduce staleness, requiring careful tuning of $m$. MoCo v2 imports SimCLR's projection head and stronger augmentations for further gains.

5.3 BYOL (Bootstrap Your Own Latent, Grill et al. 2020)

BYOL's stunning observation: you can do without negatives at all.

Architecture: two networks, online + target, where target is the EMA of online. The online branch has a predictor $q$ that predicts the target's output:

\[ L_{\mathrm{BYOL}} = \| \overline{q(z_\theta)} - \overline{z'_\xi} \|^2 \quad \text{(L2 normalized)} \]

Key tricks:

stop-gradient on target (target is not backpropagated through);
predictor asymmetry — remove the predictor and the model collapses immediately;
no negatives, but the triple asymmetry of stop-gradient + EMA + predictor prevents collapse to a trivial solution.

BYOL overturned the belief that "contrastive learning needs negatives" and triggered a wave of theoretical explanations (e.g. dynamics analyses by Tian 2021).

5.4 SimSiam (Chen & He 2021)

Even more extreme than BYOL: it drops the EMA target too, keeping only stop-gradient + predictor:

\[ L_{\mathrm{SimSiam}} = -\frac{1}{2}\left(\cos(p_1, \mathrm{sg}(z_2)) + \cos(p_2, \mathrm{sg}(z_1))\right) \]

Stop-gradient is the "soul" preventing collapse in BYOL/SimSiam — one of the most mysterious theoretical questions of 2021.

5.5 DINO (Caron et al. 2021)

DINO = Self-Distillation with NO labels. It marries contrastive learning to knowledge distillation and showcases the emergent capabilities of self-supervised ViTs for the first time.

A student network $g_\theta$ and a teacher network $g_\xi$ (EMA), both ViTs;
the student sees small (local) crops, the teacher sees large (global) crops;
the student matches the teacher's softmax output (not a contrastive loss but a cross-entropy);
centering + sharpening prevent collapse.

Striking finding: DINO ViT's attention maps automatically focus on object boundaries — purely unsupervised training emerges semantic segmentation. A landmark of ViT + self-supervised learning.

5.6 An Overview of Contrastive Methods

Method	Year	Loss	Needs negatives?	Key innovation
SimCLR	2020	NT-Xent	yes (large batch)	projection head + strong augmentation
MoCo v2	2020	InfoNCE	yes (queue)	momentum encoder
BYOL	2020	L2 (predicted vs EMA)	no	predictor + stop-gradient
SwAV	2020	cluster prediction	no	cluster prototypes + multi-view
SimSiam	2021	cosine + stop-gradient	no	minimalist, no EMA
Barlow Twins	2021	cross-correlation matrix	no	redundancy reduction
DINO	2021	cross-entropy distillation	no	self-distillation + ViT emergence

Cross-page deep dive: ../../../1_DeepLearning/Foundation_Models/vision_foundation.md covers the overall significance and downstream applications of CLIP. This section focuses on the algorithmic detail of its contrastive learning.

6.1 CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021)

Setup: 400 million (image, caption) pairs scraped from the web. Task: bring an image and its caption close together in the embedding space.

Architecture:

image encoder $f_I$ (ViT or ResNet);
text encoder $f_T$ (Transformer);
both followed by a linear projection to a joint $d$-dim space;
L2 normalization.

Loss: symmetric InfoNCE. Given $N$ in-batch pairs $(I_i, T_i)$, build the $N \times N$ similarity matrix $S$ with $S_{ij} = f_I(I_i)^T f_T(T_j) / \tau$. Then:

\[ L_{\mathrm{CLIP}} = \frac{1}{2}\left( \mathrm{CE}(\mathrm{softmax}_{\text{row}}(S), I) + \mathrm{CE}(\mathrm{softmax}_{\text{col}}(S), I) \right) \]

i.e. "pair each image with its true caption" + "pair each caption with its true image".

Zero-shot classification: wrap each class name $c$ as the prompt "a photo of a {c}" and encode it as $f_T(t_c)$. The prediction for a new image $I$ is:

\[ \hat c = \arg\max_c \; f_I(I)^T f_T(t_c) \]

CLIP matches or surpasses fully supervised models on 30+ vision datasets in the zero-shot setting, ushering in the foundation-model era of vision–language.

6.2 ALIGN (Jia et al. 2021)

Almost contemporaneous with CLIP, with even larger scale (1.8 billion noisy alt-text pairs) and the same loss. The lesson: data scale can compensate for data quality.

6.3 SigLIP (Zhai et al. 2023)

Key change: replace CLIP's softmax-based InfoNCE with a sigmoid-based binary classification:

\[ L_{\mathrm{SigLIP}} = -\frac{1}{N}\sum_{i,j} \log \sigma\left(z_{ij} \cdot (t \cdot s_{ij} + b)\right) \]

where $z_{ij} = +1$ if $i = j$ else $-1$, $s_{ij}$ is the similarity, and $t, b$ are learnable temperature and bias. Advantages:

no need for full-batch softmax normalization, drastically reducing communication cost in distributed training;
a small batch (e.g. 8k) achieves what SimCLR/CLIP need a 32k batch for;
mathematically a per-pair independent binary classification — more robust.

SigLIP has become the de facto standard CLIP replacement after 2024.

7. Geometry of the Metric Space

The choice of embedding-space geometry is crucial for the Analogizers.

7.1 Euclidean Embedding

The most common. But Euclidean geometry is limited at modeling "hierarchical structure" — see the hyperbolic case below.

7.2 Spherical Embedding

Force $\|z\| = 1$, reducing similarity to cosine. Advantages:

numerically stable (gradients do not explode through norms);
compatible with softmax (which assumes bounded logits);
geometric uniformity (hyperspherical uniformity, Wang & Isola 2020).

Wang & Isola (2020) famously decomposed the contrastive objective into:

\[ L_{\mathrm{contrastive}} \approx -\underbrace{\mathbb{E}[\mathrm{sim}(z, z^+)]}_{\text{alignment}} + \underbrace{\log \mathbb{E}[\exp(\mathrm{sim}(z, z^-) / \tau)]}_{\text{uniformity}} \]

Good contrastive models simultaneously achieve alignment (same-class together) and uniformity (uniform spread on the sphere); both are necessary.

7.3 Hyperbolic Embedding

Hyperbolic space (e.g. the Poincaré ball model) has volume that grows exponentially with radius, naturally suiting tree-like / hierarchical structures. Nickel & Kiela (2017) showed that hyperbolic embeddings precisely represent WordNet with far fewer dimensions. Gradient optimization on Riemannian manifolds requires special handling, however (Riemannian Adam).

8. Architectural Evolution of Metric Learning

flowchart TD
    A["k-NN<br/>fixed Euclidean"] --> B["Mahalanobis metric learning"]
    B --> C["LMNN (2009)"]
    B --> D["ITML (2007)"]
    C --> E["Siamese networks<br/>(Bromley 1993, revival 2014)"]
    D --> E
    E --> F["Triplet networks<br/>FaceNet (2015)"]
    F --> G["Contrastive loss NT-Xent<br/>SimCLR (2020)"]
    G --> H["MoCo (2020)<br/>queue + momentum"]
    G --> I["BYOL (2020)<br/>no negatives"]
    G --> J["DINO (2021)<br/>self-distillation"]
    G --> K["CLIP (2021)<br/>cross-modal"]
    K --> L["SigLIP (2023)<br/>sigmoid"]
    K --> M["BLIP/Flamingo/<br/>multimodal LLMs"]
    H --> N["RAG era<br/>retrieval embeddings"]

    style G fill:#fde68a
    style K fill:#bbf7d0
    style N fill:#bbf7d0

9. Evaluation

Metric learning cannot be evaluated by accuracy — what we evaluate is embedding quality, not classification. Common metrics:

9.1 Recall@K

For each query in the test set, check whether any of its $K$ nearest neighbors share its class:

\[ \mathrm{Recall@K} = \frac{1}{|Q|} \sum_{q \in Q} \mathbb{1}\left[\exists i \in \mathrm{kNN}_K(q): y_i = y_q\right] \]

The standard metric for face recognition and image retrieval.

9.2 mAP (mean Average Precision)

For each query, compute AP after sorting by similarity:

\[ \mathrm{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{N} P(k) \cdot \mathrm{rel}(k) \]

mAP is the average of AP over all queries. More comprehensive than Recall@K but more expensive.

9.3 NMI (Normalized Mutual Information)

Cluster the embeddings and compute mutual information against the true labels — measuring the semantic consistency of the embedding space.

9.4 t-SNE / UMAP Visualization

Project high-dimensional embeddings to 2D for inspection. t-SNE emphasizes local structure, UMAP balances global structure — but visualization is auxiliary, not a primary evaluation.

9.5 Linear Probing

Freeze the encoder and train a linear classifier on the embeddings. The accuracy reflects the embedding's "linear separability" — the standard evaluation for self-supervised pretraining since SimCLR.

9.6 KNN Evaluation

Freeze the encoder and run k-NN classification directly. More "fundamentalist" than linear probing — it directly tests the Analogizer premise: good embedding ⇒ neighbors are useful.

Code Example: SimCLR-style NT-Xent Implementation

import torch
import torch.nn.functional as F

def nt_xent_loss(z1: torch.Tensor, z2: torch.Tensor, tau: float = 0.5) -> torch.Tensor:
    """
    z1, z2: shape [N, d]; embeddings of the two augmented views in a batch
    returns: scalar NT-Xent loss
    """
    N = z1.shape[0]
    z = torch.cat([z1, z2], dim=0)            # [2N, d]
    z = F.normalize(z, dim=1)                 # L2 normalize to unit sphere
    sim = z @ z.T / tau                       # [2N, 2N] cosine similarities
    sim.fill_diagonal_(-1e9)                  # exclude self

    # Positive index: the positive of i is i+N (or i-N)
    targets = torch.cat([torch.arange(N, 2*N), torch.arange(0, N)]).to(z.device)
    return F.cross_entropy(sim, targets)

Cross-references

The Analogizers as a tribe: 本页 §1（派系入口）
The overarching narrative of visual self-supervision: ../../../1_DeepLearning/Computer_Vision/视觉自监督.md
Siamese in meta-learning and few-shot: ../../../1_DeepLearning/Computer_Vision/meta_learning.md
CLIP's place in the foundation-model lineage: ../../../1_DeepLearning/Foundation_Models/vision_foundation.md
How to deploy learned embeddings (vector databases, ANN, RAG): Recommender Systems and Modern Retrieval

References

Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. CVPR. — The foundational paper on contrastive loss.
Bromley, J., et al. (1993). Signature verification using a Siamese time delay neural network. NIPS. — The original Siamese network.
Davis, J. V., et al. (2007). Information-theoretic metric learning. ICML. — ITML.
Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. JMLR. — LMNN.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. CVPR. — Triplet loss + large-scale face recognition.
Sohn, K. (2016). Improved deep metric learning with multi-class N-pair loss. NIPS.
van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748. — InfoNCE and the mutual-information lower bound.
Chen, T., et al. (2020). A simple framework for contrastive learning of visual representations. ICML. — SimCLR / NT-Xent.
He, K., et al. (2020). Momentum contrast for unsupervised visual representation learning. CVPR. — MoCo.
Grill, J.-B., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS. — BYOL, showing one can do without negatives.
Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. CVPR. — SimSiam.
Caron, M., et al. (2021). Emerging properties in self-supervised vision Transformers. ICCV. — DINO.
Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML. — CLIP.
Zhai, X., et al. (2023). Sigmoid loss for language image pre-training. ICCV. — SigLIP.
Khosla, P., et al. (2020). Supervised contrastive learning. NeurIPS. — SupCon.
Wang, T., & Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ICML. — Alignment + uniformity decomposition.
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. NeurIPS. — Hyperbolic embeddings.

Recommender Systems and Modern Retrieval

Scope of this page: this is the page with the largest gap relative to the rest of the site, and it is meant to be the deepest of the three Analogizer pages. From the collaborative filtering of the 1990s all the way to the RAG / RETRO of 2025 — recommender systems, case-based reasoning (CBR), and vector retrieval are the same thing: given a large case base, find the cases most similar to the current query and "borrow" their answers. This is the Analogizers' core productive force in industry.

The Analogizers' biggest industrial application has never been the SVM but the recommender system. Amazon's "people who looked at this also looked at...", Netflix's movie picks, YouTube's video feed — the kernel under all of them is finding similar users / similar items. The same idea was migrated to ANN retrieval (FAISS / HNSW) in the 2010s, and in the 2020s became the infrastructure layer of RAG.

This page tells that engineering thread end-to-end.

1. A Survey of Recommender Systems

By "information source", recommender systems fall into three paradigms:

Paradigm	Inputs	Idea	Representative methods	Strengths	Weaknesses
Content-based	item features + user history	recommend items whose features resemble what the user liked	TF-IDF + cosine	strong cold-start for new items; explainable	bound by explicit feature engineering; narrow horizon
Collaborative Filtering (CF)	user-item interaction matrix	leverage "the wisdom of crowds" — similar users like similar items	k-NN CF, matrix factorization	no content features needed; latent patterns emerge	user / item cold start; sparsity
Hybrid	both	weighted / switching / feature fusion	Wide & Deep, two-tower	best of both	engineering complexity

Recommendation = a textbook Analogizer application: for a new query (user), find historically similar queries and reuse their answers (items).

2. Collaborative Filtering

2.1 The User-Item Matrix

Suppose we have $U$ users and $I$ items. The rating matrix $R \in \mathbb{R}^{U \times I}$ has entries $R_{ui}$ being user $u$'s rating on item $i$ (mostly missing). Task: fill in the missing entries.

2.2 User-based CF (Goldberg 1992; GroupLens 1994)

The predicted rating of item $i$ for user $u$:

\[ \hat r_{ui} = \bar r_u + \frac{\sum_{v \in N(u)} \mathrm{sim}(u, v) \cdot (r_{vi} - \bar r_v)}{\sum_{v \in N(u)} |\mathrm{sim}(u, v)|} \]

where $N(u)$ is the $k$ users most similar to $u$, and $\bar r_u$ is $u$'s average rating (used to subtract a per-user bias).

2.3 Item-based CF (Sarwar et al. 2001)

Amazon's pivotal innovation: use item similarity instead of user similarity — item similarity is more stable (user tastes drift, item attributes do not), and the number of items is usually far smaller than the number of users:

\[ \hat r_{ui} = \frac{\sum_{j \in N(i; u)} \mathrm{sim}(i, j) \cdot r_{uj}}{\sum_{j \in N(i; u)} |\mathrm{sim}(i, j)|} \]

where $N(i; u)$ is the set of $k$ items rated by $u$ and most similar to $i$.

2.4 Similarity Measures

Measure	Formula	Note
Pearson	$\frac{\sum_i (r_{ui} - \bar r_u)(r_{vi} - \bar r_v)}{\sqrt{\sum (r_{ui}-\bar r_u)^2} \sqrt{\sum (r_{vi}-\bar r_v)^2}}$	subtracts means automatically; handles different rating scales
Cosine	$\frac{r_u^T r_v}{\\|r_u\\|\\|r_v\\|}$	simple and fast; implicitly assumes missing = 0 (problematic)
Adjusted Cosine	cosine after subtracting each item's mean	best choice for item-based CF
Jaccard	$\frac{	R_u \cap R_v

2.5 The Cold-Start Problem

CF's Achilles' heel:

Type	Description	Mitigation
New-user cold start	no history	onboarding ratings / demographics / fall back to content recommendation
New-item cold start	nobody rated it	content recommendation / use item features as side information
Sparse cold start	matrix is 99.9% missing	matrix factorization (learn low-dim representations, generalize)

3. Matrix Factorization

CF's sparsity problem is gracefully solved by matrix factorization — approximating $R$ by a low-rank factorization.

3.1 Truncated SVD

\[ R \approx U_k \Sigma_k V_k^T \]

But classical SVD assumes a dense matrix and cannot handle missing entries.

3.2 FunkSVD (Funk 2006, Netflix Prize)

Fits only on observed entries:

\[ \hat r_{ui} = \mu + b_u + b_i + p_u^T q_i \]

Parameters:

$\mu$: global average;
$b_u$: user bias ("strict / lenient");
$b_i$: item bias ("popular / unpopular");
$p_u, q_i \in \mathbb{R}^k$: latent factor vectors for users and items.

Objective:

\[ \min_{p, q, b} \sum_{(u,i) \in \mathcal{O}} (r_{ui} - \hat r_{ui})^2 + \lambda (\|p_u\|^2 + \|q_i\|^2 + b_u^2 + b_i^2) \]

The sum is taken only over the observed set $\mathcal{O}$. Solving by SGD:

for (u, i, r) in observed pairs:
    e = r - μ - b_u - b_i - p_u^T q_i
    b_u ← b_u + η (e - λ b_u)
    b_i ← b_i + η (e - λ b_i)
    p_u ← p_u + η (e q_i - λ p_u)
    q_i ← q_i + η (e p_u - λ q_i)

Funk used this method to push Netflix Prize RMSE forward by 7%, a phenomenal result in 2006.

3.3 ALS (Alternating Least Squares)

Joint optimization of $P, Q$ is non-convex, but fixing one makes the other a convex least-squares problem — alternate the updates:

\[ P \leftarrow \arg\min_P \|R - PQ^T\|_F^2 + \lambda \|P\|_F^2 \quad \text{(closed form)} \]

\[ Q \leftarrow \arg\min_Q \|R - PQ^T\|_F^2 + \lambda \|Q\|_F^2 \quad \text{(closed form)} \]

ALS parallelizes naturally (each row independent) and is the default recommender in Spark MLlib.

3.4 NMF (Non-negative Matrix Factorization)

Constrains $P, Q \geq 0$. More interpretable (factors can be read as "topics") but less expressive than unconstrained MF.

4. Implicit Feedback

Key observation: in reality explicit ratings (1–5 stars) are scarce; implicit signals (clicks, watch time, purchases) dominate. The difficulty of implicit feedback: unobserved ≠ disliked — possibly just unseen.

4.1 WMF (Weighted Matrix Factorization, Hu et al. 2008)

Splits the 0/1 matrix $P$ from a confidence matrix $C$:

$p_{ui} = 1$ if $r_{ui} > 0$ (interaction occurred);
$c_{ui} = 1 + \alpha r_{ui}$ (more interaction means more confidence in "liked").

Objective:

\[ \min_{P, Q} \sum_{u, i} c_{ui} (p_{ui} - p_u^T q_i)^2 + \lambda (\|P\|^2 + \|Q\|^2) \]

Note: the sum runs over all $(u, i)$, including unobserved pairs (treated as 0 with low confidence). The cost is kept linear via the algebraic trick $Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y$ inside ALS.

4.2 BPR (Bayesian Personalized Ranking, Rendle et al. 2009)

Revolutionary insight: don't predict scores — directly optimize the ordering. With triplets $(u, i, j)$ where $i$ is something user $u$ interacted with (positive) and $j$ is not (negative), BPR assumes the user prefers $i > j$:

\[ L_{\mathrm{BPR}} = -\sum_{(u, i, j) \in D_S} \log \sigma(\hat x_{uij}) + \lambda \|\Theta\|^2 \]

where $\hat x_{uij} = \hat r_{ui} - \hat r_{uj}$ (any scoring model, commonly MF).

Advantages:

directly optimizes AUC (pairwise ranking);
naturally fits implicit feedback;
form is reminiscent of SVM hinge loss (the "margin" idea).

BPR is still the baseline of implicit recommendation today — every new method must be compared to BPR-MF.

5. Deep Recommendation

After 2014 deep learning swept through recommender systems.

5.1 NCF (Neural Collaborative Filtering, He et al. 2017)

Replaces MF's "dot product $p_u^T q_i$" with an MLP:

\[ \hat r_{ui} = \mathrm{MLP}\left(\mathrm{concat}(\mathbf{e}_u, \mathbf{e}_i)\right) \]

where $\mathbf{e}_u, \mathbf{e}_i$ are ID embeddings. NeuMF further fuses GMF (generalized MF) and MLP in parallel.

Controversy: Rendle et al. (2020) showed that a well-tuned MF beats NCF, sparking the "is NCF really better, or just more tuned" reflection — the "fake progress" problem in recommender systems.

5.2 AutoRec (Sedhain et al. 2015)

Casts recommendation as an autoencoding task: input is the user's rating vector $r_u$ (with missing entries), output is the densely reconstructed vector $\hat r_u$.

5.3 Wide & Deep (Cheng et al. 2016)

Google Play's deployed dual-track architecture:

Wide part: a linear model over crossed features that memorizes explicit rules ("people who installed A install B");
Deep part: embeddings + MLP that generalize to unseen combinations.

Trained jointly — an industry hit. Successors: DeepFM, xDeepFM, DCN — all aiming to handle feature interactions better.

5.4 DIN (Deep Interest Network, Zhou et al. 2018)

Alibaba e-commerce setting: a user's interests are multi-modal and dynamic — for query item $i$, attention-aggregate the user's history sequence:

\[ \mathbf{v}_u = \sum_{j \in H_u} a(\mathbf{e}_j, \mathbf{e}_i) \cdot \mathbf{e}_j \]

with weights $a$ produced by a small MLP. An early successful application of attention in recommendation, predating Transformer. Successors DIEN and SIM tackle longer sequence modeling.

5.5 Two-Tower Models: the De Facto Standard for Modern Retrieval

flowchart LR
    U["User features<br/>(ID, history, demographics)"] --> UT["User tower fU"]
    I["Item features<br/>(ID, content, attributes)"] --> IT["Item tower fI"]
    UT --> ZU["User embedding zu"]
    IT --> ZI["Item embedding zi"]
    ZU & ZI --> S["Similarity zu · zi"]
    S --> P["Prediction / ranking"]
    style UT fill:#fde68a
    style IT fill:#fde68a
    style S fill:#bbf7d0

Key property: user and item embeddings are computed in fully decoupled ways —

offline: batch-compute all item embeddings and store them in an ANN index;
online: when a user arrives, compute the user embedding in real time and run a k-NN query.

Latency can drop into the sub-millisecond range — this is industrial-grade retrieval. Representative papers: YouTube DNN (Covington 2016), Sampling-bias-corrected (Yi 2019).

Training: typically with in-batch negatives + InfoNCE (identical to contrastive learning):

\[ L = -\log \frac{\exp(z_u^T z_{i^+} / \tau)}{\sum_{j \in B} \exp(z_u^T z_j / \tau)} \]

Contrastive learning = metric learning = two-tower retrieval = modern RAG retriever training — all are essentially the same objective.

5.6 Ranking vs. Retrieval

Industrial recommendation has two stages:

Stage	Task	Candidate scale	Mainstream methods
Retrieval	from billions select thousands	$10^9 \to 10^3$	two-tower + ANN
Ranking	precise scoring of thousands	$10^3 \to 10^1$	DIN / DCN / GBDT

Two-tower cannot be used for ranking — ranking demands fine-grained crossed features that decoupled towers cannot express.

6. Graph-based Recommendation

User–item interactions form a bipartite graph naturally, so GNNs apply.

6.1 PinSage (Ying et al. 2018)

Pinterest's industrial GraphSAGE on a 3-billion-node graph:

Neighbor sampling: random walks compute importance;
Convolution: aggregate weighted neighbor embeddings into the central node;
MapReduce-style offline training.

PinSage serves 200 million MAUs and is the iconic GNN production case.

6.2 LightGCN (He et al. 2020)

Reverse engineering: ablate GCN's non-linear transform and feature transform, keeping only neighbor aggregation:

\[ \mathbf{e}_u^{(k+1)} = \sum_{i \in N(u)} \frac{1}{\sqrt{|N(u)| |N(i)|}} \mathbf{e}_i^{(k)} \]

The final embedding is a weighted average across layers. Striking finding: dropping non-linearity actually helps — the features in recommendation are already ID embeddings, so non-linear transforms are unnecessary. LightGCN is a strong baseline beyond BPR-MF.

7. Case-Based Reasoning (CBR)

CBR is the symbolic-AI era's "proto-RAG" — to understand CBR is to understand the conceptual roots of RAG.

7.1 Aamodt & Plaza 1994: The 4R Cycle

flowchart LR
    P["New problem"] --> R1["1. Retrieve<br/>find similar cases"]
    R1 --> R2["2. Reuse<br/>reuse the case's solution"]
    R2 --> R3["3. Revise<br/>adapt to the new problem"]
    R3 --> R4["4. Retain<br/>store the new case"]
    R4 --> CB[("Case Base")]
    CB --> R1
    style R1 fill:#fde68a
    style R4 fill:#bbf7d0

Stage	Key questions
Retrieve	what similarity? how to retrieve efficiently?
Reuse	reuse directly? or extract experience?
Revise	adapt the solution with domain knowledge (manual / rules / models)
Retain	what to keep (avoid base bloat)? how to organize?

7.2 Case Base Structures

Flat case base: cases stored linearly, depending on a good index (k-d tree, inverted index);
Hierarchical case base: organized by abstraction level (e.g. medicine: organ → system → symptom → diagnosis);
Prototypes + instances: each prototype represents a category, instances hang under the prototype.

7.3 Classical CBR Systems

System	Domain	Era	Notes
CASEY (Koton 1989)	cardiology diagnosis	1980s	hybrid with MYCIN-style rule systems
HYPO (Ashley 1990)	legal reasoning	1990s	trade-secret cases; introduced "factor" notion
CHEF (Hammond 1986)	cooking planning	1980s	recipe adaptation
Customer service helpdesk	ticket matching	1990s+	still in use — "find similar past tickets"

7.4 CBR vs. ML

CBR has long been overlooked by the ML mainstream, but its paradigm is robust:

Dimension	Traditional ML	CBR
Knowledge form	parametric model	cases + similarity
Learning cost	high (retraining)	low (incrementally add cases)
Interpretability	usually poor	"because of the similar case X"
Adapting to new data	retraining needed	just Retain

Modern view: the Retrieve-Reuse-Revise-Retain cycle of CBR is reborn in the RAG / Agent era — using LLMs as the Reuse + Revise engine and a vector database as the case base yields the "LLM agent + memory" paradigm.

8. Modern Retrieval: the Bridge from the Analogizers to the Era of Large Models

Cross-page deep dive: ../../../AI_Agents/04_Memory_Systems/长期记忆与向量数据库.md covers the role of vector databases in agent memory. This section focuses on the algorithmic level of retrieval methods, which is its prerequisite.

8.1 Retrieval Paradigms: dense vs. sparse

Paradigm	Representation	Similarity	Representatives
Sparse	high-dim sparse (bag of words)	BM25, TF-IDF	Lucene, Elasticsearch
Dense	low-dim dense embedding	cosine, dot product	DPR, ColBERT, Sentence-BERT
Hybrid	both combined	weighted fusion / RRF	Vespa, Weaviate

BM25 (1994)

The classical sparse retrieval method:

\[ \mathrm{BM25}(q, d) = \sum_{t \in q} \mathrm{IDF}(t) \cdot \frac{\mathrm{tf}(t, d) (k_1 + 1)}{\mathrm{tf}(t, d) + k_1 \left(1 - b + b \frac{|d|}{\mathrm{avgdl}}\right)} \]

with $k_1 \approx 1.2 \sim 2.0$ and $b \approx 0.75$. Still one of the strongest baselines — many papers tout fancy dense retrieval, only to find BM25 trailing by just a few points.

DPR (Dense Passage Retrieval, Karpukhin et al. 2020)

Dual BERT towers trained for QA retrieval:

query encoder + passage encoder (no shared weights);
in-batch negatives + InfoNCE;
after training, passages are encoded offline into FAISS.

DPR beats BM25 on Natural Questions, opening the era of dense retrieval.

8.2 Comparison of ANN Algorithms

When the vector count reaches the billions, exact k-NN is infeasible. ANN (Approximate Nearest Neighbor) trades accuracy for speed.

Algorithm	Idea	Accuracy	Speed	Memory	Build cost	Use case
Brute-force (Flat)	compare one by one	100%	slow ($O(N)$)	$O(N)$	$O(1)$	$N < 10^6$ or ground truth
IVF (Inverted File)	cluster ($k$-means) then invert	90–99%	fast	$O(N)$	$O(Nk)$	large data, coarse + fine ranking
PQ (Product Quantization)	sub-vector quantization compression	medium	very fast	very small	medium	memory-constrained settings
IVF-PQ	IVF + PQ combined	medium-high	very fast	very small	medium	FAISS default for large scale
HNSW	hierarchical small-world graph	very high	very fast	large ($O(N \log N)$ edges)	slow	the universal best; the standard choice
LSH	locality-sensitive hashing	medium	fast	medium	fast	historical method, surpassed by HNSW
ScaNN (Google)	IVF + anisotropic quantization	high	very fast	small	medium	Google's internal workhorse
Annoy (Spotify)	random-projection forest	medium	fast	medium	fast	static indexes

Industrial decision tree:

Data < 1M, accuracy paramount → Flat (FAISS IndexFlatIP);
1M – 100M, memory sufficient → HNSW (default choice);
100M+, memory tight → IVF-PQ;
need online insertion → HNSW (IVF doesn't update easily).

8.3 HNSW (Hierarchical Navigable Small World) in Detail

Malkov & Yashunin's 2016 HNSW is the "textbook ANN choice" — near-100% accuracy, sub-millisecond queries, friendly to incremental insertion.

Core idea: organize data points as a multi-layer graph, where each layer is a small-world graph (nearby neighbors plus a few long-range links); the upper layers are sparse (for quick jumps) and the lowest layer is dense (for fine-grained search).

flowchart TD
    subgraph L2["Layer 2 (sparse, long jumps)"]
      A2["A"] -.- F2["F"]
    end
    subgraph L1["Layer 1 (medium density)"]
      A1["A"] --- B1["B"] --- F1["F"]
      B1 --- D1["D"]
    end
    subgraph L0["Layer 0 (all points, dense)"]
      A0["A"] --- B0["B"] --- C0["C"] --- D0["D"]
      D0 --- E0["E"] --- F0["F"]
      C0 --- E0
    end
    A2 -.-> A1 -.-> A0
    F2 -.-> F1 -.-> F0

Search algorithm:

HNSW search query q:
  current ← entry point (top layer)
  for level from top down to 1:
    greedy search to the nearest point of q at this layer → current
  on layer 0, run a best-first beam search bounded by ef-search
  return top-k

Insertion:

Randomly choose a level $\ell$ (exponential distribution, $\ell = \lfloor -\ln(r) \cdot m_L \rfloor$);
greedily search from the top down to layer $\ell + 1$;
from layer $\ell$ down to 0, connect to the $M$ nearest neighbors at each level (using a heuristic that preserves diversity).

Key parameters:

$M$: maximum links per node (typically 16–48);
ef-construction: search width during construction (typically 200);
ef-search: search width during query (the accuracy–speed knob; typically 50–200).

Distance functions: HNSW does not fix a metric; common choices include L2, inner product, and cosine — as long as the triangle inequality holds or approximately holds (in practice it often works even for non-metric distances).

8.4 RAG = the Renaissance of the Analogizers

flowchart LR
    Q["User query"] --> EM["Embedding model<br/>(metric-learning trained)"]
    EM --> QV["Query vector"]
    QV --> ANN["ANN index<br/>(HNSW/FAISS)"]
    DB[("Document base<br/>(pre-embedded)")] --> ANN
    ANN --> TOPK["Top-K documents"]
    TOPK --> LLM["LLM<br/>(in-context reasoning)"]
    Q --> LLM
    LLM --> ANS["Answer + citations"]
    style ANN fill:#fde68a
    style LLM fill:#bbf7d0

RAG (Retrieval-Augmented Generation, Lewis et al. 2020) feeds the top-k retrieved documents as LLM context, so that the LLM answers "having read the relevant material".

RAG vs. classical CBR side by side:

Dimension	Classical CBR (Aamodt 1994)	RAG (2020+)
Case base	structured cases	documents / passages (chunks)
Retrieval	hand-crafted similarity (feature matching)	embedding + ANN
Reuse engine	rule system / human	LLM (Transformer)
Revise	manual editing	LLM in-context
Retain	append to case base	append to document base (incremental indexing)
Evaluation	case-hit rate	Recall@K + end-to-end QA accuracy

Tribal viewpoint: RAG is not a "patch" on the LLM — it is the return of the Analogizers against parametric methods. The LLM paradigm of compressing knowledge into weights has fundamental limits (hallucination, staleness, no provenance), and the Analogizers' "case base + retrieval" is reborn in the era of large models under the name RAG.

8.5 Retrieval-Augmented Training

Push retrieval from "inference time" back into "training time":

RETRO (Borgeaud et al. 2022, DeepMind)

Idea: when training the LLM, every chunk retrieves top-k neighbors from a 2-trillion-token database, and these are injected via cross-attention into the Transformer.

Effect: 7.5B parameters + retrieval ≈ 175B parameters without retrieval;
Significance: a parameter-efficiency revolution — knowledge no longer has to be stuffed entirely into weights.

Atlas (Izacard et al. 2022, Meta)

End-to-end training of retrieval + generation for knowledge-intensive NLP tasks (QA, fact checking). Few-shot performance surpasses 50× larger PaLM.

REPLUG (Shi et al. 2023)

Even lighter: train only the retriever to "cooperate" with a frozen black-box LLM. The LLM's outputs serve as supervision for the retriever — the retriever learns to "pick documents most useful to the LLM".

9. Engineering Practice: Choosing a Vector Database

9.1 Mainstream Vector Databases Compared

DB	Type	Indexes	Persistence	Metadata filtering	Distributed	Best for
FAISS	library	full Flat/IVF/PQ/HNSW	DIY	weak	none	offline, research, advanced users
Milvus	service	HNSW/IVF/DiskANN	strong	strong	strong (cloud-native)	large-scale production
Qdrant	service (Rust)	HNSW	strong	strong (payload filters)	clustering	medium scale, filter-heavy
Weaviate	service	HNSW	strong	strong (GraphQL)	clustering	multi-modal, schema-driven data
pgvector	Postgres extension	IVF/HNSW	reuse PG	SQL	reuse PG	existing PG stack, medium scale
Pinecone	SaaS	proprietary	hosted	strong	hosted	want zero ops
Chroma	embedded	HNSW	file	medium	weak	prototype, small apps

Selection guide:

prototype / small RAG → Chroma or FAISS in-memory;
existing PG / scale < 10M → pgvector;
need filtering + medium scale → Qdrant;
large scale / multi-tenant → Milvus or Pinecone (the former self-hosted, the latter managed).

9.2 The Latency Budget for Real-Time Retrieval

The end-to-end budget for industrial search / recommendation is roughly 100ms. Allocation by stage:

Stage	Latency budget	Notes
Network + deserialization	5–10ms	gRPC + binary protocol
Query encoder (embedding model)	10–30ms	model distillation / quantization is key
ANN retrieval (top-1000)	5–20ms	HNSW remains < 20ms at the billion scale
Metadata filtering	1–5ms	inverted index
Ranking (deep model)	30–50ms	the dominant cost
Post-processing / diversity	5–10ms	re-rank

Engineering pain point: embedding-model latency. Common remedies: model distillation (a small two-tower model), ONNX Runtime / TensorRT inference acceleration, INT8 quantization, caching of common queries.

9.3 Index Update Strategies

The "freshness" challenge of vector databases:

Strategy	Description	Pros / cons
Full rebuild	rebuild the index periodically offline	simple; not real-time
Incremental insertion	HNSW supports add	real-time; graph quality degrades over time
Dual-index swap	hot index (online incremental) + cold index (offline full), swap periodically	mainstream in industry; high ops cost
Delta index	main index (cold) + delta (small hot index), merge at query time	Milvus / Vespa style

10. Formula Cheatsheet

10.1 BPR Loss

\[ L_{\mathrm{BPR}} = -\sum_{(u, i, j)} \log \sigma\left(\hat r_{ui} - \hat r_{uj}\right) + \lambda \|\Theta\|^2 \]

10.2 FunkSVD Objective

\[ \min_{p, q, b} \sum_{(u,i) \in \mathcal{O}} (r_{ui} - \mu - b_u - b_i - p_u^T q_i)^2 + \lambda \cdot \mathrm{Reg} \]

10.3 Two-Tower In-Batch InfoNCE

\[ L = -\frac{1}{|B|} \sum_{(u, i) \in B} \log \frac{\exp(z_u^T z_i / \tau)}{\sum_{j \in B} \exp(z_u^T z_j / \tau)} \]

10.4 HNSW Distances (any metric; three common ones)

\[ d_2(x, y) = \|x - y\|_2, \quad d_{\text{IP}}(x, y) = -x^T y, \quad d_{\cos}(x, y) = 1 - \frac{x^T y}{\|x\|\|y\|} \]

Code Example: FAISS HNSW Index + Query

import numpy as np
import faiss

d = 128                            # embedding dimension
N = 1_000_000                      # number of vectors

# 1. Build the HNSW index
index = faiss.IndexHNSWFlat(d, M=32)  # 32 neighbors per node
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 64

# 2. Insert vectors (already unit-normalized embeddings)
embeddings = np.random.randn(N, d).astype("float32")
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
index.add(embeddings)

# 3. Query
query = np.random.randn(1, d).astype("float32")
query /= np.linalg.norm(query)
D, I = index.search(query, k=10)   # top-10 nearest neighbors
print("top-10 distances:", D[0])
print("top-10 indices:", I[0])

Code Example: BPR-MF in NumPy

import numpy as np

def bpr_mf(triples, U, I, k=32, lr=0.05, lam=0.01, epochs=20):
    """
    triples: list of (u, i, j); u prefers i over j
    U, I: total numbers of users and items
    """
    P = np.random.randn(U, k) * 0.01
    Q = np.random.randn(I, k) * 0.01
    for ep in range(epochs):
        np.random.shuffle(triples)
        for (u, i, j) in triples:
            x_uij = P[u] @ (Q[i] - Q[j])
            sig = 1.0 / (1.0 + np.exp(x_uij))
            P[u] += lr * (sig * (Q[i] - Q[j]) - lam * P[u])
            Q[i] += lr * (sig * P[u] - lam * Q[i])
            Q[j] += lr * (-sig * P[u] - lam * Q[j])
    return P, Q

Cross-references

The Analogizers as a tribe: 本页 §1（派系入口）
Metric learning (how the embedding is trained): Metric Learning and Contrastive Learning
The mathematical basis of kernels and k-NN: Nearest Neighbors and Kernel Methods
Vector-database engineering for agent long-term memory: ../../../AI_Agents/04_Memory_Systems/长期记忆与向量数据库.md

References

Goldberg, D., et al. (1992). Using collaborative filtering to weave an information tapestry. CACM. — The earliest proposal of collaborative filtering.
Resnick, P., et al. (1994). GroupLens: An open architecture for collaborative filtering of netnews. CSCW. — User-based CF.
Sarwar, B., et al. (2001). Item-based collaborative filtering recommendation algorithms. WWW. — Item-based CF (the heart of Amazon).
Funk, S. (2006). Netflix Update: Try this at home. — The original blog post for FunkSVD.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer. — A modern survey of MF.
Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. ICDM. — WMF.
Rendle, S., et al. (2009). BPR: Bayesian Personalized Ranking from implicit feedback. UAI. — BPR.
He, X., et al. (2017). Neural collaborative filtering. WWW. — NCF.
Cheng, H.-T., et al. (2016). Wide & Deep learning for recommender systems. DLRS@RecSys. — Wide & Deep.
Zhou, G., et al. (2018). Deep Interest Network for click-through rate prediction. KDD. — DIN.
Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for YouTube recommendations. RecSys. — YouTube two-tower retrieval.
Ying, R., et al. (2018). Graph convolutional neural networks for web-scale recommender systems. KDD. — PinSage.
He, X., et al. (2020). LightGCN: Simplifying and powering graph convolution network for recommendation. SIGIR. — LightGCN.
Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications. — The foundation of the CBR 4R cycle.
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for open-domain question answering. EMNLP. — DPR.
Malkov, Y. A., & Yashunin, D. A. (2016). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI. — HNSW.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. — The FAISS paper.
Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS. — RAG.
Borgeaud, S., et al. (2022). Improving language models by retrieving from trillions of tokens. ICML. — RETRO.
Izacard, G., et al. (2022). Atlas: Few-shot learning with retrieval augmented language models. arXiv:2208.03299. — Atlas.
Shi, W., et al. (2023). REPLUG: Retrieval-augmented black-box language models. arXiv:2301.12652.
Rendle, S., et al. (2020). Neural collaborative filtering vs. matrix factorization revisited. RecSys. — The NCF vs. MF controversy.

Method	\(K(x, x_i)\)	\(\alpha_i\)
k-NN	\(\mathbb{1}[x_i \in \mathrm{kNN}(x)]\)	\(y_i\)
Nadaraya-Watson regression	RBF kernel \(\exp(-\\|x-x_i\\|^2/2\sigma^2)\)	\(y_i / \sum_j K_j\)
SVM	any Mercer kernel	Lagrange multipliers \(\alpha_i y_i\) (only support vectors are non-zero)
Kernel ridge regression	any Mercer kernel	\((K + \lambda I)^{-1} y\)
Collaborative filtering	user similarity \(\mathrm{sim}(u, u_i)\)	\(r_{u_i, j}\) (other users' ratings on item \(j\))
RAG	\(\cos(\mathbf{e}_q, \mathbf{e}_{d_i})\)	documents retrieved into the LLM context

\(k\)	Bias	Variance	Decision boundary	Risk
\(k=1\)	very low	very high	jagged	overfitting (sensitive to noise)
moderate \(k\)	balanced	balanced	smooth	—
\(k=N\)	very high	very low	degenerates to majority class	underfitting

Metric	Formula	Use case
Euclidean	\(d(x,y) = \sqrt{\sum_i (x_i - y_i)^2}\)	continuous features with comparable scale and meaning
Manhattan (\(\ell_1\))	$\sum_i	x_i - y_i
Chebyshev (\(\ell_\infty\))	$\max_i	x_i - y_i
Minkowski (\(\ell_p\))	$\left(\sum_i	x_i - y_i
Mahalanobis	\(\sqrt{(x-y)^T \Sigma^{-1} (x-y)}\)	features with different scales / correlations
Cosine	\(1 - \frac{x^T y}{\\|x\\|\\|y\\|}\)	text TF-IDF, embedding similarity
Hamming	\(\sum_i \mathbb{1}[x_i \neq x_i]\)	binary strings, categorical features
Edit distance (Levenshtein)	minimal edit count by DP	strings, biological sequences
Jaccard	$1 - \frac{	A \cap B

Kernel	Formula	Parameters	Use case
Linear	\(K(x,y) = x^T y\)	—	high-dimensional sparse text (TF-IDF)
Polynomial	\((x^T y + c)^d\)	\(c, d\)	feature interactions (NLP N-gram-like)
RBF / Gaussian	\(\exp(-\gamma \\|x-y\\|^2)\)	\(\gamma\)	the universal default; infinite-dim feature space
Laplacian	\(\exp(-\gamma \\|x-y\\|_1)\)	\(\gamma\)	more outlier-robust than RBF
Sigmoid	\(\tanh(\kappa x^T y + c)\)	\(\kappa, c\)	historically mimicked two-layer nets (not strictly PSD)
String kernel	shared-substring count	\(k\) (substring length)	text classification, biological sequences
Graph kernel	random walk / subgraph count	—	chemical molecules, social graphs
Fisher kernel	$\nabla_\theta \log p(x	\theta)^T I^{-1} \nabla_\theta \log p(y	\theta)$

\(\tau\)	Behavior	Effect
very small (\(\tau \to 0\))	softmax becomes argmax	focuses only on the hardest negative; unstable training
very large (\(\tau \to \infty\))	softmax becomes uniform	all negatives weighed equally; little learning signal
moderate (\(\tau \approx 0.05 \sim 0.5\))	balanced	SimCLR default 0.5; MoCo default 0.07; CLIP learns it

Measure	Formula	Note
Pearson	\(\frac{\sum_i (r_{ui} - \bar r_u)(r_{vi} - \bar r_v)}{\sqrt{\sum (r_{ui}-\bar r_u)^2} \sqrt{\sum (r_{vi}-\bar r_v)^2}}\)	subtracts means automatically; handles different rating scales
Cosine	\(\frac{r_u^T r_v}{\\|r_u\\|\\|r_v\\|}\)	simple and fast; implicitly assumes missing = 0 (problematic)
Adjusted Cosine	cosine after subtracting each item's mean	best choice for item-based CF
Jaccard	$\frac{	R_u \cap R_v

Stage	Task	Candidate scale	Mainstream methods
Retrieval	from billions select thousands	\(10^9 \to 10^3\)	two-tower + ANN
Ranking	precise scoring of thousands	\(10^3 \to 10^1\)	DIN / DCN / GBDT

Analogizers

Tribe Profile

Master Equation

Lineage

Modern Renaissance

Division of Labour with Existing Pages

Sub-page Navigation

References

Nearest Neighbors and Kernel Methods

1. The k-NN Framework in Full

1.1 Algorithm Definition

1.2 Weighted k-NN

1.3 Choosing k

1.4 Theoretical Guarantee for 1-NN (Cover-Hart 1967)

2. A Lineage of Distance Metrics

2.1 Common Metrics

2.2 The Mahalanobis Metric

2.3 A Geometric Pitfall: Why Cosine Reigns in NLP

3. The Curse of Dimensionality

3.1 Volume Concentration

3.2 Distance Concentration

3.3 Ways out of the Curse

4. k-NN Complexity and Acceleration

4.1 KD-Tree

4.2 Ball Tree

4.3 LSH (Locality-Sensitive Hashing)

5. The Unifying Perspective of Kernel Methods

5.1 A Kernel Is an Implicit Similarity

5.2 The Mercer Condition

5.3 Algebra of Kernels

6. The SVM Derivation in Full

6.1 Hard-Margin Primal

6.2 Lagrangian Dual

6.3 KKT Conditions

6.4 Soft Margin and Hinge Loss

6.5 The Kernel SVM Decision Function

6.6 The SMO Algorithm

7. The Kernel Zoo

8. Kernel PCA and Kernel Ridge Regression

8.1 Kernel PCA

8.2 Kernel Ridge Regression

9. The Representer Theorem

10. Gaussian Processes as Kernel Methods

11. Division of Labour with [../../03_Machine_Learning/kernel_methods.md]

Code Example: Running Everything Above with scikit-learn

Cross-references

References

Metric Learning and Contrastive Learning

1. The Goal of Metric Learning

2. Classical Metric Learning

2.1 Mahalanobis Metric Learning

2.2 LMNN (Large Margin Nearest Neighbor, Weinberger & Saul 2009)

2.3 ITML (Information-Theoretic Metric Learning, Davis et al. 2007)

2.4 The Limits of Shallow Metric Learning

3. Deep Metric Learning Architectures

3.1 Siamese Networks

3.2 Triplet Networks (FaceNet, Schroff et al. 2015)

3.3 Hard Negative Mining

4. A Lineage of Loss Functions

4.1 Contrastive Loss (Hadsell, Chopra & LeCun 2006)

4.2 Triplet Loss

4.3 Quadruplet / N-pair / Lifted Structure

4.4 InfoNCE / NT-Xent: the Core Loss of Modern Contrastive Learning

4.4.1 Derivation: Connection to a Mutual-Information Lower Bound

4.4.2 The Role of the Temperature \(\tau\)

4.4.3 SupCon: Supervised Contrastive Loss

5. Self-Supervised Contrastive Learning

5.1 SimCLR (Chen et al. 2020)

5.2 MoCo (Momentum Contrast, He et al. 2020)

5.3 BYOL (Bootstrap Your Own Latent, Grill et al. 2020)

5.4 SimSiam (Chen & He 2021)

5.5 DINO (Caron et al. 2021)

5.6 An Overview of Contrastive Methods

6. Cross-Modal Metric Learning

6.1 CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021)

6.2 ALIGN (Jia et al. 2021)

6.3 SigLIP (Zhai et al. 2023)

7. Geometry of the Metric Space

7.1 Euclidean Embedding

7.2 Spherical Embedding