Analogizers
The tribe in one sentence: similar inputs should have similar outputs. The essence of learning is finding a good metric for "similar".
The Analogizers are one of the five tribes of machine learning identified by Pedro Domingos in The Master Algorithm. The tribe's core creed is plain yet profound: to predict a new sample, look for past samples that "look like it" and reuse their answers. From the venerable k-Nearest Neighbors (k-NN, 1950s), through the Support Vector Machines (SVMs) that ruled theoretical ML in the 1990s, to the CLIP / RAG ecosystem that swept industry after 2020, the Analogizers have woven the "metric + retrieval" paradigm through every decade of ML history.
The Analogizers' fundamental disagreement with the other tribes is this: they do not explicitly model the data-generating process (against the Bayesians), do not extract general rules (against the Symbolists), do not rely on differentiable parameterization (against the early Connectionists), and do not rely on population search (against the Evolutionaries). They reduce "learning" to "similarity": give me a good kernel \(K(x, x')\) and I can predict.
Tribe Profile
| Dimension | Analogizers |
|---|---|
| Ontology | The world is made of concrete instances; universal laws are secondary |
| Epistemology | Knowledge = case base + similarity metric; reasoning = retrieval + analogical extrapolation |
| Master Algorithm | Support Vector Machines / Kernel Methods |
| Core question | How to measure similarity? |
| Evaluator | Margin, contrastive loss, Recall@K |
| Optimizer | Quadratic programming (QP) / SMO / stochastic gradient descent (SGD) |
| Representative methods | k-NN, SVM, kernel PCA, kernel ridge regression, LMNN, FaceNet, SimCLR, CLIP |
| Modern branches | Metric learning, contrastive learning, ANN retrieval, RAG |
| Achilles' heel | Curse of dimensionality, similarity itself requires a prior, retrieval cost |
| Key figures | Vladimir Vapnik, Bernhard Schölkopf, Tomasz Aamodt, Yann LeCun (Siamese), Geoffrey Hinton (t-SNE) |
Master Equation
Almost every method of the Analogizers can be written in the same form — a weighted sum of similarities to training samples:
where:
- \(K(x, x_i)\) is the similarity kernel: how alike the new sample \(x\) is to training sample \(x_i\);
- \(\alpha_i\) is each training sample's contribution weight to the prediction (a 0/1 indicator in k-NN, a dual variable in SVM, a user/item preference in CF);
- \(f\) is the output function (sign / softmax for classification, identity for regression).
The unifying power of this framework is striking:
| Method | \(K(x, x_i)\) | \(\alpha_i\) |
|---|---|---|
| k-NN | \(\mathbb{1}[x_i \in \mathrm{kNN}(x)]\) | \(y_i\) |
| Nadaraya-Watson regression | RBF kernel \(\exp(-\|x-x_i\|^2/2\sigma^2)\) | \(y_i / \sum_j K_j\) |
| SVM | any Mercer kernel | Lagrange multipliers \(\alpha_i y_i\) (only support vectors are non-zero) |
| Kernel ridge regression | any Mercer kernel | \((K + \lambda I)^{-1} y\) |
| Collaborative filtering | user similarity \(\mathrm{sim}(u, u_i)\) | \(r_{u_i, j}\) (other users' ratings on item \(j\)) |
| RAG | \(\cos(\mathbf{e}_q, \mathbf{e}_{d_i})\) | documents retrieved into the LLM context |
Unified perspective: from k-NN to RAG, sixty years of methodological evolution have just been answering two questions — how to define \(K\) (how to measure similarity?) and how to define \(\alpha\) (which cases are worth borrowing from?).
Lineage
flowchart TD
A["k-NN (Cover & Hart 1967)"] --> B["Kernel methods / Parzen window"]
B --> C["SVM (Boser & Vapnik 1992)"]
C --> D["Kernel PCA / Kernel ridge regression"]
A --> E["Case-Based Reasoning CBR (Aamodt 1994)"]
A --> F["Collaborative Filtering CF (GroupLens 1994)"]
F --> G["Matrix Factorization FunkSVD (Netflix 2006)"]
G --> H["BPR / NCF / Two-tower models"]
A --> I["Metric learning LMNN/ITML (2007-2009)"]
I --> J["Siamese / Triplet (FaceNet 2015)"]
J --> K["Contrastive learning SimCLR / MoCo (2020)"]
K --> L["Cross-modal CLIP (2021)"]
L --> M["RAG / RETRO / Atlas (2022+)"]
E --> M
H --> N["ANN retrieval FAISS / HNSW"]
N --> M
style C fill:#fde68a
style L fill:#bbf7d0
style M fill:#bbf7d0
Three main streams:
- Kernel methods stream: k-NN → Parzen window → SVM → kernel PCA / GP, unified in the RKHS framework.
- Retrieval stream: k-NN → CBR → CF → matrix factorization → ANN → RAG, focused on "how to find similar cases efficiently".
- Metric learning stream: LMNN → Siamese → Triplet → InfoNCE → CLIP, focused on "how to make a neural network learn a good similarity".
After 2020 the three streams were unified: deep models learn embeddings (metric learning), ANN indexes store embeddings (retrieval), and LLMs reason with retrieved results (RAG).
Modern Renaissance
The Analogizers have suffered two declines:
- First decline (after backprop was rediscovered in 1986): the Connectionists rose, and k-NN was dismissed as "primitive".
- Second decline (after AlexNet in 2012): deep learning dominated vision and NLP, and SVMs all but vanished from mainstream papers.
Yet the Analogizers never died, and after 2020 they made a dramatic comeback, for three reasons:
- Contrastive learning became the self-supervised mainstream: SimCLR / MoCo / DINO train with contrastive losses, which are essentially metric learning on large unlabeled corpora. "Learning a good embedding" became the central goal of foundation models.
- CLIP broke the modality wall: the joint embedding learned from image-text contrastive training enabled zero-shot classification and semantic search as new paradigms. CLIP is the most successful Analogizer system to date.
- RAG replaced pure parametric memory: parameterizing knowledge inside LLM weights proved limited (hallucination, staleness, no provenance). RAG plugs an external knowledge base + vector retrieval back into the LLM — this is Aamodt's 1994 CBR reincarnated for the GPT era.
From obscure k-NN to the RAG era: a 2010 grad student would say "doing k-NN? too naive". A 2025 grad student says "our LLM agent uses HNSW to retrieve 1B documents for in-context learning" — same thing.
Division of Labour with Existing Pages
The three sub-pages in this directory expand on the three pillars of the Analogizers:
- Nearest Neighbors and Kernel Methods: k-NN, distance metrics, the kernel trick, full SVM derivation, the Representer theorem, the GP perspective.
- Metric Learning and Contrastive Learning: LMNN/ITML, Siamese/Triplet, InfoNCE, SimCLR/MoCo/DINO, CLIP/SigLIP.
- Recommender Systems and Modern Retrieval: CF/MF, deep recommenders, CBR, ANN indexes, RAG engineering practice.
Coordination with other pages:
- ../../03_Machine_Learning/supervised.md introduces the k-NN/SVM interface from a "supervised algorithms catalog" angle. This directory organizes them by tribe philosophy and lineage.
- ../../03_Machine_Learning/kernel_methods.md covers kernel basics. The Nearest Neighbors and Kernel Methods page in this directory adds: a proof of the Representer theorem, the SVM dual derivation, the GP connection, and the SMO algorithm.
- The GP section in ../../03_Machine_Learning/贝叶斯学习.md and the kernel methods section here mirror each other (Bayesian view vs. Analogizer view).
Sub-page Navigation
| Sub-page | Core question | Keywords |
|---|---|---|
| Nearest Neighbors and Kernel Methods | How to define "similar"? How to turn similarity into a predictor? | k-NN, KD-Tree, Mercer condition, SVM dual, KKT, RKHS, Representer, kernel PCA, GP |
| Metric Learning and Contrastive Learning | How to learn good similarity? | LMNN, Siamese, Triplet, Contrastive, InfoNCE, SimCLR, MoCo, DINO, CLIP |
| Recommender Systems and Modern Retrieval | How to retrieve similar cases efficiently? | CF, FunkSVD, BPR, NCF, two-tower, CBR, FAISS, HNSW, RAG |
References
- Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books. — The original source of the five-tribes framework; Chapter 7 is devoted to the Analogizers.
- Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. — The foundational k-NN paper, proving that the 1-NN error is at most twice the Bayes error.
- Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer. — VC dimension and the SVM theoretical framework.
- Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. COLT. — The pioneering paper of SVMs and the kernel trick.
- Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press. — The standard textbook on kernel methods, the Representer theorem, and RKHS.
- Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications, 7(1), 39–59. — The foundational paper on the CBR 4R cycle.
- Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML. — CLIP, the landmark of the Analogizers' modern renaissance.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS. — The RAG paper, plugging retrieval back into parametric models.
Nearest Neighbors and Kernel Methods
Scope of this page: lay out the "skeleton" of the Analogizers in full — from the most naive k-NN to the unifying framework of kernel methods / SVMs. The emphasis is on three threads: mathematical derivation (SVM dual, KKT, the Representer theorem), geometric intuition (the kernel trick = implicit lifting to higher dimensions), and engineering acceleration (KD-Tree / LSH).
The oldest method of the Analogizers is k-Nearest Neighbors (k-NN): to predict a new sample, find its \(k\) nearest neighbors and let the neighbors' labels vote. Proposed in the 1950s and given an error bound by Cover & Hart in 1967, this method remains the basic paradigm behind virtually every vector retrieval system today.
But k-NN has two fatal problems:
- What does "nearest" mean? — the distance metric directly determines the result, and naive Euclidean distance breaks down in high dimensions.
- How do we combine similarities into a "decision boundary"? — k-NN's boundary is patched together pointwise, with poor theoretical properties.
Kernel methods and SVMs answered both questions in the 1990s, elevating "similarity learning" to a paradigm with a complete mathematical foundation.
1. The k-NN Framework in Full
1.1 Algorithm Definition
Given a training set \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N\), a distance metric \(d\), and a new sample \(x\):
- Compute the distances from \(x\) to all training points \(\{d(x, x_i)\}_{i=1}^N\).
- Find the \(k\) nearest neighbors \(\mathrm{kNN}(x)\).
- Classification: \(\hat y = \mathrm{mode}\{y_i : x_i \in \mathrm{kNN}(x)\}\) (majority vote).
- Regression: \(\hat y = \frac{1}{k}\sum_{x_i \in \mathrm{kNN}(x)} y_i\) (average).
1.2 Weighted k-NN
Naive k-NN treats the \(k\) neighbors equally, but clearly "closer neighbors" should contribute more. The weighted version:
This is Nadaraya-Watson kernel regression. Pushing \(k\) all the way to \(N\) (using all points, but with weights vanishing for distant ones) yields the Parzen window estimator — a non-parametric density estimator. Note: k-NN \(\to\) Parzen window \(\to\) kernel regression is a continuous progression; the three are mathematically the same thing.
1.3 Choosing k
\(k\) controls the bias–variance trade-off:
| \(k\) | Bias | Variance | Decision boundary | Risk |
|---|---|---|---|---|
| \(k=1\) | very low | very high | jagged | overfitting (sensitive to noise) |
| moderate \(k\) | balanced | balanced | smooth | — |
| \(k=N\) | very high | very low | degenerates to majority class | underfitting |
In practice, \(k\) is chosen by cross-validation, with \(k = \sqrt{N}\) as a common rule of thumb. \(k\) should be odd (to avoid ties in binary classification).
1.4 Theoretical Guarantee for 1-NN (Cover-Hart 1967)
Let \(R^*\) be the Bayes-optimal error rate. The asymptotic error rate \(R_{1\text{-NN}}\) of 1-NN as \(N \to \infty\) satisfies:
In one sentence: the most naive 1-NN has error at most twice the Bayes optimum — a stunning theoretical guarantee that partly explains why the Analogizers never truly faded.
2. A Lineage of Distance Metrics
The choice of distance metric is the make-or-break factor for k-NN (and for every Analogizer method).
2.1 Common Metrics
| Metric | Formula | Use case |
|---|---|---|
| Euclidean | \(d(x,y) = \sqrt{\sum_i (x_i - y_i)^2}\) | continuous features with comparable scale and meaning |
| Manhattan (\(\ell_1\)) | $\sum_i | x_i - y_i |
| Chebyshev (\(\ell_\infty\)) | $\max_i | x_i - y_i |
| Minkowski (\(\ell_p\)) | $\left(\sum_i | x_i - y_i |
| Mahalanobis | \(\sqrt{(x-y)^T \Sigma^{-1} (x-y)}\) | features with different scales / correlations |
| Cosine | \(1 - \frac{x^T y}{\|x\|\|y\|}\) | text TF-IDF, embedding similarity |
| Hamming | \(\sum_i \mathbb{1}[x_i \neq x_i]\) | binary strings, categorical features |
| Edit distance (Levenshtein) | minimal edit count by DP | strings, biological sequences |
| Jaccard | $1 - \frac{ | A \cap B |
2.2 The Mahalanobis Metric
Mahalanobis is an "adaptive" generalization of Euclidean distance:
where \(M = L^T L\) is positive semi-definite. When \(M = I\) it reduces to Euclidean; when \(M = \Sigma^{-1}\) (the inverse data covariance), it corrects for differing scales and correlations across dimensions. Metric learning is precisely the act of learning this \(M\) from data — see Metric Learning and Contrastive Learning.
2.3 A Geometric Pitfall: Why Cosine Reigns in NLP
In text and embedding work, cosine is preferred over Euclidean because the norm of an embedding usually carries "frequency / confidence" information rather than "semantic" information. After normalizing onto the unit sphere (\(\|x\|=1\)), cosine distance is monotonically related to Euclidean:
So modern deep metric learning (CLIP, SimCLR) almost always L2-normalizes embeddings and uses dot products.
3. The Curse of Dimensionality
Core proposition: in high-dimensional space "nearest neighbor" loses its meaning — distances between all points tend to be equal.
3.1 Volume Concentration
Consider a uniform distribution on the \(d\)-dimensional unit hypercube \([0,1]^d\). To capture \(1\%\) of the volume, the side length per dimension must be:
| \(d\) | \(\ell\) (side length needed for \(1\%\) volume) |
|---|---|
| 1 | 0.01 |
| 2 | 0.1 |
| 10 | 0.63 |
| 100 | 0.955 |
| 1000 | 0.9954 |
At \(d = 100\), capturing \(1\%\) of the "local" neighbors requires covering \(95.5\%\) of every dimension's range — there is no longer anything "local" about it.
3.2 Distance Concentration
For \(N\) uniform samples inside a \(d\)-dimensional unit ball, the ratio between the smallest distance \(D_{\min}\) and the largest distance \(D_{\max}\) satisfies:
That is, all pairwise distances become essentially equal — the nearest neighbor is no longer "near" and the farthest is no longer "far". Beyer et al. (1999) proved this rigorously, and it is devastating for k-NN.
3.3 Ways out of the Curse
The Analogizers fight the curse along three paths:
- Dimensionality reduction / manifold hypothesis: real data lies on a low-dimensional manifold (images, though tens of thousands of dimensions, have low effective dimension). t-SNE / UMAP / autoencoders all fall here.
- Learn a metric: metric learning projects data into a "semantically meaningful" low-dimensional embedding space — this is the entire premise of deep metric learning.
- The kernel trick: SVM's kernel trick goes the opposite way — lifting to infinite dimensions but accessing them only through similarity inner products, avoiding explicit storage.
4. k-NN Complexity and Acceleration
A naive k-NN query for a single point costs \(O(Nd)\) — unacceptable for large \(N\).
4.1 KD-Tree
The KD-Tree (k-dimensional tree) recursively bisects space using axis-aligned hyperplanes:
- Construction: \(O(N \log N)\), depth \(O(\log N)\);
- Query: \(O(\log N)\) in low dimensions (\(d \lesssim 20\)), degrading to \(O(N)\) in high dimensions;
- Best for: low-dimensional dense data.
4.2 Ball Tree
Drops the axis-aligned constraint and uses nested hyperspheres instead. More robust than the KD-Tree at moderate dimensions (20–100), but with larger constants in construction and query.
4.3 LSH (Locality-Sensitive Hashing)
Hashes "close points" into the same bucket — trading accuracy for massive speedups. An LSH family \(\mathcal{H}\) satisfies:
For example, SimHash for cosine distance: pick a random hyperplane \(r\) and set \(h(x) = \mathrm{sign}(r^T x)\). The probability that two vectors map to the same bit is proportional to the cosine of their angle.
LSH is the iconic early ANN (approximate nearest neighbor) method, comprehensively surpassed in the 2010s by graph-based HNSW — see Section 8 of Recommender Systems and Modern Retrieval.
5. The Unifying Perspective of Kernel Methods
5.1 A Kernel Is an Implicit Similarity
A kernel \(K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is a symmetric function that we may read as a "similarity". The core insight of the kernel trick is:
i.e. \(K\) is equivalent to first mapping \(x\) into some (possibly infinite-dimensional) feature space \(\mathcal{H}\) and then taking an inner product. We never need to construct \(\phi\) explicitly, only compute \(K\).
flowchart LR
A["Input space X<br/>(linearly inseparable)"] -->|"explicit map φ"| B["Feature space H<br/>(linearly separable)"]
A -->|"K(x,x') = ⟨φ(x), φ(x')⟩<br/>(avoids constructing φ)"| C["Similarity matrix<br/>K ∈ R^{N×N}"]
C --> D["Dual solution<br/>depends only on K"]
style C fill:#fde68a
style D fill:#bbf7d0
5.2 The Mercer Condition
Not every "similarity function" is a valid kernel. Mercer's theorem: \(K\) is a valid kernel \(\iff\) for any \(\{x_1, \dots, x_N\}\), the Gram matrix / kernel matrix
is symmetric positive semi-definite (PSD), i.e. for any \(\alpha \in \mathbb{R}^N\):
PSD-ness is the foundation of all kernel-method theory — it guarantees that the dual problem is convex, that the Representer theorem holds, and that the solution is unique.
5.3 Algebra of Kernels
Valid kernels combine into new valid kernels:
- \(K_1 + K_2\) is a kernel;
- \(c \cdot K\) (\(c > 0\)) is a kernel;
- \(K_1 \cdot K_2\) is a kernel (pointwise product);
- \(f(x) f(x')\) is a kernel for any \(f\);
- \(K(\psi(x), \psi(x'))\) is a kernel.
In engineering, one can design separate kernels for different feature families (numeric, text, graph) and linearly combine them — this is Multiple Kernel Learning (MKL).
6. The SVM Derivation in Full
The SVM is the Analogizers' "master algorithm": it upgrades k-NN from "voting" to "maximum-margin hyperplane".
6.1 Hard-Margin Primal
Binary classification, \(y_i \in \{-1, +1\}\). Find a hyperplane \(w^T x + b = 0\) that separates the two classes and maximizes the margin. The distance from \(x_i\) to the hyperplane is \(\frac{|w^T x_i + b|}{\|w\|}\). Normalizing so that the closest point satisfies \(|w^T x_i + b| = 1\), the margin is \(\frac{2}{\|w\|}\).
Primal problem:
A convex quadratic program, directly solvable, but its dimension equals \(\dim(x)\) — no kernels possible.
6.2 Lagrangian Dual
Introducing multipliers \(\alpha_i \geq 0\), the Lagrangian is:
Setting partial derivatives with respect to \(w\) and \(b\) to zero:
Substituting back yields the dual problem:
Key observation: in the dual, \(x_i\) appears only through inner products \(\langle x_i, x_j \rangle\) — exactly the entry point for the kernel trick. Replace the inner product with \(K(x_i, x_j)\) and you instantly get an SVM in an infinite-dimensional feature space.
6.3 KKT Conditions
The optimum \((w^*, b^*, \alpha^*)\) must satisfy the Karush-Kuhn-Tucker (KKT) conditions:
- Stationarity: \(w^* = \sum_i \alpha_i^* y_i x_i\);
- Primal feasibility: \(y_i({w^*}^T x_i + b^*) \geq 1\);
- Dual feasibility: \(\alpha_i^* \geq 0\);
- Complementary slackness: \(\alpha_i^* \left[y_i({w^*}^T x_i + b^*) - 1\right] = 0\).
Complementary slackness yields a deep conclusion: \(\alpha_i^* > 0 \Rightarrow y_i({w^*}^T x_i + b^*) = 1\) — only points sitting exactly on the margin boundary have non-zero \(\alpha\). These points are called support vectors.
The SVM solution depends only on a subset of training points (the support vectors), and that is the source of its generalization power — the VC dimension is controlled by the number of support vectors.
6.4 Soft Margin and Hinge Loss
Real data is rarely linearly separable. Introduce slack variables \(\xi_i \geq 0\):
\(C\) is the regularization parameter: \(C \to \infty\) recovers the hard margin; small \(C\) tolerates more violations. The equivalent unconstrained form is:
The second term is the hinge loss — the SVM's loss function.
The soft-margin dual is almost identical to the hard-margin dual, with the only change being the box constraint \(0 \leq \alpha_i \leq C\).
6.5 The Kernel SVM Decision Function
Once \(\alpha^*\) is learned, the prediction for a new sample \(x\) is:
A perfect match for the Analogizer master form \(\hat y = f(\sum_i \alpha_i K(x, x_i))\).
6.6 The SMO Algorithm
The standard way to solve the SVM dual is Sequential Minimal Optimization (SMO, Platt 1998):
SMO main loop:
while not converged:
pick a pair of multipliers (α_i, α_j) that violate KKT
fix the others; reduce the dual to a 2-D QP in (α_i, α_j)
solve analytically (1-D QP, closed form)
end
The elegance of SMO: it optimizes only two variables at a time (necessarily two, because of the constraint \(\sum_i \alpha_i y_i = 0\)), decomposing an \(N\)-dimensional QP into many 2-D sub-problems. LibSVM is the canonical efficient implementation of SMO.
7. The Kernel Zoo
| Kernel | Formula | Parameters | Use case |
|---|---|---|---|
| Linear | \(K(x,y) = x^T y\) | — | high-dimensional sparse text (TF-IDF) |
| Polynomial | \((x^T y + c)^d\) | \(c, d\) | feature interactions (NLP N-gram-like) |
| RBF / Gaussian | \(\exp(-\gamma \|x-y\|^2)\) | \(\gamma\) | the universal default; infinite-dim feature space |
| Laplacian | \(\exp(-\gamma \|x-y\|_1)\) | \(\gamma\) | more outlier-robust than RBF |
| Sigmoid | \(\tanh(\kappa x^T y + c)\) | \(\kappa, c\) | historically mimicked two-layer nets (not strictly PSD) |
| String kernel | shared-substring count | \(k\) (substring length) | text classification, biological sequences |
| Graph kernel | random walk / subgraph count | — | chemical molecules, social graphs |
| Fisher kernel | $\nabla_\theta \log p(x | \theta)^T I^{-1} \nabla_\theta \log p(y | \theta)$ |
Default choices: tabular features → RBF; text → linear plus high-dimensional features; structured objects → string/graph kernel; if in doubt → RBF + grid search over \((\gamma, C)\).
8. Kernel PCA and Kernel Ridge Regression
8.1 Kernel PCA
Plain PCA finds the directions of maximum variance in the original space; kernel PCA does PCA in the feature space after the \(\phi\) map. Concretely, we find the eigenvectors of the centered kernel matrix \(\tilde{\mathbf{K}}\):
The \(k\)-th principal component corresponds to the eigenvector \(\alpha^{(k)}\), and the \(k\)-th principal coordinate of a new sample \(x\) is:
Significance: a manifold structure that is non-linear in the original space may become linear in the \(\phi\) space — kernel PCA extracts non-linear principal components.
8.2 Kernel Ridge Regression
The kernelized closed-form solution to ridge regression \(\min_w \|y - Xw\|^2 + \lambda \|w\|^2\) is:
A textbook instance of the Analogizer master form. Connection to GP regression: the kernel ridge prediction mean is mathematically identical to the GP regression posterior mean — see Section 10.
9. The Representer Theorem
Representer theorem (Schölkopf, Herbrich & Smola 2001): in a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}_K\), the optimal solution to a regularized empirical risk minimization problem can always be written as a linear combination of kernel functions evaluated at the training points.
Formally, consider:
where \(L\) is any loss and \(\Omega\) a monotonically increasing regularizer. Then the optimal solution must have the form:
Sketch of proof: decompose any \(f \in \mathcal{H}_K\) as \(f = f_\parallel + f_\perp\), with \(f_\parallel \in \mathrm{span}\{K(\cdot, x_i)\}\) and \(f_\perp\) orthogonal to it. By the reproducing property, \(f(x_i) = \langle f, K(\cdot, x_i) \rangle = f_\parallel(x_i)\), so \(f_\perp\) does not affect the loss term. Yet \(\|f\|^2 = \|f_\parallel\|^2 + \|f_\perp\|^2 \geq \|f_\parallel\|^2\), so the regularizer is minimized when \(f_\perp = 0\). \(\square\)
Significance:
- Theoretical foundation: reduces an infinite-dimensional optimization to a finite \(N\)-dimensional one in \(\alpha\);
- Algorithmic unity: SVMs, kernel ridge regression, kernel logistic regression, GPs — all of their solutions take the form \(\sum_i \alpha_i K(x_i, \cdot)\), differing only in how \(\alpha\) is determined;
- Legitimacy of the Analogizers: theoretically explains why "predicting via a weighted sum of training similarities" is the essential form of every RKHS method, not a coincidence.
10. Gaussian Processes as Kernel Methods
The Gaussian process (GP) is the Bayesians' regression method, but in essence it is a kernel method. A GP is fully specified by its mean function \(m(x)\) and covariance kernel \(K(x, x')\):
After observing \(\mathbf{y} = f(\mathbf{X}) + \epsilon\) (Gaussian noise \(\epsilon \sim \mathcal{N}(0, \sigma^2 I)\)), the posterior at a new point \(x_*\) is:
where \(\mathbf{k}_* = [K(x_*, x_i)]_{i=1}^N\) and \(\mathbf{K}\) is the training-point kernel matrix.
Key observation: the posterior mean
has exactly the same form as the kernel ridge solution (with \(\sigma^2 = \lambda\)). The only difference between GPs and kernel ridge regression is that GPs additionally provide the predictive variance (uncertainty quantification), whereas kernel ridge only gives a point estimate.
Tribal viewpoint: the Bayesians read a GP as "a prior over functions"; the Analogizers read a GP as "a non-parametric prediction weighted by training-point kernel similarity". The two perspectives are equivalent but tell different stories — and that is why the GP section in ../../03_Machine_Learning/贝叶斯学习.md and this section mirror each other.
11. Division of Labour with [../../03_Machine_Learning/kernel_methods.md]
../../03_Machine_Learning/kernel_methods.md already covers the basics of kernel methods (what a kernel is, the intuition for the kernel trick, common kernels). This page does not repeat that material; instead it goes deeper:
| Topic | 03_Machine_Learning/kernel_methods.md | This page |
|---|---|---|
| Kernel-trick concept | introduces it | assumes it and uses it |
| Mercer condition | mentions it | PSD properties + algebra of kernels |
| SVM derivation | brief sketch | full primal + dual + KKT + SMO |
| Representer theorem | — | full statement with proof sketch |
| GP–kernel connection | — | explicit derivation of the equivalence with kernel ridge |
| Curse of dimensionality | — | full mathematical analysis (volume, distance concentration) |
| Algorithmic engineering | — | overview of KD-Tree / LSH / SMO implementations |
In short: that page asks "what are kernel methods", this page asks "why are kernel methods the master algorithm of the Analogizers, and what is their full mathematical skeleton".
Code Example: Running Everything Above with scikit-learn
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.kernel_ridge import KernelRidge
from sklearn.decomposition import KernelPCA
X, y = make_moons(n_samples=500, noise=0.2, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
# 1) k-NN
knn = KNeighborsClassifier(n_neighbors=5, weights="distance").fit(X_tr, y_tr)
# 2) RBF-SVM (Analogizer master algorithm)
svm = SVC(kernel="rbf", C=1.0, gamma="scale").fit(X_tr, y_tr)
print("support vectors / training samples:", len(svm.support_), "/", len(X_tr))
# 3) Kernel ridge regression (regression-flavored Analogizer)
krr = KernelRidge(alpha=0.1, kernel="rbf", gamma=1.0).fit(X_tr, y_tr)
# 4) Kernel PCA (non-linear dimensionality reduction)
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=1.0).fit(X_tr)
Z = kpca.transform(X_te)
print("k-NN acc:", knn.score(X_te, y_te))
print("SVM acc:", svm.score(X_te, y_te))
Cross-references
- The Analogizers as a tribe: 本页 §1(派系入口)
- Kernel methods basics on the site: ../../03_Machine_Learning/kernel_methods.md
- Supervised algorithms catalog (k-NN/SVM interface): ../../03_Machine_Learning/supervised.md
- The Bayesian view of GPs: ../../03_Machine_Learning/贝叶斯学习.md
- How the metric is learned: Metric Learning and Contrastive Learning
- How to retrieve similar samples efficiently: Recommender Systems and Modern Retrieval
References
- Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27. — The foundational k-NN paper and 1-NN error bound.
- Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. COLT. — SVMs and the kernel trick.
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. — Soft-margin SVM.
- Platt, J. (1998). Sequential Minimal Optimization: A fast algorithm for training support vector machines. Microsoft Research TR. — The SMO algorithm.
- Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. COLT. — The modern statement and proof of the Representer theorem.
- Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels. MIT Press. — The standard textbook on kernel methods.
- Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. — The standard GP textbook; Chapter 6 covers the connection to kernel methods.
- Beyer, K., et al. (1999). When is "nearest neighbor" meaningful? ICDT. — The rigorous proof of distance concentration.
- Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. CACM. — The original KD-Tree paper.
- Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. STOC. — The foundation of LSH.
Metric Learning and Contrastive Learning
Scope of this page: a thorough treatment of how similarity itself is learned. From the shallow metric learning of the 2000s (LMNN/ITML) all the way to the contrastive learning of the 2020s (SimCLR/MoCo/BYOL/DINO) and the cross-modal CLIP — this is the renaissance of the Analogizers in the deep learning era.
Nearest Neighbors and Kernel Methods solved "given a similarity, how to predict". The deeper question is — where does the similarity itself come from?
A naive Euclidean distance in raw pixel space carries almost no semantic meaning: shifting a cat photo by one pixel yields a Euclidean distance larger than the gap between "cat vs. dog". To make "similar inputs have similar outputs", we must learn an embedding in which similar inputs are genuinely close.
That is the entire mission of metric learning. When metric learning is married to deep networks and scaled to unlabeled data, it becomes the training paradigm of foundation models in the 2020s — contrastive learning.
1. The Goal of Metric Learning
Formally: find an embedding function \(f_\theta: \mathcal{X} \to \mathbb{R}^d\) such that for any "similar pair" \((x, x^+)\) and "dissimilar pair" \((x, x^-)\):
Here \(d\) is usually Euclidean or cosine distance. Key questions:
- Who defines "similar"? — labels in supervised learning (same class is similar); data augmentation in self-supervised learning (different augmentations of the same image are similar).
- How to train? — convert "similarity-ordering constraints" into a differentiable loss.
- What is the geometry of the embedding space? — Euclidean / spherical / hyperbolic.
2. Classical Metric Learning
2.1 Mahalanobis Metric Learning
Learn a positive semi-definite matrix \(M = L^T L\) such that
is task-meaningful. Note that \(L\) is a linear embedding — Mahalanobis metric learning is "learn a linear embedding, then use Euclidean distance".
2.2 LMNN (Large Margin Nearest Neighbor, Weinberger & Saul 2009)
LMNN directly optimizes the metric for k-NN, requiring each point's "target neighbors" (same class) to be at least one margin closer than its "impostors" (different class):
where \(j \rightsquigarrow i\) means \(j\) is one of \(i\)'s preselected target neighbors (k-nearest among the same class) and \(l\) ranges over all points. \(M\) must remain PSD (projection onto the PSD cone).
Contributions: imports the SVM "margin" idea into metric learning; can be combined with the kernel trick.
2.3 ITML (Information-Theoretic Metric Learning, Davis et al. 2007)
Measures the deviation of the metric from a prior metric using the KL divergence:
The solution form is unusually clean: iterative Bregman projection. Advantage: parameterizes naturally inside the PSD cone, avoiding explicit projection.
2.4 The Limits of Shallow Metric Learning
LMNN/ITML both assume a linear transform suffices — but semantic similarity in images and text is highly non-linear. Replacing the linear part with a deep network leads to the next section's Siamese / Triplet networks.
3. Deep Metric Learning Architectures
3.1 Siamese Networks
Proposed by Bromley & LeCun (1993) and exploding in popularity after 2014. Two weight-sharing subnetworks process two inputs and compare embeddings:
flowchart LR
X1["x₁"] --> F1["fθ"]
X2["x₂"] --> F2["fθ (shared weights)"]
F1 --> Z1["z₁"]
F2 --> Z2["z₂"]
Z1 & Z2 --> D["d(z₁, z₂)"]
D --> L["contrastive loss"]
style F1 fill:#fde68a
style F2 fill:#fde68a
Applications: face verification (DeepFace, FaceNet), signature verification, low-shot classification (deepening the Siamese section in ../../../1_DeepLearning/Computer_Vision/meta_learning.md).
3.2 Triplet Networks (FaceNet, Schroff et al. 2015)
Three inputs: anchor \(a\), positive \(p\) (same class as \(a\)), negative \(n\) (different class from \(a\)). The loss pushes \(a\) at least one margin \(m\) closer to \(p\) than to \(n\):
FaceNet's recipe:
- L2-normalize embeddings to the unit sphere (128 dimensions);
- 99.63% verification accuracy on LFW.
3.3 Hard Negative Mining
The central difficulty of triplet training: random negatives mostly already satisfy the constraint (loss = 0), giving no gradient. We must pick hard negatives (the negatives closest to the anchor):
| Strategy | Description | Risk |
|---|---|---|
| Random | random negative | training stalls |
| Semi-hard | choose \(\|f(a)-f(n)\| > \|f(a)-f(p)\|\) but still violating the margin | FaceNet's default |
| Hardest | the closest negative in the batch | prone to collapse to a trivial solution |
| Online (within batch) | mine on the fly inside a mini-batch | efficient, mainstream |
4. A Lineage of Loss Functions
The evolution of metric learning losses is a single thread, with each generation fixing problems left by the last.
4.1 Contrastive Loss (Hadsell, Chopra & LeCun 2006)
The earliest deep contrastive loss. Given a pair \((x_1, x_2)\) and a "similar" label \(y \in \{0, 1\}\) (0 = similar, 1 = dissimilar):
where \(D = \|f(x_1) - f(x_2)\|\) and \(m\) is the margin. Intuition:
- similar pair (\(y=0\)) → minimize squared distance;
- dissimilar pair (\(y=1\)) → push apart by at least \(m\).
4.2 Triplet Loss
See above — extends the contrastive "pair" to a "triplet", balancing positive and negative naturally.
4.3 Quadruplet / N-pair / Lifted Structure
| Loss | Form | Improvement |
|---|---|---|
| Quadruplet (Chen 2017) | one anchor + one positive + two negatives, requiring negatives to also be apart | larger inter-class margins |
| N-pair (Sohn 2016) | one anchor + one positive + (N-1) negatives, softmax | better mini-batch utilization |
| Lifted Structure (Oh Song 2016) | jointly optimize all positive and negative pairs in a batch | global structure |
| Angular Loss (Wang 2017) | constrains the inner angles of the triangle, scale-invariant | more geometrically intuitive |
4.4 InfoNCE / NT-Xent: the Core Loss of Modern Contrastive Learning
InfoNCE (van den Oord et al. 2018) connects contrastive learning to mutual information estimation, and NT-Xent (Normalized Temperature-scaled Cross Entropy, Chen et al. SimCLR 2020) is its standard form for self-supervised vision.
In a batch of \(N\) original samples, each producing 2 augmented views, there are \(2N\) samples total. The loss for view \(i\) paired with its positive \(j\) is:
where \(\mathrm{sim}(u, v) = \frac{u^T v}{\|u\|\|v\|}\) (cosine similarity) and \(\tau > 0\) is the temperature. The total batch loss is the average over all \((i, j)\) pairs.
4.4.1 Derivation: Connection to a Mutual-Information Lower Bound
InfoNCE is a lower-bound estimator of the mutual information \(I(X; Y)\):
where \(K\) is the number of negatives. This gives contrastive learning an information-theoretic interpretation: maximizing contrastive performance = maximizing the mutual information between two views. But the bound is loose — the relationship between theory and practice remains debated (Tschannen 2020).
4.4.2 The Role of the Temperature \(\tau\)
The temperature \(\tau\) controls the "sharpness" of the softmax:
| \(\tau\) | Behavior | Effect |
|---|---|---|
| very small (\(\tau \to 0\)) | softmax becomes argmax | focuses only on the hardest negative; unstable training |
| very large (\(\tau \to \infty\)) | softmax becomes uniform | all negatives weighed equally; little learning signal |
| moderate (\(\tau \approx 0.05 \sim 0.5\)) | balanced | SimCLR default 0.5; MoCo default 0.07; CLIP learns it |
Empirical: too-small temperature is a common cause of early-stage collapse. CLIP makes \(\log \tau^{-1}\) learnable, with initial value \(\log 14\) and a clip at \(\log 100\) to prevent it from going too small.
4.4.3 SupCon: Supervised Contrastive Loss
Extends NT-Xent to the labeled setting (Khosla et al. 2020) by treating all same-class samples as positives:
where \(P(i)\) is the set of samples sharing the class of \(i\). SupCon often beats cross-entropy in supervised classification.
5. Self-Supervised Contrastive Learning
Cross-page deep dive: ../../../1_DeepLearning/Computer_Vision/视觉自监督.md already lays out the overall paradigm of self-supervised learning. This section focuses on the internal evolution and algorithmic detail of the contrastive line.
5.1 SimCLR (Chen et al. 2020)
The first work to scale contrastive learning seriously. Key components:
- Augmentation composition: random crop + color jitter + Gaussian blur (ablations show that crop + color is the crucial pair).
- Projection head: a 2-layer MLP \(g\) stacked on top of the encoder; the contrastive loss is computed on \(g(h)\) while \(h\) is used downstream. A pivotal trick: the projection head discards augmentation-sensitive features, making \(h\) more general.
- NT-Xent loss.
- Large batch: 4096–8192, since each sample contributes \(2N - 2\) negatives within the batch.
SimCLR step:
for x in batch:
x1, x2 ← two random augmentations(x)
h1, h2 ← encoder(x1), encoder(x2)
z1, z2 ← projection_head(h1), projection_head(h2)
L ← NT-Xent({(z1_i, z2_i)})
backprop
5.2 MoCo (Momentum Contrast, He et al. 2020)
SimCLR depends on a large batch to provide negatives, demanding huge memory. MoCo's solution:
- Momentum encoder: maintain an EMA copy of the encoder:
- Queue: keep the key embeddings of the most recent \(K\) mini-batches (typically \(K = 65536\)) as a negative pool.
- InfoNCE loss: compare the query \(q\) with the positive key \(k_+\) and all negative keys in the queue.
Benefit: the number of negatives is decoupled from batch size; memory-friendly. Downside: momentum updates introduce staleness, requiring careful tuning of \(m\). MoCo v2 imports SimCLR's projection head and stronger augmentations for further gains.
5.3 BYOL (Bootstrap Your Own Latent, Grill et al. 2020)
BYOL's stunning observation: you can do without negatives at all.
Architecture: two networks, online + target, where target is the EMA of online. The online branch has a predictor \(q\) that predicts the target's output:
Key tricks:
- stop-gradient on target (target is not backpropagated through);
- predictor asymmetry — remove the predictor and the model collapses immediately;
- no negatives, but the triple asymmetry of stop-gradient + EMA + predictor prevents collapse to a trivial solution.
BYOL overturned the belief that "contrastive learning needs negatives" and triggered a wave of theoretical explanations (e.g. dynamics analyses by Tian 2021).
5.4 SimSiam (Chen & He 2021)
Even more extreme than BYOL: it drops the EMA target too, keeping only stop-gradient + predictor:
Stop-gradient is the "soul" preventing collapse in BYOL/SimSiam — one of the most mysterious theoretical questions of 2021.
5.5 DINO (Caron et al. 2021)
DINO = Self-Distillation with NO labels. It marries contrastive learning to knowledge distillation and showcases the emergent capabilities of self-supervised ViTs for the first time.
- A student network \(g_\theta\) and a teacher network \(g_\xi\) (EMA), both ViTs;
- the student sees small (local) crops, the teacher sees large (global) crops;
- the student matches the teacher's softmax output (not a contrastive loss but a cross-entropy);
- centering + sharpening prevent collapse.
Striking finding: DINO ViT's attention maps automatically focus on object boundaries — purely unsupervised training emerges semantic segmentation. A landmark of ViT + self-supervised learning.
5.6 An Overview of Contrastive Methods
| Method | Year | Loss | Needs negatives? | Key innovation |
|---|---|---|---|---|
| SimCLR | 2020 | NT-Xent | yes (large batch) | projection head + strong augmentation |
| MoCo v2 | 2020 | InfoNCE | yes (queue) | momentum encoder |
| BYOL | 2020 | L2 (predicted vs EMA) | no | predictor + stop-gradient |
| SwAV | 2020 | cluster prediction | no | cluster prototypes + multi-view |
| SimSiam | 2021 | cosine + stop-gradient | no | minimalist, no EMA |
| Barlow Twins | 2021 | cross-correlation matrix | no | redundancy reduction |
| DINO | 2021 | cross-entropy distillation | no | self-distillation + ViT emergence |
6. Cross-Modal Metric Learning
Cross-page deep dive: ../../../1_DeepLearning/Foundation_Models/vision_foundation.md covers the overall significance and downstream applications of CLIP. This section focuses on the algorithmic detail of its contrastive learning.
6.1 CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021)
Setup: 400 million (image, caption) pairs scraped from the web. Task: bring an image and its caption close together in the embedding space.
Architecture:
- image encoder \(f_I\) (ViT or ResNet);
- text encoder \(f_T\) (Transformer);
- both followed by a linear projection to a joint \(d\)-dim space;
- L2 normalization.
Loss: symmetric InfoNCE. Given \(N\) in-batch pairs \((I_i, T_i)\), build the \(N \times N\) similarity matrix \(S\) with \(S_{ij} = f_I(I_i)^T f_T(T_j) / \tau\). Then:
i.e. "pair each image with its true caption" + "pair each caption with its true image".
Zero-shot classification: wrap each class name \(c\) as the prompt "a photo of a {c}" and encode it as \(f_T(t_c)\). The prediction for a new image \(I\) is:
CLIP matches or surpasses fully supervised models on 30+ vision datasets in the zero-shot setting, ushering in the foundation-model era of vision–language.
6.2 ALIGN (Jia et al. 2021)
Almost contemporaneous with CLIP, with even larger scale (1.8 billion noisy alt-text pairs) and the same loss. The lesson: data scale can compensate for data quality.
6.3 SigLIP (Zhai et al. 2023)
Key change: replace CLIP's softmax-based InfoNCE with a sigmoid-based binary classification:
where \(z_{ij} = +1\) if \(i = j\) else \(-1\), \(s_{ij}\) is the similarity, and \(t, b\) are learnable temperature and bias. Advantages:
- no need for full-batch softmax normalization, drastically reducing communication cost in distributed training;
- a small batch (e.g. 8k) achieves what SimCLR/CLIP need a 32k batch for;
- mathematically a per-pair independent binary classification — more robust.
SigLIP has become the de facto standard CLIP replacement after 2024.
7. Geometry of the Metric Space
The choice of embedding-space geometry is crucial for the Analogizers.
7.1 Euclidean Embedding
The most common. But Euclidean geometry is limited at modeling "hierarchical structure" — see the hyperbolic case below.
7.2 Spherical Embedding
Force \(\|z\| = 1\), reducing similarity to cosine. Advantages:
- numerically stable (gradients do not explode through norms);
- compatible with softmax (which assumes bounded logits);
- geometric uniformity (hyperspherical uniformity, Wang & Isola 2020).
Wang & Isola (2020) famously decomposed the contrastive objective into:
Good contrastive models simultaneously achieve alignment (same-class together) and uniformity (uniform spread on the sphere); both are necessary.
7.3 Hyperbolic Embedding
Hyperbolic space (e.g. the Poincaré ball model) has volume that grows exponentially with radius, naturally suiting tree-like / hierarchical structures. Nickel & Kiela (2017) showed that hyperbolic embeddings precisely represent WordNet with far fewer dimensions. Gradient optimization on Riemannian manifolds requires special handling, however (Riemannian Adam).
8. Architectural Evolution of Metric Learning
flowchart TD
A["k-NN<br/>fixed Euclidean"] --> B["Mahalanobis metric learning"]
B --> C["LMNN (2009)"]
B --> D["ITML (2007)"]
C --> E["Siamese networks<br/>(Bromley 1993, revival 2014)"]
D --> E
E --> F["Triplet networks<br/>FaceNet (2015)"]
F --> G["Contrastive loss NT-Xent<br/>SimCLR (2020)"]
G --> H["MoCo (2020)<br/>queue + momentum"]
G --> I["BYOL (2020)<br/>no negatives"]
G --> J["DINO (2021)<br/>self-distillation"]
G --> K["CLIP (2021)<br/>cross-modal"]
K --> L["SigLIP (2023)<br/>sigmoid"]
K --> M["BLIP/Flamingo/<br/>multimodal LLMs"]
H --> N["RAG era<br/>retrieval embeddings"]
style G fill:#fde68a
style K fill:#bbf7d0
style N fill:#bbf7d0
9. Evaluation
Metric learning cannot be evaluated by accuracy — what we evaluate is embedding quality, not classification. Common metrics:
9.1 Recall@K
For each query in the test set, check whether any of its \(K\) nearest neighbors share its class:
The standard metric for face recognition and image retrieval.
9.2 mAP (mean Average Precision)
For each query, compute AP after sorting by similarity:
mAP is the average of AP over all queries. More comprehensive than Recall@K but more expensive.
9.3 NMI (Normalized Mutual Information)
Cluster the embeddings and compute mutual information against the true labels — measuring the semantic consistency of the embedding space.
9.4 t-SNE / UMAP Visualization
Project high-dimensional embeddings to 2D for inspection. t-SNE emphasizes local structure, UMAP balances global structure — but visualization is auxiliary, not a primary evaluation.
9.5 Linear Probing
Freeze the encoder and train a linear classifier on the embeddings. The accuracy reflects the embedding's "linear separability" — the standard evaluation for self-supervised pretraining since SimCLR.
9.6 KNN Evaluation
Freeze the encoder and run k-NN classification directly. More "fundamentalist" than linear probing — it directly tests the Analogizer premise: good embedding ⇒ neighbors are useful.
Code Example: SimCLR-style NT-Xent Implementation
import torch
import torch.nn.functional as F
def nt_xent_loss(z1: torch.Tensor, z2: torch.Tensor, tau: float = 0.5) -> torch.Tensor:
"""
z1, z2: shape [N, d]; embeddings of the two augmented views in a batch
returns: scalar NT-Xent loss
"""
N = z1.shape[0]
z = torch.cat([z1, z2], dim=0) # [2N, d]
z = F.normalize(z, dim=1) # L2 normalize to unit sphere
sim = z @ z.T / tau # [2N, 2N] cosine similarities
sim.fill_diagonal_(-1e9) # exclude self
# Positive index: the positive of i is i+N (or i-N)
targets = torch.cat([torch.arange(N, 2*N), torch.arange(0, N)]).to(z.device)
return F.cross_entropy(sim, targets)
Cross-references
- The Analogizers as a tribe: 本页 §1(派系入口)
- The overarching narrative of visual self-supervision: ../../../1_DeepLearning/Computer_Vision/视觉自监督.md
- Siamese in meta-learning and few-shot: ../../../1_DeepLearning/Computer_Vision/meta_learning.md
- CLIP's place in the foundation-model lineage: ../../../1_DeepLearning/Foundation_Models/vision_foundation.md
- How to deploy learned embeddings (vector databases, ANN, RAG): Recommender Systems and Modern Retrieval
References
- Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. CVPR. — The foundational paper on contrastive loss.
- Bromley, J., et al. (1993). Signature verification using a Siamese time delay neural network. NIPS. — The original Siamese network.
- Davis, J. V., et al. (2007). Information-theoretic metric learning. ICML. — ITML.
- Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. JMLR. — LMNN.
- Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. CVPR. — Triplet loss + large-scale face recognition.
- Sohn, K. (2016). Improved deep metric learning with multi-class N-pair loss. NIPS.
- van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748. — InfoNCE and the mutual-information lower bound.
- Chen, T., et al. (2020). A simple framework for contrastive learning of visual representations. ICML. — SimCLR / NT-Xent.
- He, K., et al. (2020). Momentum contrast for unsupervised visual representation learning. CVPR. — MoCo.
- Grill, J.-B., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS. — BYOL, showing one can do without negatives.
- Chen, X., & He, K. (2021). Exploring simple Siamese representation learning. CVPR. — SimSiam.
- Caron, M., et al. (2021). Emerging properties in self-supervised vision Transformers. ICCV. — DINO.
- Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML. — CLIP.
- Zhai, X., et al. (2023). Sigmoid loss for language image pre-training. ICCV. — SigLIP.
- Khosla, P., et al. (2020). Supervised contrastive learning. NeurIPS. — SupCon.
- Wang, T., & Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. ICML. — Alignment + uniformity decomposition.
- Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. NeurIPS. — Hyperbolic embeddings.
Recommender Systems and Modern Retrieval
Scope of this page: this is the page with the largest gap relative to the rest of the site, and it is meant to be the deepest of the three Analogizer pages. From the collaborative filtering of the 1990s all the way to the RAG / RETRO of 2025 — recommender systems, case-based reasoning (CBR), and vector retrieval are the same thing: given a large case base, find the cases most similar to the current query and "borrow" their answers. This is the Analogizers' core productive force in industry.
The Analogizers' biggest industrial application has never been the SVM but the recommender system. Amazon's "people who looked at this also looked at...", Netflix's movie picks, YouTube's video feed — the kernel under all of them is finding similar users / similar items. The same idea was migrated to ANN retrieval (FAISS / HNSW) in the 2010s, and in the 2020s became the infrastructure layer of RAG.
This page tells that engineering thread end-to-end.
1. A Survey of Recommender Systems
By "information source", recommender systems fall into three paradigms:
| Paradigm | Inputs | Idea | Representative methods | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Content-based | item features + user history | recommend items whose features resemble what the user liked | TF-IDF + cosine | strong cold-start for new items; explainable | bound by explicit feature engineering; narrow horizon |
| Collaborative Filtering (CF) | user-item interaction matrix | leverage "the wisdom of crowds" — similar users like similar items | k-NN CF, matrix factorization | no content features needed; latent patterns emerge | user / item cold start; sparsity |
| Hybrid | both | weighted / switching / feature fusion | Wide & Deep, two-tower | best of both | engineering complexity |
Recommendation = a textbook Analogizer application: for a new query (user), find historically similar queries and reuse their answers (items).
2. Collaborative Filtering
2.1 The User-Item Matrix
Suppose we have \(U\) users and \(I\) items. The rating matrix \(R \in \mathbb{R}^{U \times I}\) has entries \(R_{ui}\) being user \(u\)'s rating on item \(i\) (mostly missing). Task: fill in the missing entries.
2.2 User-based CF (Goldberg 1992; GroupLens 1994)
The predicted rating of item \(i\) for user \(u\):
where \(N(u)\) is the \(k\) users most similar to \(u\), and \(\bar r_u\) is \(u\)'s average rating (used to subtract a per-user bias).
2.3 Item-based CF (Sarwar et al. 2001)
Amazon's pivotal innovation: use item similarity instead of user similarity — item similarity is more stable (user tastes drift, item attributes do not), and the number of items is usually far smaller than the number of users:
where \(N(i; u)\) is the set of \(k\) items rated by \(u\) and most similar to \(i\).
2.4 Similarity Measures
| Measure | Formula | Note |
|---|---|---|
| Pearson | \(\frac{\sum_i (r_{ui} - \bar r_u)(r_{vi} - \bar r_v)}{\sqrt{\sum (r_{ui}-\bar r_u)^2} \sqrt{\sum (r_{vi}-\bar r_v)^2}}\) | subtracts means automatically; handles different rating scales |
| Cosine | \(\frac{r_u^T r_v}{\|r_u\|\|r_v\|}\) | simple and fast; implicitly assumes missing = 0 (problematic) |
| Adjusted Cosine | cosine after subtracting each item's mean | best choice for item-based CF |
| Jaccard | $\frac{ | R_u \cap R_v |
2.5 The Cold-Start Problem
CF's Achilles' heel:
| Type | Description | Mitigation |
|---|---|---|
| New-user cold start | no history | onboarding ratings / demographics / fall back to content recommendation |
| New-item cold start | nobody rated it | content recommendation / use item features as side information |
| Sparse cold start | matrix is 99.9% missing | matrix factorization (learn low-dim representations, generalize) |
3. Matrix Factorization
CF's sparsity problem is gracefully solved by matrix factorization — approximating \(R\) by a low-rank factorization.
3.1 Truncated SVD
But classical SVD assumes a dense matrix and cannot handle missing entries.
3.2 FunkSVD (Funk 2006, Netflix Prize)
Fits only on observed entries:
Parameters:
- \(\mu\): global average;
- \(b_u\): user bias ("strict / lenient");
- \(b_i\): item bias ("popular / unpopular");
- \(p_u, q_i \in \mathbb{R}^k\): latent factor vectors for users and items.
Objective:
The sum is taken only over the observed set \(\mathcal{O}\). Solving by SGD:
for (u, i, r) in observed pairs:
e = r - μ - b_u - b_i - p_u^T q_i
b_u ← b_u + η (e - λ b_u)
b_i ← b_i + η (e - λ b_i)
p_u ← p_u + η (e q_i - λ p_u)
q_i ← q_i + η (e p_u - λ q_i)
Funk used this method to push Netflix Prize RMSE forward by 7%, a phenomenal result in 2006.
3.3 ALS (Alternating Least Squares)
Joint optimization of \(P, Q\) is non-convex, but fixing one makes the other a convex least-squares problem — alternate the updates:
ALS parallelizes naturally (each row independent) and is the default recommender in Spark MLlib.
3.4 NMF (Non-negative Matrix Factorization)
Constrains \(P, Q \geq 0\). More interpretable (factors can be read as "topics") but less expressive than unconstrained MF.
4. Implicit Feedback
Key observation: in reality explicit ratings (1–5 stars) are scarce; implicit signals (clicks, watch time, purchases) dominate. The difficulty of implicit feedback: unobserved ≠ disliked — possibly just unseen.
4.1 WMF (Weighted Matrix Factorization, Hu et al. 2008)
Splits the 0/1 matrix \(P\) from a confidence matrix \(C\):
- \(p_{ui} = 1\) if \(r_{ui} > 0\) (interaction occurred);
- \(c_{ui} = 1 + \alpha r_{ui}\) (more interaction means more confidence in "liked").
Objective:
Note: the sum runs over all \((u, i)\), including unobserved pairs (treated as 0 with low confidence). The cost is kept linear via the algebraic trick \(Y^T C^u Y = Y^T Y + Y^T (C^u - I) Y\) inside ALS.
4.2 BPR (Bayesian Personalized Ranking, Rendle et al. 2009)
Revolutionary insight: don't predict scores — directly optimize the ordering. With triplets \((u, i, j)\) where \(i\) is something user \(u\) interacted with (positive) and \(j\) is not (negative), BPR assumes the user prefers \(i > j\):
where \(\hat x_{uij} = \hat r_{ui} - \hat r_{uj}\) (any scoring model, commonly MF).
Advantages:
- directly optimizes AUC (pairwise ranking);
- naturally fits implicit feedback;
- form is reminiscent of SVM hinge loss (the "margin" idea).
BPR is still the baseline of implicit recommendation today — every new method must be compared to BPR-MF.
5. Deep Recommendation
After 2014 deep learning swept through recommender systems.
5.1 NCF (Neural Collaborative Filtering, He et al. 2017)
Replaces MF's "dot product \(p_u^T q_i\)" with an MLP:
where \(\mathbf{e}_u, \mathbf{e}_i\) are ID embeddings. NeuMF further fuses GMF (generalized MF) and MLP in parallel.
Controversy: Rendle et al. (2020) showed that a well-tuned MF beats NCF, sparking the "is NCF really better, or just more tuned" reflection — the "fake progress" problem in recommender systems.
5.2 AutoRec (Sedhain et al. 2015)
Casts recommendation as an autoencoding task: input is the user's rating vector \(r_u\) (with missing entries), output is the densely reconstructed vector \(\hat r_u\).
5.3 Wide & Deep (Cheng et al. 2016)
Google Play's deployed dual-track architecture:
- Wide part: a linear model over crossed features that memorizes explicit rules ("people who installed A install B");
- Deep part: embeddings + MLP that generalize to unseen combinations.
Trained jointly — an industry hit. Successors: DeepFM, xDeepFM, DCN — all aiming to handle feature interactions better.
5.4 DIN (Deep Interest Network, Zhou et al. 2018)
Alibaba e-commerce setting: a user's interests are multi-modal and dynamic — for query item \(i\), attention-aggregate the user's history sequence:
with weights \(a\) produced by a small MLP. An early successful application of attention in recommendation, predating Transformer. Successors DIEN and SIM tackle longer sequence modeling.
5.5 Two-Tower Models: the De Facto Standard for Modern Retrieval
flowchart LR
U["User features<br/>(ID, history, demographics)"] --> UT["User tower fU"]
I["Item features<br/>(ID, content, attributes)"] --> IT["Item tower fI"]
UT --> ZU["User embedding zu"]
IT --> ZI["Item embedding zi"]
ZU & ZI --> S["Similarity zu · zi"]
S --> P["Prediction / ranking"]
style UT fill:#fde68a
style IT fill:#fde68a
style S fill:#bbf7d0
Key property: user and item embeddings are computed in fully decoupled ways —
- offline: batch-compute all item embeddings and store them in an ANN index;
- online: when a user arrives, compute the user embedding in real time and run a k-NN query.
Latency can drop into the sub-millisecond range — this is industrial-grade retrieval. Representative papers: YouTube DNN (Covington 2016), Sampling-bias-corrected (Yi 2019).
Training: typically with in-batch negatives + InfoNCE (identical to contrastive learning):
Contrastive learning = metric learning = two-tower retrieval = modern RAG retriever training — all are essentially the same objective.
5.6 Ranking vs. Retrieval
Industrial recommendation has two stages:
| Stage | Task | Candidate scale | Mainstream methods |
|---|---|---|---|
| Retrieval | from billions select thousands | \(10^9 \to 10^3\) | two-tower + ANN |
| Ranking | precise scoring of thousands | \(10^3 \to 10^1\) | DIN / DCN / GBDT |
Two-tower cannot be used for ranking — ranking demands fine-grained crossed features that decoupled towers cannot express.
6. Graph-based Recommendation
User–item interactions form a bipartite graph naturally, so GNNs apply.
6.1 PinSage (Ying et al. 2018)
Pinterest's industrial GraphSAGE on a 3-billion-node graph:
- Neighbor sampling: random walks compute importance;
- Convolution: aggregate weighted neighbor embeddings into the central node;
- MapReduce-style offline training.
PinSage serves 200 million MAUs and is the iconic GNN production case.
6.2 LightGCN (He et al. 2020)
Reverse engineering: ablate GCN's non-linear transform and feature transform, keeping only neighbor aggregation:
The final embedding is a weighted average across layers. Striking finding: dropping non-linearity actually helps — the features in recommendation are already ID embeddings, so non-linear transforms are unnecessary. LightGCN is a strong baseline beyond BPR-MF.
7. Case-Based Reasoning (CBR)
CBR is the symbolic-AI era's "proto-RAG" — to understand CBR is to understand the conceptual roots of RAG.
7.1 Aamodt & Plaza 1994: The 4R Cycle
flowchart LR
P["New problem"] --> R1["1. Retrieve<br/>find similar cases"]
R1 --> R2["2. Reuse<br/>reuse the case's solution"]
R2 --> R3["3. Revise<br/>adapt to the new problem"]
R3 --> R4["4. Retain<br/>store the new case"]
R4 --> CB[("Case Base")]
CB --> R1
style R1 fill:#fde68a
style R4 fill:#bbf7d0
| Stage | Key questions |
|---|---|
| Retrieve | what similarity? how to retrieve efficiently? |
| Reuse | reuse directly? or extract experience? |
| Revise | adapt the solution with domain knowledge (manual / rules / models) |
| Retain | what to keep (avoid base bloat)? how to organize? |
7.2 Case Base Structures
- Flat case base: cases stored linearly, depending on a good index (k-d tree, inverted index);
- Hierarchical case base: organized by abstraction level (e.g. medicine: organ → system → symptom → diagnosis);
- Prototypes + instances: each prototype represents a category, instances hang under the prototype.
7.3 Classical CBR Systems
| System | Domain | Era | Notes |
|---|---|---|---|
| CASEY (Koton 1989) | cardiology diagnosis | 1980s | hybrid with MYCIN-style rule systems |
| HYPO (Ashley 1990) | legal reasoning | 1990s | trade-secret cases; introduced "factor" notion |
| CHEF (Hammond 1986) | cooking planning | 1980s | recipe adaptation |
| Customer service helpdesk | ticket matching | 1990s+ | still in use — "find similar past tickets" |
7.4 CBR vs. ML
CBR has long been overlooked by the ML mainstream, but its paradigm is robust:
| Dimension | Traditional ML | CBR |
|---|---|---|
| Knowledge form | parametric model | cases + similarity |
| Learning cost | high (retraining) | low (incrementally add cases) |
| Interpretability | usually poor | "because of the similar case X" |
| Adapting to new data | retraining needed | just Retain |
Modern view: the Retrieve-Reuse-Revise-Retain cycle of CBR is reborn in the RAG / Agent era — using LLMs as the Reuse + Revise engine and a vector database as the case base yields the "LLM agent + memory" paradigm.
8. Modern Retrieval: the Bridge from the Analogizers to the Era of Large Models
Cross-page deep dive: ../../../AI_Agents/04_Memory_Systems/长期记忆与向量数据库.md covers the role of vector databases in agent memory. This section focuses on the algorithmic level of retrieval methods, which is its prerequisite.
8.1 Retrieval Paradigms: dense vs. sparse
| Paradigm | Representation | Similarity | Representatives |
|---|---|---|---|
| Sparse | high-dim sparse (bag of words) | BM25, TF-IDF | Lucene, Elasticsearch |
| Dense | low-dim dense embedding | cosine, dot product | DPR, ColBERT, Sentence-BERT |
| Hybrid | both combined | weighted fusion / RRF | Vespa, Weaviate |
BM25 (1994)
The classical sparse retrieval method:
with \(k_1 \approx 1.2 \sim 2.0\) and \(b \approx 0.75\). Still one of the strongest baselines — many papers tout fancy dense retrieval, only to find BM25 trailing by just a few points.
DPR (Dense Passage Retrieval, Karpukhin et al. 2020)
Dual BERT towers trained for QA retrieval:
- query encoder + passage encoder (no shared weights);
- in-batch negatives + InfoNCE;
- after training, passages are encoded offline into FAISS.
DPR beats BM25 on Natural Questions, opening the era of dense retrieval.
8.2 Comparison of ANN Algorithms
When the vector count reaches the billions, exact k-NN is infeasible. ANN (Approximate Nearest Neighbor) trades accuracy for speed.
| Algorithm | Idea | Accuracy | Speed | Memory | Build cost | Use case |
|---|---|---|---|---|---|---|
| Brute-force (Flat) | compare one by one | 100% | slow (\(O(N)\)) | \(O(N)\) | \(O(1)\) | \(N < 10^6\) or ground truth |
| IVF (Inverted File) | cluster (\(k\)-means) then invert | 90–99% | fast | \(O(N)\) | \(O(Nk)\) | large data, coarse + fine ranking |
| PQ (Product Quantization) | sub-vector quantization compression | medium | very fast | very small | medium | memory-constrained settings |
| IVF-PQ | IVF + PQ combined | medium-high | very fast | very small | medium | FAISS default for large scale |
| HNSW | hierarchical small-world graph | very high | very fast | large (\(O(N \log N)\) edges) | slow | the universal best; the standard choice |
| LSH | locality-sensitive hashing | medium | fast | medium | fast | historical method, surpassed by HNSW |
| ScaNN (Google) | IVF + anisotropic quantization | high | very fast | small | medium | Google's internal workhorse |
| Annoy (Spotify) | random-projection forest | medium | fast | medium | fast | static indexes |
Industrial decision tree:
- Data < 1M, accuracy paramount → Flat (FAISS
IndexFlatIP); - 1M – 100M, memory sufficient → HNSW (default choice);
- 100M+, memory tight → IVF-PQ;
- need online insertion → HNSW (IVF doesn't update easily).
8.3 HNSW (Hierarchical Navigable Small World) in Detail
Malkov & Yashunin's 2016 HNSW is the "textbook ANN choice" — near-100% accuracy, sub-millisecond queries, friendly to incremental insertion.
Core idea: organize data points as a multi-layer graph, where each layer is a small-world graph (nearby neighbors plus a few long-range links); the upper layers are sparse (for quick jumps) and the lowest layer is dense (for fine-grained search).
flowchart TD
subgraph L2["Layer 2 (sparse, long jumps)"]
A2["A"] -.- F2["F"]
end
subgraph L1["Layer 1 (medium density)"]
A1["A"] --- B1["B"] --- F1["F"]
B1 --- D1["D"]
end
subgraph L0["Layer 0 (all points, dense)"]
A0["A"] --- B0["B"] --- C0["C"] --- D0["D"]
D0 --- E0["E"] --- F0["F"]
C0 --- E0
end
A2 -.-> A1 -.-> A0
F2 -.-> F1 -.-> F0
Search algorithm:
HNSW search query q:
current ← entry point (top layer)
for level from top down to 1:
greedy search to the nearest point of q at this layer → current
on layer 0, run a best-first beam search bounded by ef-search
return top-k
Insertion:
- Randomly choose a level \(\ell\) (exponential distribution, \(\ell = \lfloor -\ln(r) \cdot m_L \rfloor\));
- greedily search from the top down to layer \(\ell + 1\);
- from layer \(\ell\) down to 0, connect to the \(M\) nearest neighbors at each level (using a heuristic that preserves diversity).
Key parameters:
- \(M\): maximum links per node (typically 16–48);
ef-construction: search width during construction (typically 200);ef-search: search width during query (the accuracy–speed knob; typically 50–200).
Distance functions: HNSW does not fix a metric; common choices include L2, inner product, and cosine — as long as the triangle inequality holds or approximately holds (in practice it often works even for non-metric distances).
8.4 RAG = the Renaissance of the Analogizers
flowchart LR
Q["User query"] --> EM["Embedding model<br/>(metric-learning trained)"]
EM --> QV["Query vector"]
QV --> ANN["ANN index<br/>(HNSW/FAISS)"]
DB[("Document base<br/>(pre-embedded)")] --> ANN
ANN --> TOPK["Top-K documents"]
TOPK --> LLM["LLM<br/>(in-context reasoning)"]
Q --> LLM
LLM --> ANS["Answer + citations"]
style ANN fill:#fde68a
style LLM fill:#bbf7d0
RAG (Retrieval-Augmented Generation, Lewis et al. 2020) feeds the top-k retrieved documents as LLM context, so that the LLM answers "having read the relevant material".
RAG vs. classical CBR side by side:
| Dimension | Classical CBR (Aamodt 1994) | RAG (2020+) |
|---|---|---|
| Case base | structured cases | documents / passages (chunks) |
| Retrieval | hand-crafted similarity (feature matching) | embedding + ANN |
| Reuse engine | rule system / human | LLM (Transformer) |
| Revise | manual editing | LLM in-context |
| Retain | append to case base | append to document base (incremental indexing) |
| Evaluation | case-hit rate | Recall@K + end-to-end QA accuracy |
Tribal viewpoint: RAG is not a "patch" on the LLM — it is the return of the Analogizers against parametric methods. The LLM paradigm of compressing knowledge into weights has fundamental limits (hallucination, staleness, no provenance), and the Analogizers' "case base + retrieval" is reborn in the era of large models under the name RAG.
8.5 Retrieval-Augmented Training
Push retrieval from "inference time" back into "training time":
RETRO (Borgeaud et al. 2022, DeepMind)
Idea: when training the LLM, every chunk retrieves top-k neighbors from a 2-trillion-token database, and these are injected via cross-attention into the Transformer.
- Effect: 7.5B parameters + retrieval ≈ 175B parameters without retrieval;
- Significance: a parameter-efficiency revolution — knowledge no longer has to be stuffed entirely into weights.
Atlas (Izacard et al. 2022, Meta)
End-to-end training of retrieval + generation for knowledge-intensive NLP tasks (QA, fact checking). Few-shot performance surpasses 50× larger PaLM.
REPLUG (Shi et al. 2023)
Even lighter: train only the retriever to "cooperate" with a frozen black-box LLM. The LLM's outputs serve as supervision for the retriever — the retriever learns to "pick documents most useful to the LLM".
9. Engineering Practice: Choosing a Vector Database
9.1 Mainstream Vector Databases Compared
| DB | Type | Indexes | Persistence | Metadata filtering | Distributed | Best for |
|---|---|---|---|---|---|---|
| FAISS | library | full Flat/IVF/PQ/HNSW | DIY | weak | none | offline, research, advanced users |
| Milvus | service | HNSW/IVF/DiskANN | strong | strong | strong (cloud-native) | large-scale production |
| Qdrant | service (Rust) | HNSW | strong | strong (payload filters) | clustering | medium scale, filter-heavy |
| Weaviate | service | HNSW | strong | strong (GraphQL) | clustering | multi-modal, schema-driven data |
| pgvector | Postgres extension | IVF/HNSW | reuse PG | SQL | reuse PG | existing PG stack, medium scale |
| Pinecone | SaaS | proprietary | hosted | strong | hosted | want zero ops |
| Chroma | embedded | HNSW | file | medium | weak | prototype, small apps |
Selection guide:
- prototype / small RAG → Chroma or FAISS in-memory;
- existing PG / scale < 10M → pgvector;
- need filtering + medium scale → Qdrant;
- large scale / multi-tenant → Milvus or Pinecone (the former self-hosted, the latter managed).
9.2 The Latency Budget for Real-Time Retrieval
The end-to-end budget for industrial search / recommendation is roughly 100ms. Allocation by stage:
| Stage | Latency budget | Notes |
|---|---|---|
| Network + deserialization | 5–10ms | gRPC + binary protocol |
| Query encoder (embedding model) | 10–30ms | model distillation / quantization is key |
| ANN retrieval (top-1000) | 5–20ms | HNSW remains < 20ms at the billion scale |
| Metadata filtering | 1–5ms | inverted index |
| Ranking (deep model) | 30–50ms | the dominant cost |
| Post-processing / diversity | 5–10ms | re-rank |
Engineering pain point: embedding-model latency. Common remedies: model distillation (a small two-tower model), ONNX Runtime / TensorRT inference acceleration, INT8 quantization, caching of common queries.
9.3 Index Update Strategies
The "freshness" challenge of vector databases:
| Strategy | Description | Pros / cons |
|---|---|---|
| Full rebuild | rebuild the index periodically offline | simple; not real-time |
| Incremental insertion | HNSW supports add | real-time; graph quality degrades over time |
| Dual-index swap | hot index (online incremental) + cold index (offline full), swap periodically | mainstream in industry; high ops cost |
| Delta index | main index (cold) + delta (small hot index), merge at query time | Milvus / Vespa style |
10. Formula Cheatsheet
10.1 BPR Loss
10.2 FunkSVD Objective
10.3 Two-Tower In-Batch InfoNCE
10.4 HNSW Distances (any metric; three common ones)
Code Example: FAISS HNSW Index + Query
import numpy as np
import faiss
d = 128 # embedding dimension
N = 1_000_000 # number of vectors
# 1. Build the HNSW index
index = faiss.IndexHNSWFlat(d, M=32) # 32 neighbors per node
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 64
# 2. Insert vectors (already unit-normalized embeddings)
embeddings = np.random.randn(N, d).astype("float32")
embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
index.add(embeddings)
# 3. Query
query = np.random.randn(1, d).astype("float32")
query /= np.linalg.norm(query)
D, I = index.search(query, k=10) # top-10 nearest neighbors
print("top-10 distances:", D[0])
print("top-10 indices:", I[0])
Code Example: BPR-MF in NumPy
import numpy as np
def bpr_mf(triples, U, I, k=32, lr=0.05, lam=0.01, epochs=20):
"""
triples: list of (u, i, j); u prefers i over j
U, I: total numbers of users and items
"""
P = np.random.randn(U, k) * 0.01
Q = np.random.randn(I, k) * 0.01
for ep in range(epochs):
np.random.shuffle(triples)
for (u, i, j) in triples:
x_uij = P[u] @ (Q[i] - Q[j])
sig = 1.0 / (1.0 + np.exp(x_uij))
P[u] += lr * (sig * (Q[i] - Q[j]) - lam * P[u])
Q[i] += lr * (sig * P[u] - lam * Q[i])
Q[j] += lr * (-sig * P[u] - lam * Q[j])
return P, Q
Cross-references
- The Analogizers as a tribe: 本页 §1(派系入口)
- Metric learning (how the embedding is trained): Metric Learning and Contrastive Learning
- The mathematical basis of kernels and k-NN: Nearest Neighbors and Kernel Methods
- Vector-database engineering for agent long-term memory: ../../../AI_Agents/04_Memory_Systems/长期记忆与向量数据库.md
References
- Goldberg, D., et al. (1992). Using collaborative filtering to weave an information tapestry. CACM. — The earliest proposal of collaborative filtering.
- Resnick, P., et al. (1994). GroupLens: An open architecture for collaborative filtering of netnews. CSCW. — User-based CF.
- Sarwar, B., et al. (2001). Item-based collaborative filtering recommendation algorithms. WWW. — Item-based CF (the heart of Amazon).
- Funk, S. (2006). Netflix Update: Try this at home. — The original blog post for FunkSVD.
- Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer. — A modern survey of MF.
- Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. ICDM. — WMF.
- Rendle, S., et al. (2009). BPR: Bayesian Personalized Ranking from implicit feedback. UAI. — BPR.
- He, X., et al. (2017). Neural collaborative filtering. WWW. — NCF.
- Cheng, H.-T., et al. (2016). Wide & Deep learning for recommender systems. DLRS@RecSys. — Wide & Deep.
- Zhou, G., et al. (2018). Deep Interest Network for click-through rate prediction. KDD. — DIN.
- Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for YouTube recommendations. RecSys. — YouTube two-tower retrieval.
- Ying, R., et al. (2018). Graph convolutional neural networks for web-scale recommender systems. KDD. — PinSage.
- He, X., et al. (2020). LightGCN: Simplifying and powering graph convolution network for recommendation. SIGIR. — LightGCN.
- Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications. — The foundation of the CBR 4R cycle.
- Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for open-domain question answering. EMNLP. — DPR.
- Malkov, Y. A., & Yashunin, D. A. (2016). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI. — HNSW.
- Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547. — The FAISS paper.
- Lewis, P., et al. (2020). Retrieval-Augmented Generation for knowledge-intensive NLP tasks. NeurIPS. — RAG.
- Borgeaud, S., et al. (2022). Improving language models by retrieving from trillions of tokens. ICML. — RETRO.
- Izacard, G., et al. (2022). Atlas: Few-shot learning with retrieval augmented language models. arXiv:2208.03299. — Atlas.
- Shi, W., et al. (2023). REPLUG: Retrieval-augmented black-box language models. arXiv:2301.12652.
- Rendle, S., et al. (2020). Neural collaborative filtering vs. matrix factorization revisited. RecSys. — The NCF vs. MF controversy.