Unsupervised Learning

Unsupervised learning is an important branch of machine learning whose defining characteristic is that the training data has no labels. The model must discover the underlying structure, patterns, and regularities in the data on its own.

Overview of Unsupervised Learning

Comparison with Supervised Learning

Dimension	Supervised Learning	Unsupervised Learning
Labels	Requires labeled data	No labels needed
Objective	Learn a mapping from inputs to outputs	Discover inherent data structure
Evaluation	Clear evaluation metrics	Evaluation is difficult; often requires human judgment
Typical tasks	Classification, regression	Clustering, dimensionality reduction, anomaly detection, density estimation
Data cost	High labeling cost	Low data acquisition cost

Application Scenarios

Customer segmentation: Grouping customers based on purchasing behavior
Anomaly detection: Credit card fraud detection, network intrusion detection
Data visualization: Low-dimensional visualization of high-dimensional data
Feature learning: Autoencoders learning data representations
Recommender systems: Discovering latent similarities among users or items

Clustering Algorithms

Clustering is one of the most fundamental tasks in unsupervised learning. The goal is to partition data points into groups (clusters) such that points within the same group are as similar as possible and points in different groups are as dissimilar as possible.

K-Means Algorithm

K-Means is the most classic clustering algorithm, known for its simplicity and efficiency.

Algorithm Steps:

Randomly initialize $K$ cluster centroids $\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \dots, \boldsymbol{\mu}_K$
Assignment step (E-step): Assign each data point $\mathbf{x}_i$ to its nearest centroid:

\[ c_i = \arg\min_{k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

Update step (M-step): Recompute the centroid of each cluster:

\[ \boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i \]

Repeat steps 2--3 until convergence (centroids no longer change or changes fall below a threshold).

Objective Function (SSE / Inertia):

\[ J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

K-Means essentially minimizes the Within-Cluster Sum of Squares (WCSS).

Choosing K:

Elbow Method: Plot $K$ versus SSE and select the "elbow" point where the rate of decrease sharply levels off.
Silhouette Score: For each sample $i$, define:

\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]

where $a(i)$ is the mean distance to other points in the same cluster, and $b(i)$ is the mean distance to the nearest neighboring cluster. $s(i) \in [-1, 1]$; values closer to 1 are better.

Limitations of K-Means:

Requires specifying $K$ in advance
Sensitive to initialization (K-Means++ can mitigate this)
Can only discover convex (spherical) clusters
Sensitive to outliers

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm capable of discovering arbitrarily shaped clusters and automatically identifying noise points.

Core Concepts:

$\varepsilon$-neighborhood: All points within radius $\varepsilon$ of a point $p$
Core point: A point whose $\varepsilon$-neighborhood contains at least $\text{MinPts}$ points
Directly density-reachable: Point $q$ is in the $\varepsilon$-neighborhood of core point $p$
Density-reachable: There exists a chain of core points where consecutive points are directly density-reachable
Density-connected: There exists a core point $o$ such that both $p$ and $q$ are density-reachable from $o$

Parameter Selection:

$\varepsilon$: Typically chosen using a K-distance plot by sorting the $k$-th nearest neighbor distances and finding the "knee"
MinPts: A common rule of thumb is $\text{MinPts} \geq d + 1$ (where $d$ is the data dimensionality); a typical value is 5

Advantages: No need to specify the number of clusters; discovers arbitrarily shaped clusters; identifies noise points.

Disadvantages: Sensitive to parameters; struggles with data of varying densities.

Hierarchical Clustering

Hierarchical clustering builds a cluster tree (dendrogram). It can proceed bottom-up by merging (agglomerative) or top-down by splitting (divisive).

Agglomerative Hierarchical Clustering Steps:

Start with each sample as its own cluster
Compute distances between all pairs of clusters
Merge the two closest clusters
Repeat until only one cluster remains

Common Linkage Criteria:

Linkage Method	Definition	Characteristics
Single linkage	$d(A,B) = \min_{a \in A, b \in B} d(a,b)$	Prone to chaining effects
Complete linkage	$d(A,B) = \max_{a \in A, b \in B} d(a,b)$	Tends to produce compact clusters
Average linkage	$d(A,B) = \frac{1}{	A
Ward linkage	Minimizes the increase in within-cluster variance	Tends to produce equal-sized clusters

Gaussian Mixture Models (GMM)

GMM assumes that data is generated by a mixture of $K$ Gaussian distributions:

\[ p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]

where $\pi_k$ are mixing weights satisfying $\sum_k \pi_k = 1$.

EM Algorithm Solution:

E-step: Compute the posterior probability (responsibility) that each sample belongs to each Gaussian component:

\[ \gamma_{ik} = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \]

M-step: Update parameters:

\[ \boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}, \quad \boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_i \gamma_{ik}}, \quad \pi_k = \frac{\sum_i \gamma_{ik}}{N} \]

Compared to K-Means, GMM can model ellipsoidal clusters and provides probabilistic soft assignments.

Spectral Clustering

Spectral clustering leverages the spectrum (eigenvalues) of the Laplacian matrix of a similarity graph over the data.

Algorithm Steps:

Construct a similarity matrix $\mathbf{W}$ (e.g., using a Gaussian kernel: $w_{ij} = \exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma^2)$)
Compute the degree matrix $\mathbf{D}$, where $D_{ii} = \sum_j w_{ij}$
Compute the normalized Laplacian $\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}$
Extract the $K$ smallest eigenvectors of $\mathbf{L}_{\text{sym}}$ to form $\mathbf{U} \in \mathbb{R}^{n \times K}$
Apply K-Means clustering to the rows of $\mathbf{U}$

Spectral clustering can discover non-convex clusters and is well-suited for data with manifold structure.

Dimensionality Reduction

Dimensionality reduction maps high-dimensional data into a lower-dimensional space while preserving the most important structural information.

PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in the data and projects onto those directions.

Mathematical Derivation:

Given a data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (already centered), compute the covariance matrix:

\[ \mathbf{C} = \frac{1}{n-1} \mathbf{X}^\top \mathbf{X} \]

Perform eigendecomposition of $\mathbf{C}$:

\[ \mathbf{C} \mathbf{v}_k = \lambda_k \mathbf{v}_k \]

Sort eigenvalues in descending order $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d$ and take the first $K$ eigenvectors as projection directions.

Explained Variance Ratio:

\[ \text{EVR}_k = \frac{\lambda_k}{\sum_{i=1}^{d} \lambda_i} \]

A common practice is to choose the number of dimensions such that the cumulative explained variance ratio reaches 95%.

t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method particularly well-suited for visualizing high-dimensional data.

Core Idea:

In the high-dimensional space, compute pairwise similarities using a Gaussian kernel:

\[ p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i}\exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)} \]

Symmetrize: $p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}$

In the low-dimensional space, compute similarities using a Student-t distribution (1 degree of freedom):

\[ q_{ij} = \frac{(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2)^{-1}} \]

Minimize the KL divergence: $\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

Perplexity: Controls the Gaussian kernel bandwidth $\sigma_i$; typically set between 5 and 50. Higher perplexity considers a wider neighborhood.

Important caveats: t-SNE does not preserve global structure, results may vary across runs, and it is not suitable as a preprocessing step.

UMAP

UMAP (Uniform Manifold Approximation and Projection) is grounded in the theory of topological data analysis and Riemannian geometry.

Core Idea:

Assume the data is uniformly distributed on a Riemannian manifold
Construct a fuzzy simplicial set in the high-dimensional space to represent the topological structure
Find a low-dimensional embedding that preserves this topological structure as faithfully as possible

UMAP vs t-SNE Comparison:

Dimension	t-SNE	UMAP
Speed	Slow ($O(n^2)$ or Barnes-Hut $O(n \log n)$)	Fast (approximate nearest neighbors)
Global structure	Not preserved	Better preserved
Scalability	Tens of thousands of points	Handles millions of points
Theoretical basis	Probability distribution matching	Topology / category theory
Use cases	Visualization only	Visualization + general-purpose dimensionality reduction

Anomaly Detection

The goal of anomaly detection is to identify "anomalous" samples that differ significantly from the majority of the data.

Isolation Forest

Core idea: Anomalous points are easier to "isolate" than normal points (i.e., they require fewer random splits).

Randomly select features and split values to build isolation trees (iTrees)
The anomaly score is based on the average path length across all trees:

\[ s(\mathbf{x}, n) = 2^{-\frac{E[h(\mathbf{x})]}{c(n)}} \]

where $h(\mathbf{x})$ is the path length and $c(n)$ is a normalization factor. Scores close to 1 indicate anomalies; scores near 0.5 indicate normal points.

LOF (Local Outlier Factor)

LOF detects anomalies by comparing a sample's local density to the local densities of its neighbors:

\[ \text{LOF}_k(\mathbf{x}) = \frac{\sum_{\mathbf{o} \in N_k(\mathbf{x})} \frac{\text{lrd}_k(\mathbf{o})}{\text{lrd}_k(\mathbf{x})}}{|N_k(\mathbf{x})|} \]

An LOF greater than 1 indicates that the point has lower density than its neighborhood and may be anomalous.

One-Class SVM

One-Class SVM finds a hyperplane in the feature space that separates the majority of data from the origin:

\[ \min_{\mathbf{w}, \rho, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i - \rho \]

\[ \text{s.t.} \quad \mathbf{w}^\top \phi(\mathbf{x}_i) \geq \rho - \xi_i, \quad \xi_i \geq 0 \]

The parameter $\nu$ provides an upper bound on the fraction of anomalies.

Comprehensive Comparison of Methods

Method	Type	Best Suited For	Complexity	Key Parameters
K-Means	Clustering	Convex clusters, large datasets	$O(nKt)$	$K$
DBSCAN	Clustering	Arbitrary shapes, noisy data	$O(n \log n)$	$\varepsilon$, MinPts
GMM	Clustering	Ellipsoidal clusters, soft assignments	$O(nK d^2)$	$K$, covariance type
Spectral clustering	Clustering	Non-convex manifold structures	$O(n^3)$	$K$, $\sigma$
PCA	Dim. reduction	Linear reduction, decorrelation	$O(nd^2)$	Number of components
t-SNE	Dim. reduction	High-dimensional data visualization	$O(n^2)$	Perplexity
UMAP	Dim. reduction	Visualization + general reduction	$O(n^{1.14})$	n_neighbors, min_dist
Isolation Forest	Anomaly detection	High-dimensional anomaly detection	$O(n \log n)$	Number of trees, sample size
LOF	Anomaly detection	Local anomaly detection	$O(n^2)$	$k$ (number of neighbors)
One-Class SVM	Anomaly detection	Small data, high dimensions	$O(n^2) \sim O(n^3)$	$\nu$, kernel function

Linkage Method	Definition	Characteristics
Single linkage	\(d(A,B) = \min_{a \in A, b \in B} d(a,b)\)	Prone to chaining effects
Complete linkage	\(d(A,B) = \max_{a \in A, b \in B} d(a,b)\)	Tends to produce compact clusters
Average linkage	$d(A,B) = \frac{1}{	A
Ward linkage	Minimizes the increase in within-cluster variance	Tends to produce equal-sized clusters

Dimension	t-SNE	UMAP
Speed	Slow (\(O(n^2)\) or Barnes-Hut \(O(n \log n)\))	Fast (approximate nearest neighbors)
Global structure	Not preserved	Better preserved
Scalability	Tens of thousands of points	Handles millions of points
Theoretical basis	Probability distribution matching	Topology / category theory
Use cases	Visualization only	Visualization + general-purpose dimensionality reduction