Skip to content

Unsupervised Learning

Unsupervised learning is an important branch of machine learning whose defining characteristic is that the training data has no labels. The model must discover the underlying structure, patterns, and regularities in the data on its own.

Overview of Unsupervised Learning

Comparison with Supervised Learning

Dimension Supervised Learning Unsupervised Learning
Labels Requires labeled data No labels needed
Objective Learn a mapping from inputs to outputs Discover inherent data structure
Evaluation Clear evaluation metrics Evaluation is difficult; often requires human judgment
Typical tasks Classification, regression Clustering, dimensionality reduction, anomaly detection, density estimation
Data cost High labeling cost Low data acquisition cost

Application Scenarios

  • Customer segmentation: Grouping customers based on purchasing behavior
  • Anomaly detection: Credit card fraud detection, network intrusion detection
  • Data visualization: Low-dimensional visualization of high-dimensional data
  • Feature learning: Autoencoders learning data representations
  • Recommender systems: Discovering latent similarities among users or items

Clustering Algorithms

Clustering is one of the most fundamental tasks in unsupervised learning. The goal is to partition data points into groups (clusters) such that points within the same group are as similar as possible and points in different groups are as dissimilar as possible.

K-Means Algorithm

K-Means is the most classic clustering algorithm, known for its simplicity and efficiency.

Algorithm Steps:

  1. Randomly initialize \(K\) cluster centroids \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \dots, \boldsymbol{\mu}_K\)
  2. Assignment step (E-step): Assign each data point \(\mathbf{x}_i\) to its nearest centroid:
\[ c_i = \arg\min_{k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]
  1. Update step (M-step): Recompute the centroid of each cluster:
\[ \boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i \]
  1. Repeat steps 2--3 until convergence (centroids no longer change or changes fall below a threshold).

Objective Function (SSE / Inertia):

\[ J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

K-Means essentially minimizes the Within-Cluster Sum of Squares (WCSS).

Choosing K:

  • Elbow Method: Plot \(K\) versus SSE and select the "elbow" point where the rate of decrease sharply levels off.
  • Silhouette Score: For each sample \(i\), define:
\[ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \]

where \(a(i)\) is the mean distance to other points in the same cluster, and \(b(i)\) is the mean distance to the nearest neighboring cluster. \(s(i) \in [-1, 1]\); values closer to 1 are better.

Limitations of K-Means:

  • Requires specifying \(K\) in advance
  • Sensitive to initialization (K-Means++ can mitigate this)
  • Can only discover convex (spherical) clusters
  • Sensitive to outliers

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm capable of discovering arbitrarily shaped clusters and automatically identifying noise points.

Core Concepts:

  • \(\varepsilon\)-neighborhood: All points within radius \(\varepsilon\) of a point \(p\)
  • Core point: A point whose \(\varepsilon\)-neighborhood contains at least \(\text{MinPts}\) points
  • Directly density-reachable: Point \(q\) is in the \(\varepsilon\)-neighborhood of core point \(p\)
  • Density-reachable: There exists a chain of core points where consecutive points are directly density-reachable
  • Density-connected: There exists a core point \(o\) such that both \(p\) and \(q\) are density-reachable from \(o\)

Parameter Selection:

  • \(\varepsilon\): Typically chosen using a K-distance plot by sorting the \(k\)-th nearest neighbor distances and finding the "knee"
  • MinPts: A common rule of thumb is \(\text{MinPts} \geq d + 1\) (where \(d\) is the data dimensionality); a typical value is 5

Advantages: No need to specify the number of clusters; discovers arbitrarily shaped clusters; identifies noise points.

Disadvantages: Sensitive to parameters; struggles with data of varying densities.


Hierarchical Clustering

Hierarchical clustering builds a cluster tree (dendrogram). It can proceed bottom-up by merging (agglomerative) or top-down by splitting (divisive).

Agglomerative Hierarchical Clustering Steps:

  1. Start with each sample as its own cluster
  2. Compute distances between all pairs of clusters
  3. Merge the two closest clusters
  4. Repeat until only one cluster remains

Common Linkage Criteria:

Linkage Method Definition Characteristics
Single linkage \(d(A,B) = \min_{a \in A, b \in B} d(a,b)\) Prone to chaining effects
Complete linkage \(d(A,B) = \max_{a \in A, b \in B} d(a,b)\) Tends to produce compact clusters
Average linkage $d(A,B) = \frac{1}{ A
Ward linkage Minimizes the increase in within-cluster variance Tends to produce equal-sized clusters

Gaussian Mixture Models (GMM)

GMM assumes that data is generated by a mixture of \(K\) Gaussian distributions:

\[ p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \]

where \(\pi_k\) are mixing weights satisfying \(\sum_k \pi_k = 1\).

EM Algorithm Solution:

E-step: Compute the posterior probability (responsibility) that each sample belongs to each Gaussian component:

\[ \gamma_{ik} = \frac{\pi_k \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \]

M-step: Update parameters:

\[ \boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}}, \quad \boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^\top}{\sum_i \gamma_{ik}}, \quad \pi_k = \frac{\sum_i \gamma_{ik}}{N} \]

Compared to K-Means, GMM can model ellipsoidal clusters and provides probabilistic soft assignments.


Spectral Clustering

Spectral clustering leverages the spectrum (eigenvalues) of the Laplacian matrix of a similarity graph over the data.

Algorithm Steps:

  1. Construct a similarity matrix \(\mathbf{W}\) (e.g., using a Gaussian kernel: \(w_{ij} = \exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma^2)\))
  2. Compute the degree matrix \(\mathbf{D}\), where \(D_{ii} = \sum_j w_{ij}\)
  3. Compute the normalized Laplacian \(\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}\)
  4. Extract the \(K\) smallest eigenvectors of \(\mathbf{L}_{\text{sym}}\) to form \(\mathbf{U} \in \mathbb{R}^{n \times K}\)
  5. Apply K-Means clustering to the rows of \(\mathbf{U}\)

Spectral clustering can discover non-convex clusters and is well-suited for data with manifold structure.


Dimensionality Reduction

Dimensionality reduction maps high-dimensional data into a lower-dimensional space while preserving the most important structural information.

PCA (Principal Component Analysis)

PCA finds the directions of maximum variance in the data and projects onto those directions.

Mathematical Derivation:

Given a data matrix \(\mathbf{X} \in \mathbb{R}^{n \times d}\) (already centered), compute the covariance matrix:

\[ \mathbf{C} = \frac{1}{n-1} \mathbf{X}^\top \mathbf{X} \]

Perform eigendecomposition of \(\mathbf{C}\):

\[ \mathbf{C} \mathbf{v}_k = \lambda_k \mathbf{v}_k \]

Sort eigenvalues in descending order \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d\) and take the first \(K\) eigenvectors as projection directions.

Explained Variance Ratio:

\[ \text{EVR}_k = \frac{\lambda_k}{\sum_{i=1}^{d} \lambda_i} \]

A common practice is to choose the number of dimensions such that the cumulative explained variance ratio reaches 95%.


t-SNE

t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method particularly well-suited for visualizing high-dimensional data.

Core Idea:

  1. In the high-dimensional space, compute pairwise similarities using a Gaussian kernel:
\[ p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i}\exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)} \]

Symmetrize: \(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\)

  1. In the low-dimensional space, compute similarities using a Student-t distribution (1 degree of freedom):
\[ q_{ij} = \frac{(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2)^{-1}} \]
  1. Minimize the KL divergence: \(\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\)

Perplexity: Controls the Gaussian kernel bandwidth \(\sigma_i\); typically set between 5 and 50. Higher perplexity considers a wider neighborhood.

Important caveats: t-SNE does not preserve global structure, results may vary across runs, and it is not suitable as a preprocessing step.


UMAP

UMAP (Uniform Manifold Approximation and Projection) is grounded in the theory of topological data analysis and Riemannian geometry.

Core Idea:

  1. Assume the data is uniformly distributed on a Riemannian manifold
  2. Construct a fuzzy simplicial set in the high-dimensional space to represent the topological structure
  3. Find a low-dimensional embedding that preserves this topological structure as faithfully as possible

UMAP vs t-SNE Comparison:

Dimension t-SNE UMAP
Speed Slow (\(O(n^2)\) or Barnes-Hut \(O(n \log n)\)) Fast (approximate nearest neighbors)
Global structure Not preserved Better preserved
Scalability Tens of thousands of points Handles millions of points
Theoretical basis Probability distribution matching Topology / category theory
Use cases Visualization only Visualization + general-purpose dimensionality reduction

Anomaly Detection

The goal of anomaly detection is to identify "anomalous" samples that differ significantly from the majority of the data.

Isolation Forest

Core idea: Anomalous points are easier to "isolate" than normal points (i.e., they require fewer random splits).

  • Randomly select features and split values to build isolation trees (iTrees)
  • The anomaly score is based on the average path length across all trees:
\[ s(\mathbf{x}, n) = 2^{-\frac{E[h(\mathbf{x})]}{c(n)}} \]

where \(h(\mathbf{x})\) is the path length and \(c(n)\) is a normalization factor. Scores close to 1 indicate anomalies; scores near 0.5 indicate normal points.

LOF (Local Outlier Factor)

LOF detects anomalies by comparing a sample's local density to the local densities of its neighbors:

\[ \text{LOF}_k(\mathbf{x}) = \frac{\sum_{\mathbf{o} \in N_k(\mathbf{x})} \frac{\text{lrd}_k(\mathbf{o})}{\text{lrd}_k(\mathbf{x})}}{|N_k(\mathbf{x})|} \]

An LOF greater than 1 indicates that the point has lower density than its neighborhood and may be anomalous.

One-Class SVM

One-Class SVM finds a hyperplane in the feature space that separates the majority of data from the origin:

\[ \min_{\mathbf{w}, \rho, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + \frac{1}{\nu n}\sum_{i=1}^{n}\xi_i - \rho \]
\[ \text{s.t.} \quad \mathbf{w}^\top \phi(\mathbf{x}_i) \geq \rho - \xi_i, \quad \xi_i \geq 0 \]

The parameter \(\nu\) provides an upper bound on the fraction of anomalies.


Comprehensive Comparison of Methods

Method Type Best Suited For Complexity Key Parameters
K-Means Clustering Convex clusters, large datasets \(O(nKt)\) \(K\)
DBSCAN Clustering Arbitrary shapes, noisy data \(O(n \log n)\) \(\varepsilon\), MinPts
GMM Clustering Ellipsoidal clusters, soft assignments \(O(nK d^2)\) \(K\), covariance type
Spectral clustering Clustering Non-convex manifold structures \(O(n^3)\) \(K\), \(\sigma\)
PCA Dim. reduction Linear reduction, decorrelation \(O(nd^2)\) Number of components
t-SNE Dim. reduction High-dimensional data visualization \(O(n^2)\) Perplexity
UMAP Dim. reduction Visualization + general reduction \(O(n^{1.14})\) n_neighbors, min_dist
Isolation Forest Anomaly detection High-dimensional anomaly detection \(O(n \log n)\) Number of trees, sample size
LOF Anomaly detection Local anomaly detection \(O(n^2)\) \(k\) (number of neighbors)
One-Class SVM Anomaly detection Small data, high dimensions \(O(n^2) \sim O(n^3)\) \(\nu\), kernel function

评论 #