Unsupervised Learning
Unsupervised learning is an important branch of machine learning whose defining characteristic is that the training data has no labels. The model must discover the underlying structure, patterns, and regularities in the data on its own.
Overview of Unsupervised Learning
Comparison with Supervised Learning
| Dimension | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Labels | Requires labeled data | No labels needed |
| Objective | Learn a mapping from inputs to outputs | Discover inherent data structure |
| Evaluation | Clear evaluation metrics | Evaluation is difficult; often requires human judgment |
| Typical tasks | Classification, regression | Clustering, dimensionality reduction, anomaly detection, density estimation |
| Data cost | High labeling cost | Low data acquisition cost |
Application Scenarios
- Customer segmentation: Grouping customers based on purchasing behavior
- Anomaly detection: Credit card fraud detection, network intrusion detection
- Data visualization: Low-dimensional visualization of high-dimensional data
- Feature learning: Autoencoders learning data representations
- Recommender systems: Discovering latent similarities among users or items
Clustering Algorithms
Clustering is one of the most fundamental tasks in unsupervised learning. The goal is to partition data points into groups (clusters) such that points within the same group are as similar as possible and points in different groups are as dissimilar as possible.
K-Means Algorithm
K-Means is the most classic clustering algorithm, known for its simplicity and efficiency.
Algorithm Steps:
- Randomly initialize \(K\) cluster centroids \(\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \dots, \boldsymbol{\mu}_K\)
- Assignment step (E-step): Assign each data point \(\mathbf{x}_i\) to its nearest centroid:
- Update step (M-step): Recompute the centroid of each cluster:
- Repeat steps 2--3 until convergence (centroids no longer change or changes fall below a threshold).
Objective Function (SSE / Inertia):
K-Means essentially minimizes the Within-Cluster Sum of Squares (WCSS).
Choosing K:
- Elbow Method: Plot \(K\) versus SSE and select the "elbow" point where the rate of decrease sharply levels off.
- Silhouette Score: For each sample \(i\), define:
where \(a(i)\) is the mean distance to other points in the same cluster, and \(b(i)\) is the mean distance to the nearest neighboring cluster. \(s(i) \in [-1, 1]\); values closer to 1 are better.
Limitations of K-Means:
- Requires specifying \(K\) in advance
- Sensitive to initialization (K-Means++ can mitigate this)
- Can only discover convex (spherical) clusters
- Sensitive to outliers
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm capable of discovering arbitrarily shaped clusters and automatically identifying noise points.
Core Concepts:
- \(\varepsilon\)-neighborhood: All points within radius \(\varepsilon\) of a point \(p\)
- Core point: A point whose \(\varepsilon\)-neighborhood contains at least \(\text{MinPts}\) points
- Directly density-reachable: Point \(q\) is in the \(\varepsilon\)-neighborhood of core point \(p\)
- Density-reachable: There exists a chain of core points where consecutive points are directly density-reachable
- Density-connected: There exists a core point \(o\) such that both \(p\) and \(q\) are density-reachable from \(o\)
Parameter Selection:
- \(\varepsilon\): Typically chosen using a K-distance plot by sorting the \(k\)-th nearest neighbor distances and finding the "knee"
- MinPts: A common rule of thumb is \(\text{MinPts} \geq d + 1\) (where \(d\) is the data dimensionality); a typical value is 5
Advantages: No need to specify the number of clusters; discovers arbitrarily shaped clusters; identifies noise points.
Disadvantages: Sensitive to parameters; struggles with data of varying densities.
Hierarchical Clustering
Hierarchical clustering builds a cluster tree (dendrogram). It can proceed bottom-up by merging (agglomerative) or top-down by splitting (divisive).
Agglomerative Hierarchical Clustering Steps:
- Start with each sample as its own cluster
- Compute distances between all pairs of clusters
- Merge the two closest clusters
- Repeat until only one cluster remains
Common Linkage Criteria:
| Linkage Method | Definition | Characteristics |
|---|---|---|
| Single linkage | \(d(A,B) = \min_{a \in A, b \in B} d(a,b)\) | Prone to chaining effects |
| Complete linkage | \(d(A,B) = \max_{a \in A, b \in B} d(a,b)\) | Tends to produce compact clusters |
| Average linkage | $d(A,B) = \frac{1}{ | A |
| Ward linkage | Minimizes the increase in within-cluster variance | Tends to produce equal-sized clusters |
Gaussian Mixture Models (GMM)
GMM assumes that data is generated by a mixture of \(K\) Gaussian distributions:
where \(\pi_k\) are mixing weights satisfying \(\sum_k \pi_k = 1\).
EM Algorithm Solution:
E-step: Compute the posterior probability (responsibility) that each sample belongs to each Gaussian component:
M-step: Update parameters:
Compared to K-Means, GMM can model ellipsoidal clusters and provides probabilistic soft assignments.
Spectral Clustering
Spectral clustering leverages the spectrum (eigenvalues) of the Laplacian matrix of a similarity graph over the data.
Algorithm Steps:
- Construct a similarity matrix \(\mathbf{W}\) (e.g., using a Gaussian kernel: \(w_{ij} = \exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma^2)\))
- Compute the degree matrix \(\mathbf{D}\), where \(D_{ii} = \sum_j w_{ij}\)
- Compute the normalized Laplacian \(\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-1/2}\mathbf{W}\mathbf{D}^{-1/2}\)
- Extract the \(K\) smallest eigenvectors of \(\mathbf{L}_{\text{sym}}\) to form \(\mathbf{U} \in \mathbb{R}^{n \times K}\)
- Apply K-Means clustering to the rows of \(\mathbf{U}\)
Spectral clustering can discover non-convex clusters and is well-suited for data with manifold structure.
Dimensionality Reduction
Dimensionality reduction maps high-dimensional data into a lower-dimensional space while preserving the most important structural information.
PCA (Principal Component Analysis)
PCA finds the directions of maximum variance in the data and projects onto those directions.
Mathematical Derivation:
Given a data matrix \(\mathbf{X} \in \mathbb{R}^{n \times d}\) (already centered), compute the covariance matrix:
Perform eigendecomposition of \(\mathbf{C}\):
Sort eigenvalues in descending order \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d\) and take the first \(K\) eigenvectors as projection directions.
Explained Variance Ratio:
A common practice is to choose the number of dimensions such that the cumulative explained variance ratio reaches 95%.
t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method particularly well-suited for visualizing high-dimensional data.
Core Idea:
- In the high-dimensional space, compute pairwise similarities using a Gaussian kernel:
Symmetrize: \(p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\)
- In the low-dimensional space, compute similarities using a Student-t distribution (1 degree of freedom):
- Minimize the KL divergence: \(\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\)
Perplexity: Controls the Gaussian kernel bandwidth \(\sigma_i\); typically set between 5 and 50. Higher perplexity considers a wider neighborhood.
Important caveats: t-SNE does not preserve global structure, results may vary across runs, and it is not suitable as a preprocessing step.
UMAP
UMAP (Uniform Manifold Approximation and Projection) is grounded in the theory of topological data analysis and Riemannian geometry.
Core Idea:
- Assume the data is uniformly distributed on a Riemannian manifold
- Construct a fuzzy simplicial set in the high-dimensional space to represent the topological structure
- Find a low-dimensional embedding that preserves this topological structure as faithfully as possible
UMAP vs t-SNE Comparison:
| Dimension | t-SNE | UMAP |
|---|---|---|
| Speed | Slow (\(O(n^2)\) or Barnes-Hut \(O(n \log n)\)) | Fast (approximate nearest neighbors) |
| Global structure | Not preserved | Better preserved |
| Scalability | Tens of thousands of points | Handles millions of points |
| Theoretical basis | Probability distribution matching | Topology / category theory |
| Use cases | Visualization only | Visualization + general-purpose dimensionality reduction |
Anomaly Detection
The goal of anomaly detection is to identify "anomalous" samples that differ significantly from the majority of the data.
Isolation Forest
Core idea: Anomalous points are easier to "isolate" than normal points (i.e., they require fewer random splits).
- Randomly select features and split values to build isolation trees (iTrees)
- The anomaly score is based on the average path length across all trees:
where \(h(\mathbf{x})\) is the path length and \(c(n)\) is a normalization factor. Scores close to 1 indicate anomalies; scores near 0.5 indicate normal points.
LOF (Local Outlier Factor)
LOF detects anomalies by comparing a sample's local density to the local densities of its neighbors:
An LOF greater than 1 indicates that the point has lower density than its neighborhood and may be anomalous.
One-Class SVM
One-Class SVM finds a hyperplane in the feature space that separates the majority of data from the origin:
The parameter \(\nu\) provides an upper bound on the fraction of anomalies.
Comprehensive Comparison of Methods
| Method | Type | Best Suited For | Complexity | Key Parameters |
|---|---|---|---|---|
| K-Means | Clustering | Convex clusters, large datasets | \(O(nKt)\) | \(K\) |
| DBSCAN | Clustering | Arbitrary shapes, noisy data | \(O(n \log n)\) | \(\varepsilon\), MinPts |
| GMM | Clustering | Ellipsoidal clusters, soft assignments | \(O(nK d^2)\) | \(K\), covariance type |
| Spectral clustering | Clustering | Non-convex manifold structures | \(O(n^3)\) | \(K\), \(\sigma\) |
| PCA | Dim. reduction | Linear reduction, decorrelation | \(O(nd^2)\) | Number of components |
| t-SNE | Dim. reduction | High-dimensional data visualization | \(O(n^2)\) | Perplexity |
| UMAP | Dim. reduction | Visualization + general reduction | \(O(n^{1.14})\) | n_neighbors, min_dist |
| Isolation Forest | Anomaly detection | High-dimensional anomaly detection | \(O(n \log n)\) | Number of trees, sample size |
| LOF | Anomaly detection | Local anomaly detection | \(O(n^2)\) | \(k\) (number of neighbors) |
| One-Class SVM | Anomaly detection | Small data, high dimensions | \(O(n^2) \sim O(n^3)\) | \(\nu\), kernel function |