Ensemble Learning
Ensemble learning constructs a more powerful model by combining multiple "weak learners." Its core philosophy is that a committee of diverse models outperforms any single model.
Principles of Ensemble Learning
Why Does Ensembling Work?
The bias-variance decomposition is central to understanding why ensembles are effective. For the generalization error of a single model:
- Bagging (e.g., Random Forest) primarily reduces variance: by averaging predictions from multiple high-variance models to smooth out fluctuations
- Boosting (e.g., AdaBoost, XGBoost) primarily reduces bias: by progressively focusing on hard-to-classify samples and correcting errors from previous models
Conditions for ensembles to be effective (generalization of Condorcet's jury theorem):
- Each base learner's accuracy must exceed 50% (better than random guessing)
- Base learners must exhibit diversity (i.e., their errors should not be perfectly correlated)
Suppose we have \(T\) independent base learners, each with error rate \(\epsilon < 0.5\). The ensemble error rate via majority voting is:
As \(T \to \infty\), \(\epsilon_{\text{ensemble}} \to 0\).
Bagging
Bootstrap Sampling
Bagging (Bootstrap Aggregating) is built around bootstrap sampling:
- From the original dataset (\(n\) samples), draw \(n\) samples with replacement to form a new training set
- Repeat \(T\) times to obtain \(T\) distinct training sets
- Train one base learner on each training set
- At prediction time, use majority voting (classification) or averaging (regression)
In each bootstrap sample, approximately \(1 - (1-1/n)^n \approx 63.2\%\) of samples are selected; the remaining ~36.8% are out-of-bag (OOB) samples.
Random Forest
Random Forest extends Bagging with random feature subspace selection:
- At each node split, randomly select \(m\) features (rather than all \(d\) features) as candidates
- For classification, the recommended setting is \(m = \lfloor\sqrt{d}\rfloor\); for regression, \(m = \lfloor d/3 \rfloor\)
OOB Error Estimation:
For each sample \(\mathbf{x}_i\), predict using only the trees that did not include it in their bootstrap sample. The OOB error is the average error across all such OOB predictions and serves as an unbiased estimate of the generalization error, eliminating the need for a separate validation set.
Advantages of Random Forest:
- Resistant to overfitting (adding more trees does not cause overfitting)
- Handles high-dimensional data well
- Provides feature importance rankings
- Supports parallel training
- Reasonably robust to missing values and outliers
Feature Importance:
- Mean Decrease in Impurity (MDI): Sum of impurity decreases across all trees for each feature's splits
- Mean Decrease in Accuracy (MDA): Observe the increase in OOB error when a feature's values are randomly permuted
Boosting
AdaBoost
AdaBoost (Adaptive Boosting) iteratively adjusts sample weights so that subsequent learners focus more on samples that previous learners misclassified.
Algorithm Steps:
- Initialize sample weights: \(w_i^{(1)} = \frac{1}{n}\)
- For \(t = 1, 2, \dots, T\): - Train base learner \(h_t\) using weight distribution \(\mathbf{w}^{(t)}\) - Compute the weighted error rate:
- Compute the learner weight:
- Update sample weights:
where \(Z_t\) is the normalization factor.
- Final prediction: \(H(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})\right)\)
AdaBoost can be shown to be equivalent to a forward stagewise additive model under exponential loss.
Gradient Boosting
The core idea of Gradient Boosting is that each step fits the negative gradient of the loss with respect to the current model's predictions.
General Framework:
Given a loss function \(L(y, F(\mathbf{x}))\), initialize \(F_0(\mathbf{x}) = \arg\min_c \sum_i L(y_i, c)\).
For \(t = 1, 2, \dots, T\):
- Compute the negative gradient (pseudo-residuals):
- Fit a base learner \(h_t\) to the pseudo-residuals \(\{(\mathbf{x}_i, r_{it})\}\)
- Determine the step size \(\eta_t\) (optionally via line search)
- Update the model: \(F_t(\mathbf{x}) = F_{t-1}(\mathbf{x}) + \eta_t h_t(\mathbf{x})\)
When \(L\) is the squared loss, \(r_{it} = y_i - F_{t-1}(\mathbf{x}_i)\) reduces to the actual residuals.
XGBoost
XGBoost (eXtreme Gradient Boosting) is a highly efficient implementation of Gradient Boosting with several key innovations.
Regularized Objective Function
where the regularization term is:
Here \(T_{\text{leaves}}\) is the number of leaf nodes, \(w_j\) is the leaf weight, \(\gamma\) controls tree complexity, and \(\lambda\) is the L2 regularization coefficient.
Second-Order Taylor Expansion
The objective function is approximated using a second-order Taylor expansion around \(F_{t-1}\):
where \(g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}\) and \(h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}\).
Optimal leaf weights and split gain:
Approximate Split Finding
For large datasets, XGBoost uses a Weighted Quantile Sketch to approximately find optimal split points, avoiding enumeration of all possible split values.
Sparsity-Aware Algorithm
XGBoost natively handles sparse data by learning a default direction for each node. When a feature value is missing, the sample is directed along the default branch.
Engineering Optimizations
- Column Block: Data is stored sorted by column to accelerate split-point search
- Cache-Aware Access: Optimizes memory access patterns to exploit CPU caches
- Out-of-Core Computation: Supports computation when data does not fit entirely in memory
LightGBM
LightGBM, developed by Microsoft, is optimized for large-scale data scenarios.
GOSS (Gradient-based One-Side Sampling)
Retains all samples with large gradients (high information content) and randomly samples from those with small gradients:
- Sort samples by absolute gradient value
- Keep the top \(a\%\) large-gradient samples
- Randomly sample \(b\%\) from the remaining samples
- Up-weight the sampled small-gradient samples by a factor of \(\frac{1-a}{b}\) to maintain the original distribution
EFB (Exclusive Feature Bundling)
Bundles mutually exclusive features (features that are rarely non-zero simultaneously) together, reducing the effective number of features. This is highly effective for sparse, high-dimensional data.
Leaf-Wise Growth Strategy
Unlike XGBoost's level-wise strategy, LightGBM adopts a leaf-wise strategy: at each step, it splits the leaf with the largest gain.
- Advantage: For the same number of leaves, leaf-wise typically achieves lower loss
- Risk: More prone to overfitting;
max_depthshould be used to constrain depth
CatBoost
CatBoost, developed by Yandex, is specifically designed to handle categorical features and address the prediction shift problem.
Ordered Boosting
Traditional boosting suffers from prediction shift: the model used to compute residuals overlaps with the training data used to fit the next tree.
CatBoost's Ordered Boosting:
- Randomly permute the data
- For the \(i\)-th sample, compute its residual using a model trained only on the first \(i-1\) samples
- This ensures unbiased residuals
Categorical Feature Handling
CatBoost uses Ordered Target Encoding:
where \(p\) is the prior mean of the target variable and \(a\) is a smoothing parameter. Random permutations are used to prevent target leakage.
Symmetric Trees
CatBoost uses oblivious (symmetric) decision trees by default: all nodes at the same depth use the same splitting condition. This leads to extremely fast prediction and acts as a form of regularization.
Stacking
Stacking (Stacked Generalization) uses a meta-learner to combine the predictions of multiple base learners.
Basic Procedure
- First level: Train \(M\) different base learners \(h_1, h_2, \dots, h_M\)
- Generate predictions from the base learners to form a new feature matrix
- Second level: Train a meta-learner (e.g., logistic regression, linear regression) to learn how to combine the base learners' predictions
Cross-Validated Stacking
To avoid overfitting, \(K\)-fold cross-validation is typically used to generate first-level predictions:
- Split the data into \(K\) folds
- For each fold, train the base learners on the other \(K-1\) folds and predict on the held-out fold
- Concatenate predictions across all folds to form the meta-learner's training data
Choice of meta-learner: A simple model (e.g., linear regression, logistic regression) is typically preferred, since the first level already captures complexity. An overly complex meta-learner risks overfitting.
XGBoost vs LightGBM vs CatBoost Comparison
| Dimension | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Split strategy | Level-wise | Leaf-wise | Symmetric tree |
| Categorical features | Requires manual encoding | Native support (split-by-group) | Native support (Ordered TS) |
| Missing value handling | Learns default direction | Learns default direction | Multiple modes |
| Training speed | Moderate | Fastest | Slower (ordered boosting overhead) |
| Prediction speed | Moderate | Moderate | Fastest (symmetric trees) |
| GPU support | Yes | Yes | Yes |
| Overfitting control | Regularization + pruning | max_depth + regularization | Ordered boosting + symmetric trees |
| Large data performance | Good | Best (GOSS + EFB) | Good |
| Small data performance | Good | Prone to overfitting | Best (ordered boosting) |
| Best suited for | General purpose | Large-scale, high-dimensional sparse data | Many categorical features, small datasets |
Selection Guidelines
- Large datasets with many features (especially sparse): Prefer LightGBM
- Many categorical features: Prefer CatBoost
- Need fine-grained tuning with extensive community resources: XGBoost is the classic choice
- Competitions and production: Often train all three and stack or select the best
Ensemble Methods Overview
| Method | Core Idea | Reduces Bias/Variance | Base Learner Requirements | Parallelizable |
|---|---|---|---|---|
| Bagging | Bootstrap + voting/averaging | Reduces variance | High-variance models (e.g., deep decision trees) | Yes |
| Random Forest | Bagging + feature randomization | Reduces variance | Decision trees | Yes |
| AdaBoost | Sample reweighting | Reduces bias | Weak learners (e.g., decision stumps) | No |
| Gradient Boosting | Fit residuals | Reduces bias | Shallow decision trees | No |
| XGBoost | Regularization + second-order gradients | Reduces bias | CART | Within-tree parallelism |
| LightGBM | GOSS + EFB | Reduces bias | CART | Within-tree parallelism |
| CatBoost | Ordered Boosting | Reduces bias | Symmetric trees | Within-tree parallelism |
| Stacking | Meta-learner combination | Both | Heterogeneous models | First level parallelizable |