Skip to content

Ensemble Learning

Ensemble learning constructs a more powerful model by combining multiple "weak learners." Its core philosophy is that a committee of diverse models outperforms any single model.

Principles of Ensemble Learning

Why Does Ensembling Work?

The bias-variance decomposition is central to understanding why ensembles are effective. For the generalization error of a single model:

\[ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise} \]
  • Bagging (e.g., Random Forest) primarily reduces variance: by averaging predictions from multiple high-variance models to smooth out fluctuations
  • Boosting (e.g., AdaBoost, XGBoost) primarily reduces bias: by progressively focusing on hard-to-classify samples and correcting errors from previous models

Conditions for ensembles to be effective (generalization of Condorcet's jury theorem):

  • Each base learner's accuracy must exceed 50% (better than random guessing)
  • Base learners must exhibit diversity (i.e., their errors should not be perfectly correlated)

Suppose we have \(T\) independent base learners, each with error rate \(\epsilon < 0.5\). The ensemble error rate via majority voting is:

\[ \epsilon_{\text{ensemble}} = \sum_{k=\lceil T/2 \rceil}^{T} \binom{T}{k} \epsilon^k (1-\epsilon)^{T-k} \]

As \(T \to \infty\), \(\epsilon_{\text{ensemble}} \to 0\).


Bagging

Bootstrap Sampling

Bagging (Bootstrap Aggregating) is built around bootstrap sampling:

  1. From the original dataset (\(n\) samples), draw \(n\) samples with replacement to form a new training set
  2. Repeat \(T\) times to obtain \(T\) distinct training sets
  3. Train one base learner on each training set
  4. At prediction time, use majority voting (classification) or averaging (regression)

In each bootstrap sample, approximately \(1 - (1-1/n)^n \approx 63.2\%\) of samples are selected; the remaining ~36.8% are out-of-bag (OOB) samples.

Random Forest

Random Forest extends Bagging with random feature subspace selection:

  • At each node split, randomly select \(m\) features (rather than all \(d\) features) as candidates
  • For classification, the recommended setting is \(m = \lfloor\sqrt{d}\rfloor\); for regression, \(m = \lfloor d/3 \rfloor\)

OOB Error Estimation:

For each sample \(\mathbf{x}_i\), predict using only the trees that did not include it in their bootstrap sample. The OOB error is the average error across all such OOB predictions and serves as an unbiased estimate of the generalization error, eliminating the need for a separate validation set.

Advantages of Random Forest:

  • Resistant to overfitting (adding more trees does not cause overfitting)
  • Handles high-dimensional data well
  • Provides feature importance rankings
  • Supports parallel training
  • Reasonably robust to missing values and outliers

Feature Importance:

  • Mean Decrease in Impurity (MDI): Sum of impurity decreases across all trees for each feature's splits
  • Mean Decrease in Accuracy (MDA): Observe the increase in OOB error when a feature's values are randomly permuted

Boosting

AdaBoost

AdaBoost (Adaptive Boosting) iteratively adjusts sample weights so that subsequent learners focus more on samples that previous learners misclassified.

Algorithm Steps:

  1. Initialize sample weights: \(w_i^{(1)} = \frac{1}{n}\)
  2. For \(t = 1, 2, \dots, T\): - Train base learner \(h_t\) using weight distribution \(\mathbf{w}^{(t)}\) - Compute the weighted error rate:
\[ \epsilon_t = \sum_{i=1}^{n} w_i^{(t)} \cdot \mathbb{1}[h_t(\mathbf{x}_i) \neq y_i] \]

- Compute the learner weight:

\[ \alpha_t = \frac{1}{2} \ln\frac{1-\epsilon_t}{\epsilon_t} \]

- Update sample weights:

\[ w_i^{(t+1)} = \frac{w_i^{(t)} \exp(-\alpha_t y_i h_t(\mathbf{x}_i))}{Z_t} \]

where \(Z_t\) is the normalization factor.

  1. Final prediction: \(H(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})\right)\)

AdaBoost can be shown to be equivalent to a forward stagewise additive model under exponential loss.


Gradient Boosting

The core idea of Gradient Boosting is that each step fits the negative gradient of the loss with respect to the current model's predictions.

General Framework:

Given a loss function \(L(y, F(\mathbf{x}))\), initialize \(F_0(\mathbf{x}) = \arg\min_c \sum_i L(y_i, c)\).

For \(t = 1, 2, \dots, T\):

  1. Compute the negative gradient (pseudo-residuals):
\[ r_{it} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\bigg|_{F=F_{t-1}} \]
  1. Fit a base learner \(h_t\) to the pseudo-residuals \(\{(\mathbf{x}_i, r_{it})\}\)
  2. Determine the step size \(\eta_t\) (optionally via line search)
  3. Update the model: \(F_t(\mathbf{x}) = F_{t-1}(\mathbf{x}) + \eta_t h_t(\mathbf{x})\)

When \(L\) is the squared loss, \(r_{it} = y_i - F_{t-1}(\mathbf{x}_i)\) reduces to the actual residuals.


XGBoost

XGBoost (eXtreme Gradient Boosting) is a highly efficient implementation of Gradient Boosting with several key innovations.

Regularized Objective Function

\[ \mathcal{L} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{t=1}^{T} \Omega(f_t) \]

where the regularization term is:

\[ \Omega(f) = \gamma \cdot T_{\text{leaves}} + \frac{1}{2}\lambda \sum_{j=1}^{T_{\text{leaves}}} w_j^2 \]

Here \(T_{\text{leaves}}\) is the number of leaf nodes, \(w_j\) is the leaf weight, \(\gamma\) controls tree complexity, and \(\lambda\) is the L2 regularization coefficient.

Second-Order Taylor Expansion

The objective function is approximated using a second-order Taylor expansion around \(F_{t-1}\):

\[ \mathcal{L}^{(t)} \approx \sum_{i=1}^{n} \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right] + \Omega(f_t) \]

where \(g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}\) and \(h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}\).

Optimal leaf weights and split gain:

\[ w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}, \quad \text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda}\right] - \gamma \]

Approximate Split Finding

For large datasets, XGBoost uses a Weighted Quantile Sketch to approximately find optimal split points, avoiding enumeration of all possible split values.

Sparsity-Aware Algorithm

XGBoost natively handles sparse data by learning a default direction for each node. When a feature value is missing, the sample is directed along the default branch.

Engineering Optimizations

  • Column Block: Data is stored sorted by column to accelerate split-point search
  • Cache-Aware Access: Optimizes memory access patterns to exploit CPU caches
  • Out-of-Core Computation: Supports computation when data does not fit entirely in memory

LightGBM

LightGBM, developed by Microsoft, is optimized for large-scale data scenarios.

GOSS (Gradient-based One-Side Sampling)

Retains all samples with large gradients (high information content) and randomly samples from those with small gradients:

  • Sort samples by absolute gradient value
  • Keep the top \(a\%\) large-gradient samples
  • Randomly sample \(b\%\) from the remaining samples
  • Up-weight the sampled small-gradient samples by a factor of \(\frac{1-a}{b}\) to maintain the original distribution

EFB (Exclusive Feature Bundling)

Bundles mutually exclusive features (features that are rarely non-zero simultaneously) together, reducing the effective number of features. This is highly effective for sparse, high-dimensional data.

Leaf-Wise Growth Strategy

Unlike XGBoost's level-wise strategy, LightGBM adopts a leaf-wise strategy: at each step, it splits the leaf with the largest gain.

  • Advantage: For the same number of leaves, leaf-wise typically achieves lower loss
  • Risk: More prone to overfitting; max_depth should be used to constrain depth

CatBoost

CatBoost, developed by Yandex, is specifically designed to handle categorical features and address the prediction shift problem.

Ordered Boosting

Traditional boosting suffers from prediction shift: the model used to compute residuals overlaps with the training data used to fit the next tree.

CatBoost's Ordered Boosting:

  1. Randomly permute the data
  2. For the \(i\)-th sample, compute its residual using a model trained only on the first \(i-1\) samples
  3. This ensures unbiased residuals

Categorical Feature Handling

CatBoost uses Ordered Target Encoding:

\[ \hat{x}_i^k = \frac{\sum_{j < i, x_j^k = x_i^k} y_j + a \cdot p}{\sum_{j < i, x_j^k = x_i^k} 1 + a} \]

where \(p\) is the prior mean of the target variable and \(a\) is a smoothing parameter. Random permutations are used to prevent target leakage.

Symmetric Trees

CatBoost uses oblivious (symmetric) decision trees by default: all nodes at the same depth use the same splitting condition. This leads to extremely fast prediction and acts as a form of regularization.


Stacking

Stacking (Stacked Generalization) uses a meta-learner to combine the predictions of multiple base learners.

Basic Procedure

  1. First level: Train \(M\) different base learners \(h_1, h_2, \dots, h_M\)
  2. Generate predictions from the base learners to form a new feature matrix
  3. Second level: Train a meta-learner (e.g., logistic regression, linear regression) to learn how to combine the base learners' predictions

Cross-Validated Stacking

To avoid overfitting, \(K\)-fold cross-validation is typically used to generate first-level predictions:

  1. Split the data into \(K\) folds
  2. For each fold, train the base learners on the other \(K-1\) folds and predict on the held-out fold
  3. Concatenate predictions across all folds to form the meta-learner's training data

Choice of meta-learner: A simple model (e.g., linear regression, logistic regression) is typically preferred, since the first level already captures complexity. An overly complex meta-learner risks overfitting.


XGBoost vs LightGBM vs CatBoost Comparison

Dimension XGBoost LightGBM CatBoost
Split strategy Level-wise Leaf-wise Symmetric tree
Categorical features Requires manual encoding Native support (split-by-group) Native support (Ordered TS)
Missing value handling Learns default direction Learns default direction Multiple modes
Training speed Moderate Fastest Slower (ordered boosting overhead)
Prediction speed Moderate Moderate Fastest (symmetric trees)
GPU support Yes Yes Yes
Overfitting control Regularization + pruning max_depth + regularization Ordered boosting + symmetric trees
Large data performance Good Best (GOSS + EFB) Good
Small data performance Good Prone to overfitting Best (ordered boosting)
Best suited for General purpose Large-scale, high-dimensional sparse data Many categorical features, small datasets

Selection Guidelines

  • Large datasets with many features (especially sparse): Prefer LightGBM
  • Many categorical features: Prefer CatBoost
  • Need fine-grained tuning with extensive community resources: XGBoost is the classic choice
  • Competitions and production: Often train all three and stack or select the best

Ensemble Methods Overview

Method Core Idea Reduces Bias/Variance Base Learner Requirements Parallelizable
Bagging Bootstrap + voting/averaging Reduces variance High-variance models (e.g., deep decision trees) Yes
Random Forest Bagging + feature randomization Reduces variance Decision trees Yes
AdaBoost Sample reweighting Reduces bias Weak learners (e.g., decision stumps) No
Gradient Boosting Fit residuals Reduces bias Shallow decision trees No
XGBoost Regularization + second-order gradients Reduces bias CART Within-tree parallelism
LightGBM GOSS + EFB Reduces bias CART Within-tree parallelism
CatBoost Ordered Boosting Reduces bias Symmetric trees Within-tree parallelism
Stacking Meta-learner combination Both Heterogeneous models First level parallelizable

评论 #