Ensemble Learning

Ensemble learning constructs a more powerful model by combining multiple "weak learners." Its core philosophy is that a committee of diverse models outperforms any single model.

Principles of Ensemble Learning

Why Does Ensembling Work?

The bias-variance decomposition is central to understanding why ensembles are effective. For the generalization error of a single model:

\[ \text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise} \]

Bagging (e.g., Random Forest) primarily reduces variance: by averaging predictions from multiple high-variance models to smooth out fluctuations
Boosting (e.g., AdaBoost, XGBoost) primarily reduces bias: by progressively focusing on hard-to-classify samples and correcting errors from previous models

Conditions for ensembles to be effective (generalization of Condorcet's jury theorem):

Each base learner's accuracy must exceed 50% (better than random guessing)
Base learners must exhibit diversity (i.e., their errors should not be perfectly correlated)

Suppose we have \(T\) independent base learners, each with error rate \(\epsilon < 0.5\). The ensemble error rate via majority voting is:

\[ \epsilon_{\text{ensemble}} = \sum_{k=\lceil T/2 \rceil}^{T} \binom{T}{k} \epsilon^k (1-\epsilon)^{T-k} \]

As \(T \to \infty\), \(\epsilon_{\text{ensemble}} \to 0\).

Bagging

Bootstrap Sampling

Bagging (Bootstrap Aggregating) is built around bootstrap sampling:

From the original dataset (\(n\) samples), draw \(n\) samples with replacement to form a new training set
Repeat \(T\) times to obtain \(T\) distinct training sets
Train one base learner on each training set
At prediction time, use majority voting (classification) or averaging (regression)

In each bootstrap sample, approximately \(1 - (1-1/n)^n \approx 63.2\%\) of samples are selected; the remaining ~36.8% are out-of-bag (OOB) samples.

Random Forest

Random Forest extends Bagging with random feature subspace selection:

At each node split, randomly select \(m\) features (rather than all \(d\) features) as candidates
For classification, the recommended setting is \(m = \lfloor\sqrt{d}\rfloor\); for regression, \(m = \lfloor d/3 \rfloor\)

OOB Error Estimation:

For each sample \(\mathbf{x}_i\), predict using only the trees that did not include it in their bootstrap sample. The OOB error is the average error across all such OOB predictions and serves as an unbiased estimate of the generalization error, eliminating the need for a separate validation set.

Advantages of Random Forest:

Resistant to overfitting (adding more trees does not cause overfitting)
Handles high-dimensional data well
Provides feature importance rankings
Supports parallel training
Reasonably robust to missing values and outliers

Feature Importance:

Mean Decrease in Impurity (MDI): Sum of impurity decreases across all trees for each feature's splits
Mean Decrease in Accuracy (MDA): Observe the increase in OOB error when a feature's values are randomly permuted

Boosting

AdaBoost

AdaBoost (Adaptive Boosting) iteratively adjusts sample weights so that subsequent learners focus more on samples that previous learners misclassified.

Algorithm Steps:

Initialize sample weights: \(w_i^{(1)} = \frac{1}{n}\)
For \(t = 1, 2, \dots, T\): - Train base learner \(h_t\) using weight distribution \(\mathbf{w}^{(t)}\) - Compute the weighted error rate:

\[ \epsilon_t = \sum_{i=1}^{n} w_i^{(t)} \cdot \mathbb{1}[h_t(\mathbf{x}_i) \neq y_i] \]

- Compute the learner weight:

\[ \alpha_t = \frac{1}{2} \ln\frac{1-\epsilon_t}{\epsilon_t} \]

- Update sample weights:

\[ w_i^{(t+1)} = \frac{w_i^{(t)} \exp(-\alpha_t y_i h_t(\mathbf{x}_i))}{Z_t} \]

where \(Z_t\) is the normalization factor.

Final prediction: \(H(\mathbf{x}) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(\mathbf{x})\right)\)

AdaBoost can be shown to be equivalent to a forward stagewise additive model under exponential loss.

Gradient Boosting

The core idea of Gradient Boosting is that each step fits the negative gradient of the loss with respect to the current model's predictions.

General Framework:

Given a loss function \(L(y, F(\mathbf{x}))\), initialize \(F_0(\mathbf{x}) = \arg\min_c \sum_i L(y_i, c)\).

For \(t = 1, 2, \dots, T\):

Compute the negative gradient (pseudo-residuals):

\[ r_{it} = -\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\bigg|_{F=F_{t-1}} \]

Fit a base learner \(h_t\) to the pseudo-residuals \(\{(\mathbf{x}_i, r_{it})\}\)
Determine the step size \(\eta_t\) (optionally via line search)
Update the model: \(F_t(\mathbf{x}) = F_{t-1}(\mathbf{x}) + \eta_t h_t(\mathbf{x})\)

When \(L\) is the squared loss, \(r_{it} = y_i - F_{t-1}(\mathbf{x}_i)\) reduces to the actual residuals.

XGBoost

XGBoost (eXtreme Gradient Boosting) is a highly efficient implementation of Gradient Boosting with several key innovations.

Regularized Objective Function

\[ \mathcal{L} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{t=1}^{T} \Omega(f_t) \]

where the regularization term is:

\[ \Omega(f) = \gamma \cdot T_{\text{leaves}} + \frac{1}{2}\lambda \sum_{j=1}^{T_{\text{leaves}}} w_j^2 \]

Here \(T_{\text{leaves}}\) is the number of leaf nodes, \(w_j\) is the leaf weight, \(\gamma\) controls tree complexity, and \(\lambda\) is the L2 regularization coefficient.

Second-Order Taylor Expansion

The objective function is approximated using a second-order Taylor expansion around \(F_{t-1}\):

\[ \mathcal{L}^{(t)} \approx \sum_{i=1}^{n} \left[ g_i f_t(\mathbf{x}_i) + \frac{1}{2} h_i f_t^2(\mathbf{x}_i) \right] + \Omega(f_t) \]

where \(g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}\) and \(h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}\).

Optimal leaf weights and split gain:

\[ w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}, \quad \text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda}\right] - \gamma \]

Approximate Split Finding

For large datasets, XGBoost uses a Weighted Quantile Sketch to approximately find optimal split points, avoiding enumeration of all possible split values.

Sparsity-Aware Algorithm

XGBoost natively handles sparse data by learning a default direction for each node. When a feature value is missing, the sample is directed along the default branch.

Engineering Optimizations

Column Block: Data is stored sorted by column to accelerate split-point search
Cache-Aware Access: Optimizes memory access patterns to exploit CPU caches
Out-of-Core Computation: Supports computation when data does not fit entirely in memory

LightGBM

LightGBM, developed by Microsoft, is optimized for large-scale data scenarios.

GOSS (Gradient-based One-Side Sampling)

Retains all samples with large gradients (high information content) and randomly samples from those with small gradients:

Sort samples by absolute gradient value
Keep the top \(a\%\) large-gradient samples
Randomly sample \(b\%\) from the remaining samples
Up-weight the sampled small-gradient samples by a factor of \(\frac{1-a}{b}\) to maintain the original distribution

EFB (Exclusive Feature Bundling)

Bundles mutually exclusive features (features that are rarely non-zero simultaneously) together, reducing the effective number of features. This is highly effective for sparse, high-dimensional data.

Leaf-Wise Growth Strategy

Unlike XGBoost's level-wise strategy, LightGBM adopts a leaf-wise strategy: at each step, it splits the leaf with the largest gain.

Advantage: For the same number of leaves, leaf-wise typically achieves lower loss
Risk: More prone to overfitting; max_depth should be used to constrain depth

CatBoost

CatBoost, developed by Yandex, is specifically designed to handle categorical features and address the prediction shift problem.

Ordered Boosting

Traditional boosting suffers from prediction shift: the model used to compute residuals overlaps with the training data used to fit the next tree.

CatBoost's Ordered Boosting:

Randomly permute the data
For the \(i\)-th sample, compute its residual using a model trained only on the first \(i-1\) samples
This ensures unbiased residuals

Categorical Feature Handling

CatBoost uses Ordered Target Encoding:

\[ \hat{x}_i^k = \frac{\sum_{j < i, x_j^k = x_i^k} y_j + a \cdot p}{\sum_{j < i, x_j^k = x_i^k} 1 + a} \]

where \(p\) is the prior mean of the target variable and \(a\) is a smoothing parameter. Random permutations are used to prevent target leakage.

Symmetric Trees

CatBoost uses oblivious (symmetric) decision trees by default: all nodes at the same depth use the same splitting condition. This leads to extremely fast prediction and acts as a form of regularization.

Stacking

Stacking (Stacked Generalization) uses a meta-learner to combine the predictions of multiple base learners.

Basic Procedure

First level: Train \(M\) different base learners \(h_1, h_2, \dots, h_M\)
Generate predictions from the base learners to form a new feature matrix
Second level: Train a meta-learner (e.g., logistic regression, linear regression) to learn how to combine the base learners' predictions

Cross-Validated Stacking

To avoid overfitting, \(K\)-fold cross-validation is typically used to generate first-level predictions:

Split the data into \(K\) folds
For each fold, train the base learners on the other \(K-1\) folds and predict on the held-out fold
Concatenate predictions across all folds to form the meta-learner's training data

Choice of meta-learner: A simple model (e.g., linear regression, logistic regression) is typically preferred, since the first level already captures complexity. An overly complex meta-learner risks overfitting.

XGBoost vs LightGBM vs CatBoost Comparison

Dimension	XGBoost	LightGBM	CatBoost
Split strategy	Level-wise	Leaf-wise	Symmetric tree
Categorical features	Requires manual encoding	Native support (split-by-group)	Native support (Ordered TS)
Missing value handling	Learns default direction	Learns default direction	Multiple modes
Training speed	Moderate	Fastest	Slower (ordered boosting overhead)
Prediction speed	Moderate	Moderate	Fastest (symmetric trees)
GPU support	Yes	Yes	Yes
Overfitting control	Regularization + pruning	max_depth + regularization	Ordered boosting + symmetric trees
Large data performance	Good	Best (GOSS + EFB)	Good
Small data performance	Good	Prone to overfitting	Best (ordered boosting)
Best suited for	General purpose	Large-scale, high-dimensional sparse data	Many categorical features, small datasets

Selection Guidelines

Large datasets with many features (especially sparse): Prefer LightGBM
Many categorical features: Prefer CatBoost
Need fine-grained tuning with extensive community resources: XGBoost is the classic choice
Competitions and production: Often train all three and stack or select the best

Ensemble Methods Overview

Method	Core Idea	Reduces Bias/Variance	Base Learner Requirements	Parallelizable
Bagging	Bootstrap + voting/averaging	Reduces variance	High-variance models (e.g., deep decision trees)	Yes
Random Forest	Bagging + feature randomization	Reduces variance	Decision trees	Yes
AdaBoost	Sample reweighting	Reduces bias	Weak learners (e.g., decision stumps)	No
Gradient Boosting	Fit residuals	Reduces bias	Shallow decision trees	No
XGBoost	Regularization + second-order gradients	Reduces bias	CART	Within-tree parallelism
LightGBM	GOSS + EFB	Reduces bias	CART	Within-tree parallelism
CatBoost	Ordered Boosting	Reduces bias	Symmetric trees	Within-tree parallelism
Stacking	Meta-learner combination	Both	Heterogeneous models	First level parallelizable