集成方法与信用评分

概述

集成方法 (Ensemble Methods) 通过组合多个弱学习器 (Weak Learners) 构建强学习器，在金融预测中展现出卓越的性能。本文重点介绍随机森林 (Random Forest)、梯度提升树 (Gradient Boosting) 系列算法，以及它们在信用评分 (Credit Scoring) 中的完整应用流程。

随机森林 (Random Forest)

随机森林通过 Bagging (Bootstrap Aggregating) 构建多棵决策树并取平均/投票：

\[\hat{f}_{\text{RF}}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)\]

其中 \(T_b\) 为在 Bootstrap 样本上训练的决策树。关键创新在于每次分裂时仅考虑 \(m \approx \sqrt{p}\) 个随机选取的特征，降低树间相关性 (Correlation)，从而减小集成的方差 (Variance)。

集成模型方差的分解为：

\[\text{Var}\left(\frac{1}{B}\sum_{b=1}^B T_b\right) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2\]

其中 \(\rho\) 为树间相关系数，\(\sigma^2\) 为单棵树的方差。降低 \(\rho\) 比增加 \(B\) 更有效。

随机森林的优势

天然支持并行训练，计算效率高
对异常值和缺失值有一定鲁棒性
通过 OOB (Out-of-Bag) 误差估计实现内置交叉验证
不易过拟合（相比单棵深树）

梯度提升树 (Gradient Boosting)

XGBoost

XGBoost 通过加法模型 (Additive Model) 逐步拟合残差：

\[\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta \cdot f_t(x_i)\]

目标函数包含正则化项：

\[\mathcal{L}^{(t)} = \sum_{i=1}^{N} L(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t)\]

\[\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2\]

其中 \(T\) 为叶节点数，\(w_j\) 为叶节点权重，\(\gamma\) 和 \(\lambda\) 分别控制树复杂度和权重的正则化。

LightGBM

LightGBM 相比 XGBoost 的改进：

GOSS (Gradient-based One-Side Sampling)：保留大梯度样本，随机采样小梯度样本
EFB (Exclusive Feature Bundling)：将互斥特征绑定以降低维度
Histogram-based 分裂：将连续特征离散化为直方图，加速分裂点搜索

import lightgbm as lgb

params = {
    'objective': 'binary',        # 二分类（信用评分）
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 31,
    'max_depth': 6,
    'min_child_samples': 50,      # 防止过拟合
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_alpha': 0.1,             # L1 正则化
    'reg_lambda': 1.0,            # L2 正则化
    'scale_pos_weight': neg_count / pos_count,  # 类别不平衡
}

dtrain = lgb.Dataset(X_train, label=y_train)
dval = lgb.Dataset(X_val, label=y_val, reference=dtrain)

model = lgb.train(
    params, dtrain,
    num_boost_round=1000,
    valid_sets=[dval],
    callbacks=[lgb.early_stopping(50)]
)

特征重要性 (Feature Importance)

基于分裂的重要性 (Split-based)

统计每个特征被用于分裂的总次数或总增益 (Gain)：

\[\text{Importance}_j = \sum_{t \in \text{trees}} \sum_{s \in \text{splits}(t)} \mathbb{1}[\text{feature}(s) = j] \cdot \Delta \text{Gain}(s)\]

排列重要性 (Permutation Importance)

随机打乱特征 \(j\) 的值，观察模型性能的下降幅度：

\[\text{PI}_j = \text{Score}_{\text{original}} - \text{Score}_{\text{permuted}_j}\]

特征重要性的陷阱

基于分裂的重要性对高基数 (High Cardinality) 特征有偏向性。在金融场景中，建议同时使用排列重要性和 SHAP 值进行交叉验证。相关特征之间的重要性分配也可能具有误导性。

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# SHAP 值提供每个样本、每个特征的边际贡献
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

信用评分完整流程

以下展示基于集成方法的信用评分 (Credit Scoring) 端到端流程。

Step 1: 数据准备与特征构建

import pandas as pd
import numpy as np

# 特征类别
demographic_features = ['age', 'income', 'employment_years']
credit_history = ['num_accounts', 'avg_utilization', 'delinquency_count',
                  'months_since_last_delinquency', 'credit_age_months']
loan_features = ['loan_amount', 'interest_rate', 'dti_ratio', 'loan_purpose']

# 衍生特征
df['income_to_loan'] = df['income'] / df['loan_amount']
df['utilization_x_delinquency'] = df['avg_utilization'] * df['delinquency_count']

Step 2: 数据预处理

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

Step 3: 模型训练与调优

from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import roc_auc_score, classification_report
import optuna

def objective(trial):
    params = {
        'num_leaves': trial.suggest_int('num_leaves', 15, 63),
        'max_depth': trial.suggest_int('max_depth', 3, 8),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.1, log=True),
        'min_child_samples': trial.suggest_int('min_child', 20, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True),
    }
    # 时间序列交叉验证
    cv_scores = []
    for train_idx, val_idx in tscv.split(X):
        model = lgb.LGBMClassifier(**params, n_estimators=500)
        model.fit(X[train_idx], y[train_idx],
                  eval_set=[(X[val_idx], y[val_idx])],
                  callbacks=[lgb.early_stopping(30)])
        y_prob = model.predict_proba(X[val_idx])[:, 1]
        cv_scores.append(roc_auc_score(y[val_idx], y_prob))
    return np.mean(cv_scores)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

Step 4: 模型评估

指标	含义	合格阈值
AUC	区分度	> 0.75
KS	最大区分能力	> 0.30
Gini	2*AUC - 1	> 0.50
PSI	群体稳定性指数	< 0.10

PSI 监控模型漂移

群体稳定性指数 (Population Stability Index) 用于检测特征分布漂移 (Distribution Drift)：

\[\text{PSI} = \sum_{i=1}^{B}(p_i^{\text{new}} - p_i^{\text{old}}) \ln \frac{p_i^{\text{new}}}{p_i^{\text{old}}}\]

当 PSI > 0.25 时，模型需要重新训练。

小结

集成方法凭借强大的非线性拟合能力和内置的过拟合防范机制，已成为金融建模的主力工具。在信用评分场景中，LightGBM/XGBoost 结合严格的特征工程和时序交叉验证，能够构建稳健且可解释的评分模型。实际部署中还需关注模型监控 (Model Monitoring)、公平性 (Fairness) 和监管合规 (Regulatory Compliance) 等问题。