Model Evaluation and Selection
Model evaluation is a critical stage in the machine learning workflow. Choosing appropriate evaluation metrics and validation strategies directly determines whether we can correctly assess a model's generalization ability.
Overview of Model Evaluation
The central question of model evaluation is: How well does a trained model perform on data it has never seen?
We need to answer the following questions:
- What metrics should we use to measure model quality?
- How can we reliably estimate generalization performance?
- How do we choose among multiple candidate models?
- How can we efficiently search for hyperparameters?
Cross-Validation
Cross-validation is the standard method for estimating generalization performance. It repeatedly trains and validates on different subsets of the training data to obtain more reliable performance estimates.
K-Fold Cross-Validation
Randomly partition the dataset into \(K\) equally sized subsets (folds). In each iteration, train on \(K-1\) folds and validate on the remaining fold. Repeat \(K\) times so that each fold serves as the validation set exactly once.
Common choices are \(K = 5\) or \(K = 10\).
Advantages: Makes full use of the data; relatively low variance.
Disadvantages: Computational cost is \(K\) times that of a single training run.
Stratified K-Fold
In stratified K-fold cross-validation, each fold preserves the same class proportions as the overall dataset.
Best suited for: Classification tasks, especially with class imbalance (e.g., fraud detection where positive samples constitute only 1%). Without stratification, some folds may contain no minority-class samples at all.
Leave-One-Out (LOO)
LOO is the extreme case of K-Fold where \(K = n\) (the number of samples). Each iteration holds out one sample for validation and trains on the remaining \(n-1\) samples.
Advantages: Provides a nearly unbiased estimate of the generalization error.
Disadvantages: Extremely high computational cost (\(n\) models must be trained); high variance (training sets across folds are highly overlapping).
Time Series Splitting
For time series data, random splitting must not be used, since future data should never influence predictions about the past.
Rolling Window (Walk-Forward) Validation:
Fold 1: [Train: t1-t3] [Validate: t4]
Fold 2: [Train: t1-t4] [Validate: t5]
Fold 3: [Train: t1-t5] [Validate: t6]
Expanding Window Validation: The training set continuously grows.
Sliding Window Validation: The training set size remains fixed and the window slides forward.
Classification Metrics
Confusion Matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
Fundamental Metrics
Accuracy:
Accuracy can be misleading under class imbalance. For example, if 99% of samples are negative, a model that always predicts negative still achieves 99% accuracy.
Precision:
"Of all samples predicted as positive, how many are actually positive?" Focuses on the cost of false positives.
Recall (Sensitivity / TPR):
"Of all actually positive samples, how many were correctly identified?" Focuses on the cost of false negatives.
F1 Score:
The harmonic mean of precision and recall:
More generally, the \(F_\beta\) score allows adjusting the relative importance of precision and recall:
When \(\beta > 1\), recall is weighted more heavily; when \(\beta < 1\), precision is weighted more heavily.
AUC-ROC Curve
The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR):
By varying the classification threshold, we obtain a series of points that form the ROC curve.
AUC (Area Under the ROC Curve):
Probabilistic interpretation of AUC: the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample.
| AUC Range | Model Quality |
|---|---|
| 0.5 | Random guessing |
| 0.5 -- 0.7 | Poor |
| 0.7 -- 0.8 | Fair |
| 0.8 -- 0.9 | Good |
| 0.9 -- 1.0 | Excellent |
PR Curve (Precision-Recall Curve)
Plots precision against recall. When class imbalance is severe, the PR curve is more informative than the ROC curve.
AUPRC (Area Under the PR Curve) is the area under the precision-recall curve, also known as Average Precision (AP).
Log Loss (Cross-Entropy Loss)
Log Loss evaluates not only whether the predicted class is correct but also the calibration of predicted probabilities. Samples predicted with high confidence but incorrectly classified incur heavy penalties.
Regression Metrics
MSE (Mean Squared Error)
Penalizes large errors more heavily (due to the squaring). The units of MSE are the square of the target variable's units.
RMSE (Root Mean Squared Error)
Shares the same units as the target variable, making it easier to interpret.
MAE (Mean Absolute Error)
More robust to outliers (linear vs. quadratic penalty).
\(R^2\) (Coefficient of Determination)
\(R^2 = 1\) indicates a perfect fit; \(R^2 = 0\) means the model is no better than predicting the mean. \(R^2\) can be negative (the model is worse than predicting the mean).
MAPE (Mean Absolute Percentage Error)
Expresses error as a percentage, which is convenient for business stakeholders. Caveat: MAPE approaches infinity when \(y_i\) is close to 0.
Regression Metrics Comparison
| Metric | Sensitive to Outliers | Interpretability | Bounded | Best Suited For |
|---|---|---|---|---|
| MSE | Very sensitive | Poor (squared units) | \([0, +\infty)\) | Optimization objective, theoretical analysis |
| RMSE | Sensitive | Good (original units) | \([0, +\infty)\) | General-purpose evaluation |
| MAE | Robust | Good (original units) | \([0, +\infty)\) | Data with outliers |
| \(R^2\) | Sensitive | Good (proportion) | \((-\infty, 1]\) | Explaining model performance |
| MAPE | Robust | Good (percentage) | \([0, +\infty)\) | Business reporting |
Learning Curves and Validation Curves
Learning Curves
A learning curve plots training set size against training/validation error:
- High bias (underfitting): Both training and validation errors are high and barely decrease with more data. Action: increase model complexity.
- High variance (overfitting): Training error is low but validation error is high, with a large gap between them. Action: add more training data or apply regularization.
Validation Curves
A validation curve plots a specific hyperparameter against training/validation error:
- Helps identify the optimal range for a hyperparameter
- Hyperparameter too small: underfitting (high bias)
- Hyperparameter too large: overfitting (high variance)
Hyperparameter Search
Grid Search
Exhaustively evaluates all hyperparameter combinations:
- Specify a list of candidate values for each hyperparameter
- Evaluate cross-validation performance for every combination
Advantages: Guarantees finding the optimum within the search space.
Disadvantages: Curse of dimensionality. With \(m\) hyperparameters, each having \(k\) candidate values, \(k^m\) combinations must be evaluated.
Random Search
Randomly samples a fixed number of points from the hyperparameter space:
Bergstra & Bengio (2012) demonstrated that random search is generally more efficient than grid search. The reason is:
- Some hyperparameters are far more important than others
- Grid search wastes many evaluations on unimportant dimensions
- Random search achieves better coverage along each dimension
Bayesian Optimization
Bayesian optimization uses a surrogate model (typically a Gaussian process) to model the relationship between hyperparameters and performance, and an acquisition function (e.g., Expected Improvement) to intelligently select the next set of hyperparameters.
Basic Procedure:
- Initialization: Randomly evaluate several hyperparameter combinations
- Fit a surrogate model \(\hat{f}(\theta)\) to approximate the true objective function
- Select the next hyperparameters by maximizing the acquisition function:
where \(f^+\) is the current best value.
- Evaluate the new hyperparameters and update the surrogate model
- Repeat steps 2--4
Popular Tools:
- Optuna: Supports pruning, multiple samplers (TPE, CMA-ES)
- Hyperopt: Based on TPE (Tree-structured Parzen Estimator)
- Bayesian Optimization: Based on Gaussian processes
| Method | Efficiency | Parallelization | Best Suited For |
|---|---|---|---|
| Grid Search | Low | Easy | Few hyperparameters, small search space |
| Random Search | Medium | Easy | General purpose, many hyperparameters |
| Bayesian Optimization | High | More difficult | Expensive evaluations, many hyperparameters |
Model Selection
Bias-Variance Tradeoff
Bias: The systematic deviation of the model from the true relationship. High-bias models are too simple (underfitting).
Variance: The model's sensitivity to fluctuations in the training data. High-variance models are too complex (overfitting).
For regression, the generalization error can be decomposed as:
where \(\sigma^2_\epsilon\) is the irreducible noise.
Effect of Model Complexity:
- Model too simple: high bias, low variance -- underfitting
- Model too complex: low bias, high variance -- overfitting
- Optimal complexity: minimizes the sum of bias and variance
No Free Lunch Theorem
No single model is optimal for all problems. Averaged over all possible data-generating distributions, any two algorithms have the same expected performance.
This implies:
- The appropriate model must be chosen for each specific problem
- Prior knowledge and data characteristics determine the best model choice
- There is no "universal" algorithm
Statistical Tests
When comparing the performance of two or more models, examining average scores alone is insufficient. Statistical tests are needed to determine whether performance differences are significant.
t-Test
Tests whether two models have significantly different mean performance across \(K\)-fold cross-validation.
Hypothesis: \(H_0\): There is no difference in performance between the two models.
Test statistic:
where \(d_k = \text{Score}_k^{A} - \text{Score}_k^{B}\), \(\bar{d}\) is the mean of the differences, and \(s_d\) is the standard deviation of the differences.
Degrees of freedom \(df = K - 1\); consult the t-distribution table to decide whether to reject \(H_0\).
Paired t-Test
The corrected resampled t-test accounts for overlap between training and test sets:
where \(R\) is the number of repetitions, and \(n_{\text{test}}\) and \(n_{\text{train}}\) are the number of test and training samples per fold, respectively.
McNemar's Test
McNemar's test compares two classifiers on the same test set without requiring cross-validation.
Construct a contingency table:
| Model B Correct | Model B Incorrect | |
|---|---|---|
| Model A Correct | \(n_{00}\) | \(n_{01}\) |
| Model A Incorrect | \(n_{10}\) | \(n_{11}\) |
Test statistic (with continuity correction):
Under \(H_0\), \(\chi^2\) follows a chi-squared distribution with 1 degree of freedom.
Statistical Test Selection Guide
| Scenario | Recommended Test |
|---|---|
| Two models, same dataset, K-Fold CV | Paired t-test |
| Two models, same test set | McNemar's test |
| Multiple models, same dataset | Friedman test + Nemenyi post-hoc |
| Two models, multiple datasets | Wilcoxon signed-rank test |
| Multiple models, multiple datasets | Friedman test |