Model Evaluation and Selection

Model evaluation is a critical stage in the machine learning workflow. Choosing appropriate evaluation metrics and validation strategies directly determines whether we can correctly assess a model's generalization ability.

Overview of Model Evaluation

The central question of model evaluation is: How well does a trained model perform on data it has never seen?

We need to answer the following questions:

What metrics should we use to measure model quality?
How can we reliably estimate generalization performance?
How do we choose among multiple candidate models?
How can we efficiently search for hyperparameters?

Cross-Validation

Cross-validation is the standard method for estimating generalization performance. It repeatedly trains and validates on different subsets of the training data to obtain more reliable performance estimates.

K-Fold Cross-Validation

Randomly partition the dataset into \(K\) equally sized subsets (folds). In each iteration, train on \(K-1\) folds and validate on the remaining fold. Repeat \(K\) times so that each fold serves as the validation set exactly once.

\[ \text{CV Score} = \frac{1}{K}\sum_{k=1}^{K} \text{Score}_k \]

Common choices are \(K = 5\) or \(K = 10\).

Advantages: Makes full use of the data; relatively low variance.

Disadvantages: Computational cost is \(K\) times that of a single training run.

Stratified K-Fold

In stratified K-fold cross-validation, each fold preserves the same class proportions as the overall dataset.

Best suited for: Classification tasks, especially with class imbalance (e.g., fraud detection where positive samples constitute only 1%). Without stratification, some folds may contain no minority-class samples at all.

Leave-One-Out (LOO)

LOO is the extreme case of K-Fold where \(K = n\) (the number of samples). Each iteration holds out one sample for validation and trains on the remaining \(n-1\) samples.

\[ \text{LOO Score} = \frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i) \]

Advantages: Provides a nearly unbiased estimate of the generalization error.

Disadvantages: Extremely high computational cost (\(n\) models must be trained); high variance (training sets across folds are highly overlapping).

Time Series Splitting

For time series data, random splitting must not be used, since future data should never influence predictions about the past.

Rolling Window (Walk-Forward) Validation:

Fold 1: [Train: t1-t3] [Validate: t4]
Fold 2: [Train: t1-t4] [Validate: t5]
Fold 3: [Train: t1-t5] [Validate: t6]

Expanding Window Validation: The training set continuously grows.

Sliding Window Validation: The training set size remains fixed and the window slides forward.

Classification Metrics

Confusion Matrix

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

Fundamental Metrics

Accuracy:

\[ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

Accuracy can be misleading under class imbalance. For example, if 99% of samples are negative, a model that always predicts negative still achieves 99% accuracy.

Precision:

\[ \text{Precision} = \frac{TP}{TP + FP} \]

"Of all samples predicted as positive, how many are actually positive?" Focuses on the cost of false positives.

Recall (Sensitivity / TPR):

\[ \text{Recall} = \frac{TP}{TP + FN} \]

"Of all actually positive samples, how many were correctly identified?" Focuses on the cost of false negatives.

F1 Score:

The harmonic mean of precision and recall:

\[ F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN} \]

More generally, the \(F_\beta\) score allows adjusting the relative importance of precision and recall:

\[ F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}} \]

When \(\beta > 1\), recall is weighted more heavily; when \(\beta < 1\), precision is weighted more heavily.

AUC-ROC Curve

The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR):

\[ \text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN} \]

By varying the classification threshold, we obtain a series of points that form the ROC curve.

AUC (Area Under the ROC Curve):

\[ \text{AUC} = \int_0^1 \text{TPR}(t) \, d\text{FPR}(t) \]

Probabilistic interpretation of AUC: the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample.

AUC Range	Model Quality
0.5	Random guessing
0.5 -- 0.7	Poor
0.7 -- 0.8	Fair
0.8 -- 0.9	Good
0.9 -- 1.0	Excellent

PR Curve (Precision-Recall Curve)

Plots precision against recall. When class imbalance is severe, the PR curve is more informative than the ROC curve.

AUPRC (Area Under the PR Curve) is the area under the precision-recall curve, also known as Average Precision (AP).

Log Loss (Cross-Entropy Loss)

\[ \text{Log Loss} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\right] \]

Log Loss evaluates not only whether the predicted class is correct but also the calibration of predicted probabilities. Samples predicted with high confidence but incorrectly classified incur heavy penalties.

Regression Metrics

MSE (Mean Squared Error)

\[ \text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \]

Penalizes large errors more heavily (due to the squaring). The units of MSE are the square of the target variable's units.

RMSE (Root Mean Squared Error)

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]

Shares the same units as the target variable, making it easier to interpret.

MAE (Mean Absolute Error)

\[ \text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i| \]

More robust to outliers (linear vs. quadratic penalty).

\(R^2\) (Coefficient of Determination)

\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \]

\(R^2 = 1\) indicates a perfect fit; \(R^2 = 0\) means the model is no better than predicting the mean. \(R^2\) can be negative (the model is worse than predicting the mean).

MAPE (Mean Absolute Percentage Error)

\[ \text{MAPE} = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right| \]

Expresses error as a percentage, which is convenient for business stakeholders. Caveat: MAPE approaches infinity when \(y_i\) is close to 0.

Regression Metrics Comparison

Metric	Sensitive to Outliers	Interpretability	Bounded	Best Suited For
MSE	Very sensitive	Poor (squared units)	\([0, +\infty)\)	Optimization objective, theoretical analysis
RMSE	Sensitive	Good (original units)	\([0, +\infty)\)	General-purpose evaluation
MAE	Robust	Good (original units)	\([0, +\infty)\)	Data with outliers
\(R^2\)	Sensitive	Good (proportion)	\((-\infty, 1]\)	Explaining model performance
MAPE	Robust	Good (percentage)	\([0, +\infty)\)	Business reporting

Learning Curves and Validation Curves

Learning Curves

A learning curve plots training set size against training/validation error:

High bias (underfitting): Both training and validation errors are high and barely decrease with more data. Action: increase model complexity.
High variance (overfitting): Training error is low but validation error is high, with a large gap between them. Action: add more training data or apply regularization.

Validation Curves

A validation curve plots a specific hyperparameter against training/validation error:

Helps identify the optimal range for a hyperparameter
Hyperparameter too small: underfitting (high bias)
Hyperparameter too large: overfitting (high variance)

Hyperparameter Search

Grid Search

Exhaustively evaluates all hyperparameter combinations:

Specify a list of candidate values for each hyperparameter
Evaluate cross-validation performance for every combination

Advantages: Guarantees finding the optimum within the search space.

Disadvantages: Curse of dimensionality. With \(m\) hyperparameters, each having \(k\) candidate values, \(k^m\) combinations must be evaluated.

Random Search

Randomly samples a fixed number of points from the hyperparameter space:

\[ \theta_i \sim p(\theta), \quad i = 1, 2, \dots, N \]

Bergstra & Bengio (2012) demonstrated that random search is generally more efficient than grid search. The reason is:

Some hyperparameters are far more important than others
Grid search wastes many evaluations on unimportant dimensions
Random search achieves better coverage along each dimension

Bayesian Optimization

Bayesian optimization uses a surrogate model (typically a Gaussian process) to model the relationship between hyperparameters and performance, and an acquisition function (e.g., Expected Improvement) to intelligently select the next set of hyperparameters.

Basic Procedure:

Initialization: Randomly evaluate several hyperparameter combinations
Fit a surrogate model \(\hat{f}(\theta)\) to approximate the true objective function
Select the next hyperparameters by maximizing the acquisition function:

\[ \theta_{\text{next}} = \arg\max_\theta \text{EI}(\theta) = \arg\max_\theta \mathbb{E}[\max(f(\theta) - f^+, 0)] \]

where \(f^+\) is the current best value.

Evaluate the new hyperparameters and update the surrogate model
Repeat steps 2--4

Popular Tools:

Optuna: Supports pruning, multiple samplers (TPE, CMA-ES)
Hyperopt: Based on TPE (Tree-structured Parzen Estimator)
Bayesian Optimization: Based on Gaussian processes

Method	Efficiency	Parallelization	Best Suited For
Grid Search	Low	Easy	Few hyperparameters, small search space
Random Search	Medium	Easy	General purpose, many hyperparameters
Bayesian Optimization	High	More difficult	Expensive evaluations, many hyperparameters

Model Selection

Bias-Variance Tradeoff

Bias: The systematic deviation of the model from the true relationship. High-bias models are too simple (underfitting).

Variance: The model's sensitivity to fluctuations in the training data. High-variance models are too complex (overfitting).

For regression, the generalization error can be decomposed as:

\[ \mathbb{E}[(y - \hat{f}(\mathbf{x}))^2] = \text{Bias}^2[\hat{f}(\mathbf{x})] + \text{Var}[\hat{f}(\mathbf{x})] + \sigma^2_\epsilon \]

where \(\sigma^2_\epsilon\) is the irreducible noise.

Effect of Model Complexity:

Model too simple: high bias, low variance -- underfitting
Model too complex: low bias, high variance -- overfitting
Optimal complexity: minimizes the sum of bias and variance

No Free Lunch Theorem

No single model is optimal for all problems. Averaged over all possible data-generating distributions, any two algorithms have the same expected performance.

This implies:

The appropriate model must be chosen for each specific problem
Prior knowledge and data characteristics determine the best model choice
There is no "universal" algorithm

Statistical Tests

When comparing the performance of two or more models, examining average scores alone is insufficient. Statistical tests are needed to determine whether performance differences are significant.

t-Test

Tests whether two models have significantly different mean performance across \(K\)-fold cross-validation.

Hypothesis: \(H_0\): There is no difference in performance between the two models.

Test statistic:

\[ t = \frac{\bar{d}}{s_d / \sqrt{K}} \]

where \(d_k = \text{Score}_k^{A} - \text{Score}_k^{B}\), \(\bar{d}\) is the mean of the differences, and \(s_d\) is the standard deviation of the differences.

Degrees of freedom \(df = K - 1\); consult the t-distribution table to decide whether to reject \(H_0\).

Paired t-Test

The corrected resampled t-test accounts for overlap between training and test sets:

\[ t = \frac{\bar{d}}{\sqrt{\left(\frac{1}{K \cdot R} + \frac{n_{\text{test}}}{n_{\text{train}}}\right) \cdot s_d^2}} \]

where \(R\) is the number of repetitions, and \(n_{\text{test}}\) and \(n_{\text{train}}\) are the number of test and training samples per fold, respectively.

McNemar's Test

McNemar's test compares two classifiers on the same test set without requiring cross-validation.

Construct a contingency table:

	Model B Correct	Model B Incorrect
Model A Correct	\(n_{00}\)	\(n_{01}\)
Model A Incorrect	\(n_{10}\)	\(n_{11}\)

Test statistic (with continuity correction):

\[ \chi^2 = \frac{(|n_{01} - n_{10}| - 1)^2}{n_{01} + n_{10}} \]

Under \(H_0\), \(\chi^2\) follows a chi-squared distribution with 1 degree of freedom.

Statistical Test Selection Guide

Scenario	Recommended Test
Two models, same dataset, K-Fold CV	Paired t-test
Two models, same test set	McNemar's test
Multiple models, same dataset	Friedman test + Nemenyi post-hoc
Two models, multiple datasets	Wilcoxon signed-rank test
Multiple models, multiple datasets	Friedman test