Evaluation & Monitoring

Evaluation is integral to every stage of a model's lifecycle, from training to deployment. Offline evaluation determines whether a model is ready to go live, online evaluation verifies that its real-world performance meets expectations, and monitoring ensures the model continues to operate reliably in production.

Offline Evaluation

Classification Metrics

The Confusion Matrix is the foundation of all classification metrics:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

Common metrics:

Accuracy = \(\frac{TP + TN}{TP + TN + FP + FN}\), suitable for class-balanced scenarios
Precision = \(\frac{TP}{TP + FP}\), answers "of all positive predictions, how many are truly positive"
Recall = \(\frac{TP}{TP + FN}\), answers "of all actual positives, how many were identified"
F1 Score = \(\frac{2 \times P \times R}{P + R}\), the harmonic mean of Precision and Recall
AUC-ROC: area under the ROC curve, measuring the model's ability to distinguish between positive and negative samples across different thresholds; range [0, 1], where 0.5 indicates random guessing

Selection guidelines:

When classes are imbalanced, Accuracy can be misleading — use F1 or AUC-ROC instead
In scenarios where missed detections are costly (e.g., medical diagnosis), prioritize Recall
In scenarios where false alarms are costly (e.g., spam filtering), prioritize Precision

Text Generation Metrics

BLEU (Bilingual Evaluation Understudy):

Core idea: measures n-gram overlap between generated text and reference text
Commonly used for machine translation evaluation
BLEU-4 considers 1-gram through 4-gram matches simultaneously
Limitation: only checks exact word matches without understanding semantics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

ROUGE-N: recall based on n-gram overlap
ROUGE-L: based on the Longest Common Subsequence (LCS)
Commonly used for text summarization evaluation
Key difference from BLEU: BLEU emphasizes Precision, while ROUGE emphasizes Recall

CIDEr (Consensus-based Image Description Evaluation):

Specifically designed for Image Captioning
Uses TF-IDF weighted n-gram matching
Down-weights common phrases (e.g., "a picture of")

METEOR:

Accounts for synonym matching and stemming
Captures semantic similarity better than BLEU
Typically shows higher correlation with human judgments than BLEU

Code Generation Metrics

pass@k:

Definition: generate \(k\) candidate code samples; the task is considered solved if at least one passes all test cases
Computation: estimated from \(n\) samples to reduce variance

\[ \text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \]

where \(n\) is the total number of generated samples and \(c\) is the number that pass the tests.

HumanEval:

A code generation benchmark proposed by OpenAI
Contains 164 Python programming problems, each with accompanying test cases
Typically reports pass@1 and pass@10

LLM Comprehensive Benchmarks

MMLU (Massive Multitask Language Understanding):

Covers 57 subjects (mathematics, history, law, medicine, etc.)
Multiple-choice format with four options
Measures a model's breadth of knowledge and reasoning ability

HellaSwag:

A commonsense reasoning benchmark
Presents the beginning of a scenario and asks for the most plausible continuation
Uses Adversarial Filtering to construct high-quality distractors

MT-Bench:

Multi-turn dialogue evaluation
Uses GPT-4 as the judge (LLM-as-Judge)
Evaluates instruction following, reasoning, creative writing, and other capabilities

Chatbot Arena:

Crowdsourced evaluation based on human preferences
Users chat with two anonymous models and vote for the better response
Uses an Elo rating system for ranking
Widely considered the evaluation method closest to real user experience

Online Evaluation

No matter how strong offline metrics look, they cannot guarantee real-world performance. Below are common online evaluation strategies.

A/B Testing

The most classic online evaluation method:

Randomly split user traffic into a control group (old model) and a treatment group (new model)
Collect key business metrics for both groups (click-through rate, retention rate, user satisfaction, etc.)
Use statistical tests (e.g., t-test) to determine whether the difference is significant

Considerations:

Sufficient sample size is needed to avoid false positives
Traffic allocation must be truly random to avoid Selection Bias
Monitor multiple metrics simultaneously to ensure gains on one metric do not come at the cost of another (metric trade-offs)

Shadow Deployment

Deploy the new model as a "shadow" that receives the same requests as the production model but does not serve its results to users:

Advantage: zero risk, enabling evaluation on real traffic
Disadvantage: requires double the compute resources and cannot assess user interaction metrics

Best suited for the validation phase before a new model's first production launch.

Canary Release

A gradual rollout deployment strategy:

Route 1–5% of traffic to the new model
Monitor key metrics and confirm no anomalies
Gradually increase the traffic share (10% → 25% → 50% → 100%)
Roll back immediately if issues are detected at any stage

Difference from A/B Testing: a canary release aims for safe deployment rather than rigorous statistical comparison.

Risk Assessment

Red Teaming

Simulating adversarial attacks to uncover security vulnerabilities in the model:

Manual Red Teaming: security experts manually craft adversarial inputs
Automated Red Teaming: another LLM automatically generates attack prompts
Testing dimensions: harmful content generation, information leakage, prompt injection, jailbreak attacks, etc.

For more details, see 红队测试 and LLM越狱

Bias & Fairness

Common types of bias:

Demographic Bias: performance disparities across different genders, races, or age groups
Representation Bias: certain groups are underrepresented or misrepresented in training data
Stereotyping: model outputs reinforce social stereotypes

Fairness metrics:

Demographic Parity: the rate of positive predictions should be similar across groups
Equal Opportunity: the True Positive Rate should be similar across groups
Calibration: predicted probabilities should be equally accurate across groups

Safety Benchmarks

TruthfulQA: tests whether a model generates plausible-sounding but incorrect information
BBQ (Bias Benchmark for QA): tests for social biases in question answering
RealToxicityPrompts: tests a model's tendency to generate harmful content
SafetyBench: a Chinese-language safety evaluation benchmark

Monitoring

After deployment, continuous monitoring is essential to maintaining service quality.

Model Drift

A decline in model performance over time after deployment, commonly caused by:

Concept Drift: the relationship between inputs and outputs changes (e.g., shifts in user behavior)
Gradual Drift: slow, continuous change
Sudden Drift: abrupt change (e.g., policy updates, unexpected events)

Detection methods:

Periodically evaluate model metrics on fresh data
Monitor changes in prediction distributions (e.g., KL divergence, PSI)
Set alert thresholds on performance metrics

Data Drift

A shift in the input data distribution that may not immediately affect model performance:

Covariate Shift: change in the distribution of input features
Prior Probability Shift: change in the distribution of labels