Evaluation & Monitoring
Evaluation is integral to every stage of a model's lifecycle, from training to deployment. Offline evaluation determines whether a model is ready to go live, online evaluation verifies that its real-world performance meets expectations, and monitoring ensures the model continues to operate reliably in production.
Offline Evaluation
Classification Metrics
The Confusion Matrix is the foundation of all classification metrics:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) |
| Actual Negative | FP (False Positive) | TN (True Negative) |
Common metrics:
- Accuracy = \(\frac{TP + TN}{TP + TN + FP + FN}\), suitable for class-balanced scenarios
- Precision = \(\frac{TP}{TP + FP}\), answers "of all positive predictions, how many are truly positive"
- Recall = \(\frac{TP}{TP + FN}\), answers "of all actual positives, how many were identified"
- F1 Score = \(\frac{2 \times P \times R}{P + R}\), the harmonic mean of Precision and Recall
- AUC-ROC: area under the ROC curve, measuring the model's ability to distinguish between positive and negative samples across different thresholds; range [0, 1], where 0.5 indicates random guessing
Selection guidelines:
- When classes are imbalanced, Accuracy can be misleading — use F1 or AUC-ROC instead
- In scenarios where missed detections are costly (e.g., medical diagnosis), prioritize Recall
- In scenarios where false alarms are costly (e.g., spam filtering), prioritize Precision
Text Generation Metrics
BLEU (Bilingual Evaluation Understudy):
- Core idea: measures n-gram overlap between generated text and reference text
- Commonly used for machine translation evaluation
- BLEU-4 considers 1-gram through 4-gram matches simultaneously
- Limitation: only checks exact word matches without understanding semantics
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- ROUGE-N: recall based on n-gram overlap
- ROUGE-L: based on the Longest Common Subsequence (LCS)
- Commonly used for text summarization evaluation
- Key difference from BLEU: BLEU emphasizes Precision, while ROUGE emphasizes Recall
CIDEr (Consensus-based Image Description Evaluation):
- Specifically designed for Image Captioning
- Uses TF-IDF weighted n-gram matching
- Down-weights common phrases (e.g., "a picture of")
METEOR:
- Accounts for synonym matching and stemming
- Captures semantic similarity better than BLEU
- Typically shows higher correlation with human judgments than BLEU
Code Generation Metrics
pass@k:
- Definition: generate \(k\) candidate code samples; the task is considered solved if at least one passes all test cases
- Computation: estimated from \(n\) samples to reduce variance
where \(n\) is the total number of generated samples and \(c\) is the number that pass the tests.
HumanEval:
- A code generation benchmark proposed by OpenAI
- Contains 164 Python programming problems, each with accompanying test cases
- Typically reports pass@1 and pass@10
LLM Comprehensive Benchmarks
MMLU (Massive Multitask Language Understanding):
- Covers 57 subjects (mathematics, history, law, medicine, etc.)
- Multiple-choice format with four options
- Measures a model's breadth of knowledge and reasoning ability
HellaSwag:
- A commonsense reasoning benchmark
- Presents the beginning of a scenario and asks for the most plausible continuation
- Uses Adversarial Filtering to construct high-quality distractors
MT-Bench:
- Multi-turn dialogue evaluation
- Uses GPT-4 as the judge (LLM-as-Judge)
- Evaluates instruction following, reasoning, creative writing, and other capabilities
Chatbot Arena:
- Crowdsourced evaluation based on human preferences
- Users chat with two anonymous models and vote for the better response
- Uses an Elo rating system for ranking
- Widely considered the evaluation method closest to real user experience
Online Evaluation
No matter how strong offline metrics look, they cannot guarantee real-world performance. Below are common online evaluation strategies.
A/B Testing
The most classic online evaluation method:
- Randomly split user traffic into a control group (old model) and a treatment group (new model)
- Collect key business metrics for both groups (click-through rate, retention rate, user satisfaction, etc.)
- Use statistical tests (e.g., t-test) to determine whether the difference is significant
Considerations:
- Sufficient sample size is needed to avoid false positives
- Traffic allocation must be truly random to avoid Selection Bias
- Monitor multiple metrics simultaneously to ensure gains on one metric do not come at the cost of another (metric trade-offs)
Shadow Deployment
Deploy the new model as a "shadow" that receives the same requests as the production model but does not serve its results to users:
- Advantage: zero risk, enabling evaluation on real traffic
- Disadvantage: requires double the compute resources and cannot assess user interaction metrics
Best suited for the validation phase before a new model's first production launch.
Canary Release
A gradual rollout deployment strategy:
- Route 1–5% of traffic to the new model
- Monitor key metrics and confirm no anomalies
- Gradually increase the traffic share (10% → 25% → 50% → 100%)
- Roll back immediately if issues are detected at any stage
Difference from A/B Testing: a canary release aims for safe deployment rather than rigorous statistical comparison.
Risk Assessment
Red Teaming
Simulating adversarial attacks to uncover security vulnerabilities in the model:
- Manual Red Teaming: security experts manually craft adversarial inputs
- Automated Red Teaming: another LLM automatically generates attack prompts
- Testing dimensions: harmful content generation, information leakage, prompt injection, jailbreak attacks, etc.
Bias & Fairness
Common types of bias:
- Demographic Bias: performance disparities across different genders, races, or age groups
- Representation Bias: certain groups are underrepresented or misrepresented in training data
- Stereotyping: model outputs reinforce social stereotypes
Fairness metrics:
- Demographic Parity: the rate of positive predictions should be similar across groups
- Equal Opportunity: the True Positive Rate should be similar across groups
- Calibration: predicted probabilities should be equally accurate across groups
Safety Benchmarks
- TruthfulQA: tests whether a model generates plausible-sounding but incorrect information
- BBQ (Bias Benchmark for QA): tests for social biases in question answering
- RealToxicityPrompts: tests a model's tendency to generate harmful content
- SafetyBench: a Chinese-language safety evaluation benchmark
Monitoring
After deployment, continuous monitoring is essential to maintaining service quality.
Model Drift
A decline in model performance over time after deployment, commonly caused by:
- Concept Drift: the relationship between inputs and outputs changes (e.g., shifts in user behavior)
- Gradual Drift: slow, continuous change
- Sudden Drift: abrupt change (e.g., policy updates, unexpected events)
Detection methods:
- Periodically evaluate model metrics on fresh data
- Monitor changes in prediction distributions (e.g., KL divergence, PSI)
- Set alert thresholds on performance metrics
Data Drift
A shift in the input data distribution that may not immediately affect model performance:
- Covariate Shift: change in the distribution of input features
- Prior Probability Shift: change in the distribution of labels
Detection methods:
- Statistical tests: KS Test (Kolmogorov-Smirnov Test), Chi-square Test
- PSI (Population Stability Index): quantifies the difference between two distributions
- PSI < 0.1: no significant change
- 0.1 < PSI < 0.25: moderate change, warrants attention
- PSI > 0.25: significant change, requires investigation
Latency / Throughput Monitoring
For online services, performance metrics are just as important as model quality metrics:
| Metric | Description | Typical Alert Threshold |
|---|---|---|
| P50 Latency | Median latency | Depends on the use case |
| P99 Latency | Tail latency | No more than 2x the SLA |
| Throughput (QPS) | Queries processed per second | Below 80% of expected capacity |
| Error Rate | Proportion of failed requests | > 1% |
| GPU Utilization | GPU usage rate | Sustained > 95% or < 30% |
| TTFT | Time To First Token | Critical metric for LLM services |
| TPS | Tokens Per Second | LLM generation speed |
Alert System
A robust alert system should include:
- Multi-level alerts: Info → Warning → Critical → Emergency
- Alert aggregation: prevent alert storms (massive duplicate alerts for the same issue)
- Alert routing: route different alert types to different teams
- Automated response: e.g., automatic traffic switching, automatic model rollback
Common tool stack:
- Metrics collection: Prometheus
- Visualization: Grafana
- Alert management: PagerDuty, OpsGenie
- Log analysis: ELK Stack (Elasticsearch + Logstash + Kibana)
References
- Guo et al., "Evaluating Large Language Models: A Comprehensive Survey", 2023
- Liang et al., "Holistic Evaluation of Language Models (HELM)", 2022
- Chatbot Arena Leaderboard