Skip to content

Evaluation & Monitoring

Evaluation is integral to every stage of a model's lifecycle, from training to deployment. Offline evaluation determines whether a model is ready to go live, online evaluation verifies that its real-world performance meets expectations, and monitoring ensures the model continues to operate reliably in production.


Offline Evaluation

Classification Metrics

The Confusion Matrix is the foundation of all classification metrics:

Predicted Positive Predicted Negative
Actual Positive TP (True Positive) FN (False Negative)
Actual Negative FP (False Positive) TN (True Negative)

Common metrics:

  • Accuracy = \(\frac{TP + TN}{TP + TN + FP + FN}\), suitable for class-balanced scenarios
  • Precision = \(\frac{TP}{TP + FP}\), answers "of all positive predictions, how many are truly positive"
  • Recall = \(\frac{TP}{TP + FN}\), answers "of all actual positives, how many were identified"
  • F1 Score = \(\frac{2 \times P \times R}{P + R}\), the harmonic mean of Precision and Recall
  • AUC-ROC: area under the ROC curve, measuring the model's ability to distinguish between positive and negative samples across different thresholds; range [0, 1], where 0.5 indicates random guessing

Selection guidelines:

  • When classes are imbalanced, Accuracy can be misleading — use F1 or AUC-ROC instead
  • In scenarios where missed detections are costly (e.g., medical diagnosis), prioritize Recall
  • In scenarios where false alarms are costly (e.g., spam filtering), prioritize Precision

Text Generation Metrics

BLEU (Bilingual Evaluation Understudy):

  • Core idea: measures n-gram overlap between generated text and reference text
  • Commonly used for machine translation evaluation
  • BLEU-4 considers 1-gram through 4-gram matches simultaneously
  • Limitation: only checks exact word matches without understanding semantics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation):

  • ROUGE-N: recall based on n-gram overlap
  • ROUGE-L: based on the Longest Common Subsequence (LCS)
  • Commonly used for text summarization evaluation
  • Key difference from BLEU: BLEU emphasizes Precision, while ROUGE emphasizes Recall

CIDEr (Consensus-based Image Description Evaluation):

  • Specifically designed for Image Captioning
  • Uses TF-IDF weighted n-gram matching
  • Down-weights common phrases (e.g., "a picture of")

METEOR:

  • Accounts for synonym matching and stemming
  • Captures semantic similarity better than BLEU
  • Typically shows higher correlation with human judgments than BLEU

Code Generation Metrics

pass@k:

  • Definition: generate \(k\) candidate code samples; the task is considered solved if at least one passes all test cases
  • Computation: estimated from \(n\) samples to reduce variance
\[ \text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \]

where \(n\) is the total number of generated samples and \(c\) is the number that pass the tests.

HumanEval:

  • A code generation benchmark proposed by OpenAI
  • Contains 164 Python programming problems, each with accompanying test cases
  • Typically reports pass@1 and pass@10

LLM Comprehensive Benchmarks

MMLU (Massive Multitask Language Understanding):

  • Covers 57 subjects (mathematics, history, law, medicine, etc.)
  • Multiple-choice format with four options
  • Measures a model's breadth of knowledge and reasoning ability

HellaSwag:

  • A commonsense reasoning benchmark
  • Presents the beginning of a scenario and asks for the most plausible continuation
  • Uses Adversarial Filtering to construct high-quality distractors

MT-Bench:

  • Multi-turn dialogue evaluation
  • Uses GPT-4 as the judge (LLM-as-Judge)
  • Evaluates instruction following, reasoning, creative writing, and other capabilities

Chatbot Arena:

  • Crowdsourced evaluation based on human preferences
  • Users chat with two anonymous models and vote for the better response
  • Uses an Elo rating system for ranking
  • Widely considered the evaluation method closest to real user experience

Online Evaluation

No matter how strong offline metrics look, they cannot guarantee real-world performance. Below are common online evaluation strategies.

A/B Testing

The most classic online evaluation method:

  1. Randomly split user traffic into a control group (old model) and a treatment group (new model)
  2. Collect key business metrics for both groups (click-through rate, retention rate, user satisfaction, etc.)
  3. Use statistical tests (e.g., t-test) to determine whether the difference is significant

Considerations:

  • Sufficient sample size is needed to avoid false positives
  • Traffic allocation must be truly random to avoid Selection Bias
  • Monitor multiple metrics simultaneously to ensure gains on one metric do not come at the cost of another (metric trade-offs)

Shadow Deployment

Deploy the new model as a "shadow" that receives the same requests as the production model but does not serve its results to users:

  • Advantage: zero risk, enabling evaluation on real traffic
  • Disadvantage: requires double the compute resources and cannot assess user interaction metrics

Best suited for the validation phase before a new model's first production launch.

Canary Release

A gradual rollout deployment strategy:

  1. Route 1–5% of traffic to the new model
  2. Monitor key metrics and confirm no anomalies
  3. Gradually increase the traffic share (10% → 25% → 50% → 100%)
  4. Roll back immediately if issues are detected at any stage

Difference from A/B Testing: a canary release aims for safe deployment rather than rigorous statistical comparison.


Risk Assessment

Red Teaming

Simulating adversarial attacks to uncover security vulnerabilities in the model:

  • Manual Red Teaming: security experts manually craft adversarial inputs
  • Automated Red Teaming: another LLM automatically generates attack prompts
  • Testing dimensions: harmful content generation, information leakage, prompt injection, jailbreak attacks, etc.

For more details, see 红队测试 and LLM越狱

Bias & Fairness

Common types of bias:

  • Demographic Bias: performance disparities across different genders, races, or age groups
  • Representation Bias: certain groups are underrepresented or misrepresented in training data
  • Stereotyping: model outputs reinforce social stereotypes

Fairness metrics:

  • Demographic Parity: the rate of positive predictions should be similar across groups
  • Equal Opportunity: the True Positive Rate should be similar across groups
  • Calibration: predicted probabilities should be equally accurate across groups

Safety Benchmarks

  • TruthfulQA: tests whether a model generates plausible-sounding but incorrect information
  • BBQ (Bias Benchmark for QA): tests for social biases in question answering
  • RealToxicityPrompts: tests a model's tendency to generate harmful content
  • SafetyBench: a Chinese-language safety evaluation benchmark

Monitoring

After deployment, continuous monitoring is essential to maintaining service quality.

Model Drift

A decline in model performance over time after deployment, commonly caused by:

  • Concept Drift: the relationship between inputs and outputs changes (e.g., shifts in user behavior)
  • Gradual Drift: slow, continuous change
  • Sudden Drift: abrupt change (e.g., policy updates, unexpected events)

Detection methods:

  • Periodically evaluate model metrics on fresh data
  • Monitor changes in prediction distributions (e.g., KL divergence, PSI)
  • Set alert thresholds on performance metrics

Data Drift

A shift in the input data distribution that may not immediately affect model performance:

  • Covariate Shift: change in the distribution of input features
  • Prior Probability Shift: change in the distribution of labels

Detection methods:

  • Statistical tests: KS Test (Kolmogorov-Smirnov Test), Chi-square Test
  • PSI (Population Stability Index): quantifies the difference between two distributions
    • PSI < 0.1: no significant change
    • 0.1 < PSI < 0.25: moderate change, warrants attention
    • PSI > 0.25: significant change, requires investigation

Latency / Throughput Monitoring

For online services, performance metrics are just as important as model quality metrics:

Metric Description Typical Alert Threshold
P50 Latency Median latency Depends on the use case
P99 Latency Tail latency No more than 2x the SLA
Throughput (QPS) Queries processed per second Below 80% of expected capacity
Error Rate Proportion of failed requests > 1%
GPU Utilization GPU usage rate Sustained > 95% or < 30%
TTFT Time To First Token Critical metric for LLM services
TPS Tokens Per Second LLM generation speed

Alert System

A robust alert system should include:

  • Multi-level alerts: Info → Warning → Critical → Emergency
  • Alert aggregation: prevent alert storms (massive duplicate alerts for the same issue)
  • Alert routing: route different alert types to different teams
  • Automated response: e.g., automatic traffic switching, automatic model rollback

Common tool stack:

  • Metrics collection: Prometheus
  • Visualization: Grafana
  • Alert management: PagerDuty, OpsGenie
  • Log analysis: ELK Stack (Elasticsearch + Logstash + Kibana)

References

  • Guo et al., "Evaluating Large Language Models: A Comprehensive Survey", 2023
  • Liang et al., "Holistic Evaluation of Language Models (HELM)", 2022
  • Chatbot Arena Leaderboard

评论 #