LLM Evaluation
1. LLM Benchmarks
1.1 Knowledge and Reasoning Benchmarks
| Benchmark | Evaluates | Data Scale | Evaluation Method |
|---|---|---|---|
| MMLU | Multiple-choice knowledge across 57 subjects | ~16K questions | Accuracy |
| MMLU-Pro | Upgraded MMLU, 10-choice, harder | ~12K questions | Accuracy |
| ARC | Elementary science reasoning | ~7.8K questions | Accuracy |
| HellaSwag | Common sense reasoning (sentence completion) | ~10K questions | Accuracy |
| Winogrande | Common sense reasoning (pronoun resolution) | ~1.3K questions | Accuracy |
| TruthfulQA | Truthfulness evaluation (anti-hallucination) | ~817 questions | Truthfulness + informativeness |
1.2 Math and Coding Benchmarks
| Benchmark | Evaluates | Data Scale | Evaluation Method |
|---|---|---|---|
| GSM8K | Grade school math word problems | 8.5K problems | Accuracy |
| MATH | High school/competition math | 12.5K problems | Accuracy |
| HumanEval | Python code generation | 164 problems | Pass@K |
| MBPP | Python code generation (simple) | 974 problems | Pass@K |
| SWE-bench | Real software engineering tasks | 2294 tasks | Resolution rate |
1.3 Comprehensive Evaluation Platforms
HELM (Holistic Evaluation of Language Models)
Evaluation dimensions:
- Accuracy
- Calibration
- Robustness
- Fairness
- Bias
- Toxicity
- Efficiency
Coverage:
- 42 scenarios including QA, summarization, translation, classification
- 59 metrics
Chatbot Arena
Evaluation method:
- Users chat with two anonymous models simultaneously
- Users vote for the better response
- Elo rating system for ranking
Advantages:
- Reflects real user preferences
- Continuously updated rankings
- Covers open-ended conversational ability
1.4 Limitations of Benchmarks
- Data contamination: Training data may include test sets
- Goodhart's Law: Optimizing for benchmarks does not equal genuine capability improvement
- Insufficient coverage: Benchmarks may not reflect actual application needs
- Static nature: Benchmarks do not update over time
2. Evaluation Methods
2.1 Reference-Based Evaluation
BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
score = sentence_bleu(reference, candidate)
- Based on n-gram overlap
- Suitable for: Translation evaluation
- Limitation: Limited correlation with human judgment
ROUGE
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(
"The cat sat on the mat",
"The cat is on the mat"
)
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence
- Suitable for: Summarization evaluation
BERTScore
from bert_score import score
P, R, F1 = score(
cands=["The cat is on the mat"],
refs=["The cat sat on the mat"],
lang="en"
)
- Semantic similarity based on BERT embeddings
- Captures semantic equivalence better than BLEU/ROUGE
2.2 Reference-Free Evaluation
LLM-as-Judge
Using a strong model (e.g., GPT-4) to evaluate other models' outputs:
judge_prompt = """
Please evaluate the quality of the following AI assistant's response.
User question: {question}
AI response: {answer}
Please score on the following dimensions (1-5):
1. Relevance: Does the response address the question?
2. Accuracy: Is the information accurate?
3. Completeness: Does it cover all aspects of the question?
4. Clarity: Is the expression clear and easy to understand?
5. Helpfulness: Is it practically helpful to the user?
Output in JSON format:
{
"relevance": <score>,
"accuracy": <score>,
"completeness": <score>,
"clarity": <score>,
"helpfulness": <score>,
"overall": <score>,
"reasoning": "<scoring rationale>"
}
"""
LLM-as-Judge biases:
- Position bias: Tends to favor the first response
- Verbosity bias: Tends to prefer longer responses
- Self-preference: Models tend to prefer their own outputs
- Mitigation: Randomize order, average over multiple evaluations
Pairwise Comparison
comparison_prompt = """
Below are two AI assistants' responses to the same question. Please judge which is better.
Question: {question}
Response A: {answer_a}
Response B: {answer_b}
Please choose:
- A is better
- B is better
- About the same
Reasoning:
"""
2.3 Human Evaluation
Evaluation process:
1. Prepare evaluation dataset (100-500 samples)
2. Design evaluation criteria and scoring rubric
3. Train annotation team
4. Independent evaluation by multiple annotators (at least 2-3)
5. Calculate inter-annotator agreement (Cohen's Kappa)
6. Aggregate and analyze results
Evaluation dimensions:
| Dimension | 1 point | 3 points | 5 points |
|---|---|---|---|
| Fluency | Incoherent | Mostly fluent | Natural and fluent |
| Relevance | Completely off-topic | Partially relevant | Highly relevant |
| Accuracy | Multiple errors | Partially correct | Fully correct |
| Helpfulness | Not helpful | Somewhat helpful | Very helpful |
3. Automated Evaluation Pipeline
3.1 Evaluation Pipeline Design
class LLMEvaluationPipeline:
def __init__(self):
self.metrics = {
"reference_based": [BleuMetric(), RougeMetric(), BertScoreMetric()],
"reference_free": [LLMJudgeMetric(), CoherenceMetric()],
"safety": [ToxicityMetric(), BiasMetric(), HallucinationMetric()],
"performance": [LatencyMetric(), ThroughputMetric(), CostMetric()],
}
def evaluate(self, test_set, model_output):
results = {}
for category, metrics in self.metrics.items():
results[category] = {}
for metric in metrics:
score = metric.compute(test_set, model_output)
results[category][metric.name] = score
return results
def generate_report(self, results):
"""Generate evaluation report"""
report = {
"timestamp": datetime.now(),
"model": self.model_name,
"results": results,
"summary": self.summarize(results),
"recommendations": self.recommend(results),
}
return report
3.2 Continuous Evaluation
# Periodic evaluation (daily/weekly)
schedule.every().day.at("02:00").do(run_evaluation)
def run_evaluation():
# 1. Sample from production logs
samples = sample_production_logs(n=500)
# 2. Run evaluation
results = pipeline.evaluate(samples)
# 3. Compare with baseline
baseline = load_baseline()
regression = detect_regression(results, baseline)
# 4. Alert
if regression:
alert(f"Performance regression detected: {regression}")
# 5. Log results
log_results(results)
4. Domain-Specific Evaluation
4.1 Healthcare
Evaluation dimensions:
- Medical accuracy (consistency with clinical guidelines)
- Safety (no harmful advice)
- Disclaimers (recommend seeing a doctor)
- Privacy protection (no patient information leakage)
4.2 Legal
Evaluation dimensions:
- Legal accuracy (correctness of statute citations)
- Jurisdictional applicability (correctly identifying applicable law)
- Timeliness (whether statutes are current)
- Risk disclosure (clearly stating this does not constitute legal advice)
4.3 Education
Evaluation dimensions:
- Knowledge accuracy
- Explanation clarity (appropriate for target age group)
- Teaching method (guided vs direct answers)
- Encouragement (positive and constructive feedback)
4.4 Code Generation
# Code generation evaluation
class CodeEvaluator:
def evaluate(self, generated_code, test_cases):
results = {
"pass_rate": self.run_tests(generated_code, test_cases),
"syntax_valid": self.check_syntax(generated_code),
"style_score": self.check_style(generated_code), # PEP 8, etc.
"security_issues": self.security_scan(generated_code),
"complexity": self.calculate_complexity(generated_code),
}
return results
5. Evaluation Best Practices
5.1 Evaluation Dataset Design
- Diversity: Cover different difficulties, topics, and lengths
- Representativeness: Reflect actual production traffic distribution
- Edge cases: Include known difficult cases
- Continuous updates: Update based on newly discovered failure cases
- Avoid contamination: Ensure evaluation data is not in the training set
5.2 Multi-Dimensional Evaluation
Don't look at a single metric!
Good evaluation =
Quality metrics (accuracy, relevance, completeness)
+ Safety metrics (hallucination rate, toxicity, bias)
+ Performance metrics (latency, throughput)
+ Cost metrics (cost per query)
+ User metrics (satisfaction, retention)
5.3 Evaluation Pitfalls
- Overfitting to benchmarks: Good benchmark performance does not equal real-world usefulness
- Ignoring distribution: High average scores don't mean there are no severe failures
- Ignoring safety: A high-quality but unsafe model is dangerous
- Evaluation cost: LLM-as-Judge also has costs that need to be balanced
6. Summary
| Method | Cost | Speed | Accuracy | Applicable Stage |
|---|---|---|---|---|
| BLEU/ROUGE | Very low | Very fast | Low | Quick screening |
| BERTScore | Low | Fast | Medium | Semantic evaluation |
| LLM-as-Judge | Medium | Medium | Medium-High | Regular evaluation |
| Human evaluation | High | Slow | High | Critical decisions |
| Chatbot Arena | Medium | Slow | Highest | Model ranking |
Recommended combinations:
- Daily development: Automatic metrics + LLM-as-Judge
- Version releases: + Human evaluation + A/B testing
- Model selection: + Benchmarks + domain-specific evaluation
References
- Hendrycks et al., "Measuring Massive Multitask Language Understanding", 2021
- Chen et al., "Evaluating Large Language Models Trained on Code", 2021 (HumanEval)
- Liang et al., "Holistic Evaluation of Language Models", 2023 (HELM)
- Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", 2023
- Benchmarks — Agent evaluation benchmarks
- A/B Testing and Deployment — A/B testing methods