Skip to content

LLM Evaluation

1. LLM Benchmarks

1.1 Knowledge and Reasoning Benchmarks

Benchmark Evaluates Data Scale Evaluation Method
MMLU Multiple-choice knowledge across 57 subjects ~16K questions Accuracy
MMLU-Pro Upgraded MMLU, 10-choice, harder ~12K questions Accuracy
ARC Elementary science reasoning ~7.8K questions Accuracy
HellaSwag Common sense reasoning (sentence completion) ~10K questions Accuracy
Winogrande Common sense reasoning (pronoun resolution) ~1.3K questions Accuracy
TruthfulQA Truthfulness evaluation (anti-hallucination) ~817 questions Truthfulness + informativeness

1.2 Math and Coding Benchmarks

Benchmark Evaluates Data Scale Evaluation Method
GSM8K Grade school math word problems 8.5K problems Accuracy
MATH High school/competition math 12.5K problems Accuracy
HumanEval Python code generation 164 problems Pass@K
MBPP Python code generation (simple) 974 problems Pass@K
SWE-bench Real software engineering tasks 2294 tasks Resolution rate

1.3 Comprehensive Evaluation Platforms

HELM (Holistic Evaluation of Language Models)

Evaluation dimensions:
- Accuracy
- Calibration
- Robustness
- Fairness
- Bias
- Toxicity
- Efficiency

Coverage:
- 42 scenarios including QA, summarization, translation, classification
- 59 metrics

Chatbot Arena

Evaluation method:
- Users chat with two anonymous models simultaneously
- Users vote for the better response
- Elo rating system for ranking

Advantages:
- Reflects real user preferences
- Continuously updated rankings
- Covers open-ended conversational ability

1.4 Limitations of Benchmarks

  • Data contamination: Training data may include test sets
  • Goodhart's Law: Optimizing for benchmarks does not equal genuine capability improvement
  • Insufficient coverage: Benchmarks may not reflect actual application needs
  • Static nature: Benchmarks do not update over time

2. Evaluation Methods

2.1 Reference-Based Evaluation

BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

score = sentence_bleu(reference, candidate)
  • Based on n-gram overlap
  • Suitable for: Translation evaluation
  • Limitation: Limited correlation with human judgment

ROUGE

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(
    "The cat sat on the mat",
    "The cat is on the mat"
)
  • ROUGE-1: Unigram overlap
  • ROUGE-2: Bigram overlap
  • ROUGE-L: Longest common subsequence
  • Suitable for: Summarization evaluation

BERTScore

from bert_score import score

P, R, F1 = score(
    cands=["The cat is on the mat"],
    refs=["The cat sat on the mat"],
    lang="en"
)
  • Semantic similarity based on BERT embeddings
  • Captures semantic equivalence better than BLEU/ROUGE

2.2 Reference-Free Evaluation

LLM-as-Judge

Using a strong model (e.g., GPT-4) to evaluate other models' outputs:

judge_prompt = """
Please evaluate the quality of the following AI assistant's response.

User question: {question}

AI response: {answer}

Please score on the following dimensions (1-5):
1. Relevance: Does the response address the question?
2. Accuracy: Is the information accurate?
3. Completeness: Does it cover all aspects of the question?
4. Clarity: Is the expression clear and easy to understand?
5. Helpfulness: Is it practically helpful to the user?

Output in JSON format:
{
  "relevance": <score>,
  "accuracy": <score>,
  "completeness": <score>,
  "clarity": <score>,
  "helpfulness": <score>,
  "overall": <score>,
  "reasoning": "<scoring rationale>"
}
"""

LLM-as-Judge biases:

  • Position bias: Tends to favor the first response
  • Verbosity bias: Tends to prefer longer responses
  • Self-preference: Models tend to prefer their own outputs
  • Mitigation: Randomize order, average over multiple evaluations

Pairwise Comparison

comparison_prompt = """
Below are two AI assistants' responses to the same question. Please judge which is better.

Question: {question}

Response A: {answer_a}
Response B: {answer_b}

Please choose:
- A is better
- B is better
- About the same

Reasoning:
"""

2.3 Human Evaluation

Evaluation process:
1. Prepare evaluation dataset (100-500 samples)
2. Design evaluation criteria and scoring rubric
3. Train annotation team
4. Independent evaluation by multiple annotators (at least 2-3)
5. Calculate inter-annotator agreement (Cohen's Kappa)
6. Aggregate and analyze results

Evaluation dimensions:

Dimension 1 point 3 points 5 points
Fluency Incoherent Mostly fluent Natural and fluent
Relevance Completely off-topic Partially relevant Highly relevant
Accuracy Multiple errors Partially correct Fully correct
Helpfulness Not helpful Somewhat helpful Very helpful

3. Automated Evaluation Pipeline

3.1 Evaluation Pipeline Design

class LLMEvaluationPipeline:
    def __init__(self):
        self.metrics = {
            "reference_based": [BleuMetric(), RougeMetric(), BertScoreMetric()],
            "reference_free": [LLMJudgeMetric(), CoherenceMetric()],
            "safety": [ToxicityMetric(), BiasMetric(), HallucinationMetric()],
            "performance": [LatencyMetric(), ThroughputMetric(), CostMetric()],
        }

    def evaluate(self, test_set, model_output):
        results = {}

        for category, metrics in self.metrics.items():
            results[category] = {}
            for metric in metrics:
                score = metric.compute(test_set, model_output)
                results[category][metric.name] = score

        return results

    def generate_report(self, results):
        """Generate evaluation report"""
        report = {
            "timestamp": datetime.now(),
            "model": self.model_name,
            "results": results,
            "summary": self.summarize(results),
            "recommendations": self.recommend(results),
        }
        return report

3.2 Continuous Evaluation

# Periodic evaluation (daily/weekly)
schedule.every().day.at("02:00").do(run_evaluation)

def run_evaluation():
    # 1. Sample from production logs
    samples = sample_production_logs(n=500)

    # 2. Run evaluation
    results = pipeline.evaluate(samples)

    # 3. Compare with baseline
    baseline = load_baseline()
    regression = detect_regression(results, baseline)

    # 4. Alert
    if regression:
        alert(f"Performance regression detected: {regression}")

    # 5. Log results
    log_results(results)

4. Domain-Specific Evaluation

4.1 Healthcare

Evaluation dimensions:
- Medical accuracy (consistency with clinical guidelines)
- Safety (no harmful advice)
- Disclaimers (recommend seeing a doctor)
- Privacy protection (no patient information leakage)
Evaluation dimensions:
- Legal accuracy (correctness of statute citations)
- Jurisdictional applicability (correctly identifying applicable law)
- Timeliness (whether statutes are current)
- Risk disclosure (clearly stating this does not constitute legal advice)

4.3 Education

Evaluation dimensions:
- Knowledge accuracy
- Explanation clarity (appropriate for target age group)
- Teaching method (guided vs direct answers)
- Encouragement (positive and constructive feedback)

4.4 Code Generation

# Code generation evaluation
class CodeEvaluator:
    def evaluate(self, generated_code, test_cases):
        results = {
            "pass_rate": self.run_tests(generated_code, test_cases),
            "syntax_valid": self.check_syntax(generated_code),
            "style_score": self.check_style(generated_code),  # PEP 8, etc.
            "security_issues": self.security_scan(generated_code),
            "complexity": self.calculate_complexity(generated_code),
        }
        return results

5. Evaluation Best Practices

5.1 Evaluation Dataset Design

  • Diversity: Cover different difficulties, topics, and lengths
  • Representativeness: Reflect actual production traffic distribution
  • Edge cases: Include known difficult cases
  • Continuous updates: Update based on newly discovered failure cases
  • Avoid contamination: Ensure evaluation data is not in the training set

5.2 Multi-Dimensional Evaluation

Don't look at a single metric!

Good evaluation = 
  Quality metrics (accuracy, relevance, completeness)
  + Safety metrics (hallucination rate, toxicity, bias)
  + Performance metrics (latency, throughput)
  + Cost metrics (cost per query)
  + User metrics (satisfaction, retention)

5.3 Evaluation Pitfalls

  • Overfitting to benchmarks: Good benchmark performance does not equal real-world usefulness
  • Ignoring distribution: High average scores don't mean there are no severe failures
  • Ignoring safety: A high-quality but unsafe model is dangerous
  • Evaluation cost: LLM-as-Judge also has costs that need to be balanced

6. Summary

Method Cost Speed Accuracy Applicable Stage
BLEU/ROUGE Very low Very fast Low Quick screening
BERTScore Low Fast Medium Semantic evaluation
LLM-as-Judge Medium Medium Medium-High Regular evaluation
Human evaluation High Slow High Critical decisions
Chatbot Arena Medium Slow Highest Model ranking

Recommended combinations:

  • Daily development: Automatic metrics + LLM-as-Judge
  • Version releases: + Human evaluation + A/B testing
  • Model selection: + Benchmarks + domain-specific evaluation

References

  • Hendrycks et al., "Measuring Massive Multitask Language Understanding", 2021
  • Chen et al., "Evaluating Large Language Models Trained on Code", 2021 (HumanEval)
  • Liang et al., "Holistic Evaluation of Language Models", 2023 (HELM)
  • Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", 2023
  • Benchmarks — Agent evaluation benchmarks
  • A/B Testing and Deployment — A/B testing methods

评论 #