LLM Evaluation

1. LLM Benchmarks

1.1 Knowledge and Reasoning Benchmarks

Benchmark	Evaluates	Data Scale	Evaluation Method
MMLU	Multiple-choice knowledge across 57 subjects	~16K questions	Accuracy
MMLU-Pro	Upgraded MMLU, 10-choice, harder	~12K questions	Accuracy
ARC	Elementary science reasoning	~7.8K questions	Accuracy
HellaSwag	Common sense reasoning (sentence completion)	~10K questions	Accuracy
Winogrande	Common sense reasoning (pronoun resolution)	~1.3K questions	Accuracy
TruthfulQA	Truthfulness evaluation (anti-hallucination)	~817 questions	Truthfulness + informativeness

1.2 Math and Coding Benchmarks

Benchmark	Evaluates	Data Scale	Evaluation Method
GSM8K	Grade school math word problems	8.5K problems	Accuracy
MATH	High school/competition math	12.5K problems	Accuracy
HumanEval	Python code generation	164 problems	Pass@K
MBPP	Python code generation (simple)	974 problems	Pass@K
SWE-bench	Real software engineering tasks	2294 tasks	Resolution rate

1.3 Comprehensive Evaluation Platforms

HELM (Holistic Evaluation of Language Models)

Evaluation dimensions:
- Accuracy
- Calibration
- Robustness
- Fairness
- Bias
- Toxicity
- Efficiency

Coverage:
- 42 scenarios including QA, summarization, translation, classification
- 59 metrics

Chatbot Arena

Evaluation method:
- Users chat with two anonymous models simultaneously
- Users vote for the better response
- Elo rating system for ranking

Advantages:
- Reflects real user preferences
- Continuously updated rankings
- Covers open-ended conversational ability

1.4 Limitations of Benchmarks

Data contamination: Training data may include test sets
Goodhart's Law: Optimizing for benchmarks does not equal genuine capability improvement
Insufficient coverage: Benchmarks may not reflect actual application needs
Static nature: Benchmarks do not update over time

2. Evaluation Methods

2.1 Reference-Based Evaluation

BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

score = sentence_bleu(reference, candidate)

Based on n-gram overlap
Suitable for: Translation evaluation
Limitation: Limited correlation with human judgment

ROUGE

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
scores = scorer.score(
    "The cat sat on the mat",
    "The cat is on the mat"
)

ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence
Suitable for: Summarization evaluation

BERTScore

from bert_score import score

P, R, F1 = score(
    cands=["The cat is on the mat"],
    refs=["The cat sat on the mat"],
    lang="en"
)

Semantic similarity based on BERT embeddings
Captures semantic equivalence better than BLEU/ROUGE

2.2 Reference-Free Evaluation

LLM-as-Judge

Using a strong model (e.g., GPT-4) to evaluate other models' outputs:

judge_prompt = """
Please evaluate the quality of the following AI assistant's response.

User question: {question}

AI response: {answer}

Please score on the following dimensions (1-5):
1. Relevance: Does the response address the question?
2. Accuracy: Is the information accurate?
3. Completeness: Does it cover all aspects of the question?
4. Clarity: Is the expression clear and easy to understand?
5. Helpfulness: Is it practically helpful to the user?

Output in JSON format:
{
  "relevance": <score>,
  "accuracy": <score>,
  "completeness": <score>,
  "clarity": <score>,
  "helpfulness": <score>,
  "overall": <score>,
  "reasoning": "<scoring rationale>"
}
"""

LLM-as-Judge biases:

Position bias: Tends to favor the first response
Verbosity bias: Tends to prefer longer responses
Self-preference: Models tend to prefer their own outputs
Mitigation: Randomize order, average over multiple evaluations

Pairwise Comparison

comparison_prompt = """
Below are two AI assistants' responses to the same question. Please judge which is better.

Question: {question}

Response A: {answer_a}
Response B: {answer_b}

Please choose:
- A is better
- B is better
- About the same

Reasoning:
"""

2.3 Human Evaluation

Evaluation process:
1. Prepare evaluation dataset (100-500 samples)
2. Design evaluation criteria and scoring rubric
3. Train annotation team
4. Independent evaluation by multiple annotators (at least 2-3)
5. Calculate inter-annotator agreement (Cohen's Kappa)
6. Aggregate and analyze results

Evaluation dimensions:

Dimension	1 point	3 points	5 points
Fluency	Incoherent	Mostly fluent	Natural and fluent
Relevance	Completely off-topic	Partially relevant	Highly relevant
Accuracy	Multiple errors	Partially correct	Fully correct
Helpfulness	Not helpful	Somewhat helpful	Very helpful

3. Automated Evaluation Pipeline

3.1 Evaluation Pipeline Design

class LLMEvaluationPipeline:
    def __init__(self):
        self.metrics = {
            "reference_based": [BleuMetric(), RougeMetric(), BertScoreMetric()],
            "reference_free": [LLMJudgeMetric(), CoherenceMetric()],
            "safety": [ToxicityMetric(), BiasMetric(), HallucinationMetric()],
            "performance": [LatencyMetric(), ThroughputMetric(), CostMetric()],
        }

    def evaluate(self, test_set, model_output):
        results = {}

        for category, metrics in self.metrics.items():
            results[category] = {}
            for metric in metrics:
                score = metric.compute(test_set, model_output)
                results[category][metric.name] = score

        return results

    def generate_report(self, results):
        """Generate evaluation report"""
        report = {
            "timestamp": datetime.now(),
            "model": self.model_name,
            "results": results,
            "summary": self.summarize(results),
            "recommendations": self.recommend(results),
        }
        return report

3.2 Continuous Evaluation

# Periodic evaluation (daily/weekly)
schedule.every().day.at("02:00").do(run_evaluation)

def run_evaluation():
    # 1. Sample from production logs
    samples = sample_production_logs(n=500)

    # 2. Run evaluation
    results = pipeline.evaluate(samples)

    # 3. Compare with baseline
    baseline = load_baseline()
    regression = detect_regression(results, baseline)

    # 4. Alert
    if regression:
        alert(f"Performance regression detected: {regression}")

    # 5. Log results
    log_results(results)

4. Domain-Specific Evaluation

4.1 Healthcare

Evaluation dimensions:
- Medical accuracy (consistency with clinical guidelines)
- Safety (no harmful advice)
- Disclaimers (recommend seeing a doctor)
- Privacy protection (no patient information leakage)

4.2 Legal

Evaluation dimensions:
- Legal accuracy (correctness of statute citations)
- Jurisdictional applicability (correctly identifying applicable law)
- Timeliness (whether statutes are current)
- Risk disclosure (clearly stating this does not constitute legal advice)

4.3 Education

Evaluation dimensions:
- Knowledge accuracy
- Explanation clarity (appropriate for target age group)
- Teaching method (guided vs direct answers)
- Encouragement (positive and constructive feedback)

4.4 Code Generation

# Code generation evaluation
class CodeEvaluator:
    def evaluate(self, generated_code, test_cases):
        results = {
            "pass_rate": self.run_tests(generated_code, test_cases),
            "syntax_valid": self.check_syntax(generated_code),
            "style_score": self.check_style(generated_code),  # PEP 8, etc.
            "security_issues": self.security_scan(generated_code),
            "complexity": self.calculate_complexity(generated_code),
        }
        return results

5. Evaluation Best Practices

5.1 Evaluation Dataset Design

Diversity: Cover different difficulties, topics, and lengths
Representativeness: Reflect actual production traffic distribution
Edge cases: Include known difficult cases
Continuous updates: Update based on newly discovered failure cases
Avoid contamination: Ensure evaluation data is not in the training set

5.2 Multi-Dimensional Evaluation

Don't look at a single metric!

Good evaluation = 
  Quality metrics (accuracy, relevance, completeness)
  + Safety metrics (hallucination rate, toxicity, bias)
  + Performance metrics (latency, throughput)
  + Cost metrics (cost per query)
  + User metrics (satisfaction, retention)

5.3 Evaluation Pitfalls

Overfitting to benchmarks: Good benchmark performance does not equal real-world usefulness
Ignoring distribution: High average scores don't mean there are no severe failures
Ignoring safety: A high-quality but unsafe model is dangerous
Evaluation cost: LLM-as-Judge also has costs that need to be balanced

6. Summary

Method	Cost	Speed	Accuracy	Applicable Stage
BLEU/ROUGE	Very low	Very fast	Low	Quick screening
BERTScore	Low	Fast	Medium	Semantic evaluation
LLM-as-Judge	Medium	Medium	Medium-High	Regular evaluation
Human evaluation	High	Slow	High	Critical decisions
Chatbot Arena	Medium	Slow	Highest	Model ranking

Recommended combinations:

Daily development: Automatic metrics + LLM-as-Judge
Version releases: + Human evaluation + A/B testing
Model selection: + Benchmarks + domain-specific evaluation

References

Hendrycks et al., "Measuring Massive Multitask Language Understanding", 2021
Chen et al., "Evaluating Large Language Models Trained on Code", 2021 (HumanEval)
Liang et al., "Holistic Evaluation of Language Models", 2023 (HELM)
Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", 2023
Benchmarks — Agent evaluation benchmarks
A/B Testing and Deployment — A/B testing methods

LLM Evaluation

1. LLM Benchmarks

1.1 Knowledge and Reasoning Benchmarks

1.2 Math and Coding Benchmarks

1.3 Comprehensive Evaluation Platforms

HELM (Holistic Evaluation of Language Models)

Chatbot Arena

1.4 Limitations of Benchmarks

2. Evaluation Methods

2.1 Reference-Based Evaluation

BLEU

ROUGE

BERTScore

2.2 Reference-Free Evaluation

LLM-as-Judge

Pairwise Comparison

2.3 Human Evaluation

3. Automated Evaluation Pipeline

3.1 Evaluation Pipeline Design

3.2 Continuous Evaluation

4. Domain-Specific Evaluation

4.1 Healthcare

4.2 Legal

4.3 Education

4.4 Code Generation

5. Evaluation Best Practices

5.1 Evaluation Dataset Design

5.2 Multi-Dimensional Evaluation

5.3 Evaluation Pitfalls

6. Summary

References

评论 #