Skip to content

Evaluation Methods Overview

Overview

Evaluating AI Agents is fundamentally different from evaluating traditional LLMs. Traditional LLM evaluation focuses on the quality of single input-output pairs, while agent evaluation must account for multi-step interactions, tool usage, environment operations, and other complex factors. How to comprehensively and accurately evaluate agent capabilities is a key question for advancing agent technology.

Agent Evaluation vs. LLM Evaluation

Dimension LLM Evaluation Agent Evaluation
Input/Output Single prompt → response Multi-turn interaction sequences
Evaluation target Text quality Task completion
Environment dependency No external environment Depends on tools and environment
Non-determinism Relatively low High (multiple paths, multiple strategies)
Evaluation cost Relatively low Higher (environment setup, runtime)
Intermediate process Not considered Must be evaluated (efficiency, safety)

Evaluation Dimensions

graph TD
    A[Agent Evaluation Dimensions] --> B[Task Completion]
    A --> C[Efficiency]
    A --> D[Safety]
    A --> E[Cost]
    A --> F[User Satisfaction]

    B --> B1[Success Rate]
    B --> B2[Partial Completion]
    B --> B3[Correctness]

    C --> C1[Number of Steps]
    C --> C2[Token Consumption]
    C --> C3[Time Consumption]

    D --> D1[Safety Violations]
    D --> D2[Permission Usage Reasonableness]
    D --> D3[Data Protection]

    E --> E1[API Call Cost]
    E --> E2[Compute Resources]
    E --> E3[Human Intervention Cost]

    F --> F1[Subjective Ratings]
    F --> F2[Preference Ranking]
    F --> F3[Trustworthiness Perception]

    style A fill:#e3f2fd

Task Completion

The most critical evaluation dimension, measuring whether the agent successfully completed the target task.

Binary Success Rate:

\[ \text{Success Rate} = \frac{|\{t \in T : \text{completed}(t)\}|}{|T|} \]

Partial Completion:

For complex tasks, they can be decomposed into subgoals and completion ratios calculated:

\[ \text{Partial Completion}(t) = \frac{\sum_{i=1}^{n} w_i \cdot \mathbb{1}[\text{subgoal}_i \text{ achieved}]}{\sum_{i=1}^{n} w_i} \]

Where \(w_i\) is the weight of the \(i\)-th subgoal.

Efficiency

Measures the resources consumed by the agent to complete a task.

Step Efficiency:

\[ \text{Step Efficiency} = \frac{\text{Optimal Steps}}{\text{Actual Steps}} \]

Token Efficiency:

\[ \text{Token Cost} = \sum_{i=1}^{N} (c_{\text{input}} \cdot n_{\text{input}}^{(i)} + c_{\text{output}} \cdot n_{\text{output}}^{(i)}) \]

Where \(c\) is the per-token price, \(n\) is the token count, and \(N\) is the total number of steps.

Safety

Evaluates the agent's safe behavior during task execution:

  • Permission compliance: Whether only authorized tools and operations are used
  • Side effect control: Whether unnecessary environment changes are produced
  • Sensitive information handling: Whether privacy data is properly protected
  • Content safety: Whether generated content is compliant

Cost

\[ \text{Total Cost} = C_{\text{LLM}} + C_{\text{tool}} + C_{\text{compute}} + C_{\text{human}} \]

Components:

Cost Type Description
\(C_{\text{LLM}}\) LLM API call costs
\(C_{\text{tool}}\) External tool/API call costs
\(C_{\text{compute}}\) Compute resources (sandbox, GPU, etc.)
\(C_{\text{human}}\) Human review and intervention labor costs

User Satisfaction

  • Subjective ratings: User satisfaction scores for results (1-5 scale)
  • Preference ranking: User preference comparison across multiple agent results
  • Trustworthiness: Level of user trust in agent output
  • Usability: Fluency and intuitiveness of the interaction experience

Evaluation Method Categories

Offline Evaluation

Evaluation using predefined test sets and standard answers:

  • Advantages: Reproducible, low cost, easy to compare
  • Disadvantages: Limited coverage, gap from real scenarios

Online Evaluation

Deploying agents in real environments and collecting user feedback:

  • Advantages: Most realistic evaluation
  • Disadvantages: High cost, difficult to control variables

Simulation-based Evaluation

Building simulated environments to evaluate agents:

  • Advantages: Controllable, reproducible, supports large-scale testing
  • Disadvantages: Gap between simulated and real environments

LLM-as-Judge

Using an LLM as the evaluator:

evaluation_prompt = """
Please evaluate the following agent's task completion:

Task objective: {task_description}
Agent execution process: {agent_trajectory}
Final result: {final_output}

Please rate on the following dimensions (1-5):
1. Task completion
2. Execution efficiency
3. Safety
4. Output quality

And provide an overall assessment.
"""

Note: LLM-as-Judge has biases (position bias, verbosity bias) and should be used in conjunction with human evaluation.

Evaluation Frameworks

General Evaluation Framework Design

class AgentEvaluator:
    def evaluate(self, agent, task_suite):
        results = []
        for task in task_suite:
            # Set up environment
            env = self.setup_environment(task)

            # Run agent
            trajectory = agent.run(task, env)

            # Multi-dimensional evaluation
            score = {
                "completion": self.eval_completion(trajectory, task),
                "efficiency": self.eval_efficiency(trajectory),
                "safety": self.eval_safety(trajectory),
                "cost": self.eval_cost(trajectory),
            }
            results.append(score)

        return self.aggregate(results)

Trajectory Evaluation

The agent's execution trajectory contains rich evaluation information:

\[ \tau = (s_0, a_0, o_0, s_1, a_1, o_1, \ldots, s_T, a_T, o_T) \]

Where \(s_t\) is the state, \(a_t\) is the action, and \(o_t\) is the observation.

Evaluation dimensions:

  • Action reasonableness: Whether each step's action is reasonable
  • Recovery capability: Ability to recover when encountering errors
  • Planning quality: Whether there is a clear execution plan
  • Tool usage: Whether tool selection and usage is appropriate

Practical Recommendations

  1. Multi-dimensional evaluation: Do not focus solely on success rate; consider efficiency, safety, and cost comprehensively
  2. Layered evaluation: Evaluate progressively from simple to complex tasks
  3. Ablation studies: Evaluate the contribution of each component (memory, tools, planning)
  4. Baseline comparison: Compare with non-agent approaches (pure LLM)
  5. Statistical significance: Run multiple times and take averages, report confidence intervals

References

  1. Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
  2. Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
  3. Zhuge, M., et al. "Agent-as-a-Judge: Evaluate Agents with Agents." arXiv:2410.10934, 2024.

Cross-references: - Benchmark details → Benchmarks - Human evaluation → Human Evaluation and Alignment - Reliability → Reliability and Robustness


评论 #