Evaluation Methods Overview
Overview
Evaluating AI Agents is fundamentally different from evaluating traditional LLMs. Traditional LLM evaluation focuses on the quality of single input-output pairs, while agent evaluation must account for multi-step interactions, tool usage, environment operations, and other complex factors. How to comprehensively and accurately evaluate agent capabilities is a key question for advancing agent technology.
Agent Evaluation vs. LLM Evaluation
| Dimension | LLM Evaluation | Agent Evaluation |
|---|---|---|
| Input/Output | Single prompt → response | Multi-turn interaction sequences |
| Evaluation target | Text quality | Task completion |
| Environment dependency | No external environment | Depends on tools and environment |
| Non-determinism | Relatively low | High (multiple paths, multiple strategies) |
| Evaluation cost | Relatively low | Higher (environment setup, runtime) |
| Intermediate process | Not considered | Must be evaluated (efficiency, safety) |
Evaluation Dimensions
graph TD
A[Agent Evaluation Dimensions] --> B[Task Completion]
A --> C[Efficiency]
A --> D[Safety]
A --> E[Cost]
A --> F[User Satisfaction]
B --> B1[Success Rate]
B --> B2[Partial Completion]
B --> B3[Correctness]
C --> C1[Number of Steps]
C --> C2[Token Consumption]
C --> C3[Time Consumption]
D --> D1[Safety Violations]
D --> D2[Permission Usage Reasonableness]
D --> D3[Data Protection]
E --> E1[API Call Cost]
E --> E2[Compute Resources]
E --> E3[Human Intervention Cost]
F --> F1[Subjective Ratings]
F --> F2[Preference Ranking]
F --> F3[Trustworthiness Perception]
style A fill:#e3f2fd
Task Completion
The most critical evaluation dimension, measuring whether the agent successfully completed the target task.
Binary Success Rate:
Partial Completion:
For complex tasks, they can be decomposed into subgoals and completion ratios calculated:
Where \(w_i\) is the weight of the \(i\)-th subgoal.
Efficiency
Measures the resources consumed by the agent to complete a task.
Step Efficiency:
Token Efficiency:
Where \(c\) is the per-token price, \(n\) is the token count, and \(N\) is the total number of steps.
Safety
Evaluates the agent's safe behavior during task execution:
- Permission compliance: Whether only authorized tools and operations are used
- Side effect control: Whether unnecessary environment changes are produced
- Sensitive information handling: Whether privacy data is properly protected
- Content safety: Whether generated content is compliant
Cost
Components:
| Cost Type | Description |
|---|---|
| \(C_{\text{LLM}}\) | LLM API call costs |
| \(C_{\text{tool}}\) | External tool/API call costs |
| \(C_{\text{compute}}\) | Compute resources (sandbox, GPU, etc.) |
| \(C_{\text{human}}\) | Human review and intervention labor costs |
User Satisfaction
- Subjective ratings: User satisfaction scores for results (1-5 scale)
- Preference ranking: User preference comparison across multiple agent results
- Trustworthiness: Level of user trust in agent output
- Usability: Fluency and intuitiveness of the interaction experience
Evaluation Method Categories
Offline Evaluation
Evaluation using predefined test sets and standard answers:
- Advantages: Reproducible, low cost, easy to compare
- Disadvantages: Limited coverage, gap from real scenarios
Online Evaluation
Deploying agents in real environments and collecting user feedback:
- Advantages: Most realistic evaluation
- Disadvantages: High cost, difficult to control variables
Simulation-based Evaluation
Building simulated environments to evaluate agents:
- Advantages: Controllable, reproducible, supports large-scale testing
- Disadvantages: Gap between simulated and real environments
LLM-as-Judge
Using an LLM as the evaluator:
evaluation_prompt = """
Please evaluate the following agent's task completion:
Task objective: {task_description}
Agent execution process: {agent_trajectory}
Final result: {final_output}
Please rate on the following dimensions (1-5):
1. Task completion
2. Execution efficiency
3. Safety
4. Output quality
And provide an overall assessment.
"""
Note: LLM-as-Judge has biases (position bias, verbosity bias) and should be used in conjunction with human evaluation.
Evaluation Frameworks
General Evaluation Framework Design
class AgentEvaluator:
def evaluate(self, agent, task_suite):
results = []
for task in task_suite:
# Set up environment
env = self.setup_environment(task)
# Run agent
trajectory = agent.run(task, env)
# Multi-dimensional evaluation
score = {
"completion": self.eval_completion(trajectory, task),
"efficiency": self.eval_efficiency(trajectory),
"safety": self.eval_safety(trajectory),
"cost": self.eval_cost(trajectory),
}
results.append(score)
return self.aggregate(results)
Trajectory Evaluation
The agent's execution trajectory contains rich evaluation information:
Where \(s_t\) is the state, \(a_t\) is the action, and \(o_t\) is the observation.
Evaluation dimensions:
- Action reasonableness: Whether each step's action is reasonable
- Recovery capability: Ability to recover when encountering errors
- Planning quality: Whether there is a clear execution plan
- Tool usage: Whether tool selection and usage is appropriate
Practical Recommendations
- Multi-dimensional evaluation: Do not focus solely on success rate; consider efficiency, safety, and cost comprehensively
- Layered evaluation: Evaluate progressively from simple to complex tasks
- Ablation studies: Evaluate the contribution of each component (memory, tools, planning)
- Baseline comparison: Compare with non-agent approaches (pure LLM)
- Statistical significance: Run multiple times and take averages, report confidence intervals
References
- Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
- Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
- Zhuge, M., et al. "Agent-as-a-Judge: Evaluate Agents with Agents." arXiv:2410.10934, 2024.
Cross-references: - Benchmark details → Benchmarks - Human evaluation → Human Evaluation and Alignment - Reliability → Reliability and Robustness