Evaluation Methods Overview

Overview

Evaluating AI Agents is fundamentally different from evaluating traditional LLMs. Traditional LLM evaluation focuses on the quality of single input-output pairs, while agent evaluation must account for multi-step interactions, tool usage, environment operations, and other complex factors. How to comprehensively and accurately evaluate agent capabilities is a key question for advancing agent technology.

Agent Evaluation vs. LLM Evaluation

Dimension	LLM Evaluation	Agent Evaluation
Input/Output	Single prompt → response	Multi-turn interaction sequences
Evaluation target	Text quality	Task completion
Environment dependency	No external environment	Depends on tools and environment
Non-determinism	Relatively low	High (multiple paths, multiple strategies)
Evaluation cost	Relatively low	Higher (environment setup, runtime)
Intermediate process	Not considered	Must be evaluated (efficiency, safety)

Evaluation Dimensions

graph TD
    A[Agent Evaluation Dimensions] --> B[Task Completion]
    A --> C[Efficiency]
    A --> D[Safety]
    A --> E[Cost]
    A --> F[User Satisfaction]

    B --> B1[Success Rate]
    B --> B2[Partial Completion]
    B --> B3[Correctness]

    C --> C1[Number of Steps]
    C --> C2[Token Consumption]
    C --> C3[Time Consumption]

    D --> D1[Safety Violations]
    D --> D2[Permission Usage Reasonableness]
    D --> D3[Data Protection]

    E --> E1[API Call Cost]
    E --> E2[Compute Resources]
    E --> E3[Human Intervention Cost]

    F --> F1[Subjective Ratings]
    F --> F2[Preference Ranking]
    F --> F3[Trustworthiness Perception]

    style A fill:#e3f2fd

Task Completion

The most critical evaluation dimension, measuring whether the agent successfully completed the target task.

Binary Success Rate:

\[ \text{Success Rate} = \frac{|\{t \in T : \text{completed}(t)\}|}{|T|} \]

Partial Completion:

For complex tasks, they can be decomposed into subgoals and completion ratios calculated:

\[ \text{Partial Completion}(t) = \frac{\sum_{i=1}^{n} w_i \cdot \mathbb{1}[\text{subgoal}_i \text{ achieved}]}{\sum_{i=1}^{n} w_i} \]

Where \(w_i\) is the weight of the \(i\)-th subgoal.

Efficiency

Measures the resources consumed by the agent to complete a task.

Step Efficiency:

\[ \text{Step Efficiency} = \frac{\text{Optimal Steps}}{\text{Actual Steps}} \]

Token Efficiency:

\[ \text{Token Cost} = \sum_{i=1}^{N} (c_{\text{input}} \cdot n_{\text{input}}^{(i)} + c_{\text{output}} \cdot n_{\text{output}}^{(i)}) \]

Where \(c\) is the per-token price, \(n\) is the token count, and \(N\) is the total number of steps.

Safety

Evaluates the agent's safe behavior during task execution:

Permission compliance: Whether only authorized tools and operations are used
Side effect control: Whether unnecessary environment changes are produced
Sensitive information handling: Whether privacy data is properly protected
Content safety: Whether generated content is compliant

Cost

\[ \text{Total Cost} = C_{\text{LLM}} + C_{\text{tool}} + C_{\text{compute}} + C_{\text{human}} \]

Components:

Cost Type	Description
\(C_{\text{LLM}}\)	LLM API call costs
\(C_{\text{tool}}\)	External tool/API call costs
\(C_{\text{compute}}\)	Compute resources (sandbox, GPU, etc.)
\(C_{\text{human}}\)	Human review and intervention labor costs

User Satisfaction

Subjective ratings: User satisfaction scores for results (1-5 scale)
Preference ranking: User preference comparison across multiple agent results
Trustworthiness: Level of user trust in agent output
Usability: Fluency and intuitiveness of the interaction experience

Evaluation Method Categories

Offline Evaluation

Evaluation using predefined test sets and standard answers:

Advantages: Reproducible, low cost, easy to compare
Disadvantages: Limited coverage, gap from real scenarios

Online Evaluation

Deploying agents in real environments and collecting user feedback:

Advantages: Most realistic evaluation
Disadvantages: High cost, difficult to control variables

Simulation-based Evaluation

Building simulated environments to evaluate agents:

Advantages: Controllable, reproducible, supports large-scale testing
Disadvantages: Gap between simulated and real environments

LLM-as-Judge

Using an LLM as the evaluator:

evaluation_prompt = """
Please evaluate the following agent's task completion:

Task objective: {task_description}
Agent execution process: {agent_trajectory}
Final result: {final_output}

Please rate on the following dimensions (1-5):
1. Task completion
2. Execution efficiency
3. Safety
4. Output quality

And provide an overall assessment.
"""

Note: LLM-as-Judge has biases (position bias, verbosity bias) and should be used in conjunction with human evaluation.

Evaluation Frameworks

General Evaluation Framework Design

class AgentEvaluator:
    def evaluate(self, agent, task_suite):
        results = []
        for task in task_suite:
            # Set up environment
            env = self.setup_environment(task)

            # Run agent
            trajectory = agent.run(task, env)

            # Multi-dimensional evaluation
            score = {
                "completion": self.eval_completion(trajectory, task),
                "efficiency": self.eval_efficiency(trajectory),
                "safety": self.eval_safety(trajectory),
                "cost": self.eval_cost(trajectory),
            }
            results.append(score)

        return self.aggregate(results)

Trajectory Evaluation

The agent's execution trajectory contains rich evaluation information:

\[ \tau = (s_0, a_0, o_0, s_1, a_1, o_1, \ldots, s_T, a_T, o_T) \]

Where \(s_t\) is the state, \(a_t\) is the action, and \(o_t\) is the observation.

Evaluation dimensions:

Action reasonableness: Whether each step's action is reasonable
Recovery capability: Ability to recover when encountering errors
Planning quality: Whether there is a clear execution plan
Tool usage: Whether tool selection and usage is appropriate

Practical Recommendations

Multi-dimensional evaluation: Do not focus solely on success rate; consider efficiency, safety, and cost comprehensively
Layered evaluation: Evaluate progressively from simple to complex tasks
Ablation studies: Evaluate the contribution of each component (memory, tools, planning)
Baseline comparison: Compare with non-agent approaches (pure LLM)
Statistical significance: Run multiple times and take averages, report confidence intervals

References

Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
Zhuge, M., et al. "Agent-as-a-Judge: Evaluate Agents with Agents." arXiv:2410.10934, 2024.

Cross-references: - Benchmark details → Benchmarks - Human evaluation → Human Evaluation and Alignment - Reliability → Reliability and Robustness