Skip to content

Benchmarks

Overview

Benchmarks are standardized tools for evaluating AI Agent capabilities. With the rapid development of agent technology, numerous benchmarks targeting different capability dimensions have emerged. This section comprehensively reviews the major agent benchmarks, including task descriptions, evaluation metrics, current state-of-the-art results, and limitations.

Benchmark Landscape

graph TD
    A[Agent Benchmarks] --> B[General Capabilities]
    A --> C[Code Capabilities]
    A --> D[Web Capabilities]
    A --> E[Tool Use]
    A --> F[Desktop Operations]

    B --> B1[AgentBench]
    B --> B2[GAIA]

    C --> C1[SWE-bench]
    C --> C2[HumanEval]

    D --> D1[WebArena]
    D --> D2[Mind2Web]

    E --> E1[τ-bench]
    E --> E2[ToolBench]

    F --> F1[OSWorld]

    style A fill:#e3f2fd

Comprehensive Benchmarks

AgentBench (Liu et al., 2024)

Overview: A multi-domain agent capability evaluation benchmark covering 8 different environments.

Attribute Details
Institution Tsinghua University
Task count 8 environments, thousands of tasks total
Release date 2023.08
Paper ICLR 2024

Evaluation Environments:

Environment Task Type Evaluation Metric
OS (Operating System) Terminal operations Success rate
DB (Database) SQL queries Execution accuracy
KG (Knowledge Graph) Knowledge graph reasoning F1
DCG (Digital Card Game) Strategy game Win rate
LTP (Lateral Thinking) Lateral thinking puzzles Success rate
HouseHold Household environment operations Success rate
WebShop Online shopping Reward score
WebBrowsing Web browsing Success rate

SOTA Results (selected environments):

Model OS DB KG Overall
GPT-4 42.4 32.0 60.0 4.01
Claude 3 38.5 28.7 55.2 3.72
Llama-2-70B 8.3 3.2 20.1 1.08

Limitations:

  • Some environments are overly simplified, with gaps from real scenarios
  • Evaluation metrics are primarily binary success rate, lacking fine-grained evaluation
  • Significant gap between open-source and closed-source models

GAIA (Mialon et al., 2024)

Overview: An evaluation benchmark for general AI assistants, testing agents' ability to solve problems in real-world scenarios.

Attribute Details
Institution Meta, HuggingFace
Task count 466 questions
Release date 2023.11
Paper ICLR 2024

Difficulty Levels:

Level Description Steps Required Human Accuracy GPT-4 + Plugins
Level 1 Easy 1-3 steps 92% 44.6%
Level 2 Medium 3-5 steps 86% 16.0%
Level 3 Hard 5+ steps 81% 0%

Features:

  • Tasks come from the real world, requiring multi-step reasoning and tool use
  • Answers are deterministic (precisely verifiable)
  • Human performance far exceeds current AI systems
  • Requires comprehensive use of search, computation, file processing, and other tools

Code Benchmarks

SWE-bench (Jimenez et al., 2024)

Overview: A code repair benchmark based on real GitHub Issues.

Attribute Details
Institution Princeton
Task count 2294 (full) / 500 (Verified)
Release date 2023.10
Paper ICLR 2024

Task Description:

  • Each task is a real GitHub Issue
  • The agent must understand the issue description, locate code, and write a fix patch
  • Fixes are verified through the project's test suite

Evaluation Metric:

\[ \text{Resolve Rate} = \frac{\text{Number of issues passing all tests}}{\text{Total issues}} \times 100\% \]

SOTA Results (SWE-bench Verified):

System Resolve Rate (%) Date
RAG baseline 2.7 2024.01
SWE-Agent 18.0 2024.04
Agentless 27.3 2024.07
AutoCodeRover-v2 30.7 2024.08
OpenAI Codex 49.3 2025.05
Claude Code 72.7 2025

Limitations:

  • Only Python projects (primarily 12 popular repositories)
  • Focused on bug fixes, lacking feature development tasks
  • Inconsistent test case quality
  • Does not evaluate code quality and maintainability

HumanEval (Chen et al., 2021)

Overview: A function-level code generation benchmark.

Attribute Details
Institution OpenAI
Task count 164 programming problems
Evaluation metric pass@k

Evaluation Metric:

\[ \text{pass@}k = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] \]

Where \(n\) is the total number of generated samples and \(c\) is the number of samples passing tests.

SOTA Results:

Model pass@1 (%)
Codex (2021) 28.8
GPT-4 (2023) 67.0
Claude 3.5 Sonnet 92.0
GPT-4o (2024) 90.2
Claude Opus 4 93.0+

Limitations:

  • Problems are relatively simple (algorithm problem level)
  • Does not involve real project complexity
  • Does not evaluate multi-file interaction capabilities

Web Benchmarks

WebArena (Zhou et al., 2024)

Overview: A self-hosted real web environment benchmark.

Attribute Details
Institution CMU
Task count 812 tasks
Environment 4 self-hosted websites
Release date 2024

Environment Composition:

Website Type Task Example
OneStopShop E-commerce Find the cheapest product
Reddit-like Forum Post, search, comment
GitLab-like Code hosting Create Issue, submit PR
CMS Content management Create pages, manage users

SOTA Results:

Method Success Rate (%)
GPT-4 (text only) 14.4
GPT-4V (multimodal) 16.4
Agent-E 25.2
Human performance 78.2

Mind2Web (Deng et al., 2023)

Overview: A large-scale web agent dataset on real websites.

  • 2000+ tasks covering 137 websites
  • 31 domains (travel, shopping, social, etc.)
  • Evaluates element selection and action prediction accuracy

Tool Use Benchmarks

τ-bench (Yao et al., 2024)

Overview: A benchmark for evaluating agent tool use capabilities.

Evaluation Dimensions:

  • Tool selection accuracy
  • Parameter filling correctness
  • Tool call ordering reasonableness
  • Error handling capability

ToolBench (Qin et al., 2023)

Overview: A large-scale tool use benchmark.

  • 16,000+ real-world APIs
  • 49 categories
  • Supports single-tool and multi-tool scenarios

Desktop Operation Benchmarks

OSWorld (Xie et al., 2024)

Overview: An agent benchmark in desktop operating system environments.

Attribute Details
Environment Ubuntu, Windows, macOS
Task count 369 tasks
Applications Common desktop apps (Office, browsers, etc.)

Features:

  • Real operating system environments (virtual machines)
  • Covers file management, application operations, system settings, etc.
  • Multi-step, cross-application complex tasks

SOTA Results:

Model Success Rate (%)
GPT-4V 12.2
Claude 3 Opus 11.8
Gemini Pro 7.5
Human performance 72.4

Benchmark Comparison Summary

Benchmark Domain Scale Environment Type Primary Metric Human Baseline
AgentBench General 8 environments Simulated+Real Composite score -
GAIA General 466 questions Real world Accuracy 92% (L1)
SWE-bench Code 2294 tasks Real repos Resolve rate -
HumanEval Code 164 tasks Sandbox pass@k ~90%
WebArena Web 812 tasks Self-hosted Success rate 78.2%
OSWorld Desktop 369 tasks Virtual machine Success rate 72.4%
τ-bench Tools Various Simulated Accuracy -

Limitations of Benchmarks

Common Issues

  1. Incomplete task coverage: Cannot cover all real-world scenarios
  2. Evaluation bias: Easy to overfit to specific test patterns
  3. Cost issues: Running complete benchmarks is expensive
  4. Update lag: Agent capabilities improve faster than benchmark updates
  5. Lack of process evaluation: Most only evaluate final results

Improvement Directions

  • Dynamic benchmarks: Task pools that update over time
  • Multi-dimensional evaluation: Comprehensive evaluation of efficiency, safety, and cost
  • Real environments: Closer to real usage scenarios
  • Human alignment: Focus on consistency with human preferences

References

  1. Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
  2. Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
  3. Jimenez, C. E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.
  4. Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
  5. Zhou, S., et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
  6. Xie, T., et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." NeurIPS 2024.

Cross-references: - Code agents → Code Generation Agents - Web agents → Web Agents - Evaluation methods → Evaluation Methods Overview


评论 #