Benchmarks

Overview

Benchmarks are standardized tools for evaluating AI Agent capabilities. With the rapid development of agent technology, numerous benchmarks targeting different capability dimensions have emerged. This section comprehensively reviews the major agent benchmarks, including task descriptions, evaluation metrics, current state-of-the-art results, and limitations.

Benchmark Landscape

graph TD
    A[Agent Benchmarks] --> B[General Capabilities]
    A --> C[Code Capabilities]
    A --> D[Web Capabilities]
    A --> E[Tool Use]
    A --> F[Desktop Operations]

    B --> B1[AgentBench]
    B --> B2[GAIA]

    C --> C1[SWE-bench]
    C --> C2[HumanEval]

    D --> D1[WebArena]
    D --> D2[Mind2Web]

    E --> E1[τ-bench]
    E --> E2[ToolBench]

    F --> F1[OSWorld]

    style A fill:#e3f2fd

Comprehensive Benchmarks

AgentBench (Liu et al., 2024)

Overview: A multi-domain agent capability evaluation benchmark covering 8 different environments.

Attribute	Details
Institution	Tsinghua University
Task count	8 environments, thousands of tasks total
Release date	2023.08
Paper	ICLR 2024

Evaluation Environments:

Environment	Task Type	Evaluation Metric
OS (Operating System)	Terminal operations	Success rate
DB (Database)	SQL queries	Execution accuracy
KG (Knowledge Graph)	Knowledge graph reasoning	F1
DCG (Digital Card Game)	Strategy game	Win rate
LTP (Lateral Thinking)	Lateral thinking puzzles	Success rate
HouseHold	Household environment operations	Success rate
WebShop	Online shopping	Reward score
WebBrowsing	Web browsing	Success rate

SOTA Results (selected environments):

Model	OS	DB	KG	Overall
GPT-4	42.4	32.0	60.0	4.01
Claude 3	38.5	28.7	55.2	3.72
Llama-2-70B	8.3	3.2	20.1	1.08

Limitations:

Some environments are overly simplified, with gaps from real scenarios
Evaluation metrics are primarily binary success rate, lacking fine-grained evaluation
Significant gap between open-source and closed-source models

GAIA (Mialon et al., 2024)

Overview: An evaluation benchmark for general AI assistants, testing agents' ability to solve problems in real-world scenarios.

Attribute	Details
Institution	Meta, HuggingFace
Task count	466 questions
Release date	2023.11
Paper	ICLR 2024

Difficulty Levels:

Level	Description	Steps Required	Human Accuracy	GPT-4 + Plugins
Level 1	Easy	1-3 steps	92%	44.6%
Level 2	Medium	3-5 steps	86%	16.0%
Level 3	Hard	5+ steps	81%	0%

Features:

Tasks come from the real world, requiring multi-step reasoning and tool use
Answers are deterministic (precisely verifiable)
Human performance far exceeds current AI systems
Requires comprehensive use of search, computation, file processing, and other tools

Code Benchmarks

SWE-bench (Jimenez et al., 2024)

Overview: A code repair benchmark based on real GitHub Issues.

Attribute	Details
Institution	Princeton
Task count	2294 (full) / 500 (Verified)
Release date	2023.10
Paper	ICLR 2024

Task Description:

Each task is a real GitHub Issue
The agent must understand the issue description, locate code, and write a fix patch
Fixes are verified through the project's test suite

Evaluation Metric:

\[ \text{Resolve Rate} = \frac{\text{Number of issues passing all tests}}{\text{Total issues}} \times 100\% \]

SOTA Results (SWE-bench Verified):

System	Resolve Rate (%)	Date
RAG baseline	2.7	2024.01
SWE-Agent	18.0	2024.04
Agentless	27.3	2024.07
AutoCodeRover-v2	30.7	2024.08
OpenAI Codex	49.3	2025.05
Claude Code	72.7	2025

Limitations:

Only Python projects (primarily 12 popular repositories)
Focused on bug fixes, lacking feature development tasks
Inconsistent test case quality
Does not evaluate code quality and maintainability

HumanEval (Chen et al., 2021)

Overview: A function-level code generation benchmark.

Attribute	Details
Institution	OpenAI
Task count	164 programming problems
Evaluation metric	pass@k

Evaluation Metric:

\[ \text{pass@}k = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] \]

Where \(n\) is the total number of generated samples and \(c\) is the number of samples passing tests.

SOTA Results:

Model	pass@1 (%)
Codex (2021)	28.8
GPT-4 (2023)	67.0
Claude 3.5 Sonnet	92.0
GPT-4o (2024)	90.2
Claude Opus 4	93.0+

Limitations:

Problems are relatively simple (algorithm problem level)
Does not involve real project complexity
Does not evaluate multi-file interaction capabilities

Web Benchmarks

WebArena (Zhou et al., 2024)

Overview: A self-hosted real web environment benchmark.

Attribute	Details
Institution	CMU
Task count	812 tasks
Environment	4 self-hosted websites
Release date	2024

Environment Composition:

Website	Type	Task Example
OneStopShop	E-commerce	Find the cheapest product
Reddit-like	Forum	Post, search, comment
GitLab-like	Code hosting	Create Issue, submit PR
CMS	Content management	Create pages, manage users

SOTA Results:

Method	Success Rate (%)
GPT-4 (text only)	14.4
GPT-4V (multimodal)	16.4
Agent-E	25.2
Human performance	78.2

Mind2Web (Deng et al., 2023)

Overview: A large-scale web agent dataset on real websites.

2000+ tasks covering 137 websites
31 domains (travel, shopping, social, etc.)
Evaluates element selection and action prediction accuracy

Tool Use Benchmarks

τ-bench (Yao et al., 2024)

Overview: A benchmark for evaluating agent tool use capabilities.

Evaluation Dimensions:

Tool selection accuracy
Parameter filling correctness
Tool call ordering reasonableness
Error handling capability

ToolBench (Qin et al., 2023)

Overview: A large-scale tool use benchmark.

16,000+ real-world APIs
49 categories
Supports single-tool and multi-tool scenarios

Desktop Operation Benchmarks

OSWorld (Xie et al., 2024)

Overview: An agent benchmark in desktop operating system environments.

Attribute	Details
Environment	Ubuntu, Windows, macOS
Task count	369 tasks
Applications	Common desktop apps (Office, browsers, etc.)

Features:

Real operating system environments (virtual machines)
Covers file management, application operations, system settings, etc.
Multi-step, cross-application complex tasks

SOTA Results:

Model	Success Rate (%)
GPT-4V	12.2
Claude 3 Opus	11.8
Gemini Pro	7.5
Human performance	72.4

Benchmark Comparison Summary

Benchmark	Domain	Scale	Environment Type	Primary Metric	Human Baseline
AgentBench	General	8 environments	Simulated+Real	Composite score	-
GAIA	General	466 questions	Real world	Accuracy	92% (L1)
SWE-bench	Code	2294 tasks	Real repos	Resolve rate	-
HumanEval	Code	164 tasks	Sandbox	pass@k	~90%
WebArena	Web	812 tasks	Self-hosted	Success rate	78.2%
OSWorld	Desktop	369 tasks	Virtual machine	Success rate	72.4%
τ-bench	Tools	Various	Simulated	Accuracy	-

Limitations of Benchmarks

Common Issues

Incomplete task coverage: Cannot cover all real-world scenarios
Evaluation bias: Easy to overfit to specific test patterns
Cost issues: Running complete benchmarks is expensive
Update lag: Agent capabilities improve faster than benchmark updates
Lack of process evaluation: Most only evaluate final results

Improvement Directions

Dynamic benchmarks: Task pools that update over time
Multi-dimensional evaluation: Comprehensive evaluation of efficiency, safety, and cost
Real environments: Closer to real usage scenarios
Human alignment: Focus on consistency with human preferences

References

Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
Jimenez, C. E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.
Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
Zhou, S., et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
Xie, T., et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." NeurIPS 2024.

Cross-references: - Code agents → Code Generation Agents - Web agents → Web Agents - Evaluation methods → Evaluation Methods Overview

Benchmarks

Overview

Benchmark Landscape

Comprehensive Benchmarks

AgentBench (Liu et al., 2024)

GAIA (Mialon et al., 2024)

Code Benchmarks

SWE-bench (Jimenez et al., 2024)

HumanEval (Chen et al., 2021)

Web Benchmarks

WebArena (Zhou et al., 2024)

Mind2Web (Deng et al., 2023)

Tool Use Benchmarks

τ-bench (Yao et al., 2024)

ToolBench (Qin et al., 2023)

Desktop Operation Benchmarks

OSWorld (Xie et al., 2024)

Benchmark Comparison Summary

Limitations of Benchmarks

Common Issues

Improvement Directions

References

评论 #