Benchmarks
Overview
Benchmarks are standardized tools for evaluating AI Agent capabilities. With the rapid development of agent technology, numerous benchmarks targeting different capability dimensions have emerged. This section comprehensively reviews the major agent benchmarks, including task descriptions, evaluation metrics, current state-of-the-art results, and limitations.
Benchmark Landscape
graph TD
A[Agent Benchmarks] --> B[General Capabilities]
A --> C[Code Capabilities]
A --> D[Web Capabilities]
A --> E[Tool Use]
A --> F[Desktop Operations]
B --> B1[AgentBench]
B --> B2[GAIA]
C --> C1[SWE-bench]
C --> C2[HumanEval]
D --> D1[WebArena]
D --> D2[Mind2Web]
E --> E1[τ-bench]
E --> E2[ToolBench]
F --> F1[OSWorld]
style A fill:#e3f2fd
Comprehensive Benchmarks
AgentBench (Liu et al., 2024)
Overview: A multi-domain agent capability evaluation benchmark covering 8 different environments.
| Attribute | Details |
|---|---|
| Institution | Tsinghua University |
| Task count | 8 environments, thousands of tasks total |
| Release date | 2023.08 |
| Paper | ICLR 2024 |
Evaluation Environments:
| Environment | Task Type | Evaluation Metric |
|---|---|---|
| OS (Operating System) | Terminal operations | Success rate |
| DB (Database) | SQL queries | Execution accuracy |
| KG (Knowledge Graph) | Knowledge graph reasoning | F1 |
| DCG (Digital Card Game) | Strategy game | Win rate |
| LTP (Lateral Thinking) | Lateral thinking puzzles | Success rate |
| HouseHold | Household environment operations | Success rate |
| WebShop | Online shopping | Reward score |
| WebBrowsing | Web browsing | Success rate |
SOTA Results (selected environments):
| Model | OS | DB | KG | Overall |
|---|---|---|---|---|
| GPT-4 | 42.4 | 32.0 | 60.0 | 4.01 |
| Claude 3 | 38.5 | 28.7 | 55.2 | 3.72 |
| Llama-2-70B | 8.3 | 3.2 | 20.1 | 1.08 |
Limitations:
- Some environments are overly simplified, with gaps from real scenarios
- Evaluation metrics are primarily binary success rate, lacking fine-grained evaluation
- Significant gap between open-source and closed-source models
GAIA (Mialon et al., 2024)
Overview: An evaluation benchmark for general AI assistants, testing agents' ability to solve problems in real-world scenarios.
| Attribute | Details |
|---|---|
| Institution | Meta, HuggingFace |
| Task count | 466 questions |
| Release date | 2023.11 |
| Paper | ICLR 2024 |
Difficulty Levels:
| Level | Description | Steps Required | Human Accuracy | GPT-4 + Plugins |
|---|---|---|---|---|
| Level 1 | Easy | 1-3 steps | 92% | 44.6% |
| Level 2 | Medium | 3-5 steps | 86% | 16.0% |
| Level 3 | Hard | 5+ steps | 81% | 0% |
Features:
- Tasks come from the real world, requiring multi-step reasoning and tool use
- Answers are deterministic (precisely verifiable)
- Human performance far exceeds current AI systems
- Requires comprehensive use of search, computation, file processing, and other tools
Code Benchmarks
SWE-bench (Jimenez et al., 2024)
Overview: A code repair benchmark based on real GitHub Issues.
| Attribute | Details |
|---|---|
| Institution | Princeton |
| Task count | 2294 (full) / 500 (Verified) |
| Release date | 2023.10 |
| Paper | ICLR 2024 |
Task Description:
- Each task is a real GitHub Issue
- The agent must understand the issue description, locate code, and write a fix patch
- Fixes are verified through the project's test suite
Evaluation Metric:
SOTA Results (SWE-bench Verified):
| System | Resolve Rate (%) | Date |
|---|---|---|
| RAG baseline | 2.7 | 2024.01 |
| SWE-Agent | 18.0 | 2024.04 |
| Agentless | 27.3 | 2024.07 |
| AutoCodeRover-v2 | 30.7 | 2024.08 |
| OpenAI Codex | 49.3 | 2025.05 |
| Claude Code | 72.7 | 2025 |
Limitations:
- Only Python projects (primarily 12 popular repositories)
- Focused on bug fixes, lacking feature development tasks
- Inconsistent test case quality
- Does not evaluate code quality and maintainability
HumanEval (Chen et al., 2021)
Overview: A function-level code generation benchmark.
| Attribute | Details |
|---|---|
| Institution | OpenAI |
| Task count | 164 programming problems |
| Evaluation metric | pass@k |
Evaluation Metric:
Where \(n\) is the total number of generated samples and \(c\) is the number of samples passing tests.
SOTA Results:
| Model | pass@1 (%) |
|---|---|
| Codex (2021) | 28.8 |
| GPT-4 (2023) | 67.0 |
| Claude 3.5 Sonnet | 92.0 |
| GPT-4o (2024) | 90.2 |
| Claude Opus 4 | 93.0+ |
Limitations:
- Problems are relatively simple (algorithm problem level)
- Does not involve real project complexity
- Does not evaluate multi-file interaction capabilities
Web Benchmarks
WebArena (Zhou et al., 2024)
Overview: A self-hosted real web environment benchmark.
| Attribute | Details |
|---|---|
| Institution | CMU |
| Task count | 812 tasks |
| Environment | 4 self-hosted websites |
| Release date | 2024 |
Environment Composition:
| Website | Type | Task Example |
|---|---|---|
| OneStopShop | E-commerce | Find the cheapest product |
| Reddit-like | Forum | Post, search, comment |
| GitLab-like | Code hosting | Create Issue, submit PR |
| CMS | Content management | Create pages, manage users |
SOTA Results:
| Method | Success Rate (%) |
|---|---|
| GPT-4 (text only) | 14.4 |
| GPT-4V (multimodal) | 16.4 |
| Agent-E | 25.2 |
| Human performance | 78.2 |
Mind2Web (Deng et al., 2023)
Overview: A large-scale web agent dataset on real websites.
- 2000+ tasks covering 137 websites
- 31 domains (travel, shopping, social, etc.)
- Evaluates element selection and action prediction accuracy
Tool Use Benchmarks
τ-bench (Yao et al., 2024)
Overview: A benchmark for evaluating agent tool use capabilities.
Evaluation Dimensions:
- Tool selection accuracy
- Parameter filling correctness
- Tool call ordering reasonableness
- Error handling capability
ToolBench (Qin et al., 2023)
Overview: A large-scale tool use benchmark.
- 16,000+ real-world APIs
- 49 categories
- Supports single-tool and multi-tool scenarios
Desktop Operation Benchmarks
OSWorld (Xie et al., 2024)
Overview: An agent benchmark in desktop operating system environments.
| Attribute | Details |
|---|---|
| Environment | Ubuntu, Windows, macOS |
| Task count | 369 tasks |
| Applications | Common desktop apps (Office, browsers, etc.) |
Features:
- Real operating system environments (virtual machines)
- Covers file management, application operations, system settings, etc.
- Multi-step, cross-application complex tasks
SOTA Results:
| Model | Success Rate (%) |
|---|---|
| GPT-4V | 12.2 |
| Claude 3 Opus | 11.8 |
| Gemini Pro | 7.5 |
| Human performance | 72.4 |
Benchmark Comparison Summary
| Benchmark | Domain | Scale | Environment Type | Primary Metric | Human Baseline |
|---|---|---|---|---|---|
| AgentBench | General | 8 environments | Simulated+Real | Composite score | - |
| GAIA | General | 466 questions | Real world | Accuracy | 92% (L1) |
| SWE-bench | Code | 2294 tasks | Real repos | Resolve rate | - |
| HumanEval | Code | 164 tasks | Sandbox | pass@k | ~90% |
| WebArena | Web | 812 tasks | Self-hosted | Success rate | 78.2% |
| OSWorld | Desktop | 369 tasks | Virtual machine | Success rate | 72.4% |
| τ-bench | Tools | Various | Simulated | Accuracy | - |
Limitations of Benchmarks
Common Issues
- Incomplete task coverage: Cannot cover all real-world scenarios
- Evaluation bias: Easy to overfit to specific test patterns
- Cost issues: Running complete benchmarks is expensive
- Update lag: Agent capabilities improve faster than benchmark updates
- Lack of process evaluation: Most only evaluate final results
Improvement Directions
- Dynamic benchmarks: Task pools that update over time
- Multi-dimensional evaluation: Comprehensive evaluation of efficiency, safety, and cost
- Real environments: Closer to real usage scenarios
- Human alignment: Focus on consistency with human preferences
References
- Liu, X., et al. "AgentBench: Evaluating LLMs as Agents." ICLR 2024.
- Mialon, G., et al. "GAIA: A Benchmark for General AI Assistants." ICLR 2024.
- Jimenez, C. E., et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.
- Chen, M., et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374, 2021.
- Zhou, S., et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
- Xie, T., et al. "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments." NeurIPS 2024.
Cross-references: - Code agents → Code Generation Agents - Web agents → Web Agents - Evaluation methods → Evaluation Methods Overview