Web Agents
Overview
Web Agents are AI agents capable of autonomously browsing web pages, understanding web content, and completing various web tasks. With the advancement of multimodal large models, web agents have evolved from simple web scraping tools into intelligent systems that can operate browsers much like humans do.
Core Challenges
The web environment is fundamentally different from traditional programming environments:
- Dynamism: Web content changes in real time, DOM structures are unstable
- Multimodality: Requires simultaneous understanding of text, images, and layout
- Interaction complexity: Multiple operations including clicking, scrolling, typing, and dragging
- Security: Authentication, CAPTCHAs, and anti-scraping mechanisms
Three Technical Approaches for Web Agents
graph TD
A[Web Agent Technical Approaches] --> B[DOM-based]
A --> C[Screenshot-based]
A --> D[Accessibility Tree-based]
B --> B1[Parse HTML/DOM]
B --> B2[Extract structured information]
B --> B3[Operate via selectors]
C --> C1[Feed screenshots to multimodal LLM]
C --> C2[Visual page understanding]
C --> C3[Operate via coordinates]
D --> D1[Parse accessibility tree]
D --> D2[Semantic element descriptions]
D --> D3[Operate via roles and labels]
style B fill:#e1f5fe
style C fill:#fff3e0
style D fill:#e8f5e9
DOM-based Approach
Directly parses the DOM structure of web pages to understand page content and perform operations.
Advantages:
- Obtains precise element information (ID, class, text content)
- High operation determinism (via CSS selectors or XPath)
- Can access invisible element information
Disadvantages:
- Complex DOM structures consume many tokens
- DOM structures vary greatly across websites
- Dynamically rendered content may not be in the initial DOM
- Visual layout information is lost
Screenshot-based Approach
Uses multimodal large models to directly "see" web page screenshots for understanding and operation.
Advantages:
- Closest to human browsing behavior
- Not dependent on DOM structure, highly generalizable
- Can understand visual layout and design intent
Disadvantages:
- Precise positioning is difficult (coordinate prediction errors)
- Small fonts or dense elements are hard to identify
- Screenshot resolution and viewport limitations
- Higher computational cost
Accessibility Tree-based Approach
Leverages the browser's Accessibility Tree to obtain a semantically structured representation of the page.
Advantages:
- High information density, good token efficiency
- Semantic descriptions (roles such as button, link, textbox)
- Balances structural and semantic information
Disadvantages:
- Websites with poor accessibility implementation have missing information
- Cannot capture purely visual information (such as image content)
- Requires special browser API support
Representative Systems
WebVoyager (He et al., 2024)
WebVoyager is a web agent based on multimodal LLMs, capable of completing tasks on real websites.
Core Design:
- Uses GPT-4V as the base model
- Inputs both screenshots and accessibility tree information
- Supported actions: click, type, scroll, wait, go_back, google_search
- Evaluated on 15 real websites
Action Space Definition:
ACTION_SPACE = {
"click": {"params": ["element_id"], "desc": "Click on specified element"},
"type": {"params": ["element_id", "text"], "desc": "Type text in element"},
"scroll": {"params": ["direction"], "desc": "Scroll page up/down"},
"wait": {"params": [], "desc": "Wait for page to load"},
"go_back": {"params": [], "desc": "Go back to previous page"},
"google_search": {"params": ["query"], "desc": "Google search"},
"answer": {"params": ["text"], "desc": "Provide final answer"},
}
Performance: Average task completion rate of approximately 55.7% across multiple website categories.
WebArena (Zhou et al., 2024)
WebArena is a standardized benchmark environment for evaluating web agents.
Environment Design:
| Component | Description |
|---|---|
| Websites | 4 self-hosted sites (e-commerce, forum, CMS, GitLab) |
| Task count | 812 human-annotated tasks |
| Evaluation | Functional evaluation based on final state |
| Difficulty | Simulates real website complexity |
Task Examples:
- "Post a thread about Python on the forum"
- "Find the cheapest laptop on the e-commerce site"
- "Create a new Issue on GitLab"
Baseline Results:
| Method | Task Success Rate (%) |
|---|---|
| GPT-4 (text) | 14.4 |
| GPT-4V (multimodal) | 16.4 |
| Human performance | 78.2 |
Other Representative Systems
| System | Institution | Technical Approach | Features |
|---|---|---|---|
| Mind2Web | OSU | DOM + LLM | Large-scale real website dataset |
| AgentQ | MultiOn | MCTS + DPO | RL-optimized web agent |
| Browser Use | Open-source | Python library | Lightweight browser control |
| Playwright MCP | Microsoft | MCP protocol | Standardized browser tools |
Key Technologies
Observation Space Processing
Web pages contain enormous amounts of information, requiring effective information compression:
Common compression strategies:
- DOM pruning: Removing invisible elements, script tags, and style tags
- Viewport clipping: Processing only content within the current viewport
- Element annotation: Annotating interactive elements on screenshots with numbers (Set-of-Mark)
- Summary generation: Using LLMs to generate page content summaries
Set-of-Mark (SoM) Method
Annotates interactive elements on screenshots, combining visual and structural information:
[1] Search box (input)
[2] Search button (button)
[3] Login link (link)
[4] Shopping cart icon (button)
...
This approach transforms the visual positioning problem into a number selection problem, significantly improving operation accuracy.
Task Planning and Execution
Web tasks typically require multi-step operations involving planning capabilities:
graph LR
A[Task Understanding] --> B[Step Decomposition]
B --> C[Execute Step 1]
C --> D[Observe Result]
D --> E{Task Complete?}
E -->|No| F[Execute Next Step]
F --> D
E -->|Yes| G[Return Result]
D --> H{Replanning Needed?}
H -->|Yes| B
Error Handling
Common error types in web environments:
| Error Type | Cause | Handling Strategy |
|---|---|---|
| Element not found | DOM changes, dynamic loading | Wait + retry |
| Page timeout | Network issues, slow server | Reload |
| Popup interruption | Cookie consent, ads | Detect and close |
| CAPTCHA | Anti-scraping mechanisms | Human intervention / skip |
| Authentication failure | Session expired | Re-authenticate |
Challenges and Frontiers
Dynamic Page Handling
Modern web applications heavily use JavaScript dynamic rendering (SPA), presenting the following challenges:
- Lazy loading: Content that requires scrolling to load
- Asynchronous updates: Partial page updates after AJAX requests
- Client-side routing: URL changes without full page refresh
- Animations and transitions: Need to wait for animation completion before operating
Authentication and Security
- Session management: Cookies, Sessions, OAuth tokens
- Multi-factor authentication: SMS verification codes, TOTP, etc.
- CAPTCHA recognition: reCAPTCHA, hCaptcha, etc.
- Privacy protection: Avoiding user privacy data leakage
Cross-tab Operations
Real web tasks often require multiple tabs:
- Searching for information in one tab, filling forms in another
- Comparing multiple product pages
- Managing data transfer between multiple windows
Frontier Directions
- Unified GUI Agent: Unified framework for web agents and desktop agents
- Self-evolution: Continuously improving capabilities through online learning
- Collaborative Web Agents: Multiple agents collaborating on complex web tasks
- Privacy preservation: Completing tasks without exposing user data
Application Scenarios
- Automated testing: End-to-end testing of web applications
- Data collection: Structured information extraction from websites
- Process automation: Automatically filling forms and submitting applications
- Price monitoring: Automatically tracking product price changes
- Competitive analysis: Automatically collecting and analyzing competitor information
References
- He, H., et al. "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models." arXiv:2401.13919, 2024.
- Zhou, S., et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." ICLR 2024.
- Deng, X., et al. "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023.
- Yang, S., et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V." arXiv:2310.11441, 2023.
Cross-references: - Browser tools → Browser and Computer Operation - Multimodal perception → Vision and Multimodal Perception - Evaluation benchmarks → Benchmarks