浏览器与计算机操作

引言

让 Agent 像人类一样操作浏览器和计算机，是 Agent 能力的重要扩展。从浏览器自动化到完整的桌面操作，Vision-Language Models（VLMs）的进步使得 GUI Agent 成为可能。

浏览器自动化

传统方法：DOM 操作

基于 HTML DOM 结构进行自动化：

# Playwright 示例
from playwright.async_api import async_playwright

async def browser_automation():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # 导航
        await page.goto("https://example.com")

        # 点击
        await page.click("button#submit")

        # 填写表单
        await page.fill("input[name='search']", "AI Agent")

        # 提取内容
        content = await page.text_content("div.result")

        # 截图
        await page.screenshot(path="screenshot.png")

        await browser.close()
        return content

Agent 驱动的浏览器操作

将浏览器操作封装为 Agent 工具：

class BrowserTools:
    def __init__(self):
        self.page = None

    async def navigate(self, url: str) -> str:
        """导航到指定 URL"""
        await self.page.goto(url, wait_until="networkidle")
        return f"已导航到 {url}"

    async def click(self, selector: str) -> str:
        """点击页面元素"""
        await self.page.click(selector)
        return f"已点击 {selector}"

    async def type_text(self, selector: str, text: str) -> str:
        """在输入框中输入文本"""
        await self.page.fill(selector, text)
        return f"已在 {selector} 输入 '{text}'"

    async def get_text(self, selector: str) -> str:
        """获取元素文本"""
        return await self.page.text_content(selector)

    async def screenshot(self) -> str:
        """获取当前页面截图"""
        screenshot = await self.page.screenshot()
        return screenshot  # 返回图片数据

    async def get_page_content(self) -> str:
        """获取页面的可访问性树"""
        accessibility_tree = await self.page.accessibility.snapshot()
        return format_accessibility_tree(accessibility_tree)

GUI Agent 方法

方法 1：截图 → 动作（Screenshot-to-Action）

使用 VLM 直接从屏幕截图理解界面并决定操作：

截图 → VLM 分析 → 决定动作（点击坐标/输入文本）→ 执行 → 新截图 → ...

class ScreenshotAgent:
    def __init__(self, vlm):
        self.vlm = vlm

    async def step(self, task, screenshot):
        """基于截图决定下一步操作"""
        response = self.vlm.generate(
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": f"任务: {task}\n请分析截图并决定下一步操作。"},
                        {"type": "image", "source": screenshot}
                    ]
                }
            ],
            tools=[
                {"name": "click", "parameters": {"x": "int", "y": "int"}},
                {"name": "type", "parameters": {"text": "string"}},
                {"name": "scroll", "parameters": {"direction": "up|down"}},
                {"name": "done", "parameters": {"result": "string"}},
            ]
        )
        return response.tool_calls

方法 2：DOM → 动作

解析 HTML DOM 结构，让 LLM 基于文本表示决定操作：

async def get_simplified_dom(page):
    """获取简化的 DOM 表示"""
    elements = await page.evaluate("""
        () => {
            const interactable = document.querySelectorAll(
                'a, button, input, select, textarea, [role="button"], [onclick]'
            );
            return Array.from(interactable).map((el, idx) => ({
                id: idx,
                tag: el.tagName.toLowerCase(),
                text: el.textContent?.trim().substring(0, 100),
                type: el.type || '',
                href: el.href || '',
                placeholder: el.placeholder || '',
            }));
        }
    """)

    # 格式化为 LLM 可读的文本
    dom_text = "可交互元素：\n"
    for el in elements:
        dom_text += f"[{el['id']}] <{el['tag']}> {el['text']}"
        if el['type']:
            dom_text += f" (type={el['type']})"
        if el['href']:
            dom_text += f" (href={el['href']})"
        dom_text += "\n"

    return dom_text

方法 3：Accessibility Tree

使用无障碍访问树作为页面的结构化表示：

async def get_accessibility_tree(page):
    """获取页面的无障碍访问树"""
    tree = await page.accessibility.snapshot()

    def format_node(node, depth=0):
        indent = "  " * depth
        text = f"{indent}{node['role']}"
        if node.get('name'):
            text += f': "{node["name"]}"'
        if node.get('value'):
            text += f' (value: {node["value"]})'
        text += "\n"

        for child in node.get('children', []):
            text += format_node(child, depth + 1)
        return text

    return format_node(tree)

# 输出示例：
# WebArea: "Search - Google"
#   navigation: "Navigation"
#     link: "Gmail"
#     link: "Images"
#   search: "Search"
#     textbox: "Search" (value: "")
#     button: "Google Search"

Anthropic Computer Use

概述

Anthropic 的 Computer Use 让 Claude 能够像人类一样操作电脑：查看屏幕、移动鼠标、点击按钮、输入文字。

工具定义

# Anthropic Computer Use 提供三个内置工具
computer_tools = [
    {
        "type": "computer_20241022",
        "name": "computer",
        "display_width_px": 1024,
        "display_height_px": 768,
    },
    {
        "type": "text_editor_20241022",
        "name": "str_replace_editor",
    },
    {
        "type": "bash_20241022",
        "name": "bash",
    }
]

使用示例

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=computer_tools,
    messages=[{
        "role": "user",
        "content": "打开 Firefox 浏览器，搜索今天的天气"
    }]
)

# Claude 会返回类似的工具调用：
# 1. computer(action="screenshot") - 先截图查看当前状态
# 2. computer(action="mouse_move", coordinate=[500, 400]) - 移动鼠标
# 3. computer(action="left_click") - 点击
# 4. computer(action="type", text="天气") - 输入文字

动作空间

动作	说明	参数
`screenshot`	截取屏幕	无
`mouse_move`	移动鼠标	coordinate: [x, y]
`left_click`	左键单击	coordinate: [x, y]
`right_click`	右键单击	coordinate: [x, y]
`double_click`	双击	coordinate: [x, y]
`type`	输入文本	text: string
`key`	按键	key: string（如 "Return", "ctrl+c"）
`scroll`	滚动	coordinate, direction
`drag`	拖拽	start_coordinate, end_coordinate

WebVoyager 与 Web Agent

WebVoyager（He et al., 2024）

使用 VLM 在真实网站上完成复杂任务：

任务: "在 Amazon 上搜索评分最高的降噪耳机，价格在 $100 以下"

Agent 执行流程:
1. 截图 → 识别搜索框 → 输入"降噪耳机"
2. 截图 → 识别过滤器 → 点击价格过滤
3. 截图 → 设置价格范围 → 点击应用
4. 截图 → 识别排序选项 → 按评分排序
5. 截图 → 提取结果 → 返回推荐

Web Agent 架构

graph TB
    TASK[用户任务] --> PLAN[任务规划]
    PLAN --> LOOP{Agent 循环}

    LOOP --> OBS[观察<br/>截图/DOM/A11y]
    OBS --> THINK[思考<br/>分析当前状态]
    THINK --> ACT[行动<br/>点击/输入/滚动]
    ACT --> ENV[浏览器环境]
    ENV --> OBS

    THINK -->|任务完成| RESULT[返回结果]

屏幕理解与 VLM

Set-of-Mark（SoM）

在截图上标注可交互元素的编号，帮助 VLM 精确定位：

def add_set_of_mark(screenshot, elements):
    """在截图上标注元素编号"""
    from PIL import Image, ImageDraw, ImageFont

    img = Image.open(screenshot)
    draw = ImageDraw.Draw(img)

    for idx, el in enumerate(elements):
        x, y, w, h = el["bbox"]
        # 绘制边框
        draw.rectangle([x, y, x+w, y+h], outline="red", width=2)
        # 标注编号
        draw.text((x, y-15), str(idx), fill="red")

    return img

多模态输入

# 结合截图和结构化信息
prompt = f"""
你正在操作浏览器完成任务: {task}

当前页面截图见图片。
当前 URL: {current_url}
可交互元素:
{accessibility_tree}

请选择下一步操作。
"""

安全考虑

风险

敏感信息泄露：屏幕上可能有密码、个人信息
误操作：Agent 可能点击错误的按钮（如删除、支付）
恶意网站：Agent 可能被钓鱼网站欺骗
权限过大：操作系统级别的访问权限

安全措施

class SafeBrowserAgent:
    BLOCKED_ACTIONS = [
        "确认支付", "删除账号", "发送邮件",  # 高风险操作
    ]

    BLOCKED_DOMAINS = [
        "bank.com", "payment.com",  # 金融网站
    ]

    async def safe_execute(self, action, context):
        """安全检查后执行"""
        # 1. 域名检查
        if any(d in context["url"] for d in self.BLOCKED_DOMAINS):
            return "拒绝：不在金融网站上执行自动操作"

        # 2. 高风险操作检查
        if any(a in str(action) for a in self.BLOCKED_ACTIONS):
            # 需要人工确认
            confirmed = await self.request_human_confirmation(action)
            if not confirmed:
                return "用户取消了操作"

        # 3. 执行
        return await self.execute(action)

延伸阅读

Web 智能体 - Web Agent 的具体应用
He, H., et al. (2024). "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
Anthropic. "Computer Use" Documentation
Zheng, B., et al. (2024). "GPT-4V(ision) is a Generalist Web Agent, if Grounded"