浏览器与计算机操作
引言
让 Agent 像人类一样操作浏览器和计算机,是 Agent 能力的重要扩展。从浏览器自动化到完整的桌面操作,Vision-Language Models(VLMs)的进步使得 GUI Agent 成为可能。
浏览器自动化
传统方法:DOM 操作
基于 HTML DOM 结构进行自动化:
# Playwright 示例
from playwright.async_api import async_playwright
async def browser_automation():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# 导航
await page.goto("https://example.com")
# 点击
await page.click("button#submit")
# 填写表单
await page.fill("input[name='search']", "AI Agent")
# 提取内容
content = await page.text_content("div.result")
# 截图
await page.screenshot(path="screenshot.png")
await browser.close()
return content
Agent 驱动的浏览器操作
将浏览器操作封装为 Agent 工具:
class BrowserTools:
def __init__(self):
self.page = None
async def navigate(self, url: str) -> str:
"""导航到指定 URL"""
await self.page.goto(url, wait_until="networkidle")
return f"已导航到 {url}"
async def click(self, selector: str) -> str:
"""点击页面元素"""
await self.page.click(selector)
return f"已点击 {selector}"
async def type_text(self, selector: str, text: str) -> str:
"""在输入框中输入文本"""
await self.page.fill(selector, text)
return f"已在 {selector} 输入 '{text}'"
async def get_text(self, selector: str) -> str:
"""获取元素文本"""
return await self.page.text_content(selector)
async def screenshot(self) -> str:
"""获取当前页面截图"""
screenshot = await self.page.screenshot()
return screenshot # 返回图片数据
async def get_page_content(self) -> str:
"""获取页面的可访问性树"""
accessibility_tree = await self.page.accessibility.snapshot()
return format_accessibility_tree(accessibility_tree)
GUI Agent 方法
方法 1:截图 → 动作(Screenshot-to-Action)
使用 VLM 直接从屏幕截图理解界面并决定操作:
截图 → VLM 分析 → 决定动作(点击坐标/输入文本)→ 执行 → 新截图 → ...
class ScreenshotAgent:
def __init__(self, vlm):
self.vlm = vlm
async def step(self, task, screenshot):
"""基于截图决定下一步操作"""
response = self.vlm.generate(
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": f"任务: {task}\n请分析截图并决定下一步操作。"},
{"type": "image", "source": screenshot}
]
}
],
tools=[
{"name": "click", "parameters": {"x": "int", "y": "int"}},
{"name": "type", "parameters": {"text": "string"}},
{"name": "scroll", "parameters": {"direction": "up|down"}},
{"name": "done", "parameters": {"result": "string"}},
]
)
return response.tool_calls
方法 2:DOM → 动作
解析 HTML DOM 结构,让 LLM 基于文本表示决定操作:
async def get_simplified_dom(page):
"""获取简化的 DOM 表示"""
elements = await page.evaluate("""
() => {
const interactable = document.querySelectorAll(
'a, button, input, select, textarea, [role="button"], [onclick]'
);
return Array.from(interactable).map((el, idx) => ({
id: idx,
tag: el.tagName.toLowerCase(),
text: el.textContent?.trim().substring(0, 100),
type: el.type || '',
href: el.href || '',
placeholder: el.placeholder || '',
}));
}
""")
# 格式化为 LLM 可读的文本
dom_text = "可交互元素:\n"
for el in elements:
dom_text += f"[{el['id']}] <{el['tag']}> {el['text']}"
if el['type']:
dom_text += f" (type={el['type']})"
if el['href']:
dom_text += f" (href={el['href']})"
dom_text += "\n"
return dom_text
方法 3:Accessibility Tree
使用无障碍访问树作为页面的结构化表示:
async def get_accessibility_tree(page):
"""获取页面的无障碍访问树"""
tree = await page.accessibility.snapshot()
def format_node(node, depth=0):
indent = " " * depth
text = f"{indent}{node['role']}"
if node.get('name'):
text += f': "{node["name"]}"'
if node.get('value'):
text += f' (value: {node["value"]})'
text += "\n"
for child in node.get('children', []):
text += format_node(child, depth + 1)
return text
return format_node(tree)
# 输出示例:
# WebArea: "Search - Google"
# navigation: "Navigation"
# link: "Gmail"
# link: "Images"
# search: "Search"
# textbox: "Search" (value: "")
# button: "Google Search"
Anthropic Computer Use
概述
Anthropic 的 Computer Use 让 Claude 能够像人类一样操作电脑:查看屏幕、移动鼠标、点击按钮、输入文字。
工具定义
# Anthropic Computer Use 提供三个内置工具
computer_tools = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1024,
"display_height_px": 768,
},
{
"type": "text_editor_20241022",
"name": "str_replace_editor",
},
{
"type": "bash_20241022",
"name": "bash",
}
]
使用示例
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=computer_tools,
messages=[{
"role": "user",
"content": "打开 Firefox 浏览器,搜索今天的天气"
}]
)
# Claude 会返回类似的工具调用:
# 1. computer(action="screenshot") - 先截图查看当前状态
# 2. computer(action="mouse_move", coordinate=[500, 400]) - 移动鼠标
# 3. computer(action="left_click") - 点击
# 4. computer(action="type", text="天气") - 输入文字
动作空间
| 动作 | 说明 | 参数 |
|---|---|---|
screenshot |
截取屏幕 | 无 |
mouse_move |
移动鼠标 | coordinate: [x, y] |
left_click |
左键单击 | coordinate: [x, y] |
right_click |
右键单击 | coordinate: [x, y] |
double_click |
双击 | coordinate: [x, y] |
type |
输入文本 | text: string |
key |
按键 | key: string(如 "Return", "ctrl+c") |
scroll |
滚动 | coordinate, direction |
drag |
拖拽 | start_coordinate, end_coordinate |
WebVoyager 与 Web Agent
WebVoyager(He et al., 2024)
使用 VLM 在真实网站上完成复杂任务:
任务: "在 Amazon 上搜索评分最高的降噪耳机,价格在 $100 以下"
Agent 执行流程:
1. 截图 → 识别搜索框 → 输入"降噪耳机"
2. 截图 → 识别过滤器 → 点击价格过滤
3. 截图 → 设置价格范围 → 点击应用
4. 截图 → 识别排序选项 → 按评分排序
5. 截图 → 提取结果 → 返回推荐
Web Agent 架构
graph TB
TASK[用户任务] --> PLAN[任务规划]
PLAN --> LOOP{Agent 循环}
LOOP --> OBS[观察<br/>截图/DOM/A11y]
OBS --> THINK[思考<br/>分析当前状态]
THINK --> ACT[行动<br/>点击/输入/滚动]
ACT --> ENV[浏览器环境]
ENV --> OBS
THINK -->|任务完成| RESULT[返回结果]
屏幕理解与 VLM
Set-of-Mark(SoM)
在截图上标注可交互元素的编号,帮助 VLM 精确定位:
def add_set_of_mark(screenshot, elements):
"""在截图上标注元素编号"""
from PIL import Image, ImageDraw, ImageFont
img = Image.open(screenshot)
draw = ImageDraw.Draw(img)
for idx, el in enumerate(elements):
x, y, w, h = el["bbox"]
# 绘制边框
draw.rectangle([x, y, x+w, y+h], outline="red", width=2)
# 标注编号
draw.text((x, y-15), str(idx), fill="red")
return img
多模态输入
# 结合截图和结构化信息
prompt = f"""
你正在操作浏览器完成任务: {task}
当前页面截图见图片。
当前 URL: {current_url}
可交互元素:
{accessibility_tree}
请选择下一步操作。
"""
安全考虑
风险
- 敏感信息泄露:屏幕上可能有密码、个人信息
- 误操作:Agent 可能点击错误的按钮(如删除、支付)
- 恶意网站:Agent 可能被钓鱼网站欺骗
- 权限过大:操作系统级别的访问权限
安全措施
class SafeBrowserAgent:
BLOCKED_ACTIONS = [
"确认支付", "删除账号", "发送邮件", # 高风险操作
]
BLOCKED_DOMAINS = [
"bank.com", "payment.com", # 金融网站
]
async def safe_execute(self, action, context):
"""安全检查后执行"""
# 1. 域名检查
if any(d in context["url"] for d in self.BLOCKED_DOMAINS):
return "拒绝:不在金融网站上执行自动操作"
# 2. 高风险操作检查
if any(a in str(action) for a in self.BLOCKED_ACTIONS):
# 需要人工确认
confirmed = await self.request_human_confirmation(action)
if not confirmed:
return "用户取消了操作"
# 3. 执行
return await self.execute(action)
延伸阅读
- Web 智能体 - Web Agent 的具体应用
- He, H., et al. (2024). "WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models"
- Anthropic. "Computer Use" Documentation
- Zheng, B., et al. (2024). "GPT-4V(ision) is a Generalist Web Agent, if Grounded"