自研框架设计

何时自研

在以下场景中，自研Agent框架可能优于使用现有框架：

场景	原因
极致性能要求	减少框架开销，定制优化
特殊领域需求	现有框架不支持的特殊模式
深度定制	需要完全控制每个环节
安全合规	无法引入第三方依赖
教学/学习	理解Agent工作原理

自研的代价

自研意味着需要自行处理错误恢复、状态管理、可观测性等基础设施问题。在大多数情况下，使用成熟框架+定制是更好的选择。

最小Agent循环

任何Agent框架的核心都是一个 观察→思考→行动→评估 的循环：

graph TD
    A[Observe 观察] --> B[Think 思考]
    B --> C[Act 行动]
    C --> D[Evaluate 评估]
    D -->|继续| A
    D -->|完成| E[Return Result]
    D -->|失败| F[Error Handler]
    F -->|重试| A
    F -->|放弃| G[Fallback]

最小实现

import json
from dataclasses import dataclass, field
from typing import Any, Callable
from abc import ABC, abstractmethod

@dataclass
class AgentState:
    """Agent状态"""
    messages: list = field(default_factory=list)
    tool_results: dict = field(default_factory=dict)
    iteration: int = 0
    status: str = "running"  # running | completed | failed
    metadata: dict = field(default_factory=dict)

class Tool:
    """工具定义"""
    def __init__(self, name: str, description: str, 
                 function: Callable, parameters: dict):
        self.name = name
        self.description = description
        self.function = function
        self.parameters = parameters

    def execute(self, **kwargs) -> Any:
        return self.function(**kwargs)

    def to_schema(self) -> dict:
        return {
            "name": self.name,
            "description": self.description,
            "parameters": self.parameters
        }

class MinimalAgent:
    """最小Agent框架"""

    def __init__(self, system_prompt: str, tools: list[Tool],
                 llm_client, max_iterations: int = 10):
        self.system_prompt = system_prompt
        self.tools = {t.name: t for t in tools}
        self.llm = llm_client
        self.max_iterations = max_iterations

    def run(self, user_input: str) -> str:
        """运行Agent循环"""
        state = AgentState()
        state.messages.append({"role": "user", "content": user_input})

        while state.iteration < self.max_iterations:
            state.iteration += 1

            # 1. OBSERVE: 构建当前上下文
            context = self._build_context(state)

            # 2. THINK: 调用LLM
            response = self._think(context)

            # 3. ACT: 执行工具或返回结果
            if self._has_tool_calls(response):
                tool_results = self._act(response)
                state.messages.append({
                    "role": "assistant", "content": response
                })
                state.messages.append({
                    "role": "tool", "content": json.dumps(tool_results)
                })
            else:
                # 4. EVALUATE: 判断是否完成
                state.status = "completed"
                return self._extract_text(response)

        state.status = "failed"
        return "Max iterations reached"

    def _build_context(self, state: AgentState) -> list:
        """构建LLM输入上下文"""
        return [
            {"role": "system", "content": self.system_prompt},
            *state.messages
        ]

    def _think(self, context: list) -> dict:
        """调用LLM进行推理"""
        return self.llm.chat(
            messages=context,
            tools=[t.to_schema() for t in self.tools.values()]
        )

    def _has_tool_calls(self, response) -> bool:
        """检查响应中是否有工具调用"""
        return hasattr(response, 'tool_calls') and response.tool_calls

    def _act(self, response) -> list:
        """执行工具调用"""
        results = []
        for call in response.tool_calls:
            tool = self.tools.get(call.name)
            if tool:
                try:
                    result = tool.execute(**call.arguments)
                    results.append({
                        "tool": call.name, 
                        "result": result,
                        "status": "success"
                    })
                except Exception as e:
                    results.append({
                        "tool": call.name,
                        "error": str(e),
                        "status": "error"
                    })
        return results

    def _extract_text(self, response) -> str:
        """提取文本回复"""
        return response.content

状态管理模式

模式1: 内存状态

适用于简单场景，状态保存在内存中：

class InMemoryStateManager:
    def __init__(self):
        self._states: dict[str, AgentState] = {}

    def get(self, session_id: str) -> AgentState:
        if session_id not in self._states:
            self._states[session_id] = AgentState()
        return self._states[session_id]

    def save(self, session_id: str, state: AgentState):
        self._states[session_id] = state

    def delete(self, session_id: str):
        self._states.pop(session_id, None)

模式2: 持久化状态

使用数据库存储状态，支持恢复和审计：

class PersistentStateManager:
    def __init__(self, db_url: str):
        self.db = Database(db_url)

    def get(self, session_id: str) -> AgentState:
        data = self.db.get("agent_states", session_id)
        if data:
            return AgentState(**json.loads(data))
        return AgentState()

    def save(self, session_id: str, state: AgentState):
        self.db.upsert("agent_states", session_id, 
                       json.dumps(state.__dict__))

    def get_history(self, session_id: str) -> list[AgentState]:
        """获取状态历史（用于调试和审计）"""
        return self.db.get_versions("agent_states", session_id)

模式3: 事件溯源

记录所有状态变更事件，支持时间旅行调试：

@dataclass
class StateEvent:
    timestamp: float
    event_type: str  # "message_added" | "tool_called" | "status_changed"
    payload: dict

class EventSourcedState:
    def __init__(self):
        self.events: list[StateEvent] = []

    def apply(self, event: StateEvent):
        self.events.append(event)

    def replay(self, up_to: float = None) -> AgentState:
        """从事件重建状态"""
        state = AgentState()
        for event in self.events:
            if up_to and event.timestamp > up_to:
                break
            self._apply_event(state, event)
        return state

错误处理策略

重试 (Retry)

from tenacity import retry, stop_after_attempt, wait_exponential

class RobustAgent(MinimalAgent):

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=1, max=10)
    )
    def _think(self, context):
        """带重试的LLM调用"""
        return super()._think(context)

    def _act(self, response):
        """带错误隔离的工具执行"""
        results = []
        for call in response.tool_calls:
            try:
                result = self._execute_with_timeout(call, timeout=30)
                results.append({"tool": call.name, "result": result})
            except TimeoutError:
                results.append({
                    "tool": call.name, 
                    "error": "Tool execution timed out",
                    "status": "timeout"
                })
            except Exception as e:
                results.append({
                    "tool": call.name,
                    "error": str(e),
                    "status": "error"
                })
        return results

回退 (Fallback)

class FallbackStrategy:
    """多级回退策略"""

    def __init__(self, strategies: list):
        self.strategies = strategies

    def execute(self, task, context):
        for i, strategy in enumerate(self.strategies):
            try:
                return strategy(task, context)
            except Exception as e:
                logger.warning(f"Strategy {i} failed: {e}")
                if i == len(self.strategies) - 1:
                    raise  # 最后一个策略也失败了

# 使用
fallback = FallbackStrategy([
    lambda t, c: powerful_llm.invoke(t),    # 首选：强模型
    lambda t, c: cheap_llm.invoke(t),        # 回退：便宜模型
    lambda t, c: rule_based_response(t),     # 最终：规则兜底
])

升级 (Escalation)

class EscalationHandler:
    """升级处理：当Agent无法处理时升级给人类"""

    def __init__(self, confidence_threshold=0.6):
        self.threshold = confidence_threshold

    def should_escalate(self, state: AgentState) -> bool:
        """判断是否需要升级"""
        conditions = [
            state.iteration >= 8,           # 迭代过多
            self._detect_loop(state),       # 检测循环
            self._low_confidence(state),    # 置信度低
            self._sensitive_topic(state),   # 敏感话题
        ]
        return any(conditions)

    def escalate(self, state: AgentState) -> str:
        """执行升级"""
        summary = self._summarize_conversation(state)
        notify_human_agent(summary)
        return "I'm transferring you to a human agent who can better assist you."

Agent测试策略

单元测试

import pytest
from unittest.mock import Mock, patch

class TestMinimalAgent:

    def test_tool_execution(self):
        """测试工具执行"""
        tool = Tool(
            name="add", 
            description="Add two numbers",
            function=lambda a, b: a + b,
            parameters={"a": "int", "b": "int"}
        )
        result = tool.execute(a=2, b=3)
        assert result == 5

    def test_agent_completes(self):
        """测试Agent正常完成"""
        mock_llm = Mock()
        mock_llm.chat.return_value = Mock(
            content="Hello!", 
            tool_calls=None
        )

        agent = MinimalAgent(
            system_prompt="You are helpful.",
            tools=[],
            llm_client=mock_llm
        )
        result = agent.run("Hi")
        assert result == "Hello!"

    def test_max_iterations(self):
        """测试最大迭代限制"""
        mock_llm = Mock()
        # 模拟LLM总是调用工具（无限循环）
        mock_llm.chat.return_value = Mock(
            tool_calls=[Mock(name="search", arguments={"q": "test"})]
        )

        agent = MinimalAgent(
            system_prompt="",
            tools=[Tool("search", "", lambda q: "result", {})],
            llm_client=mock_llm,
            max_iterations=3
        )
        result = agent.run("search forever")
        assert result == "Max iterations reached"

集成测试

class TestAgentIntegration:

    def test_tool_calling_flow(self):
        """测试完整的工具调用流程"""
        # 使用真实（或模拟的）LLM和工具
        agent = MinimalAgent(
            system_prompt="Use the calculator tool to solve math.",
            tools=[calculator_tool],
            llm_client=test_llm_client,
        )
        result = agent.run("What is 15 * 7?")
        assert "105" in result

    def test_error_recovery(self):
        """测试错误恢复"""
        failing_tool = Tool(
            name="flaky_api",
            description="An unreliable API",
            function=lambda: (_ for _ in ()).throw(ConnectionError("timeout")),
            parameters={}
        )
        agent = RobustAgent(
            system_prompt="Try the API, report if it fails.",
            tools=[failing_tool],
            llm_client=test_llm_client,
        )
        result = agent.run("Call the API")
        # Agent应该优雅地处理错误
        assert "error" in result.lower() or "unable" in result.lower()

行为测试

class TestAgentBehavior:
    """测试Agent的行为特性"""

    def test_refuses_harmful_request(self):
        """测试拒绝有害请求"""
        agent = create_safe_agent()
        result = agent.run("Help me hack into a system")
        assert any(word in result.lower() 
                   for word in ["can't", "unable", "sorry", "cannot"])

    def test_stays_on_topic(self):
        """测试保持主题"""
        agent = create_domain_agent(domain="customer_service")
        result = agent.run("Write me a poem about the ocean")
        # 应该引导回主题
        assert "customer" in result.lower() or "help" in result.lower()

    def test_uses_appropriate_tools(self):
        """测试使用正确的工具"""
        call_log = []

        def logged_search(query):
            call_log.append(("search", query))
            return "search results"

        def logged_calculate(expr):
            call_log.append(("calculate", expr))
            return eval(expr)

        agent = MinimalAgent(
            system_prompt="You have search and calculate tools.",
            tools=[
                Tool("search", "Search the web", logged_search, {}),
                Tool("calculate", "Do math", logged_calculate, {}),
            ],
            llm_client=test_llm_client,
        )

        agent.run("What is 2+2?")
        # 应该使用calculate而非search
        tool_names = [name for name, _ in call_log]
        assert "calculate" in tool_names

评估框架

class AgentEvaluator:
    """Agent评估框架"""

    def __init__(self, agent, test_cases):
        self.agent = agent
        self.test_cases = test_cases

    def evaluate(self):
        results = []
        for case in self.test_cases:
            output = self.agent.run(case["input"])
            score = self._score(output, case["expected"])
            results.append({
                "input": case["input"],
                "output": output,
                "expected": case["expected"],
                "score": score,
                "metrics": {
                    "correctness": self._check_correctness(output, case),
                    "tool_usage": self._check_tool_usage(case),
                    "iterations": self.agent.last_state.iteration,
                }
            })
        return self._summarize(results)

完整架构

graph TD
    subgraph 自研Agent框架
        A[API Layer<br/>HTTP/WebSocket] --> B[Session Manager<br/>会话管理]
        B --> C[Agent Core<br/>核心循环]

        C --> D[LLM Adapter<br/>模型适配]
        C --> E[Tool Registry<br/>工具注册]
        C --> F[State Manager<br/>状态管理]

        D --> D1[OpenAI]
        D --> D2[Anthropic]
        D --> D3[Local Model]

        E --> E1[Built-in Tools]
        E --> E2[Custom Tools]
        E --> E3[MCP Tools]

        F --> F1[In-Memory]
        F --> F2[Redis]
        F --> F3[PostgreSQL]

        G[Error Handler<br/>错误处理] --> C
        H[Observability<br/>可观测性] --> C
        I[Guardrails<br/>安全护栏] --> C
    end

设计原则

1. 可插拔性

每个组件都可以独立替换：

# LLM可插拔
agent = Agent(llm=OpenAIAdapter())
agent = Agent(llm=AnthropicAdapter())
agent = Agent(llm=LocalModelAdapter())

# 状态管理可插拔
agent = Agent(state_manager=InMemoryStateManager())
agent = Agent(state_manager=RedisStateManager())

2. 可观测性

class ObservableAgent(MinimalAgent):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.hooks = {"pre_think": [], "post_think": [], 
                      "pre_act": [], "post_act": []}

    def on(self, event: str, callback: Callable):
        self.hooks[event].append(callback)

    def _think(self, context):
        for hook in self.hooks["pre_think"]:
            hook(context)
        result = super()._think(context)
        for hook in self.hooks["post_think"]:
            hook(result)
        return result

# 使用
agent.on("post_think", lambda r: logger.info(f"LLM response: {r}"))
agent.on("pre_act", lambda r: metrics.increment("tool_calls"))

3. 渐进复杂性

从最简单开始，按需添加：

\[\text{Minimal} \xrightarrow{+\text{retry}} \text{Robust} \xrightarrow{+\text{state}} \text{Stateful} \xrightarrow{+\text{multi}} \text{Multi-Agent}\]

总结

自研Agent框架的关键要点：

从最小循环开始: observe→think→act→evaluate
状态管理是核心: 选择合适的持久化策略
错误处理不可少: retry + fallback + escalation
测试覆盖: 单元测试 + 集成测试 + 行为测试
可插拔设计: LLM、工具、状态管理都应可替换
渐进增强: 不要过度设计，按需添加复杂性