Skip to content

AI Agent Introduction

An AI agent encompasses two layers:

  • Agentic Frameworks (Orchestration Layer): This serves as the agent's brain, responsible for defining the agent's goals, roles, memory, and planning capabilities. The core principle relies on prompt engineering techniques such as ReAct and Chain of Thought to decompose complex tasks into a series of executable steps. Currently, the most mainstream framework is LangChain.
  • Action Layer: After the agent completes its orchestration and planning, it invokes tools in the action layer to interact with the external world (e.g., browsers or APIs).

AI Agents and Embodied Intelligence are two intersecting but independently standing research directions. They share some foundational concepts (such as planning, memory, tool use), but differ fundamentally in core problems, technology stacks, and evaluation methods:

  • AI Agents primarily operate in digital environments (APIs, web, code), with core challenges in reasoning, planning, and tool orchestration
  • Embodied Intelligence operates in 3D space through physical or virtual bodies, with core challenges in perception-action loops, continuous control, and physical interaction

These notes primarily focus on Autonomous Task-Solving AI Agents — software agents that orchestrate workflows and accomplish tasks in digital environments using AI technologies. This is the type of agent most commonly discussed in the industry today. Embodied intelligence is covered as an independent top-level section.

For detailed discussion of embodied intelligence, please refer to the Embodied Intelligence section of this site.

Throughout this site, unless otherwise specified, "AI Agent" refers to disembodied, task-oriented agents.

Lilian Weng, a former scientist at OpenAI, proposed a formula that has been widely cited across the entire industry and can be regarded as the most broadly accepted definition of AI agents today:

AI Agent = LLM + Planning + Memory + Tools

Let us now examine each component of this formula in detail.


Planning

Planning is one of the core capabilities of an agent, endowing it with the ability to decompose complex tasks into manageable subtasks.

Task Decomposition

An agent needs to break down a high-level user request into a series of specific, executable sub-goals. Common decomposition strategies include:

  • Top-down decomposition: Formulate a macro-level plan first, then progressively refine each step
  • Recursive decomposition: Split a large task into subtasks, which are further split into smaller subtasks, until each is simple enough to execute directly
  • Dependency-based decomposition: Analyze dependencies among subtasks, construct a task DAG (Directed Acyclic Graph), and identify steps that can be parallelized

Reasoning Patterns

An agent's planning capability relies on underlying reasoning patterns. The mainstream patterns include:

  • CoT (Chain of Thought): Decomposing problems through step-by-step reasoning
  • ReAct (Reason + Act): Alternating between reasoning and acting, with each step informed by the observation from the previous one
  • Plan-and-Execute: Generating a complete plan first, then executing step by step, with dynamic adjustments as needed

Detailed coverage of these reasoning patterns can be found in the Reasoning Patterns note in this chapter.

Sub-goal Setting and Adjustment

A capable agent not only sets sub-goals but also dynamically adjusts its plan based on feedback during execution. This includes:

  • Deciding whether to retry, skip, or revise subsequent plans when a subtask fails
  • Updating its understanding of the task and modifying the plan when new information is discovered
  • Continuously optimizing strategy across multiple interaction rounds

Memory

The memory system is critical for an agent's ability to handle complex tasks and maintain contextual coherence. It can be categorized by time horizon and storage mechanism into short-term memory and long-term memory.

Short-term Memory

Short-term memory is essentially the LLM's context window. It includes:

  • The conversation history of the current session
  • Roles and instructions pre-configured in the system prompt
  • Intermediate results and observations from the current task

The primary limitation of short-term memory is the length of the context window. Even though current models support 128K tokens or more, processing extremely long conversations or large volumes of tool-call results can still lead to information loss.

Long-term Memory

Long-term memory enables an agent to store and retrieve information across sessions. The core technologies include:

  • Vector databases: Convert text into vector embeddings and store them in databases (e.g., Pinecone, Weaviate, ChromaDB), retrieving relevant information through similarity search
  • RAG (Retrieval-Augmented Generation): Inject retrieved relevant documents into the LLM's context, enabling the model to reason over external knowledge. RAG is essentially a mechanism for providing long-term memory to LLMs
  • Structured storage: Use relational databases or knowledge graphs to store structured facts and relationships

Memory Management Strategies

In production systems, memory management also involves the following strategies:

  • Summary compression: Summarize overly long conversation histories while retaining key information
  • Importance scoring: Assign importance weights to different memory entries, prioritizing retrieval of high-importance content
  • Forgetting mechanisms: Retire outdated or low-relevance memory entries

Tool Use

Tool use is the critical bridge that takes an agent from "pure reasoning" to "real-world action." By invoking external tools, an agent's capabilities are vastly extended.

Function Calling

Function Calling is the standard mechanism for LLM-to-tool interaction today. The workflow is:

  1. Define available tools in the system prompt (function names, parameter schemas, capability descriptions)
  2. The LLM decides whether a tool call is needed based on the user's request and generates a structured function call
  3. An external system executes the function call and returns the result
  4. The LLM integrates the tool's response into its answer

Major LLM providers including OpenAI, Anthropic, and Google all support native Function Calling capabilities.

Common Tool Types

Commonly used tool types for agents include:

  • API calls: Accessing external services for weather, news, financial data, etc.
  • Code execution: Running computations and data analysis via code interpreters (e.g., Python REPL)
  • Search engines: Retrieving real-time information through Google, Bing, etc.
  • File operations: Reading from and writing to file systems, operating databases
  • Browser automation: Automating web browsing, form filling, and information extraction

Tool Selection and Orchestration

When a large number of tools are available, an agent needs tool selection capability — choosing the most appropriate tool from its toolbox based on current task requirements. Advanced agent systems also support tool orchestration, combining multiple tool calls into a cohesive workflow.


Multi-Agent Collaboration

When a single agent's capabilities are insufficient for a complex task, multiple agents can be introduced to collaborate, each focusing on its area of expertise.

Role Specialization

In multi-agent systems, different agents typically assume distinct roles:

  • Manager Agent: Responsible for task allocation and global coordination
  • Specialist Agent: Handles domain-specific tasks (e.g., code writing, document search, data analysis)
  • Critic Agent: Reviews the outputs of other agents and provides feedback

Communication Mechanisms

The primary communication methods between multiple agents include:

  • Shared message queue: All agents communicate through a shared message channel
  • Point-to-point communication: Agents send messages directly to one another
  • Blackboard pattern: Agents write intermediate results to a shared space; other agents read as needed

Representative Frameworks

  • AutoGen (Microsoft): An open-source framework supporting multi-agent conversations and collaboration, where agents interact through dialogue
  • CrewAI: A role-playing-based multi-agent framework where each agent has a defined role, goal, and backstory
  • MetaGPT: Simulates the organizational structure of a software company, with agents playing roles such as product manager, architect, and engineer

Agent Framework Overview

Current mainstream agent frameworks each have their own focus:

Framework Core Focus Key Features
LangChain / LangGraph General-purpose agent orchestration Most mature ecosystem; rich tool integrations; LangGraph supports stateful multi-step workflows
LlamaIndex Data-driven agents Excels at RAG and data indexing; well-suited for knowledge-intensive agents
AutoGPT Autonomous agent exploration Early representative of autonomous agents; emphasizes end-to-end autonomous task completion
Claude Code Software engineering agent Anthropic's coding agent with deep integration of code understanding and generation capabilities
OpenAI Agents SDK Lightweight agent development OpenAI's official framework providing Swarm-style multi-agent orchestration
Dify / Coze Low-code agent platforms Visual workflow builders for agents, lowering the development barrier

Factors to consider when choosing a framework include: task complexity, whether multi-agent collaboration is needed, tool integration requirements, deployment environment, and the team's technology stack preferences.

In practice, many teams also choose to forgo specific frameworks entirely, building custom agent systems directly on top of LLM APIs and Function Calling to achieve greater flexibility and control.


评论 #