ReAct and Tool Reasoning

Overview

ReAct (Reasoning + Acting), proposed by Yao et al. (2022), is one of the most important architectural paradigms for LLM agents. It unifies reasoning (Thought) and action (Action) in an interleaved sequence, enabling LLMs to complete complex tasks through interaction with external environments. This article provides an in-depth analysis of ReAct's principles, implementation, and extensions.

1. Motivation: Unifying Reasoning and Action

1.1 Limitations of Pure Reasoning

Chain-of-Thought enables LLMs to show their reasoning process, but has fundamental limitations:

Stale knowledge: Knowledge in model parameters has a cutoff date
Factual hallucination: Reasoning may be based on incorrect premises
Unverifiable: Can only reason internally, cannot verify information
Cannot act: Can only generate text, cannot execute operations

1.2 Limitations of Pure Action

Directly having LLMs call tools (like Toolformer) also has problems:

Lack of planning: Does not know why a particular tool should be called
Cannot reason: Cannot analyze information returned by tools
Blind action: May repeat ineffective operations

1.3 ReAct's Core Insight

When humans solve problems, thinking and acting alternate: thinking guides action, and the results of action update thinking.

\[ \text{ReAct}: \underbrace{\text{Thought}}_{\text{Reasoning}} \rightarrow \underbrace{\text{Action}}_{\text{Acting}} \rightarrow \underbrace{\text{Observation}}_{\text{Observing}} \rightarrow \text{Thought} \rightarrow \cdots \]

2. ReAct Core Mechanism

2.1 Thought-Action-Observation Loop

graph TD
    Q[Question/Task] --> T1[Thought 1<br/>Analyze problem, formulate strategy]
    T1 --> A1[Action 1<br/>Call tool/Execute operation]
    A1 --> O1[Observation 1<br/>Tool returns result]
    O1 --> T2[Thought 2<br/>Analyze result, decide next step]
    T2 --> A2[Action 2<br/>Call another tool]
    A2 --> O2[Observation 2<br/>New result]
    O2 --> T3[Thought 3<br/>Synthesize information, draw conclusion]
    T3 --> ANS[Final Answer]

2.2 Formal Definition

Given task \(q\), ReAct generates an interleaved sequence:

\[ \tau = (t_1, a_1, o_1, t_2, a_2, o_2, \ldots, t_n, a_n, o_n) \]

where:

\(t_i \sim P_{\text{LLM}}(\cdot \mid q, t_1, a_1, o_1, \ldots, o_{i-1})\) -- LLM-generated thought
\(a_i \sim P_{\text{LLM}}(\cdot \mid q, t_1, a_1, o_1, \ldots, t_i)\) -- LLM-selected action
\(o_i = \text{Env}(a_i)\) -- Observation after environment executes action

Key: Thought \(t_i\) is generated internally by the LLM (not passed to the environment), used for:

Tracking current progress and goals
Reasoning about what action to take
Analyzing the meaning of observation results
Deciding whether sufficient information exists to answer

2.3 Comparison with CoT and Act-Only

Method	Sequence Format	External Information	Reasoning Visible
Standard	\(q \rightarrow a\)	None	No
CoT	\(q \rightarrow t_1, t_2, \ldots \rightarrow a\)	None	Yes
Act-Only	\(q \rightarrow a_1, o_1, a_2, o_2, \ldots\)	Yes	No
ReAct	\(q \rightarrow t_1, a_1, o_1, t_2, a_2, o_2, \ldots\)	Yes	Yes

3. Example Analysis

3.1 Knowledge QA Example

Question: What program runs on the device that Apple Remote can control?

Thought 1: I need to first look up what devices Apple Remote can control.
Action 1: Search[Apple Remote]
Observation 1: Apple Remote is a remote control by Apple that can control
    Front Row media software on Mac, Apple TV, and iPhone.

Thought 2: Apple Remote can control devices running Front Row.
    But what is Front Row? Let me verify.
Action 2: Search[Front Row software]
Observation 2: Front Row is a discontinued media center software by Apple
    for Mac, similar to Windows Media Center.

Thought 3: Based on the search results, the program running on
    the devices Apple Remote controls is Front Row.
Action 3: Finish[Front Row]

3.2 Comparative Analysis

Possible error with pure CoT:

Thought: Apple Remote is Apple's remote control...it probably controls iTunes...
Answer: iTunes  ← Hallucination!

Possible problem with pure Act-Only:

Action 1: Search[Apple Remote program]  ← Imprecise search term
Observation 1: (irrelevant results)
Action 2: Search[Apple Remote software]  ← Blind retry

ReAct's advantage: Thought guides more precise searches; Observation corrects possible hallucinations.

4. MRKL System

MRKL (Modular Reasoning, Knowledge and Language, pronounced "miracle"), proposed by Karpas et al. (2022), is another tool reasoning architecture.

4.1 Architecture

graph TD
    INPUT[User Input] --> ROUTER[Router/LLM]
    ROUTER --> M1[Math Module<br/>Calculator/Wolfram]
    ROUTER --> M2[Search Module<br/>Google/Wikipedia]
    ROUTER --> M3[Database Module<br/>SQL Query]
    ROUTER --> M4[Code Module<br/>Python Execution]
    ROUTER --> M5[Knowledge Module<br/>LLM's Own Knowledge]
    M1 --> AGG[Aggregator]
    M2 --> AGG
    M3 --> AGG
    M4 --> AGG
    M5 --> AGG
    AGG --> OUTPUT[Final Answer]

4.2 Differences from ReAct

Dimension	ReAct	MRKL
Control flow	Sequential iteration	Routing dispatch
Module selection	Dynamic at each step	One-time routing
Reasoning visibility	Thought explicitly visible	Routing decision implicit
Multi-step reasoning	Natively supported	Requires additional mechanisms

5. Mathematical Analysis of Action Selection

5.1 Action Space

Define action space \(\mathcal{A} = \{a_1, a_2, \ldots, a_K\}\), where each action \(a_i\) may be:

A tool call (with parameters)
A termination action (output answer)
An internal action (reasoning, waiting)

5.2 Selection Probability

The LLM's probability of selecting action \(a_i\):

\[ P(a_i \mid c) = \frac{\exp(\text{logit}(a_i \mid c) / T)}{\sum_j \exp(\text{logit}(a_j \mid c) / T)} \]

where \(c\) is the current context (containing all previous thoughts, actions, and observations), and \(T\) is the temperature.

5.3 Grounding Effect

ReAct's Observations provide a grounding effect. Define reasoning accuracy:

\[ \text{Acc}_{\text{ReAct}} = P(\text{correct} \mid q, \{o_i\}) \geq P(\text{correct} \mid q) = \text{Acc}_{\text{CoT}} \]

When external observations provide relevant and correct information, fact-based reasoning is more reliable than pure internal reasoning.

6. Implementation Patterns

6.1 Prompt Template

REACT_PROMPT = """
Answer the following question by reasoning step by step
and using tools when needed.

Available tools:
- Search[query]: Search Wikipedia for information
- Lookup[keyword]: Look up a keyword in the current page
- Finish[answer]: Return the final answer

Question: {question}

{examples}

Question: {input_question}
"""

6.2 Parsing and Execution Loop

def react_loop(question, tools, max_steps=10):
    context = f"Question: {question}\n"

    for step in range(max_steps):
        # LLM generates Thought + Action
        response = llm.generate(REACT_PROMPT + context)
        thought, action = parse_response(response)

        context += f"Thought {step+1}: {thought}\n"
        context += f"Action {step+1}: {action}\n"

        # Check if termination action
        if action.startswith("Finish"):
            return extract_answer(action)

        # Execute action to get observation
        observation = execute_tool(action, tools)
        context += f"Observation {step+1}: {observation}\n"

    return "Maximum steps reached"

6.3 Tool Definition Best Practices

tools = [
    {
        "name": "search",
        "description": "Search for information on the web. "
                      "Use when you need factual information.",
        "parameters": {
            "query": {
                "type": "string",
                "description": "The search query"
            }
        }
    },
    {
        "name": "calculator",
        "description": "Perform mathematical calculations. "
                      "Use when you need precise arithmetic.",
        "parameters": {
            "expression": {
                "type": "string",
                "description": "Mathematical expression to evaluate"
            }
        }
    }
]

7. ReAct Variants and Extensions

7.1 ReAct + Self-Consistency

Run ReAct multiple times and majority-vote on final answers:

\[ a^* = \arg\max_a \sum_{i=1}^{K} \mathbb{1}[\text{ReAct}_i(q) = a] \]

7.2 ReAct + Reflexion

Reflect after failure, storing experience in memory:

[First attempt fails]
Reflection: Last time I searched with incorrect keywords, leading to
            irrelevant information. Next time I should analyze the key
            entities in the problem first, then construct search terms.
[Second attempt, with reflection memory]

7.3 ReWOO (Reasoning WithOut Observation)

Xu et al. (2023) proposed generating a complete reasoning plan first (without observations), then executing in batch:

\[ \text{Plan}: (t_1, a_1), (t_2, a_2), \ldots \quad \text{(generated at once)} \]

\[ \text{Execute}: o_1, o_2, \ldots \quad \text{(batch execution)} \]

\[ \text{Solve}: \text{answer} = \text{LLM}(\text{Plan}, \text{Observations}) \]

Advantage: Reduces LLM call count; suitable for scenarios with low tool call latency.

Cross-Reference

For a detailed discussion of tool use, see Tool Use Survey.

8. Limitations and Challenges

Context window: Long interaction histories consume many tokens, potentially exceeding window limits
Tool selection errors: LLM may select the wrong tool or pass incorrect parameters
Infinite loops: May get stuck in repetitive think-act cycles
Observation noise: Information returned by tools may be inaccurate or irrelevant
Planning depth: ReAct is essentially greedy, lacking global planning
Cost: Each step requires LLM reasoning, consuming many tokens

References

Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
Karpas, E. et al. (2022). MRKL Systems: A Modular, Neuro-Symbolic Architecture that Combines Large Language Models, External Knowledge Sources and Discrete Reasoning. arXiv:2205.00445.
Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.
Xu, B. et al. (2023). ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. arXiv:2305.18323.
Nakano, R. et al. (2021). WebGPT: Browser-assisted Question-answering with Human Feedback. arXiv:2112.09332.