Skip to content

Reflection and Self-Improvement

Overview

Reflection is a key mechanism for agents to learn from experience. Unlike traditional parameter updates, LLM agent reflection improves future behavior through verbalized experience summaries. This article provides an in-depth analysis of Reflexion, Self-Refine, self-debugging, and other methods, exploring the new paradigm of "verbal reinforcement learning."


1. Why Is Reflection Needed?

1.1 The Learning Dilemma of LLM Agents

Traditional RL agents learn through gradient updates:

\[ \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) \]

But LLM agents face challenges:

Problem Description
Frozen parameters Weights typically not updated after deployment
Expensive fine-tuning Fine-tuning large models is extremely costly
Catastrophic forgetting Fine-tuning may harm general capabilities
Sample efficiency RL fine-tuning requires extensive interaction

Reflection's solution: Rather than updating parameters, store experience as natural language for use in future contexts.

1.2 Cognitive Basis of Reflection

Reflection corresponds to human metacognition capability:

  • Monitoring: Evaluating the quality of one's own reasoning and actions
  • Regulation: Adjusting strategies based on evaluation results
  • Experience accumulation: Converting lessons into knowledge usable in the future

2. Reflexion

2.1 Core Idea

Reflexion, proposed by Shinn et al. (2023), replaces the scalar reward signal of traditional RL with verbalized reflection:

\[ \text{Traditional RL}: r \in \mathbb{R} \quad \longrightarrow \quad \text{Reflexion}: r \in \text{Natural Language} \]
graph TD
    TASK[Task] --> ACTOR[Actor<br/>Executing Agent]
    ACTOR --> TRAJ[Trajectory<br/>Action Sequence + Results]
    TRAJ --> EVAL[Evaluator]
    EVAL --> |Success| DONE[Complete]
    EVAL --> |Failure| REF[Self-Reflection]
    REF --> MEM[Experience Memory<br/>Store Reflections]
    MEM --> ACTOR

2.2 Three Core Components

Actor

Executes actions based on current memory and task description:

\[ a_t = \text{Actor}(\text{task}, \text{memory}, o_t) \]

The Actor uses a ReAct-style reasoning and action loop.

Evaluator

Evaluates the success of the execution trajectory:

\[ \text{score} = \text{Evaluator}(\tau) \]

Evaluation methods depend on task type:

Task Type Evaluation Method
Programming Run test cases, count pass rate
Decision-making Check if final state achieves goal
Reasoning Compare answer with ground truth
Open-ended LLM-as-Judge evaluation

Self-Reflection

Generates verbalized experience summaries:

Reflection: In my last attempt, I made the following mistakes:
1. I searched for "Apple Remote controls" instead of "Apple Remote,"
   leading to imprecise search results.
2. At step three, I skipped the verification step and directly gave an answer.
Improvement strategy:
1. Use more concise search keywords.
2. Verify key information before giving a final answer.

2.3 Reflexion Algorithm

Algorithm: Reflexion
Input: Task task, Maximum attempts max_trials

memory = []

for trial in 1..max_trials:
    # Execute task (with memory)
    trajectory = Actor.execute(task, memory)

    # Evaluate result
    success, feedback = Evaluator.evaluate(trajectory)

    if success:
        return trajectory  # Success!

    # Reflect on failure
    reflection = SelfReflection.reflect(
        task, trajectory, feedback, memory
    )

    # Store reflection in memory
    memory.append(reflection)

return "Max trials exceeded"

2.4 Experimental Results

Benchmark Baseline (Pass@1) Reflexion (Pass@1)
HumanEval (Programming) 80.1% 91.0%
MBPP (Programming) 73.1% 77.1%
ALFWorld (Decision-making) 75.0% 97.0%
HotpotQA (Reasoning) 29.4% 51.0%

Key Insight

The core contribution of Reflexion is not "retrying" itself (simple retry can also improve), but that structured reflection makes each retry more directed than the last.


3. Self-Refine

3.1 Core Idea

Self-Refine, proposed by Madaan et al. (2023), is a simpler iterative improvement framework:

\[ \text{Generate} \rightarrow \text{Feedback} \rightarrow \text{Refine} \rightarrow \text{Feedback} \rightarrow \text{Refine} \rightarrow \cdots \]
graph LR
    GEN[Initial Generation<br/>y₀ = LLM(x)] --> FB[Self-Feedback<br/>f = LLM(x, y₀)]
    FB --> REF[Refinement<br/>y₁ = LLM(x, y₀, f)]
    REF --> FB2[Self-Feedback<br/>f' = LLM(x, y₁)]
    FB2 --> |Below standard| REF2[Refinement<br/>y₂ = LLM(x, y₁, f')]
    FB2 --> |Meets standard| DONE[Output y₁]

3.2 Differences from Reflexion

Dimension Reflexion Self-Refine
Scope Agent tasks (multi-step) Single generation tasks
Feedback source Environment execution results LLM self-evaluation
Memory Cross-trial memory Current iteration only
Tool use Yes Usually no
Typical applications Programming, decision-making, QA Writing, translation, code optimization

3.3 Limitations of Self-Refine

  1. Self-evaluation accuracy: LLM may not detect its own errors
  2. Over-modification: May change correct parts to incorrect ones
  3. Convergence issues: No theoretical guarantee of converging to a better solution

4. Verbal Reinforcement Learning

4.1 Conceptual Framework

Reflexion pioneered the "Verbal Reinforcement Learning" paradigm:

Traditional RL:

\[ \text{State} \xrightarrow{\text{Policy}} \text{Action} \xrightarrow{\text{Environment}} \text{Reward} \xrightarrow{\text{Gradient}} \text{Parameter Update} \]

Verbal RL:

\[ \text{Context} \xrightarrow{\text{LLM}} \text{Action} \xrightarrow{\text{Environment}} \text{Feedback} \xrightarrow{\text{Reflection}} \text{Memory Update} \]

4.2 Comparative Analysis

Dimension Traditional RL Verbal RL
State representation Vectors Natural language
Policy update Gradient descent In-context experience
Reward signal Scalar Language description
Learning speed Slow (requires many interactions) Fast (few attempts)
Generalization In-domain Cross-domain (language generalizability)
Forgetting Catastrophic forgetting Controlled by memory management

4.3 Storage and Retrieval of Verbal Experience

\[ \text{Retrieve}(q) = \arg\max_{m \in \mathcal{M}} \text{sim}(\text{embed}(q), \text{embed}(m)) \cdot \text{recency}(m) \]

where \(\mathcal{M}\) is the experience memory store and \(q\) is the current query.

Cross-Reference

For a detailed discussion of memory systems, see Episodic and Semantic Memory.


5. Self-Debugging

5.1 Code Self-Debugging

Chen et al. (2024) proposed having LLMs debug their own generated code through execution feedback:

Generate code → Run tests → Analyze errors → Fix code → Re-test

5.2 Debugging Strategies

Strategy Description Applicable Scenario
Simple feedback Pass error messages back to LLM Compilation errors, runtime exceptions
Unit testing Verify with test cases Function-level debugging
Explanation tracing Have LLM explain code logic Logic errors
Divide-and-conquer Verify function by function Complex programs
Rubber duck debugging Have LLM explain line by line Subtle logic errors

5.3 Debugging Loop

def self_debug(task, max_attempts=5):
    code = llm.generate_code(task)

    for attempt in range(max_attempts):
        result = execute(code, test_cases)

        if result.all_passed:
            return code

        # Analyze error
        error_analysis = llm.analyze_error(
            task=task,
            code=code,
            error=result.error,
            failed_tests=result.failed_tests
        )

        # Fix code
        code = llm.fix_code(
            task=task,
            code=code,
            analysis=error_analysis
        )

    return code  # Return last version

6. Advanced Reflection Patterns

6.1 Hierarchical Reflection

Meta-reflection: "Is my reflection strategy itself effective?"
   ├── Strategy reflection: "Is the search strategy I'm using correct?"
   │   ├── Action reflection: "Was this specific tool call appropriate?"
   │   └── Reasoning reflection: "Was this reasoning step correct?"
   └── Goal reflection: "Am I solving the right problem?"

6.2 Contrastive Reflection

Comparing successful and failed attempts:

Successful attempt: Used precise search terms → Got correct info → Correct answer
Failed attempt: Used vague search terms → Got irrelevant info → Wrong answer
Reflection: The key difference is search term precision. In the future,
      I should first analyze key entities in the problem, then construct
      precise search queries.

6.3 Preemptive Reflection

Reflecting before taking action:

\[ \text{Preflection}(s, a) = \text{LLM}(\text{"In state } s \text{, executing } a \text{, what could go wrong?"}) \]

7. Limitations and Future Directions

7.1 Current Limitations

  1. Unreliable self-evaluation: LLM may not identify its own errors
  2. Unstable reflection quality: Sometimes reflections are too general or inaccurate
  3. Memory management: As reflections accumulate, how to selectively forget?
  4. Convergence: No theoretical guarantee that reflection leads to improvement
  5. Computational cost: Multiple attempts and reflections increase total cost

7.2 Future Directions

  1. External verifiers: Combine with formal verification to ensure reflection correctness
  2. Reflection distillation: Distill successful reflection experiences into more efficient forms
  3. Collaborative reflection: Multiple agents evaluate and reflect on each other
  4. Continual learning: Persist reflection experiences, reuse across tasks

References

  1. Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
  2. Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
  3. Chen, X. et al. (2024). Teaching Large Language Models to Self-Debug. arXiv:2304.05128.
  4. Huang, J. et al. (2023). Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024.
  5. Pan, L. et al. (2023). Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Self-Correction Strategies. arXiv:2308.03188.

评论 #