Reflection and Self-Improvement

Overview

Reflection is a key mechanism for agents to learn from experience. Unlike traditional parameter updates, LLM agent reflection improves future behavior through verbalized experience summaries. This article provides an in-depth analysis of Reflexion, Self-Refine, self-debugging, and other methods, exploring the new paradigm of "verbal reinforcement learning."

1. Why Is Reflection Needed?

1.1 The Learning Dilemma of LLM Agents

Traditional RL agents learn through gradient updates:

\[ \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) \]

But LLM agents face challenges:

Problem	Description
Frozen parameters	Weights typically not updated after deployment
Expensive fine-tuning	Fine-tuning large models is extremely costly
Catastrophic forgetting	Fine-tuning may harm general capabilities
Sample efficiency	RL fine-tuning requires extensive interaction

Reflection's solution: Rather than updating parameters, store experience as natural language for use in future contexts.

1.2 Cognitive Basis of Reflection

Reflection corresponds to human metacognition capability:

Monitoring: Evaluating the quality of one's own reasoning and actions
Regulation: Adjusting strategies based on evaluation results
Experience accumulation: Converting lessons into knowledge usable in the future

2. Reflexion

2.1 Core Idea

Reflexion, proposed by Shinn et al. (2023), replaces the scalar reward signal of traditional RL with verbalized reflection:

\[ \text{Traditional RL}: r \in \mathbb{R} \quad \longrightarrow \quad \text{Reflexion}: r \in \text{Natural Language} \]

graph TD
    TASK[Task] --> ACTOR[Actor<br/>Executing Agent]
    ACTOR --> TRAJ[Trajectory<br/>Action Sequence + Results]
    TRAJ --> EVAL[Evaluator]
    EVAL --> |Success| DONE[Complete]
    EVAL --> |Failure| REF[Self-Reflection]
    REF --> MEM[Experience Memory<br/>Store Reflections]
    MEM --> ACTOR

2.2 Three Core Components

Actor

Executes actions based on current memory and task description:

\[ a_t = \text{Actor}(\text{task}, \text{memory}, o_t) \]

The Actor uses a ReAct-style reasoning and action loop.

Evaluator

Evaluates the success of the execution trajectory:

\[ \text{score} = \text{Evaluator}(\tau) \]

Evaluation methods depend on task type:

Task Type	Evaluation Method
Programming	Run test cases, count pass rate
Decision-making	Check if final state achieves goal
Reasoning	Compare answer with ground truth
Open-ended	LLM-as-Judge evaluation

Self-Reflection

Generates verbalized experience summaries:

Reflection: In my last attempt, I made the following mistakes:
1. I searched for "Apple Remote controls" instead of "Apple Remote,"
   leading to imprecise search results.
2. At step three, I skipped the verification step and directly gave an answer.
Improvement strategy:
1. Use more concise search keywords.
2. Verify key information before giving a final answer.

2.3 Reflexion Algorithm

Algorithm: Reflexion
Input: Task task, Maximum attempts max_trials

memory = []

for trial in 1..max_trials:
    # Execute task (with memory)
    trajectory = Actor.execute(task, memory)

    # Evaluate result
    success, feedback = Evaluator.evaluate(trajectory)

    if success:
        return trajectory  # Success!

    # Reflect on failure
    reflection = SelfReflection.reflect(
        task, trajectory, feedback, memory
    )

    # Store reflection in memory
    memory.append(reflection)

return "Max trials exceeded"

2.4 Experimental Results

Benchmark	Baseline (Pass@1)	Reflexion (Pass@1)
HumanEval (Programming)	80.1%	91.0%
MBPP (Programming)	73.1%	77.1%
ALFWorld (Decision-making)	75.0%	97.0%
HotpotQA (Reasoning)	29.4%	51.0%

Key Insight

The core contribution of Reflexion is not "retrying" itself (simple retry can also improve), but that structured reflection makes each retry more directed than the last.

3. Self-Refine

3.1 Core Idea

Self-Refine, proposed by Madaan et al. (2023), is a simpler iterative improvement framework:

\[ \text{Generate} \rightarrow \text{Feedback} \rightarrow \text{Refine} \rightarrow \text{Feedback} \rightarrow \text{Refine} \rightarrow \cdots \]

graph LR
    GEN[Initial Generation<br/>y₀ = LLM(x)] --> FB[Self-Feedback<br/>f = LLM(x, y₀)]
    FB --> REF[Refinement<br/>y₁ = LLM(x, y₀, f)]
    REF --> FB2[Self-Feedback<br/>f' = LLM(x, y₁)]
    FB2 --> |Below standard| REF2[Refinement<br/>y₂ = LLM(x, y₁, f')]
    FB2 --> |Meets standard| DONE[Output y₁]

3.2 Differences from Reflexion

Dimension	Reflexion	Self-Refine
Scope	Agent tasks (multi-step)	Single generation tasks
Feedback source	Environment execution results	LLM self-evaluation
Memory	Cross-trial memory	Current iteration only
Tool use	Yes	Usually no
Typical applications	Programming, decision-making, QA	Writing, translation, code optimization

3.3 Limitations of Self-Refine

Self-evaluation accuracy: LLM may not detect its own errors
Over-modification: May change correct parts to incorrect ones
Convergence issues: No theoretical guarantee of converging to a better solution

4. Verbal Reinforcement Learning

4.1 Conceptual Framework

Reflexion pioneered the "Verbal Reinforcement Learning" paradigm:

Traditional RL:

\[ \text{State} \xrightarrow{\text{Policy}} \text{Action} \xrightarrow{\text{Environment}} \text{Reward} \xrightarrow{\text{Gradient}} \text{Parameter Update} \]

Verbal RL:

\[ \text{Context} \xrightarrow{\text{LLM}} \text{Action} \xrightarrow{\text{Environment}} \text{Feedback} \xrightarrow{\text{Reflection}} \text{Memory Update} \]

4.2 Comparative Analysis

Dimension	Traditional RL	Verbal RL
State representation	Vectors	Natural language
Policy update	Gradient descent	In-context experience
Reward signal	Scalar	Language description
Learning speed	Slow (requires many interactions)	Fast (few attempts)
Generalization	In-domain	Cross-domain (language generalizability)
Forgetting	Catastrophic forgetting	Controlled by memory management

4.3 Storage and Retrieval of Verbal Experience

\[ \text{Retrieve}(q) = \arg\max_{m \in \mathcal{M}} \text{sim}(\text{embed}(q), \text{embed}(m)) \cdot \text{recency}(m) \]

where \(\mathcal{M}\) is the experience memory store and \(q\) is the current query.

Cross-Reference

For a detailed discussion of memory systems, see Episodic and Semantic Memory.

5. Self-Debugging

5.1 Code Self-Debugging

Chen et al. (2024) proposed having LLMs debug their own generated code through execution feedback:

Generate code → Run tests → Analyze errors → Fix code → Re-test

5.2 Debugging Strategies

Strategy	Description	Applicable Scenario
Simple feedback	Pass error messages back to LLM	Compilation errors, runtime exceptions
Unit testing	Verify with test cases	Function-level debugging
Explanation tracing	Have LLM explain code logic	Logic errors
Divide-and-conquer	Verify function by function	Complex programs
Rubber duck debugging	Have LLM explain line by line	Subtle logic errors

5.3 Debugging Loop

def self_debug(task, max_attempts=5):
    code = llm.generate_code(task)

    for attempt in range(max_attempts):
        result = execute(code, test_cases)

        if result.all_passed:
            return code

        # Analyze error
        error_analysis = llm.analyze_error(
            task=task,
            code=code,
            error=result.error,
            failed_tests=result.failed_tests
        )

        # Fix code
        code = llm.fix_code(
            task=task,
            code=code,
            analysis=error_analysis
        )

    return code  # Return last version

6. Advanced Reflection Patterns

6.1 Hierarchical Reflection

Meta-reflection: "Is my reflection strategy itself effective?"
   ├── Strategy reflection: "Is the search strategy I'm using correct?"
   │   ├── Action reflection: "Was this specific tool call appropriate?"
   │   └── Reasoning reflection: "Was this reasoning step correct?"
   └── Goal reflection: "Am I solving the right problem?"

6.2 Contrastive Reflection

Comparing successful and failed attempts:

Successful attempt: Used precise search terms → Got correct info → Correct answer
Failed attempt: Used vague search terms → Got irrelevant info → Wrong answer
Reflection: The key difference is search term precision. In the future,
      I should first analyze key entities in the problem, then construct
      precise search queries.

6.3 Preemptive Reflection

Reflecting before taking action:

\[ \text{Preflection}(s, a) = \text{LLM}(\text{"In state } s \text{, executing } a \text{, what could go wrong?"}) \]

7. Limitations and Future Directions

7.1 Current Limitations

Unreliable self-evaluation: LLM may not identify its own errors
Unstable reflection quality: Sometimes reflections are too general or inaccurate
Memory management: As reflections accumulate, how to selectively forget?
Convergence: No theoretical guarantee that reflection leads to improvement
Computational cost: Multiple attempts and reflections increase total cost

7.2 Future Directions

External verifiers: Combine with formal verification to ensure reflection correctness
Reflection distillation: Distill successful reflection experiences into more efficient forms
Collaborative reflection: Multiple agents evaluate and reflect on each other
Continual learning: Persist reflection experiences, reuse across tasks

References

Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
Madaan, A. et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.
Chen, X. et al. (2024). Teaching Large Language Models to Self-Debug. arXiv:2304.05128.
Huang, J. et al. (2023). Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024.
Pan, L. et al. (2023). Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Self-Correction Strategies. arXiv:2308.03188.