2023 ReflexionAnAutonomousAgentwithD

From GM-RKB
Jump to navigation Jump to search

Subject Headings: LLM Self-Reflection Method, Emergent Properties of LLMs.

Notes

Cited By

2023

Quotes

Abstract

Recent advancements in decision-making large language model (LLM) agents have demonstrated impressive performance across various benchmarks. However, these state-of-the-art approaches typically necessitate internal model fine-tuning, external model fine-tuning, or policy optimization over a defined state space. Implementing these methods can prove challenging due to the scarcity of high-quality training data or the lack of well-defined state space. Moreover, these agents do not possess certain qualities inherent to human decision-making processes, specifically the ability to learn from mistakes. Self-reflection allows humans to efficiently solve novel problems through a process of trial and error. Building on recent research, we propose Reflexion, an approach that endows an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities. To achieve full automation, we introduce a straightforward yet effective heuristic that enables the agent to pinpoint hallucination instances, avoid repetition in action sequences, and, in some environments, construct an internal memory map of the given environment. To assess our approach, we evaluate the agent's ability to complete decision-making tasks in AlfWorld environments and knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments. We observe success rates of 97% and 51%, respectively, and provide a discussion on the emergent property of self-reflection.

1 Introduction

Mastering decision-making and knowledge-intensive search tasks in novel environments is a crucial skill set for large-scale natural language agents. LLMs such as OpenAI’s GPT-3 (Brown et al., 2020), Google’s PaLM (Chowdhery et al., 2022), and others have achieved impressive results on various benchmarks (Kaplan et al., 2020; Rae et al., 2021; Nakano et al., 2021; Kojima et al., 2022; Ouyang et al., 2022; Chung et al., 2022). These models exhibit human-like abilities to understand tasks in given environments, marking significant progress in the field of natural language processing. Grounding complex tasks in natural language allows agents to overcome high syntactic barriers that may result in false-negative errors. However, learning optimal policies for natural language RL agents is challenging due to vast and mostly unbound state spaces.

Several decision-making approaches have been proposed to enable natural language agents to select their next action without a learned policy in text-based environments. Chain-of-thought (CoT) reasoning leverages emergent properties such as reasoning and commonsense to solve tasks in a single action but reasoned through several steps (Huang et al., 2022a; Wei et al., 2022b). However, the accuracy of these approaches decrease as the number of required subtasks increase as the model is more prone to hallucinate over longer sequences. ReAct (Yao et al., 2023) is an approach that utilizes emergent properties in LLMs, such as verbal reasoning traces, to solve problems by allowing the agent to reason and act, proving substantial performance in various text-based benchmarks. In addition, several recent works have aimed to allow natural language agents to exhibit reflective-like qualities to infer more intuitive future actions. The Describe, Explain, Plan, and Select (DEPS) approach uses multi-step reasoning and sub-task error correction to solve long-range tasks (Wang et al., 2023). DEPS demonstrates impressive performance due to its ability to explain mistakes in sub-tasks within trials, but relies on immediate failure detection for subtasks and cannot explain mistakes that may have developed over a long range of actions and subtasks. Huang et al. (2022b) use inner monologue to further process next decisions within closed-loop feedback environments. Huang et al. (2022b) use a success detection approach in which the agent will explicitly know if an executed action has led to a successful state. (Huang et al., 2022a; Haluptzok et al., 2022) use self-generated solutions to fine-tune an LLM to improve performance without access to a labeled dataset. Although these approaches have achieved remarkable accuracy across various decision- making tasks or knowledge-intensive tasks, they lack the ability to utilize success detection cues to improve their behavior over long trajectories. In addition, they often succumb to common mistakes, such as repetitive action choice, cyclic hallucination, or random action choice. In other words, while these methods achieve state-of-the-art results, a small subset of tasks remain unsolved due to the agent’s inability to learn from its own mistakes over long trajectories to correct future action sequence planning and execution.

To address common failure points, human-in-the-loop (HITL) approaches have been commonly used to improve performance Fan et al. (2022); Wu et al. (2022) Yao et al. (2023) briefly explore a human-in-the-loop (HITL) approach to redirect the agent’s reasoning trace after erroneous actions. While this approach achieves improved performance with minimal human intervention, it is not fully autonomous by its reliance on human trainers to monitor trajectories at each time step. Large-scale LLMs have shown to exhibit advanced human-like qualities that enable natural language agents to solve tasks in more intuitive ways (Wei et al., 2022a). We hypothesize that LLMs possess an emergent property of self-reflection and could effectively utilize self-optimization grounded in natural language if given the opportunity to autonomously close the trial loop.

To test our hypothesis, we equip an LLM-based agent with a self-reflective LLM and a simple heuristic for detecting hallucination and inefficient action execution in an approach named Reflexion. We then challenge the agent to learn from its own mistakes on the AlfWorld text-based benchmark (Shridhar et al., 2021) and the HotPotQA question-answering benchmark (Yang et al., 2018). This results in improved performance in decision-making and knowledge-intensive tasks. When combined with the ReAct problem-solving technique (Yao et al., 2023), self-reflection guides the Reflexion agent to achieve a 97% success discovery rate on the AlfWorld benchmark in just 12 autonomous trials, outperforming the base ReAct agent with an accuracy of 75%. We also evaluated a Reflexion-based ReAct agent on 100 questions from HotPotQA. The agent achieved a 51% success discovery rate by iteratively refining its content search and content extraction by receiving advice from its memory, outperforming a base ReAct agent by 17%. It is essential to emphasize that Reflexion is not designed to achieve near-perfect accuracy scores; instead, its goal is to demonstrate learning through trial and error to enable discovery in tasks and environments previously considered nearly impossible to solve.

2. Architecture

The abstract architecture of our Reflexion agent is depicted in Figure 1. In this study, Reflexion leverages ReAct (Yao et al., 2023), but any decision-making approach can be used in future implementations. In the first trial, the agent is given a task from the environment which composes the initial query. Then, the agent executes a series of actions generated by an LLM and receives observations and rewards from the environment. For environments that provide descriptive or continuous rewards, we constrain the output to a simple binary success status to ensure applicability; reward constraining is explained in further detail later. After every action at, the agent computes a heuristic h, which may suggest self-reflection. If self-reflection is recommended, the agent queries an LLM to reflect on its current task, trajectory history, and last reward (which is simply the fact that the agent had failed in the given environment under the binary reward constraint). Then, the agent resets the environment retry in a subsequent trial. If no self-reflection is advised, the agent adds the at and ot to its trajectory history and queries the LLM for the next action. In practice, we set a hyperparameter limit of three maximum reflections to be stored in the agent’s memory to avoiding queries beyond the limit of the LLM. If the agent exceeds the maximum number of trials, fails to improve performance between two consecutive trials, or completes the task, the run is terminated.

Figure 1: Reflexion can be added to any decision-making approach. We enable ReAct agents to use self-reflection to improve their own performance.

...

Figure 3: HotPotQA performance across 100 question and answer pairs showing cumulative proportions of correct EM answers.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 ReflexionAnAutonomousAgentwithDNoah Shinn
Beck Labash
Ashwin Gopinath
Reflexion: An Autonomous Agent with Dynamic Memory and Self-reflection10.48550/arXiv.2303.113662023