2023 DERAEnhancingLargeLanguageModel

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Dialog-Enabled Resolving Agents (DERA), Medical NLG.

Notes

Cited By

Quotes

Abstract

Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. In this work, we present dialog-enabled resolving agents (DERA). DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. We frame our dialog as a discussion between two agent types - a Researcher, who processes information and identifies crucial problem components, and a Decider, who has the autonomy to integrate the Researcher's information and makes judgments on the final output. We test DERA against three clinically-focused tasks. For medical conversation summarization and care plan generation, DERA shows significant improvement over the base GPT-4 performance in both human expert preference evaluations and quantitative metrics. In a new finding, we also show that GPT-4's performance (70%) on an open-ended version of the MedQA question-answering (QA) dataset (Jin et al. 2021, USMLE) is well above the passing level (60%), with DERA showing similar performance. We release the open-ended MEDQA dataset at this https URL.

Introduction

Large language models (LLMs; Brown et al. (2020); Lewis et al. (2020)) are deep-learning models that have been trained to predict natural language text conditioned on an input. The use of these models has led to advances in natural language performance far beyond just language modeling tasks. Within the realm of medicine, LLM-powered methods have shown improvements in medical tasks such as question answering (Singhal et al., 2022; Liévin et al., 2022), information extraction (Agrawal et al., 2022), and summarization (Chintagunta et al., 2021).

LLM-powered methods use natural language instructions called prompts. These instruction sets often include a task definition, rules the predictions must follow, and optionally some examples of the task input and output (Reynolds and McDonell, 2021; Brown et al., 2020). The ability of generative language models to create output based on natural language instructions (or prompts) removes the need for task-specific training (Min et al., 2022) and allows non-experts to build upon this technology.

While many tasks can be formulated as a single prompt, later work has shown that breaking down single tasks into sub-tasks (called chaining) has benefits in terms of task performance and interpretability (Wu et al., 2022). Examples of chaining strategies include chain-of-thought (Wei et al., 2022) and other task-specific approaches (e.g, Agrawal et al. (2022)). Chain-of-thought strategies prompt the model to think through a problem as an expert might approach it, leading to improvements in some tasks (Liévin et al., 2022; Wang et al., 2022; Tafjord et al., 2022).

All of these approaches attempt to coerce the correct generation from the base language model. However, one fundamental limitation of this strategy is that these prompting architectures are restricted to a fixed set of prompts designed for specific tasks in mind, such as writing explanations or resolving anomalies within the output. Furthermore, they struggle with generating factually accurate text and often can include hallucinations and omissions (Maynez et al., 2020; Dziri et al., 2022; Berezin and Batura, 2022). This poses a significant hurdle when applying them to real-world scenarios, especially in the clinical domain.

Figure 1: Overview of DERA. The method consists of two agents–a Researcher and a Decider. The Decider generates an initial output for the task (step 1). Then, the Decider and Researcher work through the problem via conversation (step 2), with the Researcher tasked to help identify crucial problem components. The Decider has the autonomy to integrate the Researcher's inputs and makes judgments on the final output (step 3). Neither agent has knowledge of the ideal final output.

We advocate for a different approach that has two essential elements. First, it consists of an iterative approach to refining the initial output. This allows the generation to be refined holistically as opposed to conditional chaining. Second, it includes an advisor that can guide by suggesting areas to focus on in each iteration, adding interpretability to the process. With the advent of GPT-4 (OpenAI, 2023) capable of robust, realistic conversation, we can use dialog as the medium for interaction.

We propose DERA: Dialog-Enabled Resolving Agents. DERA is a framework to explore how we can improve performance on natural language tasks using agents tasked with resolving (or improving) the output through dialog. We propose that scoping each agent in the dialog to a specific role will better enable them to focus on discrete portions of the task, and ensure their partner agent stays aligned with the overall goal. One agent role, the Researcher, works to identify relevant information for the problem and suggest areas of focus to the other agent. Another agent role, the Decider, has the autonomy to react to that information and make final decisions about the output.

Our paper makes the following contributions:

In both human-annotated evaluations, we find that DERA outperforms base GPT-4 performance in the careplan generation and medical conversation summarization tasks on a variety of metrics. In quantitative evaluations, we find that DERA successfully corrects medical conversation summaries with large amounts of errors. Conversely, we find small to no improvement between GPT-4 performance and DERA on question-answering.

2 DERA: Overview

DERA is a general chat framework that leverages dialog-capable agents to iteratively work through a task (Figure 1). We focus on agent setups that work to probe knowledge sources, whether internal (from within GPT-4) or external (from text, documents, etc.). In approaches like chain-of-thought, these roles are generally performed jointly. In contrast, we propose that pairing an information-focused agent with a decision-maker agent will lead to a higher-quality output. Furthermore, this approach allows for DERA to alternate between processing knowledge and acting upon them, as opposed to doing them concurrently.

First, we propose the use of a Researcher agent. The goal of a researcher agent is to review pieces of information – which can be internal to GPT-4 or external – and make suggestions on what is likely to be crucial in solving the problem. As we do not have a definitive source of what is and is not relevant, we rely on GPT-4’s ability to identify rel- evancy in light of the current task. We do not treat this agent as the definitive source of truth. Rather, we task it with being helpful and constructive dur- ing the dialog.

Second, we propose the use of a Decider agent. In addition to starting the conversation, this agent is tasked with responding to the information provided by the Researcher agent, and deciding whether to integrate that information into the task output. This allows GPT-4 to make discrete decisions in reaction to the information highlighted by the Researcher. At no point, however, does the Decider defer to the Researcher. This agent is ultimately responsible for the final decision, and while it is tasked with re- viewing all information highlighted by Researcher, it does not have to use any of that information.

The specifics of each agent can vary for different tasks. For Question Answering, the Researcher is tasked with pulling information from the question, using the internal knowledge of GPT-4 alone. For summarization, the Researcher has access to exter- nal texts which contain the full patient encounter. Conversely, the edits to the text generation task are made incrementally by the Decider in the summa- rization task, while they are made more discretely in the question-answering task. In some settings, agents take a hybrid role, where they each have access to different information and jointly make decisions. Overall, the goal remains the same – that this approach allows for information to be pro- cessed in a manner that is both role-defined and iterative, producing better quality output.

We apply DERA to three natural language gen- eration tasks. The first, medical conversation sum- marization (§3), probes the ability of DERA to create a summary of a doctor-patient chat. This re- quires the ability to identify and rewrite medically- relevant information in a concise format. The second, care plan generation (§4), tests whether DERA can generate doctor-facing suggestions for potential actions to address patient concerns. This requires similar abilities, with the added challenge of knowing the appropriate next steps for a variety of medical conditions. Finally, medical question- answering (§5) tests the ability of DERA to gener- ate a wide variety of medical knowledge in a short format.

3 Medical Conversation Summarization

Overview
The task of medical conversation sum- marization is to encapsulate a patient-doctor con- versation (Enarvi et al., 2020; Joshi et al., 2020; Zhang et al., 2021; Chintagunta et al., 2021). We focus on summarizing patient-doctor chats into six independent sections: Demographics and Social Determinants of Health, Medical Intent, Pertinent Positives, Pertinent Negatives, Pertinent Unknowns, and Medical History. This structured format re- quires the model to summarize the chat while plac- ing each piece of information in the appropriate section. As these summaries are typically used by doctors for downstream tasks such as clinical decision-making, it is important that the generated summaries are both factually accurate (no halluci- nations) and complete (no omissions).
DERA Setup
We formulate the DERA setup for medical conversation summarization as follows. Both Decider and Researcher have access to the full medical conversation between the patient and the physician. Both agents are prompted to con- verse with one another. The Decider agent gener- ates an initial summary of the medical conversa- tion (Prompt 1) and shares it with the Researcher agent. The Researcher agent’s role (Prompt 4) is to “read” the summary and point out any discrep- ancies to Decider. Decider, using Prompt 3, either accepts or rejects those discrepancies, by agreeing with the suggestion or disagreeing and respond- ing with some reasoning. Instead of regenerating the summary at each step of the conversation, De- cider writes the accepted suggestions to a shared scratchpad, which acts like a memory that it uses at the end of the conversation to generate the final summary. The conversation terminates once Re- searcher is satisfied with the suggestions made to the scratchpad or a maximum conversation length is reached (set to 15 turns total). As the final step, the Decider generates (Prompt 5) the final sum- mary using the contents of the scratchpad and the original summary.

GPT-4 prompts are run with the settings mentioned in Table 5.

Dataset
We randomly sampled 500 medical en- counters from a chat-based telehealth platform. Each encounter contains the patient’s age, sex, and chat conversation with a licensed medical provider. Encounters in this dataset cover a wide variety of common presentations in telehealth, includ- ing urinary tract infections, back/abdominal pains, toothaches, and others. All data is de-identified and scrubbed for protected health information prior to experimentation. Conversations contain 27 dialog turns on average (min of 9 turns, max of 82 turns) and average 646 unigram tokens per encounter (min 42 tokens, max 2031 tokens).
Human Expert Evaluation
To evaluate the ef- fectiveness of DERA to generate better summaries, we conducted human evaluation studies with four licensed physicians on a random subset of 50 out of the 500 encounters described above. We sampled a smaller, random subset due to the high labeling cost induced by using expert physicians.

The licensed physicians were provided with the encounter and the two summaries. These included the initial GPT-4 generated summary, and the final generated summary produced using DERA. Each physician was asked to answer three main questions in the light of the summary’s clinical utility for themselves or another physician:

  1. Which summary do you prefer to use for the given patient and encounter? (Options: Initial, Final).
  2. What percentage of the overall clinical infor- mation in the dialog is captured by the sum- mary? (Options: All, Most, Some, None)
  3. What percentage of the suggestions added to the DERA scratchpad do you agree with? (Op- tions: All, Most, Some, None)

Figure 2 shows the results of our human expert evaluation. Physicians notably choose the summary produced after DERA over the initially generated summary 90% - 10%. Their preference for the DERA-produced summary is further corroborated by the fraction of medical information captured in the final DERA summary vs. initial, as final summaries were rated as capturing "All" medical information from the patient-physician dialog in 86% of encounters vs. the initial summaries cap- turing "All" medical information in just 56% of en- counters. In general, we also find broad agreement for the suggestions in each encounter’s scratch- pad: they agreed with "All" corrections suggested for a given encounter’s summary 63% of the time, "Most" 14% of the time, "Some" 5% of the time, and "None" 18% of the time. On average, each scratchpad contains 2-3 suggestions.

In addition to these questions, we also asked the physician-experts the following: If this sum- another clinical provider acted upon mary, does this summary contain information that could potentially be harmful to the patient given their presentation? (Options: Yes, No). The amount of summaries containing "harmful" information drops from 2% in the initial summary to 0% in the final DERA summary. We caution against drawing gen- eralizations from these harmfulness numbers. Our evaluations are both limited in number and drawn from a patient population specific to the telehealth platform; thus cannot predict the generalizability of these findings in other settings.

Figure 2: Results from physician-expert evaluations on the medical conversation summarization task. (Left) Physi- cians choose the final summary produced by DERA over the initial GPT-4 generated summary 90% to 10%. (Cen- ter) Final DERA summaries capture far more clinical information than initial GPT-4 generated summaries, with physicians rating "All" relevant clinical information from the patient-physician chat captured in 86% of DERA summaries vs. 56% of initial GPT-4 summaries. (Right) For summary correction suggestions in the scratchpad, physicians rate agreement with "All" suggestions in 63% of encounters, Most" in 14%, "Some" in 5%, and "None" in 18%.
Corruption		Summ. Level	Version	Pertinent Positives	Pertinent Negatives	Pertinent Unknowns	Medical History	Average
low (  3  )	Initial	89.38	83.05	87.42	80.88	85.18
10	DERA	95.65	96.77	97.10	97.35	96.71
medium (  5  )  Initial	83.12	81.6	71.14	73.82	77.42
10	DERA	94.29	95.31	96.17	98.12	95.97
high (  7  )	Initial	68.35	70.07	68.79	57.27	66.12
10	DERA	92.96	90.86	94.81	95.16	93.45
Table 1: Medical conversation summarization task: Quantitative evaluation (GPT-F1 scores) of the initial summary with errors and the DERA corrected version. We show that by introducing synthetic corruption (hallucinations, omissions, etc.) into medical summaries, DERA can resolve these corruptions at low, medium, and high levels of corruption. GPT-F1 scores for the DERA-produced summary are consistently higher than the initial summaries.

Quantitative Evaluation We also perform a more large-scale study without the need for hu- man annotation. We generate GPT-4 summaries for all the 500 encounters and assume them to be ground truth. Then, we synthetically induce “cor- ruptions” into the generated summary and use that as the initial input. These mistakes artificially lower the summary’s quality and produce significant hal- lucinations and omissions. The goal is to quanti- tatively evaluate DERA’s ability to write medical summaries by measuring the degree to which the Researcher and Decider agents can identify and fix "corruptions" introduced to the medical summary.

Prompt 2 contains specific instructions for gen- erating the corruptions. We can control the level of corruption desired by passing one of three lev- els of corruption as a variable to our corruption prompt: low ( 3 ), medium ( 5 ), or high ( 7 ). The 10 10 10 higher the corruption, the more symptoms could be rearranged. Similarly, hallucinated symptoms could be introduced, among other corruptions. For a qualitative example of this process of generating an initial summary, corrupting it, resolving with DERA, and generating a final summary see Fig. 6.

We measure the degree to which corruptions are present by using a GPT-based metric that tracks the medical concept coverage of the medical summary, GPT-F1. To compute GPT-F1, we compute the harmonic mean of two sub-metrics: GPT-Recall and GPT-Precision. We describe each sub-metric below.

GPT-Recall
To compute, we first extract medi- cal entities from both the predicted text and ground-

truth text2 of the same summary section (using Prompt 6) and use a verification prompt (Prompt 7) to infer if the entities extracted from the ground- truth section are also present in the predicted text, This produces tpgt and fn values, which is used to calculate GPT-Recall = tpgt . GPT-Precision: To compute, we also first ex- tract medical entities from the corresponding pre- dicted and ground-truth summary sections and verify concepts extracted from the predicted sec- tion are also present in the ground-truth text, ei- ther as exact matches or re-phrasings. This pro- duces tppred and fp, which is used to calculate

GPT-Precision

= tppred . pred p We present the results of our quantitative evalua- tion using the GPT-F1 metric in Table 1. Specifi- cally, we compare GPT-F1 on the initial summary with errors to the DERA corrected summary. Note first how the higher levels of corruption manifest in the initial summary GPT-F1. As the corrup- tion level of the initial summary increases, the initial GPT-F1 score drops. We find that DERA can produce significantly improved summaries in low, medium, and high levels of corruption, as ev- idenced by increases in GPT-F1. This suggests that the interaction between the Researcher and Decider agents is identifying hallucinations and omissions and resolving them through dialog, even when many such corruptions are present.

4 Care Plan Generation

We also analyze the performance of DERA on the task of generating a care management plan. This care plan contains suggestions that are meant to be physician-facing - that is, we generate suggestions that a physician would be required to approve of and then communicate to a patient. Our care plans contain five sections: Medications, Referrals, Tests, Lifestyle, and Supportive Care. DERA setup As in the medical conversation summarization task, the goal of DERA is to im- prove the quality of the generated care plan by sug- gesting more appropriate home care for the patient, recommending additional lab tests, or otherwise better aligning the generated summary. The DERA setup is the same as the medical conversation sum- marization task with care plan-specific prompts.

2Note that the terms "predicted" and "ground-truth" are used loosely here, and can also refer to synthetic text that is used as ground-truth (as in the case of the corruption experi- ments).

The Decider starts with an initial care plan. The Researcher is prompted (Prompt 10) to converse with the Decider (Prompt 9). Finally, the Decider generates the final care plan (Prompt 11). by com- bining the initial care plan with the content of the ‘scratchpad’ accumulated during the conversation. We run DERA on the care plan generation task using GPT-4 with the settings mentioned in Table 5.

Dataset We used the same set of 50 medical en- counters we used for the human expert evaluation of the medical conversation summarization task. Human Experts Evaluation We evaluated the effectiveness of DERA to generate care plans through human evaluation with four licensed physi- cians. We explicitly instructed the physician evalu- ators that the generated plan is defined as "meant to be provider-facing, meaning that not all suggested interventions will necessarily be recommended to the patient or followed by the patient." The physi- cians who evaluated the quality of these care plans were not those who provided care to the patients in the original encounter.

The experts were provided with the encounter and the two careplans – the baseline GPT-4 gener- ated summary and the DERA generated summary starting from GPT-4 generated summary. They were asked to answer the following three questions:

  1. Which careplan do you prefer to use for the given patient and encounter? (Options: Initial, Final).
  2. What fraction of the necessary care plan man- agement steps are captured? (Options: All, Most, Some, None)
  3. What percentage of the suggestions added to the DERA scratchpad do you agree with? (Op- tions: All, Most, Some, None)

Figure 4 shows the results. In a head-to-head comparison, the physicians prefer the final care plan produced by DERA 84% of the time. Fur- thermore, when asked to give what fraction of care plan corrections were useful, they fully agreed with 72% of suggestions. They agree with none of the suggestions only 14% of the time. Finally, they rated 92% of care plans as complete, compared to 64% of initial care plans. In summation, the appli- cation of DERA to care plan generation increased the resulting quality substantially.

Doctor-Patient Chat Patient: UTI Doctor: Hi NAME, thank you for starting a visit. My name is NAME, and I am your clinical associate for this live chat. I am sorry to hear that. Patient: Okay thank you . . . Doctor: When was your last sexual encounter? Patient: Two days ago. I’m not having any since then because of the pain with my bladder infection. Doctor: Was the encounter unprotected? Patient: No Doctor: How many sexual partners have you had in the past 6 months?

Patient: Several. I did however get tested for HIV and STIs a couple weeks ago and all came back negative. I get tested regularly since I am sexually active. Doctor: Do you have a new sexual partner? Patient: No new ones Doctor: Do you douche or use feminine hygiene products? Patient: No Sometimes I use baby wipes . . . Doctor: For what medical conditions you are taking BuSpar, LaMICtal, Zoloft? Patient: Buspar for Bipolar, Lamictal for anxiety, Zoloft for depression

Figure 3: Qualitative example of care plan generation with DERA abridged for space. Starting with the initial chat, the Decider generates an initial care plan. The Researcher and Decider agents in DERA then converse with one another, visible in DERA dialog. The Decider adds accepted suggestions to a scratchpad, which collects the final changes to make to the care plan. The final care plan is generated by the Decider using this scratchpad. Note the points in bold that were added to the final care plan.

Figure 4: Care plan generation task: Results from physician-expert evaluations. (Left) Physicians choose the final care plan produced by DERA over the initial GPT-4 generated care plan 84% to 16%. (Center) Final DERA care plans capture far more of the necessary care management steps than initial GPT-4 generated care plans, with physicians rating "All" relevant steps inferred from the patient-physician chat generated in 92% of DERA care plans vs. 64% of initial GPT-4 care plans. (Right) For care plan correction suggestions in the scratchpad, physicians rate agreement with "All" suggestions in 72% of encounters, Most" in 14%, "Some" in 0%, and "None" in 14%.

In addition to these questions, we also asked the physician-experts the following: If this care plan were acted upon by the patient, does this care plan contain information that could potentially be harmful to the patient given their presentation? (Options: Yes, No). The amount of careplan con- taining "harmful" information drops from 2% in the initial careplan to 0% in the final DERA summary. We caution against drawing generalizations from these harmfulness numbers. Our evaluations are both limited in number and drawn from a patient population specific to the telehealth platform; thus cannot predict the generalizability of these findings in other settings.

Qualitative Examples We show a qualitative ex- ample of the care plan generation task with DERA in Figure 3. The initial care plan generated by the Decider was originally rated as containing "Most" necessary care management steps by our physician- expert evaluator, suggesting there were still some improvements possible. In the DERA dialog, the Researcher highlights potential drug interactions with the patient’s current medications and the rec- ommendation to educate the patient on safe sexual practices. These corrections were accepted by the Decider, as evidenced by the notes written to the scratchpad. In turn, the corrections were mani- fested in the final care plan, with the three changes bolded. This final care plan was rated as contain- ing "All" necessary care management steps by our physician-expert evaluator.

5 Open-Ended Medical Question Answering

Overview We also investigate the use of DERA for short-form medical reasoning. A com- monly used dataset for this task is MedQA (Jin et al., 2021) which consists of USMLE-style practice multiple-choice questions. Previous ap- proaches for this dataset have included using RoBERTa (Liu et al., 2019), refining chain-of- thought using GPT-3 (Liévin et al., 2022), and fine-tuning PaLM (Chowdhery et al., 2022; Sing- hal et al., 2022). While most previously-reported results achieved passing results, recent GPT-4 is shown to work at a near-expert level (Nori et al., 2023).

In all previous work, the primary focus was on the multiple-choice question format which has lim- ited applicability in the real world. If these models are to support doctors in decision-making, these models need to operate without any options pro- vided. To mimic this setting, we extend the MedQA dataset to be open-ended to evaluate the model in a more realistic and harder setting. In an open-ended form, the model must generate the correct answer free-form and not choose from a given bank of options. We also evaluate a set of continuing edu- cation questions from the New England Journal of Medicine (NEJM), again in an open-ended setting.

A method that can perform at a high level on this task requires several attributes. First, it must be able to recall a large set of knowledge across multiple domains of medicine. Second, it must be able to reason over long questions, which will likely include both irrelevant and crucial facts needed to arrive at the solution.

Datasets We evaluate our approach using two Medical Question answering datasets - MedQA US dataset (Jin et al., 2021) and New England Journal of Medicine Test Questions (NEJM). Both datasets consist of questions taken from practice or real medical exams (United States Medical Licensing for MedQA, and continuing education questions for NEJM). For both datasets, the questions are originally written in multiple-choice format (e.g, Which of the following is the best diagnosis?). Our goal is to test DERA ’s performance on open-ended question answering, where the task will be to gen- erate the answer free-form.

Therefore, we use GPT-4 to alter the questions to be open-ended. In most cases, this requires a simple rephrasing of the final sentence. For ex- ample, the previous question could be re-written as What is the best diagnosis?. In these cases, we restrict GPT-4 to rewrite only the final sen- tence of the question, so as to guard against hallu- cinations. When a more complex rewrite is re- quired, we prompt GPT-4 to rewrite the entire question and find that it only changes the relevant sentence. Some questions could already be an- swered open-ended and required no rewriting. Al- though we performed quality checks, as the entire process is automated, there may be some errors. The prompts for rewriting the final sentence 13 and the full question 12 are included in the Ap- pendix. We also release the full MedQA open- ended dataset at https://github.com/curai/ curai-research/tree/main/DERA. We cannot release the NEJM dataset due to licensing issues.

3Reproductive, Gastrointestinal, Neurologic/Psychogenic, Special Sensory, Endocrine, Musculoskeletal, and Maternity Care

For MedQA, we sample a portion of the train- ing set (1178 questions) as a development set and maintain the integrity of the test set (1273 ques- tions) as formulated by the authors. For NEJM, we split the datasets by area, reserving 7 areas 3 as a development set (consisting of 639 questions), with the remainder serving as a test set (1112 questions). We do not exclude questions containing images. DERA setup To generate an initial answer for DERA to discuss, we use a single-shot prompt which outputs a short answer (Prompt 14). We use a single-shot prompt to ensure a consistent output, which we were unable to achieve with a zero-shot prompt. Earlier work (Singhal et al., 2022) has shown that using a self-consistency strategy pro- vides stronger results. We adopt this approach by running 5 completions of our single-shot prompt and selecting the answer with the most votes as the single-shot answer, and consider this as our baseline [2].

Instead of initializing our Decider with a single answer, we provide it with the distribution of votes. This approach provides DERA with the distribu- tion better captures the underlying uncertainty of the model5. A prompt (Prompt 15) is tasked with writing a reasoning behind the votes, which is used as the initial Decider message.

Starting with the initial Decider message, both Decider (Prompt 17) and Researcher have access only to the question and their own conversation as they iteratively discuss the problem and attempt to achieve the right answer. The Researcher can stop the dialogue when they have exhausted all relevant information, otherwise, it is set to end after n = 3 turns. At each turn, the Decider must state what their current answer is and explain their reasoning, and they may choose to either confirm or change their answer.

We instruct both prompts to act as medical doc- tors who work at an expert level. To arrive at a final answer, a prompt is given the same information as the original one-shot prompt, with the exception that it is also given the full chat history to use as additional context. We generate n = 5 answers and use the most frequently generated answer as our final answer (see Prompt 18). We run DERA on open-ended question answer- ing with the parameters noted in Table 6. For the multiple-choice setting, we use a very similar con- figuration. The primary prompt changes are limited to the fact that Decider is given a set of options and asked to generate the letter (A-D) instead of a short phrase.

Evaluation Metrics There are two main chal- lenges when evaluating open-ended responses to the questions. Some questions may have multi- ple correct answers, given that it is no longer con- strained to a discrete set of options. More impor- tantly, we need a mechanism to map the free-form answer to the correct answer text. While we only evaluate against the ground truth correct option, we include metrics that attempt to account for the degree of similarity between the answer text and the correct option.

5This also handles cases where closely related lexical forms receive separate votes, as the Decider output will con- clude that the options are similar.

Figure 5: We provide several examples of chats between the Decider and Researcher agents taken from our MedQA development set (more shown in Appendix Section A.1). We include the Gold Answer, 1-shot answer, and DERA answer. This example is a case where the single-shot answer is not specific enough, and the DERA dialog changes the answer to the correct one.

To identify generated answers that are related to the gold standard answer, we prompt GPT- 4 to score the relatedness of the generated and gold answers (Prompt 19). To evaluate a bi- nary prompt that judges if the generated and gold answers are the exactly same, and provides an explanation (Prompt 20). Finally, we eval- uate the generated and gold answer similarity using BERTScore (Zhang et al. (2019), model scibert-basevocab-uncased). While this met- ric has limitations (Hanna and Bojar, 2021; Sun et al., 2022), it is commonly reported for genera- tion tasks. We do not use a knowledge base such as UMLS(Aronson, 2001) based similarity (McInnes et al., 2009) as many answers do not map directly to a single entity in a medical knowledge base.

Model Accuracy PaLM (Singhal et al., 2022) 0.676 Nori et al. (2023) 0.814 GPT-4 0-shot 0.834 DERA 0.840 Table 2: Results on the multiple-choice (4-option) ver- sion of MedQA. The GPT-4 0-shot and DERA results were generated on a version of GPT-4 available in February 2023.

Results We compare DERA to single-shot per- formance using GPT-4, where n = 5 answers are detected, and the one with the most votes is selected as the answer6. Due to the costs involved with run- ning the experiments, we only report single runs. We include quantitative results for open-ended question answering in Table 3, and for multiple- choice question answering in Table 2.

6If there are ties, the first completion of the highest-ranking answers is selected.

BERTScore MEDQA GPT-4 Exact GPT-4 Sim BERTScore NEJM GPT-4 Exact GPT-4 Sim GPT-4 1-shot 0.746 0.698 0.65 0.676 0.703 0.711 DERA 0.744 0.703 0.67 0.670 0.711 0.724

Table 3: Results on the Open-Ended versions of MedQA and NEJM. We evaluate using a variety of metrics, including GPT-4 prompts that identify exact matches and similar matches (using a 0-1 scale). In addition, we calculate the average BERTScore F1 to measure the similarity between the gold and generated answers using a separate model.

For the multiple-choice results, we find that GPT- 4 outperforms the best previously published ap- proaches out of the box on MedQA. This is in line with that reported by Nori et al. (2023), which uses a very similar approach. We suspect that our results are slightly higher due to our use of a self- consistency approach. We do not see significant improvements when applying DERA compared to the multiple choice setting.

In the open-ended setting, we see strong per- formance in both one-shot GPT-4 and DERAfor both NEJM and MedQA. Liévin et al. (2022) notes that the passing grade for the MedQA test set is 60%. For both GPT-4 one-shot and DERA, we see that GPT-4 Exact Matching is above 60% and BERTScore and Similarity measures are above 0.6. This marks an impressive ability to generate open- ended answers to questions. Yet there still exists a gap between open-ended and multiple-choice per- formance, suggesting opportunities for future work. Similarly to the multiple choice setting, DERA shows small to no improvement over GPT- 4, depending on the metric. The largest gain for DERA is in the similarity metric for both MedQA and NEJM, which suggests that DERA can lead to answers that are closer to the ground truth. Exam- ples of the open-ended question-answering chats are included in Figure 5 and Appendix Section A.1.

Qualitative Analysis We include the first 10 ex- amples from the MedQA development set (which we randomly drew from their training set) in Ap- pendix Table 47. In our analysis of these develop- ment examples, we see several patterns.

7These results were generated with an earlier version of GPT-4 available in February 2023.

First, sometimes the agent successfully changes an incorrect answer to the correct answer. For example, in Question 4 shown in Figure 5, the original answer is Inherited bleeding disorder, and DERA changes it to the more specific Von Wille- brand Disease. In other cases, DERA leaves the answer as the same in the original 1-shot genera- tion (e.g, Questions 5, 9, 55, 94, 98). We also note that this does not occur in a majority of cases, as only 542 of the 1273 MedQA training examples have the exact same answer between DERA and one-shot.

In other cases, such as in Question 54, DERA adds additional details to the 1-shot an- swer (1-shot Smoking cessation counseling and support to the Decider’s final answer Assessing for occupational lung disease and providing smoking cessation. There are some clear challenges with open-ended question answering that show in both the DERA and 1-shot generations. Specifically, of- ten both give a more general answer than is given in the gold standard answer. For example, in Ques- tion 74, the gold standard answer text is a specific medication (Deantrolene), while both DERA and 1- shot produce more general answers (e.g, Immediate hospitalization and supportive care).

Overall, without the inclusion of a specific set of options, it is difficult for GPT-4 to generate an an- swer at a correct level of specificity (e.g, a specific treatment instead of a general approach) and a cor- rect length (e.g, answering in short answer format instead of long sentences). In some settings, these attributes may be useful, but it results in a challeng- ing approach to evaluate. We predict the need for additional work in methods that automatically eval- uate the output of large language model-powered tools, given the inherent complexity present.

6 Discussion and Conclusion

We introduce a framework for agent-agent dialog called DERA. This approach allows agents to focus on specific roles, reducing the need for an LLM to achieve the correct answer in one or two gener- ations. In this setup, we use two types of agents – Researcher, tasked with reviewing and selecting information, and Decider, tasked with integrating that information into the final output. Both discuss the problem in a chat format, with the goal of improving the output of GPT-4.

As found in Sections 3 and 4, we find DERA im- proves the quality of the generated text in a variety of metrics. Importantly, this reduces the number of hallucinations and omissions in the resulting text. This finding is important given the ability of large language models (LLM), in particular GPT-4, to generate text that is fluent but potentially prone to errors. The ability of DERA to identify and correct these hallucinations and omissions is critical when applying these models to real-world scenarios. A key feature is that the same LLM can be harnessed in both roles. We did not find similar improvements in the question-answering task. As discussed in Section 5, DERA produced little to no improvement over a GPT-4 baseline. We suggest this is due to sev- eral factors, including the requirement to generate a single, granular answer. DERA often adds infor- mation to an answer, which is not helpful for short text generation. These findings, paired with those discussed above, suggest this method is well-suited for longer-generation tasks.

Furthermore, the chat-based format of DERA al- lows for increased interpretability when auditing the results. Even though LLMs such as GPT-4 may achieve high performance in zero-shot or one-shot settings, generating long-form explanations does not provide a granular forum for understanding resulting generations. Conversely, the chat-based format allows for discussions that are granular and could be verified by an end user for mistakes.

In the future, this setup could be altered to in- clude human input in the discussion. Alternatively, different problems may dictate the inclusion of dif- ferent types of agents. Overall, we believe that while LLM-based tools are critical in increasing the quality of natural language performance, additional research is required to ensure they are consistent and auditable. Finally, we reiterate the need for further research in automated metrics for evaluating LLM output. Human-led qualitative evaluations can provide im- portant insights, but it remains a challenge to mea- sure improvement given the limited tools currently available.

7 Limitations

The experiments in this paper were performed us- ing OpenAI’s API, mostly using GPT-4 models.

While these models generate text at a higher qual- ity than other previous models, there are still lim- itations. First, we do not have access to what the model has and has not been trained on. Specifically, we do not know if openly-released datasets, such as MedQA, were included in the training data. Sec- ond, we report results using the latest version of GPT-4 available at the time. As OpenAI does not persist models, this may make reproducing results challenging.

While we include a variety of quantitative evalu- ations, the task of automatically evaluating gener- ated text needs further research. Previous methods, such as BERTScore, use models that are less pow- erful than GPT-4, yet using GPT-4 to evaluate itself is also potentially problematic. Similarly, evalu- ations of the ability of DERA to reduce the pres- ence of harmful text in generations are promising, but given the limited amount of harmful content present to begin with, we caution against a broad interpretation of that result. Another limitation is inherent to clinical text datasets. As stated in Section 3, our evaluations and style of summary and care plan are limited to a patient population specific to a single tele- health service, and may not be applicable to uses for the general population. Several of our evalu- ation datasets cannot be openly released for data privacy or licensing reasons, yet this is likely true for a broad amount of medically-focused research. Additionally, some other datasets that are openly available cannot be directly used with API-based models (Agrawal et al., 2022), further limiting op- tions. We also acknowledge that while MedQA does probe medical knowledge, it likely does so in a different form than is likely to be applied in a regular clinical setting.

References

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. 2022. Large language models are zero-shot clinical information extractors. arXiv preprint arXiv:2205.12689.

Alan R Aronson. 2001. Effective mapping of biomed- ical text to the umls metathesaurus: the metamap program. In: Proceedings of the AMIA Symposium, page 17. American Medical Informatics Associa- tion. Sergey Berezin and Tatiana Batura. 2022. Named en- tity inclusion in abstractive text summarization. In: Proceedings of the Third Workshop on Scholarly Document Processing, pages 158–162, Gyeongju, Republic of Korea. Association for Computational Linguistics.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.

Bharath Chintagunta, Namit Katariya, Xavier Amatri- ain, and Anitha Kannan. 2021. Medically aware GPT-3 as a data generator for medical dialogue sum- marization. In: Proceedings of the Second Work- shop on Natural Language Processing for Medical Conversations, pages 66–76, Online. Association for Computational Linguistics.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Nouha Dziri, Sivan Milton, Mo Yu, Osmar Zaiane, and Siva Reddy. 2022. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.

Seppo Enarvi, Marilisa Amoia, Miguel Del-Agua Teba, Brian Delaney, Frank Diehl, Stefan Hahn, Kristina Harris, Liam McGrath, Yue Pan, Joel Pinto, Luca Rubini, Miguel Ruiz, Gagandeep Singh, Fabian Stemmer, Weiyi Sun, Paul Vozila, Thomas Lin, and Ranjani Ramamurthy. 2020. Generating medi- cal reports from patient-doctor conversations using sequence-to-sequence models. In: Proceedings of the First Workshop on Natural Language Process- ing for Medical Conversations, pages 22–30, Online. Association for Computational Linguistics.

Michael Hanna and Ondˇrej Bojar. 2021. A fine-grained analysis of BERTScore. In: Proceedings of the Sixth Conference on Machine Translation, pages 507–517, Online. Association for Computational Linguistics.

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What dis- ease does this patient have? a large-scale open do- main question answering dataset from medical ex- ams. Applied Sciences, 11(14):6421.

Anirudh Joshi, Namit Katariya, Xavier Amatriain, and Anitha Kannan. 2020. Dr. summarize: Global sum- marization of medical dialogue by exploiting local structures. In Findings of the Association for Com- putational Linguistics: EMNLP 2020, pages 3755– 3763, Online. Association for Computational Lin- guistics.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. In: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.

Valentin Liévin, Christoffer Egeberg Hother, and Ole Winther. 2022. Can large language models rea- son about medical questions? arXiv preprint arXiv:2207.08143. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, On- line. Association for Computational Linguistics.

Bridget T McInnes, Ted Pedersen, and Serguei VS Pakhomov. 2009. Umls-interface and umls- similarity: open source software for measuring paths and semantic similarity. In AMIA annual sympo- sium proceedings, volume 2009, page 431. Ameri- can Medical Informatics Association.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- moyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems. OpenAI. 2023. Gpt-4 technical report.

Laria Reynolds and Kyle McDonell. 2021. Prompt pro- gramming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Com- puting Systems, pages 1–7.

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mah- davi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2022. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.

Tianxiang Sun, Junliang He, Xipeng Qiu, and Xuanjing Huang. 2022. BERTScore is unfair: On social bias in language model-based metrics for text generation. In: Proceedings of the 2022 Conference on Empiri- cal Methods in Natural Language Processing, pages 3726–3739, Abu Dhabi, United Arab Emirates. As- sociation for Computational Linguistics.

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answering questions with faithful and truthful chains of reasoning. In: Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Boshi Wang, Xiang Deng, and Huan Sun. 2022. It- eratively prompt pre-trained language models for chain of thought. In: Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 2714–2730, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903. Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In: Proceedings of the 2022 CHI Conference on Human Factors in Computing Sys- tems, CHI ’22, New York, NY, USA. Association for Computing Machinery.

Longxiang Zhang, Renato Negrinho, Arindam Ghosh, Vasudevan Jagannathan, Hamid Reza Hassanzadeh, Thomas Schaaf, and Matthew R Gormley. 2021. Leveraging pretrained models for automatic sum- marization of doctor-patient conversations. arXiv preprint arXiv:2109.12174.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Eval- uating text generation with bert. arXiv preprint arXiv:1904.09675.

A Appendix

A.1 Question Answering Examples

We provide several examples of chats between the Decider and Researcher agents taken from our MedQA development set. We include the Gold Answer, 1-shot answer, and DERA answer. In the following example, the 1-shot prompt arrives at the correct answer, but DERA expands the answer to include additional information.

Question [ID 574]:Your colleague has been read- ing the literature on beta-carotene supplementation and the risk of heart disease. She thinks they may share a clinically relevant association and would like to submit an editorial to a top journal. Upon final literature review, she discovers a newly published study that refutes any association between beta-carotene and heart disease. Your colleague is upset; you suggest that she, instead, mathematically pool the results from all of the studies on this topic and publish the findings. What type of study design are you recommending to your colleague? Gold Answer: Meta-analysis, 1-shot : Meta-analysis

Decider: Both options are viable for pooling results from multiple studies, but a meta- analysis (80%) is more likely to be recom- mended as it allows for a quantitative syn- thesis of the data. A systematic review and meta-analysis (20%) would also be a good option, as it includes a comprehensive liter-

goal.

Decider: After considering the goal of my col- league to submit an editorial to a top jour- nal, I believe a systematic review and meta- analysis would be the most appropriate study design. This design includes both a com- prehensive literature search and qualitative analysis, in addition to the quantitative syn- thesis of data, which would make it more suitable for publication in a top journal. Researcher: Consider the fact that your col- league discovered a ’newly published study that refutes any association between beta- carotene and heart disease’ and think about how this might impact the choice of study design. Decider: After considering the fact that my col- league discovered a newly published study that refutes any association between beta- carotene and heart disease, I still believe a systematic review and meta-analysis would be the most appropriate study design. This design would allow for the inclusion of this new study in the comprehensive literature search and analysis, providing a more com- plete picture of the current state of research on this topic. DERA answer: Systematic review and meta-analysis

In the following example, the 1-shot prompt arrives at the wrong answer, and while DERA changes the answer, it is also not correct.

ature search and qualitative analysis, but it

may be more time-consuming. Ultimately, the choice depends on the specific goals and resources of the researcher. Researcher: Consider the fact that your col- league wants to ’mathematically pool the results from all of the studies on this topic’ and think about which study design would best allow for this type of analysis.

Decider: Based on the fact that my colleague wants to mathematically pool the results from all of the studies, I believe a meta- analysis would be the best study design as it allows for a quantitative synthesis of the data. Researcher: Think about the goal of your col- league, which is to ’submit an editorial to a top journal’ and consider which study design would be most appropriate for achieving this

Question [ID 2599]:A 15-year-old girl is brought to her pediatrician’s office complaining of frequent diarrhea, fatigue, and inability to gain weight. Her vital signs are within normal limits, and her BMI is 17. She describes her stools as pale, malodorous, and bulky. She often has abdominal bloating. Her symptoms are most prominent after breakfast when she typically consumes cereal. After several weeks of careful evaluation and symptomatic treatment, the pediatrician recommends an esophagogastroduodenoscopy. A diagnostic biopsy shows blunting of intestinal villi and flat mucosa with multiple intraepithelial lymphocytes. What is the patient likely deficient in? Gold Answer: IgA, 1-shot : Gluten tolerance

Decider: Based on the patient’s symptoms

and the results of the esophagogastroduo- denoscopy, it is likely that the patient has a gluten intolerance or Celiac disease. How- ever, without further testing and confirma- tion, it is difficult to determine the exact diag- nosis with certainty. Therefore, the options of Gluten intolerance/Celiac Disease, Gluten intolerance or celiac disease, and Gluten in- tolerance or Celiac disease are all rated at 20%. The option of Gluten tolerance is rated at 40% as it is possible that the patient’s symptoms are caused by another condition.

Researcher: Consider the patient’s symptoms of frequent diarrhea, fatigue, and inability to gain weight, along with her description of pale, malodorous, and bulky stools. Decider: After considering the patient’s symp- toms of frequent diarrhea, fatigue, and inabil- ity to gain weight, along with her description of pale, malodorous, and bulky stools, I be- lieve that Celiac Disease is the most likely diagnosis. The results of the esophagogastro- duodenoscopy showing blunting of intestinal villi and flat mucosa with multiple intraep- ithelial lymphocytes also support this con- clusion.

Researcher: Think about the patient’s symptoms being most prominent after breakfast when she typically consumes cereal, and how this could be related to her condition. Decider: After considering the patient’s symp- toms being most prominent after breakfast when she typically consumes cereal, I be- lieve this further supports my conclusion that Celiac Disease is the most likely diagnosis. The consumption of cereal, which often con- tains gluten, could be triggering her symp- toms.

Researcher: Consider the diagnostic biopsy re- sults which show blunting of intestinal villi and flat mucosa with multiple intraepithelial lymphocytes, and how this could be related to the patient’s condition.

Decider: After considering the patient’s symp- toms, the timing of her symptoms being most prominent after consuming cereal, and the diagnostic biopsy results showing blunting of intestinal villi and flat mucosa with mul- tiple intraepithelial lymphocytes, I believe Celiac Disease is the most likely diagnosis. DERA answer: Celiac Disease

Table 4: Examples from the MedQA Training set (used as a development set in our experiments). We include the id (or line number), the rewritten open-ended question, the correct original gold answer, the student predicted answer, and the 1-shot answer.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 DERAEnhancingLargeLanguageModelAnitha Kannan
Varun Nair
Elliot Schumacher
Geoffrey Tso
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents10.48550/arXiv.2303.170712023
  1. https://github.com/curai/curai-research/tree/main/DERA
  2. We do not account for variations in the text, each lexical form is counted separately.