2023 HowisChatGPTsBehaviorChangingov

From GM-RKB
Jump to navigation Jump to search

Subject Headings: LLM Benchmarking.

Notes

  • Significant drifts were observed in the LLMs' performance and behavior over time. For some tasks, the accuracy dropped substantially from March to June (e.g. GPT-4's accuracy on prime number identification fell from 84% to 51%).
  • On the other hand, performance improved on some other tasks (e.g. GPT-3.5's accuracy on prime numbers increased from 50% to 76%).
  • The chain-of-thought reasoning approach became less effective over time, especially for GPT-4. This contributed to performance drops on math tasks.
  • GPT-4 became less willing to answer subjective questions and sensitive queries in June compared to March.
  • GPT-4 performed better on multi-hop reasoning questions in June but GPT-3.5 became worse.
  • More formatting mistakes occurred in code generation in June versus March for both models.
  • Overall, the behavior and performance of the "same" LLM can change substantially over a short period.

Cited By

Quotes

Abstract

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) US Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings show that the behavior of the "same" LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

Introduction

...

  • Figure 2: ... GPT-4 followed the chain-of-thought instruction to obtain the right answer in March, but ignored it in June with the wrong answer. ...
  • ... This interesting phenomenon indicates that the same prompting approach, even the widely adopted chain-of-thought strategy, could lead to substantially different performances due to LLM drifts. ..."

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 HowisChatGPTsBehaviorChangingovMatei Zaharia
Lingjiao Chen
James Zou
How is ChatGPT's Behavior Changing over Time?10.48550/arXiv.2307.090092023