2024 SelfRewardingLanguageModels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Self-Rewarding Language Models (SR-LMs), AlpacaEval 2.0 Leaderboard.

Notes

  • It introduces Self-Rewarding Language Models (SR-LMs), where language models use their own outputs for training and self-improvement.
  • The approach enables continual updates and advancements beyond initial training data limits, thus addressing the limitations of static training datasets.
  • It demonstrates that SR-LMs significantly enhance instruction-following abilities by iteratively generating and evaluating their own instruction-following examples, showcasing a novel approach to language model training.
  • Its training methodology combines Instruction Fine-Tuning (IFT) and Evaluation Fine-Tuning (EFT) with iterative training.
  • It outperforms the AlpacaEval 2.0 leaderboard.
  • It acknowledges the limitations of SR-LM, particularly in understanding the scaling laws of this effect and conducting comprehensive safety evaluations.

Cited By

Quotes

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

References


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2024 SelfRewardingLanguageModelsJason Weston
Kyunghyun Cho
Sainbayar Sukhbaatar
Weizhe Yuan
Richard Yuanzhe Pang
Jing Xu
Self-Rewarding Language Models10.48550/arXiv.2401.100202024