2023 DirectPreferenceOptimizationYou

(Rafailov et al., 2023) ⇒ Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. (2023). “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” doi:10.48550/arXiv.2305.18290

Subject Headings: Direct Preference Optimization (DPO).

Notes

It introduces Direct Preference Optimization (DPO), a novel approach for aligning language models with human preferences, as a simpler alternative to the complex and often unstable reinforcement learning from human feedback (RLHF).
It demonstrates a unique mathematical insight, showing that for any given language model, there exists a specific reward function for which the model is optimal, eliminating the need for a separately represented reward function.
It simplifies aligning instruction-tuned LLM models to human preferences (by requiring only the language model transformer for training, since the reward function is implicitly defined).
It can be computationally lighter and easier to implement than RLHF (which involves training two transformer networks and is sensitive to hyperparameter choices).

Cited By

http://scholar.google.com/scholar?q=%222023%22+Direct+Preference+Optimization%3A+Your+Language+Model+is+Secretly+a+Reward+Model

Quotes

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2023 DirectPreferenceOptimizationYou	Christopher D. Manning Chelsea Finn Stefano Ermon Eric Mitchell Rafael Rafailov Archit Sharma			Direct Preference Optimization: Your Language Model is Secretly a Reward Model				10.48550/arXiv.2305.18290		2023

2023 DirectPreferenceOptimizationYou

Notes

Cited By

Quotes

Abstract

References

Navigation menu

Search