Alternatives: Direct Preference Optimization (DPO)
Was this section helpful?
Deep Reinforcement Learning from Human Preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1706.03741 - A foundational paper that proposes learning reward functions from human preferences to train reinforcement learning agents.
DPO (Direct Preference Optimization), Hugging Face, 2024 (Hugging Face) - Provides practical information and code examples for implementing DPO using the Hugging Face TRL library.