Deep Reinforcement Learning from Human Preferences, Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, 2017Advances in Neural Information Processing Systems 30, Vol. 30DOI: 10.48550/arXiv.1706.03741 - Introduces the method of learning a reward function from human comparisons and then using this learned reward function to train deep reinforcement learning agents. This work laid a significant foundation for modern RLHF.
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXivDOI: 10.48550/arXiv.1707.06347 - Describes the Proximal Policy Optimization (PPO) algorithm, an effective and widely used policy gradient method for reinforcement learning, which is a core algorithm in the RL fine-tuning stage of RLHF.
Training language models to follow instructions with human feedback, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, Ryan Lowe, 2022Advances in Neural Information Processing Systems 35, Vol. 35 - This paper introduces the InstructGPT model and details the complete three-stage RLHF pipeline-including Supervised Fine-Tuning (SFT), Reward Model (RM) training using human preferences, and Reinforcement Learning fine-tuning with PPO and a KL penalty-for aligning large language models.