Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization (PPO) algorithm, a popular and effective method for reinforcement learning known for its stability and sample efficiency.
Aligning Language Models to Follow Instructions, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Karen Simonyan, Jonathan Kaplan, Hendrik Strobelt, Raymond Burdo, Robert Long, Jeremy Nelson, Sam Stewart, Anand Sharma, Amber Kuo, Adam Gough, Ilya Sutskever, Jacob Hilton, Noah Fiedel, Harish Khandelia, Anna Chen, Christine McGrew, Lora Gordon, Michael Petrov, Hyung Won Chung, Susanne Frandsen, Andy Li, Thom Lane, Françoise Chollet, Elizabeth L. Huang, Ariel Herbert-Voss, Alexis Gray, Houman Shadab, Eric Tang, Laura Gao, Amanda Askell, Brian Chen, Anna Goldie, Azalia Mirhoseini, Chris Hallacy, Monika Zimowski, Brandon Houghton, Girish Sastry, Anna Pavlick, Geoffrey Irving, Owain Evans, Josh Achiam, John Schulman, Da Yan, Alex Passos, Charles Foster, David Amodei, 2022Advances in Neural Information Processing Systems, Vol. 35 - Details the Reinforcement Learning from Human Feedback (RLHF) process, including supervised fine-tuning, reward model training, and PPO-based policy optimization for aligning language models with human preferences.
Hugging Face TRL Library Documentation, Hugging Face, 2024 - Provides practical guidance and implementations for Reinforcement Learning from Human Feedback (RLHF), including PPO, specifically for fine-tuning Transformer-based language models.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - A comprehensive textbook covering the fundamental concepts, algorithms, and theory of reinforcement learning, essential for understanding PPO's underlying principles.