Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Karthik Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, S.K. Sutskever, Amanda Askell, Sarita Char, Janelle Shane, Brian Mcmahan, Noah Fiedel, Paul Christiano, Geoff Irving, Kyle Scott, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Introduces Reinforcement Learning from Human Feedback (RLHF), which this section identifies as having scaling challenges due to its reliance on continuous human preference labeling.
Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Roslyn Campbell, Anna Chen, Dawn Drain, Deep Ganguli, Andy Jones, Nicholas Joseph, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tom Henighan, Brian Hutchinson, Rita Johnston, Abhishek Karkhanis, Jeremy Kim, Carol Chen, Kristóf T. Garamvölgyi, Sam McCandlish, Chris Olah, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Samuel R. Bowman, Kevin Scott, Shauna Gordon-McKeon, Lauren Hume, Michael Johnston, Ben Mann, Amanda Ngo, Arvind Neelakantan, Long Ouyang, Catherine Perez, Nicholas Schiefer, Justin Shlegeris, Stephanie Sclafani, Gabe Selsky, Sam Ringer, Mike Smith, Jordan Schneider, Noah Shinn, Brooke Smyth, Stephen McAleer, Andrew Trask, Jon Uesato, Jeff Wu, Danny Wu, Steven T. Young, Evan Hubinger, 2022arXiv preprint arXiv:2212.08073 (arXiv)DOI: 10.48550/arXiv.2212.08073 - Directly introduces Constitutional AI as a method for scalable alignment, specifically leveraging AI feedback to reduce the need for extensive human supervision.
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI-Generated Feedback, Sungdong Lee, Seonghyeok Park, Hyeonseung Lee, et al., 2023arXiv preprint arXiv:2309.00267 - Focuses on Reinforcement Learning from AI Feedback (RLAIF), a method designed to scale alignment by using AI-generated preferences to train models, directly addressing the bottleneck of human feedback.