Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan, 2022arXiv preprint arXiv:2212.08073DOI: 10.48550/arXiv.2212.08073 - Presents the Constitutional AI framework, which involves generating AI critiques and revisions, a core topic of this section.
Deep reinforcement learning from human preferences, Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, 2017arXiv preprint arXiv:1706.03741DOI: 10.48550/arXiv.1706.03741 - Foundational work on training reinforcement learning agents using human feedback, a precursor to RLAIF and essential for understanding preference labeling.
QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Introduces an efficient fine-tuning approach for quantized large language models, relevant for optimizing feedback models through quantization.
Distilling the knowledge in a neural network, Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015arXiv preprint arXiv:1503.02531DOI: 10.48550/arXiv.1503.02531 - A foundational paper introducing the concept of model distillation, where a smaller 'student' model is trained to mimic a larger 'teacher' model.