Language Models are Few-Shot Learners, Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei, 2020Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vol. 33 (Neural Information Processing Systems Foundation, Inc. (NeurIPS))DOI: 10.48550/arXiv.2005.14165 - This paper outlines the detailed data mixture used for training GPT-3, including proportions from various sources like Common Crawl, WebText, Books, and Wikipedia, and discusses their impact on model performance.
The Pile: An 800GB Dataset of Diverse Text for Language Model Training, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Jean-Philippe Labonne, Joshua McCarty, Stefano Palmeri, Chris Pennacchio, Jason Phang, Anthony Rice, Eric Sheng, Andrea Singhal, Stephen Slater, Shawn Tabassum, Andy Tang, Anish Thite, Huu Tran, Sam Wang, Ben Wang, and Anna Zou, 2021arXiv preprint arXiv:2101.00027DOI: 10.48550/arXiv.2101.00027 - This research presents a large, diverse dataset explicitly designed to foster generalization in language models by combining 22 distinct high-quality sources, illustrating strategic data mixture composition.
PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Julian Salazar, Marcianna Blanton, Yi Tay, Josh Adlam, Stephen Brody, Jeremy Robinson, Liam Donovan, David Luan, Noam Shazeer, Katherine Lee, Zhongtao Zheng, Quoc V. Le, Ed H. Chi, and Jeffrey Dean, 2022arXiv preprint arXiv:2204.02311DOI: 10.48550/arXiv.2204.02311 - This paper describes the highly diverse training dataset for PaLM, including text from various domains like web pages, books, and code, emphasizing the role of data mixture in scaling language models.
Llama 2: Open Foundation and Fine-Tuned Chat Models, Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude F Harris, Liangsheng Hsu, Ishan Jain, Ahmed Khan, Yangkun Lin, Daouda Kebe, Marcin Korenkiewicz, Jenya Lee, Guillaume Lample, Rosy Liao, Mao Li, Eric Michael Smith, Rajneesh Kumar Singh, Utkarsh Singhal, Pararth Shah, Robert Stojnic, Andrew P. Williams, Eryk Wronkowicz, Binh Tang, Nicolas Usunier, Gabriel Synnaeve, Chloe Xu, Hu Xu, Zheng Yan, and Hongyu Zhong, 2023arXiv preprint arXiv:2307.09288DOI: 10.48550/arXiv.2307.09288 - This work details the pre-training data composition for Llama 2, underscoring the relevance of data quality, diversity, and safety aspects in developing modern large language models.