LLaMA: Open and Efficient Foundation Language Models, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample, 2023arXiv preprint arXiv:2302.13971DOI: 10.48550/arXiv.2302.13971 - Describes the composition of the training data mixture for the LLaMA models, providing specific sampling proportions for diverse sources.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020arXiv preprint arXiv:2101.00027DOI: 10.48550/arXiv.2101.00027 - Details the construction of a large, diverse dataset from 22 distinct sources, explicitly discussing the weighting and sampling strategies used.
PaLM: Scaling Language Modeling with Pathways, Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Shaohan Huang, Zonglin Li, Abe Child, Siqi Mao, Julian Goldwasser, Azade Nova, Ofir Press, Alexandre Passos, Sam Shwetake, Eva Schlinger, Jason Wei, Shixiang Shane Gu, Heng-Tze Cheng, Jeff Dean, Youlong Cheng, Kristen Morrow, Frederick Liu, Anne Verlot, Chandra Baskaran, Ashish Vaswani, Anand Taylor, Swaroop Dutta, Andrew Dai, Thang Luong, Quoc V. Le, 2022Journal of Machine Learning Research, Vol. 24 - Presents a large language model and describes its training data mixture, which involves weighting a combination of high-quality textual sources.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020Journal of Machine Learning Research, Vol. 21 (Journal of Machine Learning Research)DOI: 10.1613/JMLR.2020.20-074 - Discusses the creation of the C4 dataset and the combination of various public datasets for pre-training, highlighting the challenges of data composition.