The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020arXiv preprintDOI: 10.48550/arXiv.2101.00027 - Describes a large-scale, diverse, and high-quality dataset specifically designed for training large language models, comprising many of the sources discussed in the section.
Hugging Face Datasets Library Documentation, Hugging Face, 2024 - Official guide for using the datasets library to access, process, and share datasets for machine learning, including many LLM pre-training corpora.