The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020arXiv preprintDOI: 10.48550/arXiv.2101.00027 - Introduces and details the construction, components, and characteristics of The Pile dataset, a prominent resource for LLM pre-training.
🤗 Datasets Documentation, Hugging Face, 2024 (Hugging Face) - Provides comprehensive guides and API references for the datasets library, essential for efficiently accessing and processing open datasets for LLMs.
Open Data Handbook - Licensing Open Data, Open Knowledge Foundation, 2010 (Open Knowledge Foundation) - Offers practical guidance on understanding and choosing open licenses for data, clarifying terms like Creative Commons and Open Data Commons.