The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020arXiv preprint arXiv:2101.00027DOI: 10.48550/arXiv.2101.00027 - Describes the extensive data curation, filtering, and deduplication processes used to create a large-scale dataset for language model pre-training.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜, Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret, 2021FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery)DOI: 10.1145/3442188.3445922 - Raises concerns regarding bias, ethical implications, and environmental costs associated with large language models, particularly focusing on data provenance and societal impact.