Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020Journal of Machine Learning Research, Vol. 21 (JMLR)DOI: 10.48550/arXiv.1910.10683 - Details the construction of the C4 dataset, a significant corpus for LLM pretraining, outlining various filtering techniques including heuristics and statistical methods used for quality control.
Bag of Tricks for Efficient Text Classification, Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, 2016arXivDOI: 10.48550/arXiv.1607.01759 - Introduces fastText, an influential library for efficient text classification and language identification, which is a common component in data filtering pipelines.