Identifying Potential Data Sources

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Common Crawl: Open Datasets for Web-Scale Language Models, Common Crawl Foundation, 2025 - Describes the project, data formats, and how to access the massive web crawl archives that form a base for many LLM datasets.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020 Journal of Machine Learning Research, Vol. 21 DOI: 10.5555/3455716.3455823 - Introduces the C4 dataset, a widely used, cleaned-up version of Common Crawl data that serves as a benchmark for LLM pre-training.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020 arXiv preprint DOI: 10.48550/arXiv.2101.00027 - Describes a large-scale, diverse, and high-quality dataset specifically designed for training large language models, comprising many of the sources discussed in the section.
Hugging Face Datasets Library Documentation, Hugging Face, 2024 - Official guide for using the datasets library to access, process, and share datasets for machine learning, including many LLM pre-training corpora.