Leveraging Open Licensed Datasets

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020 arXiv preprint DOI: 10.48550/arXiv.2101.00027 - Introduces and details the construction, components, and characteristics of The Pile dataset, a prominent resource for LLM pre-training.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2020 JMLR, Vol. 21 DOI: 10.5555/3455716.3455823 - Describes the creation of the C4 dataset as part of the T5 model research, detailing its cleaning and filtering process from Common Crawl.
🤗 Datasets Documentation, Hugging Face, 2024 (Hugging Face) - Provides comprehensive guides and API references for the datasets library, essential for efficiently accessing and processing open datasets for LLMs.
Open Data Handbook - Licensing Open Data, Open Knowledge Foundation, 2010 (Open Knowledge Foundation) - Offers practical guidance on understanding and choosing open licenses for data, clarifying terms like Creative Commons and Open Data Commons.