Incorporating New Data Sources Safely

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

The Pile: An 800GB Dataset of Diverse Text for Language Modeling, Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy, 2020 arXiv preprint arXiv:2101.00027 DOI: 10.48550/arXiv.2101.00027 - Describes the extensive data curation, filtering, and deduplication processes used to create a large-scale dataset for language model pre-training.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜, Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret, 2021 FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Association for Computing Machinery) DOI: 10.1145/3442188.3445922 - Raises concerns regarding bias, ethical implications, and environmental costs associated with large language models, particularly focusing on data provenance and societal impact.