Creating High-Quality Instruction Datasets

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Scaling Instruction-Finetuned Transformers, Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei, 2022 ICML DOI: 10.48550/arXiv.2210.11416 - This paper introduces instruction tuning and the FLAN collection, highlighting the benefits of task diversity for improving generalization in language models.
Dolly v2: Databricks’ First Open-Source, Instruction-Following LLM, Mike Conover, Matthew Hayes, Jonathan Frank, et al., 2023 (Databricks Blog) - Details the creation of the Databricks Dolly instruction dataset, notable for being entirely human-generated without proprietary model output, and its use in training an open-source LLM.