Standard Benchmarks: GLUE and SuperGLUE

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

GLUE: A Multi-Task Benchmark for Natural Language Understanding, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman, 2018 International Conference on Learning Representations (ICLR) 2019 (published 2018) DOI: 10.48550/arXiv.1804.07461 - The original paper introducing the General Language Understanding Evaluation (GLUE) benchmark, detailing its tasks and methodology.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman, 2019 Advances in Neural Information Processing Systems (NeurIPS) 2019 DOI: 10.48550/arXiv.1905.00537 - Introduces SuperGLUE as a more challenging successor to GLUE, designed to evaluate advanced language understanding capabilities.
Fine-tuning a pretrained model, Hugging Face, 2024 (Hugging Face) - A chapter from the Hugging Face NLP Course that explains the practical process of fine-tuning pre-trained transformer models for specific NLP tasks, including code examples relevant to GLUE/SuperGLUE evaluation.