Universal Language Model Fine-tuning for Text Classification, Jeremy Howard and Sebastian Ruder, 2018Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/P18-1031 - Presents ULMFit, a three-stage transfer learning method for NLP tasks using a pre-trained language model, showing performance gains with limited task-specific data.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017arXiv preprint arXiv:1706.03762DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer architecture, which advanced sequence modeling and became the foundation for modern large language models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, 2019Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/N19-1423 - Introduces BERT, a model that applied the pre-train, fine-tune approach for deep bidirectional Transformers across many NLP tasks using masked language modeling.