Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.) - Presents the Transformer architecture, which heavily influenced ASR models, and describes its associated learning rate schedule with warmup and decay.
Curriculum Learning, Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Jason Weston, 2009Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09 (ACM Press)DOI: 10.1145/1553374.1553381 - The foundational paper that introduced the concept of curriculum learning, a training strategy that involves ordering examples from easy to hard.
An Overview of Multi-Task Learning in Deep Neural Networks, Sebastian Ruder, 2017arXiv preprint arXiv:1706.05098 - A comprehensive survey that outlines the principles, applications, and benefits of multi-task learning in the context of deep neural networks.