Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - Introduces the Long Short-Term Memory (LSTM) architecture, a significant advancement designed to overcome the vanishing gradient problem and effectively learn long-range dependencies in recurrent neural networks.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Chapter 10 provides a detailed theoretical and practical overview of recurrent neural networks, including a thorough explanation of vanishing and exploding gradients, and various solutions.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (Association for Computational Linguistics)DOI: 10.3115/v1/D14-1179 - This paper introduces the Gated Recurrent Unit (GRU) as an efficient alternative to LSTMs for sequence modeling, particularly addressing challenges like vanishing gradients.