Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (The MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - Introduces the Long Short-Term Memory (LSTM) architecture, which was designed to overcome the vanishing gradient problem in RNNs.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A comprehensive textbook providing detailed explanations of recurrent neural networks, backpropagation through time, and gradient issues. See Chapter 10 for sequence modeling.