Learning long-term dependencies with gradient descent is difficult, Yoshua Bengio, Patrice Simard, Paolo Frasconi, 1994IEEE Transactions on Neural Networks, Vol. 5 (IEEE)DOI: 10.1109/72.279181 - This foundational paper rigorously analyzes the vanishing and exploding gradient problems in recurrent neural networks, providing the mathematical basis for understanding why long-term dependencies are difficult to learn with gradient descent.
Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - This seminal paper introduces the Long Short-Term Memory (LSTM) network, specifically designed to address the vanishing gradient problem in recurrent neural networks by incorporating a gating mechanism that allows for more stable gradient flow over long sequences.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - An authoritative textbook providing a comprehensive explanation of recurrent neural networks, including a detailed treatment of the vanishing gradient problem and its mathematical underpinnings.