Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)DOI: 10.48550/arXiv.1406.1078 - Introduces Gated Recurrent Units (GRUs) as a simplified alternative to LSTMs, with a focus on their architecture and early application in sequence modeling.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - The original paper presenting the Long Short-Term Memory (LSTM) architecture, which GRUs simplify.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A standard textbook that details the theory and application of recurrent neural networks, including a comparison of LSTMs and GRUs and their computational characteristics.