Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997Neural Computation, Vol. 9DOI: 10.1162/neco.1997.9.8.1735 - Introduces the Long Short-Term Memory (LSTM) architecture to address long-term dependencies in RNNs.
Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le, 2014Advances in Neural Information Processing Systems (NIPS) 27 - Presents a general end-to-end approach for sequence learning, illustrating the encoder-decoder structure with a fixed-size context.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NIPS) 30, Vol. 30 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1706.03762 - Introduces the Transformer model, designed to overcome limitations of recurrent models like sequential computation and long-range dependency capture.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - Provides a foundation on recurrent neural networks, backpropagation through time, and associated training difficulties.