Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1706.03762 - This paper introduces the Transformer architecture, which relies entirely on attention mechanisms and removes recurrence, thus overcoming the limitations of RNNs for sequence modeling.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997Neural Computation, Vol. 9 (The MIT Press)DOI: 10.1162/neco.1997.9.8.1735 - This foundational paper introduces Long Short-Term Memory (LSTM) networks, a type of recurrent neural network designed to address the vanishing gradient problem and improve handling of dependencies over longer sequences.
Stanford CS224N: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2025 (Stanford University) - This university course offers comprehensive materials on deep learning for natural language processing, covering RNNs, LSTMs, GRUs, attention mechanisms, and the Transformer architecture, explaining their architectural evolution and applications.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)DOI: 10.48550/arXiv.1406.1078 - This paper introduces the Gated Recurrent Unit (GRU), a simpler variant of LSTM that effectively models sequences and also contributes to addressing long-range dependencies in recurrent networks.