Overcoming Recurrence with Attention

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems (NeurIPS) DOI: 10.48550/arXiv.1706.03762 - This paper introduces the Transformer architecture, which relies entirely on attention mechanisms and removes recurrence, thus overcoming the limitations of RNNs for sequence modeling.
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997 Neural Computation, Vol. 9 (The MIT Press) DOI: 10.1162/neco.1997.9.8.1735 - This foundational paper introduces Long Short-Term Memory (LSTM) networks, a type of recurrent neural network designed to address the vanishing gradient problem and improve handling of dependencies over longer sequences.
Stanford CS224N: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2025 (Stanford University) - This university course offers comprehensive materials on deep learning for natural language processing, covering RNNs, LSTMs, GRUs, attention mechanisms, and the Transformer architecture, explaining their architectural evolution and applications.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) DOI: 10.48550/arXiv.1406.1078 - This paper introduces the Gated Recurrent Unit (GRU), a simpler variant of LSTM that effectively models sequences and also contributes to addressing long-range dependencies in recurrent networks.