Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Neural Information Processing Systems)DOI: 10.5555/3295222.3295349 - Foundational paper introducing the Transformer architecture and the self-attention mechanism, highlighting its quadratic complexity with respect to sequence length.
Longformer: The Long-Document Transformer, Iz Beltagy, Matthew E. Peters, and Arman Cohan, 2020arXiv preprint arXiv:2004.05150DOI: 10.48550/arXiv.2004.05150 - Introduces the Longformer model, which uses a combination of sliding window and global attention patterns to efficiently process very long sequences.
Big Bird: Transformers for Longer Sequences, Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed, 2020Advances in Neural Information Processing Systems, Vol. 33 (Curran Associates, Inc.)DOI: 10.5555/3455793.3455928 - Presents BigBird, a sparse attention mechanism that combines global, local, and random attention to achieve linear complexity while maintaining strong performance on long-sequence tasks.