RoFormer: Enhanced Transformer with Rotary Position Embedding, Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021arXiv preprint arXiv:2104.09864DOI: 10.48550/arXiv.2104.09864 - This is the seminal paper introducing Rotary Position Embedding (RoPE), detailing its mathematical foundation and demonstrating its effectiveness in the Transformer architecture.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The foundational paper that introduced the Transformer architecture, providing essential background for understanding positional encoding mechanisms like RoPE.
Self-Attention with Relative Position Representations, Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, 2018Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)DOI: 10.48550/arXiv.1803.02155 - This paper proposes an early method for incorporating relative position representations into self-attention, offering a point of comparison to RoPE's approach.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)DOI: 10.48550/arXiv.1901.02860 - This work introduces a relative positional encoding scheme and segment-level recurrence, offering another perspective on handling sequence length and relative positions in Transformers.