Self-Attention with Relative Position Representations, Peter Shaw, Jakob Uszkoreit, Ashish Vaswani, 2018Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (Association for Computational Linguistics)DOI: 10.18653/v1/N18-2074 - The foundational paper introducing relative position representations into Transformer self-attention, detailing the modification of attention scores and value aggregation.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1706.03762 - The original paper that introduced the Transformer architecture, providing the fundamental context for all subsequent positional encoding variations.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, 2019Journal of Machine Learning Research, Vol. 21DOI: 10.48550/arXiv.1910.10683 - Describes the T5 model, which utilizes a different but related relative positional encoding scheme, offering practical insights into implementation in large models.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2025 - A comprehensive textbook chapter explaining the Transformer architecture, including discussions on various positional encoding techniques, such as relative position embeddings. Chapter 10 covers Transformers and large language models.