Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL)DOI: 10.48550/arXiv.1901.02860 - The original paper introducing the Transformer-XL architecture and its relative positional encoding scheme, including detailed mathematical formulations and implementation strategies.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems (NeurIPS) 30DOI: 10.48550/arXiv.1706.03762 - The seminal paper that introduced the Transformer architecture and its original absolute sinusoidal positional encoding, providing background for subsequent relative encoding methods.