Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov, 2019Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics)DOI: 10.18653/v1/P19-1285 - Proposes a relative positional encoding scheme that allows for context reuse and generalizes better to longer sequences by making positional information dependent only on relative distances.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017) (Neural Information Processing Systems Foundation, Inc. (NeurIPS)) - The foundational paper introducing the Transformer architecture and its absolute positional encoding, providing the context for the subsequent development of relative positional encoding schemes.