Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems 30 (NIPS 2017)DOI: 10.48550/arXiv.1706.03762 - Foundational paper introducing the Transformer architecture, including Post-LN and scaled dot-product attention.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020ICML 2020, Vol. 119DOI: 10.48550/arXiv.2002.04745 - Analyzes the impact of Layer Normalization placement (Pre-LN vs. Post-LN) on Transformer training stability and gradient flow.