Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - Introduces the original Transformer architecture, which uses Post-Layer Normalization.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Introduces the concept of Layer Normalization, a fundamental component in Transformer architectures.
Language Models are Unsupervised Multitask Learners, Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019 (OpenAI) - A seminal paper that popularized the use of Pre-Layer Normalization for stable training of very deep Transformers.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020ICML 2020DOI: 10.48550/arXiv.2002.04745 - Provides an in-depth analysis of the training dynamics and stability differences between Pre-LN and Post-LN Transformer configurations.