The Role of Layer Normalization and Residual Connections
Was this section helpful?
Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2016.90 - Introduces residual connections (skip connections) to enable the training of very deep neural networks by addressing the vanishing/exploding gradient problem.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Proposes layer normalization, an alternative to batch normalization, that normalizes activations across features for each data point independently, particularly suitable for recurrent neural networks and Transformers.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ćukasz Kaiser, Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS) - The original paper introducing the Transformer architecture, which effectively combines multi-head self-attention with residual connections and layer normalization to build deep, sequence processing models.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020ICML 2020DOI: 10.48550/arXiv.2002.04745 - Examines the role and placement of layer normalization within Transformer blocks, discussing the differences and advantages of Pre-LN over Post-LN for training stability and performance, especially in deep models.