The Role of Layer Normalization and Residual Connections

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE) DOI: 10.1109/CVPR.2016.90 - Introduces residual connections (skip connections) to enable the training of very deep neural networks by addressing the vanishing/exploding gradient problem.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016 arXiv preprint arXiv:1607.06450 DOI: 10.48550/arXiv.1607.06450 - Proposes layer normalization, an alternative to batch normalization, that normalizes activations across features for each data point independently, particularly suitable for recurrent neural networks and Transformers.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS) - The original paper introducing the Transformer architecture, which effectively combines multi-head self-attention with residual connections and layer normalization to build deep, sequence processing models.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020 ICML 2020 DOI: 10.48550/arXiv.2002.04745 - Examines the role and placement of layer normalization within Transformer blocks, discussing the differences and advantages of Pre-LN over Post-LN for training stability and performance, especially in deep models.