Initialization in Transformer Components

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS) DOI: 10.5555/3295222.3295252 - Introduces the Transformer architecture, which forms the basis for the discussion on component initialization.
Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot, Yoshua Bengio, 2010 Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Vol. 9 (PMLR) - Presents Xavier initialization, a method for stabilizing training of deep networks by preserving signal variance.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2015 Proceedings of the IEEE International Conference on Computer Vision (ICCV 2015) (IEEE) DOI: 10.1109/ICCV.2015.122 - Introduces Kaiming initialization, an adaptation of variance-preserving methods specifically designed for ReLU and similar non-linearities.
torch.nn.init, PyTorch Contributors, 2022 (PyTorch Foundation) - Official documentation for PyTorch's initialization functions, including normal_, kaiming_uniform_, zeros_, and ones_.