Impact of Architectural Choices on Stability

Was this section helpful?

References

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017 Advances in Neural Information Processing Systems 30 (NIPS 2017) DOI: 10.48550/arXiv.1706.03762 - Foundational paper introducing the Transformer architecture, including Post-LN and scaled dot-product attention.
On Layer Normalization in the Transformer Architecture, Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu, 2020 ICML 2020, Vol. 119 DOI: 10.48550/arXiv.2002.04745 - Analyzes the impact of Layer Normalization placement (Pre-LN vs. Post-LN) on Transformer training stability and gradient flow.
GLU Variants Improve Transformer, Noam Shazeer, 2020 arXiv preprint arXiv:2002.05202 DOI: 10.48550/arXiv.2002.05202 - Introduces SwiGLU and other Gated Linear Unit variants, showing improved performance and training characteristics in Transformers.
RoFormer: Enhanced Transformer with Rotary Position Embedding, Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu, 2021 arXiv preprint arXiv:2104.09864 DOI: 10.48550/arXiv.2104.09864 - Introduces Rotary Position Embeddings (RoPE), a method to integrate relative position information into self-attention.