Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, 2017Advances in Neural Information Processing Systems, Vol. 30 (Curran Associates, Inc.)DOI: 10.5555/3295222.3295349 - The foundational paper introducing the Transformer architecture, detailing the integration of residual connections and layer normalization within its encoder-decoder blocks.
Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, 2016Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE)DOI: 10.1109/CVPR.2016.90 - Introduces residual networks (ResNet), which first proposed the concept of residual connections to enable the training of very deep neural networks by mitigating vanishing gradients.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Presents Layer Normalization, a technique for normalizing activations within a layer across features, which is crucial for stabilizing training in recurrent and sequence models like the Transformer.