Deep Residual Learning for Image Recognition, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)DOI: 10.1109/CVPR.2016.90 - Introduces residual connections, a key architectural component adopted by the Transformer to facilitate the training of deep neural networks.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Presents layer normalization, a method used in the Transformer encoder to stabilize activations and improve training efficiency.
Speech and Language Processing (3rd ed. draft), Daniel Jurafsky and James H. Martin, 2023 (Pearson) - A widely recognized textbook providing a comprehensive explanation of Transformer models, including detailed breakdowns of the encoder's internal workings.